G. Cormode and M. Garofalakis. Streaming in a connected world: Querying and tracking distributed data streams, March 2008. Tutorial at VLDB 2006, SIGMOD 2007, EDBT 2008.

Today, a majority of data is fundamentally distributed in nature. Data for almost any task is collected over a broad area, and streams in at a much greater rate than ever before. In particular, advances in sensor technology and miniaturization have led to the concept of the sensor network: a (typically wireless) collection of sensing devices collecting detailed data about their surroundings. A fundamental question arises: how to query and monitor this rich new source of data? Similar scenarios emerge within the context of monitoring more traditional, wired networks, and in other emerging models such as P2P networks and grid-based computing. In all cases, if we can perform more computational work within the network to reduce the communication needed, then we can reduce bandwidth usage and improve performance.

We consider two broad classes of approaches to such in-network query processing, by analogy to query types in traditional DBMSs. In the one shot model, a query is issued by a user at some site, and must be answered based on the current state of data in the network. We identify several possible approaches to this problem. For simple queries, partial computation of the result over a tree can reduce the data transferred significantly. For “holistic” queries, such as medians, count distinct and so on, clever composable summaries give a compact way to accurately approximate query answers. Lastly, careful modeling of correlations between measurements and other trends in the data can further reduce the number of sensors probed. In the continuous model, a query is placed by a user which requires the answer to be available continuously. This yields yet further challenges, since even using tree computation, summarization and modeling, we cannot afford to communicate every time new data is received by one of the remote sites. Instead, the result of work on this problem has been a new tradeoff of reduced accuracy in the query answer for reduced communication cost. This has led to a variety of techniques for different query types to apportion the available “uncertainty” in the query answer between different sites, and to model the evolution of measured values to anticipate future values and so reduce communication further.

Our objective in this tutorial is to discuss the algorithmic foundations of this new world, illustrate some of the powerful techniques that have been developed to address these challenges, and outline interesting directions for future work in the area.

Note: these slides are provided in powerpoint format for ease of use (due to embedded animations) and for use in other presentations. Please feel free to use these slides for your own purposes; we would appreciate acknowledgements and citations (to the VLDB 06 proceedings).

bib | video | slides | .pdf ] Back


This file was generated by bibtex2html 1.92.