G. Cormode and S. Muthukrishnan. Summarizing and mining skewed data streams. In SIAM Conference on Data Mining (SDM), 2005.

Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support post-hoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well modeled by Zipf distributions, which are characterized by a parameter, z, that captures the amount of skew.

We present a data stream summary that can answer point queries with ε accuracy and show that the space needed is only O-min{1,1/z}). This is the first o(1/ε) space algorithm for this problem, and we show it is essentially tight for skewed distributions. We show that the same data structure can also estimate the L2 norm of the stream in o(1/ε2) space for z>1/2, another improvement over the existing Ω(1/ε2) methods.

We support our theoretical results with an experimental study over a large variety of real and synthetic data. We show that significant skew is present in both textual and telecommunication data. Our methods give strong accuracy, significantly better than other methods, and behave exactly in line with their analytic bounds.

bib | slides | .pdf ] Back


This file was generated by bibtex2html 1.92.