Processing large data streams is now a major topic in data management. The data involved can be truly massive, and the required analyses complex. In a stream of sequential events such as stock feeds, sensor readings, or IP traffic measurements, tuples pertaining to recent events are typically more important than older ones. This can be formalized via time decay functions, which assign weights based on age. Decay functions such as sliding windows and exponential decay have been well studied under the assumption of well-ordered arrivals, i.e., data arrives in the order of increasing time stamps. However, data quality issues are prevalent in massive streams (due to network asynchrony and delays or possibly due to features inherent to the measurement process), and correct order of arrival cannot be guaranteed. We focus on the computation of decayed aggregates such as range queries, quantiles, and heavy hitters on out-of-order streams, where elements do not necessarily arrive in increasing order of timestamps. We give the first deterministic algorithms for approximating these aggregates under the sliding window, exponential and polynomial decay functions. Our techniques are based on extending existing data stream summaries, such as the q-digest. We show that the overhead for allowing out-of-order arrivals is small when compared to the case of well-ordered arrivals. Our experimental study confirms that these methods can be applied in practice and investigates how the specific decay function impacts performance.
[ bib | Alternate Version ] Back
This file was generated by bibtex2html 1.92.