## G. Cormode, H. Karloff, and T. Wirth.
Set cover algorithms for very large datasets.
In *ACM Conference on Information and Knowledge Management
(CIKM)*, 2010.

The problem of Set Cover-to find the smallest subcollection of
sets that covers some universe-is at the heart of many data and
analysis tasks. It arises in a wide range of settings, including
operations research, machine learning, planning, data quality and data
mining. Although finding an optimal solution is NP-hard, the greedy
algorithm is widely used, and typically finds solutions that are close
to optimal.
However, a direct implementation of the greedy approach, which
picks the set with the largest number of uncovered items at each step,
does not behave well when the input is very large and disk resident. The greedy
algorithm must make many random accesses to disk, which are
unpredictable and costly in comparison to linear scans. In order to
scale Set Cover to large datasets, we provide a new
algorithm which finds a solution that is provably close to that of
greedy, but which is much more efficient to implement using modern
disk technology. Our experiments show a ten-fold improvement in
speed on moderately-sized datasets, and an even greater improvement on
larger datasets.

[ bib |
slides |
.pdf ]
Back

*This file was generated by
bibtex2html 1.92.*