We initiate a new class of string matching problems called Substring Compression Problems. Given a string S that may be preprocessed, the problem is to quickly find the compressed representation or the compressed size of any query substring of S (Substring Compression Query or SCQ) or to find the length l substring of S whose compressed size is the least (Least Compressible Substring or LCS problem).
Starting from the seminal paper of Lempel and Ziv over 25 years ago, many different methods have emerged for compressing entire strings. Determining substring compressibility is a natural variant that is combinatorially and algorithmically challenging, yet surprisingly has not been studied before. In addition, compressibility of strings is emerging as a tool to compare biological sequences and analyze their information content. However, typically, the compressibility of the entire sequence is not as informative as that of portions of the sequences. Thus substring compressibility may be a more suitable basis for sequence analysis.
We present the first known, nearly optimal algorithms for substring compression problems-SCQ, LCS and their generaliztions-that are exact or provably approximate. Our exact algorithms exploit the structure in strings via suffix trees and our approximate algorithms rely on new relationships we find between Lempel-Ziv compression and string parsings.
[ bib | slides | .pdf ] Back
This file was generated by bibtex2html 1.92.