HOME

TheInfoList



OR:

The term compressed data structure arises in the
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
subfields of
algorithms In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing c ...
,
data structures In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
, and
theoretical computer science Theoretical computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory. It is difficult to circumsc ...
. It refers to a data structure whose operations are roughly as fast as those of a conventional data structure for the problem, but whose size can be substantially smaller. The size of the compressed data structure is typically highly dependent upon the entropy of the data being represented. Important examples of compressed data structures include the
compressed suffix array In computer science, a compressed suffix arrayR. Grossi and J. S. Vitter, Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching, ''SIAM Journal on Computing,'' 35(2), 2005, 378-407. An earlier version ap ...
and the
FM-index In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,Paolo Ferragina and Giovanni Manz ...
,P. Ferragina and G. Manzini, Opportunistic Data Structures with Applications, ''Proceedings of the 41st IEEE Symposium on Foundations of Computer Science'', November 2000, 390-398. Journal version in Indexing Compressed Text, ''Journal of the ACM'', 52(4), 2005, 552-581. both of which can represent an arbitrary text of characters ''T'' for
pattern matching In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be ...
. Given any input pattern ''P'', they support the operation of finding if and where ''P'' appears in ''T''. The search time is proportional to the sum of the length of pattern ''P'', a very slow-growing function of the length of the text ''T'', and the number of reported matches. The space they occupy is roughly equal to the size of the text ''T'' in entropy-compressed form, such as that obtained by
Prediction by Partial Matching Prediction by partial matching (PPM) is an adaptive statistical data compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the st ...
or
gzip gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and i ...
. Moreover, both data structures are self-indexing, in that they can reconstruct the text ''T'' in a random access manner, and thus the underlying text ''T'' can be discarded. In other words, they simultaneously provide a compressed and quickly searchable representation of the text ''T''. They represent a substantial space improvement over the conventional suffix tree and suffix array, which occupy many times more space than the size of ''T''. They also support searching for arbitrary patterns, as opposed to the
inverted index In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of d ...
, which can support only word-based searches. In addition, inverted indexes do not have the self-indexing feature. An important related notion is that of a
succinct data structure In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but (unlike other compressed representations) still allows for efficient query operations. Th ...
, which uses space roughly equal to the information-theoretic minimum, which is a worst-case notion of the space needed to represent the data. In contrast, the size of a compressed data structure depends upon the particular data being represented. When the data are compressible, as is often the case in practice for natural language text, the compressed data structure can occupy space very close to the information-theoretic minimum, and significantly less space than most compression schemes.{{citation needed, reason=Until it wasn't demonstrated, there no reason to think it is true; many of the algorithms are only good at the paper, date=May 2016


References

Data structures