HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a compressed suffix arrayR. Grossi and J. S. Vitter, Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching, ''SIAM Journal on Computing,'' 35(2), 2005, 378-407. An earlier version appeared in ''Proceedings of the 32nd ACM Symposium on Theory of Computing,'' May 2000, 397-406.Paolo Ferragina and Giovanni Manzini (2000). "Opportunistic Data Structures with Applications". Proceedings of the 41st Annual Symposium on Foundations of Computer Science. p.390.R. Grossi, A. Gupta, and J. S. Vitter, High-Order Entropy-Compressed Text Indexes, ''Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms,'' January 2003, 841-850. is a
compressed data structure The term compressed data structure arises in the computer science subfields of algorithms, data structures, and theoretical computer science. It refers to a data structure whose operations are roughly as fast as those of a conventional data structu ...
for
pattern matching In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact: "either it will or will not be ...
. Compressed suffix arrays are a general class of
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
that improve on the
suffix array In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics. Suffix arrays were introduced by as ...
. These data structures enable quick search for an arbitrary
string String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian anim ...
with a comparatively small index. Given a text ''T'' of ''n'' characters from an alphabet Σ, a compressed suffix array supports searching for arbitrary patterns in ''T''. For an input pattern ''P'' of ''m'' characters, the search time is typically O(''m'') or O(''m'' + log(''n'')). The space used is typically O(n H_k(T)) + o(n), where H_k(T) is the k-th order empirical entropy of the text ''T''. The time and space to construct a compressed suffix array are normally . The original instantiation of the compressed suffix array solved a long-standing open problem by showing that fast pattern matching was possible using only a linear-space data structure, namely, one proportional to the size of the text ''T'', which takes O(n \, ) bits. The conventional suffix array and suffix tree use \Omega(n \, {\log n}) bits, which is substantially larger. The basis for the data structure is a recursive decomposition using the "neighbor function," which allows a suffix array to be represented by one of half its length. The construction is repeated multiple times until the resulting suffix array uses a linear number of bits. Following work showed that the actual storage space was related to the zeroth-order entropy and that the index supports self-indexing. The space bound was further improved achieving the ultimate goal of higher-order entropy; the compression is obtained by partitioning the neighbor function by high-order contexts, and compressing each partition with a
wavelet tree The Wavelet Tree is a succinct data structure to store strings in compressed space. It generalizes the \mathbf_q and \mathbf_q operations defined on bitvectors to arbitrary alphabets. Originally introduced to represent compressed suffix arrays, ...
. The space usage is extremely competitive in practice with other state-of-the-art compressors, and it also supports fast pattern matching. The memory accesses made by compressed suffix arrays and other compressed data structures for pattern matching are typically not localized, and thus these data structures have been notoriously hard to design efficiently for use in external memory. Recent progress using geometric duality takes advantage of the block access provided by disks to speed up the I/O time significantly In addition, potentially practical search performance for a compressed suffix array in external-memory has been demonstrated.M. P. Ferguson, ''FEMTO: fast search of large sequence collections'', ''Proceedings of the 23rd Annual Conference on Combinatorial Pattern Matching'', July 2012


Open Source Implementations

There are several open source implementations of compressed suffix arrays available (see
External Links An internal link is a type of hyperlink on a web page to another page or resource, such as an image or document, on the same website or domain. Hyperlinks are considered either "external" or "internal" depending on their target or destination ...
below). Bowtie and Bowtie2 are open-source compressed suffix array implementations of read alignment for use in
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
. The Succinct Data Structure Library (SDSL) is a library containing a variety of compressed data structures including compressed suffix arrays. FEMTO is an implementation of compressed suffix arrays for external memory. In addition, a variety of implementations, including the original
FM-index In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,Paolo Ferragina and Giovanni Manz ...
implementations, are available from the Pizza & Chili Website (see external links).


See also

*
FM-index In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,Paolo Ferragina and Giovanni Manz ...
*
Suffix Array In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics. Suffix arrays were introduced by as ...


References


External links

Implementations:
Bowtie
an
Bowtie2

Succinct Data Structure Library (SDSL)

FEMTO


String data structures Database index techniques Substring indices