Compressed Data Structure
   HOME





Compressed Data Structure
The term compressed data structure arises in the computer science subfields of algorithms, data structures, and theoretical computer science. It refers to a data structure whose operations are roughly as fast as those of a conventional data structure for the problem, but whose size can be substantially smaller. The size of the compressed data structure is typically highly dependent upon the information entropy of the data being represented. Important examples of compressed data structures include the compressed suffix array and the FM-index, both of which can represent an arbitrary text of characters ''T'' for pattern matching. Given any input pattern ''P'', they support the operation of finding if and where ''P'' appears in ''T''. The search time is proportional to the sum of the length of pattern ''P'', a very slow-growing function of the length of the text ''T'', and the number of reported matches. The space they occupy is roughly equal to the size of the text ''T'' in entrop ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Computer Science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, applied disciplines (including the design and implementation of Computer architecture, hardware and Software engineering, software). Algorithms and data structures are central to computer science. The theory of computation concerns abstract models of computation and general classes of computational problem, problems that can be solved using them. The fields of cryptography and computer security involve studying the means for secure communication and preventing security vulnerabilities. Computer graphics (computer science), Computer graphics and computational geometry address the generation of images. Programming language theory considers different ways to describe computational processes, and database theory concerns the management of re ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Algorithms
In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing calculations and data processing. More advanced algorithms can use conditionals to divert the code execution through various routes (referred to as automated decision-making) and deduce valid inferences (referred to as automated reasoning). In contrast, a heuristic is an approach to solving problems without well-defined correct or optimal results.David A. Grossman, Ophir Frieder, ''Information Retrieval: Algorithms and Heuristics'', 2nd edition, 2004, For example, although social media recommender systems are commonly called "algorithms", they actually rely on heuristics as there is no truly "correct" recommendation. As an effective method, an algorithm can be expressed within a finite amount of space and time"Any classic ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Data Structures
In computer science, a data structure is a data organization and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data, i.e., it is an algebraic structure about data. Usage Data structures serve as the basis for abstract data types (ADT). The ADT defines the logical form of the data type. The data structure implements the physical form of the data type. Different types of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks. For example, relational databases commonly use B-tree indexes for data retrieval, while compiler implementations usually use hash tables to look up identifiers. Data structures provide a means to manage large amounts of data efficiently for uses such as large databases and internet indexing services. Usually, effici ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Theoretical Computer Science
Theoretical computer science is a subfield of computer science and mathematics that focuses on the Abstraction, abstract and mathematical foundations of computation. It is difficult to circumscribe the theoretical areas precisely. The Association for Computing Machinery, ACM's Special Interest Group on Algorithms and Computation Theory (SIGACT) provides the following description: History While logical inference and mathematical proof had existed previously, in 1931 Kurt Gödel proved with his incompleteness theorem that there are fundamental limitations on what statements could be proved or disproved. Information theory was added to the field with A Mathematical Theory of Communication, a 1948 mathematical theory of communication by Claude Shannon. In the same decade, Donald Hebb introduced a mathematical model of Hebbian learning, learning in the brain. With mounting biological data supporting this hypothesis with some modification, the fields of neural networks and para ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Information Entropy
In information theory, the entropy of a random variable quantifies the average level of uncertainty or information associated with the variable's potential states or possible outcomes. This measures the expected amount of information needed to describe the state of the variable, considering the distribution of probabilities across all potential states. Given a discrete random variable X, which may be any member x within the set \mathcal and is distributed according to p\colon \mathcal\to , 1/math>, the entropy is \Eta(X) := -\sum_ p(x) \log p(x), where \Sigma denotes the sum over the variable's possible values. The choice of base for \log, the logarithm, varies for different applications. Base 2 gives the unit of bits (or " shannons"), while base ''e'' gives "natural units" nat, and base 10 gives units of "dits", "bans", or " hartleys". An equivalent definition of entropy is the expected value of the self-information of a variable. The concept of information entropy was ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Compressed Suffix Array
In computer science, a compressed suffix arrayR. Grossi and J. S. VitterCompressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching ''SIAM Journal on Computing,'' 35(2), 2005, 378–407. An earlier version appeared in ''Proceedings of the 32nd ACM Symposium on Theory of Computing,'' May 2000, 397–406.Paolo Ferragina and Giovanni Manzini (2000)"Opportunistic Data Structures with Applications" Proceedings of the 41st Annual Symposium on Foundations of Computer Science. p.390.R. Grossi, A. Gupta, and J. S. VitterHigh-Order Entropy-Compressed Text Indexes ''Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms,'' January 2003, 841–850. is a compressed data structure for pattern matching. Compressed suffix arrays are a general class of data structure that improve on the suffix array. These data structures enable quick search for an arbitrary string with a comparatively small index. Given a text ''T'' of ''n'' characters from ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


FM-index
In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,Paolo Ferragina and Giovanni Manzini (2000)"Opportunistic Data Structures with Applications".Proceedings of the 41st Annual Symposium on Foundations of Computer Science. p.390. who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for Full-text index in Minute space. It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence. The query time, as well as the required storage space, has a sublinear complexity with respect to the size of the input data. The original authors have devised improvements to their original approach and dubbed it "FM-Index version 2". A further improvem ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Pattern Matching
In computer science, pattern matching is the act of checking a given sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually must be exact: "either it will or will not be a match." The patterns generally have the form of either sequences or tree structures. Uses of pattern matching include outputting the locations (if any) of a pattern within a token sequence, to output some component of the matched pattern, and to substitute the matching pattern with some other token sequence (i.e., search and replace). Sequence patterns (e.g., a text string) are often described using regular expressions and matched using techniques such as backtracking. Tree patterns are used in some programming languages as a general tool to process data based on its structure, e.g. C#, F#, Haskell, Java, ML, Python, Ruby, Rust, Scala, Swift and the symbolic mathematics language Mathematica have special syntax for expressing ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Prediction By Partial Matching
Prediction by partial matching (PPM) is an adaptive statistical data compression technique based on context modeling and prediction. PPM models use a set of previous symbols in the uncompressed symbol stream to predict the next symbol in the stream. PPM algorithms can also be used to cluster data into predicted groupings in cluster analysis. Theory Predictions are usually reduced to symbol rankings. Each symbol (a letter, bit or any other amount of data) is ranked before it is compressed, and the ranking system determines the corresponding codeword (and therefore the compression rate). In many compression algorithms, the ranking is equivalent to probability mass function estimation. Given the previous letters (or given a context), each symbol is assigned with a probability. For instance, in arithmetic coding the symbols are ranked by their probabilities to appear after previous symbols, and the whole sequence is compressed into a single fraction that is computed according t ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Gzip
gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and intended for use by GNU (from which the "g" of gzip is derived). Version 0.1 was first publicly released on 31 October 1992, and version 1.0 followed in February 1993. The decompression of the ''gzip'' format can be implemented as a streaming algorithm, an important feature for Web protocols, data interchange and ETL (in standard pipes) applications. File format gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. DEFLATE was intended as a replacement for LZW and other patent-encumbered data compression algorithms which, at the time, limited the usability of the compress utility and other popular archivers. "gzip" also refers to the gzip file format (described in the table below). In sho ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Suffix Tree
In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations. The construction of such a tree for the string S takes time and space linear in the length of S. Once constructed, several operations can be performed quickly, such as locating a substring in S, locating a substring if a certain number of mistakes are allowed, and locating matches for a regular expression pattern. Suffix trees also provided one of the first linear-time solutions for the longest common substring problem. These speedups come at a cost: storing a string's suffix tree typically requires significantly more space than storing the string itself. History The concept was first introduced by . Rather than the suffix S ..n/math>, Weiner stored in his trie the ''prefix ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Suffix Array
In computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics. Suffix arrays were introduced by as a simple, space efficient alternative to suffix trees. They had independently been discovered by Gaston Gonnet in 1987 under the name ''PAT array'' . gave the first in-place \mathcal(n) time suffix array construction algorithm that is optimal both in time and space, where ''in-place'' means that the algorithm only needs \mathcal(1) additional space beyond the input string and the output suffix array. Enhanced suffix arrays (ESAs) are suffix arrays with additional tables that reproduce the full functionality of suffix trees preserving the same time and memory complexity. A sorted array of only some (rather than all) suffixes of a string is called a sparse suffix array. Definition Let S=S ..S /math> be an n-string and let S ,j/math ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]