Substring Index
   HOME





Substring Index
In computer science, a substring index is a data structure which gives substring search in a text or text collection in sublinear time. Once constructed from a document or set of documents, a substring index can be used to locate all occurrences of a pattern in time linear or near-linear in the pattern size, with no dependence or only logarithmic dependence on the document size. The phrase full-text index is often used for substring indexes. But this is ambiguous, as it is also used for regular word indexes such as inverted files and document retrieval. See full text search. General considerations These data structures typically treat their text and pattern as strings over a fixed alphabet, and search for locations where the pattern occurs as a substring of the text. The symbols of the alphabet may be characters (for instance in Unicode) but in practical applications for text retrieval it may be preferable to treat the ( stemmed) words of a document as the symbols of its alphabe ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Computer Science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, applied disciplines (including the design and implementation of Computer architecture, hardware and Software engineering, software). Algorithms and data structures are central to computer science. The theory of computation concerns abstract models of computation and general classes of computational problem, problems that can be solved using them. The fields of cryptography and computer security involve studying the means for secure communication and preventing security vulnerabilities. Computer graphics (computer science), Computer graphics and computational geometry address the generation of images. Programming language theory considers different ways to describe computational processes, and database theory concerns the management of re ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Deterministic Finite Automaton
In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automaton (DFSA)—is a finite-state machine that accepts or rejects a given string of symbols, by running through a state sequence uniquely determined by the string. ''Deterministic'' refers to the uniqueness of the computation run. In search of the simplest models to capture finite-state machines, Warren McCulloch and Walter Pitts were among the first researchers to introduce a concept similar to finite automata in 1943. The figure illustrates a deterministic finite automaton using a state diagram. In this example automaton, there are three states: S0, S1, and S2 (denoted graphically by circles). The automaton takes a finite sequence of 0s and 1s as input. For each state, there is a transition arrow leading out to a next state for both 0 an ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


String Data Structures
String or strings may refer to: *String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects Arts, entertainment, and media Films * ''Strings'' (1991 film), a Canadian animated short * ''Strings'' (2004 film), a film directed by Anders Rønnow Klarlund * ''Strings'' (2011 film), an American dramatic thriller film * ''Strings'' (2012 film), a British film by Rob Savage * '' Bravetown'' (2015 film), an American drama film originally titled ''Strings'' * '' The String'' (2009), a French film Music Instruments * String (music), the flexible element that produces vibrations and sound in string instruments * String instrument, a musical instrument that produces sound through vibrating strings ** List of string instruments * String piano, a pianistic extended technique in which sound is produced by direct manipulation of the strings, rather than striking the piano's keys Types of groups * String band, music ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Algorithms On Strings
In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing calculations and data processing. More advanced algorithms can use conditionals to divert the code execution through various routes (referred to as automated decision-making) and deduce valid inferences (referred to as automated reasoning). In contrast, a heuristic is an approach to solving problems without well-defined correct or optimal results.David A. Grossman, Ophir Frieder, ''Information Retrieval: Algorithms and Heuristics'', 2nd edition, 2004, For example, although social media recommender systems are commonly called "algorithms", they actually rely on heuristics as there is no truly "correct" recommendation. As an effective method, an algorithm can be expressed within a finite amount of space and time"Any classical mathe ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Journal Of The ACM
The ''Journal of the ACM'' (''JACM'') is a peer-reviewed scientific journal covering computer science in general, especially theoretical aspects. It is an official journal of the Association for Computing Machinery. Its current editor-in-chief is Venkatesan Guruswami. The journal was established in 1954 and "computer scientists universally hold the ''Journal of the ACM'' in high esteem". See also * ''Communications of the ACM ''Communications of the ACM'' (''CACM'') is the monthly journal of the Association for Computing Machinery (ACM). History It was established in 1958, with Saul Rosen as its first managing editor. It is sent to all ACM members. Articles are i ...'' References External links * {{DEFAULTSORT:Journal Of The Acm Academic journals established in 1954 Computer science journals Association for Computing Machinery academic journals Bimonthly journals English-language journals ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Burrows–Wheeler Transform
The Burrows–Wheeler transform (BWT) rearranges a character string into runs of similar characters, in a manner that can be reversed to recover the original string. Since compression techniques such as move-to-front transform and run-length encoding are more effective when such runs are present, the BWT can be used as a preparatory step to improve the efficiency of a compression algorithm, and is used this way in software such as bzip2. The algorithm can be implemented efficiently using a suffix array thus reaching linear time complexity. It was invented by David Wheeler in 1983, and later published by him and Michael Burrows in 1994. Their paper included a compression algorithm, called the Block-sorting Lossless Data Compression Algorithm or BSLDCA, that compresses data by using the BWT followed by move-to-front coding and Huffman coding or arithmetic coding. Description The transform is done by constructing a matrix (known as the Burrows-Wheeler Matrix) whose rows are the ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




FM-index
In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,Paolo Ferragina and Giovanni Manzini (2000)"Opportunistic Data Structures with Applications".Proceedings of the 41st Annual Symposium on Foundations of Computer Science. p.390. who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries. The name stands for Full-text index in Minute space. It can be used to efficiently find the number of occurrences of a pattern within the compressed text, as well as locate the position of each occurrence. The query time, as well as the required storage space, has a sublinear complexity with respect to the size of the input data. The original authors have devised improvements to their original approach and dubbed it "FM-Index version 2". A further improvem ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Data Compression
In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder. The process of reducing the size of a data file is often referred to as data compression. In the context of data transmission, it is called source coding: encoding is done at the source of the data before it is stored or transmitted. Source coding should not be confused with channel coding, for error detection and correction or line coding, the means for mapping data onto a sig ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Compressed Suffix Array
In computer science, a compressed suffix arrayR. Grossi and J. S. VitterCompressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching ''SIAM Journal on Computing,'' 35(2), 2005, 378–407. An earlier version appeared in ''Proceedings of the 32nd ACM Symposium on Theory of Computing,'' May 2000, 397–406.Paolo Ferragina and Giovanni Manzini (2000)"Opportunistic Data Structures with Applications" Proceedings of the 41st Annual Symposium on Foundations of Computer Science. p.390.R. Grossi, A. Gupta, and J. S. VitterHigh-Order Entropy-Compressed Text Indexes ''Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms,'' January 2003, 841–850. is a compressed data structure for pattern matching. Compressed suffix arrays are a general class of data structure that improve on the suffix array. These data structures enable quick search for an arbitrary string with a comparatively small index. Given a text ''T'' of ''n'' characters from ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


SIAM Journal On Computing
The ''SIAM Journal on Computing'' is a scientific journal focusing on the mathematical and formal aspects of computer science. It is published by the Society for Industrial and Applied Mathematics (SIAM). Although its official ISO abbreviation is ''SIAM J. Comput.'', its publisher and contributors frequently use the shorter abbreviation ''SICOMP''. SICOMP typically hosts the special issues of the IEEE Annual Symposium on Foundations of Computer Science (FOCS) and the Annual ACM Symposium on Theory of Computing (STOC), where about 15% of papers published in FOCS and STOC each year are invited to these special issues. For example, Volume 48 contains 11 out of 85 papers published in FOCS 2016. References External linksSIAM Journal on Computing
on

LCP Array
In computer science, the longest common prefix array (LCP Array data structure, array) is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array. For example, if ''A'' := [aab, ab, abaab, b, baab] is a suffix array, the longest common prefix between ''A''[1] = aab and ''A''[2] = ab is ''a'' which has length 1, so ''H''[2] = 1 in the LCP array ''H''. Likewise, the LCP of ''A''[2] = ab and ''A''[3] = abaab is ab, so ''H''[3] = 2. Augmenting the suffix array with the LCP array allows one to efficiently simulate top-down and bottom-up Tree traversal, traversals of the suffix tree, speeds up pattern matching on the suffix array and is a prerequisite for compressed suffix trees. History The LCP array was introduced in 1993, by Udi Manber and Gene Myers alongside the suffix array in order to improve the running time of their string search algorithm. Definition Le ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Binary Search
In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the middle element of the array. If they are not equal, the half in which the target cannot lie is eliminated and the search continues on the remaining half, again taking the middle element to compare to the target value, and repeating this until the target value is found. If the search ends with the remaining half being empty, the target is not in the array. Binary search runs in Time complexity#Logarithmic time, logarithmic time in the Best, worst and average case, worst case, making O(\log n) comparisons, where n is the number of elements in the array. Binary search is faster than linear search except for small arrays. However, the array must be sorted first to be able to apply binary search. There are specialized data structures designed fo ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]