Wavelet Tree
   HOME

TheInfoList



OR:

The Wavelet Tree is a
succinct data structure In computer science, a succinct data structure is a data structure which uses an amount of space that is "close" to the information-theoretic lower bound, but (unlike other compressed representations) still allows for efficient query operations. Th ...
to store strings in compressed space. It generalizes the \mathbf_q and \mathbf_q operations defined on bitvectors to arbitrary alphabets. Originally introduced to represent
compressed suffix array In computer science, a compressed suffix arrayR. Grossi and J. S. Vitter, Compressed Suffix Arrays and Suffix Trees, with Applications to Text Indexing and String Matching, ''SIAM Journal on Computing,'' 35(2), 2005, 378-407. An earlier version ap ...
s, it has found application in several contexts. The tree is defined by recursively partitioning the alphabet into pairs of subsets; the leaves correspond to individual symbols of the alphabet, and at each node a bitvector stores whether a symbol of the string belongs to one subset or the other. The name derives from an analogy with the
wavelet transform In mathematics, a wavelet series is a representation of a square-integrable (real number, real- or complex number, complex-valued) function (mathematics), function by a certain orthonormal series (mathematics), series generated by a wavelet. This ...
for signals, which recursively decomposes a signal into low-frequency and high-frequency components.


Properties

Let \Sigma be a finite alphabet with \sigma=, \Sigma, . By using succinct dictionaries in the nodes, a string s \in \Sigma^* can be stored in , s, H_0(s) + o(, s, \log \sigma), where H_0(s) is the order-0 empirical entropy of s. If the tree is balanced, the operations \mathbf, \mathbf_q, and \mathbf_q can be supported in O(\log \sigma) time.


Access operation

A wavelet tree contains a bitmap representation of a string. If we know the alphabet set, then the exact string can be inferred by tracking bits down the tree. To find the letter at ith position in the string :- Input: - The position ''i'' in the string of which we want to know the letter, starting at 1. - The top node ''W'' of the wavelet tree that represents the string Output: The letter at position ''i'' if ''W.isLeafNode'' return ''W.letter'' if ''W.bitvector ' = 0 return access(''i'' - rank(''W.bitvector'', ''i''), ''W.left'') else return access(rank(''W.bitvector'', ''i''), ''W.right'') In this context, the rank of a position i in a bitvector b is the number of ones that appear in the first i positions of b. Because the rank can be calculated in O(1) by using succinct dictionaries, any S in string S can be accessed in O(\log \sigma) time, as long as the tree is balanced.


Extensions

Several extensions to the basic structure have been presented in the literature. To reduce the height of the tree, multiary nodes can be used instead of binary. The data structure can be made dynamic, supporting insertions and deletions at arbitrary points of the string; this feature enables the implementation of dynamic
FM-index In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array. It was created by Paolo Ferragina and Giovanni Manzini,Paolo Ferragina and Giovanni Manz ...
es. This can be further generalized, allowing the update operations to change the underlying alphabet: the Wavelet Trie exploits the
trie In computer science, a trie, also called digital tree or prefix tree, is a type of ''k''-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes def ...
structure on an alphabet of strings to enable dynamic tree modifications.


Further reading


Wavelet Trees
A blog post describing the construction of a wavelet tree, with examples. The wavelet Tree has many applications in different domains such as geographical search. Researchers presented a dual indexing technique based on a wavelet tree for Geographical Information Retrieval (GIR). The large amount of geographical information on the internet has made the indexing challenging during the information retrieval which is defined as an activity to retrieve a set of information based on our need from a set of resources. An efficient indexing technique is an essential requirement in processing of the plenty of data that are present in today’s information retrieval systems. This has been the motivation towards the design and development of several different indexing structures with a primary focus towards secondary memory as compared to the main memory which used to consist of a smaller size as well as a higher cost in the beginning, however with a dramatic reduction in the main memory prices, indexes in the main memory has also been introduced which led to the development of primary memory-based indexing structures. Dual indexing technique uses wavelet tree data structure for both textual and spatial indexing and allows for the use of MBR in dynamic insertion during index construction. It has the advantage of supporting parallel computing when there is an extensive data set involved. It can be used extensively and concurrently on different machines such that textual data can be searched on one machine and spatial data on another machine which in turn results in the reduction of the search time and the storage space, hence making this method way more efficient as compared to the previous proposed methods. Experimental results support this by showing that when pure spatial indexing is used, search time complexity is reduced and takes even less than one third time of that of spatial indexing performed using previously introduced methods namely R-tree or R*-tree. In addition, the search time is decreased by half in the case of dual indexing (textual and spatial) using wavelet tree, as compared to the other techniques like B/R, B/R* when the search query length is larger than 2 keywords.


References

R. Grossi, A. Gupta, and J. S. Vitter
High-order entropy-compressed text indexes
''Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA)'', January 2003, 841-850.
P. Ferragina, R. Giancarlo, G. Manzini
The myriad virtues of Wavelet Trees
''Information and Computation'', Volume 207, Issue 8, August 2009, Pages 849-866
G. Navarro
Wavelet Trees for All
''Proceedings of 23rd Annual Symposium on Combinatorial Pattern Matching (CPM)'', 2012
H.-L. Chan, W.-K. Hon, T.-W. Lam, and K. Sadakane
Compressed Indexes for dynamic text collections
''ACM Transactions on Algorithms'', 3(2), 2007
R. Grossi and G. Ottaviano
The Wavelet Trie: maintaining an indexed sequence of strings in compressed space
''In Proceedings of the 31st Symposium on the Principles of Database Systems (PODS)'', 2012
Yadav, A. K., & Yadav, D. (2019)
Wavelet tree based dual indexing technique for geographical search
''Int. Arab J. Inf. Technol., 16(4), 624-632.


External links

*{{Commonscatinline Trees (data structures) Succinct data structure String data structures