HOME

TheInfoList



OR:

In computer science, a trie, also called digital tree or prefix tree, is a type of ''k''-ary
search tree In computer science, a search tree is a tree data structure used for locating specific keys from within a set. In order for a tree to function as a search tree, the key for each node must be greater than any keys in subtrees on the left, and les ...
, a
tree In botany, a tree is a perennial plant with an elongated Plant stem, stem, or trunk (botany), trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondar ...
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual
characters Character or Characters may refer to: Arts, entertainment, and media Literature * ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk * ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
. In order to access a key (to recover its value, change it, or remove it), the trie is traversed
depth-first Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a graph) and explores as far as possible alon ...
, following the links between nodes, which represent each character in the key. Unlike a binary search tree, nodes in the trie do not store their associated key. Instead, a node's position in the trie defines the key with which it is associated. This distributes the value of each key across the data structure, and means that not every node necessarily has an associated value. All the children of a node have a common
prefix A prefix is an affix which is placed before the stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy''. Particula ...
of the string associated with that parent node, and the root is associated with the
empty string In formal language theory, the empty string, or empty word, is the unique string of length zero. Formal theory Formally, a string is a finite, ordered sequence of characters such as letters, digits or spaces. The empty string is the special case ...
. This task of storing data accessible by its prefix can be accomplished in a memory-optimized way by employing a
radix tree In computer science, a radix tree (also radix trie or compact prefix tree or compressed trie) is a data structure that represents a space-optimized trie (prefix tree) in which each node that is the only child is merged with its parent. The resul ...
. Though tries can be keyed by character strings, they need not be. The same algorithms can be adapted for ordered lists of any underlying type, e.g.
permutation In mathematics, a permutation of a set is, loosely speaking, an arrangement of its members into a sequence or linear order, or if the set is already ordered, a rearrangement of its elements. The word "permutation" also refers to the act or pr ...
s of digits or shapes. In particular, a bitwise trie is keyed on the individual bits making up a piece of fixed-length binary data, such as an integer or
memory address In computing, a memory address is a reference to a specific memory location used at various levels by software and hardware. Memory addresses are fixed-length sequences of digits conventionally displayed and manipulated as unsigned integers. ...
. The key lookup complexity of a trie remains proportional to the key size. Specialized trie implementations such as compressed tries are used to deal with the enormous space requirement of a trie in naive implementations.


History, etymology, and pronunciation

The idea of a trie for representing a set of strings was first abstractly described by
Axel Thue Axel Thue (; 19 February 1863 – 7 March 1922) was a Norwegian mathematician, known for his original work in diophantine approximation and combinatorics. Work Thue published his first important paper in 1909. He stated in 1914 the so-called w ...
in 1912. Cited by Knuth. Tries were first described in a computer context by René de la Briandais in 1959. The idea was independently described in 1960 by
Edward Fredkin Edward Fredkin (born October 2, 1934) is a distinguished career professor at Carnegie Mellon University (CMU), and an early pioneer of digital physics. Fredkin's primary contributions include work on reversible computing and cellular automata. ...
, who coined the term ''trie'', pronouncing it (as "tree"), after the middle syllable of ''retrieval''. However, other authors pronounce it (as "try"), in an attempt to distinguish it verbally from "tree".


Overview

Tries are a form of string-indexed look-up data structure, which is used to store a dictionary list of words that can be searched on in a manner that allows for efficient generation of completion lists. A prefix trie is an
ordered tree In graph theory, a tree is an undirected graph in which any two vertices are connected by ''exactly one'' path, or equivalently a connected acyclic undirected graph. A forest is an undirected graph in which any two vertices are connected by '' ...
data structure used in the representation of a set of strings over a finite alphabet set, which allows efficient storage of words with common prefixes. Tries can be efficacious on
string-searching algorithm In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger strin ...
s such as
predictive text Predictive text is an input technology used where one key or button represents many letters, such as on the numeric keypads of mobile phones and in accessibility technologies. Each key press results in a ''prediction'' rather than repeatedly ...
, approximate string matching, and
spell checking In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic di ...
in comparison to a binary search trees. A trie can be seen as a tree-shaped
deterministic finite automaton In the theory of computation, a branch of theoretical computer science, a deterministic finite automaton (DFA)—also known as deterministic finite acceptor (DFA), deterministic finite-state machine (DFSM), or deterministic finite-state automa ...
.


Operations

Tries support various operations: insertion, deletion, and lookup of a string key. Tries are composed of \text that contain ''links'' that are either references to other child suffix child nodes, or \text . Except for ''root'', each node is pointed to by just one other node, called the ''parent''. Each node contains \text links, where \text is the
cardinality In mathematics, the cardinality of a set is a measure of the number of elements of the set. For example, the set A = \ contains 3 elements, and therefore A has a cardinality of 3. Beginning in the late 19th century, this concept was generalized ...
of the set of
alphabets An alphabet is a standardized set of basic written graphemes (called letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character represents a syllab ...
, although tries have a substantial number of \text links. In most cases, the size of \text array is bitlength of the
character encoding Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
- 256 in the case of (unsigned) ASCII. The \text links within \text in \text emphasizes the following characteristics: # Characters and string keys are implicitly stored in the trie data structure representation, and include a character
sentinel value In computer programming, a sentinel value (also referred to as a flag value, trip value, rogue value, signal value, or dummy data) is a special value in the context of an algorithm which uses its presence as a condition of termination, typically i ...
indicating string-termination. # Each node contains one possible link to a
prefix A prefix is an affix which is placed before the stem of a word. Adding it to the beginning of one word changes it into another word. For example, when the prefix ''un-'' is added to the word ''happy'', it creates the word ''unhappy''. Particula ...
of strong keys of the set. A basic structure type of nodes in the trie is as follows; \text may contain an optional \text, which is associated with each key stored in the last character of string, or terminal node.


Searching

Searching a \text in a trie is guided by the characters in the search string key, as each node in the trie contains a corresponding link to each possible character in the given string. Thus, following the string within the trie yields the associated \text for the given string key. A \text link within search execution indicates the inexistence of the key. Following pseudocode implements the search procedure for a given string key (\text) in a rooted trie (\text). In the above pseudocode, \text and \text correspond to the pointer of trie's root node and the string key respectively. The search operation, in a standard trie, takes O(\text), \text is the size of the string parameter \text, and \text corresponds to the alphabet size. Binary search trees, on the other hand, take O(m \log n) on the worst case, since the search depends on the height of the tree (\log n) of the BST (in case of
balanced tree In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.Donald ...
s), where \text and \text being number of keys and the length of the keys. Tries occupy less space in comparison with BST if it encompasses a large number of short strings, since nodes share common initial string subsequences and stores the keys implicitly on the structure. The terminal node of the tree contains a non-nil \text, and it is a ''search hit'' if the associated value is found in the trie, and ''search miss'' if it isn't.


Insertion

Insertion into trie is guided by using the character sets as the indexes into the \text array until last character of the string key is reached. Each node in the trie corresponds to one call of the
radix sorting In computer science, radix sort is a non- comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process ...
routine, as the trie structure reflects the execution of patten of the top-down radix sort. If a \text link is encountered prior to reaching the last character of the string key, a new \text is created, such along lines 3-5. \text gets assigned to input \text; if \text wasn't \text at the time of insertion, the value associated with the given string key gets substituted with the current one.


Deletion

Deletion of a key–value pair from a trie involves finding the terminal node with the corresponding string key, marking the terminal indicator and value to ''false'' and \text correspondingly. Following is a
recursive Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...
procedure for removing a string key (\text) from rooted trie (\text). The procedures begins by examining the \text; \text denotes the arrival of a terminal node or end of string key. If terminal and if it has no children, the node gets removed from the trie (line 14 assign the character index to \text). However, an end of string key without the node being terminal indicates that the key does not exist, thus the procedure does not modify the trie. The recursion proceeds by incrementing \text's index.


Replacing other data structures


Replacement for hash tables

A trie can be used to replace a
hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', ...
, over which it has the following advantages: * Searching for a node with an associated key of size m has the complexity of O(m), whereas an imperfect hash function may have numerous colliding keys, and the worst-case lookup speed of such a table would be O(N), where N denotes the total number of nodes within the table. * Tries do not need a hash function for the operation, unlike a hash table; there are also no
collisions In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great fo ...
of different keys in a trie. * Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value. * String keys within the trie can be sorted using a predetermined alphabetical ordering. However, tries are less efficient than a hash table when the data is directly accessed on a
secondary storage device Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a compute ...
such as a hard disk drive that has higher
random access Random access (more precisely and more generally called direct access) is the ability to access an arbitrary element of a sequence in equal time or any datum from a population of addressable elements roughly as easily and efficiently as any othe ...
time than the
main memory Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a compute ...
. Tries are also disadvantageous when the key value cannot be easily represented as string, such as
floating point numbers In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can be ...
where multiple representations are possible (e.g. 1 is equivalent to 1.0, +1.0, 1.00, etc.), however it can be unambiguously represented as a
binary number A binary number is a number expressed in the base-2 numeral system or binary numeral system, a method of mathematical expression which uses only two symbols: typically "0" (zero) and "1" ( one). The base-2 numeral system is a positional notation ...
in
IEEE 754 The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical standard for floating-point arithmetic established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). The standard addressed many problems found in ...
, in comparison to
two's complement Two's complement is a mathematical operation to reversibly convert a positive binary number into a negative binary number with equivalent (but negative) value, using the binary digit with the greatest place value (the leftmost bit in big- endian ...
format.


Implementation strategies

Tries can be represented in several ways, corresponding to different trade-offs between memory use and speed of the operations. Using a vector of pointers for representing a trie consumes enormous space; however, memory space can be reduced at the expense of running time if a
singly linked list In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes whic ...
is used for each node vector, as most entries of the vector contains \text. Techniques such as ''alphabet reduction'' may alleviate the high space complexity by reinterpreting the original string as a long string over a smaller alphabet i.e. a string of bytes can alternatively be regarded as a string of four-bit units and stored in a trie with sixteen pointers per node. However, lookups need to visit twice as many nodes in the worst-case, although space requirements go down by a factor of eight. Other techniques include storing a vector of 256 ASCII pointers as a bitmap of 256 bits representing ASCII alphabet, which reduces the size of individual nodes dramatically.


Bitwise tries

Bitwise tries are used to address the enormous space requirement for the trie nodes in a naive simple pointer vector implementations. Each character in the string key set is represented via individual bits, which are used to traverse the trie over a string key. The implementations for these types of trie use vectorized CPU instructions to find the first set bit in a fixed-length key input (e.g. GCC's __builtin_clz()
intrinsic function In computer software, in compiler theory, an intrinsic function (or built-in function) is a function (subroutine) available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it may su ...
). Accordingly, the set bit is used to index the first item, or child node, in the 32- or 64-entry based bitwise tree. Search then proceeds by testing each subsequent bit in the key. This procedure is also cache-local and highly parallelizable due to
register Register or registration may refer to: Arts entertainment, and media Music * Register (music), the relative "height" or range of a note, melody, part, instrument, etc. * ''Register'', a 2017 album by Travis Miller * Registration (organ), the ...
independency, and thus performant on
out-of-order execution In computer engineering, out-of-order execution (or more formally dynamic execution) is a paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a process ...
CPUs.


Compressed tries

Radix tree In computer science, a radix tree (also radix trie or compact prefix tree or compressed trie) is a data structure that represents a space-optimized trie (prefix tree) in which each node that is the only child is merged with its parent. The resul ...
, also known as a compressed trie, is a space-optimized variant of a trie in which nodes with only one child get merged with its parents; elimination of branches of the nodes with a single child results in better in both space and time metrics. This works best under the following conditions where the trie remains static and set of keys stored are very sparse within their representation space. One more approach is to "pack" the trie, in which a space-efficient implementation of a sparse packed trie applied to automatic hyphenation, in which the descendants of each node may be interleaved in memory.


Patricia trees

Patricia trees are a particular implementation of compressed binary trie that utilize binary encoding of the string keys in its representation. Every node in a Patricia tree contains an index, known as a "skip number", that stores the node's branching index to avoid empty subtrees during traversal. A naive implementation of a trie consumes immense storage due to larger number of leaf-nodes caused by sparse distribution of keys; Patricia trees can be efficient for such cases. A representation of a Patricia tree with string keys \ is shown in figure 4, and each index value adjacent to the nodes represents the "skip number" - the index of the bit with which branching is to be decided. The skip number 1 at node 0 corresponds to the position 1 in the binary encoded ASCII where the leftmost bit differed in the key set X. The skip number is crucial for search, insertion, and deletion of nodes in the Patricia tree, and a bit masking operation is performed during every iteration.


Applications

Trie data structures are commonly used in
predictive text Predictive text is an input technology used where one key or button represents many letters, such as on the numeric keypads of mobile phones and in accessibility technologies. Each key press results in a ''prediction'' rather than repeatedly ...
or
autocomplete Autocomplete, or word completion, is a feature in which an application predicts the rest of a word a user is typing. In Android and iOS smartphones, this is called predictive text. In graphical user interfaces, users can typically press the tab ...
dictionaries, and approximate matching algorithms. Tries enable faster searches, occupy less space, especially when the set contains large number of short strings, thus used in
spell checking In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic di ...
, hyphenation applications and
longest prefix match Longest prefix match (also called Maximum prefix length match) refers to an algorithm used by routers in Internet Protocol (IP) networking to select an entry from a routing table. Because each entry in a forwarding table may specify a sub-netwo ...
algorithms. However, if storing dictionary words is all that is required (i.e. there is no need to store metadata associated with each word), a minimal deterministic acyclic finite state automaton (DAFSA) or radix tree would use less storage space than a trie. This is because DAFSAs and radix trees can compress identical branches from the trie which correspond to the same suffixes (or parts) of different words being stored. String dictionaries are also utilized in
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
, such as finding
lexicon A lexicon is the vocabulary of a language or branch of knowledge (such as nautical or medical). In linguistics, a lexicon is a language's inventory of lexemes. The word ''lexicon'' derives from Greek word (), neuter of () meaning 'of or for w ...
of a
text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical ...
.


Sorting

Lexicographic sorting of a set of string keys can be implemented by building a trie for the given keys and traversing the tree in
pre-order A pre-order is an order placed for an item that has not yet been released. The idea for pre-orders came because people found it hard to get popular items in stores because of their popularity. Companies then had the idea to allow customers to r ...
fashion; this is also a form of
radix sort In computer science, radix sort is a non- comparative sorting algorithm. It avoids comparison by creating and distributing elements into buckets according to their radix. For elements with more than one significant digit, this bucketing process ...
. Tries are also fundamental data structures for burstsort, which is notable for being the fastest string sorting algorithm as of 2007, accompanied for its efficient use of CPU
cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache Count ...
.


Full-text search

A special kind of trie, called a
suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...
, can be used to index all
suffix In linguistics, a suffix is an affix which is placed after the stem of a word. Common examples are case endings, which indicate the grammatical case of nouns, adjectives, and verb endings, which form the conjugation of verbs. Suffixes can carry g ...
es in a text to carry out fast full-text searches.


Web search engines

A specialized kind of trie called a compressed trie, is used in
web search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
s for storing the
indexes Index (or its plural form indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on a Halo megastru ...
- a collection of all searchable words. Each terminal node is associated with a list of
URL A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifi ...
s—called occurrence list—to pages that match the keyword. The trie is stored in the main memory, whereas the occurrence is kept in an external storage, frequently in large clusters, or the in-memory index points to documents stored in an external location.


Bioinformatics

Tries are used in Bioinformatics, notably in
sequence alignment In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Ali ...
software applications such as
BLAST Blast or The Blast may refer to: *Explosion, a rapid increase in volume and release of energy in an extreme manner *Detonation, an exothermic front accelerating through a medium that eventually drives a shock front Film * ''Blast'' (1997 film), ...
, which indexes all the different substring of length ''k'' (called
k-mer In bioinformatics, ''k''-mers are substrings of length k contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which ''k''-mers are composed of nucleotides (''i.e''. A, T, G, ...
s) of a text by storing the positions of their occurrences in a compressed trie sequence databases.


Internet routing

Compressed variants of tries, such as databases for managing
Forwarding Information Base A forwarding information base (FIB), also known as a forwarding table or MAC table, is most commonly used in network bridging, routing, and similar functions to find the proper output network interface controller to which the input interface shou ...
(FIB), are used in storing IP address prefixes within routers and bridges for prefix-based lookup to resolve mask-based operations in
IP routing IP routing is the application of routing methodologies to IP networks. This involves not only protocols and technologies but includes the policies of the worldwide organization and configuration of Internet infrastructure. In each IP network no ...
.


See also

*
Suffix tree In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particu ...
*
Hash trie In computer science, hash trie can refer to: * Hash tree (persistent data structure), a trie used to map hash values to keys * A space-efficient implementation of a sparse trie, in which the descendants of each node may be interleaved in memory. ( ...
*
Hash array mapped trie A hash array mapped trie (HAMT) is an implementation of an associative array that combines the characteristics of a hash table and an array mapped trie. It is a refined version of the more general notion of a Hash tree (persistent data structure), ...
*
Prefix hash tree A prefix hash tree (PHT) is a distributed data structure that enables more sophisticated queries over a distributed hash table (DHT). The prefix hash tree uses the lookup interface of a DHT to construct a trie In computer science, a trie, al ...
* Ctrie *
HAT-trie The HAT-trie is a type of radix trie that uses array nodes to collect individual key–value pairs under radix nodes and hash buckets into an associative array. Unlike a simple hash table, HAT-tries store key–value in an ordered collection. The ...


References


External links


NIST's Dictionary of Algorithms and Data Structures: Trie
{{Strings Trees (data structures) Finite automata Articles with example Python (programming language) code