Bitwise Trie With Bitmap
   HOME

TheInfoList



OR:

A bitwise trie is a special form of
trie In computer science, a trie (, ), also known as a digital tree or prefix tree, is a specialized search tree data structure used to store and retrieve strings from a dictionary or set. Unlike a binary search tree, nodes in a trie do not store t ...
where each node with its child-branches represents a bit sequence of one or more bits of a key. A bitwise trie with bitmap uses a
bitmap In computing, a bitmap (also called raster) graphic is an image formed from rows of different colored pixels. A GIF is an example of a graphics image file that uses a bitmap. As a noun, the term "bitmap" is very often used to refer to a partic ...
to denote valid child branches.


Tries and bitwise tries

A
trie In computer science, a trie (, ), also known as a digital tree or prefix tree, is a specialized search tree data structure used to store and retrieve strings from a dictionary or set. Unlike a binary search tree, nodes in a trie do not store t ...
is a type of
search tree In computer science, a search tree is a tree data structure used for locating specific keys from within a set. In order for a tree to function as a search tree, the key for each node must be greater than any keys in subtrees on the left, and les ...
where – unlike for example a
B-tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing fo ...
– keys are not stored in the nodes but in the path to leaves. The key is distributed across the tree structure. In a "classic" trie, each node with its child-branches represents one symbol of the alphabet of one position (character) of a key. In bitwise tries, keys are treated as bit-sequence of some binary representation and each node with its child-branches represents the value of a sub-sequence of this bit-sequence to form a binary tree (the sub-sequence contains only one bit) or n-ary tree (the sub-sequence contains multiple bits). To give an example that explains the difference between "classic" tries and bitwise tries: For numbers as keys, the alphabet for a trie could consist of the symbols '0' .. '9' to represent digits of a number in the decimal system and the nodes would have up to 10 possible children. There are multiple straight forward approaches to implement such a trie as physical data structure. To state two: * A node can be represented having an array of child pointers for each symbol of the alphabet \Sigma – an array of 10 pointers per node in the decimal number example. This gives a O(, M, ) lookup time with , M, being the length of the key. But this isn't space efficient as each node preserves space for all possible child symbols even if there's no key that realizes that path. * A node contains a list of (symbol, child pointer) tuples, ordered by symbol to allow a binary search through that list. This has a better space efficiency but the lookup time now is O(, M, \cdot \log , \Sigma, ). An ideal trie has an access time that is independent of the amount of keys stored. These approaches get worse for larger alphabets, if, for example, the key is a string of
Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
characters. Treating the key as bit-sequence allows to have a fixed cardinality per node.


Bitwise trie with bitmap

Bagwell presented a time and space efficient solution for tries named Array Mapped Tree (AMT). The
Hash array mapped trie A hash array mapped trie (HAMT, ) is an implementation of an associative array that combines the characteristics of a hash table and an array mapped trie. It is a refined version of the more general notion of a hash tree. Operation A HAMT is ...
(HAMT) is based on AMT. The compact trie node representation uses a bitmap to mark every valid branch – a bitwise trie with bitmap. The AMT uses eight 32-bit bitmaps per node to represent a 256-ary trie that is able to represent an 8 bit sequence per node. With 64-Bit-CPUs (
64-bit computing In computer architecture, 64-bit integers, memory addresses, or other data units are those that are 64 bits wide. Also, 64-bit central processing units (CPU) and arithmetic logic units (ALU) are those that are based on processor registers, ...
) a variation is to have a 64-ary trie with only one 64-bit bitmap per node that is able to represent a 6 bit sequence. To determine the index of the child pointer of a node for such a given 6-bit value, the amount of preceding child pointers has to be calculated. It turns out that this can be implemented quite efficiently.


Node traversal

long bitMap = mem odeIdx long bitPos = 1L << value; // 6-bit-value if ((bitMap & bitPos)

0) return false; // not found long childNodeIdx = mem odeIdx + 1 + Long.bitCount(bitMap & (bitPos - 1))
The offset to find the index based on the current node index is the amount of least significant bits set in the bitmap before the target position plus one for the bitmap. The amount of least significant bits set can be calculated efficiently with constant time complexity using simple bit operations and a CTPOP (count population) operation that determines the number of set bits, which is available as Long.bitCount() in Java. CTPOP itself can be implemented quite efficiently using a "bit-hack" and many modern CPUs even provide CTPOP as a dedicated instruction treated by compilers as
intrinsic function In computer software, in compiler theory, an intrinsic function, also called built-in function or builtin function, is a function ( subroutine) available for use in a given programming language whose implementation is handled specially by the com ...
.


CTPOP bit-hack implementation

int bitCount(long x) x -= ((x >>> 1) & 0x5555555555555555L); x = (x & 0x3333333333333333L) + ((x >>> 2) & 0x3333333333333333L); x = (x + (x >>> 4)) & 0x0f0f0f0f0f0f0f0fL; x += (x >>> 8); x += (x >>> 16); x += (x >>> 32); return x & 0x7f; How to use this principle for an universal
index Index (: indexes or indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on the Halo Array in the ...
for
database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...
and
Information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
applications is described in. A specialized and simplified solution demonstrating these concepts is shown below for an implementation of a 32-Bit-Integer set.


Set implementation example


Physical data structure

In this example implementation for a bitwise trie with bitmap, nodes are placed in an array of long (64-bit) integers. A node is identified by the position (index) in that array. The index of the root node marks the root of the trie. Nodes are allocated from unused space in that array, extending the array if necessary. In addition, nodes, that are replaced, are collected in free lists and their space is recycled. Without this recycling, the data structure can be used to implement a
persistent data structure In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (v ...
by just keeping the previous root index and never overriding existing nodes but always creating a copy of a changed node. Leaf nodes are inlined: Instead of having a child-pointer to a leaf node, the bitmap of the leaf node itself is stored. public class BBTrieSet


Set operations


Contains key

The get method tests, if a key is part of the set. The key is delivered as byte[] where each byte represents one 6-bit bit sequence of the key – so only 6 of the 8 bits per byte are used. public boolean get(byte[] key, int len)


Set (add) key

public boolean set(byte[] key, int len) private long set(long nodeRef, byte[] key, int off, int len)


Clear (remove) key

public boolean clear(byte[] key, int len) public long clear(long nodeRef, byte[] key, int off, int len) }


Set operators

Set operators for intersection (and), union (or) and difference (minus) are feasible using a
flyweight pattern In computer programming, the flyweight software design pattern refers to an Object (computer science), object that minimizes Computer memory, memory usage by sharing some of its data with other similar objects. The flyweight pattern is one of twe ...
as shown below. An interface represents physical nodes and "virtual" result nodes of an operator. Instances of this interface are created on demand during a trie traversal. Compound expressions, involving more than one operator, can be expressed directly by combining these operators as an operator can be used as argument (input) for another operator.


Flyweight interface

public interface BBTrieNode public static class BBTrieNodeMem implements BBTrieNode


Intersection (AND)

The intersection operator is very efficient as it automatically performs pruning even over subexpressions. Nonrelevant child nodes don't have to be accessed because the bitmap and a bitwise AND operation allows to determine the result upfront. For example, calculating \\cap(\\cup\)=\, the subexpression \\cup\=\ would not be materialized as intermediate result. public static class BBTrieAnd implements BBTrieNode


Union (OR)

public static class BBTrieOr implements BBTrieNode


Difference (MINUS)

public static class BBTrieMinus implements BBTrieNode


Ranges

Using the virtual node approach, range queries can be accomplished by intersecting a range generating virtual trie (see below) with another operator. So to determine which numbers of a set, say \, lay in certain range, say 0..50 instead of iterating through the set and checking each entry, this is performed by evaluating \\cap\. public static class BBTrieIntRange implements BBTrieNode


Usage example

The example shows the usage with 32-bit integers as keys. public class BBTrieSetSample


References

{{reflist Trees (data structures) Articles with example Java code