Cuckoo Filter
   HOME

TheInfoList



OR:

A cuckoo filter is a space-efficient
probabilistic Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
that is used to test whether an element is a member of a
set Set, The Set, SET or SETS may refer to: Science, technology, and mathematics Mathematics *Set (mathematics), a collection of elements *Category of sets, the category whose objects and morphisms are sets and total functions, respectively Electro ...
, like a
Bloom filter A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
does.
False positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
matches are possible, but
false negatives A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
are not – in other words, a query returns either "possibly in set" or "definitely not in set". A cuckoo filter can also delete existing items, which is not supported by Bloom filters. In addition, for applications that store many items and target moderately low false positive rates, cuckoo filters can achieve lower space overhead than space-optimized Bloom filters. Cuckoo filters were first described in 2014.


Algorithm description

A cuckoo filter uses a hash table based on
cuckoo hashing Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick ...
to store the
fingerprint A fingerprint is an impression left by the friction ridges of a human finger. The recovery of partial fingerprints from a crime scene is an important method of forensic science. Moisture and grease on a finger result in fingerprints on surfac ...
s of items. The data structure is broken into buckets of some size b. To insert the fingerprint of an item x, one first computes two potential buckets h_1(x) and h_2(x) where x could go. These buckets are calculated using the formula :h_1(x)=\text(x) :h_2(x)=h_1(x)\oplus\text(\text(x)) Note that, due to the symmetry of the
XOR Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
operation, one can compute h_2(x) from h_1(x), and h_1(x) from h_2(x). As defined above, h_2(x) = h_1(x)\oplus\text(\text(x)); it follows that h_1(x) = h_2(x)\oplus\text(\text(x)). These properties are what make it possible to store the fingerprints with cuckoo hashing. The fingerprint of x is placed into one of buckets h_1(x) and h_2(x). If the buckets are full, then one of the fingerprints in the bucket is evicted using cuckoo hashing, and placed into the other bucket where it can go. If that bucket, in turn, is also full, then that may trigger another eviction, etc. The hash table can achieve both high utilization (thanks to
cuckoo hashing Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick ...
), and compactness because only fingerprints are stored. Lookup and delete operations of a cuckoo filter are straightforward. There are a maximum of two buckets to check by h_1(x) and h_2(x). If found, the appropriate lookup or delete operation can be performed in O(b) time. Often, in practice, b is a constant. In order for the hash table to offer theoretical guarantees, the fingerprint size f must be at least \Omega((\log n) / b) bits. Subject to this constraint, cuckoo filters guarantee a false-positive rate of at most \epsilon \le b/2^.


Comparison to Bloom filters

A cuckoo filter is similar to a
Bloom filter A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
in that they both are fast and compact, and they may both return false positives as answers to set-membership queries: * Space-optimal Bloom filters use 1.44\log_2(1/\epsilon) bits of space per inserted key, where \epsilon is the false positive rate. A cuckoo filter requires (\log_2(1/\epsilon) + 1 + \log_2 b)/\alpha space per key where \alpha is the hash table load factor, which can be 95.5\% based on the cuckoo filter's setting. Note that the information theoretical lower bound requires \log_2(1/\epsilon) bits for each item. Both bloom filters and cuckoo filters with low load can be compressed when not in use. * On a positive lookup, a space-optimal Bloom filter requires a constant \log_2(1/\epsilon) memory accesses into the bit array, whereas a cuckoo filter requires at most 2b memory accesses, which can be a constant in practice. * Cuckoo filters have degraded insertion speed after reaching a load threshold, when table expanding is recommended. In contrast, Bloom filters can keep inserting new items at the cost of a higher false positive rate before expansion. * Bloom filters offer fast union and approximate intersection operations using cheap bitwise operations, which can also be applied to compressed bloom filters if streaming compression is used.


Limitations

* A cuckoo filter can only delete items that are known to be inserted before. * Insertion can fail and rehashing is required like other cuckoo hash tables. Note that the amortized insertion complexity is still O(1). * Cuckoo filters require a fingerprint size f of at least \Omega((\log n) / b) bits. This means that the space per key must be at least \Omega((\log n) / b) bits, even if \epsilon is large. In practice, b is chosen to be large enough that this is not a major issue.


References

{{Reflist


External links


Probabilistic Filters By Example – A tutorial comparing Cuckoo and Bloom filters.
Probabilistic data structures Lossy compression algorithms Hash-based data structures