HOME

TheInfoList



OR:

Cuckoo hashing is a scheme in
computer programming Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as ana ...
for resolving
hash collision In computer science, a hash collision or hash clash is when two pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of bits. Al ...
s of values of
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
s in a
table Table may refer to: * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (landform), a flat area of land * Table (information), a data arrangement with rows and columns * Table (database), how the table data ...
, with
worst-case In computer science, best, worst, and average cases of a given algorithm express what the resource usage is ''at least'', ''at most'' and ''on average'', respectively. Usually the resource being considered is running time, i.e. time complexity, b ...
constant lookup time. The name derives from the behavior of some species of
cuckoo Cuckoos are birds in the Cuculidae family, the sole taxon in the order Cuculiformes . The cuckoo family includes the common or European cuckoo, roadrunners, koels, malkohas, couas, coucals and anis. The coucals and anis are sometimes separ ...
, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as
brood parasitism Brood parasites are animals that rely on others to raise their young. The strategy appears among birds, insects and fish. The brood parasite manipulates a host, either of the same or of another species, to raise its young as if it were its own ...
; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.


History

Cuckoo hashing was first described by Rasmus Pagh and
Flemming Friche Rodler Flemming is a surname and a male given name referring, like the more common '' Fleming'', to an inhabitant (or descendant thereof) of Flanders,
in a 2001 conference paper. The paper was awarded the
European Symposium on Algorithms The European Symposium on Algorithms (ESA) is an international conference covering the field of algorithms. It has been held annually since 1993, typically in early Autumn in a different European location each year. Like most theoretical computer ...
Test-of-Time award in 2020.


Operations

Cuckoo hashing is a form of
open addressing Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternative locations in the array (the ''probe sequence'') until either the t ...
in which each non-empty cell of a
hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
contains a
key Key or The Key may refer to: Common meanings * Key (cryptography), a piece of information that controls the operation of a cryptography algorithm * Key (lock), device used to control access to places or facilities restricted by a lock * Key (map ...
or key–value pair. A
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
is used to determine the location for each key, and its presence in the table (or the value associated with it) can be found by examining that cell of the table. However, open addressing suffers from
collisions In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great f ...
, which happens when more than one key is mapped to the same cell. The basic idea of cuckoo hashing is to resolve collisions by using two hash functions instead of only one. This provides two possible locations in the hash table for each key. In one of the commonly used variants of the algorithm, the hash table is split into two smaller tables of equal size, and each hash function provides an index into one of these two tables. It is also possible for both hash functions to provide indexes into a single table.


Lookup

Cuckoo hashing uses two hash tables, T_1 and T_2 and assuming r be the length of each tables, the hash functions for the two tables is defined as, h_1,\ h_2\ :\ \cup \rightarrow \ and \forall x \in S where x be the key and S be the set whose keys are stored in h_1(x) of T_1 or h_2(x) of T_2. The lookup operation is as follows: The
logical or In logic, disjunction is a logical connective typically notated as \lor and read aloud as "or". For instance, the English language sentence "it is raining or it is snowing" can be represented in logic using the disjunctive formula R \lor S ...
(\vee) denotes that, the value of the key x is found in either T_1 or T_2, which is O(1) in worst case.


Deletion

Deletion is performed in O(1) since there isn't involvement of probing—not considering the cost of shrinking operation if table is too sparse.


Insertion

Insertion of a new
item Item may refer to: Organizations * ''Instituto del Tercer Mundo'' (ITeM), the Third World Institute * ITEM club, an economic forecasting group based in the United Kingdom Newspapers * ''The Item'', an American independent, morning newspaper ...
, the first step involves examining if the slot h_1(x) of the table T_1 is occupied; if it is not, the item is inserted at that cell. However, if the slot is occupied, the preoccupied item gets removed—let it be x'—and x is inserted at T_1 _1(x)/math>. The removed item x' is inserted into the table T_2 by following the same procedure; the process continues until an empty position is found to insert the key. To avoid the possible
infinite iteration In computer programming, an infinite loop (or endless loop) is a sequence of instructions that, as written, will continue endlessly, unless an external intervention occurs ("pull the plug"). It may be intentional. Overview This differs from: * ...
in the process loop, a \text is specified such that if the
iterations Iteration is the repetition of a process in order to generate a (possibly unbounded) sequence of outcomes. Each repetition of the process is a single iteration, and the outcome of each iteration is then the starting point of the next iteration. ...
exceeds the fixed threshold, the hash tables—both T_1 and T_2—are rehashed with newer hash functions and the insertion procedure repeats. Following is a
pseudocode In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...
for insertion: On lines 10 and 15, the "cuckoo approach" of kicking other keys—which was preoccupied at T_ _(x)/math>—takes place until every key has its own "nest" i.e. the item x is inserted into a spot on either one of the two tables; the notation \leftrightarrow expresses the process of swapping.


Theory

Insertions succeed in expected constant time, even considering the possibility of having to rebuild the table, as long as the number of keys is kept below half of the capacity of the hash table, i.e., the load factor is below 50%. One method of proving this uses the theory of
random graph In mathematics, random graph is the general term to refer to probability distributions over graphs. Random graphs may be described simply by a probability distribution, or by a random process which generates them. The theory of random graphs li ...
s: one may form an
undirected graph In discrete mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects correspond to mathematical abstractions called '' v ...
called the "cuckoo graph" that has a vertex for each hash table location, and an edge for each hashed value, with the endpoints of the edge being the two possible locations of the value. Then, the greedy insertion algorithm for adding a set of values to a cuckoo hash table succeeds if and only if the cuckoo graph for this set of values is a
pseudoforest In graph theory, a pseudoforest is an undirected graphThe kind of undirected graph considered here is often called a multigraph or pseudograph, to distinguish it from a simple graph. in which every connected component has at most one cycle. Tha ...
, a graph with at most one cycle in each of its connected components. Any vertex-induced subgraph with more edges than vertices corresponds to a set of keys for which there are an insufficient number of slots in the hash table. When the hash function is chosen randomly, the cuckoo graph is a
random graph In mathematics, random graph is the general term to refer to probability distributions over graphs. Random graphs may be described simply by a probability distribution, or by a random process which generates them. The theory of random graphs li ...
in the
Erdős–Rényi model In the mathematical field of graph theory, the Erdős–Rényi model is either of two closely related models for generating random graphs or the evolution of a random network. They are named after Hungarian mathematicians Paul Erdős and Alfrà ...
. With high probability, for load factor less than 1/2 (corresponding to a random graph in which the ratio of the number of edges to the number of vertices is bounded below 1/2), the graph is a pseudoforest and the cuckoo hashing algorithm succeeds in placing all keys. The same theory also proves that the expected size of a connected component of the cuckoo graph is small, ensuring that each insertion takes constant expected time. However, also with high probability, a load factor greater than 1/2 will lead to a
giant component In network theory, a giant component is a connected component of a given random graph that contains a finite fraction of the entire graph's vertices. Giant component in Erdős–Rényi model Giant components are a prominent feature of the ErdŠ...
with two or more cycles, causing the data structure to fail and need to be resized. Since a theoretical random hash function requires too much space for practical usage, an important theoretical question is which practical hash functions suffice for Cuckoo hashing. One approach is to use
k-independent hashing In computer science, a family of hash functions is said to be ''k''-independent, ''k''-wise independent or ''k''-universal if selecting a function at random from the family guarantees that the hash codes of any designated ''k'' keys are independe ...
. In 2009 it was shown that O(\log n)-independence suffices, and at least 6-independence is needed. Another approach is to use
Tabulation hashing In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work b ...
, which is not 6-independent, but was shown in 2012 to have other properties sufficient for Cuckoo hashing. A third approach from 2014 is to slightly modify the cuckoo hashtable with a so-called stash, which makes it possible to use nothing more than 2-independent hash functions.


Practice

In practice, cuckoo hashing is about 20–30% slower than
linear probing Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene ...
, which is the fastest of the common approaches. The reason is that cuckoo hashing often causes two cache misses per search, to check the two locations where a key might be stored, while linear probing usually causes only one cache miss per search. However, because of its worst case guarantees on search time, cuckoo hashing can still be valuable when real-time response rates are required. One advantage of cuckoo hashing is its link-list free property, which fits GPU processing well.


Example

The following hash functions are given: h\left(k\right)=k\bmod 11
h'\left(k\right)=\left\lfloor\frac\right\rfloor\bmod 11 The following two tables show the insertion of some example elements. Each column corresponds to the state of the two hash tables over time. The possible insertion locations for each new value are highlighted.


Cycle

If you now attempt to insert the element 6, then you get into a cycle, and fail. In the last row of the table we find the same initial situation as at the beginning again. h\left(6\right)=6\bmod 11=6
h'\left(6\right)=\left\lfloor\frac\right\rfloor\bmod 11=0


Variations

Several variations of cuckoo hashing have been studied, primarily with the aim of improving its space usage by increasing the load factor that it can tolerate to a number greater than the 50% threshold of the basic algorithm. Some of these methods can also be used to reduce the failure rate of cuckoo hashing, causing rebuilds of the data structure to be much less frequent. Generalizations of cuckoo hashing that use more than two alternative hash functions can be expected to utilize a larger part of the capacity of the hash table efficiently while sacrificing some lookup and insertion speed. Using just three hash functions increases the load to 91%. Another generalization of cuckoo hashing, called ''blocked cuckoo hashing'' consists in using more than one key per bucket. Using just 2 keys per bucket permits a load factor above 80%. Another variation of cuckoo hashing that has been studied is ''cuckoo hashing with a stash''. The stash, in this data structure, is an array of a constant number of keys, used to store keys that cannot successfully be inserted into the main hash table of the structure. This modification reduces the failure rate of cuckoo hashing to an inverse-polynomial function with an exponent that can be made arbitrarily large by increasing the stash size. However, larger stashes also mean slower searches for keys that are not present or are in the stash. A stash can be used in combination with more than two hash functions or with blocked cuckoo hashing to achieve both high load factors and small failure rates. The analysis of cuckoo hashing with a stash extends to practical hash functions, not just to the random hash function model commonly used in theoretical analysis of hashing. Some people recommend a simplified generalization of cuckoo hashing called skewed-associative cache in some
CPU cache A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which ...
s. Another variation of a cuckoo hash table, called a cuckoo filter, replaces the stored keys of a cuckoo hash table with much shorter fingerprints, computed by applying another hash function to the keys. In order to allow these fingerprints to be moved around within the cuckoo filter, without knowing the keys that they came from, the two locations of each fingerprint may be computed from each other by a bitwise
exclusive or Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
operation with the fingerprint, or with a hash of the fingerprint. This data structure forms an approximate set membership data structure with much the same properties as a
Bloom filter A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
: it can store the members of a set of keys, and test whether a query key is a member, with some chance of
false positive A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
s (queries that are incorrectly reported as being part of the set) but no
false negative A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test result ...
s. However, it improves on a Bloom filter in multiple respects: its memory usage is smaller by a constant factor, it has better
locality of reference In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...
, and (unlike Bloom filters) it allows for fast deletion of set elements with no additional storage penalty.


Comparison with related structures

A study by Zukowski et al. has shown that cuckoo hashing is much faster than chained hashing for small,
cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache Count ...
-resident hash tables on modern processors. Kenneth Ross has shown bucketized versions of cuckoo hashing (variants that use buckets that contain more than one key) to be faster than conventional methods also for large hash tables, when space utilization is high. The performance of the bucketized cuckoo hash table was investigated further by Askitis, with its performance compared against alternative hashing schemes. A survey by Mitzenmacher presents open problems related to cuckoo hashing as of 2009.


See also

*
Perfect hashing In computer science, a perfect hash function for a set is a hash function that maps distinct elements in to a set of integers, with no collisions. In mathematical terms, it is an injective function. Perfect hash functions may be used to im ...
*
Double hashing Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is ...
*
Quadratic probing Quadratic probing is an open addressing scheme in computer programming for resolving hash collisions in hash tables. Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial unti ...
*
Hopscotch hashing Hopscotch hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table using open addressing. It is also well suited for implementing a concurrent hash table. Hopscotch hashing was introduced by ...


References


External links


A cool and practical alternative to traditional hash tables
U. Erlingsson, M. Manasse, F. Mcsherry, 2006.
Cuckoo Hashing for Undergraduates, 2006
R. Pagh, 2006.

(Part 1

an

, Michael Mitzenmacher, 2007. * {{cite conference , last=Naor , first=Moni , author2=Segev, Gil , author3=Wieder, Udi , title=History-Independent Cuckoo Hashing , book-title=International Colloquium on Automata, Languages and Programming (ICALP) , place=Reykjavik, Iceland , year=2008 , url=http://www.wisdom.weizmann.ac.il/~naor/PAPERS/cuckoo_hi_abs.html , access-date=2008-07-21
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
X. Li, D. Andersen, M. Kaminsky, M. Freedman. EuroSys 2014.


Examples


Concurrent high-performance Cuckoo hashtable written in C++

Cuckoo hash map written in C++





Cuckoo hashing for Go
Search algorithms Hashing pl:Tablica mieszajÄ…ca#Haszowanie kuku.C5.82cze