HOME

TheInfoList



OR:

Double hashing is a
computer programming Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as ana ...
technique used in conjunction with
open addressing Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternative locations in the array (the ''probe sequence'') until either the t ...
in
hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
s to resolve
hash collision In computer science, a hash collision or hash clash is when two pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of bits. Al ...
s, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is a classical data structure on a table T. The double hashing technique uses one hash value as an index into the table and then repeatedly steps forward an interval until the desired value is located, an empty location is reached, or the entire table has been searched; but this interval is set by a second, independent
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
. Unlike the alternative collision-resolution methods of
linear probing Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene ...
and
quadratic probing Quadratic probing is an open addressing scheme in computer programming for resolving hash collisions in hash tables. Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial unti ...
, the interval depends on the data, so that values mapping to the same location have different bucket sequences; this minimizes repeated collisions and the effects of clustering. Given two random, uniform, and independent hash functions h_1 and h_2, the ith location in the bucket sequence for value k in a hash table of , T, buckets is: h(i,k)=(h_1(k) + i \cdot h_2(k))\bmod, T, . Generally, h_1 and h_2 are selected from a set of
universal hash In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees ...
functions; h_1 is selected to have a range of \ and h_2 to have a range of \. Double hashing approximates a random distribution; more precisely, pair-wise independent hash functions yield a probability of (n/, T, )^2 that any pair of keys will follow the same bucket sequence.


Selection of h2(k)

The secondary hash function h_2(k) should have several characteristics: *it should never yield an index of zero *it should cycle through the whole table *it should be very fast to compute *it should be pair-wise independent of h_1(k) *The distribution characteristics of h_2 are irrelevant. It is analogous to a random-number generator. * All h_2(k) be ''relatively prime'' to , ''T'', . In practice: * If division hashing is used for both functions, the divisors are chosen as primes. * If the ''T'' is a power of 2, the first and last requirements are usually satisfied by making h_2(k) always return an odd number. This has the side effect of doubling the chance of collision due to one wasted bit.


Analysis

Let n be the number of elements stored in T, then T's load factor is \alpha = n/, T, . That is, start by randomly, uniformly and independently selecting two
universal hash In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees ...
functions h_1 and h_2 to build a double hashing table T. All elements are put in T by double hashing using h_1 and h_2. Given a key k, the (i+1)-st hash location is computed by: h(i,k) = ( h_1(k) + i \cdot h_2(k) ) \bmod , T, . Let T have fixed load factor \alpha: 1 > \alpha > 0. Bradford and Katehakis showed the expected number of probes for an unsuccessful search in T, still using these initially chosen hash functions, is \tfrac regardless of the distribution of the inputs. Pair-wise independence of the hash functions suffices. Like all other forms of open addressing, double hashing becomes linear as the hash table approaches maximum capacity. The usual heuristic is to limit the table loading to 75% of capacity. Eventually, rehashing to a larger size will be necessary, as with all other open addressing schemes.


Variants

Peter Dillinger's PhD thesis points out that double hashing produces unwanted equivalent hash functions when the hash functions are treated as a set, as in
Bloom filter A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
s: If h_2(y) = -h_2(x) and h_1(y) = h_1(x) + k\cdot h_2(x), then h(i, y) = h(k - i, x) and the sets of hashes \left\ = \left\ are identical. This makes a collision twice as likely as the hoped-for 1/, T, ^2. There are additionally a significant number of mostly-overlapping hash sets; if h_2(y) = h_2(x) and h1(y) = h_1(x) \pm h_2(x), then h(i, y) = h(i\pm 1, x), and comparing additional hash values (expanding the range of i) is of no help.


Triple hashing

Adding a quadratic term i^2, i(i+1)/2 (a
triangular number A triangular number or triangle number counts objects arranged in an equilateral triangle. Triangular numbers are a type of figurate number, other examples being square numbers and cube numbers. The th triangular number is the number of dots in ...
) or even i^2 \cdot h_3(x) (triple hashing)Alternatively defined with the triangular number, as in Dillinger 2004. to the hash function improves the hash function somewhat but does not fix this problem; if: : h_1(y) = h_1(x) + k \cdot h_2(x) + k^2 \cdot h_3(x), : h_2(y) = -h_2(x) - 2k \cdot h_3(x), and : h_3(y) = h_3(x). then : \begin h(k-i, y) &= h_1(y) + (k - i) \cdot h_2(y) + (k-i)^2 \cdot h_3(y) \\ &= h_1(y) + (k - i) (-h_2(x) - 2k h_3(x)) + (k-i)^2 h_3(x) \\ &= \ldots \\ &= h_1(x) + k h_2(x) + k^2 h_3(x) + (i - k) h_2(x) + (i^2 - k^2) h_3(x) \\ &= h_1(x) + i h_2(x) + i^2 h_3(x) \\ &= h(i, x). \\ \end


Enhanced double hashing

Adding a cubic term i^3 or (i^3-i)/6 (a
tetrahedral number A tetrahedral number, or triangular pyramidal number, is a figurate number that represents a pyramid with a triangular base and three sides, called a tetrahedron. The th tetrahedral number, , is the sum of the first triangular numbers, that is, ...
), does solve the problem, a technique known as enhanced double hashing. This can be computed efficiently by forward differencing: struct key; /// Opaque /// Use other data types when needed. (Must be unsigned for guaranteed wrapping.) extern unsigned int h1(struct key const *), h2(struct key const *); /// Calculate k hash values from two underlying hash functions /// h1() and h2() using enhanced double hashing. On return, /// hashes = h1(x) + i*h2(x) + (i*i*i - i)/6. /// Takes advantage of automatic wrapping (modular reduction) /// of unsigned types in C. void ext_dbl_hash(struct key const *x, unsigned int hashes[], unsigned int n) In addition to rectifying the collision problem, enhanced double hashing also removes double-hashing's numerical restrictions on h_2(x)'s properties, allowing a hash function similar in property to (but still independent of) h_1 to be used.


See also

*
Cuckoo hashing Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick ...
*
2-choice hashing 2-choice hashing, also known as 2-choice chaining, is "a variant of a hash table in which keys are added by hashing with two hash functions. The key is put in the array position with the fewer (colliding) keys. Some collision resolution scheme is ...


References

{{reflist


External links


How Caching Affects Hashing
by Gregory L. Heileman and Wenbin Luo 2005.
klib
a C library that includes double hashing functionality. Search algorithms Hashing