In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, tabulation hashing is a method for constructing
universal families of hash functions by combining
table lookup with
exclusive or
Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false).
It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
operations. It was first studied in the form of
Zobrist hashing
Zobrist hashing (also referred to as Zobrist keys or Zobrist signatures Bruce Moreland/ref>) is a hash function construction used in computer programs that play abstract board games, such as chess and Go, to implement transposition tables, a speci ...
for computer games; later work by Carter and
Wegman Wegman is a surname. Notable people with the surname include:
* Bill Wegman, Major League Baseball player
* Dorothy Wegman Raphaelson, American dancer, Ziegfeld Girl and vaudeville performer, and novelist
* Edward Wegman, American professor of stat ...
extended this method to arbitrary fixed-length keys. Generalizations of tabulation hashing have also been developed that can handle variable-length keys such as text strings.
Despite its simplicity, tabulation hashing has strong theoretical properties that distinguish it from some other hash functions. In particular, it is
3-independent: every 3-tuple of keys is equally likely to be mapped to any 3-tuple of hash values. However, it is not 4-independent. More sophisticated but slower variants of tabulation hashing extend the method to higher degrees of independence.
Because of its high degree of independence, tabulation hashing is usable with hashing methods that require a high-quality hash function, including
hopscotch hashing
Hopscotch hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table using open addressing. It is also well suited for implementing a concurrent hash table. Hopscotch hashing was introduced by ...
,
cuckoo hashing
Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick ...
, and the
MinHash
In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how Similarity measure, similar two sets are. The scheme was invented by , and initially ...
technique for estimating the size of set intersections.
Method
The basic idea is as follows:
First, divide the key to be hashed into smaller "blocks" of a chosen length. Then, create a set of
lookup table
In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation. The process is termed as "direct addressing" and LUTs differ from hash tables in a way that, to retrieve a value v wi ...
s, one for each block, and fill them with random values. Finally, use the tables to compute a hash value for each block, and combine all of these hashes into a final hash value using the bitwise
exclusive or
Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false).
It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...
operation.
[; .]
More formally:
Let ''p'' be the number of
bit
The bit is the most basic unit of information in computing and digital communications. The name is a portmanteau of binary digit. The bit represents a logical state with one of two possible values. These values are most commonly represente ...
s in a key to be hashed, and ''q'' be the number of bits desired in an output hash function. Choose a block size ''r'' ≤ ''p''; the choice of block size controls the tradeoff between time and memory usage, so it should be made so that the tables are not too large, e.g., so that the tables fit into the computer's
cache memory
In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...
. Smaller blocks use less memory but slow down the hash function. Compute ''t'' = ceil(''p''/''r''), the number of ''r''-bit blocks needed to represent a key.
Create a two-dimensional 2
''r'' × ''t'' array, ''T'', and fill it with random ''q''-bit numbers. Now ''T'' can be used to compute the hash value ''h''(''x'') of any given key ''x''. To do so, partition ''x'' into ''r''-bit values, where ''x''
0 consists of the lowest ''r'' bits of ''x'', ''x''
1 consists of the next ''r'' bits, etc. For example, if ''r'' = 8, then ''x''
''i'' is just the ''i''th byte of ''x''. Then, use these ''r''-bit and position values as indices into ''T'', and combine the results using the exclusive or operation:
:''h''(''x'') = ''T''
''x''
0] ⊕ ''T''
''x''
1] ⊕ ''T''
''x''
2] ⊕ ... ⊕ ''T''
-1''x''
t-1].
Note that it is not valid to use the same table (e.g. ''T
') for each ''x''
i, since then the hash function would not be able to distinguish between strings with the same ''x''
is, but permuted differently.
Code for a typical example with ''r'' = ''t'' = 8 and ''q'' = ''p'' = 64 is given below.
// Secret table of random numbers
uint64_t T 256];
for (int i = 0; i < 8; i++)
for (int j = 0; j < 256; j++)
T j] = getRandomUInt64();
// Simple Tabulation Hash function
uint64_t hash(uint64_t x)
History
The first instance of tabulation hashing is
Zobrist hashing
Zobrist hashing (also referred to as Zobrist keys or Zobrist signatures Bruce Moreland/ref>) is a hash function construction used in computer programs that play abstract board games, such as chess and Go, to implement transposition tables, a speci ...
, a method for hashing positions in abstract board games such as
chess
Chess is a board game for two players, called White and Black, each controlling an army of chess pieces in their color, with the objective to checkmate the opponent's king. It is sometimes called international chess or Western chess to disti ...
named after
Albert Lindsey Zobrist
Albert Lindsey Zobrist (born February 27, 1942) is an American computer scientist, games researcher, and inventor of the famous Zobrist Hashing, which was published in 1970. He is further author of the first Go program in 1968 as part of his P ...
, who published it in 1970. In this method, a random bitstring is generated for each game feature such as a combination of a chess piece and a square of the chessboard. Then, to hash any game position, the bitstrings for the features of that position are combined by a bitwise exclusive or. The resulting hash value can then be used as an index into a
transposition table
{{no footnotes, date=November 2017
A transposition table is a cache of previously seen positions, and associated evaluations, in a game tree generated by a computer game playing program. If a position recurs via a different sequence of moves, the ...
. Because each move typically changes only a small number of game features, the Zobrist value of the position after a move can be updated quickly from the value of the position before the move, without needing to loop over all of the features of the position.
Tabulation hashing in greater generality, for arbitrary binary values, was later rediscovered by and studied in more detail by .
Universality
define a randomized scheme for generating hash functions to be
universal
Universal is the adjective for universe.
Universal may also refer to:
Companies
* NBCUniversal, a media and entertainment company
** Universal Animation Studios, an American Animation studio, and a subsidiary of NBCUniversal
** Universal TV, a ...
if, for any two keys, the probability that they
collide (that is, they are mapped to the same value as each other) is 1/''m'', where ''m'' is the number of values that the keys can take on. They defined a stronger property in the subsequent paper : a randomized scheme for generating hash functions is
''k''-independent'' if, for every ''k''-tuple of keys, and each possible ''k''-tuple of values, the probability that those keys are mapped to those values is 1/''m''
''k''. 2-independent hashing schemes are automatically universal, and any universal hashing scheme can be converted into a 2-independent scheme by storing a random number ''x'' as part of the initialization phase of the algorithm and adding ''x'' to each hash value. Thus, universality is essentially the same as 2-independence. However, ''k''-independence for larger values of ''k'' is a stronger property, held by fewer hashing algorithms.
As observe, tabulation hashing is 3-independent but not 4-independent. For any single key ''x'', ''T''
0,0">'x''0,0is equally likely to take on any hash value, and the exclusive or of ''T''
0,0">'x''0,0with the remaining table values does not change this property. For any two keys ''x'' and ''y'', ''x'' is equally likely to be mapped to any hash value as before, and there is at least one position ''i'' where ''x
i'' ≠ ''y
i''; the table value ''T''
''i'',''i''">'y''''i'',''i''is used in the calculation of ''h''(''y'') but not in the calculation of ''h''(''x''), so even after the value of ''h''(''x'') has been determined, ''h''(''y'') is equally likely to be any valid hash value. Similarly, for any three keys ''x'', ''y'', and ''z'', at least one of the three keys has a position ''i'' where its value ''z''
''i'' differs from the other two, so that even after the values of ''h''(''x'') and ''h''(''y'') are determined, ''h''(''z'') is equally likely to be any valid hash value.
[; .]
However, this reasoning breaks down for four keys because there are sets of keys ''w'', ''x'', ''y'', and ''z'' where none of the four has a byte value that it does not share with at least one of the other keys. For instance, if the keys have two bytes each, and ''w'', ''x'', ''y'', and ''z'' are the four keys that have either zero or one as their byte values, then each byte value in each position is shared by exactly two of the four keys. For these four keys, the hash values computed by tabulation hashing will always satisfy the equation , whereas for a 4-independent hashing scheme the same equation would only be satisfied with probability 1/''m''. Therefore, tabulation hashing is not 4-independent.
Application
Because tabulation hashing is a universal hashing scheme, it can be used in any hashing-based algorithm in which universality is sufficient. For instance, in
hash chaining, the expected time per operation is proportional to the sum of collision probabilities, which is the same for any universal scheme as it would be for truly random hash functions, and is constant whenever the load factor of the hash table is constant. Therefore, tabulation hashing can be used to compute hash functions for hash chaining with a theoretical guarantee of constant expected time per operation.
However, universal hashing is not strong enough to guarantee the performance of some other hashing algorithms. For instance, for
linear probing
Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene ...
, 5-independent hash functions are strong enough to guarantee constant time operation, but there are 4-independent hash functions that fail. Nevertheless, despite only being 3-independent, tabulation hashing provides the same constant-time guarantee for linear probing.
Cuckoo hashing
Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick ...
, another technique for implementing
hash table
In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
s, guarantees constant time per lookup (regardless of the hash function). Insertions into a cuckoo hash table may fail, causing the entire table to be rebuilt, but such failures are sufficiently unlikely that the expected time per insertion (using either a truly random hash function or a hash function with logarithmic independence) is constant. With tabulation hashing, on the other hand, the best bound known on the failure probability is higher, high enough that insertions cannot be guaranteed to take constant expected time. Nevertheless, tabulation hashing is adequate to ensure the linear-expected-time construction of a cuckoo hash table for a static set of keys that does not change as the table is used.
Extensions
Although tabulation hashing as described above ("simple tabulation hashing") is only 3-independent, variations of this method can be used to obtain hash functions with much higher degrees of independence.
uses the same idea of using exclusive or operations to combine random values from a table, with a more complicated algorithm based on
expander graph
In graph theory, an expander graph is a sparse graph that has strong connectivity properties, quantified using vertex, edge or spectral expansion. Expander constructions have spawned research in pure and applied mathematics, with several applicati ...
s for transforming the key bits into table indices, to define hashing schemes that are ''k''-independent for any constant or even logarithmic value of ''k''. However, the number of table lookups needed to compute each hash value using Siegel's variation of tabulation hashing, while constant, is still too large to be practical, and the use of expanders in Siegel's technique also makes it not fully constructive.
provides a scheme based on tabulation hashing that reaches high degrees of independence more quickly, in a more constructive way.
He observes that using one round of simple tabulation hashing to expand the input keys to six times their original length, and then a second round of simple tabulation hashing on the expanded keys, results in a hashing scheme whose independence number is exponential in the parameter ''r'', the number of bits per block in the partition of the keys into blocks.
Simple tabulation is limited to keys of a fixed length, because a different table of random values needs to be initialized for each position of a block in the keys.
studies variations of tabulation hashing suitable for variable-length keys such as character strings. The general type of hashing scheme studied by Lemire uses a single table ''T'' indexed by the value of a block, regardless of its position within the key.
However, the values from this table may be combined by a more complicated function than bitwise exclusive or.
Lemire shows that no scheme of this type can be 3-independent. Nevertheless, he shows that it is still possible to achieve 2-independence. In particular, a tabulation scheme that interprets the values ''T''
''i''">'x''''i''(where ''x''
''i'' is, as before, the ''i''th block of the input) as the coefficients of a
polynomial
In mathematics, a polynomial is an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and positive-integer powers of variables. An exa ...
over a finite field and then takes the remainder of the resulting polynomial modulo another polynomial, gives a 2-independent hash function.
Mixed Tabulation
Mixed Tabulation hashing (and the less general Twisted Tabulation) were introduced by Dahlgaard and Thorup as a way to strengthen the properties of Tabulation hashing while keeping nearly the same performance.
Mixed tabulation can be seen as a xor'ing a "Double Tabulation" hash function with a simple tabulation hash function.
This turns out to have many nice properties even when parameters are chosen to make the mixed tabulation much faster than double tabulation
The idea is to pick a number
and hash to
bits rather than just
.
This gives
new "derived characters" which are hashed by a second hash function and the two values are xor'ed together.
Formally we have