Universal Hashing
   HOME

TheInfoList



OR:

In
mathematics Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
and
computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
, universal hashing (in a
randomized algorithm A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performan ...
or data structure) refers to selecting a
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of
hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
s,
randomized algorithm A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performan ...
s, and
cryptography Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or ''-logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adver ...
.


Introduction

Assume we want to map keys from some universe U into m bins (labelled = \). The algorithm will have to handle some data set S \subseteq U of , S, =n keys, which is not known in advance. Usually, the goal of hashing is to obtain a low number of collisions (keys from S that land in the same bin). A deterministic hash function cannot offer any guarantee in an adversarial setting if , U, > m \cdot n, since the adversary may choose S to be precisely the
preimage In mathematics, the image of a function is the set of all output values it may produce. More generally, evaluating a given function f at each element of a given subset A of its domain produces a set, called the "image of A under (or through) ...
of a bin. This means that all data keys land in the same bin, making hashing useless. Furthermore, a deterministic hash function does not allow for ''rehashing'': sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function. The solution to these problems is to pick a function randomly from a family of hash functions. A family of functions H = \ is called a universal family if, \forall x, y \in U, ~ x\ne y: ~~ , \, \le \frac. In other words, any two different keys of the universe collide with probability at most 1/m when the hash function h is drawn uniformly at random from H. This is exactly the probability of collision we would expect if the hash function assigned truly random hash codes to every key. Sometimes, the definition is relaxed by a constant factor, only requiring collision probability O(1/m) rather than \leq 1/m. This concept was introduced by Carter and Wegman in 1977, and has found numerous applications in computer science (see, for . If we have an upper bound of \epsilon<1 on the collision probability, we say that we have \epsilon-almost universality. So for example, a universal family has 1/m-almost universality. Many, but not all, universal families have the following stronger uniform difference property: : \forall x,y\in U, ~ x\ne y, when h is drawn randomly from the family H, the difference h(x)-h(y) ~\bmod~ m is uniformly distributed in /math>. Note that the definition of universality is only concerned with whether h(x)-h(y)=0, which counts collisions. The uniform difference property is stronger. (Similarly, a universal family can be XOR universal if \forall x,y\in U, ~ x\ne y, the value h(x) \oplus h(y) ~\bmod~ m is uniformly distributed in /math> where \oplus is the bitwise exclusive or operation. This is only possible if m is a power of two.) An even stronger condition is
pairwise independence In probability theory, a pairwise independent collection of random variables is a set of random variables any two of which are independent. Any collection of mutually independent random variables is pairwise independent, but some pairwise independ ...
: we have this property when \forall x,y\in U, ~ x\ne y we have the probability that x,y will hash to any pair of hash values z_1, z_2 is as if they were perfectly random: P(h(x)=z_1 \land h(y)=z_2)= 1/m^2. Pairwise independence is sometimes called strong universality. Another property is uniformity. We say that a family is uniform if all hash values are equally likely: P(h(x)=z)=1/m for any hash value z. Universality does not imply uniformity. However, strong universality does imply uniformity. Given a family with the uniform distance property, one can produce a pairwise independent or strongly universal hash family by adding a uniformly distributed random constant with values in /math> to the hash functions. (Similarly, if m is a power of two, we can achieve pairwise independence from an XOR universal hash family by doing an exclusive or with a uniformly distributed random constant.) Since a shift by a constant is sometimes irrelevant in applications (e.g. hash tables), a careful distinction between the uniform distance property and pairwise independent is sometimes not made. For some applications (such as hash tables), it is important for the least significant bits of the hash values to be also universal. When a family is strongly universal, this is guaranteed: if H is a strongly universal family with m=2^L, then the family made of the functions h \bmod for all h \in H is also strongly universal for L'\leq L. Unfortunately, the same is not true of (merely) universal families. For example, the family made of the identity function h(x)=x is clearly universal, but the family made of the function h(x)=x \bmod fails to be universal.
UMAC In cryptography, a message authentication code based on universal hashing, or UMAC, is a type of message authentication code (MAC) calculated choosing a hash function from a class of hash functions according to some secret (random) process and ap ...
and Poly1305-AES and several other
message authentication code In cryptography, a message authentication code (MAC), sometimes known as a ''tag'', is a short piece of information used for authenticating a message. In other words, to confirm that the message came from the stated sender (its authenticity) and ...
algorithms are based on universal hashing. In such applications, the software chooses a new hash function for every message, based on a unique nonce for that message. Several hash table implementations are based on universal hashing. In such applications, typically the software chooses a new hash function only after it notices that "too many" keys have collided; until then, the same hash function continues to be used over and over. (Some collision resolution schemes, such as
dynamic perfect hashing In computer science, dynamic perfect hashing is a programming technique for resolving collisions in a hash table data structure.Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a Sparse Table with 0(1) Worst Case Access Time. J. ACM 3 ...
, pick a new hash function every time there is a collision. Other collision resolution schemes, such as
cuckoo hashing Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick ...
and
2-choice hashing 2-choice hashing, also known as 2-choice chaining, is "a variant of a hash table in which keys are added by hashing with two hash functions. The key is put in the array position with the fewer (colliding) keys. Some collision resolution scheme is ...
, allow a number of collisions before picking a new hash function). A survey of fastest known universal and strongly universal hash functions for integers, vectors, and strings is found in.


Mathematical guarantees

For any fixed set S of n keys, using a universal family guarantees the following properties. # For any fixed x in S, the expected number of keys in the bin h(x) is n/m. When implementing hash tables by
chaining Chaining is a type of intervention that aims to create associations between behaviors in a behavior chain. A behavior chain is a sequence of behaviors that happen in a particular order where the outcome of the previous step in the chain serves as ...
, this number is proportional to the expected running time of an operation involving the key x (for example a query, insertion or deletion). # The expected number of pairs of keys x,y in S with x\ne y that collide (h(x) = h(y)) is bounded above by n(n-1)/2m, which is of order O(n^2/m). When the number of bins, m is chosen linear in n (i.e., is determined by a function in \Omega(n)), the expected number of collisions is O(n). When hashing into n^2 bins, there are no collisions at all with probability at least a half. # The expected number of keys in bins with at least t keys in them is bounded above by 2n/(t-2(n/m)+1). Thus, if the capacity of each bin is capped to three times the average size (t = 3n/m), the total number of keys in overflowing bins is at most O(m). This only holds with a hash family whose collision probability is bounded above by 1/m. If a weaker definition is used, bounding it by O(1/m), this result is no longer true. As the above guarantees hold for any fixed set S, they hold if the data set is chosen by an adversary. However, the adversary has to make this choice before (or independent of) the algorithm's random choice of a hash function. If the adversary can observe the random choice of the algorithm, randomness serves no purpose, and the situation is the same as deterministic hashing. The second and third guarantee are typically used in conjunction with rehashing. For instance, a randomized algorithm may be prepared to handle some O(n) number of collisions. If it observes too many collisions, it chooses another random h from the family and repeats. Universality guarantees that the number of repetitions is a geometric random variable.


Constructions

Since any computer data can be represented as one or more machine words, one generally needs hash functions for three types of domains: machine words ("integers"); fixed-length vectors of machine words; and variable-length vectors ("strings").


Hashing integers

This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication, addition, division, etc. are cheap machine-level instructions. Let the universe to be hashed be \. The original proposal of Carter and Wegman was to pick a prime p \ge , U, and define : h_(x) = ((ax + b)~\bmod ~ p)~\bmod ~ m where a,b are randomly chosen integers modulo p with a \neq 0. (This is a single iteration of a
linear congruential generator A linear congruential generator (LCG) is an algorithm that yields a sequence of pseudo-randomized numbers calculated with a discontinuous piecewise linear equation. The method represents one of the oldest and best-known pseudorandom number generat ...
.) To see that H = \ is a universal family, note that h(x) = h(y) only holds when : ax+b \equiv ay + b + i\cdot m \pmod for some integer i between 0 and (p-1)/m. Since p \ge , U, , if x \neq y their difference x-y is nonzero and has an inverse modulo p. Solving for a yields : a \equiv i\cdot m \cdot (x-y)^ \pmod. There are p-1 possible choices for a (since a=0 is excluded) and, varying i in the allowed range, \lfloor (p - 1)/m \rfloor possible non-zero values for the right hand side. Thus the collision probability is : \lfloor (p - 1)/m \rfloor / (p-1) \le ((p-1)/m)/(p-1) = 1/m. Another way to see H is a universal family is via the notion of
statistical distance In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be be ...
. Write the difference h(x) - h(y) as : h(x)-h(y) \equiv (a(x-y)~ \bmod~ p) \pmod. Since x - y is nonzero and a is uniformly distributed in \, it follows that a(x-y) modulo p is also uniformly distributed in \. The distribution of (h(x)-h(y)) ~\bmod~ m is thus almost uniform, up to a difference in probability of \pm 1/p between the samples. As a result, the statistical distance to a uniform family is O(m/p), which becomes negligible when p \gg m. The family of simpler hash functions : h_a(x) = (ax~\bmod~p)~\bmod~m is only ''approximately'' universal: \Pr\ \le 2/m for all x\neq y. Moreover, this analysis is nearly tight; Carter and Wegman show that \Pr\ \ge 2/(m-1) whenever (p-1)~\bmod~ m = 1.


Avoiding modular arithmetic

The state of the art for hashing integers is the multiply-shift scheme described by Dietzfelbinger et al. in 1997. By avoiding modular arithmetic, this method is much easier to implement and also runs significantly faster in practice (usually by at least a factor of four). The scheme assumes the number of bins is a power of two, m=2^M. Let w be the number of bits in a machine word. Then the hash functions are parametrised over odd positive integers a < 2^w (that fit in a word of w bits). To evaluate h_(x), multiply x by a modulo 2^w and then keep the high order M bits as the hash code. In mathematical notation, this is : h_a(x) = (a\cdot x\,\, \bmod\, 2^w)\,\, \mathrm\,\, 2^ and it can be implemented in C-like programming languages by : h_a(x) = (size_t) (a*x) >> (w-M) This scheme does ''not'' satisfy the uniform difference property and is only ''2/m-almost-universal''; for any x\neq y, \Pr\ \le 2/m. To understand the behavior of the hash function, notice that, if ax \bmod 2^w and ay\bmod 2^w have the same highest-order 'M' bits, then a(x-y) \bmod 2^w has either all 1's or all 0's as its highest order M bits (depending on whether ax \bmod 2^w or ay \bmod 2^w is larger). Assume that the least significant set bit of x-y appears on position w-c. Since a is a random odd integer and odd integers have inverses in the
ring Ring may refer to: * Ring (jewellery), a round band, usually made of metal, worn as ornamental jewelry * To make a sound with a bell, and the sound made by a bell :(hence) to initiate a telephone connection Arts, entertainment and media Film and ...
Z_, it follows that a(x-y)\bmod 2^w will be uniformly distributed among w-bit integers with the least significant set bit on position w-c. The probability that these bits are all 0's or all 1's is therefore at most 2/2^M=2/m. On the other hand, if c < M, then higher-order M bits of a(x-y) \bmod 2^w contain both 0's and 1's, so it is certain that h(x) \ne h(y). Finally, if c=M then bit w-M of a(x-y) \bmod 2^w is 1 and h_a(x)=h_a(y) if and only if bits w-1,\ldots,w-M+1 are also 1, which happens with probability 1/2^=2/m. This analysis is tight, as can be shown with the example x=2^ and y=3x. To obtain a truly 'universal' hash function, one can use the multiply-add-shift scheme that picks higher-order bits : h_(x) = ((ax + b) \bmod 2^)\, \mathrm\, 2^w which can be implemented in C-like programming languages by : h_(x) = ((size_t) (a*x+b) & (1<<(w+M) - 1) ) >> w where a is a random positive integer with a < 2^ and b is a random non-negative integer with b < 2^. This requires doing arithmetic on 2w-bit unsigned integers. This version of multiply-shift is due to Dietzfelbinger, and was later analyzed more precisely by Woelfel.


Hashing vectors

This section is concerned with hashing a fixed-length vector of machine words. Interpret the input as a vector \bar = (x_0, \dots, x_) of k machine words (integers of w bits each). If H is a universal family with the uniform difference property, the following family (dating back to Carter and Wegman) also has the uniform difference property (and hence is universal): : h(\bar) = \left( \sum_^ h_i(x_i) \right)\,\bmod~m, where each h_i\in H is chosen independently at random. If m is a power of two, one may replace summation by exclusive or. , section 5.3 In practice, if double-precision arithmetic is available, this is instantiated with the multiply-shift hash family of hash functions. Initialize the hash function with a vector \bar = (a_0, \dots, a_) of random odd integers on 2w bits each. Then if the number of bins is m=2^M for M\le w: : h_(\bar) = \left(\big( \sum_^ x_i \cdot a_i \big) ~\bmod ~ 2^ \right) \,\, \mathrm\,\, 2^. It is possible to halve the number of multiplications, which roughly translates to a two-fold speed-up in practice. Initialize the hash function with a vector \bar = (a_0, \dots, a_) of random odd integers on 2w bits each. The following hash family is universal: , Equation 1 : h_(\bar) = \left(\Big( \sum_^ (x_ + a_) \cdot (x_ + a_) \Big) \bmod ~ 2^ \right) \,\, \mathrm\,\, 2^. If double-precision operations are not available, one can interpret the input as a vector of half-words (w/2-bit integers). The algorithm will then use \lceil k/2 \rceil multiplications, where k was the number of half-words in the vector. Thus, the algorithm runs at a "rate" of one multiplication per word of input. The same scheme can also be used for hashing integers, by interpreting their bits as vectors of bytes. In this variant, the vector technique is known as
tabulation hashing In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work b ...
and it provides a practical alternative to multiplication-based universal hashing schemes. Strong universality at high speed is also possible. Initialize the hash function with a vector \bar = (a_0, \dots, a_) of random integers on 2w bits. Compute : h_(\bar)^ = (a_0 + \sum_^ a_ x_ \bmod ~ 2^ ) \,\, \mathrm\,\, 2^w . The result is strongly universal on w bits. Experimentally, it was found to run at 0.2 CPU cycle per byte on recent Intel processors for w = 32.


Hashing strings

This refers to hashing a ''variable-sized'' vector of machine words. If the length of the string can be bounded by a small number, it is best to use the vector solution from above (conceptually padding the vector with zeros up to the upper bound). The space required is the maximal length of the string, but the time to evaluate h(s) is just the length of s. As long as zeroes are forbidden in the string, the zero-padding can be ignored when evaluating the hash function without affecting universality. Note that if zeroes are allowed in the string, then it might be best to append a fictitious non-zero (e.g., 1) character to all strings prior to padding: this will ensure that universality is not affected. Now assume we want to hash \bar = (x_0,\dots, x_\ell), where a good bound on \ell is not known a priori. A universal family proposed by treats the string x as the coefficients of a polynomial modulo a large prime. If x_i \in /math>, let p \ge \max \ be a prime and define: :h_a(\bar) = h_\mathrm \left( \big(\sum_^\ell x_i\cdot a^ \big) \bmod ~p \right), where a \in /math> is uniformly random and h_\mathrm is chosen randomly from a universal family mapping integer domain \mapsto /math>. Using properties of modular arithmetic, above can be computed without producing large numbers for large strings as follows: uint hash(String x, int a, int p) uint h = INITIAL_VALUE for (uint i=0 ; i < x.length ; ++i) h = ((h*a) + x mod p return h This Rabin-Karp rolling hash is based on a
linear congruential generator A linear congruential generator (LCG) is an algorithm that yields a sequence of pseudo-randomized numbers calculated with a discontinuous piecewise linear equation. The method represents one of the oldest and best-known pseudorandom number generat ...
. Above algorithm is also known as ''Multiplicative hash function''. In practice, the ''mod'' operator and the parameter ''p'' can be avoided altogether by simply allowing integer to overflow because it is equivalent to ''mod'' (''Max-Int-Value'' + 1) in many programming languages. Below table shows values chosen to initialize ''h'' and a for some of the popular implementations. Consider two strings \bar, \bar and let \ell be length of the longer one; for the analysis, the shorter string is conceptually padded with zeros up to length \ell. A collision before applying h_\mathrm implies that a is a root of the polynomial with coefficients \bar - \bar. This polynomial has at most \ell roots modulo p, so the collision probability is at most \ell/p. The probability of collision through the random h_\mathrm brings the total collision probability to \frac + \frac. Thus, if the prime p is sufficiently large compared to the length of strings hashed, the family is very close to universal (in
statistical distance In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be be ...
). Other universal families of hash functions used to hash unknown-length strings to fixed-length hash values include the
Rabin fingerprint The Rabin fingerprinting scheme is a method for implementing fingerprints using polynomials over a finite field. It was proposed by Michael O. Rabin. Scheme Given an ''n''-bit message ''m''0,...,''m''n-1, we view it as a polynomial of degree ''n''- ...
and the
Buzhash A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input. A few hash functions allow a rolling hash to be computed very quickly—the new hash value ...
.


Avoiding modular arithmetic

To mitigate the computational penalty of modular arithmetic, three tricks are used in practice: # One chooses the prime p to be close to a power of two, such as a
Mersenne prime In mathematics, a Mersenne prime is a prime number that is one less than a power of two. That is, it is a prime number of the form for some integer . They are named after Marin Mersenne, a French Minim friar, who studied them in the early 17t ...
. This allows arithmetic modulo p to be implemented without division (using faster operations like addition and shifts). For instance, on modern architectures one can work with p = 2^-1, while x_i's are 32-bit values. # One can apply vector hashing to blocks. For instance, one applies vector hashing to each 16-word block of the string, and applies string hashing to the \lceil k/16 \rceil results. Since the slower string hashing is applied on a substantially smaller vector, this will essentially be as fast as vector hashing. # One chooses a power-of-two as the divisor, allowing arithmetic modulo 2^w to be implemented without division (using faster operations of
bit masking In computer science, a mask or bitmask is data that is used for bitwise operations, particularly in a bit field. Using a mask, multiple bits in a byte, nibble, word, etc. can be set either on or off, or inverted from on to off (or vice versa) i ...
). The NH hash-function family takes this approach.


See also

*
K-independent hashing In computer science, a family of hash functions is said to be ''k''-independent, ''k''-wise independent or ''k''-universal if selecting a function at random from the family guarantees that the hash codes of any designated ''k'' keys are independe ...
* Rolling hashing *
Tabulation hashing In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work b ...
* Min-wise independence *
Universal one-way hash function In cryptography a universal one-way hash function (UOWHF, often pronounced "woof"), is a type of universal hash function of particular importance to cryptography. UOWHF's are proposed as an alternative to collision-resistant hash functions (CRHFs ...
*
Low-discrepancy sequence In mathematics, a low-discrepancy sequence is a sequence with the property that for all values of ''N'', its subsequence ''x''1, ..., ''x'N'' has a low discrepancy. Roughly speaking, the discrepancy of a sequence is low if the proportion of poi ...
*
Perfect hashing In computer science, a perfect hash function for a set is a hash function that maps distinct elements in to a set of integers, with no collisions. In mathematical terms, it is an injective function. Perfect hash functions may be used to im ...


References


Further reading

* {{cite book , last = Knuth , first = Donald Ervin , author-link = Donald Knuth , title = The Art of Computer Programming, Vol. III: Sorting and Searching , edition = 3rd , year = 1998 , publisher = Addison-Wesley , location = Reading, Mass; London , isbn = 0-201-89685-0


External links


Open Data Structures - Section 5.1.1 - Multiplicative Hashing
Pat Morin Patrick Ryan Morin is a Canadian computer scientist specializing in computational geometry and data structures. He is a professor in the School of Computer Science at Carleton University. Education and career Morin was educated at Carleton Univers ...
Cryptographic hash functions Hashing Search algorithms Computational complexity theory