In
mathematics
Mathematics is an area of knowledge that includes the topics of numbers, formulas and related structures, shapes and the spaces in which they are contained, and quantities and their changes. These topics are represented in modern mathematics ...
and
computing
Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
, universal hashing (in a
randomized algorithm
A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performan ...
or data structure) refers to selecting a
hash function
A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in
expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of
hash table
In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
s,
randomized algorithm
A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performan ...
s, and
cryptography
Cryptography, or cryptology (from grc, , translit=kryptós "hidden, secret"; and ''graphein'', "to write", or ''-logia'', "study", respectively), is the practice and study of techniques for secure communication in the presence of adver ...
.
Introduction
Assume we want to map keys from some universe
into
bins (labelled
). The algorithm will have to handle some data set
of
keys, which is not known in advance. Usually, the goal of hashing is to obtain a low number of collisions (keys from
that land in the same bin). A deterministic hash function cannot offer any guarantee in an adversarial setting if
, since the adversary may choose
to be precisely the
preimage
In mathematics, the image of a function is the set of all output values it may produce.
More generally, evaluating a given function f at each element of a given subset A of its domain produces a set, called the "image of A under (or through) ...
of a bin. This means that all data keys land in the same bin, making hashing useless. Furthermore, a deterministic hash function does not allow for ''rehashing'': sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function.
The solution to these problems is to pick a function randomly from a family of hash functions. A family of functions
is called a universal family if,
.
In other words, any two different keys of the universe collide with probability at most
when the hash function
is drawn uniformly at random from
. This is exactly the probability of collision we would expect if the hash function assigned truly random hash codes to every key.
Sometimes, the definition is relaxed by a constant factor, only requiring collision probability
rather than
. This concept was introduced by Carter and Wegman
[
] in 1977, and has found numerous applications in computer science (see, for .
If we have an upper bound of
on the collision probability, we say that we have
-almost universality. So for example, a universal family has
-almost universality.
Many, but not all, universal families have the following stronger uniform difference property:
:
, when
is drawn randomly from the family
, the difference
is uniformly distributed in