In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, a perfect hash function for a set is a
hash function
A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
that maps distinct elements in to a set of integers, with no
collisions
In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great f ...
. In mathematical terms, it is an
injective function
In mathematics, an injective function (also known as injection, or one-to-one function) is a function that maps distinct elements of its domain to distinct elements; that is, implies . (Equivalently, implies in the equivalent contrapositiv ...
.
Perfect hash functions may be used to implement a
lookup table
In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation. The process is termed as "direct addressing" and LUTs differ from hash tables in a way that, to retrieve a value v wi ...
with constant worst-case access time. A perfect hash function can, as any
hash function
A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
, be used to implement
hash table
In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
s, with the advantage that no
collision resolution has to be implemented. In addition, if the keys are not the data and if it is known that queried keys will be valid, then the keys do not need to be stored in the lookup table, saving space.
Disadvantages of perfect hash functions are that needs to be known for the construction of the perfect hash function. Non-dynamic perfect hash functions need to be re-constructed if changes. For frequently changing
dynamic perfect hash functions may be used at the cost of additional space.
The space requirement to store the perfect hash function is in .
The important performance parameters for perfect hash functions are the evaluation time, which should be constant, the construction time, and the representation size.
Application
A perfect hash function with values in a limited range can be used for efficient lookup operations, by placing keys from (or other associated values) in a
lookup table
In computer science, a lookup table (LUT) is an array that replaces runtime computation with a simpler array indexing operation. The process is termed as "direct addressing" and LUTs differ from hash tables in a way that, to retrieve a value v wi ...
indexed by the output of the function. One can then test whether a key is present in , or look up a value associated with that key, by looking for it at its cell of the table. Each such lookup takes
constant time
In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
in the
worst case
In computer science, best, worst, and average cases of a given algorithm express what the resource usage is ''at least'', ''at most'' and ''on average'', respectively. Usually the resource being considered is running time, i.e. time complexity, b ...
.
With perfect hashing, the associated data can be read or written with a single access to the table.
Performance of perfect hash functions
The important performance parameters for perfect hashing are the representation size, the evaluation time, the construction time, and additionally the range requirement
.
The evaluation time can be as fast as , which is optimal.
The construction time needs to be at least , because each element in needs to be considered, and contains elements. This lower bound can be achieved in practice.
The lower bound for the representation size depends on and . Let and a perfect hash function. A good approximation for the lower bound is
Bits per element. For minimal perfect hashing, , the lower bound is bits per element.
Construction
A perfect hash function for a specific set that can be evaluated in constant time, and with values in a small range, can be found by a
randomized algorithm
A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performan ...
in a number of operations that is proportional to the size of S.
The original construction of uses a two-level scheme to map a set of elements to a range of indices, and then map each index to a range of hash values. The first level of their construction chooses a large prime (larger than the size of the
universe
The universe is all of space and time and their contents, including planets, stars, galaxies, and all other forms of matter and energy. The Big Bang theory is the prevailing cosmological description of the development of the universe. Acc ...
from which is drawn), and a parameter , and maps each element of to the index
:
If is chosen randomly, this step is likely to have collisions, but the number of elements that are simultaneously mapped to the same index is likely to be small.
The second level of their construction assigns disjoint ranges of integers to each index . It uses a second set of linear modular functions, one for each index , to map each member of into the range associated with .
As show, there exists a choice of the parameter such that the sum of the lengths of the ranges for the different values of is . Additionally, for each value of , there exists a linear modular function that maps the corresponding subset of into the range associated with that value. Both , and the second-level functions for each value of , can be found in
polynomial time
In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...
by choosing values randomly until finding one that works.
The hash function itself requires storage space to store , , and all of the second-level linear modular functions. Computing the hash value of a given key may be performed in constant time by computing , looking up the second-level function associated with , and applying this function to .
A modified version of this two-level scheme with a larger number of values at the top level can be used to construct a perfect hash function that maps into a smaller range of length .
A more recent method for constructing a perfect hash function is described by as "hash, displace, and compress". Here a first-level hash function is also used to map elements onto a range of integers. An element is stored in the Bucket .
Then, in descending order of size, each bucket's elements are hashed by a hash function of a sequence of independent fully random hash functions , starting with . If the hash function does not produce any collisions for the bucket, and the resulting values are not yet occupied by other elements from other buckets, the function is chosen for that bucket. If not, the next hash function in the sequence is tested.
To evaluate the perfect hash function one only has to save the mapping σ of the bucket index onto the correct hash function in the sequence, resulting in .
Finally, to reduce the representation size, the ( are compressed into a form that still allows the evaluation in .
This approach needs linear time in for construction, and constant evaluation time. The representation size is in , and depends on the achieved range. For example, with achieved a representation size between 3.03 bits/key and 1.40 bits/key for their given example set of 10 million entries, with lower values needing a higher computation time. The space lower bound in this scenario is 0.88 bits/key.
Pseudocode
algorithm ''hash, displace, and compress'' is
(1) Split S into buckets
(2) Sort buckets B
i in falling order according to size , B
i,
(3) Initialize array T
...m-1with 0's
(4) for all i∈
in the order from (2), do
(5) for l←1,2,...
(6) repeat forming K
i←
(6) until , K
i, =, B
i, and K
i∩=∅
(7) let σ(i):= the successful l
(8) for all j∈K
i let T
=1
(9) Transform (σ
i)
0≤i into compressed form, retaining access.
Space lower bounds
The use of words of information to store the function of is near-optimal: any perfect hash function that can be calculated in constant time
requires at least a number of bits that is proportional to the size of .
For minimal perfect hash functions the information theoretic space lower bound is
:
bits/key.
For perfect hash functions, it is first assumed that the range of is bounded by as . With the formula given by and for a universe
The universe is all of space and time and their contents, including planets, stars, galaxies, and all other forms of matter and energy. The Big Bang theory is the prevailing cosmological description of the development of the universe. Acc ...
whose size tends towards infinity, the space lower bounds is
:
bits/key, minus bits overall.
Extensions
Memory address identity
A trivial but pervasive example of perfect hashing is implicit in the (virtual) memory address space of a computer. Since each byte of virtual memory is a distinct, unique, directly addressable storage location, the value of the (starting) address of any object stored in memory can be considered a ''de-facto'' perfect hash of that object into the entire memory address range.
Dynamic perfect hashing
Using a perfect hash function is best in situations where there is a frequently queried large set, , which is seldom updated. This is because any modification of the set may cause the hash function to no longer be perfect for the modified set. Solutions which update the hash function any time the set is modified are known as dynamic perfect hashing
In computer science, dynamic perfect hashing is a programming technique for resolving collisions in a hash table data structure.Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a Sparse Table with 0(1) Worst Case Access Time. J. ACM 3 ...
,[.] but these methods are relatively complicated to implement.
Minimal perfect hash function
A minimal perfect hash function is a perfect hash function that maps keys to consecutive integers – usually the numbers from to or from to . A more formal way of expressing this is: Let and be elements of some finite set . Then is a minimal perfect hash function if and only if implies (injectivity
In mathematics, an injective function (also known as injection, or one-to-one function) is a function that maps distinct elements of its domain to distinct elements; that is, implies . (Equivalently, implies in the equivalent contrapositi ...
) and there exists an integer such that the range of is . It has been proven that a general purpose minimal perfect hash scheme requires at least bits/key.[.] Although this space bound has been achieved by theoretical works, in practice, the best known minimal perfect hashing schemes require roughly 1.56 bits/key if given enough time.[.]
k-perfect hashing
A hash function is -perfect if at most elements from are mapped onto the same value in the range. The "hash, displace, and compress" algorithm can be used to construct -perfect hash functions by allowing up to collisions. The changes necessary to accomplish this are minimal, and are underlined in the adapted pseudocode below:
(4) for all i∈ in the order from (2), do
(5) for l←1,2,...
(6) repeat forming Ki←
(6) until , Ki, =, Bi, and Ki∩=∅
(7) let σ(i):= the successful l
(8) for all j∈Ki set T larr;T 1
Order preservation
A minimal perfect hash function is ''order preserving'' if keys are given in some order and for any keys and , implies . In this case, the function value is just the position of each key in the sorted ordering of all of the keys. A simple implementation of order-preserving minimal perfect hash functions with constant access time is to use an (ordinary) perfect hash function or cuckoo hashing
Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick ...
to store a lookup table of the positions of each key. If the keys to be hashed are themselves stored in a sorted array, it is possible to store a small number of additional bits per key in a data structure that can be used to compute hash values quickly. Order-preserving minimal perfect hash functions require necessarily bits to be represented.
Related constructions
A simple alternative to perfect hashing, which also allows dynamic updates, is cuckoo hashing
Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick ...
. This scheme maps keys to two or more locations within a range (unlike perfect hashing which maps each key to a single location) but does so in such a way that the keys can be assigned one-to-one to locations to which they have been mapped. Lookups with this scheme are slower, because multiple locations must be checked, but nevertheless take constant worst-case time.[.]
References
Further reading
*Richard J. Cichelli. ''Minimal Perfect Hash Functions Made Simple'', Communications of the ACM, Vol. 23, Number 1, January 1980.
* Thomas H. Cormen
Thomas H. Cormen is the co-author of ''Introduction to Algorithms'', along with Charles Leiserson, Ron Rivest, and Cliff Stein. In 2013, he published a new book titled '' Algorithms Unlocked''. He is a professor of computer science at Dartmout ...
, Charles E. Leiserson
Charles Eric Leiserson is a computer scientist, specializing in the theory of parallel computing and distributed computing, and particularly practical applications thereof. As part of this effort, he developed the Cilk multithreaded language. ...
, Ronald L. Rivest
Ronald Linn Rivest (; born May 6, 1947) is a cryptographer and an Institute Professor at MIT. He is a member of MIT's Department of Electrical Engineering and Computer Science (EECS) and a member of MIT's Computer Science and Artificial Inte ...
, and Clifford Stein
Clifford Seth Stein (born December 14, 1965), a computer scientist, is a professor of industrial engineering and operations research at Columbia University in New York, NY, where he also holds an appointment in the Department of Computer Scien ...
. ''Introduction to Algorithms
''Introduction to Algorithms'' is a book on computer programming by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. The book has been widely used as the textbook for algorithms courses at many universities and is co ...
'', Third Edition. MIT Press, 2009. . Section 11.5: Perfect hashing, pp. 267, 277–282.
* Fabiano C. Botelho, Rasmus Pagh and Nivio Ziviani
"Perfect Hashing for Data Management Applications"
* Fabiano C. Botelho and Nivio Ziviani
"External perfect hashing for very large key sets"
16th ACM Conference on Information and Knowledge Management (CIKM07), Lisbon, Portugal, November 2007.
* Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna
"Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses"
In Proceedings of the 20th Annual ACM-SIAM Symposium On Discrete Mathematics (SODA), New York, 2009. ACM Press.
* Marshall D. Brain and Alan L. Tharp. "Near-perfect Hashing of Large Word Sets". Software--Practice and Experience, vol. 19(10), 967-078, October 1989. John Wiley & Sons.
* Douglas C. Schmidt
GPERF: A Perfect Hash Function Generator
C++ Report, SIGS, Vol. 10, No. 10, November/December, 1998.
External links
gperf
is an Open Source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
C and C++ perfect hash generator (very fast, but only works for small sets)
Minimal Perfect Hashing (bob algorithm)
by Bob Jenkins
C Minimal Perfect Hashing Library, open source implementations for many (minimal) perfect hashes (works for big sets)
Sux4J
open source monotone minimal perfect hashing in Java
MPHSharp
perfect hashing methods in C#
BBHash
minimal perfect hash function in header-only C++
Perfect::Hash
perfect hash generator in Perl that makes C code. Has a "prior art" section worth looking at.
{{DEFAULTSORT:Perfect Hash Function
Hashing
Hash functions
Search algorithms