A hash function is any
function
Function or functionality may refer to:
Computing
* Function key, a type of key on computer keyboards
* Function model, a structured representation of processes in a system
* Function object or functor or functionoid, a concept of object-oriente ...
that can be used to map
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually used to index a fixed-size table called a ''
hash table
In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
''. Use of a hash function to index a hash table is called ''hashing'' or ''scatter storage addressing''.
Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval. They require an amount of storage space only fractionally greater than the total space required for the data or records themselves. Hashing is a computationally and storage space-efficient form of data access that avoids the non-constant access time of ordered and unordered lists and structured trees, and the often exponential storage requirements of direct access of state spaces of large or variable-length keys.
Use of hash functions relies on statistical properties of key and function interaction: worst-case behaviour is intolerably bad with a vanishingly small probability, and average-case behaviour can be nearly optimal (minimal
collision
In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great fo ...
).
Hash functions are related to (and often confused with)
checksums
A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data ...
,
check digit
A check digit is a form of redundancy check used for error detection on identification numbers, such as bank account numbers, which are used in an application where they will at least sometimes be input manually. It is analogous to a binary parity ...
s,
fingerprints
A fingerprint is an impression left by the friction ridges of a human finger. The recovery of partial fingerprints from a crime scene is an important method of forensic science. Moisture and grease on a finger result in fingerprints on surf ...
,
lossy compression
In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data size ...
,
randomization function
Random number generation is a process by which, often by means of a random number generator (RNG), a sequence of numbers or symbols that cannot be reasonably predicted better than by random chance is generated. This means that the particular outc ...
s,
error-correcting codes
In computing, telecommunication, information theory, and coding theory, an error correction code, sometimes error correcting code, (ECC) is used for controlling errors in data over unreliable or noisy communication channels. The central idea is ...
, and
cipher
In cryptography, a cipher (or cypher) is an algorithm for performing encryption or decryption—a series of well-defined steps that can be followed as a procedure. An alternative, less common term is ''encipherment''. To encipher or encode i ...
s. Although the concepts overlap to some extent, each one has its own uses and requirements and is designed and optimized differently. The hash function differs from these concepts mainly in terms of
data integrity
Data integrity is the maintenance of, and the assurance of, data accuracy and consistency over its entire Information Lifecycle Management, life-cycle and is a critical aspect to the design, implementation, and usage of any system that stores, proc ...
.
Overview
A hash function takes a key as an input, which is associated with a datum or record and used to identify it to the data storage and retrieval application. The keys may be fixed length, like an integer, or variable length, like a name. In some cases, the key is the datum itself. The output is a hash code used to index a hash table holding the data or records, or pointers to them.
A hash function may be considered to perform three functions:
*Convert variable-length keys into fixed length (usually machine word length or less) values, by folding them by words or other units using a parity-preserving operator like ADD or XOR.
*Scramble the bits of the key so that the resulting values are uniformly distributed over the keyspace.
*Map the key values into ones less than or equal to the size of the table
A good hash function satisfies two basic properties: 1) it should be very fast to compute; 2) it should minimize duplication of output values (collisions). Hash functions rely on generating favourable probability distributions for their effectiveness, reducing access time to nearly constant. High table loading factors,
pathological
Pathology is the study of the causal, causes and effects of disease or injury. The word ''pathology'' also refers to the study of disease in general, incorporating a wide range of biology research fields and medical practices. However, when us ...
key sets and poorly designed hash functions can result in access times approaching linear in the number of items in the table.
Hash functions can be designed to give the best worst-case performance,
[This is useful in cases where keys are devised by a malicious agent, for example in pursuit of a DOS attack.] good performance under high table loading factors, and in special cases, perfect (collisionless) mapping of keys into hash codes. Implementation is based on parity-preserving bit operations (XOR and ADD), multiply, or divide. A necessary adjunct to the hash function is a collision-resolution method that employs an auxiliary data structure like
linked lists
In computer science, a linked list is a linear collection of data elements whose order is not given by their physical placement in memory. Instead, each element points to the next. It is a data structure consisting of a collection of nodes which ...
, or systematic probing of the table to find an empty slot.
Hash tables
Hash functions are used in conjunction with
hash tables
In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
to store and retrieve data items or data records. The hash function translates the key associated with each datum or record into a hash code, which is used to index the hash table. When an item is to be added to the table, the hash code may index an empty slot (also called a bucket), in which case the item is added to the table there. If the hash code indexes a full slot, some kind of collision resolution is required: the new item may be omitted (not added to the table), or replace the old item, or it can be added to the table in some other location by a specified procedure. That procedure depends on the structure of the hash table: In ''chained hashing'', each slot is the head of a linked list or chain, and items that collide at the slot are added to the chain. Chains may be kept in random order and searched linearly, or in serial order, or as a self-ordering list by frequency to speed up access. In ''open address hashing'', the table is probed starting from the occupied slot in a specified manner, usually by
linear probing
Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene ...
,
quadratic probing, or
double hashing Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is ...
until an open slot is located or the entire table is probed (overflow). Searching for the item follows the same procedure until the item is located, an open slot is found or the entire table has been searched (item not in table).
Specialized uses
Hash functions are also used to build
caches for large data sets stored in slow media. A cache is generally simpler than a hashed search table since any collision can be resolved by discarding or writing back the older of the two colliding items.
Hash functions are an essential ingredient of the
Bloom filter
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...
, a space-efficient
probabilistic
Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
that is used to test whether an
element is a member of a
set
Set, The Set, SET or SETS may refer to:
Science, technology, and mathematics Mathematics
*Set (mathematics), a collection of elements
*Category of sets, the category whose objects and morphisms are sets and total functions, respectively
Electro ...
.
A special case of hashing is known as
geometric hashing
In computer science, geometric hashing is a method for efficiently finding two-dimensional objects represented by discrete points that have undergone an affine transformation, though extensions exist to other object representations and transformat ...
or ''the grid method''. In these applications, the set of all inputs is some sort of
metric space
In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general settin ...
, and the hashing function can be interpreted as a
partition
Partition may refer to:
Computing Hardware
* Disk partitioning, the division of a hard disk drive
* Memory partition, a subdivision of a computer's memory, usually for use by a single job
Software
* Partition (database), the division of a ...
of that space into a grid of ''cells''. The table is often an array with two or more indices (called a ''
grid file In computer science, a grid file or bucket grid is a point access method which splits a space into a non-periodic grid where one or more cells of the grid refer to a small set of points. Grid files (a symmetric data structure) provide an efficient m ...
'', ''grid index'', ''bucket grid'', and similar names), and the hash function returns an index
tuple
In mathematics, a tuple is a finite ordered list (sequence) of elements. An -tuple is a sequence (or ordered list) of elements, where is a non-negative integer. There is only one 0-tuple, referred to as ''the empty tuple''. An -tuple is defi ...
. This principle is widely used in
computer graphics
Computer graphics deals with generating images with the aid of computers. Today, computer graphics is a core technology in digital photography, film, video games, cell phone and computer displays, and many specialized applications. A great de ...
,
computational geometry
Computational geometry is a branch of computer science devoted to the study of algorithms which can be stated in terms of geometry. Some purely geometrical problems arise out of the study of computational geometric algorithms, and such problems ar ...
and many other disciplines, to solve many
proximity problem
Proximity problems is a class of problems in computational geometry which involve estimation of distances between geometric objects.
A subset of these problems stated in terms of points only are sometimes referred to as closest point problems, ...
s in the
plane
Plane(s) most often refers to:
* Aero- or airplane, a powered, fixed-wing aircraft
* Plane (geometry), a flat, 2-dimensional surface
Plane or planes may also refer to:
Biology
* Plane (tree) or ''Platanus'', wetland native plant
* ''Planes' ...
or in
three-dimensional space
Three-dimensional space (also: 3D space, 3-space or, rarely, tri-dimensional space) is a geometric setting in which three values (called ''parameters'') are required to determine the position (geometry), position of an element (i.e., Point (m ...
, such as finding
closest pairs in a set of points, similar shapes in a list of shapes, similar
images
An image is a visual representation of something. It can be two-dimensional, three-dimensional, or somehow otherwise feed into the visual system to convey information. An image can be an artifact, such as a photograph or other two-dimensiona ...
in an
image database, and so on.
Hash tables are also used to implement
associative array
In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection. In mathematical terms an as ...
s and
dynamic sets.
Properties
Uniformity
A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash value in the output range should be generated with roughly the same
probability
Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
. The reason for this last requirement is that the cost of hashing-based methods goes up sharply as the number of ''collisions''—pairs of inputs that are mapped to the same hash value—increases. If some hash values are more likely to occur than others, a larger fraction of the lookup operations will have to search through a larger set of colliding table entries.
Note that this criterion only requires the value to be ''uniformly distributed'', not ''random'' in any sense. A good randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but the converse need not be true.
Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set of all possible entries.
In other words, if a typical set of records is hashed to table slots, the probability of a bucket receiving many more than records should be vanishingly small. In particular, if is less than , very few buckets should have more than one or two records. A small number of collisions is virtually inevitable, even if is much larger than – see the
birthday problem
In probability theory, the birthday problem asks for the probability that, in a set of randomly chosen people, at least two will share a birthday. The birthday paradox is that, counterintuitively, the probability of a shared birthday exceeds 5 ...
.
In special cases when the keys are known in advance and the key set is static, a hash function can be found that achieves absolute (or collisionless) uniformity. Such a hash function is said to be ''
perfect''. There is no algorithmic way of constructing such a function - searching for one is a
factorial
In mathematics, the factorial of a non-negative denoted is the product of all positive integers less than or equal The factorial also equals the product of n with the next smaller factorial:
\begin
n! &= n \times (n-1) \times (n-2) \t ...
function of the number of keys to be mapped versus the number of table slots they're tapped into. Finding a perfect hash function over more than a very small set of keys is usually computationally infeasible; the resulting function is likely to be more computationally complex than a standard hash function and provides only a marginal advantage over a function with good statistical properties that yields a minimum number of collisions. See
universal hash function
In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees ...
.
Testing and measurement
When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the
chi-squared test
A chi-squared test (also chi-square or test) is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variable ...
. This test is a goodness-of-fit measure: it's the actual distribution of items in buckets versus the expected (or uniform) distribution of items. The formula is:
where:
is the number of keys,
is the number of buckets,
is the number of items in bucket
A ratio within one confidence interval (0.95 - 1.05) is indicative that the hash function evaluated has an expected uniform distribution.
Hash functions can have some technical properties that make it more likely that they'll have a uniform distribution when applied. One is the
strict avalanche criterion
In cryptography, the avalanche effect is the desirable property of cryptographic algorithms, typically block ciphers and cryptographic hash functions, wherein if an input is changed slightly (for example, flipping a single bit), the output changes ...
: whenever a single input bit is complemented, each of the output bits changes with a 50% probability. The reason for this property is that selected subsets of the keyspace may have low variability. For the output to be uniformly distributed, a low amount of variability, even one bit, should translate into a high amount of variability (i.e. distribution over the tablespace) in the output. Each bit should change with a probability of 50% because if some bits are reluctant to change, the keys become clustered around those values. If the bits want to change too readily, the mapping is approaching a fixed XOR function of a single bit. Standard tests for this property have been described in the literature. The relevance of the criterion to a multiplicative hash function is assessed here.
Efficiency
In data storage and retrieval applications, the use of a hash function is a trade-off between search time and data storage space. If search time were unbounded, a very compact unordered linear list would be the best medium; if storage space were unbounded, a randomly accessible structure indexable by the key-value would be very large, very sparse, but very fast. A hash function takes a finite amount of time to map a potentially large keyspace to a feasible amount of storage space searchable in a bounded amount of time regardless of the number of keys. In most applications, the hash function should be computable with minimum latency and secondarily in a minimum number of instructions.
Computational complexity varies with the number of instructions required and latency of individual instructions, with the simplest being the bitwise methods (folding), followed by the multiplicative methods, and the most complex (slowest) are the division-based methods.
Because collisions should be infrequent, and cause a marginal delay but are otherwise harmless, it's usually preferable to choose a faster hash function over one that needs more computation but saves a few collisions.
Division-based implementations can be of particular concern because the division is microprogrammed on nearly all chip architectures. Divide (
modulo) by a constant can be inverted to become a multiply by the word-size multiplicative-inverse of the constant. This can be done by the programmer, or by the compiler. Divide can also be reduced directly into a series of shift-subtracts and shift-adds, though minimizing the number of such operations required is a daunting problem; the number of assembly instructions resulting may be more than a dozen, and swamp the pipeline. If the architecture has hardware multiply functional units, the multiply-by-inverse is likely a better approach.
We can allow the table size to not be a power of and still not have to perform any remainder or division operation, as these computations are sometimes costly. For example, let be significantly less than . Consider a
pseudorandom number generator
A pseudorandom number generator (PRNG), also known as a deterministic random bit generator (DRBG), is an algorithm for generating a sequence of numbers whose properties approximate the properties of sequences of random numbers. The PRNG-generate ...
function that is uniform on the interval . A hash function uniform on the interval is . We can replace the division by a (possibly faster) right
bit shift
In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral (considered as a bit string) at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic operati ...
: .
If keys are being hashed repeatedly, and the hash function is costly, computing time can be saved by precomputing the hash codes and storing them with the keys. Matching hash codes almost certainly means the keys are identical. This technique is used for the transposition table in game-playing programs, which stores a 64-bit hashed representation of the board position.
Universality
A ''universal hashing'' scheme is a
randomized algorithm
A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performan ...
that selects a hashing function among a family of such functions, in such a way that the probability of a collision of any two distinct keys is , where is the number of distinct hash values desired—independently of the two keys. Universal hashing ensures (in a probabilistic sense) that the hash function application will behave as well as if it were using a random function, for any distribution of the input data. It will, however, have more collisions than perfect hashing and may require more operations than a special-purpose hash function.
Applicability
A hash function is applicable in a variety of situations.
A hash function that allows only certain table sizes, strings only up to a certain length, or can't accept a seed (i.e. allow double hashing) isn't as useful as one that does.
Deterministic
A hash procedure must be
deterministic
Determinism is a philosophical view, where all events are determined completely by previously existing causes. Deterministic theories throughout the history of philosophy have developed from diverse and sometimes overlapping motives and consi ...
—meaning that for a given input value it must always generate the same hash value. In other words, it must be a
function
Function or functionality may refer to:
Computing
* Function key, a type of key on computer keyboards
* Function model, a structured representation of processes in a system
* Function object or functor or functionoid, a concept of object-oriente ...
of the data to be hashed, in the mathematical sense of the term. This requirement excludes hash functions that depend on external variable parameters, such as
pseudo-random number generator
A pseudorandom number generator (PRNG), also known as a deterministic random bit generator (DRBG), is an algorithm for generating a sequence of numbers whose properties approximate the properties of sequences of random numbers. The PRNG-generate ...
s or the time of day. It also excludes functions that depend on the memory address of the object being hashed in cases that the address may change during execution (as may happen on systems that use certain methods of
garbage collection
Waste collection is a part of the process of waste management. It is the transfer of solid waste from the point of use and disposal to the point of treatment or landfill. Waste collection also includes the curbside collection of recyclable m ...
), although sometimes rehashing of the item is possible.
The determinism is in the context of the reuse of the function. For example,
Python
Python may refer to:
Snakes
* Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia
** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia
* Python (mythology), a mythical serpent
Computing
* Python (pro ...
adds the feature that hash functions make use of a randomized seed that is generated once when the Python process starts in addition to the input to be hashed. The Python hash (
SipHash
SipHash is an add–rotate–xor (ARX) based family of pseudorandom functions created by Jean-Philippe Aumasson and Daniel J. Bernstein in 2012, in response to a spate of "hash flooding" denial-of-service attacks (HashDoS) in late 2011.
Althou ...
) is still a valid hash function when used within a single run. But if the values are persisted (for example, written to disk) they can no longer be treated as valid hash values, since in the next run the random value might differ.
Defined range
It is often desirable that the output of a hash function have fixed size (but see below). If, for example, the output is constrained to 32-bit integer values, the hash values can be used to index into an array. Such hashing is commonly used to accelerate data searches.
Producing fixed-length output from variable length input can be accomplished by breaking the input data into chunks of specific size. Hash functions used for data searches use some arithmetic expression that iteratively processes chunks of the input (such as the characters in a string) to produce the hash value.
Variable range
In many applications, the range of hash values may be different for each run of the program or may change along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function which takes two parameters—the input data , and the number of allowed hash values.
A common solution is to compute a fixed hash function with a very large range (say, to ), divide the result by , and use the division's
remainder
In mathematics, the remainder is the amount "left over" after performing some computation. In arithmetic, the remainder is the integer "left over" after dividing one integer by another to produce an integer quotient (integer division). In algebr ...
. If is itself a power of , this can be done by
bit masking
In computer science, a mask or bitmask is data that is used for bitwise operations, particularly in a bit field. Using a mask, multiple bits in a byte, nibble, word, etc. can be set either on or off, or inverted from on to off (or vice versa) i ...
and
bit shifting
In computer programming, a bitwise operation operates on a bit string, a bit array or a binary numeral (considered as a bit string) at the level of its individual bits. It is a fast and simple action, basic to the higher-level arithmetic oper ...
. When this approach is used, the hash function must be chosen so that the result has fairly uniform distribution between and , for any value of that may occur in the application. Depending on the function, the remainder may be uniform only for certain values of , e.g.
odd
Odd means unpaired, occasional, strange or unusual, or a person who is viewed as eccentric.
Odd may also refer to:
Acronym
* ODD (Text Encoding Initiative) ("One Document Does it all"), an abstracted literate-programming format for describing X ...
or
prime number
A prime number (or a prime) is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a composite number. For example, 5 is prime because the only ways ...
s.
Variable range with minimal movement (dynamic hash function)
When the hash function is used to store values in a hash table that outlives the run of the program, and the hash table needs to be expanded or shrunk, the hash table is referred to as a dynamic hash table.
A hash function that will relocate the minimum number of records when the table is resized is desirable.
What is needed is a hash function – where is the key being hashed and is the number of allowed hash values – such that with probability close to .
Linear hashing Linear hashing (LH) is a dynamic data structure which implements a hash table and grows or shrinks one bucket at a time. It was invented by Witold Litwin in 1980.
It has been analyzed by Baeza-Yates and Soza-Pollman.
It is the first in a number ...
and spiral storage are examples of dynamic hash functions that execute in constant time but relax the property of uniformity to achieve the minimal movement property.
Extendible hashing Extendible hashing is a type of hash system which treats a hash as a bit string and uses a trie for bucket lookup. Because of the hierarchical nature of the system, re-hashing is an incremental operation (done one bucket at a time, as needed). Thi ...
uses a dynamic hash function that requires space proportional to to compute the hash function, and it becomes a function of the previous keys that have been inserted. Several algorithms that preserve the uniformity property but require time proportional to to compute the value of have been invented.
A hash function with minimal movement is especially useful in
distributed hash table
A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table: key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The m ...
s.
Data normalization
In some applications, the input data may contain features that are irrelevant for comparison purposes. For example, when looking up a personal name, it may be desirable to ignore the distinction between upper and lower case letters. For such data, one must use a hash function that is compatible with the data
equivalence criterion being used: that is, any two inputs that are considered equivalent must yield the same hash value. This can be accomplished by normalizing the input before hashing it, as by upper-casing all letters.
Hashing integer data types
There are several common algorithms for hashing integers. The method giving the best distribution is data-dependent. One of the simplest and most common methods in practice is the modulo division method.
Identity hash function
If the data to be hashed is small enough, one can use the data itself (reinterpreted as an integer) as the hashed value. The cost of computing this ''
identity
Identity may refer to:
* Identity document
* Identity (philosophy)
* Identity (social science)
* Identity (mathematics)
Arts and entertainment Film and television
* ''Identity'' (1987 film), an Iranian film
* ''Identity'' (2003 film), ...
'' hash function is effectively zero. This hash function is
perfect, as it maps each input to a distinct hash value.
The meaning of "small enough" depends on the size of the type that is used as the hashed value. For example, in
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
, the hash code is a 32-bit integer. Thus the 32-bit integer
Integer
and 32-bit floating-point
Float
objects can simply use the value directly; whereas the 64-bit integer
Long
and 64-bit floating-point
Double
cannot use this method.
Other types of data can also use this hashing scheme. For example, when mapping
character string
In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). ...
s between
upper and lower case, one can use the binary encoding of each character, interpreted as an integer, to index a table that gives the alternative form of that character ("A" for "a", "8" for "8", etc.). If each character is stored in 8 bits (as in extended
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
or
ISO Latin 1
ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...
), the table has only 2
8 = 256 entries; in the case of
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expre ...
characters, the table would have 17×2
16 = entries.
The same technique can be used to map
two-letter country codes like "us" or "za" to country names (26
2 = 676 table entries), 5-digit zip codes like 13083 to city names ( entries), etc. Invalid data values (such as the country code "xx" or the zip code 00000) may be left undefined in the table or mapped to some appropriate "null" value.
Trivial hash function
If the keys are uniformly or sufficiently uniformly distributed over the key space, so that the key values are essentially random, they may be considered to be already 'hashed'. In this case, any number of any bits in the key may be dialled out and collated as an index into the hash table. For example, a simple hash function might mask off the least significant ''m'' bits and use the result as an index into a hash table of size 2
m.
Folding
A folding hash code is produced by dividing the input into n sections of m bits, where 2
m is the table size, and using a parity-preserving bitwise operation such as ADD or XOR to combine the sections, followed by a mask or shifts to trim off any excess bits at the high or low end. For example, for a table size of 15 bits and key value of 0x0123456789ABCDEF, there are five sections consisting of 0x4DEF, 0x1357, 0x159E, 0x091A and 0x8. Adding, we obtain 0x7AA4, a 15-bit value.
Mid-squares
A mid-squares hash code is produced by squaring the input and extracting an appropriate number of middle digits or bits. For example, if the input is 123,456,789 and the hash table size 10,000, squaring the key produces 15,241,578,750,190,521, so the hash code is taken as the middle 4 digits of the 17-digit number (ignoring the high digit) 8750. The mid-squares method produces a reasonable hash code if there is not a lot of leading or trailing zeros in the key. This is a variant of multiplicative hashing, but not as good because an arbitrary key is not a good multiplier.
Division hashing
A standard technique is to use a modulo function on the key, by selecting a divisor
which is a prime number close to the table size, so
. The table size is usually a power of 2. This gives a distribution from
. This gives good results over a large number of key sets. A significant drawback of division hashing is that division is microprogrammed on most modern architectures including x86 and can be 10 times slower than multiply. A second drawback is that it won't break up clustered keys. For example, the keys 123000, 456000, 789000, etc. modulo 1000 all map to the same address. This technique works well in practice because many key sets are sufficiently random already, and the probability that a key set will be cyclical by a large prime number is small.
Algebraic coding
Algebraic coding is a variant of the division method of hashing which uses division by a polynomial modulo 2 instead of an integer to map n bits to m bits. In this approach,
and we postulate an
th degree polynomial
. A key
can be regarded as the polynomial
. The remainder using polynomial arithmetic modulo 2 is
. Then
. If
is constructed to have t or fewer non-zero coefficients, then keys which share less than t bits are guaranteed to not collide.
Z a function of k, t and n, a divisor of 2
k-1, is constructed from the GF(2
k) field. Knuth gives an example: for n=15, m=10 and t=7,
. The derivation is as follows:
Let
be the smallest set of integers such that
and
.
[For example, for n=15, k=4, t=6, ]nuth
Nuth (; li, Nut ) is a village and was a municipality in the province of Limburg, situated in the southern Netherlands. In January 2019, the municipality merged with Schinnen and Onderbanken to form Beekdaelen.
The village of Nuth has 4,595 inha ...
/ref>
Define where and where the coefficients of are computed in this field. Then the degree of . Since is a root of whenever is a root, it follows that the coefficients of satisfy so they are all 0 or 1. If is any nonzero polynomial modulo 2 with at most t nonzero coefficients, then is not a multiple of modulo 2.[Knuth conveniently leaves the proof of this to the reader.] If follows that the corresponding hash function will map keys with fewer than t bits in common to unique indices.
The usual outcome is that either n will get large, or twill gets large, or both, for the scheme to be computationally feasible. Therefore, its more suited to hardware or microcode implementation.
Unique permutation hashing
See also unique permutation hashing, which has a guaranteed best worst-case insertion time.
Multiplicative hashing
Standard multiplicative hashing uses the formula which produces a hash value in . The value is an appropriately chosen value that should be relatively prime
In mathematics, two integers and are coprime, relatively prime or mutually prime if the only positive integer that is a divisor of both of them is 1. Consequently, any prime number that divides does not divide , and vice versa. This is equivale ...
to ; it should be large and its binary representation a random mix of 1's and 0's. An important practical special case occurs when and are powers of 2 and is the machine word size
In computing, a word is the natural unit of data used by a particular processor design. A word is a fixed-sized datum handled as a unit by the instruction set or the hardware of the processor. The number of bits or digits in a word (the ''word si ...
. In this case this formula becomes . This is special because arithmetic modulo is done by default in low-level programming languages and integer division by a power of 2 is simply a right-shift, so, in C, for example, this function becomes
unsigned hash(unsigned K)
and for fixed and this translates into a single integer multiplication and right-shift making it one of the fastest hash functions to compute.
Multiplicative hashing is susceptible to a "common mistake" that leads to poor diffusion—higher-value input bits do not affect lower-value output bits. A transmutation on the input which shifts the span of retained top bits down and XORs or ADDs them to the key before the multiplication step corrects for this. So the resulting function looks like:
unsigned hash(unsigned K)
Fibonacci hashing
Fibonacci
Fibonacci (; also , ; – ), also known as Leonardo Bonacci, Leonardo of Pisa, or Leonardo Bigollo Pisano ('Leonardo the Traveller from Pisa'), was an Italian mathematician from the Republic of Pisa, considered to be "the most talented Western ...
hashing is a form of multiplicative hashing in which the multiplier is , where is the machine word length and
(phi) is the golden ratio
In mathematics, two quantities are in the golden ratio if their ratio is the same as the ratio of their sum to the larger of the two quantities. Expressed algebraically, for quantities a and b with a > b > 0,
where the Greek letter phi ( ...
(approximately 5/3). A property of this multiplier is that it uniformly distributes over the table space, blocks of consecutive keys with respect to any block of bits in the key. Consecutive keys within the high bits or low bits of the key (or some other field) are relatively common. The multipliers for various word lengths are:
*16: a=4050310
*32: a=265443576910
*48: a=17396110258977110[Unisys large systems]
*64: a=1140071481932319848510[11400714819323198486 is closer, but the bottom bit is zero, essentially throwing away a bit. The next closest odd number is that given.]
Zobrist hashing
Tabulation hashing, more generally known as ''Zobrist hashing'' after Albert Zobrist, an American computer scientist, is a method for constructing universal families of hash functions by combining table lookup with XOR operations. This algorithm has proven to be very fast and of high quality for hashing purposes (especially hashing of integer-number keys).
Zobrist hashing was originally introduced as a means of compactly representing chess positions in computer game-playing programs. A unique random number was assigned to represent each type of piece (six each for black and white) on each space of the board. Thus a table of 64x12 such numbers is initialized at the start of the program. The random numbers could be any length, but 64 bits was natural due to the 64 squares on the board. A position was transcribed by cycling through the pieces in a position, indexing the corresponding random numbers (vacant spaces were not included in the calculation), and XORing them together (the starting value could be 0, the identity value for XOR, or a random seed). The resulting value was reduced by modulo, folding or some other operation to produce a hash table index. The original Zobrist hash was stored in the table as the representation of the position.
Later, the method was extended to hashing integers by representing each byte in each of 4 possible positions in the word by a unique 32-bit random number. Thus, a table of 28x4 of such random numbers is constructed. A 32-bit hashed integer is transcribed by successively indexing the table with the value of each byte of the plain text integer and XORing the loaded values together (again, the starting value can be the identity value or a random seed). The natural extension to 64-bit integers is by use of a table of 28x8 64-bit random numbers.
This kind of function has some nice theoretical properties, one of which is called ''3-tuple independence'' meaning every 3-tuple of keys is equally likely to be mapped to any 3-tuple of hash values.
Customised hash function
A hash function can be designed to exploit existing entropy in the keys. If the keys have leading or trailing zeros, or particular fields that are unused, always zero or some other constant, or generally vary little, then masking out only the volatile bits and hashing on those will provide a better and possibly faster hash function. Selected divisors or multipliers in the division and multiplicative schemes may make more uniform hash functions if the keys are cyclic or have other redundancies.
Hashing variable-length data
When the data values are long (or variable-length) character string
In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). ...
s—such as personal names, web page addresses, or mail messages—their distribution is usually very uneven, with complicated dependencies. For example, text in any natural language
In neuropsychology, linguistics, and philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages ...
has highly non-uniform distributions of characters
Character or Characters may refer to:
Arts, entertainment, and media Literature
* ''Character'' (novel), a 1936 Dutch novel by Ferdinand Bordewijk
* ''Characters'' (Theophrastus), a classical Greek set of character sketches attributed to The ...
, and character pairs, characteristic of the language. For such data, it is prudent to use a hash function that depends on all characters of the string—and depends on each character in a different way.
Middle and ends
Simplistic hash functions may add the first and last characters of a string along with the length, or form a word-size hash from the middle 4 characters of a string. This saves iterating over the (potentially long) string,
but hash functions that do not hash on all characters of a string can readily become linear due to redundancies, clustering or other pathologies in the key set. Such strategies may be effective as a custom hash function if the structure of the keys is such that either the middle, ends or other fields are zero or some other invariant constant that doesn't differentiate the keys; then the invariant parts of the keys can be ignored.
Character folding
The paradigmatic example of folding by characters is to add up the integer values of all the characters in the string. A better idea is to multiply the hash total by a constant, typically a sizable prime number, before adding in the next character, ignoring overflow. Using exclusive 'or' instead of add is also a plausible alternative. The final operation would be a modulo, mask, or other function to reduce the word value to an index the size of the table. The weakness of this procedure is that information may cluster in the upper or lower bits of the bytes, which clustering will remain in the hashed result and cause more collisions than a proper randomizing hash. ASCII byte codes, for example, have an upper bit of 0 and printable strings don't use the first 32 byte codes, so the information (95-byte codes) is clustered in the remaining bits in an unobvious manner.
The classic approach dubbed the PJW hash based on the work of Peter. J. Weinberger at ATT Bell Labs in the 1970s, was originally designed for hashing identifiers into compiler symbol tables as given in the "Dragon Book". This hash function offsets the bytes 4 bits before ADDing them together. When the quantity wraps, the high 4 bits are shifted out and if non-zero, XORed back into the low byte of the cumulative quantity. The result is a word size hash code to which a modulo or other reducing operation can be applied to produce the final hash index.
Today, especially with the advent of 64-bit word sizes, much more efficient variable-length string hashing by word chunks is available.
Word length folding
Modern microprocessors will allow for much faster processing if 8-bit character strings are not hashed by processing one character at a time, but by interpreting the string as an array of 32 bit or 64-bit integers and hashing/accumulating these "wide word" integer values by means of arithmetic operations (e.g. multiplication by constant and bit-shifting). The final word, which may have unoccupied byte positions, is filled with zeros or a specified "randomizing" value before being folded into the hash. The accumulated hash code is reduced by a final modulo or other operation to yield an index into the table.
Radix conversion hashing
Analogous to the way an ASCII or EBCDIC character string representing a decimal number is converted to a numeric quantity for computing, a variable length string can be converted as . This is simply a polynomial in a radix that takes the components as the characters of the input string of length . It can be used directly as the hash code, or a hash function applied to it to map the potentially large value to the hash table size. The value of is usually a prime number at least large enough to hold the number of different characters in the character set of potential keys. Radix conversion hashing of strings minimizes the number of collisions. Available data sizes may restrict the maximum length of string that can be hashed with this method. For example, a 128-bit double long word will hash only a 26 character alphabetic string (ignoring case) with a radix of 29; a printable ASCII string is limited to 9 characters using radix 97 and a 64-bit long word. However, alphabetic keys are usually of modest length, because keys must be stored in the hash table. Numeric character strings are usually not a problem; 64 bits can count up to , or 19 decimal digits with radix 10.
Rolling hash
In some applications, such as substring search
In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger stri ...
, one can compute a hash function for every -character substring
In formal language theory and computer science, a substring is a contiguous sequence of characters within a string. For instance, "''the best of''" is a substring of "''It was the best of times''". In contrast, "''Itwastimes''" is a subsequence ...
of a given -character string by advancing a window of width characters along the string; where is a fixed integer, and is greater than . The straightforward solution, which is to extract such a substring at every character position in the text and compute separately, requires a number of operations proportional to . However, with the proper choice of , one can use the technique of rolling hash to compute all those hashes with an effort proportional to where is the number of occurrences of the substring.
The most familiar algorithm of this type is Rabin-Karp with best and average case performance and worst case (in all fairness, the worst case here is gravely pathological: both the text string and substring are composed of a repeated single character, such as ="AAAAAAAAAAA", and ="AAA"). The hash function used for the algorithm is usually the Rabin fingerprint The Rabin fingerprinting scheme is a method for implementing fingerprints using polynomials over a finite field. It was proposed by Michael O. Rabin.
Scheme
Given an ''n''-bit message ''m''0,...,''m''n-1, we view it as a polynomial of degree ''n''- ...
, designed to avoid collisions in 8-bit character strings, but other suitable hash functions are also used.
Analysis
Worst case result for a hash function can be assessed two ways: theoretical and practical. Theoretical worst case is the probability that all keys map to a single slot. Practical worst case is expected longest probe sequence (hash function + collision resolution method). This analysis considers uniform hashing, that is, any key will map to any particular slot with probability , characteristic of universal hash functions.
While Knuth worries about adversarial attack on real time systems, Gonnet has shown that the probability of such a case is "ridiculously small". His representation was that the probability of of keys mapping to a single slot is where is the load factor, .
History
The term ''hash'' offers a natural analogy with its non-technical meaning (to chop up or make a mess out of something), given how hash functions scramble their input data to derive their output. In his research for the precise origin of the term, Donald Knuth
Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist, mathematician, and professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of computer sc ...
notes that, while Hans Peter Luhn
Hans Peter Luhn (July 1, 1896 – August 19, 1964) was a German researcher in the field of computer science and Library & Information Science for IBM, and creator of the Luhn algorithm, KWIC (Key Words In Context) indexing, and Selective ...
of IBM appears to have been the first to use the concept of a hash function in a memo dated January 1953, the term itself would only appear in published literature in the late 1960s, in Herbert Hellerman's ''Digital Computer System Principles'', even though it was already widespread jargon by then.
See also
Notes
References
External links
{{Wiktionary, hash
Calculate hash of a given value
by Timo Denk
The Goulburn Hashing Function
(PDF
Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
) by Mayur Patel
Hash Function Construction for Textual and Geometrical Data Retrieval
(PDF
Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
) Latest Trends on Computers, Vol.2, pp. 483–489, CSCC Conference, Corfu, 2010
Search algorithms