Linear probing is a scheme in

computer programming Computer programming is the process of performing a particular computation (or more generally, accomplishing a specific computing result), usually by designing and building an executable computer program. Programming involves tasks such as anal ...

for resolving

collisions In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great fo ...

hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', ...

data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...

s for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene Amdahl,

Elaine M. McGraw Elaine M. McGraw (née Boehme) was an American computer programmer who, together with Arthur Samuel and Gene Amdahl, invented open addressing based hash tables in 1954. After studying economics, McGraw began working as a computer programmer for ...

, and

Arthur Samuel Arthur Lee Samuel (December 5, 1901 – July 29, 1990) was an American pioneer in the field of computer gaming and artificial intelligence. He popularized the term "machine learning" in 1959. The Samuel Checkers-playing Program was among the wo ...

and first analyzed in 1963 by

Donald Knuth Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist, mathematician, and professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of computer ...

. Along with

quadratic probing Quadratic probing is an open addressing scheme in computer programming for resolving hash collisions in hash tables. Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic polynomial u ...

and

double hashing Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is ...

, linear probing is a form of open addressing. In these schemes, each cell of a hash table stores a single key–value pair. When the

hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...

causes a collision by mapping a new key to a cell of the hash table that is already occupied by another key, linear probing searches the table for the closest following free location and inserts the new key there. Lookups are performed in the same way, by searching the table sequentially starting at the position given by the hash function, until finding a cell with a matching key or an empty cell. As write, "Hash tables are the most commonly used nontrivial data structures, and the most popular implementation on standard hardware uses linear probing, which is both fast and simple." Linear probing can provide high performance because of its good

locality of reference In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...

, but is more sensitive to the quality of its hash function than some other collision resolution schemes. It takes constant expected time per search, insertion, or deletion when implemented using a random hash function, a 5-independent hash function, or tabulation hashing. Good results can also be achieved in practice with other hash functions such as MurmurHash.

Operations

Linear probing is a component of open addressing schemes for using a

to solve the dictionary problem. In the dictionary problem, a data structure should maintain a collection of key–value pairs subject to operations that insert or delete pairs from the collection or that search for the value associated with a given key. In open addressing solutions to this problem, the data structure is an

array An array is a systematic arrangement of similar objects, usually in rows and columns. Things called an array include: {{TOC right Music * In twelve-tone and serial composition, the presentation of simultaneous twelve-tone sets such that the ...

(the hash table) whose cells (when nonempty) each store a single key–value pair. A

is used to map each key into the cell of where that key should be stored, typically scrambling the keys so that keys with similar values are not placed near each other in the table. A

hash collision In computer science, a hash collision or hash clash is when two pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of bits. Al ...

occurs when the hash function maps a key into a cell that is already occupied by a different key. Linear probing is a strategy for resolving collisions, by placing the new key into the closest following empty cell.

Search

To search for a given key , the cells of are examined, beginning with the cell at index (where is the hash function) and continuing to the adjacent cells , , ..., until finding either an empty cell or a cell whose stored key is . If a cell containing the key is found, the search returns the value from that cell. Otherwise, if an empty cell is found, the key cannot be in the table, because it would have been placed in that cell in preference to any later cell that has not yet been searched. In this case, the search returns as its result that the key is not present in the dictionary.

Insertion

To insert a key–value pair into the table (possibly replacing any existing pair with the same key), the insertion algorithm follows the same sequence of cells that would be followed for a search, until finding either an empty cell or a cell whose stored key is . The new key–value pair is then placed into that cell. If the insertion would cause the load factor of the table (its fraction of occupied cells) to grow above some preset threshold, the whole table may be replaced by a new table, larger by a constant factor, with a new hash function, as in a

dynamic array In computer science, a dynamic array, growable array, resizable array, dynamic table, mutable array, or array list is a random access, variable-size list data structure that allows elements to be added or removed. It is supplied with standard lib ...

. Setting this threshold close to zero and using a high growth rate for the table size leads to faster hash table operations but greater memory usage than threshold values close to one and low growth rates. A common choice would be to double the table size when the load factor would exceed 1/2, causing the load factor to stay between 1/4 and 1/2.

Deletion

It is also possible to remove a key–value pair from the dictionary. However, it is not sufficient to do so by simply emptying its cell. This would affect searches for other keys that have a hash value earlier than the emptied cell, but that are stored in a position later than the emptied cell. The emptied cell would cause those searches to incorrectly report that the key is not present. Instead, when a cell is emptied, it is necessary to search forward through the following cells of the table until finding either another empty cell or a key that can be moved to cell (that is, a key whose hash value is equal to or earlier than ). When an empty cell is found, then emptying cell is safe and the deletion process terminates. But, when the search finds a key that can be moved to cell , it performs this move. This has the effect of speeding up later searches for the moved key, but it also empties out another cell, later in the same block of occupied cells. The search for a movable key continues for the new emptied cell, in the same way, until it terminates by reaching a cell that was already empty. In this process of moving keys to earlier cells, each key is examined only once. Therefore, the time to complete the whole process is proportional to the length of the block of occupied cells containing the deleted key, matching the running time of the other hash table operations. Alternatively, it is possible to use a lazy deletion strategy in which a key–value pair is removed by replacing the value by a special

flag value In computer programming, a sentinel value (also referred to as a flag value, trip value, rogue value, signal value, or dummy data) is a special value in the context of an algorithm which uses its presence as a condition of termination, typically i ...

indicating a deleted key. However, these flag values will contribute to the load factor of the hash table. With this strategy, it may become necessary to clean the flag values out of the array and rehash all the remaining key–value pairs once too large a fraction of the array becomes occupied by deleted keys.

Properties

Linear probing provides good

, which causes it to require few uncached memory accesses per operation. Because of this, for low to moderate load factors, it can provide very high performance. However, compared to some other open addressing strategies, its performance degrades more quickly at high load factors because of primary clustering, a tendency for one collision to cause more nearby collisions. Additionally, achieving good performance with this method requires a higher-quality hash function than for some other collision resolution schemes. When used with low-quality hash functions that fail to eliminate nonuniformities in the input distribution, linear probing can be slower than other open-addressing strategies such as

, which probes a sequence of cells whose separation is determined by a second hash function, or

, where the size of each step varies depending on its position within the probe sequence.

Analysis

Using linear probing, dictionary operations can be implemented in constant expected time. In other words, insert, remove and search operations can be implemented in

O(1) Big ''O'' notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by Paul Bachmann, Edmund Land ...

, as long as the load factor of the hash table is a constant strictly less than one. In more detail, the time for any particular operation (a search, insertion, or deletion) is proportional to the length of the contiguous block of occupied cells at which the operation starts. If all starting cells are equally likely, in a hash table with cells, then a maximal block of occupied cells will have probability of containing the starting location of a search, and will take time whenever it is the starting location. Therefore, the expected time for an operation can be calculated as the product of these two terms, , summed over all of the maximal blocks of contiguous cells in the table. A similar sum of squared block lengths gives the expected time bound for a random hash function (rather than for a random starting location into a specific state of the hash table), by summing over all the blocks that could exist (rather than the ones that actually exist in a given state of the table), and multiplying the term for each potential block by the probability that the block is actually occupied. That is, defining to be the event that there is a maximal contiguous block of occupied cells of length beginning at index , the expected time per operation is :

E = O(1) + \sum_^N \sum_^n O(k^2/N) \operatorname operatorname(i,k)

This formula can be simplified by replacing by a simpler necessary condition , the event that at least elements have hash values that lie within a block of cells of length . After this replacement, the value within the sum no longer depends on , and the factor cancels the terms of the outer summation. These simplifications lead to the bound :

E \le O(1) + \sum_^n O(k^2) \operatorname operatorname(k)

But by the multiplicative form of the

Chernoff bound In probability theory, the Chernoff bound gives exponentially decreasing bounds on tail distributions of sums of independent random variables. Despite being named after Herman Chernoff, the author of the paper it first appeared in, the result is d ...

, when the load factor is bounded away from one, the probability that a block of length contains at least hashed values is exponentially small as a function of , causing this sum to be bounded by a constant independent of . It is also possible to perform the same analysis using

Stirling's approximation In mathematics, Stirling's approximation (or Stirling's formula) is an approximation for factorials. It is a good approximation, leading to accurate results even for small values of n. It is named after James Stirling, though a related but less p ...

instead of the Chernoff bound to estimate the probability that a block contains exactly hashed values. In terms of the load factor , the expected time for a successful search is , and the expected time for an unsuccessful search (or the insertion of a new key) is . For constant load factors, with high probability, the longest probe sequence (among the probe sequences for all keys stored in the table) has logarithmic length.

Choice of hash function

Because linear probing is especially sensitive to unevenly distributed hash values, it is important to combine it with a high-quality hash function that does not produce such irregularities. The analysis above assumes that each key's hash is a random number independent of the hashes of all the other keys. This assumption is unrealistic for most applications of hashing. However, random or

pseudorandom A pseudorandom sequence of numbers is one that appears to be statistically random, despite having been produced by a completely deterministic and repeatable process. Background The generation of random numbers has many uses, such as for random ...

hash values may be used when hashing objects by their identity rather than by their value. For instance, this is done using linear probing by the IdentityHashMap class of the Java collections framework. The hash value that this class associates with each object, its identityHashCode, is guaranteed to remain fixed for the lifetime of an object but is otherwise arbitrary. Because the identityHashCode is constructed only once per object, and is not required to be related to the object's address or value, its construction may involve slower computations such as the call to a random or pseudorandom number generator. For instance, Java 8 uses an Xorshift pseudorandom number generator to construct these values. For most applications of hashing, it is necessary to compute the hash function for each value every time that it is hashed, rather than once when its object is created. In such applications, random or pseudorandom numbers cannot be used as hash values, because then different objects with the same value would have different hashes. And

cryptographic hash function A cryptographic hash function (CHF) is a hash algorithm (a map of an arbitrary binary string to a binary string with fixed size of n bits) that has special properties desirable for cryptography: * the probability of a particular n-bit output ...

s (which are designed to be computationally indistinguishable from truly random functions) are usually too slow to be used in hash tables. Instead, other methods for constructing hash functions have been devised. These methods compute the hash function quickly, and can be proven to work well with linear probing. In particular, linear probing has been analyzed from the framework of -independent hashing, a class of hash functions that are initialized from a small random seed and that are equally likely to map any -tuple of distinct keys to any -tuple of indexes. The parameter can be thought of as a measure of hash function quality: the larger is, the more time it will take to compute the hash function but it will behave more similarly to completely random functions. For linear probing, 5-independence is enough to guarantee constant expected time per operation, while some 4-independent hash functions perform badly, taking up to logarithmic time per operation. Another method of constructing hash functions with both high quality and practical speed is tabulation hashing. In this method, the hash value for a key is computed by using each byte of the key as an index into a table of random numbers (with a different table for each byte position). The numbers from those table cells are then combined by a bitwise

exclusive or Exclusive or or exclusive disjunction is a logical operation that is true if and only if its arguments differ (one is true, the other is false). It is symbolized by the prefix operator J and by the infix operators XOR ( or ), EOR, EXOR, , ...

operation. Hash functions constructed this way are only 3-independent. Nevertheless, linear probing using these hash functions takes constant expected time per operation. Both tabulation hashing and standard methods for generating 5-independent hash functions are limited to keys that have a fixed number of bits. To handle strings or other types of variable-length keys, it is possible to compose a simpler

universal hashing In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantee ...

technique that maps the keys to intermediate values and a higher quality (5-independent or tabulation) hash function that maps the intermediate values to hash table indices. In an experimental comparison, Richter et al. found that the Multiply-Shift family of hash functions (defined as

h_z(x) = (x \cdot z \bmod 2^w) \div 2^

) was "the fastest hash function when integrated with all hashing schemes, i.e., producing the highest throughputs and also of good quality" whereas tabulation hashing produced "the lowest throughput". They point out that each table look-up require several cycles, being more expensive than simple arithmetic operations. They also found MurmurHash to be superior than tabulation hashing: "By studying the results provided by Mult and Murmur, we think that the trade-off for by tabulation (...) is less attractive in practice".

History

The idea of an

associative array In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection. In mathematical terms an ...

that allows data to be accessed by its value rather than by its address dates back to the mid-1940s in the work of

Konrad Zuse Konrad Ernst Otto Zuse (; 22 June 1910 – 18 December 1995) was a German civil engineer, pioneering computer scientist, inventor and businessman. His greatest achievement was the world's first programmable computer; the functional program- ...

and

Vannevar Bush Vannevar Bush ( ; March 11, 1890 – June 28, 1974) was an American engineer, inventor and science administrator, who during World War II headed the U.S. Office of Scientific Research and Development (OSRD), through which almost all wartim ...

, but hash tables were not described until 1953, in an IBM memorandum by Hans Peter Luhn. Luhn used a different collision resolution method, chaining, rather than linear probing. summarizes the early history of linear probing. It was the first open addressing method, and was originally synonymous with open addressing. According to Knuth, it was first used by Gene Amdahl,

(née Boehme), and

in 1954, in an assembler program for the

IBM 701 The IBM 701 Electronic Data Processing Machine, known as the Defense Calculator while in development, was IBM’s first commercial scientific computer and its first series production mainframe computer, which was announced to the public on May ...

computer. The first published description of linear probing is by , who also credits Samuel, Amdahl, and Boehme but adds that "the system is so natural, that it very likely may have been conceived independently by others either before or since that time". Another early publication of this method was by Soviet researcher

Andrey Ershov Andrey Petrovich Yershov (russian: Андре́й Петро́вич Ершо́в; 19 April 1931, Moscow – 8 December 1988, Moscow) was a Soviet computer scientist, notable as a pioneer in systems programming and programming language rese ...

, in 1958. The first theoretical analysis of linear probing, showing that it takes constant expected time per operation with random hash functions, was given by Knuth. Sedgewick calls Knuth's work "a landmark in the analysis of algorithms". Significant later developments include a more detailed analysis of the

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomeno ...

of the running time, and the proof that linear probing runs in constant time per operation with practically usable hash functions rather than with the idealized random functions assumed by earlier analysis.

References

{{reflist, 30em Search algorithms Hashing