HOME

TheInfoList



OR:

In
computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
, a hash table, also known as hash map, is a
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
that implements an
associative array In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection. In mathematical terms an ...
or dictionary. It is an
abstract data type In computer science, an abstract data type (ADT) is a mathematical model for data types. An abstract data type is defined by its behavior (semantics) from the point of view of a ''user'', of the data, specifically in terms of possible values, pos ...
that maps keys to
values In ethics and social sciences, value denotes the degree of importance of something or action, with the aim of determining which actions are best to do or what way is best to live (normative ethics in ethics), or to describe the significance of di ...
. A hash table uses a
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...
to compute an ''index'', also called a ''hash code'', into an array of ''buckets'' or ''slots'', from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause hash ''
collisions In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great f ...
'' where the hash function generates the same index for more than one key. Such collisions are typically accommodated in some way. In a well-dimensioned hash table, the average time complexity for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key–value pairs, at
amortized In computer science, amortized analysis is a method for analyzing a given algorithm's complexity, or how much of a resource, especially time or memory, it takes to execute. The motivation for amortized analysis is that looking at the worst-case ...
constant average cost per operation.
Charles E. Leiserson Charles Eric Leiserson is a computer scientist, specializing in the theory of parallel computing and distributed computing, and particularly practical applications thereof. As part of this effort, he developed the Cilk multithreaded language. ...

''Amortized Algorithms, Table Doubling, Potential Method''
Lecture 13, course MIT 6.046J/18.410J Introduction to Algorithms—Fall 2005
Hashing is an example of a space-time tradeoff. If
memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered, ...
is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if infinite time is available, values can be stored without regard for their keys, and a
binary search In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the m ...
or linear search can be used to retrieve the element. In many situations, hash tables turn out to be on average more efficient than
search tree In computer science, a search tree is a tree data structure used for locating specific keys from within a set. In order for a tree to function as a search tree, the key for each node must be greater than any keys in subtrees on the left, and less ...
s or any other
table Table may refer to: * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (landform), a flat area of land * Table (information), a data arrangement with rows and columns * Table (database), how the table data ...
lookup structure. For this reason, they are widely used in many kinds of computer
software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
, particularly for
associative array In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection. In mathematical terms an ...
s,
database index A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without ...
ing, caches, and sets.


History

The idea of hashing arose independently in different places. In January 1953,
Hans Peter Luhn Hans Peter Luhn (July 1, 1896 – August 19, 1964) was a German researcher in the field of computer science and Library & Information Science for IBM, and creator of the Luhn algorithm, KWIC (Key Words In Context) indexing, and Selective ...
wrote an internal IBM memorandum that used hashing with chaining. Open addressing was later proposed by A. D. Linh building on Luhn's paper. Around the same time,
Gene Amdahl Gene Myron Amdahl (November 16, 1922 – November 10, 2015) was an American computer architect and high-tech entrepreneur, chiefly known for his work on mainframe computers at IBM and later his own companies, especially Amdahl Corporation ...
, Elaine M. McGraw,
Nathaniel Rochester Nathaniel Rochester (February 21, 1752 – May 17, 1831) was an American Revolutionary War soldier, and land speculator, most noted for founding the settlement which would become Rochester, New York. Early life Nathaniel Rochester was born ...
, and
Arthur Samuel Arthur Lee Samuel (December 5, 1901 – July 29, 1990) was an American pioneer in the field of computer gaming and artificial intelligence. He popularized the term "machine learning" in 1959. The Samuel Checkers-playing Program was among the wo ...
of
IBM Research IBM Research is the research and development division for IBM, an American multinational information technology company headquartered in Armonk, New York, with operations in over 170 countries. IBM Research is the largest industrial research or ...
implemented hashing for the IBM 701
assembler Assembler may refer to: Arts and media * Nobukazu Takemura, avant-garde electronic musician, stage name Assembler * Assemblers, a fictional race in the ''Star Wars'' universe * Assemblers, an alternative name of the superhero group Champions of ...
. Open addressing with linear probing is credited to Amdahl, although Ershov independently had the same idea. The term "open addressing" was coined by W. Wesley Peterson on his article which discusses the problem of search in large files. The first
published Publishing is the activity of making information, literature, music, software and other content available to the public for sale or for free. Traditionally, the term refers to the creation and distribution of printed works, such as books, news ...
work on hashing with chaining is credited to
Arnold Dumey Arnold I. Dumey (1906-1995) was the co-inventor of the postal sorting machine and cryptanalyst first for Signals Intelligence Service, SIS and then NSA. During World War II he worked for the Army Signal Corpstheoretical analysis of linear probing was submitted originally by Konheim and Weiss.


Overview

An
associative array In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection. In mathematical terms an ...
stores a
set Set, The Set, SET or SETS may refer to: Science, technology, and mathematics Mathematics *Set (mathematics), a collection of elements *Category of sets, the category whose objects and morphisms are sets and total functions, respectively Electro ...
of (key, value) pairs and allows insertion, deletion, and lookup (search), with the constraint of
unique key In relational database management systems, a unique key is a candidate key that is not the primary key of the relation. All the candidate keys of a relation can uniquely identify the records of the relation, but only one of them is used as the prim ...
s. In the hash table implementation of associative arrays, an array A of length m is partially filled with n elements, where m \ge n. A value x gets stored at an index location A
(x) An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings, ...
/math>, where h is a hash function, and h(x) < m. Under reasonable assumptions, hash tables have better
time complexity In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by t ...
bounds on search, delete, and insert operations in comparison to
self-balancing binary search tree In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.Donal ...
s. Hash tables are also commonly used to implement sets, by omitting the stored value for each key and merely tracking whether the key is present.


Load factor

A ''load factor'' \alpha is a critical statistic of a hash table, and is defined as follows: \text\ (\alpha) = \frac, where * n is the number of entries occupied in the hash table. * k is the number of buckets. The performance of the hash table deteriorates in relation to the load factor \alpha. Therefore a hash table is resized or ''rehashed'' if the load factor \alpha approaches 1. A table is also resized if the load factor drops below \alpha_/4. Acceptable figures of load factor \alpha include 0.6 and 0.75.


Hash function

A
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...
h maps the universe U of keys h : U \rightarrow \ to array indices or slots within the table for each h(x) \in where x \in S and m < n. The conventional implementations of hash functions are based on the ''integer universe assumption'' that all elements of the table stem from the universe U = \, where the bit length of u is confined within the
word size In computing, a word is the natural unit of data used by a particular processor design. A word is a fixed-sized datum handled as a unit by the instruction set or the hardware of the processor. The number of bits or digits in a word (the ''word s ...
of a
computer architecture In computer engineering, computer architecture is a description of the structure of a computer system made from component parts. It can sometimes be a high-level description that ignores details of the implementation. At a more detailed level, t ...
. A
perfect hash function In computer science, a perfect hash function for a set is a hash function that maps distinct elements in to a set of integers, with no collisions. In mathematical terms, it is an injective function. Perfect hash functions may be used to imp ...
h is defined as an injective function such that each element x in S maps to a unique value in . A perfect hash function can be created if all the keys are known ahead of time.


Integer universe assumption

The schemes of hashing used in ''integer universe assumption'' include hashing by division, hashing by multiplication,
universal hashing In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees ...
,
dynamic perfect hashing In computer science, dynamic perfect hashing is a programming technique for resolving collisions in a hash table data structure.Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a Sparse Table with 0(1) Worst Case Access Time. J. ACM 3 ...
, and static perfect hashing. However, hashing by division is the commonly used scheme.


Hashing by division

The scheme in hashing by division is as follows: h(x)\ =\ M\, \bmod\, n Where M is the hash digest of x \in S and n is the size of the table.


Hashing by multiplication

The scheme in hashing by multiplication is as follows: h(k) = \lfloor n \bigl((M A) \bmod 1\bigr) \rfloor Where A is a real-valued constant. An advantage of the hashing by multiplication is that the m is not critical. Although any value A produces a hash function,
Donald Knuth Donald Ervin Knuth ( ; born January 10, 1938) is an American computer scientist, mathematician, and professor emeritus at Stanford University. He is the 1974 recipient of the ACM Turing Award, informally considered the Nobel Prize of computer sc ...
suggests using the
golden ratio In mathematics, two quantities are in the golden ratio if their ratio is the same as the ratio of their sum to the larger of the two quantities. Expressed algebraically, for quantities a and b with a > b > 0, where the Greek letter phi ( ...
.


Choosing a hash function

Uniform distribution of the hash values is a fundamental requirement of a hash function. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., a Pearson's chi-squared test for discrete uniform distributions. The distribution needs to be uniform only for table sizes that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of the table size, then the hash function needs to be uniform only when the size is a
power of two A power of two is a number of the form where is an integer, that is, the result of exponentiation with number two as the base and integer  as the exponent. In a context where only integers are considered, is restricted to non-negative ...
. Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have the size be a
prime number A prime number (or a prime) is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a composite number. For example, 5 is prime because the only ways ...
. For
open addressing Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternative locations in the array (the ''probe sequence'') until either the t ...
schemes, the hash function should also avoid ''clustering'', the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash is claimed to have particularly poor clustering behavior.
K-independent hashing In computer science, a family of hash functions is said to be ''k''-independent, ''k''-wise independent or ''k''-universal if selecting a function at random from the family guarantees that the hash codes of any designated ''k'' keys are independe ...
offers a way to prove a certain hash function does not have bad keysets for a given type of hashtable. A number of K-independence results are known for collision resolution schemes such as linear probing and cuckoo hashing. Since K-independence can prove a hash function works, one can then focus on finding the fastest possible such hash function.


Collision resolution

A search algorithm that uses hashing consists of two parts. The first part is computing a
hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...
which transforms the search key into an
array index In computer science, an array is a data structure consisting of a collection of ''elements'' (values or variables), each identified by at least one ''array index'' or ''key''. An array is stored such that the position of each element can be co ...
. The ideal case is such that no two search keys hashes to the same array index. However, this is not always the case and is impossible to guarantee for unseen given data. Hence the second part of the algorithm is collision resolution. The two common methods for collision resolution are separate chaining and open addressing.


Separate chaining

In separate chaining, the process involves building a linked list with key–value pair for each search array index. The collided items are chained together through a single linked list, which can be traversed to access the item with a unique search key. Collision resolution through chaining with linked list is a common method of implementation of hash tables. Let T and x be the hash table and the node respectively, the operation involves as follows: Chained-Hash-Insert(''T'', ''k'') ''insert'' ''x'' ''at the head of linked list'' ''T'' 'h''(''k'') Chained-Hash-Search(''T'', ''k'') ''search for an element with key'' ''k'' ''in linked list'' ''T'' 'h''(''k'') Chained-Hash-Delete(''T'', ''k'') ''delete'' ''x'' ''from the linked list'' ''T'' 'h''(''k'') If the element is comparable either numerically or lexically, and inserted into the list by maintaining the
total order In mathematics, a total or linear order is a partial order in which any two elements are comparable. That is, a total order is a binary relation \leq on some set X, which satisfies the following for all a, b and c in X: # a \leq a ( reflex ...
, it results in faster termination of the unsuccessful searches.


Other data structures for separate chaining

If the keys are ordered, it could be efficient to use "
self-organizing Self-organization, also called spontaneous order in the social sciences, is a process where some form of overall order and disorder, order arises from local interactions between parts of an initially disordered system. The process can be spon ...
" concepts such as using a
self-balancing binary search tree In computer science, a self-balancing binary search tree (BST) is any node-based binary search tree that automatically keeps its height (maximal number of levels below the root) small in the face of arbitrary item insertions and deletions.Donal ...
, through which the theoretical worst case could be brought down to O(\log), although it introduces additional complexities. In
dynamic perfect hashing In computer science, dynamic perfect hashing is a programming technique for resolving collisions in a hash table data structure.Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a Sparse Table with 0(1) Worst Case Access Time. J. ACM 3 ...
, two-level hash tables are used to reduce the look-up complexity to be a guaranteed O(1) in the worst case. In this technique, the buckets of k entries are organized as perfect hash tables with k^2 slots providing constant worst-case lookup time, and low amortized time for insertion. A study shows array based separate chaining to be 97% more performant when compared to the standard linked list method under heavy load. Techniques such as using fusion tree for each buckets also result in constant time for all operations with high probability.


Caching and locality of reference

The linked list of separate chaining implementation may not be cache-conscious due to spatial locality
locality of reference In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...
—when the nodes of the linked list are scattered across memory, thus the list traversal during insert and search may entail
CPU cache A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, whic ...
inefficiencies. In cache-conscious variants, a dynamic array found to be more cache-friendly is used in the place where a linked list or self-balancing binary search trees is usually deployed for collision resolution through separate chaining, since the contiguous allocation pattern of the array could be exploited by hardware-cache prefetchers—such as
translation lookaside buffer A translation lookaside buffer (TLB) is a memory cache that stores the recent translations of virtual memory to physical memory. It is used to reduce the time taken to access a user memory location. It can be called an address-translation cache. ...
—resulting in reduced access time and memory consumption.


Open addressing

Open addressing Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternative locations in the array (the ''probe sequence'') until either the t ...
is another collision resolution technique in which every entry record is stored in the bucket array itself, and the hash resolution is performed through probing. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some ''probe sequence'', until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates an unsuccessful search. Well-known probe sequences include: *
Linear probing Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene ...
, in which the interval between probes is fixed (usually 1). * Quadratic probing, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the value given by the original hash computation. *
Double hashing Double hashing is a computer programming technique used in conjunction with open addressing in hash tables to resolve hash collisions, by using a secondary hash of the key as an offset when a collision occurs. Double hashing with open addressing is ...
, in which the interval between probes is computed by a secondary hash function. The performance of open addressing may be slower compared to separate chaining since the probe sequence increases when the load factor \alpha approaches 1. The probing results in an
infinite loop In computer programming, an infinite loop (or endless loop) is a sequence of instructions that, as written, will continue endlessly, unless an external intervention occurs ("pull the plug"). It may be intentional. Overview This differs from: * ...
if the load factor reaches 1, in the case of a completely filled table. The
average cost In economics, average cost or unit cost is equal to total cost (TC) divided by the number of units of a good produced (the output Q): AC=\frac. Average cost has strong implication to how firms will choose to price their commodities. Firms’ sale ...
of linear probing depends on the hash function's ability to distribute the elements uniformly throughout the table to avoid clustering, since formation of clusters would result in increased search time.


Caching and locality of reference

Since the slots are located in successive locations, linear probing could lead to better utilization of
CPU cache A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, whic ...
due to
locality of reference In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...
s resulting in reduced
memory latency ''Memory latency'' is the time (the latency) between initiating a request for a byte or word in memory until it is retrieved by a processor. If the data are not in the processor's cache, it takes longer to obtain them, as the processor will hav ...
.


Other collision resolution techniques based on open addressing


=Coalesced hashing

=
Coalesced hashing Coalesced hashing, also called coalesced chaining, is a strategy of collision resolution in a hash table that forms a hybrid of separate chaining In computing, a hash table, also known as hash map, is a data structure that implements an as ...
is a hybrid of both separate chaining and open addressing in which the buckets or nodes link within the table. The algorithm is ideally suited for fixed memory allocation. The collision in coalesced hashing is resolved by identifying the largest-indexed empty slot on the hash table, then the colliding value is inserted into that slot. The bucket is also linked to the inserted node's slot which contains its colliding hash address.


=Cuckoo hashing

=
Cuckoo hashing Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick ...
is a form of open addressing collision resolution technique which guarantees O(1) worst-case lookup complexity and constant amortized time for insertions. The collision is resolved through maintaining two hash tables, each having its own hashing function, and collided slot gets replaced with the given item, and the preoccupied element of the slot gets displaced into the other hash table. The process continues until every key has its own spot in the empty buckets of the tables; if the procedure enters into
infinite loop In computer programming, an infinite loop (or endless loop) is a sequence of instructions that, as written, will continue endlessly, unless an external intervention occurs ("pull the plug"). It may be intentional. Overview This differs from: * ...
—which is identified through maintaining a threshold loop counter—both hash tables get rehashed with newer hash functions and the procedure continues.


=Hopscotch hashing

=
Hopscotch hashing Hopscotch hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table using open addressing. It is also well suited for implementing a concurrent hash table. Hopscotch hashing was introduced by ...
is an open addressing based algorithm which combines the elements of
cuckoo hashing Cuckoo hashing is a scheme in computer programming for resolving hash collisions of values of hash functions in a table, with worst-case constant lookup time. The name derives from the behavior of some species of cuckoo, where the cuckoo chick ...
,
linear probing Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key. It was invented in 1954 by Gene ...
and chaining through the notion of a ''neighbourhood'' of buckets—the subsequent buckets around any given occupied bucket, also called a "virtual" bucket. The algorithm is designed to deliver better performance when the load factor of the hash table grows beyond 90%; it also provides high throughput in concurrent settings, thus well suited for implementing resizable
concurrent hash table A concurrent hash table (concurrent hash map) is an implementation of hash tables allowing ''concurrent access'' by ''multiple'' threads using a hash function. Concurrent hash tables represent a key concurrent data structure for use in concur ...
. The neighbourhood characteristic of hopscotch hashing guarantees a property that, the cost of finding the desired item from any given buckets within the neighbourhood is very close to the cost of finding it in the bucket itself; the algorithm attempts to be an item into its neighbourhood—with a possible cost involved in displacing other items. Each bucket within the hash table includes an additional "hop-information"—an ''H''-bit
bit array A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level ...
for indicating the relative distance of the item which was originally hashed into the current virtual bucket within ''H''-1 entries. Let k and Bk be the key to be inserted and bucket to which the key is hashed into respectively; several cases are involved in the insertion procedure such that the neighbourhood property of the algorithm is vowed: if Bk is empty, the element is inserted, and the leftmost bit of bitmap is
set Set, The Set, SET or SETS may refer to: Science, technology, and mathematics Mathematics *Set (mathematics), a collection of elements *Category of sets, the category whose objects and morphisms are sets and total functions, respectively Electro ...
to 1; if not empty, linear probing is used for finding an empty slot in the table, the bitmap of the bucket gets updated followed by the insertion; if the empty slot is not within the range of the ''neighbourhood,'' i.e. ''H''-1, subsequent swap and hop-info bit array manipulation of each bucket is performed in accordance with its neighbourhood invariant properties.


=Robin Hood hashing

= Robin hood hashing is an open addressing based collision resolution algorithm; the collisions are resolved through favouring the displacement of the element that is farthest—or longest ''probe sequence length'' (PSL)—from its "home location" i.e. the bucket to which the item was hashed into. Although robin hood hashing does not change the theoretical search cost, it significantly affects the
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
of the
distribution Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations * Probability distribution, the probability of a particular value or value range of a vari ...
of the items on the buckets, i.e. dealing with
cluster may refer to: Science and technology Astronomy * Cluster (spacecraft), constellation of four European Space Agency spacecraft * Asteroid cluster, a small asteroid family * Cluster II (spacecraft), a European Space Agency mission to study t ...
formation in the hash table. Each node within the hash table that uses robin hood hashing should be augmented to store an extra PSL value. Let x be the key to be inserted, x.psl be the (incremental) PSL length of x, T be the hash table and j be the index, the insertion procedure is as follows: * If x.psl\ \le\ T psl: the iteration goes into the next bucket without attempting an external probe. * If x.psl\ >\ T psl: insert the item x into the bucket j; swap x with T /math>—let it be x'; continue the probe from the j+1st bucket to insert x'; repeat the procedure until every element is inserted.


Dynamic resizing

Repeated insertions cause the number of entries in a hash table to grow, which consequently increases the load factor; to maintain the amortized O(1) performance of the lookup and insertion operations, a hash table is dynamically resized and the items of the tables are ''rehashed'' into the buckets of the new hash table, since the items cannot be copied over as varying table sizes results in different hash value due to
modulo operation In computing, the modulo operation returns the remainder or signed remainder of a division, after one number is divided by another (called the '' modulus'' of the operation). Given two positive numbers and , modulo (often abbreviated as ) is th ...
. If a hash table becomes "too empty" after deleting some elements, resizing may be performed to avoid excessive
memory usage Memory management is a form of resource management applied to computer memory. The essential requirement of memory management is to provide ways to dynamically allocate portions of memory to programs at their request, and free it for reuse when ...
.


Resizing by moving all entries

Generally, a new hash table with a size double that of the original hash table gets allocated privately and every item in the original hash table gets moved to the newly allocated one by computing the hash values of the items followed by the insertion operation. Rehashing is computationally expensive despite its simplicity.


Alternatives to all-at-once rehashing

Some hash table implementations, notably in
real-time system Real-time computing (RTC) is the computer science term for hardware and software systems subject to a "real-time constraint", for example from event to system response. Real-time programs must guarantee response within specified time constrai ...
s, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually to avoid storage blip—typically at 50% of new table's size—during rehashing and to avoid
memory fragmentation In computer storage, fragmentation is a phenomenon in which storage space, main storage or secondary storage, is used inefficiently, reducing capacity or performance and often both. The exact consequences of fragmentation depend on the specif ...
that triggers heap compaction due to deallocation of large memory blocks caused by the old hash table. In such case, the rehashing operation is done incrementally through extending prior memory block allocated for the old hash table such that the buckets of the hash table remain unaltered. A common approach for amortized rehashing involves maintaining two hash functions h_\text and h_\text. The process of rehashing a bucket's items in accordance with the new hash function is termed as ''cleaning'', which is implemented through
command pattern In object-oriented programming, the command pattern is a behavioral design pattern in which an object is used to encapsulate all information needed to perform an action or trigger an event at a later time. This information includes the method name ...
by encapsulating the operations such as \mathrm(\mathrm), \mathrm(\mathrm) and \mathrm(\mathrm) through a \mathrm(\mathrm, \text) wrapper such that each element in the bucket gets rehashed and its procedure involve as follows: * Clean \mathrm _\text(\mathrm)/math> bucket. * Clean \mathrm _\text(\mathrm)/math> bucket. * The ''command'' gets executed.


Linear hashing

Linear hashing Linear hashing (LH) is a dynamic data structure which implements a hash table and grows or shrinks one bucket at a time. It was invented by Witold Litwin in 1980. It has been analyzed by Baeza-Yates and Soza-Pollman. It is the first in a number ...
is an implementation of the hash table which enables dynamic growths or shrinks of the table one bucket at a time.


Performance

The performance of a hash table is dependent on the hash function's ability in generating quasi-random numbers (\sigma) for entries in the hash table where K, n and h(x) denotes the key, number of buckets and the hash function such that \sigma\ =\ h(K)\ \%\ n. If the hash function generates same \sigma for distinct keys (K_1 \ne K_2,\ h(K_1)\ =\ h(K_2)), this results in ''collision'', which should be dealt with in several ways. The constant time complexity (O(1)) of the operation in a hash table is presupposed on the condition that the hash function doesn't generate colliding indices; thus, the performance of the hash table is
directly proportional In mathematics, two sequences of numbers, often experimental data, are proportional or directly proportional if their corresponding elements have a constant ratio, which is called the coefficient of proportionality or proportionality constan ...
to the chosen hash function ability to disperse the indices. However, construction of such a hash function is practically unfeasible, that being so, implementations depend on case-specific collision resolution techniques in achieving higher performance.


Applications


Associative arrays

Hash tables are commonly used to implement many types of in-memory tables. They are used to implement
associative array In computer science, an associative array, map, symbol table, or dictionary is an abstract data type that stores a collection of (key, value) pairs, such that each possible key appears at most once in the collection. In mathematical terms an ...
s..


Database indexing

Hash tables may also be used as
disk Disk or disc may refer to: * Disk (mathematics), a geometric shape * Disk storage Music * Disc (band), an American experimental music band * ''Disk'' (album), a 1995 EP by Moby Other uses * Disk (functional analysis), a subset of a vector sp ...
-based data structures and database indices (such as in
dbm DBM or dbm may refer to: Science and technology * dBm, a unit for power measurement * DBM (computing), family of key-value database engines including dbm, ndbm, gdbm, and Berkeley DB * Database Manager (DBM), a component of 1987's ''Extended Edi ...
) although
B-tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for ...
s are more popular in these applications.


Caches

Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.


Sets

Hash tables can be used in the implementation of set data structure, which can store unique values without any particular order; set is typically used in testing the membership of a value in the collection, rather than element retrieval.


Transposition table

A
transposition table {{no footnotes, date=November 2017 A transposition table is a cache of previously seen positions, and associated evaluations, in a game tree generated by a computer game playing program. If a position recurs via a different sequence of moves, the ...
to a complex Hash Table which stores information about each section that has been searched.


Implementations


In programming languages

Many programming languages provide hash table functionality, either as built-in associative arrays or as standard library modules. In
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of websites use JavaScript on the client side for webpage behavior, of ...
, every value except for 7 "primitive" data types is called an "object", which uses either integers, strings, or guaranteed-unique "symbol" primitive values as keys for a hash map. ECMAScript 6 also added Map and Set data structures.
C++11 C++11 is a version of the ISO/ IEC 14882 standard for the C++ programming language. C++11 replaced the prior version of the C++ standard, called C++03, and was later replaced by C++14. The name follows the tradition of naming language versions b ...
includes unordered_map in its standard library for storing keys and values of arbitrary types.
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
programming language includes the HashSet, HashMap, LinkedHashSet, and LinkedHashMap
generic Generic or generics may refer to: In business * Generic term, a common name used for a range or class of similar things not protected by trademark * Generic brand, a brand for a product that does not have an associated brand or trademark, other ...
collections.
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
's built-in dict implements a hash table in the form of a type.
Ruby A ruby is a pinkish red to blood-red colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sa ...
's built-in Hash uses the open addressing model from Ruby 2.4 onwards.
Rust Rust is an iron oxide, a usually reddish-brown oxide formed by the reaction of iron and oxygen in the catalytic presence of water or air moisture. Rust consists of hydrous iron(III) oxides (Fe2O3·nH2O) and iron(III) oxide-hydroxide (FeO(OH ...
programming language includes HashMap, HashSet as part of the Rust Standard Library.


See also

* Rabin–Karp string search algorithm *
Stable hashing In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only n/m keys need to be remapped on average where n is the number of keys and m is the number of slots. In contrast, in most tra ...
*
Consistent hashing In computer science, consistent hashing is a special kind of hashing technique such that when a hash table is resized, only n/m keys need to be remapped on average where n is the number of keys and m is the number of slots. In contrast, in most tr ...
*
Extendible hashing Extendible hashing is a type of hash system which treats a hash as a bit string and uses a trie for bucket lookup. Because of the hierarchical nature of the system, re-hashing is an incremental operation (done one bucket at a time, as needed). Thi ...
*
Lazy deletion In computer science, lazy deletion refers to a method of deleting elements from a hash table that uses open addressing Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is r ...
*
Pearson hashing Pearson hashing is a hash function designed for fast execution on processors with 8-bit registers. Given an input consisting of any number of bytes, it produces as output a single byte that is strongly dependent on every byte of the input. Its impl ...
*
PhotoDNA PhotoDNA is a proprietary image-identification and content filtering technology widely used by online service providers. History PhotoDNA was developed by Microsoft Research and Hany Farid, professor at Dartmouth College, beginning in 2009. ...
*
Search data structure In computer science, a search data structure is any data structure that allows the efficient retrieval of specific items from a set of items, such as a specific record from a database. The simplest, most general, and least efficient search struc ...
*
Concurrent hash table A concurrent hash table (concurrent hash map) is an implementation of hash tables allowing ''concurrent access'' by ''multiple'' threads using a hash function. Concurrent hash tables represent a key concurrent data structure for use in concur ...
* Bloom filter *
Hash array mapped trie A hash array mapped trie (HAMT) is an implementation of an associative array that combines the characteristics of a hash table and an array mapped trie. It is a refined version of the more general notion of a hash tree. Operation A HAMT is an a ...
*
Distributed hash table A distributed hash table (DHT) is a distributed system that provides a lookup service similar to a hash table: key–value pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. The ...


References


Further reading

* *


External links

* NIST entry o
hash tables


Pat Morin Patrick Ryan Morin is a Canadian computer scientist specializing in computational geometry and data structures. He is a professor in the School of Computer Science at Carleton University. Education and career Morin was educated at Carleton Univers ...

MIT's Introduction to Algorithms: Hashing 1
MIT OCW lecture Video
MIT's Introduction to Algorithms: Hashing 2
MIT OCW lecture Video {{Authority control Articles with example C code * Hash based data structures