In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...
, dynamic perfect hashing is a programming technique for resolving
collisions
In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great f ...
in a
hash table
In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', als ...
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
.
[Fredman, M. L., Komlós, J., and Szemerédi, E. 1984. Storing a Sparse Table with 0(1) Worst Case Access Time. J. ACM 31, 3 (Jun. 1984), 538-544 http://portal.acm.org/citation.cfm?id=1884#][Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan, R. E. 1994]
"Dynamic Perfect Hashing: Upper and Lower Bounds"
.
SIAM J. Comput. 23, 4 (Aug. 1994), 738-761.
http://portal.acm.org/citation.cfm?id=182370
While more memory-intensive than its hash table counterparts, this technique is useful for situations where fast queries, insertions, and deletions must be made on a large set of elements.
Details
Static case
FKS Scheme
The problem of optimal
static hashing was first solved in general by Fredman, Komlós and Szemerédi.
In their 1984 paper,
they detail a two-tiered hash table scheme in which each bucket of the (first-level) hash table corresponds to a separate second-level hash table. Keys are hashed twice—the first hash value maps to a certain bucket in the first-level hash table; the second hash value gives the position of that entry in that bucket's second-level hash table. The second-level table is guaranteed to be collision-free (i.e.
perfect hashing
In computer science, a perfect hash function for a set is a hash function that maps distinct elements in to a set of integers, with no collisions. In mathematical terms, it is an injective function.
Perfect hash functions may be used to im ...
) upon construction. Consequently, the look-up cost is guaranteed to be
O(1)
Big ''O'' notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by Paul Bachmann, Edmund Land ...
in the worst-case.
In the static case, we are given a set with a total of entries, each one with a unique key, ahead of time.
Fredman, Komlós and Szemerédi pick a first-level hash table with size
buckets.
To construct, entries are separated into buckets by the top-level hashing function, where
. Then for each bucket with entries, a second-level table is allocated with
slots, and its
hash function
A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...
is selected at random from a
universal hash function
In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees ...
set so that it is collision-free (i.e. a
perfect hash function
In computer science, a perfect hash function for a set is a hash function that maps distinct elements in to a set of integers, with no collisions. In mathematical terms, it is an injective function.
Perfect hash functions may be used to imp ...
) and stored alongside the hash table. If the hash function randomly selected creates a table with collisions, a new hash function is randomly selected until a collision-free table can be guaranteed. Finally, with the collision-free hash, the entries are hashed into the second-level table.
The quadratic size of the
space ensures that randomly creating a table with collisions is infrequent and independent of the size of , providing linear amortized construction time. Although each second-level table requires quadratic space, if the keys inserted into the first-level hash table are
uniformly distributed, the structure as a whole occupies expected
space, since bucket sizes are small with high
probability
Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
.
The first-level hash function is specifically chosen so that, for the specific set of unique key values, the total space used by all the second-level hash tables has expected
space, and more specifically
.
Fredman, Komlós and Szemerédi showed that given a
universal hashing
In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees ...
family of hash functions, at least half of those functions have that property.
Dynamic case
Dietzfelbinger et al. present a dynamic dictionary algorithm that, when a set of n items is incrementally added to the dictionary, membership queries always run in constant time and therefore
worst-case time, the total storage required is
(linear), and
expected amortized insertion and deletion time (
amortized constant time
In computer science, amortized analysis is a method for analyzing a given algorithm's complexity, or how much of a resource, especially time or memory, it takes to execute. The motivation for amortized analysis is that looking at the worst-case ...
).
In the dynamic case, when a key is inserted into the hash table, if its entry in its respective subtable is occupied, then a collision is said to occur and the subtable is rebuilt based on its new total entry count and randomly selected hash function. Because the
load factor of the second-level table is kept low
, rebuilding is infrequent, and the
amortized
In computer science, amortized analysis is a method for analyzing a given algorithm's complexity, or how much of a resource, especially time or memory, it takes to execute. The motivation for amortized analysis is that looking at the worst-case ...
expected cost of insertions is
.
Similarly, the amortized expected cost of deletions is
.
Additionally, the ultimate sizes of the top-level table or any of the subtables is unknowable in the dynamic case. One method for maintaining expected
space of the table is to prompt a full reconstruction when a sufficient number of insertions and deletions have occurred. By results due to Dietzfelbinger et al.,
as long as the total number of insertions or deletions exceeds the number of elements at the time of last construction, the amortized expected cost of insertion and deletion remain
with full rehashing taken into consideration.
The implementation of dynamic perfect hashing by Dietzfelbinger et al. uses these concepts, as well as
lazy deletion In computer science, lazy deletion refers to a method of deleting elements from a hash table that uses open addressing
Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is r ...
, and is shown in pseudocode below.
Pseudocode implementation
Locate
function Locate(''x'') is
''j'' := h(x)
if (position h
j(''x'') of subtable ''T
j'' contains ''x'' (not deleted))
return (''x'' is in ''S'')
end if
else
return (''x'' is not in ''S'')
end else
end
Insert
During the insertion of a new entry ''x'' at ''j'', the global operations counter, ''count'', is incremented.
If ''x'' exists at ''j'', but is marked as deleted, then the mark is removed.
If ''x'' exists at ''j'' or at the subtable ''T
j'', and is not marked as deleted, then a collision is said to occur and the ''j''
th bucket's second-level table ''T
j'' is rebuilt with a different randomly selected hash function ''h
j''.
function Insert(''x'') is
''count'' = ''count'' + 1;
if (''count'' > ''M'')
FullRehash(''x'');
end if
else
''j'' = h(''x'');
if (Position h
''j''(x) of subtable ''T
j'' contains ''x'')
if (''x'' is marked deleted)
remove the delete marker;
end if
end if
else
''b
j'' = ''b
j'' + 1;
if (''b
j'' <= ''m
j'')
if position h
''j''(''x'') of ''T
j'' is empty
store ''x'' in position h
''j''(''x'') of ''T
j'';
end if
else
Put all unmarked elements of ''T
j'' in list ''L
j'';
Append ''x'' to list ''L
j'';
''b
j'' = length of ''L
j'';
repeat
''h
j'' = randomly chosen function in ''H
sj'';
until ''h
j'' is injective on the elements of ''L
j'';
for all ''y'' on list ''L
j''
store ''y'' in position h
''j''(''y'') of ''T
j'';
end for
end else
end if
else
''m
j'' = 2 * max;
''s
j'' = 2 * ''m
j'' * (''m
j'' - 1);
if the sum total of all s
j ≤ 32 * ''M''
2 / ''s''(''M'') + 4 * ''M''
Allocate ''s
j'' cells for ''T
j'';
Put all unmarked elements of ''T
j'' in list ''L
j'';
Append ''x'' to list ''L
j'';
''b
j'' = length of ''L
j'';
repeat
''h
j'' = randomly chosen function in ''H
sj'';
until ''h
j'' is injective on the elements of ''L
j'';
for all ''y'' on list ''L
j''
store ''y'' in position h
''j''(''y'') of ''T
j'';
end for
end if
else
FullRehash(''x'');
end else
end else
end else
end else
end
Delete
Deletion of ''x'' simply flags ''x'' as deleted without removal and increments ''count''. In the case of both insertions and deletions, if ''count'' reaches a threshold ''M'' the entire table is rebuilt, where ''M'' is some constant multiple of the size of S at the start of a new ''phase''. Here ''phase'' refers to the time between full rebuilds. Note that here the -1 in "Delete(''x'')" is a representation of an element which is not in the set of all possible elements ''U''.
function Delete(''x'') is
''count'' = ''count'' + 1;
''j'' = h(''x'');
if position h
j(''x'') of subtable ''Tj'' contains ''x''
mark ''x'' as deleted;
end if
else
return (x is not a member of S);
end else
if (''count'' >= ''M'')
FullRehash(-1);
end if
end
Full rebuild
A full rebuild of the table of ''S'' first starts by removing all elements marked as deleted and then setting the next threshold value ''M'' to some constant multiple of the size of ''S''. A hash function, which partitions ''S'' into ''s''(''M'') subsets, where the size of subset ''j'' is ''s
j'', is repeatedly randomly chosen until:
Finally, for each subtable ''T
j'' a hash function ''h
j'' is repeatedly randomly chosen from ''H
sj'' until ''h
j'' is injective on the elements of ''T
j''. The expected time for a full rebuild of the table of ''S'' with size ''n'' is O(''n'').
function FullRehash(''x'') is
Put all unmarked elements of ''T'' in list ''L'';
if (''x'' is in ''U'')
append ''x'' to ''L'';
end if
''count'' = length of list ''L'';
''M'' = (1 + ''c'') * max;
repeat
h = randomly chosen function in ''H
s(M)'';
for all ''j'' < ''s''(''M'')
form a list ''L
j'' for h(''x'') = ''j'';
''b
j'' = length of ''L
j'';
''m
j'' = 2 * ''b
j'';
''s
j'' = 2 * ''m
j'' * (''m
j'' - 1);
end for
until the sum total of all s
j ≤ 32 * ''M''
2 / ''s''(''M'') + 4 * ''M''
for all ''j'' < ''s''(''M'')
Allocate space ''s
j'' for subtable ''T
j'';
repeat
''h
j'' = randomly chosen function in ''H
sj'';
until ''h
j'' is injective on the elements of list ''L
j'';
end for
for all ''x'' on list ''L
j''
store ''x'' in position h
''j''(''x'') of ''T
j'';
end for
end
See also
*
Perfect hashing
In computer science, a perfect hash function for a set is a hash function that maps distinct elements in to a set of integers, with no collisions. In mathematical terms, it is an injective function.
Perfect hash functions may be used to im ...
References
{{DEFAULTSORT:Dynamic Perfect Hashing
Hashing
Search algorithms