computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...

and data mining, MinHash (or the min-wise independent permutations

locality sensitive hashing In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Since ...

scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by , and initially used in the

AltaVista AltaVista was a Web search engine established in 1995. It became one of the most-used early search engines, but lost ground to Google and was purchased by Yahoo! in 2003, which retained the brand, but based all AltaVista searches on its own sear ...

search engine to detect duplicate web pages and eliminate them from search results.. It has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets of words..

Jaccard similarity and minimum hash values

The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. Let be a set and and be subsets of , then the Jaccard index is defined to be the ratio of the number of elements of their

intersection In mathematics, the intersection of two or more objects is another object consisting of everything that is contained in all of the objects simultaneously. For example, in Euclidean geometry, when two lines in a plane are not parallel, their i ...

and the number of elements of their

union Union commonly refers to: * Trade union, an organization of workers * Union (set theory), in mathematics, a fundamental operation on sets Union may also refer to: Arts and entertainment Music * Union (band), an American rock group ** ''Un ...

: :

J(A,B) = .

This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1 otherwise. Two sets are more similar (i.e. have relatively more members in common) when their Jaccard index is closer to 1. The goal of MinHash is to estimate quickly, without explicitly computing the intersection and union. Let be a

hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually u ...

that maps the members of to distinct integers, let be a random

permutation In mathematics, a permutation of a set is, loosely speaking, an arrangement of its members into a sequence or linear order, or if the set is already ordered, a rearrangement of its elements. The word "permutation" also refers to the act or proc ...

of the elements of the set , and for any subset of define to be the minimal member of with respect to —that is, the member of with the minimum value of . (In cases where the hash function used is assumed to have pseudo-random properties, the random permutation would not be used.) Now, applying to both and , and assuming no hash collisions, we see that the values are equal () if and only if among all elements of

A\cup B

, the element with the minimum hash value lies in the intersection

A\cap B

. The probability of this being true is exactly the Jaccard index, therefore: : That is, the

probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...

that is true is equal to the similarity , assuming drawing from a uniform distribution. In other words, if is the

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...

that is one when and zero otherwise, then is an

unbiased estimator In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...

of . has too high a

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...

to be a useful estimator for the Jaccard similarity on its own, because

r

is always zero or one. The idea of the MinHash scheme is to reduce this variance by averaging together several variables constructed in the same way.

Algorithm

Variant with many hash functions

The simplest version of the minhash scheme uses different hash functions, where is a fixed integer parameter, and represents each set by the values of for these functions. To estimate using this version of the scheme, let be the number of hash functions for which , and use as the estimate. This estimate is the average of different 0-1 random variables, each of which is one when and zero otherwise, and each of which is an unbiased estimator of . Therefore, their average is also an unbiased estimator, and by standard deviation for sums of 0-1 random variables, its expected error is . Therefore, for any constant there is a constant such that the expected error of the estimate is at most . For example, 400 hashes would be required to estimate with an expected error less than or equal to .05.

Variant with a single hash function

It may be computationally expensive to compute multiple hash functions, but a related version of MinHash scheme avoids this penalty by using only a single hash function and uses it to select multiple values from each set rather than selecting only a single minimum value per hash function. Let be a hash function, and let be a fixed integer. If is any set of or more values in the domain of , define to be the subset of the members of that have the smallest values of . This subset is used as a ''signature'' for the set , and the similarity of any two sets is estimated by comparing their signatures. Specifically, let ''A'' and ''B'' be any two sets. Then is a set of ''k'' elements of , and if ''h'' is a random function then any subset of ''k'' elements is equally likely to be chosen; that is, is a

simple random sample In statistics, a simple random sample (or SRS) is a subset of individuals (a sample) chosen from a larger set (a population) in which a subset of individuals are chosen randomly, all with the same probability. It is a process of selecting a sample ...

of . The subset is the set of members of that belong to the intersection . Therefore, , , / is an unbiased estimator of . The difference between this estimator and the estimator produced by multiple hash functions is that always has exactly members, whereas the multiple hash functions may lead to a smaller number of sampled elements due to the possibility that two different hash functions may have the same minima. However, when is small relative to the sizes of the sets, this difference is negligible. By standard

Chernoff bound In probability theory, the Chernoff bound gives exponentially decreasing bounds on tail distributions of sums of independent random variables. Despite being named after Herman Chernoff, the author of the paper it first appeared in, the result is d ...

s for sampling without replacement, this estimator has expected error , matching the performance of the multiple-hash-function scheme.

Time analysis

The estimator can be computed in time from the two signatures of the given sets, in either variant of the scheme. Therefore, when and are constants, the time to compute the estimated similarity from the signatures is also constant. The signature of each set can be computed in

linear time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by ...

on the size of the set, so when many pairwise similarities need to be estimated this method can lead to a substantial savings in running time compared to doing a full comparison of the members of each set. Specifically, for set size the many hash variant takes time. The single hash variant is generally faster, requiring time to maintain the queue of minimum hash values assuming .

Incorporating weights

A variety of techniques to introduce weights into the computation of MinHashes have been developed. The simplest extends it to integer weights. Extend our hash function to accept both a set member and an integer, then generate multiple hashes for each item, according to its weight. If item occurs times, generate hashes

h(i,1), h(i,2), \ldots, h(i,n)

. Run the original algorithm on this expanded set of hashes. Doing so yields the weighted Jaccard Index as the collision probability. :

J_\mathcal(x,y) = \frac

Further extensions that achieve this collision probability on real weights with better runtime have been developed, one for dense data, and another for sparse data. Another family of extensions use exponentially distributed hashes. A uniformly random hash between 0 and 1 can be converted to follow an exponential distribution by CDF inversion. This method exploits the many beautiful properties of the minimum of a set of exponential variables. :

H(x) = \underset \frac

This yields as its collision probability the probability Jaccard index :

J_\mathcal(x,y) = \sum_ \frac

Min-wise independent permutations

In order to implement the MinHash scheme as described above, one needs the hash function to define a

random permutation A random permutation is a random ordering of a set of objects, that is, a permutation-valued random variable. The use of random permutations is often fundamental to fields that use randomized algorithms such as coding theory, cryptography, and sim ...

on elements, where is the total number of distinct elements in the union of all of the sets to be compared. But because there are different permutations, it would require bits just to specify a truly random permutation, an infeasibly large number for even moderate values of . Because of this fact, by analogy to the theory of

universal hashing In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees ...

, there has been significant work on finding a family of permutations that is "min-wise independent", meaning that for any subset of the domain, any element is equally likely to be the minimum. It has been established that a min-wise independent family of permutations must include at least :

\operatorname(1, 2, \cdots, n) \ge e^

different permutations, and therefore that it needs bits to specify a single permutation, still infeasibly large.

Practical min-wise independent hash functions

Because of the above impracticality, two variant notions of min-wise independence have been introduced: restricted min-wise independent permutations families, and approximate min-wise independent families. Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most . Approximate min-wise independence has at most a fixed probability of varying from full independence. In 1999

Piotr Indyk Piotr Indyk is Thomas D. and Virginia W. Cabot Professor in the Theory of Computation Group at the Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Academic biography Indyk received the Magister (MA) ...

proved that any k-wise independent family of hash functions is also approximately min-wise independent for

k

large enough. In particular, there are constants

c,c'>0

such that if

k\ge c \log\tfrac

, then :

\frac(1\pm\epsilon),

for all sets

, X, \le \epsilon n c'

and

x\in X

. (Note, here

(1\pm\epsilon)

means the probability is at most a factor

1+\epsilon

too big, and at most

1-\epsilon

too small.) This guarantee is, among other things, sufficient to give the Jaccard bound required by the MinHash algorithm. That is, if

A

and

B

are sets, then :

= \frac\pm\epsilon.

Since k-wise independent hash functions can be specified using just

k\log n

bits, this approach is much more practical than using completely min-wise independent permutations. Another practical family of hash functions that give approximate min-wise independence is

Tabulation hashing In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations. It was first studied in the form of Zobrist hashing for computer games; later work b ...

Applications

The original applications for MinHash involved clustering and eliminating near-duplicates among web documents, represented as sets of the words occurring in those documents. Similar techniques have also been used for clustering and near-duplicate elimination for other types of data, such as images: in the case of image data, an image can be represented as a set of smaller subimages cropped from it, or as sets of more complex image feature descriptions. In data mining, use MinHash as a tool for

association rule learning Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.Pi ...

. Given a database in which each entry has multiple attributes (viewed as a 0–1 matrix with a row per database entry and a column per attribute) they use MinHash-based approximations to the Jaccard index to identify candidate pairs of attributes that frequently co-occur, and then compute the exact value of the index for only those pairs to determine the ones whose frequencies of co-occurrence are below a given strict threshold. The MinHash algorithm has been adapted for bioinformatics, where the problem of comparing genome sequences has a similar theoretical underpinning to that of comparing documents on the web. MinHash-based tools allow rapid comparison of whole genome sequencing data with reference genomes (around 3 minutes to compare one genome with the 90000 reference genomes in

RefSeq The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences ( DNA, RNA) and their protein products. RefSeq was first introduced in 2000. This database is built by National ...

), and are suitable for speciation and maybe a limited degree of microbial sub-typing. There are also applications for metagenomics and the use of MinHash derived algorithms for genome alignment and genome assembly. Accurate average nucleotide identity (ANI) values can be generated very efficiently with MinHash-based algorithms.

Other uses

The MinHash scheme may be seen as an instance of

, a collection of techniques for using hash functions to map large sets of objects down to smaller hash values in such a way that, when two objects have a small distance from each other, their hash values are likely to be the same. In this instance, the signature of a set may be seen as its hash value. Other locality sensitive hashing techniques exist for

Hamming distance In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of ''substitutions'' required to chan ...

between sets and cosine distance between

vector Vector most often refers to: *Euclidean vector, a quantity with a magnitude and a direction *Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematic ...

s; locality sensitive hashing has important applications in

nearest neighbor search Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...

algorithms. For large distributed systems, and in particular

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...

, there exist modified versions of MinHash to help compute similarities with no dependence on the point dimension.

Evaluation and benchmarks

A large scale evaluation was conducted by

Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...

in 2006 to compare the performance of Minhash and SimHash algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling and using Minhash and

LSH lsh is a free software implementation of the Secure Shell (SSH) protocol version 2, by the GNU Project including both server and client programs. Featuring Secure Remote Password protocol (SRP) as specified in secsh-srp besides, public-key au ...

for

Google News Google News is a news aggregator service developed by Google. It presents a continuous flow of links to articles organized from thousands of publishers and magazines. Google News is available as an app on Android, iOS, and the Web. Google rel ...

personalization. .

References

{{reflist, 30em Hash functions Clustering criteria Hashing Probabilistic data structures