computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, ...

, the count–min sketch (CM sketch) is a

probabilistic Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, ...

data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...

that serves as a frequency table of events in a stream of data. It uses

hash function A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...

s to map events to frequencies, but unlike a

hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', ...

uses only sub-linear space, at the expense of overcounting some events due to

collisions In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great fo ...

. The count–min sketch was invented in 2003 by Graham Cormode and S. Muthu Muthukrishnan and described by them in a 2005 paper. Count–min sketch is an alternative to

count sketch Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms. It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton in an effort to speed up the AMS Sketch by ...

and

AMS sketch AMS or Ams may refer to: Organizations Companies * Alenia Marconi Systems * American Management Systems * AMS (Advanced Music Systems) * ams AG, semiconductor manufacturer * AMS Pictures * Auxiliary Medical Services Educational institutions * ...

and can be considered an implementation of a counting Bloom filter (Fan et al., 1998) or multistage-filter. However, they are used differently and therefore sized differently: a count–min sketch typically has a sublinear number of cells, related to the desired approximation quality of the sketch, while a counting Bloom filter is more typically sized to match the number of elements in the set.

Data structure

The goal of the basic version of the count–min sketch is to consume a stream of events, one at a time, and count the frequency of the different types of events in the stream. At any time, the sketch can be queried for the frequency of a particular event type from a universe of event types

\mathcal

, and will return an estimate of this frequency that is within a certain distance of the true frequency, with a certain probability. The actual sketch data structure is a two-dimensional array of columns and rows. The parameters and are fixed when the sketch is created, and determine the time and space needs and the probability of error when the sketch is queried for a frequency or

inner product In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...

. Associated with each of the rows is a separate hash function; the hash functions must be pairwise independent. The parameters and can be chosen by setting and , where the error in answering a query is within an additive factor of with probability (see below), and is

Euler's number The number , also known as Euler's number, is a mathematical constant approximately equal to 2.71828 that can be characterized in many ways. It is the base of the natural logarithms. It is the limit of as approaches infinity, an expressi ...

. When a new event of type arrives we update as follows: for each row of the table, apply the corresponding hash function to obtain a column index . Then increment the value in row , column by one. Several types of queries are possible on the stream. * The simplest is the ''point query'', which asks for the count of an event type . The estimated count is given by the least value in the table for , namely

\hat a_i=\min_j \mathrm, h_j(i) /math>, where \mathrm is the table.

Obviously, for each , one has a_i \leq \hat a_i, where a_i is the true frequency with which  occurred in the stream.

Additionally, this estimate has the guarantee that \hat a_i \leq a_i + \varepsilon N with probability 1-\delta, where N = \sum_ a_i is the stream size, i.e. the total number of items seen by the sketch.

* An ''inner product query'' asks for the

between the histograms represented by two count–min sketches,

\mathrm_a

and

\mathrm_b

. Small modifications to the data structure can be used to sketch other different stream statistics. Like the

, the Count–min sketch is a linear sketch. That is, given two streams, constructing a sketch on each stream and summing the sketches yields the same result as concatenating the streams and constructing a sketch on the concatenated streams. This makes the sketch mergeable and appropriate for use in distributed settings in addition to streaming ones.

Reducing bias and error

One potential problem with the usual min estimator for count–min sketches is that they are biased estimators of the true frequency of events: they may overestimate, but never underestimate the true count in a point query. Furthermore, while the min estimator works well when the distribution is highly skewed, other sketches such as the Count sketch based on means are more accurate when the distribution is not sufficiently skewed. Several variations on the sketch have been proposed to reduce error and reduce or eliminate bias. To remove bias, the estimator repeatedly randomly selects d random entries in the sketch and takes the minimum to obtain an unbiased estimate of the bias and subtracts it off. A

maximum likelihood estimator In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

(MLE) was derived in Ting. By using the MLE, the estimator is always able to match or better the min estimator and works well even if the distribution is not skewed. This paper also showed the hCount* debiasing operation is a bootstrapping procedure that can be efficiently computed without random sampling and can be generalized to any estimator. Since errors arise from hash collisions with unknown items from the universe, several approaches correct for the collisions when multiple elements of the universe are known or queried for simultaneously . For each of these, a large proportion of the universe must be known to observe a significant benefit. ''Conservative updating'' changes the update, but not the query algorithms. To count instances of event type , one first computes an estimate

\hat a_i=\min_j \mathrm, h_j(i) /math>, then updates \mathrm, h_j(i) \leftarrow \max\ for each row . While this update procedure makes the sketch not a linear sketch, it is still mergeable.

Notes

References

External links

Count–min FAQ
{{DEFAULTSORT:Count-min sketch Hashing Probabilistic data structures Hash-based data structures

Data structure

Reducing bias and error

See also

Notes

References

Further reading

External links