In
computing
Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, ...
, the count–min sketch (CM sketch) is a
probabilistic
Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, ...
data structure
In computer science, a data structure is a data organization, management, and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the rel ...
that serves as a frequency table of events in a
stream of data. It uses
hash function
A hash function is any function that can be used to map data of arbitrary size to fixed-size values. The values returned by a hash function are called ''hash values'', ''hash codes'', ''digests'', or simply ''hashes''. The values are usually ...
s to map events to frequencies, but unlike a
hash table
In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', ...
uses only
sub-linear space, at the expense of overcounting some events due to
collisions
In physics, a collision is any event in which two or more bodies exert forces on each other in a relatively short time. Although the most common use of the word ''collision'' refers to incidents in which two or more objects collide with great fo ...
. The count–min sketch was invented in 2003 by
Graham Cormode and
S. Muthu Muthukrishnan and described by them in a 2005 paper.
Count–min sketch is an alternative to
count sketch
Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms.
It was invented by
Moses Charikar, Kevin Chen and Martin Farach-Colton in an effort to speed up the AMS Sketch by ...
and
AMS sketch AMS or Ams may refer to:
Organizations Companies
* Alenia Marconi Systems
* American Management Systems
* AMS (Advanced Music Systems)
* ams AG, semiconductor manufacturer
* AMS Pictures
* Auxiliary Medical Services
Educational institutions
* ...
and can be considered an implementation of a
counting Bloom filter (Fan et al., 1998) or
multistage-filter.
[ However, they are used differently and therefore sized differently: a count–min sketch typically has a sublinear number of cells, related to the desired approximation quality of the sketch, while a counting Bloom filter is more typically sized to match the number of elements in the set.
]
Data structure
The goal of the basic version of the count–min sketch is to consume a stream of events, one at a time, and count the frequency of the different types of events in the stream. At any time, the sketch can be queried for the frequency of a particular event type from a universe of event types , and will return an estimate of this frequency that is within a certain distance of the true frequency, with a certain probability.
The actual sketch data structure is a two-dimensional array of columns and rows. The parameters and are fixed when the sketch is created, and determine the time and space needs and the probability of error when the sketch is queried for a frequency or inner product
In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, often ...
. Associated with each of the rows is a separate hash function; the hash functions must be pairwise independent. The parameters and can be chosen by setting and , where the error in answering a query is within an additive factor of with probability (see below), and is Euler's number
The number , also known as Euler's number, is a mathematical constant approximately equal to 2.71828 that can be characterized in many ways. It is the base of the natural logarithms. It is the limit of as approaches infinity, an expressi ...
.
When a new event of type arrives we update as follows: for each row of the table, apply the corresponding hash function to obtain a column index . Then increment the value in row , column by one.
Several types of queries are possible on the stream.
* The simplest is the ''point query'', which asks for the count of an event type . The estimated count is given by the least value in the table for , namely