In computer science, the count-distinct problem
(also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements.
This is a well-known problem with numerous applications. The elements might represent
IP addresses
An Internet Protocol address (IP address) is a numerical label such as that is connected to a computer network that uses the Internet Protocol for communication.. Updated by . An IP address serves two main functions: network interface iden ...
of packets passing through a
router,
unique visitor
Website popularity is commonly determined using the number of unique users, and the metric is often quoted to potential advertisers or investors. A website's number of unique users is usually measured over a standard period of time, typically a m ...
s to a web site, elements in a large database, motifs in a
DNA sequence, or elements of
RFID
Radio-frequency identification (RFID) uses electromagnetic fields to automatically identify and track tags attached to objects. An RFID system consists of a tiny radio transponder, a radio receiver and transmitter. When triggered by an electroma ...
/
sensor networks
Wireless sensor networks (WSNs) refer to networks of spatially dispersed and dedicated sensors that monitor and record the physical conditions of the environment and forward the collected data to a central location. WSNs can measure environmental c ...
.
Formal definition
: Instance: A stream of elements
with repetitions, and an integer
. Let
be the number of distinct elements, namely
, and let these elements be
.
: Objective: Find an estimate
of
using only
storage units, where
.
An example of an instance for the cardinality estimation problem is the stream:
. For this instance,
.
Naive solution
The naive solution to the problem is as follows:
Initialize a counter, , to zero,
Initialize an efficient dictionary data structure, , such as hash table or search tree in which insertion and membership can be performed quickly.
, a membership query is issued.
Increase by one,
Otherwise do nothing.
As long as the number of distinct elements is not too big, fits in main memory and an exact answer can be retrieved.
However, this approach does not scale for bounded storage, or if the computation performed for each element
should be minimized. In such a case, several
streaming algorithms In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). In most models, these algorithms have access t ...
have been proposed that use a fixed number of storage units.
HyperLogLog algorithm
Streaming algorithms
To handle the bounded storage constraint,
streaming algorithm In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). In most models, these algorithms have access ...
s use a randomization to produce a non-exact estimation of the distinct number of elements,
.
State-of-the-art estimators
hash every element
into a low-dimensional data sketch using a hash function,
.
The different techniques can be classified according to the data sketches they store.
Min/max sketches
Min/max sketches
[ ] store only the minimum/maximum hashed values. Examples of known min/max sketch estimators: Chassaing et al. presents max sketch which is the
minimum-variance unbiased estimator In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter.
For p ...
for the problem. The continuous max sketches estimator
[ ] is the
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed sta ...
estimator. The estimator of choice in practice is the
HyperLogLog
HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the ''exact'' cardinality of the distinct elements of a multiset requires an amount of memory proportional to th ...
algorithm.
[ ]
The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element
is associated with a uniform
RV,
, the expected minimum value of
is
. The hash function guarantees that
is identical for all the appearances of
. Thus, the existence of duplicates does not affect the value of the extreme order statistics.
There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation by
Flajolet et al. describes a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. The first asymptotically space- and time-optimal algorithm for this problem was given by
Daniel M. Kane, Jelani Nelson, and David P. Woodruff.
[ ]
Bottom-''m'' sketches
Bottom-''m'' sketches
are a generalization of min sketches, which maintain the
minimal values, where
.
See Cosma et al.
for a theoretical overview of count-distinct estimation algorithms, and Metwally
for a practical overview with comparative simulation results.
Weighted count-distinct problem
In its weighted version, each element is associated with a weight and the goal is to estimate the total sum of weights.
Formally,
: Instance: A stream of weighted elements
with repetitions, and an integer
. Let
be the number of distinct elements, namely
, and let these elements be
. Finally, let
be the weight of
.
: Objective: Find an estimate
of
using only
storage units, where
.
An example of an instance for the weighted problem is:
. For this instance,
, the weights are
and
.
As an application example,
could be
IP packets received by a server. Each packet belongs to one of
IP flows
. The weight
can be the load imposed by flow
on the server. Thus,
represents the total load imposed on the server by all the flows to which packets
belong.
Solving the weighted count-distinct problem
Any extreme order statistics estimator (min/max sketches) for the unweighted problem can be generalized to an estimator for the weighted problem
.
For example, the weighted estimator proposed by Cohen et al.
can be obtained when the continuous max sketches estimator is extended to solve the weighted problem.
In particular, the
HyperLogLog
HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the ''exact'' cardinality of the distinct elements of a multiset requires an amount of memory proportional to th ...
algorithm
can be extended to solve the weighted problem. The extended HyperLogLog algorithm offers the best performance, in terms of statistical accuracy and memory usage, among all the other known algorithms for the weighted problem.
See also
*
Count–min sketch
In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear s ...
*
Streaming algorithm In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). In most models, these algorithms have access ...
*
Maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed sta ...
*
Minimum-variance unbiased estimator In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an unbiased estimator that has lower variance than any other unbiased estimator for all possible values of the parameter.
For p ...
References
{{reflist
Statistical algorithms