computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to Applied science, practical discipli ...

, streaming algorithms are algorithms for processing

data stream In connection-oriented communication, a data stream is the transmission of a sequence of digitally encoded coherent signals to convey information. Typically, the transmitted symbols are grouped into a series of packets. Data streaming has bec ...

s in which the input is presented as a

sequence In mathematics, a sequence is an enumerated collection of objects in which repetitions are allowed and order matters. Like a set, it contains members (also called ''elements'', or ''terms''). The number of elements (possibly infinite) is calle ...

of items and can be examined in only a few passes (typically

just one ''Just One'' is the debut album by the hardcore band Better Than a Thousand, released in 1997. Originally just intended to be a fun project, Ray Cappo, Graham Land and Ken Olden recorded some hardcore songs at Issa Diao's (of Good Clean Fun) stu ...

). In most models, these algorithms have access to limited memory (generally logarithmic in the size of and/or the maximum value in the stream). They may also have limited processing time per item. These constraints may mean that an algorithm produces an approximate answer based on a summary or "sketch" of the data stream.

History

Though streaming algorithms had already been studied by Munro and Paterson as early as 1978, as well as

Philippe Flajolet Philippe Flajolet (; 1 December 1948 – 22 March 2011) was a French computer scientist. Biography A former student of École Polytechnique, Philippe Flajolet received his PhD in computer science from University Paris Diderot in 1973 and state ...

and G. Nigel Martin in 1982/83, the field of streaming algorithms was first formalized and popularized in a 1996 paper by

Noga Alon Noga Alon ( he, נוגה אלון; born 17 February 1956) is an Israeli mathematician and a professor of mathematics at Princeton University noted for his contributions to combinatorics and theoretical computer science, having authored hundreds of ...

Yossi Matias Yossi is a Hebrew given name, usually a short and nickname for Yosef (equivalent to English Joseph). It may refer to: People * Abba Yossi – mythology figure * Country Yossi – American singer and radio personality *Yossi Abu – Israeli execu ...

, and

Mario Szegedy Mario Szegedy (born October 23, 1960) is a Hungarian-American computer scientist, professor of computer science at Rutgers University. He received his Ph.D. in computer science in 1989 from the University of Chicago. He held a Lady Davis Fellows ...

. For this paper, the authors later won the

Gödel Prize The Gödel Prize is an annual prize for outstanding papers in the area of theoretical computer science, given jointly by the European Association for Theoretical Computer Science (EATCS) and the Association for Computing Machinery Special Inter ...

in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing. Semi-streaming algorithms were introduced in 2005 as a relaxation of streaming algorithms for graphs, in which the space allowed is linear in the number of vertices , but only logarithmic in the number of edges . This relaxation is still meaningful for dense graphs, and can solve interesting problems (such as connectivity) that are insoluble in

o(n)

space.

Models

Data stream model

In the data stream model, some or all of the input is represented as a finite sequence of integers (from some finite domain) which is generally not available for

random access Random access (more precisely and more generally called direct access) is the ability to access an arbitrary element of a sequence in equal time or any datum from a population of addressable elements roughly as easily and efficiently as any othe ...

, but instead arrives one at a time in a "stream". If the stream has length and the domain has size , algorithms are generally constrained to use space that is logarithmic in and . They can generally make only some small constant number of passes over the stream, sometimes just

one 1 (one, unit, unity) is a number representing a single or the only entity. 1 is also a numerical digit and represents a single unit of counting or measurement. For example, a line segment of ''unit length'' is a line segment of length 1. I ...

Turnstile and cash register models

Much of the streaming literature is concerned with computing statistics on frequency distributions that are too large to be stored. For this class of problems, there is a vector

\mathbf = (a_1, \dots, a_n)

(initialized to the zero vector

\mathbf

) that has updates presented to it in a stream. The goal of these algorithms is to compute functions of

\mathbf

using considerably less space than it would take to represent

\mathbf

precisely. There are two common models for updating such streams, called the "cash register" and "turnstile" models. In the cash register model, each update is of the form

\langle i,
c\rangle

, so that

a_i

is incremented by some positive integer

c

. A notable special case is when

c = 1

(only unit insertions are permitted). In the turnstile model, each update is of the form

\langle i,
c\rangle

, so that

a_i

is incremented by some (possibly negative) integer

c

. In the "strict turnstile" model, no

a_i

at any time may be less than zero.

Sliding window model

Several papers also consider the "sliding window" model. In this model, the function of interest is computing over a fixed-size window in the stream. As the stream progresses, items from the end of the window are removed from consideration while new items from the stream take their place. Besides the above frequency-based problems, some other types of problems have also been studied. Many graph problems are solved in the setting where the

adjacency matrix In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph. In the special case of a finite simp ...

or the

adjacency list In graph theory and computer science, an adjacency list is a collection of unordered lists used to represent a finite graph. Each unordered list within an adjacency list describes the set of neighbors of a particular vertex in the graph. This is ...

of the graph is streamed in some unknown order. There are also some problems that are very dependent on the order of the stream (i.e., asymmetric functions), such as counting the number of inversions in a stream and finding the longest increasing subsequence.

Evaluation

The performance of an algorithm that operates on data streams is measured by three basic factors: * The number of passes the algorithm must make over the stream. * The available memory. * The running time of the algorithm. These algorithms have many similarities with

online algorithms In computer science, an online algorithm is one that can process its input piece-by-piece in a serial fashion, i.e., in the order that the input is fed to the algorithm, without having the entire input available from the start. In contrast, an o ...

since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives. If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an

(\epsilon,\delta)

approximation meaning that the algorithm achieves an error of less than

\epsilon

with probability

1-\delta

Applications

Streaming algorithms have several applications in

networking Network, networking and networked may refer to: Science and technology * Network theory, the study of graphs as a representation of relations between discrete objects * Network science, an academic field that studies complex networks Mathematics ...

such as monitoring network links for

elephant flow In computer networking, an elephant flow is an extremely large (in total bytes) continuous flow set up by a TCP (or other protocol) flow measured over a network link. Elephant flows, though not numerous, can occupy a disproportionate share of the ...

s, counting the number of distinct flows, estimating the distribution of flow sizes, and so on. They also have applications in databases, such as estimating the size of a

join Join may refer to: * Join (law), to include additional counts or additional defendants on an indictment *In mathematics: ** Join (mathematics), a least upper bound of sets orders in lattice theory ** Join (topology), an operation combining two topo ...

Some streaming problems

Frequency moments

The th frequency moment of a set of frequencies

\mathbf

is defined as

F_k(\mathbf) = \sum_^n a_i^k

. The first moment

F_1

is simply the sum of the frequencies (i.e., the total count). The second moment

F_2

is useful for computing statistical properties of the data, such as the Gini coefficient of variation.

F_

is defined as the frequency of the most frequent items. The seminal paper of Alon, Matias, and Szegedy dealt with the problem of estimating the frequency moments.

Calculating frequency moments

A direct approach to find the frequency moments requires to maintain a register for all distinct elements which requires at least memory of order

\Omega(N)

. But we have space limitations and require an algorithm that computes in much lower memory. This can be achieved by using approximations instead of exact values. An algorithm that computes an (''ε,δ'')approximation of , where is the (''ε,δ'')- approximated value of . Where ''ε'' is the approximation parameter and ''δ'' is the confidence parameter.

= Calculating ''F''₀ (distinct elements in a DataStream)

FM-Sketch algorithm

Flajolet et al. in introduced probabilistic method of counting which was inspired from a paper by Robert Morris. Morris in his paper says that if the requirement of accuracy is dropped, a counter ''n'' can be replaced by a counter which can be stored in bits. Flajolet et al. in improved this method by using a hash function which is assumed to uniformly distribute the element in the hash space (a binary string of length ). :

h: \rightarrow,2^-1 /math>

Let  represent the kth bit in binary representation of 

: y = \sum_ \mathrm(y,k)*2^Let \rho(y) represents the position of least
significant 1-bit in the binary representation of  with a suitable convention for \rho(0) .

: \rho(y)=\begin
	\mathrm(k:\mathrm(y,k)1) &\text y>0\\
	L &\text y=0
\end Let ''A'' be the sequence of data stream of length ''M'' whose cardinality need to be determined. Let ''BITMAP'' ...''L'' − 1 be the
hash space where the (''hashedvalues'') are recorded. The below algorithm then determines approximate cardinality of ''A''.

Procedure FM-Sketch:

    for i in 0 to L − 1 do
        BITMAP := 0 
    end for
    for x in A: do
        Index := ρ(hash(x))
        if BITMAP ndex= 0 then
            BITMAP ndex:= 1
        end if
    end for
    B := Position of left most 0 bit of BITMAP[] 
    return 2 ^ B

If there are ''N'' distinct elements in a data stream. * For

i \gg \log(N)

then ''BITMAP'' 'i''is certainly 0 * For

i \ll \log(N)

then ''BITMAP'' 'i''is certainly 1 * For

i \approx \log(N)

then ''BITMAP'' 'i''is a fringes of 0's and 1's

''K''-minimum value algorithm

The previous algorithm describes the first attempt to approximate ''F''₀ in the data stream by Flajolet and Martin. Their algorithm picks a random hash function which they assume to uniformly distribute the hash values in hash space. Bar-Yossef et al. in introduced k-minimum value algorithm for determining number of distinct elements in data stream. They used a similar hash function ''h'' which can be normalized to ,1as

h: \rightarrow,1 /math>. But they fixed a limit ''t'' to number of values in hash space. The value of ''t'' is assumed of the order O\left(\dfrac\right) (i.e. less approximation-value ''ε'' requires more ''t''). KMV algorithm keeps only ''t''-smallest hash values in the hash space. After all the ''m'' values of stream have arrived, \upsilon= \mathrm(h(a_ )) is used to calculate F'_=\dfrac . That is, in a close-to uniform hash space, they expect at-least ''t'' elements to be less than O\left(\dfrac\right) .

Procedure 2 K-Minimum Value

Initialize first t values of KMV 
for a in a1 to an do
    if h(a) < Max(KMV) then
        Remove Max(KMV) from KMV set
        Insert h(a) to KMV 
    end if
end for 
return t/Max(KMV)

Complexity analysis of KMV

KMV algorithm can be implemented in

O\left(\left(\dfrac\right)\cdot\log(m)\right)

memory bits space. Each hash value requires space of order

O(\log(m))

memory bits. There are hash values of the order

O\left(\dfrac\right)

. The access time can be reduced if we store the ''t'' hash values in a binary tree. Thus the time complexity will be reduced to

O\left(\log\left(\dfrac\right)\cdot\log(m)\right)

= Calculating

= Alon et al. estimates by defining random variables that can be computed within given space and time. The expected value of random variables gives the approximate value of . Assume length of sequence ''m'' is known in advance. Then construct a random variable ''X'' as follows: * Select be a random member of sequence with index at ,

a_p=l \in(1,2,3,\ldots,n)

* Let

r=, \,

, represents the number of occurrences of within the members of the sequence following . * Random variable

X=m(r^k-(r-1)^k)

. Assume ''S''₁ be of the order

O(n^/\lambda^)

and ''S''₂ be of the order

O(\log(1/\varepsilon))

. Algorithm takes ''S''₂ random variable

Y_1,Y_2,...,Y_

and outputs the median

Y

. Where is the average of where 1 ≤ ''j'' ≤ ''S''₁. Now calculate expectation of random variable . :

\\ &=& \sum_^ m_^ = F_ \end

Complexity of

From the algorithm to calculate discussed above, we can see that each random variable stores value of and . So, to compute we need to maintain only bits for storing and bits for storing . Total number of random variable will be the . Hence the total space complexity the algorithm takes is of the order of

O\left(\dfracn^\left(\log n + \log m\right)\right)

Simpler approach to calculate

The previous algorithm calculates

F_2

in order of

O( \sqrt(\log m + \log n))

memory bits. Alon et al. in simplified this algorithm using four-wise independent random variable with values mapped to

\

. This further reduces the complexity to calculate

F_2

O\left(\dfrac\left(\log n + \log m\right)\right)

Frequent elements

In the data stream model, the frequent elements problem is to output a set of elements that constitute more than some fixed fraction of the stream. A special case is the majority problem, which is to determine whether or not any value constitutes a majority of the stream. More formally, fix some positive constant > 1, let the length of the stream be , and let denote the frequency of value in the stream. The frequent elements problem is to output the

set Set, The Set, SET or SETS may refer to: Science, technology, and mathematics Mathematics *Set (mathematics), a collection of elements *Category of sets, the category whose objects and morphisms are sets and total functions, respectively Electro ...

. Some notable algorithms are: *

Boyer–Moore majority vote algorithm The Boyer–Moore majority vote algorithm is an algorithm for finding the majority of a sequence of elements using linear time and constant space. It is named after Robert S. Boyer and J Strother Moore, who published it in 1981,. Originally publis ...

* Count-Min sketch * Lossy counting * Multi-stage

Bloom filter A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in ...

s *

Misra–Gries heavy hitters algorithm Misra and Gries defined the ''heavy-hitters problem'' (though they did not introduce the term ''heavy-hitters'') and described the first algorithm for it in the paper ''Finding repeated elements''. Their algorithm extends the Boyer-Moore majority ...

Misra–Gries summary In the field of streaming algorithms, Misra–Gries summaries are used to solve the frequent elements problem in the data stream model. That is, given a long stream of input that can only be examined once (and in some arbitrary order), the Misr ...

Event detection

Detecting events in data streams is often done using a heavy hitters algorithm as listed above: the most frequent items and their frequency are determined using one of these algorithms, then the largest increase over the previous time point is reported as trend. This approach can be refined by using exponentially weighted

moving average In statistics, a moving average (rolling average or running average) is a calculation to analyze data points by creating a series of averages of different subsets of the full data set. It is also called a moving mean (MM) or rolling mean and is ...

s and variance for normalization.

Counting distinct elements

Counting the number of distinct elements in a stream (sometimes called the moment) is another problem that has been well studied. The first algorithm for it was proposed by Flajolet and Martin. In 2010, Daniel Kane,

Jelani Nelson Jelani Osei Nelson ( am, ጄላኒ ኔልሰን; born June 28, 1984) is an Ethiopian-American Professor of Electrical Engineering and Computer Sciences at the University of California, Berkeley. He won the 2014 Presidential Early Career Award for ...

and David Woodruff found an asymptotically optimal algorithm for this problem. It uses space, with worst-case update and reporting times, as well as universal hash functions and a -wise independent hash family where .

Entropy

The (empirical) entropy of a set of frequencies

\mathbf

is defined as

F_k(\mathbf) = \sum_^n
\frac\log

, where

m = \sum_^n a_i

Online learning

Learn a model (e.g. a classifier) by a single pass over a training set. *

Feature hashing In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applyi ...

Stochastic gradient descent Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of ...

Lower bounds

Lower bounds have been computed for many of the data streaming problems that have been studied. By far, the most common technique for computing these lower bounds has been using

communication complexity In theoretical computer science, communication complexity studies the amount of communication required to solve a problem when the input to the problem is distributed among two or more parties. The study of communication complexity was first intro ...

Notes

References

* . First published as . * . * . * . * . * . * . * Heath, D., Kasif, S., Kosaraju, R., Salzberg, S., Sullivan, G., "Learning Nested Concepts With Limited Storage", Proceeding IJCAI'91 Proceedings of the 12th international joint conference on Artificial intelligence - Volume 2, Pages 777-782, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA ©1991 * . {{Algorithmic paradigms Streaming algorithms