Online aggregation is a technique for improving the interactive behavior of
database systems
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases spa ...
processing expensive analytical queries. Almost all
database
In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases s ...
operations are performed in batch mode, i.e. the user issues a query and waits till the database has finished processing the entire query. On the contrary, using online aggregation, the user gets estimates of an
aggregate query in an online fashion as soon as the query is issued. For example, if the final answer is 1000, after k seconds, the user gets the estimates in form of a confidence interval like
90, 1020with 95% probability. This confidence keeps on shrinking as the system gets more and more samples.
Online aggregation was proposed in 1997 by Hellerstein, Haas and Wang
for group-by aggregation queries over a single table. Later, the authors showed how to evaluate joins in an online fashion.
In 2007, Jermaine et al. designed and implemented a prototype database system called Database-Online (or DBO) that computes group-by aggregate query over multiple tables in an online and more importantly in a scalable fashion.
All the approaches for online aggregation use
random sampling
In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attemp ...
, which is non-trivial in a distributed environment due to
inspection paradox of renewal reward theory. In 2011, Pansare et al. proposed a
Bayesian
Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister.
Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...
model to deal with the inspection paradox and implemented online aggregation for a
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
-like environment.
References
Database theory
{{Database-stub