database management In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...

, an aggregate function or aggregation function is a function where multiple values are processed together to form a single

summary statistic In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in * a measure of ...

Aggregation - Entity Relationship Diagram

Common aggregate functions include: *

Average In colloquial, ordinary language, an average is a single number or value that best represents a set of data. The type of average taken as most typically representative of a list of numbers is the arithmetic mean the sum of the numbers divided by ...

(i.e.,

arithmetic mean In mathematics and statistics, the arithmetic mean ( ), arithmetic average, or just the ''mean'' or ''average'' is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results fr ...

) *

Count Count (feminine: countess) is a historical title of nobility in certain European countries, varying in relative status, generally of middling rank in the hierarchy of nobility. Pine, L. G. ''Titles: How the King Became His Majesty''. New York: ...

Maximum In mathematical analysis, the maximum and minimum of a function (mathematics), function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given Interval (ma ...

Median The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...

Minimum In mathematical analysis, the maximum and minimum of a function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given range (the ''local'' or ''relative ...

* Mode *

Range Range may refer to: Geography * Range (geographic), a chain of hills or mountains; a somewhat linear, complex mountainous or hilly area (cordillera, sierra) ** Mountain range, a group of mountains bordered by lowlands * Range, a term used to i ...

* Sum Others include: * Nanmean (mean ignoring NaN values, also known as "nil" or "null") * Stddev Formally, an aggregate function takes as input a

set Set, The Set, SET or SETS may refer to: Science, technology, and mathematics Mathematics *Set (mathematics), a collection of elements *Category of sets, the category whose objects and morphisms are sets and total functions, respectively Electro ...

, a

multiset In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the ''multiplicity'' of ...

(bag), or a

list A list is a Set (mathematics), set of discrete items of information collected and set forth in some format for utility, entertainment, or other purposes. A list may be memorialized in any number of ways, including existing only in the mind of t ...

from some input domain and outputs an element of an output domain . The input and output domains may be the same, such as for SUM, or may be different, such as for COUNT. Aggregate functions occur commonly in numerous

programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...

s, in

spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in c ...

s, and in

relational algebra In database theory, relational algebra is a theory that uses algebraic structures for modeling data and defining queries on it with well founded semantics (computer science), semantics. The theory was introduced by Edgar F. Codd. The main applica ...

. The listagg function, as defined in the SQL:2016 standard aggregates data from multiple rows into a single concatenated string. In the

entity relationship diagram An entity is something that Existence, exists as itself. It does not need to be of material existence. In particular, abstractions and legal fictions are usually regarded as entities. In general, there is also no presumption that an entity is Lif ...

, aggregation is represented as seen in Figure 1 with a rectangle around the relationship and its entities to indicate that it is being treated as an aggregate entity.

Decomposable aggregate functions

Aggregate functions present a

bottleneck Bottleneck may refer to: * the narrowed portion (neck) of a bottle Science and technology * Bottleneck (engineering), where the performance of an entire system is limited by a single component * Bottleneck (network), in a communication network * ...

, because they potentially require having all input values at once. In

distributed computing Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers. The components of a distributed system commu ...

, it is desirable to divide such computations into smaller pieces, and distribute the work, usually computing in parallel, via a

divide and conquer algorithm In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved dir ...

. Some aggregate functions can be computed by computing the aggregate for subsets, and then aggregating these aggregates; examples include COUNT, MAX, MIN, and SUM. In other cases the aggregate can be computed by computing auxiliary numbers for subsets, aggregating these auxiliary numbers, and finally computing the overall number at the end; examples include AVERAGE (tracking sum and count, dividing at the end) and RANGE (tracking max and min, subtracting at the end). In other cases the aggregate cannot be computed without analyzing the entire set at once, though in some cases approximations can be distributed; examples include DISTINCT COUNT ( Count-distinct problem), MEDIAN, and MODE. Such functions are called decomposable aggregation functions or decomposable aggregate functions. The simplest may be referred to as self-decomposable aggregation functions, which are defined as those functions such that there is a ''merge operator'' such that :

f(X \uplus Y) = f(X) \diamond f(Y)

where is the union of multisets (see

monoid homomorphism In abstract algebra, a monoid is a set equipped with an associative binary operation and an identity element. For example, the nonnegative integers with addition form a monoid, the identity element being . Monoids are semigroups with identity ...

). For example, SUM: :

\operatorname() = x

, for a singleton; :

\operatorname(X \uplus Y) = \operatorname(X) + \operatorname(Y)

, meaning that merge is simply addition. COUNT: :

\operatorname() = 1

, :

\operatorname(X \uplus Y) = \operatorname(X) + \operatorname(Y)

. MAX: :

\operatorname() = x

, :

\operatorname(X \uplus Y) = \max\bigl(\operatorname(X), \operatorname(Y)\bigr)

. MIN: :

\operatorname() = x

, :

\operatorname(X \uplus Y) = \min\bigl(\operatorname(X), \operatorname(Y)\bigr)

. Note that self-decomposable aggregation functions can be combined (formally, taking the product) by applying them separately, so for instance one can compute both the SUM and COUNT at the same time, by tracking two numbers. More generally, one can define a decomposable aggregation function as one that can be expressed as the composition of a final function and a self-decomposable aggregation function ,

f = g \circ h, f(X) = g(h(X))

. For example, AVERAGE=SUM/COUNT and RANGE=MAX−MIN. In the

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...

framework, these steps are known as InitialReduce (value on individual record/singleton set), Combine (binary merge on two aggregations), and FinalReduce (final function on auxiliary values), and moving decomposable aggregation before the Shuffle phase is known as an InitialReduce step, Decomposable aggregation functions are important in

online analytical processing In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction proces ...

(OLAP), as they allow aggregation queries to be computed on the pre-computed results in the

OLAP cube An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights. The term ''cube'' here refers to a multi-dimensional dataset, which is also sometimes cal ...

, rather than on the base data. For example, it is easy to support COUNT, MAX, MIN, and SUM in OLAP, since these can be computed for each cell of the OLAP cube and then summarized ("rolled up"), but it is difficult to support MEDIAN, as that must be computed for every view separately.

Other decomposable aggregate functions

In order to calculate the average and standard deviation from aggregate data, it is necessary to have available for each group: the total of values (Σx_i = SUM(x)), the number of values (N=COUNT(x)) and the total of squares of the values (Σx_i²=SUM(x²)) of each groups.

AVG:

\operatorname(X \uplus Y) = \bigl(\operatorname(X) * \operatorname(X) + \operatorname(Y) * \operatorname(Y)\bigr) / \bigl(\operatorname(X) + \operatorname(Y)\bigr)

\operatorname(X \uplus Y) = \bigl(\operatorname(X) + \operatorname(Y)\bigr) / \bigl(\operatorname(X) + \operatorname(Y)\bigr)

or, only if COUNT(X)=COUNT(Y)

\operatorname(X \uplus Y) = \bigl(\operatorname(X) + \operatorname(Y)\bigr) / 2

SUM(x²): The sum of squares of the values is important in order to calculate the Standard Deviation of groups

\operatorname(X^2 \uplus Y^2) = \operatorname(X^2)+\operatorname(Y^2)

STDDEV:
For a finite population with equal probabilities at all points, we have Standard deviation#Identities and mathematical properties

\operatorname(X) = s(x) = \sqrt = \sqrt
= \sqrt

This means that the standard deviation is equal to the square root of the difference between the average of the squares of the values and the square of the average value.

\operatorname(X \uplus Y) = \sqrt

\operatorname(X \uplus Y) = \sqrt

Decomposable aggregate functions

Other decomposable aggregate functions

See also

References

Literature

External links