Aggregation Operator
   HOME

TheInfoList



OR:

In
database management In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...
, an aggregate function or aggregation function is a function where multiple values are processed together to form a single
summary statistic In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible. Statisticians commonly try to describe the observations in * a measure of ...
. Common aggregate functions include: *
Average In colloquial, ordinary language, an average is a single number or value that best represents a set of data. The type of average taken as most typically representative of a list of numbers is the arithmetic mean the sum of the numbers divided by ...
(i.e.,
arithmetic mean In mathematics and statistics, the arithmetic mean ( ), arithmetic average, or just the ''mean'' or ''average'' is the sum of a collection of numbers divided by the count of numbers in the collection. The collection is often a set of results fr ...
) *
Count Count (feminine: countess) is a historical title of nobility in certain European countries, varying in relative status, generally of middling rank in the hierarchy of nobility. Pine, L. G. ''Titles: How the King Became His Majesty''. New York: ...
*
Maximum In mathematical analysis, the maximum and minimum of a function (mathematics), function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given Interval (ma ...
*
Median The median of a set of numbers is the value separating the higher half from the lower half of a Sample (statistics), data sample, a statistical population, population, or a probability distribution. For a data set, it may be thought of as the “ ...
*
Minimum In mathematical analysis, the maximum and minimum of a function are, respectively, the greatest and least value taken by the function. Known generically as extremum, they may be defined either within a given range (the ''local'' or ''relative ...
* Mode *
Range Range may refer to: Geography * Range (geographic), a chain of hills or mountains; a somewhat linear, complex mountainous or hilly area (cordillera, sierra) ** Mountain range, a group of mountains bordered by lowlands * Range, a term used to i ...
* Sum Others include: * Nanmean (mean ignoring NaN values, also known as "nil" or "null") * Stddev Formally, an aggregate function takes as input a
set Set, The Set, SET or SETS may refer to: Science, technology, and mathematics Mathematics *Set (mathematics), a collection of elements *Category of sets, the category whose objects and morphisms are sets and total functions, respectively Electro ...
, a
multiset In mathematics, a multiset (or bag, or mset) is a modification of the concept of a set that, unlike a set, allows for multiple instances for each of its elements. The number of instances given for each element is called the ''multiplicity'' of ...
(bag), or a
list A list is a Set (mathematics), set of discrete items of information collected and set forth in some format for utility, entertainment, or other purposes. A list may be memorialized in any number of ways, including existing only in the mind of t ...
from some input domain and outputs an element of an output domain . The input and output domains may be the same, such as for SUM, or may be different, such as for COUNT. Aggregate functions occur commonly in numerous
programming language A programming language is a system of notation for writing computer programs. Programming languages are described in terms of their Syntax (programming languages), syntax (form) and semantics (computer science), semantics (meaning), usually def ...
s, in
spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in c ...
s, and in
relational algebra In database theory, relational algebra is a theory that uses algebraic structures for modeling data and defining queries on it with well founded semantics (computer science), semantics. The theory was introduced by Edgar F. Codd. The main applica ...
. The listagg function, as defined in the SQL:2016 standard aggregates data from multiple rows into a single concatenated string. In the
entity relationship diagram An entity is something that Existence, exists as itself. It does not need to be of material existence. In particular, abstractions and legal fictions are usually regarded as entities. In general, there is also no presumption that an entity is Lif ...
, aggregation is represented as seen in Figure 1 with a rectangle around the relationship and its entities to indicate that it is being treated as an aggregate entity.


Decomposable aggregate functions

Aggregate functions present a
bottleneck Bottleneck may refer to: * the narrowed portion (neck) of a bottle Science and technology * Bottleneck (engineering), where the performance of an entire system is limited by a single component * Bottleneck (network), in a communication network * ...
, because they potentially require having all input values at once. In
distributed computing Distributed computing is a field of computer science that studies distributed systems, defined as computer systems whose inter-communicating components are located on different networked computers. The components of a distributed system commu ...
, it is desirable to divide such computations into smaller pieces, and distribute the work, usually computing in parallel, via a
divide and conquer algorithm In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved dir ...
. Some aggregate functions can be computed by computing the aggregate for subsets, and then aggregating these aggregates; examples include COUNT, MAX, MIN, and SUM. In other cases the aggregate can be computed by computing auxiliary numbers for subsets, aggregating these auxiliary numbers, and finally computing the overall number at the end; examples include AVERAGE (tracking sum and count, dividing at the end) and RANGE (tracking max and min, subtracting at the end). In other cases the aggregate cannot be computed without analyzing the entire set at once, though in some cases approximations can be distributed; examples include DISTINCT COUNT ( Count-distinct problem), MEDIAN, and MODE. Such functions are called decomposable aggregation functions or decomposable aggregate functions. The simplest may be referred to as self-decomposable aggregation functions, which are defined as those functions such that there is a ''merge operator'' such that :f(X \uplus Y) = f(X) \diamond f(Y) where is the union of multisets (see
monoid homomorphism In abstract algebra, a monoid is a set equipped with an associative binary operation and an identity element. For example, the nonnegative integers with addition form a monoid, the identity element being . Monoids are semigroups with identity ...
). For example, SUM: :\operatorname() = x, for a singleton; :\operatorname(X \uplus Y) = \operatorname(X) + \operatorname(Y), meaning that merge is simply addition. COUNT: :\operatorname() = 1, :\operatorname(X \uplus Y) = \operatorname(X) + \operatorname(Y). MAX: :\operatorname() = x, :\operatorname(X \uplus Y) = \max\bigl(\operatorname(X), \operatorname(Y)\bigr). MIN: :\operatorname() = x, :\operatorname(X \uplus Y) = \min\bigl(\operatorname(X), \operatorname(Y)\bigr). Note that self-decomposable aggregation functions can be combined (formally, taking the product) by applying them separately, so for instance one can compute both the SUM and COUNT at the same time, by tracking two numbers. More generally, one can define a decomposable aggregation function as one that can be expressed as the composition of a final function and a self-decomposable aggregation function , f = g \circ h, f(X) = g(h(X)). For example, AVERAGE=SUM/COUNT and RANGE=MAXMIN. In the
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...
framework, these steps are known as InitialReduce (value on individual record/singleton set), Combine (binary merge on two aggregations), and FinalReduce (final function on auxiliary values), and moving decomposable aggregation before the Shuffle phase is known as an InitialReduce step, Decomposable aggregation functions are important in
online analytical processing In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction proces ...
(OLAP), as they allow aggregation queries to be computed on the pre-computed results in the
OLAP cube An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights. The term ''cube'' here refers to a multi-dimensional dataset, which is also sometimes cal ...
, rather than on the base data. For example, it is easy to support COUNT, MAX, MIN, and SUM in OLAP, since these can be computed for each cell of the OLAP cube and then summarized ("rolled up"), but it is difficult to support MEDIAN, as that must be computed for every view separately.


Other decomposable aggregate functions

In order to calculate the average and standard deviation from aggregate data, it is necessary to have available for each group: the total of values (Σxi = SUM(x)), the number of values (N=COUNT(x)) and the total of squares of the values (Σxi2=SUM(x2)) of each groups.

AVG: \operatorname(X \uplus Y) = \bigl(\operatorname(X) * \operatorname(X) + \operatorname(Y) * \operatorname(Y)\bigr) / \bigl(\operatorname(X) + \operatorname(Y)\bigr) or
\operatorname(X \uplus Y) = \bigl(\operatorname(X) + \operatorname(Y)\bigr) / \bigl(\operatorname(X) + \operatorname(Y)\bigr) or, only if COUNT(X)=COUNT(Y)
\operatorname(X \uplus Y) = \bigl(\operatorname(X) + \operatorname(Y)\bigr) / 2
SUM(x2): The sum of squares of the values is important in order to calculate the Standard Deviation of groups
\operatorname(X^2 \uplus Y^2) = \operatorname(X^2)+\operatorname(Y^2)
STDDEV:
For a finite population with equal probabilities at all points, we have Standard deviation#Identities and mathematical properties \operatorname(X) = s(x) = \sqrt = \sqrt = \sqrt This means that the standard deviation is equal to the square root of the difference between the average of the squares of the values and the square of the average value. \operatorname(X \uplus Y) = \sqrt \operatorname(X \uplus Y) = \sqrt


See also

* Cross-tabulation a.k.a.
Contingency table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business int ...
*
Data drilling Data drilling (also drilldown) refers to any of various operations and transformations on tabular, relational, and multidimensional data. The term has widespread use in various contexts, but is primarily associated with specialized software design ...
*
Data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
*
Data processing Data processing is the collection and manipulation of digital data to produce meaningful information. Data processing is a form of ''information processing'', which is the modification (processing) of information in any manner detectable by an o ...
*
Extract, transform, load Extract, transform, load (ETL) is a three-phase computing process where data is ''extracted'' from an input source, ''transformed'' (including cleaning), and ''loaded'' into an output data container. The data can be collected from one or mor ...
*
Fold (higher-order function) In functional programming, fold (also termed reduce, accumulate, aggregate, compress, or inject) refers to a family of higher-order functions that analyze a recursive data structure and through use of a given combining operation, recombine the re ...
* Group by (SQL), SQL clause *
OLAP cube An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights. The term ''cube'' here refers to a multi-dimensional dataset, which is also sometimes cal ...
*
Online analytical processing In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction proces ...
*
Pivot table A pivot table is a table of values which are aggregations of groups of individual values from a more extensive table (such as from a database, spreadsheet, or business intelligence program) within one or more discrete categories. The aggregatio ...
*
Relational algebra In database theory, relational algebra is a theory that uses algebraic structures for modeling data and defining queries on it with well founded semantics (computer science), semantics. The theory was introduced by Edgar F. Codd. The main applica ...
* Utility functions on indivisible goods#Aggregates of utility functions *
XML for Analysis XML for Analysis (XMLA) is an industry standard for data access in analytical systems, such as online analytical processing (OLAP) and data mining. XMLA is based on other industry standards such as XML, SOAP and HTTP. XMLA is maintained by XMLA Coun ...
* AggregateIQ *
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...


References


Literature

*
Oracle Aggregate Functions: MAX, MIN, COUNT, SUM, AVG Examples
* * *


External links


Aggregate Functions (Transact-SQL)
{{DEFAULTSORT:Aggregate Function Subroutines