Aggregate (data Warehouse)
   HOME

TheInfoList



OR:

An aggregate is a type of summary used in dimensional models of
data warehouse In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for Business intelligence, reporting and data analysis and is a core component of business intelligence. Data warehouses are central Re ...
s to shorten the time it takes to provide answers to typical queries on large sets of
data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
. The reason why aggregates can make such a dramatic increase in the performance of a data warehouse is the reduction of the number of rows to be accessed when responding to a query.


Use

In its simplest form, an aggregate is a simple summary table that can be derived by performing a '' Group by'' SQL query. A more common use of aggregates is to take a
dimension In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coo ...
and change its granularity. When changing the
granularity Granularity (also called graininess) is the degree to which a material or system is composed of distinguishable pieces, "granules" or "grains" (metaphorically). It can either refer to the extent to which a larger entity is subdivided, or the ...
of the dimension the
fact A fact is a truth, true data, datum about one or more aspects of a circumstance. Standard reference works are often used to Fact-checking, check facts. Science, Scientific facts are verified by repeatable careful observation or measurement by ...
table has to be partially summarized to fit the new
grain A grain is a small, hard, dry fruit (caryopsis) – with or without an attached husk, hull layer – harvested for human or animal consumption. A grain crop is a grain-producing plant. The two main types of commercial grain crops are cereals and ...
of the new
dimension In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coo ...
, thus creating new dimensional and fact tables, to fit this new level of grain.


Design

Aggregates are sometimes referred to as pre-calculated summary data, since aggregations are usually precomputed, partially summarized data, that are stored in new aggregated tables. When facts are aggregated, it is either done by eliminating dimensionality or by associating the facts with a rolled up dimension. Rolled up dimensions should be shrunken versions of the dimensions associated with the granular base facts. This way, the aggregated dimension tables should conform to the base dimension tables.


Performance

In 1996,
Ralph Kimball Ralph Kimball (born July 18, 1944) is an author on the subject of data warehousing and business intelligence. He is one of the original architects of data warehousing and is known for long-term convictions that data warehouses must be designed to ...
, who is widely regarded as one of the original architects of data warehousing, stated:
''The single most dramatic way to affect performance in a large data warehouse is to provide a proper set of aggregate (summary) records that coexist with the primary base records. Aggregates can have a very significant effect on performance, in some cases speeding queries by a factor of one hundred or even one thousand. No other means exist to harvest such spectacular gains.''


Complexity

Having aggregates and atomic data increases the complexity of the dimensional model. This complexity should be transparent to the users of the data warehouse, thus when a request is made, the data warehouse should return data from the table with the correct grain. So when requests to the data warehouse are made, aggregate navigator functionality should be implemented, to help determine the correct table with the correct grain. The number of possible aggregations is determined by every possible combination of dimension granularities. Since it would produce a lot of overhead to build all possible aggregations, it is a good idea to choose a subset of tables on which to make aggregations. The best way to choose this subset and decide which aggregations to build is to monitor queries and design aggregations to match query patterns.


Aggregate navigation

Having aggregate data in the dimensional model makes the environment more complex. To make this extra complexity transparent to the user, functionality known as aggregate navigation is used to query the dimensional and fact tables with the correct grain level. The aggregate navigation essentially examines the query to see if it can be answered using a smaller, aggregate table. Implementations of aggregate navigators can be found in a range of technologies: *
OLAP In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction processi ...
engines *
Materialized view In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summa ...
s *Relational OLAP (
ROLAP In computing, online analytical processing (OLAP) (), is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term ''OLAP'' was created as a slight modification of the traditional database term online transaction proces ...
) services * BI application servers or query tools It is generally recommended to use either of the first three technologies, since the benefits in the latter case is restricted to a single front end BI tool


Challenges

*Since dimensional models only gain from aggregates on large data sets, it should be considered at what size of the data sets one should start using aggregates *One can also ask oneself if a data warehouse always handles data sets that are too large for direct queries, or if it sometimes is a good idea to omit the aggregate tables when starting a new data warehouse project * An open question is whether omitting aggregates in the first iteration of building a new data warehouse will make the structure of the dimensional model simpler


References

{{Data warehouse Data warehousing