HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, compositional data are quantitative descriptions of the parts of some whole, conveying relative information. Mathematically, compositional data is represented by points on a
simplex In geometry, a simplex (plural: simplexes or simplices) is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex is so-named because it represents the simplest possible polytope in any given dimension. ...
. Measurements involving probabilities, proportions, percentages, and ppm can all be thought of as compositional data.


Ternary plot

Compositional data in three variables can be plotted via
ternary plot A ternary plot, ternary graph, triangle plot, simplex plot, or Gibbs triangle is a barycentric plot on three variables which sum to a constant. It graphically depicts the ratios of the three variables as positions in an equilateral triangle. ...
s. The use of a barycentric plot on three variables graphically depicts the ratios of the three variables as positions in an
equilateral An equilateral triangle is a triangle in which all three sides have the same length, and all three angles are equal. Because of these properties, the equilateral triangle is a regular polygon, occasionally known as the regular triangle. It is the ...
triangle A triangle is a polygon with three corners and three sides, one of the basic shapes in geometry. The corners, also called ''vertices'', are zero-dimensional points while the sides connecting them, also called ''edges'', are one-dimension ...
.


Simplicial sample space

In general,
John Aitchison John Aitchison (22 July 1926 – 23 December 2016) was a Scottish statistician. Career John Aitchison studied at the University of Edinburgh after being uncomfortable explaining to his headmaster that he didn’t plan to attend university. H ...
defined compositional data to be proportions of some whole in 1982. In particular, a compositional data point (or ''composition'' for short) can be represented by a real vector with positive components. The sample space of compositional data is a simplex: :: \mathcal^D=\left\. \ The only information is given by the ratios between components, so the information of a composition is preserved under multiplication by any positive constant. Therefore, the sample space of compositional data can always be assumed to be a standard simplex, i.e. \kappa = 1. In this context, normalization to the standard simplex is called closure and is denoted by \scriptstyle\mathcal ,\cdot\,/math>: :: \mathcal _1,x_2,\dots,x_D\left frac,\frac, \dots,\frac\right\ where ''D'' is the number of parts (components) and cdot/math> denotes a row vector.


Aitchison geometry

The simplex can be given the structure of a
vector space In mathematics and physics, a vector space (also called a linear space) is a set (mathematics), set whose elements, often called vector (mathematics and physics), ''vectors'', can be added together and multiplied ("scaled") by numbers called sc ...
in several different ways. The following vector space structure is called Aitchison geometry or the Aitchison simplex and has the following operations: ; Perturbation (vector addition) :: x \oplus y = \left frac,\frac, \dots, \frac\right= C _1 y_1, \ldots, x_D y_D \qquad \forall x, y \in S^D ; Powering (scalar multiplication) :: \alpha \odot x = \left frac,\frac, \ldots,\frac \right= C _1^\alpha, \ldots, x_D^\alpha \qquad \forall x \in S^D, \; \alpha \in \mathbb ; Inner product :: \langle x, y \rangle = \frac \sum_^D \sum_^D \log \frac \log \frac \qquad \forall x, y \in S^D Endowed with those operations, the Aitchison simplex forms a (D-1)-dimensional Euclidean
inner product space In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, ofte ...
. The uniform composition \left frac, \dots, \frac\right/math> is the
zero vector In mathematics, a zero element is one of several generalizations of the number zero to other algebraic structures. These alternate meanings may or may not reduce to the same thing, depending on the context. Additive identities An '' additive id ...
.


Orthonormal bases

Since the Aitchison simplex forms a finite dimensional Hilbert space, it is possible to construct orthonormal bases in the simplex. Every composition x can be decomposed as follows :: x = \bigoplus_^ x_i^* \odot e_i where e_1, \ldots, e_ forms an orthonormal basis in the simplex. The values x_i^*, i=1,2,\ldots,D-1 are the (orthonormal and Cartesian) coordinates of x with respect to the given basis. They are called isometric log-ratio coordinates (\operatorname).


Linear transformations

There are three well-characterized
isomorphism In mathematics, an isomorphism is a structure-preserving mapping or morphism between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between the ...
s that transform from the Aitchison simplex to real space. All of these transforms satisfy linearity and as given below


Additive log ratio transform

The additive log ratio (alr) transform is an isomorphism where \operatorname: S^D \rightarrow \mathbb^ . This is given by :: \operatorname(x) = \left \log \frac, \cdots, \log \frac \right The choice of denominator component is arbitrary, and could be any specified component. This transform is commonly used in chemistry with measurements such as pH. In addition, this is the transform most commonly used for
multinomial logistic regression In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the prob ...
. The alr transform is not an isometry, meaning that distances on transformed values will not be equivalent to distances on the original compositions in the simplex.


Center log ratio transform

The center log ratio (clr) transform is both an isomorphism and an isometry where \operatorname: S^D \rightarrow U, \quad U \subset \mathbb^D :: \operatorname(x) = \left \log \frac, \cdots, \log \frac \right Where g(x) is the geometric mean of x . The inverse of this function is also known as the
softmax function The softmax function, also known as softargmax or normalized exponential function, converts a tuple of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
.


Isometric logratio transform

The isometric log ratio (ilr) transform is both an isomorphism and an isometry where \operatorname: S^D \rightarrow \mathbb^ :: \operatorname(x) = \big \langle x, e_1 \rangle, \ldots, \langle x, e_ \rangle\big There are multiple ways to construct orthonormal bases, including using the Gram–Schmidt orthogonalization or
singular-value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix into a rotation, followed by a rescaling followed by another rotation. It generalizes the eigendecomposition of a square normal matrix w ...
of clr transformed data. Another alternative is to construct log contrasts from a bifurcating tree. If we are given a bifurcating tree, we can construct a basis from the internal nodes in the tree. Each vector in the basis would be determined as follows :: e_\ell = C exp( \,\underbrace_k, \underbrace_r,\underbrace_s,\underbrace_t \, ) The elements within each vector are given as follows :: a = \frac \quad \text \quad b = \frac where k, r, s, t are the respective number of tips in the corresponding subtrees shown in the figure. It can be shown that the resulting basis is orthonormal Once the basis \Psi is built, the ilr transform can be calculated as follows :: \operatorname(x) = \operatorname(x) \Psi^T where each element in the ilr transformed data is of the following form :: b_i = \sqrt \log \frac where x_R and x_S are the set of values corresponding to the tips in the subtrees R and S


Examples

* In
chemistry Chemistry is the scientific study of the properties and behavior of matter. It is a physical science within the natural sciences that studies the chemical elements that make up matter and chemical compound, compounds made of atoms, molecules a ...
, compositions can be expressed as
molar concentration Molar concentration (also called molarity, amount concentration or substance concentration) is the number of moles of solute per liter of solution. Specifically, It is a measure of the concentration of a chemical species, in particular, of a so ...
s of each component. As the sum of all concentrations is not determined, the whole composition of ''D'' parts is needed and thus expressed as a vector of ''D'' molar concentrations. These compositions can be translated into weight per cent multiplying each component by the appropriated constant. * In
demography Demography () is the statistical study of human populations: their size, composition (e.g., ethnic group, age), and how they change through the interplay of fertility (births), mortality (deaths), and migration. Demographic analysis examine ...
, a town may be a compositional data point in a sample of towns; a town in which 35% of the people are Christians, 55% are Muslims, 6% are Jews, and the remaining 4% are others would correspond to the quadruple .35, 0.55, 0.06, 0.04 A data set would correspond to a list of towns. * In
geology Geology (). is a branch of natural science concerned with the Earth and other astronomical objects, the rocks of which they are composed, and the processes by which they change over time. Modern geology significantly overlaps all other Earth ...
, a rock composed of different minerals may be a compositional data point in a sample of rocks; a rock of which 10% is the first mineral, 30% is the second, and the remaining 60% is the third would correspond to the triple .1, 0.3, 0.6 A
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...
would contain one such triple for each rock in a sample of rocks. * In
high throughput sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, thymine, cytosine, and guanine. The ...
, data obtained are typically transformed to relative abundances, rendering them compositional. * In
probability Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...
and
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, a partition of the sampling space into disjoint events is described by the probabilities assigned to such events. The vector of ''D'' probabilities can be considered as a composition of ''D'' parts. As they add to one, one probability can be suppressed and the composition is completely determined. * In
chemometrics Chemometrics is the science of extracting information from chemical systems by data-driven means. Chemometrics is inherently interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, ap ...
, for the classification of petroleum oils. * In a survey, the proportions of people positively answering some different items can be expressed as percentages. As the total amount is identified as 100, the compositional vector of ''D'' components can be defined using only ''D'' − 1 components, assuming that the remaining component is the percentage needed for the whole vector to add to 100.


See also

*
Mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observati ...
*
Response surface methodology In statistics, response surface methodology (RSM) explores the relationships between several explanatory variables and one or more response variables. RSM is an empirical model which employs the use of mathematical and statistical techniques to r ...
* Applications of simplices *
Ternary plot A ternary plot, ternary graph, triangle plot, simplex plot, or Gibbs triangle is a barycentric plot on three variables which sum to a constant. It graphically depicts the ratios of the three variables as positions in an equilateral triangle. ...


Notes


References

* * * * *


External links


CoDaWeb – Compositional Data Website
* {{cite journal , hdl=10256/297 , hdl-access=free , last1=Pawlowsky-Glahn , first1=V. , last2=Egozcue , first2=J.J. , last3=Tolosana-Delgado , first3=R. , year=2007 , title=Lecture Notes on Compositional Data Analysis , website=Universitat de Girona , url=https://hdl.handle.net/10256/297 * Why, and How, Should Geologists Use Compositional Data Analysis (wikibook) Statistical data types