Aitchison Geometry
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, compositional data are quantitative descriptions of the parts of some whole, conveying relative information. Mathematically, compositional data is represented by points on a
simplex In geometry, a simplex (plural: simplexes or simplices) is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex is so-named because it represents the simplest possible polytope in any given dimension. ...
. Measurements involving probabilities, proportions, percentages, and ppm can all be thought of as compositional data.


Ternary plot

Compositional data in three variables can be plotted via
ternary plot A ternary plot, ternary graph, triangle plot, simplex plot, Gibbs triangle or de Finetti diagram is a barycentric plot on three variables which sum to a constant. It graphically depicts the ratios of the three variables as positions in an equi ...
s. The use of a barycentric
plot Plot or Plotting may refer to: Art, media and entertainment * Plot (narrative), the story of a piece of fiction Music * ''The Plot'' (album), a 1976 album by jazz trumpeter Enrico Rava * The Plot (band), a band formed in 2003 Other * ''Plot' ...
on three variables graphically depicts the ratios of the three variables as positions in an
equilateral In geometry, an equilateral triangle is a triangle in which all three sides have the same length. In the familiar Euclidean geometry, an equilateral triangle is also equiangular; that is, all three internal angles are also congruent to each othe ...
triangle A triangle is a polygon with three Edge (geometry), edges and three Vertex (geometry), vertices. It is one of the basic shapes in geometry. A triangle with vertices ''A'', ''B'', and ''C'' is denoted \triangle ABC. In Euclidean geometry, an ...
.


Simplicial sample space

In general, John Aitchison defined compositional data to be proportions of some whole in 1982. In particular, a compositional data point (or ''composition'' for short) can be represented by a real vector with positive components. The sample space of compositional data is a simplex: :: \mathcal^D=\left\. \ The only information is given by the ratios between components, so the information of a composition is preserved under multiplication by any positive constant. Therefore, the sample space of compositional data can always be assumed to be a standard simplex, i.e. \kappa = 1. In this context, normalization to the standard simplex is called closure and is denoted by \scriptstyle\mathcal ,\cdot\,/math>: :: \mathcal _1,x_2,\dots,x_D\left frac,\frac, \dots,\frac\right\ where ''D'' is the number of parts (components) and
cdot CDOT may refer to: *\cdot – the LaTeX input for the dot operator (⋅) *Cdot, a rapper from Sumter, South Carolina *Centre for Development of Telematics, India * Chicago Department of Transportation * Clustered Data ONTAP, an operating system from ...
/math> denotes a row vector.


Aitchison geometry

The simplex can be given the structure of a real vector space in several different ways. The following vector space structure is called Aitchison geometry or the Aitchison simplex and has the following operations: ; Perturbation :: x \oplus y = \left frac,\frac, \dots, \frac\right= C _1 y_1, \ldots, x_D y_D \qquad \forall x, y \in S^D ; Powering :: \alpha \odot x = \left frac,\frac, \ldots,\frac \right= C _1^\alpha, \ldots, x_D^\alpha \qquad \forall x \in S^D, \; \alpha \in \mathbb ; Inner product :: \langle x, y \rangle = \frac \sum_^D \sum_^D \log \frac \log \frac \qquad \forall x, y \in S^D Under these operations alone, it is sufficient to show that the Aitchison simplex forms a (D-1)-dimensional Euclidean vector space.


Orthonormal bases

Since the Aitchison simplex forms a finite dimensional Hilbert space, it is possible to construct orthonormal bases in the simplex. Every composition x can be decomposed as follows :: x = \bigoplus_^D x_i^* \odot e_i where e_1, \ldots, e_ forms an orthonormal basis in the simplex. The values x_i^*, i=1,2,\ldots,D-1 are the (orthonormal and Cartesian) coordinates of x with respect to the given basis. They are called isometric log-ratio coordinates (\operatorname).


Linear transformations

There are three well-characterized
isomorphism In mathematics, an isomorphism is a structure-preserving mapping between two structures of the same type that can be reversed by an inverse mapping. Two mathematical structures are isomorphic if an isomorphism exists between them. The word is ...
s that transform from the Aitchison simplex to real space. All of these transforms satisfy linearity and as given below


Additive logratio transform

The additive log ratio (alr) transform is an isomorphism where \operatorname: S^D \rightarrow \mathbb^ . This is given by :: \operatorname(x) = \left \log \frac \cdots \log \frac \right The choice of denominator component is arbitrary, and could be any specified component. This transform is commonly used in chemistry with measurements such as pH. In addition, this is the transform most commonly used for multinomial logistic regression. The alr transform is not an isometry, meaning that distances on transformed values will not be equivalent to distances on the original compositions in the simplex.


Center logratio transform

The center log ratio (clr) transform is both an isomorphism and an isometry where \operatorname: S^D \rightarrow U, \quad U \subset \mathbb^D :: \operatorname(x) = \left \log \frac \cdots \log \frac \right Where g(x) is the geometric mean of x . The inverse of this function is also known as the
softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
.


Isometric logratio transform

The isometric log ratio (ilr) transform is both an isomorphism and an isometry where \operatorname: S^D \rightarrow \mathbb^ :: \operatorname(x) = \big \langle x, e_1 \rangle, \ldots, \langle x, e_ \rangle\big There are multiple ways to construct orthonormal bases, including using the Gram–Schmidt orthogonalization or
singular-value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is relate ...
of clr transformed data. Another alternative is to construct log contrasts from a bifurcating tree. If we are given a bifurcating tree, we can construct a basis from the internal nodes in the tree. Each vector in the basis would be determined as follows :: e_\ell = C exp( \,\underbrace_k, \underbrace_r,\underbrace_s,\underbrace_t \, ) The elements within each vector are given as follows :: a = \frac \quad \text \quad b = \frac where k, r, s, t are the respective number of tips in the corresponding subtrees shown in the figure. It can be shown that the resulting basis is orthonormal Once the basis \Psi is built, the ilr transform can be calculated as follows :: \operatorname(x) = \operatorname(x) \Psi^T where each element in the ilr transformed data is of the following form :: b_i = \sqrt \log \frac where x_R and x_S are the set of values corresponding to the tips in the subtrees R and S


Examples

* In
chemistry Chemistry is the science, scientific study of the properties and behavior of matter. It is a natural science that covers the Chemical element, elements that make up matter to the chemical compound, compounds made of atoms, molecules and ions ...
, compositions can be expressed as
molar concentration Molar concentration (also called molarity, amount concentration or substance concentration) is a measure of the concentration of a chemical species, in particular of a solute in a solution, in terms of amount of substance per unit volume of solut ...
s of each component. As the sum of all concentrations is not determined, the whole composition of ''D'' parts is needed and thus expressed as a vector of ''D'' molar concentrations. These compositions can be translated into weight per cent multiplying each component by the appropriated constant. * In
demography Demography () is the statistics, statistical study of populations, especially human beings. Demographic analysis examines and measures the dimensions and Population dynamics, dynamics of populations; it can cover whole societies or groups ...
, a town may be a compositional data point in a sample of towns; a town in which 35% of the people are Christians, 55% are Muslims, 6% are Jews, and the remaining 4% are others would correspond to the quadruple .35, 0.55, 0.06, 0.04 A data set would correspond to a list of towns. * In
geology Geology () is a branch of natural science concerned with Earth and other astronomical objects, the features or rocks of which it is composed, and the processes by which they change over time. Modern geology significantly overlaps all other Ear ...
, a rock composed of different minerals may be a compositional data point in a sample of rocks; a rock of which 10% is the first mineral, 30% is the second, and the remaining 60% is the third would correspond to the triple .1, 0.3, 0.6 A
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
would contain one such triple for each rock in a sample of rocks. * In
high throughput sequencing DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The ...
, data obtained are typically transformed to relative abundances, rendering them compositional. * In
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
and
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, a partition of the sampling space into disjoint events is described by the probabilities assigned to such events. The vector of ''D'' probabilities can be considered as a composition of ''D'' parts. As they add to one, one probability can be suppressed and the composition is completely determined. * In
chemometrics Chemometrics is the science of extracting information from chemical systems by data-driven means. Chemometrics is inherently interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, ap ...
, for the classification of petroleum oils. * In a
survey Survey may refer to: Statistics and human research * Statistical survey, a method for collecting quantitative information about items in a population * Survey (human research), including opinion polls Spatial measurement * Surveying, the techniq ...
, the proportions of people positively answering some different items can be expressed as percentages. As the total amount is identified as 100, the compositional vector of ''D'' components can be defined using only ''D'' − 1 components, assuming that the remaining component is the percentage needed for the whole vector to add to 100.


See also

*
Mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation ...
*
Response surface methodology In statistics, response surface methodology (RSM) explores the relationships between several explanatory variables and one or more response variables. The method was introduced by George E. P. Box and K. B. Wilson in 1951. The main idea of RSM ...
* Applications of simplices *
Ternary plot A ternary plot, ternary graph, triangle plot, simplex plot, Gibbs triangle or de Finetti diagram is a barycentric plot on three variables which sum to a constant. It graphically depicts the ratios of the three variables as positions in an equi ...


Notes


References

* * * * *


External links


CoDaWeb – Compositional Data Website
* {{cite document , hdl=10256/297 , last1=Pawlowsky-Glahn , first1=V. , last2=Egozcue , first2=J.J. , last3=Tolosana-Delgado , first3=R. , year=2007 , title=Lecture Notes on Compositional Data Analysis * Why, and How, Should Geologists Use Compositional Data Analysis (wikibook) Statistical data types