HOME

TheInfoList



OR:

Correspondence analysis (CA) is a multivariate statistical technique proposed by
Herman Otto Hartley Herman Otto Hartley (born Hermann Otto Hirschfeld in Berlin, Germany; 1912–1980) was a German American statistician. He made significant contributions in many areas of statistics, mathematical programming, and optimization. He also founded Texas ...
(Hirschfeld) and later developed by
Jean-Paul Benzécri Jean-Paul Benzécri was a French people, French mathematician and statistician. He studied at École Normale Supérieure and was professor at University of Rennes 1, Université de Rennes and later for most of his career at the Paris Institute of ...
. It is conceptually similar to
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
, but applies to categorical rather than continuous data. In a similar manner to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form. Its aim is to display in a
biplot Biplots are a type of exploratory graph used in statistics, a generalization of the simple two-variable scatterplot. A biplot overlays a ''score plot'' with a ''loading plot''. A biplot allows information on both samples and variables of a d ...
any structure hidden in the multivariate setting of the data table. As such it is a technique from the field of multivariate
ordination Ordination is the process by which individuals are Consecration, consecrated, that is, set apart and elevated from the laity class to the clergy, who are thus then authorization, authorized (usually by the religious denomination, denominational ...
. Since the variant of CA described here can be applied either with a focus on the rows or on the columns it should in fact be called simple (symmetric) correspondence analysis. It is traditionally applied to the
contingency table In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business i ...
of a pair of nominal variables where each cell contains either a count or a zero value. If more than two categorical variables are to be summarized, a variant called
multiple correspondence analysis In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Eucli ...
should be chosen instead. CA may also be applied to
binary data Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra. Binary data occurs in many different technical and scientific fields, wher ...
given the presence/absence coding represents simplified count data i.e. a 1 describes a positive count and 0 stands for a count of zero. Depending on the scores used CA preserves the chi-square distance between either the rows or the columns of the table. Because CA is a descriptive technique, it can be applied to tables regardless of a significant chisquared test. Although the \chi^2 statistic used in
inferential statistics Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers propertie ...
and the chi-square distance are computationally related they should not be confused since the latter works as a
multivariate Multivariate may refer to: In mathematics * Multivariable calculus * Multivariate function * Multivariate polynomial In computing * Multivariate cryptography * Multivariate division algorithm * Multivariate interpolation * Multivariate optical c ...
statistical distance In statistics, probability theory, and information theory, a statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples, or the distance can be be ...
measure in CA while the \chi^2 statistic is in fact a
scalar Scalar may refer to: *Scalar (mathematics), an element of a field, which is used to define a vector space, usually the field of real numbers * Scalar (physics), a physical quantity that can be described by a single element of a number field such ...
not a
metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathem ...
.


Details

Like
principal components analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
, correspondence analysis creates
orthogonal In mathematics, orthogonality is the generalization of the geometric notion of ''perpendicularity''. By extension, orthogonality is also used to refer to the separation of specific features of a system. The term also has specialized meanings in ...
components (or axes) and, for each item in a table i.e. for each row, a set of scores (sometimes called factor scores, see
Factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...
). Correspondence analysis is performed on the data table, conceived as matrix ''C'' of size ''m'' × ''n'' where ''m'' is the number of rows and ''n'' is the number of columns. In the following mathematical description of the method capital letters in italics refer to a
matrix Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** ''The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...
while letters in italics refer to vectors. Understanding the following computations requires knowledge of
matrix algebra In abstract algebra, a matrix ring is a set of matrices with entries in a ring ''R'' that form a ring under matrix addition and matrix multiplication . The set of all matrices with entries in ''R'' is a matrix ring denoted M''n''(''R'')Lang, ''U ...
.


Preprocessing

Before proceeding to the central computational step of the algorithm, the values in matrix ''C'' have to be transformed. First compute a set of weights for the columns and the rows (sometimes called ''masses''), where row and column weights are given by the row and column vectors, respectively: :w_m = \frac C \mathbf, \quad w_n = \frac\mathbf^T C. Here n_C = \sum_^n \sum_^m C_ is the sum of all cell values in matrix ''C'', or short the sum of ''C'', and \mathbf is a column
vector Vector most often refers to: *Euclidean vector, a quantity with a magnitude and a direction *Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematic ...
of ones with the appropriate dimension. Put in simple words, w_m is just a vector whose elements are the row sums of ''C'' divided by the sum of ''C'', and w_n is a vector whose elements are the column sums of ''C'' divided by the sum of ''C''. The weights are transformed into
diagonal matrices In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagonal m ...
:W_m = \operatorname(1/\sqrt) and :W_n = \operatorname(1/\sqrt) where the diagonal elements of W_n are 1/\sqrt and those of W_m are 1/\sqrt respectively i.e. the vector elements are the inverses of the square roots of the masses. The off-diagonal elements are all 0. Next, compute matrix P by dividing C by its sum :P = \frac C. In simple words, Matrix P is just the data matrix (contingency table or binary table) transformed into portions i.e. each cell value is just the cell portion of the sum of the whole table. Finally, compute matrix ''S'', sometimes called the matrix of ''standardized residuals'', by
matrix multiplication In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the s ...
as :S = W_m(P - w_m w_n)W_n Note, the vectors w_m and w_n are combined in an
outer product In linear algebra, the outer product of two coordinate vector In linear algebra, a coordinate vector is a representation of a vector as an ordered list of numbers (a tuple) that describes the vector in terms of a particular ordered basis. An ea ...
resulting in a matrix of the same
dimensions In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coordina ...
as P. In words the formula reads: matrix \operatorname(w_m, w_n) is subtracted from matrix ''P'' and the resulting matrix is scaled (weighted) by the diagonal matrices W_m and W_n. Multiplying the resulting matrix by the diagonal matrices is equivalent to multiply the i-th row (or column) of it by the i-th element of the diagonal of W_m or W_n, respectively''.''


Interpretation of preprocessing

The vectors w_m and w_n are the row and column masses or the marginal probabilities for the rows and columns, respectively. Subtracting matrix \operatorname(w_m, w_n) from matrix ''P'' is the matrix algebra version of double
centering Centring, centre, centering"Centering 2, Centring 2" def. 1. Whitney, William Dwight, and Benjamin E. Smith. ''The Century dictionary and cyclopedia''. vol. 2. New York: Century Co., 1901. p. 885., or center is a type of formwork: the temporary str ...
the data. Multiplying this difference by the diagonal weighting matrices results in a matrix containing weighted deviations from the
origin Origin(s) or The Origin may refer to: Arts, entertainment, and media Comics and manga * Origin (comics), ''Origin'' (comics), a Wolverine comic book mini-series published by Marvel Comics in 2002 * The Origin (Buffy comic), ''The Origin'' (Bu ...
of a
vector space In mathematics and physics, a vector space (also called a linear space) is a set whose elements, often called ''vectors'', may be added together and multiplied ("scaled") by numbers called '' scalars''. Scalars are often real numbers, but can ...
. This origin is defined by matrix \operatorname(w_m, w_n). In fact matrix \operatorname(w_m, w_n) is identical with the matrix of ''expected frequencies'' in the
chi-squared test A chi-squared test (also chi-square or test) is a statistical hypothesis test used in the analysis of contingency tables when the sample sizes are large. In simpler terms, this test is primarily used to examine whether two categorical variable ...
. Therefore ''S'' is computationally related to the independence model used in that test. But since CA is ''not'' an inferential method the term independence model is inappropriate here.


Orthogonal components

The table ''S'' is then decomposed by a
singular value decomposition In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any \ m \times n\ matrix. It is related ...
as :S = U\Sigma V^* \, where U and V are the left and right singular vectors of S and \Sigma is a square diagonal matrix with the singular values \sigma_i of ''S'' on the diagonal. \Sigma is of dimension p \leq (\min(m,n)-1) hence U is of dimension m×p and V is of n×p''. A''s
orthonormal vectors In linear algebra, two vectors in an inner product space are orthonormal if they are orthogonal (or perpendicular along a line) unit vectors. A set of vectors form an orthonormal set if all vectors in the set are mutually orthogonal and all of un ...
U and V fulfill :U^* U = V^* V = I. In other words, the multivariate information that is contained in C as well as in ''S'' is now distributed across two (coordinate) matrices U and V and a diagonal (scaling) matrix \Sigma. The vector space defined by them has as number of dimensions p, that is the smaller of the two values, number of rows and number of columns, minus 1.


Inertia

While a principal component analysis may be said to decompose the (co)variance, and hence its measure of success is the amount of (co-)variance covered by the first few PCA axes - measured in eigenvalue -, a CA works with a weighted (co-)variance which is called ''inertia''. The sum of the squared singular values is the ''total inertia'' \Iota of the data table, computed as :\Iota = \sum_^p \sigma_i^2. The ''total inertia'' \Iota of the data table can also computed directly from ''S'' as :\Iota = \sum_^n \sum_^m s_^2. The amount of inertia covered by the i-th set of singular vectors is \iota_i, the ''principal inertia.'' The higher the portion of inertia covered by the first few singular vectors i.e. the larger the sum of the principal inertiae in comparison to the total inertia, the more successful a CA is. Therefore all principal inertia values are expressed as portion \epsilon_i of the total inertia :\epsilon_i = \sigma_i^2 / \sum_^p \sigma_i^2 and are presented in the form of a
scree plot In multivariate statistics, a scree plot is a line plot of the eigenvalues of factors or principal components in an analysis. The scree plot is used to determine the number of factors to retain in an exploratory factor analysis (FA) or principal c ...
. In fact a scree plot is just a bar plot of all principal inertia portions \epsilon_i.


Coordinates

To transform the singular vectors to coordinates which preserve the chisquare distances between rows or columns an additional weighting step is necessary. The resulting coordinates are called ''principal coordinates'' in CA text books. If principal coordinates are used for rows their visualization is called a ''row isometric'' scaling in econometrics and ''scaling 1'' in ecology. Since the weighting includes the singular values \Sigma of the matrix of standardized residuals S these coordinates are sometimes referred to as ''singular value scaled singular vectors'', or, a little bit misleading, as eigenvalue scaled eigenvectors. In fact the non-trivial eigenvectors of S S^* are the left singular vectors U of S and those of S^* S are the right singular vectors V of S while the eigenvalues of either of these matrices are the squares of the singular values \Sigma. But since all modern algorithms for CA are based on a singular value decomposition this terminology should be avoided. In the french tradition of CA the coordinates are sometimes called (factor) ''scores''. Factor scores or ''principal coordinates'' for the rows of matrix ''C'' are computed by :F_m = W_m U \Sigma i.e. the left singular vectors are scaled by the inverse of the square roots of the row masses and by the singular values. Because principal coordinates are computed using singular values they contain the information about the
spread Spread may refer to: Places * Spread, West Virginia Arts, entertainment, and media * ''Spread'' (film), a 2009 film. * ''$pread'', a quarterly magazine by and for sex workers * "Spread", a song by OutKast from their 2003 album ''Speakerboxxx/T ...
between the rows (or columns) in the original table. Computing the euclidean distances between the entities in principal coordinates results in values that equal their chisquare distances which is the reason why CA is said to ''"preserve chisquare distances"''. Compute principal coordinates for the columns by :F_n = W_n V \Sigma. To represent the result of CA in a proper
biplot Biplots are a type of exploratory graph used in statistics, a generalization of the simple two-variable scatterplot. A biplot overlays a ''score plot'' with a ''loading plot''. A biplot allows information on both samples and variables of a d ...
, those categories which are ''not'' plotted in principal coordinates, i.e. in chisquare distance preserving coordinates, should be plotted in so called ''standard coordinates''. They are called standard coordinates because each vector of standard coordinates has been standardized to exhibit mean 0 and variance 1. When computing standard coordinates the singular values are omitted which is a direct result of applying the biplot rule by which one of the two sets of singular vector matrices must be scaled by singular values raised to the power of zero i.e. multiplied by one i.e. be computed by omitting the singular values if the other set of singular vectors have been scaled by the singuar values. This reassures the existence of a
inner product In mathematics, an inner product space (or, rarely, a Hausdorff space, Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation (mathematics), operation called an inner product. The inner product of two ve ...
between the two sets of coordinates i.e. it leads to meaningful interpretations of their spatial relations in a biplot. In practical terms one can think of the standard coordinates as the vertices of the vector space in which the set of principal coordinates (i.e. the respective points) "exists". The standard coordinates for the rows are :G_m = W_m U and those for the columns are :G_n = W_n V Note that a ''scaling 1'' biplot in ecology implies the rows to be in principal and the columns to be in standard coordinates while ''scaling 2'' implies the rows to be in standard and the columns to be in principal coordinates. I.e. scaling 1 implies a biplot of F_mtogether with G_n while scaling 2 implies a biplot of F_ntogether with G_m.


Graphical representation of result

The visualization of a CA result always starts with displaying the scree plot of the principal inertia values to evaluate the success of summarizing spread by the first few singular vectors. The actual ordination is presented in a graph which could - at first look - be confused with a complicated
scatter plot A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. ...
. In fact it consists of two scatter plots printed one upon the other, one set of points for the rows and one for the columns. But being a biplot a clear interpretation rule relates the two coordinate matrices used. Usually the first two dimensions of the CA solution are plotted because they encompass the maximum of information about the data table that can be displayed in 2D although other combinations of dimensions may be investigated by a biplot. A biplot is in fact a low dimensional mapping of a part of the information contained in the original table. As a rule of thumb that set (rows or columns) which should be analysed with respect to its composition as measured by the other set is displayed in principal coordinates while the other set is displayed in standard coordinates. E.g. a table displaying voting districts in rows and political parties in columns with the cells containing the counted votes may be displayed with the districts (rows) in principal coordinates when the focus is on ordering districts according to similar voting. Traditionally, originating from the french tradition in CA, early CA biplots mapped both entities in the same coordinate version, usually principal coordinates, but this kind of display is misleading insofar as: "Although this is called a biplot, it does ''not'' have any useful inner product relationship between the row and column scores" as Brian Ripley, maintainer of R package MASS points out correctly. Today that kind of display should be avoided since laymen usually are not aware of the lacking relation between the two point sets. A ''scaling 1'' biplot (rows in principal coordinates, columns in standard coordinates) is interpreted as follows: * The distances between row points approximate their chi-square distance. Points close to each other represent rows with very similar values in the original data table. I.e they may exhibit rather similar frequencies in case of count data or closely related binary values in case of presence/absence data. * (Column) points in standard coordinates represent the vertices of the vector space i.e. the outer corner of something that in multidimensional space has the shape of an irregular polyhedron. Project row points onto the line connecting the origin and the standard coordinate of a column; if the projected position along that connection line is close to the position of the standard coordinate, that row point is strongly associated with this column i.e. in case of count data the row has a high frequency of that category and in case of presence/absence data the row is likely to exhibit a 1 in that column. Row points whose projection would require to elongate the connection line beyond the origin have a lower than average value in that column.


Extensions and applications

Several variants of CA are available, including
detrended correspondence analysis Detrended correspondence analysis (DCA) is a multivariate statistical technique widely used by ecologists to find the main factors or gradients in large, species-rich but usually sparse data matrices that typify ecological community data. DCA is f ...
(DCA) and
canonical correspondence analysis In multivariate analysis, canonical correspondence analysis (CCA) is an ordination technique that determines axes from the response data as a linear combination of measured predictors. CCA is commonly used in ecology in order to extract gradients th ...
(CCA). The later (CCA) is the method to use, when there is information about possible causes for the similarities between the investigated entities. The extension of correspondence analysis to many categorical variables is called
multiple correspondence analysis In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Eucli ...
. An adaptation of correspondence analysis to the problem of discrimination based upon qualitative variables (i.e., the equivalent of
discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...
for qualitative data) is called
discriminant correspondence analysis In mathematics, the discriminant of a polynomial is a quantity that depends on the coefficients and allows deducing some properties of the roots without computing them. More precisely, it is a polynomial function of the coefficients of the ori ...
or barycentric discriminant analysis. In the social sciences, correspondence analysis, and particularly its extension
multiple correspondence analysis In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Eucli ...
, was made known outside France through French sociologist
Pierre Bourdieu Pierre Bourdieu (; 1 August 1930 – 23 January 2002) was a French sociologist and public intellectual. Bourdieu's contributions to the sociology of education, the theory of sociology, and sociology of aesthetics have achieved wide influence i ...
's application of it.


Implementations

* The data visualization system
Orange Orange most often refers to: *Orange (fruit), the fruit of the tree species '' Citrus'' × ''sinensis'' ** Orange blossom, its fragrant flower *Orange (colour), from the color of an orange, occurs between red and yellow in the visible spectrum * ...
include the module: orngCA. * The statistical programming language R includes several packages, which offer a function for (simple symmetric) correspondence analysis. Using the R notation ackage_name::function_namethe packages and respective functions are: ade4::dudi.coa(), ca::ca() , ExPosition::epCA(), FactoMineR::CA(), MASS::corresp(), vegan::cca(). The easiest approach for beginners is ca::ca() as there is an extensive text book accompanying that package. *The Freeware PAST (PAleontological STatistics) offers (simple symmetric) correspondence analysis via the menu "Multivariate/Ordination/Correspondence (CA)".


See also

* Formal concept analysis


References

{{reflist


External links

* Greenacre, Michael (2008), ''La Práctica del Análisis de Correspondencias'', BBVA Foundation, Madrid, Spanish translation of ''Correspondence Analysis in Practice'', available for free download fro
BBVA Foundation publications
* Greenacre, Michael (2010), ''Biplots in Practice'', BBVA Foundation, Madrid, available for free download a
multivariatestatistics.org
Dimension reduction