Multiple correspondence analysis
   HOME

TheInfoList



OR:

In
statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
, multiple correspondence analysis (MCA) is a
data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enc ...
technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional
Euclidean space Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, that is, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidea ...
. The procedure thus appears to be the counterpart of
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
for categorical data. MCA can be viewed as an extension of simple
correspondence analysis Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rath ...
(CA) in that it is applicable to a large set of
categorical variable In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or ...
s.


As an extension of correspondence analysis

MCA is performed by applying the CA algorithm to either an indicator matrix (also called ''complete disjunctive table'' – CDT) or a ''Burt table'' formed from these variables. An indicator matrix is an individuals × variables matrix, where the rows represent individuals and the columns are dummy variables representing categories of the variables. Analyzing the indicator matrix allows the direct representation of individuals as points in geometric space. The Burt table is the symmetric matrix of all two-way cross-tabulations between the categorical variables, and has an analogy to the
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
of continuous variables. Analyzing the Burt table is a more natural generalization of simple
correspondence analysis Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rath ...
, and individuals or the means of groups of individuals can be added as supplementary points to the graphical display. In the indicator matrix approach, associations between variables are uncovered by calculating the chi-square distance between different categories of the variables and between the individuals (or respondents). These associations are then represented graphically as "maps", which eases the interpretation of the structures in the data. Oppositions between rows and columns are then maximized, in order to uncover the underlying dimensions best able to describe the central oppositions in the data. As in
factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...
or
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
, the first axis is the most important dimension, the second axis the second most important, and so on, in terms of the amount of variance accounted for. The number of axes to be retained for analysis is determined by calculating modified
eigenvalues In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denote ...
.


Details

Since MCA is adapted to make statistical conclusion out of categorical variables (such as multiple choices questions), the first thing one needs to do is to transform quantitative data (such as age, size, weight, day time, etc) into categories (using for instance statistical quantiles). When the dataset is completely represented as categorical variables, one is able to build the corresponding so called completely disjunctive table. We denote this table X. If I persons answered a survey with J multiple choices questions with 4 answers each, X will have I rows and 4J columns. More theoretically, assume X is the completely disjunctive table of I observations of K categorical variables. Assume also that the k-th variable have J_k different levels (categories) and set J=\sum_^ J_k. The table X is then a I \times J matrix with all coefficient being 0 or 1. Set the sum of all entries of X to be N and introduce Z=X/N. In an MCA, there are also two special vectors: first r, that contains the sums along the rows of Z, and c, that contains the sums along the columns of Z. Note D_r = \text(r) and D_c = \text(c), the diagonal matrices containing r and c respectively as diagonal. With these notations, computing an MCA consists essentially in the singular value decomposition of the matrix: : M = D_^ (Z-r c^t ) D_^ The decomposition of M gives you P, \Delta and Q such that M=P \Delta Q^t with P, Q two unitary matrices and \Delta is the generalized diagonal matrix of the singular values (with the same shape as Z). The positive coefficients of \Delta^2 are the eigenvalues of Z. The interest of MCA comes from the way observations (rows) and variables (columns) in Z can be decomposed. This decomposition is called a factor decomposition. The coordinates of the observations in the factor space are given by : F = D_^ P \Delta The i-th rows of F represent the i-th observation in the factor space. And similarly, the coordinates of the variables (in the same factor space as observations!) are given by : G = D_^ Q \Delta


Recent works and extensions

In recent years, several students of
Jean-Paul Benzécri Jean-Paul Benzécri was a French mathematician and statistician. He studied at École Normale Supérieure and was professor at Université de Rennes and later for most of his career at the Paris Institute of Statistics (l'Institut de Statistiqu ...
have refined MCA and incorporated it into a more general framework of data analysis known as
geometric data analysis Geometric data analysis comprises geometric aspects of image analysis, pattern analysis, and shape analysis, and the approach of multivariate statistics, which treat arbitrary data sets as ''clouds of points'' in a space that is ''n''-dimensional. ...
. This involves the development of direct connections between simple
correspondence analysis Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rath ...
,
principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
and MCA with a form of
cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
known as Euclidean classification. Two extensions have great practical use. * It is possible to include, as active elements in the MCA, several quantitative variables. This extension is called
factor analysis of mixed data In statistics, factor analysis of mixed data or factorial analysis of mixed data (FAMD, in the French original: ''AFDM'' or ''Analyse Factorielle de Données Mixtes''), is the factorial method devoted to data tables in which a group of individuals ...
(see below). * Very often, in questionnaires, the questions are structured in several issues. In the statistical analysis it is necessary to take into account this structure. This is the aim of multiple factor analysis which balances the different issues (i.e. the different groups of variables) within a global analysis and provides, beyond the classical results of factorial analysis (mainly graphics of individuals and of categories), several results (indicators and graphics) specific of the group structure.


Application fields

In the social sciences, MCA is arguably best known for its application by
Pierre Bourdieu Pierre Bourdieu (; 1 August 1930 – 23 January 2002) was a French sociologist and public intellectual. Bourdieu's contributions to the sociology of education, the theory of sociology, and sociology of aesthetics have achieved wide influence ...
, notably in his books ''
La Distinction ''Distinction: A Social Critique of the Judgement of Taste'' (''La Distinction: Critique sociale du jugement'', 1979) by Pierre Bourdieu, is a sociological report about the state of French culture, based upon the author's empirical research from ...
'', ''Homo Academicus'' and ''
The State Nobility ''The'' () is a grammatical article in English, denoting persons or things already mentioned, under discussion, implied or otherwise presumed familiar to listeners, readers, or speakers. It is the definite article in English. ''The'' is the ...
''. Bourdieu argued that there was an internal link between his vision of the social as spatial and relational --– captured by the notion of field, and the geometric properties of MCA. Sociologists following Bourdieu's work most often opt for the analysis of the indicator matrix, rather than the Burt table, largely because of the central importance accorded to the analysis of the 'cloud of individuals'.


Multiple correspondence analysis and principal component analysis

MCA can also be viewed as a PCA applied to the complete disjunctive table. To do this, the CDT must be transformed as follows. Let y_ denote the general term of the CDT. y_ is equal to 1 if individual i possesses the category k and 0 if not. Let denote p_k, the proportion of individuals possessing the category k. The transformed CDT (TCDT) has as general term: x_=y_/p_k - 1 The unstandardized PCA applied to TCDT, the column k having the weight p_k, leads to the results of MCA. This equivalence is fully explained in a book by Jérôme Pagès. It plays an important theoretical role because it opens the way to the simultaneous treatment of quantitative and qualitative variables. Two methods simultaneously analyze these two types of variables:
factor analysis of mixed data In statistics, factor analysis of mixed data or factorial analysis of mixed data (FAMD, in the French original: ''AFDM'' or ''Analyse Factorielle de Données Mixtes''), is the factorial method devoted to data tables in which a group of individuals ...
and, when the active variables are partitioned in several groups: multiple factor analysis. This equivalence does not mean that MCA is a particular case of PCA as it is not a particular case of CA. It only means that these methods are closely linked to one another, as they belong to the same family: the factorial methods.


Software

There are numerous software of data analysis that include MCA, such as STATA and SPSS. The R packag
FactoMineR
also features MCA. This software is related to a book describing the basic methods for performing MCA . There is also a Python package fo

which works with numpy array matrices; the package has not been implemented yet for Spark dataframes.


References


External links

* Le Roux, B. and H. Rouanet (2004), Geometric Data Analysis, From Correspondence Analysis to Structured Data Analysis at Google Books

* Greenacre, Michael (2008), ''La Práctica del Análisis de Correspondencias'', BBVA Foundation, Madrid, available for free download at the foundation's web sit


FactoMineR
A R software devoted to exploratory data analysis. {{DEFAULTSORT:Multiple Correspondence Analysis Dimension reduction