ANOVA–simultaneous Component Analysis
   HOME

TheInfoList



OR:

In
computational biology Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has fo ...
and
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
, analysis of variance – simultaneous component analysis (ASCA or ANOVA–SCA) is a method that partitions
variation Variation or Variations may refer to: Science and mathematics * Variation (astronomy), any perturbation of the mean motion or orbit of a planet or satellite, particularly of the moon * Genetic variation, the difference in DNA among individual ...
and enables interpretation of these partitions by SCA, a method that is similar to principal components analysis (PCA). Analysis of variance (ANOVA) is a collection of
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
s and their associated estimation procedures used to analyze differences.
Statistical coupling analysis Statistical coupling analysis or SCA is a technique used in bioinformatics to measure covariation between pairs of amino acids in a protein multiple sequence alignment (MSA). More specifically, it quantifies how much the amino acid distribution ...
(SCA) is a technique used in
bioinformatics Bioinformatics () is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combi ...
to measure covariation between pairs of
amino acids Amino acids are organic compounds that contain both amino and carboxylic acid functional groups. Although hundreds of amino acids exist in nature, by far the most important are the alpha-amino acids, which comprise proteins. Only 22 alpha am ...
in a protein multiple sequence alignment (MSA). This method is a
multivariate Multivariate may refer to: In mathematics * Multivariable calculus * Multivariate function * Multivariate polynomial In computing * Multivariate cryptography * Multivariate division algorithm * Multivariate interpolation * Multivariate optical c ...
or even megavariate extension of analysis of variance (ANOVA). The variation partitioning is similar to ANOVA. Each partition matches all variation induced by an effect or factor, usually a treatment regime or experimental condition. The calculated effect partitions are called effect estimates. Because even the effect estimates are multivariate, interpretation of these effects estimates is not intuitive. By applying SCA on the effect estimates one gets a simple interpretable result. In case of more than one effect, this method estimates the effects in such a way that the different effects are not correlated.


Details

Many research areas see increasingly large numbers of variables in only few samples. The low sample to variable ratio creates problems known as multicollinearity and singularity. Because of this, most traditional multivariate statistical methods cannot be applied.


ASCA algorithm

This section details how to calculate the ASCA model on a case of two main effects with one interaction effect. It is easy to extend the declared rationale to more main effects and more interaction effects. If the first effect is time and the second effect is dosage, only the interaction between time and dosage exists. We assume there are four time points and three dosage levels. Let X be a
matrix Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** ''The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...
that holds the data. X is mean centered, thus having zero mean columns. Let A and B denote the main effects and AB the interaction of these effects. Two main effects in a biological experiment can be time (A) and pH (B), and these two effects may interact. In designing such experiments one controls the main effects to several (at least two) levels. The different levels of an effect can be referred to as A1, A2, A3 and A4, representing 2, 3, 4, 5 hours from the start of the experiment. The same thing holds for effect B, for example, pH 6, pH 7 and pH 8 can be considered effect levels. A and B are required to be balanced if the effect estimates need to be orthogonal and the partitioning unique. Matrix E holds the information that is not assigned to any effect. The partitioning gives the following notation: : X = A+B+AB+E \,


Calculating main effect estimate A (or B)

Find all rows that correspond to effect A level 1 and average these rows. The result is a vector. Repeat this for the other effect levels. Make a new matrix of the same size of X and place the calculated averages in the matching rows. That is, give all rows that match effect (i.e.) A level 1 the average of effect A level 1. After completing the level estimates for the effect, perform an SCA. The scores of this SCA are the sample deviations for the effect, the important variables of this effect are in the weights of the SCA loading vector.


Calculating interaction effect estimate AB

Estimating the interaction effect is similar to estimating main effects. The difference is that for interaction estimates the rows that match effect A level 1 are combined with the effect B level 1 and all combinations of effects and levels are cycled through. In our example setting, with four time point and three dosage levels there are 12 interaction sets . It is important to deflate (remove) the main effects before estimating the interaction effect.


SCA on partitions A, B and AB

Simultaneous component analysis is mathematically identical to PCA, but is semantically different in that it models different objects or subjects at the same time. The standard notation for a SCA – and PCA – model is: : X=TP^+E \, where ''X'' is the data, ''T'' are the component scores and ''P'' are the component loadings. ''E'' is the residual or error matrix. Because ASCA models the variation partitions by SCA, the model for effect estimates looks like this: : A=T_P_^+E_ \, : B=T_P_^+E_ \, : AB=T_P_^+E_ \, : E=T_P_^+E_ \, Note that every partition has its own error matrix. However, algebra dictates that in a balanced mean centered data set every two level system is of rank 1. This results in zero errors, since any rank 1 matrix can be written as the product of a single component score and loading vector. The full ASCA model with two effects and interaction including the SCA looks like this: Decomposition: : X=A+B+AB+E \, : X=T_P_^+T_P_^+T_P_^+T_P_^+E_+E_+E_+E_+E \,


Time as an effect

Because 'time' is treated as a qualitative factor in the ANOVA decomposition preceding ASCA, a nonlinear multivariate time trajectory can be modeled. An example of this is shown in Figure 10 of this reference.Smilde, A. K., Hoefsloot, H. C. and Westerhuis, J. A. (2008), "The geometry of ASCA". ''Journal of Chemometrics'', 22, 464–471.


References

{{DEFAULTSORT:ANOVA-simultaneous component analysis Analysis of variance Bioinformatics