Multivariate Analysis
   HOME

TheInfoList



OR:

Multivariate statistics is a subdivision of
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
encompassing the simultaneous observation and analysis of more than one outcome variable. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical application of multivariate statistics to a particular problem may involve several types of univariate and multivariate analyses in order to understand the relationships between variables and their relevance to the problem being studied. In addition, multivariate statistics is concerned with multivariate
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
s, in terms of both :*how these can be used to represent the distributions of observed data; :*how they can be used as part of
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
, particularly where several different quantities are of interest to the same analysis. Certain types of problems involving multivariate data, for example
simple linear regression In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the ''x'' and ...
and
multiple regression In statistical modeling, regression analysis is a set of statistical processes for Estimation theory, estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning ...
, are ''not'' usually considered to be special cases of multivariate statistics because the analysis is dealt with by considering the (univariate) conditional distribution of a single outcome variable given the other variables.


Multivariate analysis

Multivariate analysis (MVA) is based on the principles of multivariate statistics. Typically, MVA is used to address the situations where multiple measurements are made on each experimental unit and the relations among these measurements and their structures are important. A modern, overlapping categorization of MVA includes: * Normal and general multivariate models and distribution theory * The study and measurement of relationships * Probability computations of multidimensional regions * The exploration of data structures and patterns Multivariate analysis can be complicated by the desire to include physics-based analysis to calculate the effects of variables for a hierarchical "system-of-systems". Often, studies that wish to use multivariate analysis are stalled by the dimensionality of the problem. These concerns are often eased through the use of
surrogate model A surrogate model is an engineering method used when an outcome of interest cannot be easily measured or computed, so a model of the outcome is used instead. Most engineering design problems require experiments and/or simulations to evaluate design ...
s, highly accurate approximations of the physics-based code. Since surrogate models take the form of an equation, they can be evaluated very quickly. This becomes an enabler for large-scale MVA studies: while a
Monte Carlo simulation Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be determini ...
across the design space is difficult with physics-based codes, it becomes trivial when evaluating surrogate models, which often take the form of response-surface equations.


Types of analysis

There are many different models, each with its own type of analysis: #
Multivariate analysis of variance In statistics, multivariate analysis of variance (MANOVA) is a procedure for comparing multivariate sample means. As a multivariate procedure, it is used when there are two or more dependent variables, and is often followed by significance tests i ...
(MANOVA) extends the
analysis of variance Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among means. ANOVA was developed by the statisticia ...
to cover cases where there is more than one dependent variable to be analyzed simultaneously; see also
Multivariate analysis of covariance Multivariate analysis of covariance (MANCOVA) is an extension of analysis of covariance (ANCOVA) methods to cover cases where there is more than one dependent variable and where the control of concomitant continuous independent variables – covari ...
(MANCOVA). #Multivariate regression attempts to determine a formula that can describe how elements in a vector of variables respond simultaneously to changes in others. For linear relations, regression analyses here are based on forms of the
general linear model The general linear model or general multivariate regression model is a compact way of simultaneously writing several multiple linear regression models. In that sense it is not a separate statistical linear model. The various multiple linear regre ...
. Some suggest that multivariate regression is distinct from multivariable regression, however, that is debated and not consistently true across scientific fields. #
Principal components analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA) creates a new set of orthogonal variables that contain the same information as the original set. It rotates the axes of variation to give a new set of orthogonal axes, ordered so that they summarize decreasing proportions of the variation. #
Factor analysis Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed ...
is similar to PCA but allows the user to extract a specified number of synthetic variables, fewer than the original set, leaving the remaining unexplained variation as error. The extracted variables are known as latent variables or factors; each one may be supposed to account for covariation in a group of observed variables. #
Canonical correlation analysis In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors ''X'' = (''X''1, ..., ''X'n'') and ''Y' ...
finds linear relationships among two sets of variables; it is the generalised (i.e. canonical) version of bivariate correlation. # Redundancy analysis (RDA) is similar to canonical correlation analysis but allows the user to derive a specified number of synthetic variables from one set of (independent) variables that explain as much variance as possible in another (independent) set. It is a multivariate analogue of regression. #
Correspondence analysis Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rat ...
(CA), or reciprocal averaging, finds (like PCA) a set of synthetic variables that summarise the original set. The underlying model assumes chi-squared dissimilarities among records (cases). # Canonical (or "constrained") correspondence analysis (CCA) for summarising the joint variation in two sets of variables (like redundancy analysis); combination of correspondence analysis and multivariate regression analysis. The underlying model assumes chi-squared dissimilarities among records (cases). #
Multidimensional scaling Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of n objects or individuals" into a configurati ...
comprises various algorithms to determine a set of synthetic variables that best represent the pairwise distances between records. The original method is
principal coordinates analysis Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of n objects or individuals" into a configurati ...
(PCoA; based on PCA). #
Discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...
, or canonical variate analysis, attempts to establish whether a set of variables can be used to distinguish between two or more groups of cases. #
Linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...
(LDA) computes a linear predictor from two sets of normally distributed data to allow for classification of new observations. # Clustering systems assign objects into groups (called clusters) so that objects (cases) from the same cluster are more similar to each other than objects from different clusters. #
Recursive partitioning Recursive partitioning is a statistical method for multivariable analysis. Recursive partitioning creates a decision tree that strives to correctly classify members of the population by splitting it into sub-populations based on several dichotomous ...
creates a decision tree that attempts to correctly classify members of the population based on a dichotomous dependent variable. #
Artificial neural networks Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains. An ANN is based on a collection of connected unit ...
extend regression and clustering methods to non-linear multivariate models. #
Statistical graphics Statistical graphics, also known as statistical graphical techniques, are graphics used in the field of statistics for data visualization. Overview Whereas statistics and data analysis procedures generally yield their output in numeric or tabul ...
such as tours, parallel coordinate plots, scatterplot matrices can be used to explore multivariate data. #
Simultaneous equations model Simultaneous equations models are a type of statistical model in which the dependent variables are functions of other dependent variables, rather than just independent variables. This means some of the explanatory variables are jointly determined ...
s involve more than one regression equation, with different dependent variables, estimated together. #
Vector autoregression Vector autoregression (VAR) is a statistical model used to capture the relationship between multiple quantities as they change over time. VAR is a type of stochastic process model. VAR models generalize the single-variable (univariate) autoregres ...
involves simultaneous regressions of various
time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Exa ...
variables on their own and each other's lagged values. #
Principal response curve In multivariate statistics, principal response curves (PRC) are used for analysis of treatment effects in experiments with a repeated measures design Repeated measures design is a research design that involves multiple measures of the same variab ...
s analysis (PRC) is a method based on RDA that allows the user to focus on treatment effects over time by correcting for changes in control treatments over time. #
Iconography of correlations In exploratory data analysis, the iconography of correlations is a method which consists in replacing a correlation matrix by a diagram where the “remarkable” correlations are represented by a solid line (positive correlation), or a dotted line ...
consists in replacing a correlation matrix by a diagram where the “remarkable” correlations are represented by a solid line (positive correlation), or a dotted line (negative correlation).


Important probability distributions

There is a set of
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
s used in multivariate analyses that play a similar role to the corresponding set of distributions that are used in
univariate analysis Univariate analysis is perhaps the simplest form of statistical analysis. Like other forms of statistics, it can be inferential or descriptive. The key fact is that only one variable is involved. Univariate analysis can yield misleading results i ...
when the
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
is appropriate to a dataset. These multivariate distributions are: :*
Multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One d ...
:*
Wishart distribution In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928. It is a family of probability distributions define ...
:* Multivariate Student-t distribution. The
Inverse-Wishart distribution In statistics, the inverse Wishart distribution, also called the inverted Wishart distribution, is a probability distribution defined on real-valued positive-definite matrices. In Bayesian statistics it is used as the conjugate prior for the co ...
is important in
Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...
, for example in
Bayesian multivariate linear regression In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random v ...
. Additionally,
Hotelling's T-squared distribution In statistics, particularly in hypothesis testing, the Hotelling's ''T''-squared distribution (''T''2), proposed by Harold Hotelling, is a multivariate probability distribution that is tightly related to the ''F''-distribution and is most notab ...
is a multivariate distribution, generalising
Student's t-distribution In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in sit ...
, that is used in multivariate
hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
.


History

Anderson's 1958 textbook,'' An Introduction to Multivariate Statistical Analysis'', educated a generation of theorists and applied statisticians; Anderson's book emphasizes
hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
via
likelihood ratio test In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after im ...
s and the properties of
power function Exponentiation is a mathematical operation, written as , involving two numbers, the '' base'' and the ''exponent'' or ''power'' , and pronounced as " (raised) to the (power of) ". When is a positive integer, exponentiation corresponds to re ...
s:
admissibility Admissibility may refer to: Law * Admissible evidence, evidence which may be introduced in a court of law *Admissibility (ECHR), whether a case will be considered in the European Convention on Human Rights system Mathematics and logic * Admissible ...
,
unbiasedness In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
and
monotonicity In mathematics, a monotonic function (or monotone function) is a function between ordered sets that preserves or reverses the given order. This concept first arose in calculus, and was later generalized to the more abstract setting of order ...
. MVA once solely stood in the statistical theory realms due to the size, complexity of underlying data set and high computational consumption. With the dramatic growth of computational power, MVA now plays an increasingly important role in data analysis and has wide application in
OMICS The branches of science known informally as omics are various disciplines in biology whose names end in the suffix '' -omics'', such as genomics, proteomics, metabolomics, metagenomics, phenomics and transcriptomics. Omics aims at the collect ...
fields.


Applications

* Multivariate hypothesis testing *
Dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
* Latent structure discovery * Clustering * Multivariate regression analysis * Classification and discrimination analysis *
Variable selection In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
*
Multidimensional analysis In statistics, econometrics and related fields, multidimensional analysis (MDA) is a data analysis process that groups data into two categories: data dimensions and measurements. For example, a data set consisting of the number of wins for a sin ...
*
Multidimensional scaling Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. MDS is used to translate "information about the pairwise 'distances' among a set of n objects or individuals" into a configurati ...
* Data mining


Software and tools

There are an enormous number of software packages and other tools for multivariate analysis, including: *
JMP (statistical software) JMP (pronounced "jump") is a suite of computer programs for statistical analysis developed by JMP, a subsidiary of SAS Institute. It was launched in 1989 to take advantage of the graphical user interface introduced by the Macintosh operating s ...
*
MiniTab Minitab is a statistics package developed at the Pennsylvania State University by researchers Barbara F. Ryan, Thomas A. Ryan, Jr., and Brian L. Joiner in conjunction with Triola Statistics Company in 1972. It began as a light version of OMNITA ...
* Calc *
PSPP PSPP is a free software application for analysis of sampled data, intended as a free alternative for IBM SPSS Statistics. It has a graphical user interface and conventional command-line interface. It is written in C and uses GNU Scientific Lib ...
* RCRAN
has details on the packages available for multivariate data analysis
*
SAS (software) SAS (previously "Statistical Analysis System") is a statistical software suite developed by SAS Institute for data management, advanced analytics, multivariate analysis, business intelligence, criminal investigation, and predictive analytics. ...
*
SciPy SciPy (pronounced "sigh pie") is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal ...
for
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
*
SPSS SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. C ...
*
Stata Stata (, , alternatively , occasionally stylized as STATA) is a general-purpose statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fie ...
*
STATISTICA Statistica is an advanced analytics software package originally developed by StatSoft and currently maintained by TIBCO Software Inc. Statistica provides data analysis, data management, statistics, data mining, machine learning, text analytics a ...
* The Unscrambler *
WarpPLS WarpPLS is a software with graphical user interface for variance-based and factor-based structural equation modeling, structural equation modeling (SEM) using the partial least squares path modeling, partial least squares and factor-based methods. ...
*
SmartPLS SmartPLS is a software with graphical user interface for variance-based structural equation modeling (SEM) using the partial least squares (PLS) path modeling method. Users can estimate models with their data by using basic PLS-SEM, weighted PL ...
*
MATLAB MATLAB (an abbreviation of "MATrix LABoratory") is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. MATLAB allows matrix manipulations, plotting of functions and data, implementation ...
*
Eviews EViews is a statistical package for Microsoft Windows, Windows, used mainly for time-series oriented econometrics, econometric analysis. It is developed by Quantitative Micro Software (QMS), now a part of IHS Inc., IHS. Version 1.0 was released ...
*
NCSS (statistical software) NCSS is a statistics package produced and distributed by NCSS, LLC. Created in 1981 by Jerry L. Hintze, NCSS, LLC specializes in providing statistical analysis software to researchers, businesses, and academic institutions. It also produces PAS ...
includes multivariate analysis.
The Unscrambler® X
is a multivariate analysis tool.
SIMCA
*DataPandit (Free SaaS applications b
Let's Excel Analytics Solutions


See also

*
Estimation of covariance matrices In statistics, sometimes the covariance matrix of a multivariate random variable is not known but has to be estimated. Estimation of covariance matrices then deals with the question of how to approximate the actual covariance matrix on the basis o ...
* Important publications in multivariate analysis *
Multivariate testing in marketing In marketing, multivariate testing or multi-variable testing techniques apply statistical hypothesis testing on multi-variable systems, typically consumers on websites. Techniques of multivariate statistics are used. In internet marketing In inte ...
*
Structured data analysis (statistics) Structured data analysis is the statistical data analysis of structured data. This can arise either in the form of an ''a priori'' structure such as multiple-choice questionnaires or in situations with the need to search for structure that fits t ...
*
Structural equation modeling Structural equation modeling (SEM) is a label for a diverse set of methods used by scientists in both experimental and observational research across the sciences, business, and other fields. It is used most in the social and behavioral scienc ...
*
RV coefficient In statistics, the RV coefficient is a multivariate generalization of the ''squared'' Pearson correlation coefficient (because the RV coefficient takes values between 0 and 1). It measures the closeness of two set of points that may each be represe ...
*
Bivariate analysis Bivariate analysis is one of the simplest forms of quantitative (statistical) analysis.Earl R. Babbie, ''The Practice of Social Research'', 12th edition, Wadsworth Publishing, 2009, , pp. 436–440 It involves the analysis of two variables (often ...
*
Design of experiments The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe and explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associ ...
(DoE) *
Dimensional analysis In engineering and science, dimensional analysis is the analysis of the relationships between different physical quantities by identifying their base quantities (such as length, mass, time, and electric current) and units of measure (such as m ...
*
Exploratory data analysis In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but pr ...
* OLS *
Partial least squares regression Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a li ...
*
Pattern recognition Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphi ...
*
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA) *
Regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...
* Soft independent modelling of class analogies (SIMCA) *
Statistical interference When two probability distributions overlap, statistical interference exists. Knowledge of the distributions can be used to determine the likelihood that one parameter exceeds another, and by how much. This technique can be used for dimensioning ...
*
Univariate analysis Univariate analysis is perhaps the simplest form of statistical analysis. Like other forms of statistics, it can be inferential or descriptive. The key fact is that only one variable is involved. Univariate analysis can yield misleading results i ...


References


Further reading

* * * A. Sen, M. Srivastava, ''Regression Analysis — Theory, Methods, and Applications'', Springer-Verlag, Berlin, 2011 (4th printing). * * Malakooti, B. (2013). Operations and Production Systems with Multiple Objectives. John Wiley & Sons. * T. W. Anderson, ''An Introduction to Multivariate Statistical Analysis'', Wiley, New York, 1958. * (M.A. level "likelihood" approach) * Feinstein, A. R. (1996) ''Multivariable Analysis''. New Haven, CT: Yale University Press. * Hair, J. F. Jr. (1995) ''Multivariate Data Analysis with Readings'', 4th ed. Prentice-Hall. * * Schafer, J. L. (1997) ''Analysis of Incomplete Multivariate Data''. CRC Press. (Advanced) * Sharma, S. (1996) ''Applied Multivariate Techniques''. Wiley. (Informal, applied) *Izenman, Alan J. (2008). Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer Texts in Statistics. New York: Springer-Verlag. . *"Handbook of Applied Multivariate Statistics and Mathematical Modeling , ScienceDirect". Retrieved 2019-09-03.


External links


Statnotes: Topics in Multivariate Analysis, by G. David Garson

Mike Palmer: The Ordination Web Page

InsightsNow: Makers of ReportsNow, ProfilesNow, and KnowledgeNow
{{Authority control