HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, pooled variance (also known as combined variance, composite variance, or overall variance, and written \sigma^2) is a method for
estimating Estimation (or estimating) is the process of finding an estimate or approximation, which is a value that is usable for some purpose even if input data may be incomplete, uncertain, or unstable. The value is nonetheless usable because it is der ...
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
of several different populations when the mean of each population may be different, but one may assume that the variance of each population is the same. The numerical estimate resulting from the use of this method is also called the pooled variance. Under the assumption of equal population variances, the pooled sample variance provides a higher
precision Precision, precise or precisely may refer to: Science, and technology, and mathematics Mathematics and computing (general) * Accuracy and precision, measurement deviation from true value and its scatter * Significant figures, the number of digit ...
estimate of variance than the individual sample variances. This higher precision can lead to increased
statistical power In statistics, the power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H_0) when a specific alternative hypothesis (H_1) is true. It is commonly denoted by 1-\beta, and represents the chances ...
when used in
statistical test A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
s that compare the populations, such as the ''t''-test. The square root of a pooled variance estimator is known as a pooled standard deviation (also known as combined standard deviation, composite standard deviation, or overall standard deviation).


Motivation

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, many times, data are collected for a
dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
, ''y'', over a range of values for the
independent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
, ''x''. For example, the observation of fuel consumption might be studied as a function of engine speed while the engine load is held constant. If, in order to achieve a small
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
in ''y'', numerous repeated tests are required at each value of ''x'', the expense of testing may become prohibitive. Reasonable estimates of variance can be determined by using the principle of pooled variance after repeating each
test Test(s), testing, or TEST may refer to: * Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities Arts and entertainment * ''Test'' (2013 film), an American film * ''Test'' (2014 film), ...
at a particular ''x'' only a few times.


Definition and computation

The pooled variance is an estimate of the fixed common variance \sigma ^2 underlying various populations that have different means. We are given a set of
sample variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
s s^2_i, where the populations are indexed i = 1, \ldots, m, :s^2_i = \frac \sum_^ \left(y_j - \overline_i \right)^2. Assuming uniform
sample size Sample size determination is the act of choosing the number of observations or Replication (statistics), replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make stat ...
s, n_i=n, then the pooled variance s^2_p can be computed by the
arithmetic mean In mathematics and statistics, the arithmetic mean ( ) or arithmetic average, or just the ''mean'' or the ''average'' (when the context is clear), is the sum of a collection of numbers divided by the count of numbers in the collection. The colle ...
: :s_p^2=\frac = \frac. If the sample sizes are non-uniform, then the pooled variance s^2_p can be computed by the
weighted average The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The ...
, using as weights w_i=n_i-1 the respective
degrees of freedom Degrees of freedom (often abbreviated df or DOF) refers to the number of independent variables or parameters of a thermodynamic system. In various scientific fields, the word "freedom" is used to describe the limits to which physical movement or ...
(see also:
Bessel's correction In statistics, Bessel's correction is the use of ''n'' − 1 instead of ''n'' in the formula for the sample variance and sample standard deviation, where ''n'' is the number of observations in a sample. This method corrects the bias in t ...
): :s_p^2=\frac = \frac.


Variants

The unbiased least squares estimate of \sigma^2 (as presented above), and the biased maximum likelihood estimate below: :s_p^2=\frac, are used in different contexts. The former can give an unbiased s_p^2 to estimate \sigma^2 when the two groups share an equal population variance. The latter one can give a more efficient s_p^2 to estimate \sigma^2, although subject to bias. Note that the quantities s_i^2 in the right hand sides of both equations are the unbiased estimates.


Example

Consider the following set of data for ''y'' obtained at various levels of the independent variable ''x''. The number of trials, mean, variance and standard deviation are presented in the next table. These statistics represent the variance and
standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
for each subset of data at the various levels of ''x''. If we can assume that the same phenomena are generating
random error Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a " mistake ...
at every level of ''x'', the above data can be “pooled” to express a single estimate of variance and standard deviation. In a sense, this suggests finding a
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
variance or standard deviation among the five results above. This mean variance is calculated by weighting the individual values with the size of the subset for each level of ''x''. Thus, the pooled variance is defined by : s_p^2 = \frac where ''n''1, ''n''2, . . ., ''n''''k'' are the sizes of the data subsets at each level of the variable ''x'', and ''s''12, ''s''22, . . ., ''s''''k''2 are their respective variances. The pooled variance of the data shown above is therefore: : s_p^2 = 2.764 \,


Effect on precision

Pooled variance is an estimate when there is a correlation between pooled data sets or the average of the data sets is not identical. Pooled variation is less precise the more non-zero the correlation or distant the averages between data sets. The variation of data for non-overlapping data sets is: : \sigma_X^2 =\frac where the mean is defined as: : \mu_X = \frac Given a biased maximum likelihood defined as: :s_p^2=\frac, Then the error in the biased maximum likelihood estimate is: :\begin \text & = s_p^2 - \sigma_X^2 \\ pt& =\frac - \frac \left( \sum_i \left N_ - 1) \sigma_^2 + N_ \mu_^2\right- \left sum_i N_ \rightmu_X^2 \right) \end Assuming ''N'' is large such that: : \sum_i N_ \approx \sum_i N_ - 1 Then the error in the estimate reduces to: :\begin E & =- \frac\\ pt& =\mu_X^2 - \frac \end Or alternatively: :\begin E & =\left \frac \right2 - \frac\\ pt& =\frac \end


Aggregation of standard deviation data

Rather than estimating pooled standard deviation, the following is the way to exactly aggregate standard deviation when more statistical information is available.


Population-based statistics

The populations of sets, which may overlap, can be calculated simply as follows: :\begin &&N_ &= N_X + N_Y - N_\\ \end The populations of sets, which do not overlap, can be calculated simply as follows: :\begin X \cap Y = \varnothing &\Rightarrow &N_ &= 0\\ &\Rightarrow &N_ &= N_X + N_Y \end Standard deviations of non-overlapping () sub-populations can be aggregated as follows if the size (actual or relative to one another) and means of each are known: : \begin \mu_ &= \frac \\ pt \sigma_ &= \sqrt \end For example, suppose it is known that the average American man has a mean height of 70 inches with a standard deviation of three inches and that the average American woman has a mean height of 65 inches with a standard deviation of two inches. Also assume that the number of men, ''N'', is equal to the number of women. Then the mean and standard deviation of heights of American adults could be calculated as : \begin \mu &= \frac = \frac = 67.5 \\ pt \sigma &= \sqrt = \sqrt \approx 3.57 \end For the more general case of ''M'' non-overlapping populations, ''X''1 through ''X''''M'', and the aggregate population X \,=\, \bigcup_i X_i, : \begin \mu_X &= \frac \\ pt \sigma_X &= \sqrt \end, where : X_i \cap X_j = \varnothing, \quad \forall\ i If the size (actual or relative to one another), mean, and standard deviation of two overlapping populations are known for the populations as well as their intersection, then the standard deviation of the overall population can still be calculated as follows: :\begin \mu_ &= \frac\left(N_X\mu_X + N_Y\mu_Y - N_\mu_\right)\\ pt \sigma_ &= \sqrt \end If two or more sets of data are being added together datapoint by datapoint, the standard deviation of the result can be calculated if the standard deviation of each data set and the
covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the les ...
between each pair of data sets is known: :\sigma_X = \sqrt For the special case where no correlation exists between any pair of data sets, then the relation reduces to the root sum of squares: :\begin &\operatorname(X_i, X_j) = 0,\quad \forall i


Sample-based statistics

Standard deviations of non-overlapping () sub-samples can be aggregated as follows if the actual size and means of each are known: :\begin \mu_ &= \frac\left(N_X\mu_X + N_Y\mu_Y\right)\\ pt \sigma_ &= \sqrt \end For the more general case of ''M'' non-overlapping data sets, ''X''1 through ''X''''M'', and the aggregate data set X \,=\, \bigcup_i X_i, :\begin \mu_X &= \frac \left(\sum_i \right)\\ pt \sigma_X &= \sqrt \end where :X_i \cap X_j = \varnothing,\quad \forall i If the size, mean, and standard deviation of two overlapping samples are known for the samples as well as their intersection, then the standard deviation of the aggregated sample can still be calculated. In general, :\begin \mu_ &= \frac\left(N_X\mu_X + N_Y\mu_Y - N_\mu_\right)\\ pt \sigma_ &= \sqrt \end


See also

* Chi-squared distribution#Asymptotic properties * Used for calculating Cohen's ''d'' (effect size) * Distribution of the sample variance *
Pooled covariance matrix In statistics, particularly in hypothesis testing, the Hotelling's ''T''-squared distribution (''T''2), proposed by Harold Hotelling, is a multivariate probability distribution that is tightly related to the ''F''-distribution and is most not ...
* Pooled degree of freedom *
Pooled mean The grand mean or pooled mean is the average of the means of several subsamples, as long as the subsamples have the same number of data points. For example, consider several lots, each containing several items. The items from each lot are sampled fo ...


References

*


External links


IUPAC Gold Book – pooled standard deviation



– also referring to Cohen's ''d'' (on page 6)
{{DEFAULTSORT:Pooled Variance Analysis of variance Statistical deviation and dispersion