The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are
statistic
A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...
s computed from a
sample of data on one or more
random variables
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers ...
.
The sample mean is the
average
In colloquial, ordinary language, an average is a single number or value that best represents a set of data. The type of average taken as most typically representative of a list of numbers is the arithmetic mean the sum of the numbers divided by ...
value (or
mean value
A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...
) of a
sample of numbers taken from a larger
population
Population is a set of humans or other organisms in a given region or area. Governments conduct a census to quantify the resident population size within a given jurisdiction. The term is also applied to non-human animals, microorganisms, and pl ...
of numbers, where "population" indicates not number of people but the entirety of relevant data, whether collected or not. A sample of 40 companies' sales from the
Fortune 500
The ''Fortune'' 500 is an annual list compiled and published by ''Fortune (magazine), Fortune'' magazine that ranks 500 of the largest United States Joint-stock company#Closely held corporations and publicly traded corporations, corporations by ...
might be used for convenience instead of looking at the population, all 500 companies' sales. The sample mean is used as an
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...
for the population mean, the average value in the entire population, where the estimate is more likely to be close to the population mean if the sample is large and representative. The reliability of the sample mean is estimated using the
standard error, which in turn is calculated using the
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
of the sample. If the sample is random, the standard error falls with the size of the sample and the sample mean's distribution approaches the normal distribution as the sample size increases.
The term "sample mean" can also be used to refer to a
vector
Vector most often refers to:
* Euclidean vector, a quantity with a magnitude and a direction
* Disease vector, an agent that carries and transmits an infectious pathogen into another living organism
Vector may also refer to:
Mathematics a ...
of average values when the statistician is looking at the values of several variables in the sample, e.g. the sales, profits, and employees of a sample of Fortune 500 companies. In this case, there is not just a sample variance for each variable but a sample
variance-covariance matrix (or simply ''covariance matrix'') showing also the relationship between each pair of variables. This would be a 3×3 matrix when 3 variables are being considered. The sample covariance is useful in judging the reliability of the sample means as estimators and is also useful as an estimate of the population covariance matrix.
Due to their ease of calculation and other desirable characteristics, the sample mean and sample covariance are widely used in statistics to represent the
location
In geography, location or place is used to denote a region (point, line, or area) on Earth's surface. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ambiguous bou ...
and
dispersion of the
distribution of values in the sample, and to estimate the values for the population.
Definition of the sample mean
The sample mean is the average of the values of a variable in a sample, which is the sum of those values divided by the number of values. Using mathematical notation, if a sample of ''N'' observations on variable ''X'' is taken from the population, the sample mean is:
:
Under this definition, if the sample (1, 4, 1) is taken from the population (1,1,3,4,0,2,1,0), then the sample mean is
, as compared to the population mean of
. Even if a sample is random, it is rarely perfectly representative, and other samples would have other sample means even if the samples were all from the same population. The sample (2, 1, 0), for example, would have a sample mean of 1.
If the statistician is interested in ''K'' variables rather than one, each observation having a value for each of those ''K'' variables, the overall sample mean consists of ''K'' sample means for individual variables. Let
be the ''i''
th independently drawn observation (''i''=1,...,''N'') on the ''j''
th random variable (''j''=1,...,''K''). These observations can be arranged into ''N''
column vectors, each with ''K'' entries, with the ''K''×1 column vector giving the ''i''-th observations of all variables being denoted
(''i''=1,...,''N'').
The sample mean vector
is a column vector whose ''j''-th element
is the average value of the ''N'' observations of the ''j''
th variable:
:
Thus, the sample mean vector contains the average of the observations for each variable, and is written
:
Definition of sample covariance
The sample covariance matrix is a ''K''-by-''K''
matrix
Matrix (: matrices or matrixes) or MATRIX may refer to:
Science and mathematics
* Matrix (mathematics), a rectangular array of numbers, symbols or expressions
* Matrix (logic), part of a formula in prenex normal form
* Matrix (biology), the m ...
with entries
:
where
is an estimate of the
covariance
In probability theory and statistics, covariance is a measure of the joint variability of two random variables.
The sign of the covariance, therefore, shows the tendency in the linear relationship between the variables. If greater values of one ...
between the
th
variable and the
th variable of the population underlying the data.
In terms of the observation vectors, the sample covariance is
:
Alternatively, arranging the observation vectors as the columns of a matrix, so that
:
,
which is a matrix of ''K'' rows and ''N'' columns.
Here, the sample covariance matrix can be computed as
:
,
where
is an ''N'' by vector of ones.
If the observations are arranged as rows instead of columns, so
is now a 1×''K'' row vector and
is an ''N''×''K'' matrix whose column ''j'' is the vector of ''N'' observations on variable ''j'', then applying transposes
in the appropriate places yields
:
Like covariance matrices for
random vector
In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge ...
, sample covariance matrices are
positive semi-definite. To prove it, note that for any matrix
the matrix
is positive semi-definite. Furthermore, a covariance matrix is positive definite if and only if the rank of the
vectors is K.
Unbiasedness
The sample mean and the sample covariance matrix are
unbiased estimates of the
mean
A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...
and the
covariance matrix of the
random vector
In probability, and statistics, a multivariate random variable or random vector is a list or vector of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge ...
, a row vector whose ''j''
th element (''j = 1, ..., K'') is one of the random variables.
The sample covariance matrix has
in the denominator rather than
due to a variant of
Bessel's correction
In statistics, Bessel's correction is the use of ''n'' − 1 instead of ''n'' in the formula for the sample variance and sample standard deviation, where ''n'' is the number of observations in a sample. This method corrects the bias in ...
: In short, the sample covariance relies on the difference between each observation and the sample mean, but the sample mean is slightly correlated with each observation since it is defined in terms of all observations. If the population mean
is known, the analogous unbiased estimate
:
using the population mean, has
in the denominator. This is an example of why in probability and statistics it is essential to distinguish between
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
s (upper case letters) and
realizations of the random variables (lower case letters).
The
maximum likelihood estimate of the covariance
:
for the
Gaussian distribution
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is
f(x ...
case has ''N'' in the denominator as well. The ratio of 1/''N'' to 1/(''N'' − 1) approaches 1 for large ''N'', so the maximum likelihood estimate approximately equals the unbiased estimate when the sample is large.
Distribution of the sample mean
For each random variable, the sample mean is a good
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...
of the population mean, where a "good" estimator is defined as being
efficient and unbiased. Of course the estimator will likely not be the true value of the
population
Population is a set of humans or other organisms in a given region or area. Governments conduct a census to quantify the resident population size within a given jurisdiction. The term is also applied to non-human animals, microorganisms, and pl ...
mean since different samples drawn from the same distribution will give different sample means and hence different estimates of the true mean. Thus the sample mean is a
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
, not a constant, and consequently has its own distribution. For a random sample of ''N'' observations on the ''j''
th random variable, the sample mean's distribution itself has mean equal to the population mean
and variance equal to
, where
is the population variance.
The arithmetic mean of a
population
Population is a set of humans or other organisms in a given region or area. Governments conduct a census to quantify the resident population size within a given jurisdiction. The term is also applied to non-human animals, microorganisms, and pl ...
, or population mean, is often denoted ''μ''. The sample mean
(the arithmetic mean of a sample of values drawn from the population) makes a good
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on Sample (statistics), observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguish ...
of the population mean, as its expected value is equal to the population mean (that is, it is an
unbiased estimator
In statistics, the bias of an estimator (or bias function) is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called ''unbiased''. In stat ...
). The sample mean is a
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
, not a constant, since its calculated value will randomly differ depending on which members of the population are sampled, and consequently it will have its own distribution. For a random sample of ''n''
independent
Independent or Independents may refer to:
Arts, entertainment, and media Artist groups
* Independents (artist group), a group of modernist painters based in Pennsylvania, United States
* Independentes (English: Independents), a Portuguese artist ...
observations, the expected value of the sample mean is
:
and the
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
of the sample mean is
:
If the samples are not independent, but
correlated
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...
, then special care has to be taken in order to avoid the problem of
pseudoreplication.
If the population is
normally distributed
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is
f(x ...
, then the sample mean is normally distributed as follows:
:
If the population is not normally distributed, the sample mean is nonetheless approximately normally distributed if ''n'' is large and ''σ''
2/''n'' < +∞. This is a consequence of the
central limit theorem
In probability theory, the central limit theorem (CLT) states that, under appropriate conditions, the Probability distribution, distribution of a normalized version of the sample mean converges to a Normal distribution#Standard normal distributi ...
.
Weighted samples
In a weighted sample, each vector
(each set of single observations on each of the ''K'' random variables) is assigned a weight
. Without loss of generality, assume that the weights are
normalized:
:
(If they are not, divide the weights by their sum).
Then the
weighted mean vector
is given by
:
and the elements
of the weighted covariance matrix
are
[Mark Galassi, Jim Davies, James Theiler, Brian Gough, Gerard Jungman, Michael Booth, and Fabrice Rossi]
GNU Scientific Library - Reference manual, Version 2.6
2021.
/ref>
:
If all weights are the same, , the weighted mean and covariance reduce to the (biased) sample mean and covariance mentioned above.
Criticism
The sample mean and sample covariance are not robust statistics
Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust Statistics, statistical methods have been developed for many common problems, such as estimating location parame ...
, meaning that they are sensitive to outliers. As robustness is often a desired trait, particularly in real-world applications, robust alternatives may prove desirable, notably quantile
In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities or dividing the observations in a sample in the same way. There is one fewer quantile t ...
-based statistics such as the sample median for location,The World Question Center 2006: The Sample Mean
Bart Kosko and
interquartile range
In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the differen ...
(IQR) for dispersion. Other alternatives include
trimming and
Winsorising, as in the
trimmed mean and the
Winsorized mean.
See also
*
Estimation of covariance matrices
*
Scatter matrix
*
Unbiased estimation of standard deviation
References
{{Authority control
Covariance and correlation
Estimation methods
Summary statistics
Matrices (mathematics)
U-statistics