HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, robust measures of scale are methods that quantify the
statistical dispersion In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a Probability distribution, distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard de ...
in a
sample Sample or samples may refer to: Base meaning * Sample (statistics), a subset of a population – complete data set * Sample (signal), a digital discrete sample of a continuous analog signal * Sample (material), a specimen or small quantity of s ...
of numerical
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
while resisting
outliers In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
. The most common such
robust statistics Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, suc ...
are the ''
interquartile range In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference ...
'' (IQR) and the ''
median absolute deviation In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. It can also refer to the population parameter that is estimated by the MAD calculated from a sample. For a un ...
'' (MAD). These are contrasted with conventional or non-robust measures of scale, such as sample
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
or
standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
, which are greatly influenced by outliers. These robust statistics are particularly used as
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
s of a
scale parameter In probability theory and statistics, a scale parameter is a special kind of numerical parameter of a parametric family of probability distributions. The larger the scale parameter, the more spread out the distribution. Definition If a family o ...
, and have the advantages of both robustness and superior
efficiency Efficiency is the often measurable ability to avoid wasting materials, energy, efforts, money, and time in doing something or in producing a desired result. In a more general sense, it is the ability to do things well, successfully, and without ...
on contaminated data, at the cost of inferior efficiency on clean data from distributions such as the normal distribution. To illustrate robustness, the standard deviation can be made arbitrarily large by increasing exactly one observation (it has a
breakdown point Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, such ...
of 0, as it can be contaminated by a single point), a defect that is not shared by robust statistics.


IQR and MAD

One of the most common robust measures of scale is the ''
interquartile range In descriptive statistics, the interquartile range (IQR) is a measure of statistical dispersion, which is the spread of the data. The IQR may also be called the midspread, middle 50%, fourth spread, or H‑spread. It is defined as the difference ...
'' (IQR), the difference between the 75th
percentile In statistics, a ''k''-th percentile (percentile score or centile) is a score ''below which'' a given percentage ''k'' of scores in its frequency distribution falls (exclusive definition) or a score ''at or below which'' a given percentage falls ...
and the 25th
percentile In statistics, a ''k''-th percentile (percentile score or centile) is a score ''below which'' a given percentage ''k'' of scores in its frequency distribution falls (exclusive definition) or a score ''at or below which'' a given percentage falls ...
of a sample; this is the 25%
trimmed ''Trimmed'' is a 1922 American silent Western film directed by Harry A. Pollard and featuring Hoot Gibson. It is not known whether the film currently survives, and it may be a lost film. Cast * Hoot Gibson as Dale Garland * Patsy Ruth Miller ...
range Range may refer to: Geography * Range (geographic), a chain of hills or mountains; a somewhat linear, complex mountainous or hilly area (cordillera, sierra) ** Mountain range, a group of mountains bordered by lowlands * Range, a term used to i ...
, an example of an
L-estimator In statistics, an L-estimator is an estimator which is a linear combination of order statistics of the measurements (which is also called an L-statistic). This can be as little as a single point, as in the median (of an odd number of values), or as ...
. Other trimmed ranges, such as the
interdecile range In statistics, the interdecile range is the difference between the first and the ninth deciles (10% and 90%). The interdecile range is a measure of statistical dispersion of the values in a set of data, similar to the range (statistics), range and ...
(10% trimmed range) can also be used. For a Gaussian distribution, IQR is related to \sigma as \sigma \approx 0.7413 \operatorname = \operatorname/1.349. Another familiar robust measure of scale is the ''
median absolute deviation In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. It can also refer to the population parameter that is estimated by the MAD calculated from a sample. For a un ...
'' (MAD), the
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...
of the absolute values of the differences between the data values and the overall median of the data set; for a Gaussian distribution, MAD is related to \sigma as \sigma \approx 1.4826\ \operatorname (the derivation can be found
here Here is an adverb that means "in, on, or at this place". It may also refer to: Software * Here Technologies, a mapping company * Here WeGo (formerly Here Maps), a mobile app and map website by Here Television * Here TV (formerly "here!"), a TV ...
).


Estimation

Robust measures of scale can be used as
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
s of properties of the population, either for
parameter estimation Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their value ...
or as estimators of their own
expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
. For example, robust estimators of scale are used to estimate the
population variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of number ...
or population
standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
, generally by multiplying by a
scale factor In affine geometry, uniform scaling (or isotropic scaling) is a linear transformation that enlarges (increases) or shrinks (diminishes) objects by a '' scale factor'' that is the same in all directions. The result of uniform scaling is similar ...
to make it an
unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
consistent estimator In statistics, a consistent estimator or asymptotically consistent estimator is an estimator—a rule for computing estimates of a parameter ''θ''0—having the property that as the number of data points used increases indefinitely, the result ...
; see scale parameter: estimation. For example, dividing the IQR by 2 erf−1(1/2) (approximately 1.349), makes it an unbiased, consistent estimator for the population standard deviation if the data follow a
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
. In other situations, it makes more sense to think of a robust measure of scale as an estimator of its own
expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
, interpreted as an alternative to the population variance or standard deviation as a measure of scale. For example, the MAD of a sample from a standard
Cauchy distribution The Cauchy distribution, named after Augustin Cauchy, is a continuous probability distribution. It is also known, especially among physicists, as the Lorentz distribution (after Hendrik Lorentz), Cauchy–Lorentz distribution, Lorentz(ian) fun ...
is an estimator of the population MAD, which in this case is 1, whereas the population variance does not exist.


Efficiency

These robust estimators typically have inferior
statistical efficiency In statistics, efficiency is a measure of quality of an estimator, of an experimental design, or of a hypothesis testing procedure. Essentially, a more efficient estimator, needs fewer input data or observations than a less efficient one to achie ...
compared to conventional estimators for data drawn from a distribution without outliers (such as a normal distribution), but have superior efficiency for data drawn from a
mixture distribution In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection a ...
or from a
heavy-tailed distribution In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. In many applications it is the right tail of the distrib ...
, for which non-robust measures such as the standard deviation should not be used. For example, for data drawn from the normal distribution, the MAD is 37% as efficient as the sample standard deviation, while the Rousseeuw–Croux estimator ''Q''''n'' is 88% as efficient as the sample standard deviation.


Absolute pairwise differences

Rousseeuw and Croux propose alternatives to the MAD, motivated by two weaknesses of it: # It is
inefficient Efficiency is the often measurable ability to avoid wasting materials, energy, efforts, money, and time in doing something or in producing a desired result. In a more general sense, it is the ability to do things well, successfully, and without ...
(37% efficiency) at Gaussian distributions. # it computes a symmetric statistic about a location estimate, thus not dealing with
skewness In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimodal d ...
. They propose two alternative statistics based on pairwise differences: ''Sn'' and ''Qn'', defined as: : \begin S_n &:= 1.1926 \, \operatorname_i \left( \operatorname_j (\,\left, x_i - x_j \\,) \right) ,\\ Q_n & := c_n \text \left( \left, x_i - x_j \ : i < j \right), \end where c_n is a constant depending on n. These can be computed in ''O''(''n'' log ''n'') time and ''O''(''n'') space. Neither of these requires
location In geography, location or place are used to denote a region (point, line, or area) on Earth's surface or elsewhere. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ...
estimation, as they are based only on differences between values. They are both more efficient than the MAD under a Gaussian distribution: ''Sn'' is 58% efficient, while ''Qn'' is 82% efficient. For a sample from a normal distribution, ''S''n is approximately unbiased for the population standard deviation even down to very modest sample sizes (<1% bias for ''n'' = 10). For a large sample from a normal distribution, 2.219144465985075864722''Q''n is approximately unbiased for the population standard deviation. For small or moderate samples, the expected value of ''Q''n under a normal distribution depends markedly on the sample size, so finite-sample correction factors (obtained from a table or from simulations) are used to calibrate the scale of ''Q''n.


The biweight midvariance

Like ''S''n and ''Q''n, the biweight midvariance aims to be robust without sacrificing too much efficiency. It is defined as : \frac , where ''I'' is the
indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...
, ''Q'' is the sample median of the ''X''i, and : u_i = \frac. Its square root is a robust estimator of scale, since data points are downweighted as their distance from the median increases, with points more than 9 MAD units from the median having no influence at all.


Extensions

propose a robust depth-based estimator for location and scale simultaneously. They propose a new measure named the Student median.


Confidence intervals

{{cleanup merge, 21=section, Robust confidence intervals A robust confidence interval is a
robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
modification of
confidence interval In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as 9 ...
s, meaning that one modifies the non-robust calculations of the confidence interval so that they are not badly affected by outlying or aberrant observations in a data-set.


Example

In the process of weighing 1000 objects, under practical conditions, it is easy to believe that the operator might make a mistake in procedure and so report an incorrect mass (thereby making one type of
systematic error Observational error (or measurement error) is the difference between a measured value of a quantity and its true value.Dodge, Y. (2003) ''The Oxford Dictionary of Statistical Terms'', OUP. In statistics, an error is not necessarily a " mistak ...
). Suppose there were 100 objects and the operator weighed them all, one at a time, and repeated the whole process ten times. Then the operator can calculate a sample
standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
for each object, and look for
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s. Any object with an unusually large standard deviation probably has an outlier in its data. These can be removed by various non-parametric techniques. If the operator repeated the process only three times, simply taking the
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...
of the three measurements and using σ would give a confidence interval. The 200 extra weighings served only to detect and correct for operator error and did nothing to improve the confidence interval. With more repetitions, one could use a
truncated mean A truncated mean or trimmed mean is a statistical measure of central tendency, much like the mean and median. It involves the calculation of the mean after discarding given parts of a probability distribution or sample at the high and low end, an ...
, discarding the largest and smallest values and averaging the rest. A bootstrap calculation could be used to determine a confidence interval narrower than that calculated from σ, and so obtain some benefit from a large amount of extra work. These procedures are
robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...
against procedural errors which are not modeled by the assumption that the balance has a fixed known standard deviation σ. In practical applications where the occasional operator error can occur, or the balance can malfunction, the assumptions behind simple statistical calculations cannot be taken for granted. Before trusting the results of 100 objects weighed just three times each to have confidence intervals calculated from σ, it is necessary to test for and remove a reasonable number of outliers (testing the assumption that the operator is careful and correcting for the fact that he is not perfect), and to test the assumption that the data really have a
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
with standard deviation σ.


Computer simulation

The theoretical analysis of such an experiment is complicated, but it is easy to set up a
spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cel ...
which draws random numbers from a normal distribution with standard deviation σ to simulate the situation; this can be done in
Microsoft Excel Microsoft Excel is a spreadsheet developed by Microsoft for Microsoft Windows, Windows, macOS, Android (operating system), Android and iOS. It features calculation or computation capabilities, graphing tools, pivot tables, and a macro (comp ...
using =NORMINV(RAND(),0,σ)), as discussed in Wittwer, J.W.
"Monte Carlo Simulation in Excel: A Practical Guide"
June 1, 2004
and the same techniques can be used in other spreadsheet programs such as in
OpenOffice.org Calc OpenOffice.org (OOo), commonly known as OpenOffice, is a discontinued open-source office suite. Active successor projects include LibreOffice (the most actively developed), Apache OpenOffice, Collabora Online (enterprise ready LibreOffice) a ...
and
gnumeric Gnumeric is a spreadsheet program that is part of the GNOME Free Software Desktop Project. Gnumeric version 1.0 was released on 31 December 2001. Gnumeric is distributed as free software under the GNU General Public License; it is intended to r ...
. After removing obvious outliers, one could subtract the median from the other two values for each object, and examine the distribution of the 200 resulting numbers. It should be normal with mean near zero and standard deviation a little larger than σ. A simple
Monte Carlo Monte Carlo (; ; french: Monte-Carlo , or colloquially ''Monte-Carl'' ; lij, Munte Carlu ; ) is officially an administrative area of the Principality of Monaco, specifically the ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is ...
spreadsheet calculation would reveal typical values for the standard deviation (around 105 to 115% of σ). Or, one could subtract the mean of each triplet from the values, and examine the distribution of 300 values. The mean is identically zero, but the standard deviation should be somewhat smaller (around 75 to 85% of σ).


See also

*
Heteroscedasticity-consistent standard errors The topic of heteroskedasticity-consistent (HC) standard errors arises in statistics and econometrics in the context of linear regression and time series analysis. These are also known as heteroskedasticity-robust standard errors (or simply robust ...


References

Robust statistics Statistical deviation and dispersion Scale statistics