statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...

, M-estimators are a broad

class Class or The Class may refer to: Common uses not otherwise categorized * Class (biology), a taxonomic rank * Class (knowledge representation), a collection of individuals or objects * Class (philosophy), an analytical concept used differently ...

extremum estimator In statistics and econometrics, extremum estimators are a wide class of estimators for parametric models that are calculated through maximization (or minimization) of a certain objective function, which depends on the data. The general theory of e ...

s for which the

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

is a sample average. Both non-linear least squares and

maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

are special cases of M-estimators. The definition of M-estimators was motivated by

robust statistics Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, su ...

, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation. 48 samples of robust M-estimators can be found in a recent review study. More generally, an M-estimator may be defined to be a zero of an estimating function. This estimating function is often the derivative of another statistical function. For example, a maximum-likelihood estimate is the point where the derivative of the likelihood function with respect to the parameter is zero; thus, a maximum-likelihood estimator is a critical point of the

score Score or scorer may refer to: *Test score, the result of an exam or test Business * Score Digital, now part of Bauer Radio * Score Entertainment, a former American trading card design and manufacturing company * Score Media, a former Canadian ...

function. In many applications, such M-estimators can be thought of as estimating characteristics of the population.

Historical motivation

The method of

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the re ...

is a prototypical M-estimator, since the estimator is defined as a minimum of the sum of squares of the residuals. Another popular M-estimator is maximum-likelihood estimation. For a family of

probability density function In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) ca ...

s ''f'' parameterized by ''θ'', a

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...

estimator of ''θ'' is computed for each set of data by maximizing the

likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood functi ...

over the parameter space . When the observations are independent and identically distributed, a ML-estimate

\hat

satisfies :

\widehat = \arg\max_\,\!

or, equivalently, :

\widehat = \arg\min_.\,\!

Maximum-likelihood estimators have optimal properties in the limit of infinitely many observations under rather general conditions, but may be biased and not the most efficient estimators for finite samples.

Definition

In 1964,

Peter J. Huber Peter Jost Huber (born 25 March 1934) is a Swiss statistician. He is known for his contributions to the development of heteroscedasticity-consistent standard errors. A native of Wohlen, Aargau, Huber earned his Ph.D. at the ETH Zürich in 1962, ...

proposed generalizing maximum likelihood estimation to the minimization of :

\sum_^n\rho(x_i, \theta),\,\!

where ρ is a function with certain properties (see below). The solutions :

\hat = \arg\min_\left(\sum_^n\rho(x_i, \theta)\right) \,\!

are called M-estimators ("M" for "maximum likelihood-type" (Huber, 1981, page 43)); other types of robust estimators include

L-estimator In statistics, an L-estimator is an estimator which is a linear combination of order statistics of the measurements (which is also called an L-statistic). This can be as little as a single point, as in the median (of an odd number of values), or a ...

s, R-estimators and

S-estimator The goal of S-estimators is to have a simple high-breakdown regression estimator, which share the flexibility and nice asymptotic properties of M-estimators. The name "S-estimators" was chosen as they are based on estimators of scale. We will co ...

s. Maximum likelihood estimators (MLE) are thus a special case of M-estimators. With suitable rescaling, M-estimators are special cases of

s (in which more general functions of the observations can be used). The function ρ, or its derivative, ψ, can be chosen in such a way to provide the estimator desirable properties (in terms of bias and efficiency) when the data are truly from the assumed distribution, and 'not bad' behaviour when the data are generated from a model that is, in some sense, ''close'' to the assumed distribution.

Types

M-estimators are solutions, ''θ'', which minimize :

\sum_^n\rho(x_i,\theta).\,\!

This minimization can always be done directly. Often it is simpler to differentiate with respect to ''θ'' and solve for the root of the derivative. When this differentiation is possible, the M-estimator is said to be of ψ-type. Otherwise, the M-estimator is said to be of ρ-type. In most practical cases, the M-estimators are of ψ-type.

ρ-type

For positive integer ''r'', let

(\mathcal,\Sigma)

and

(\Theta\subset\mathbb^r,S)

be measure spaces.

\theta\in\Theta

is a vector of parameters. An M-estimator of ρ-type

T

is defined through a

measurable function In mathematics and in particular measure theory, a measurable function is a function between the underlying sets of two measurable spaces that preserves the structure of the spaces: the preimage of any measurable set is measurable. This is in ...

\rho:\mathcal\times\Theta\rightarrow\mathbb

. It maps a probability distribution

F

\mathcal

to the value

T(F)\in\Theta

(if it exists) that minimizes

\int_\rho(x,\theta)dF(x)

: :

T(F):=\arg\min_\int_\rho(x,\theta)dF(x)

For example, for the

estimator,

\rho(x,\theta)=-\log(f(x,\theta))

, where

f(x,\theta)=\frac

ψ-type

\rho

is differentiable with respect to

\theta

, the computation of

\widehat

is usually much easier. An M-estimator of ψ-type ''T'' is defined through a measurable function

\psi:\mathcal\times\Theta\rightarrow\mathbb^r

. It maps a probability distribution ''F'' on

\mathcal

to the value

T(F)\in\Theta

(if it exists) that solves the vector equation: :

\int_\psi(x,\theta) \, dF(x)=0

\int_\psi(x,T(F)) \, dF(x)=0

For example, for the

estimator,

\psi(x,\theta)=\left(\frac,\dots,\frac\right)^\mathrm

, where

u^\mathrm

denotes the transpose of vector ''u'' and

f(x,\theta)=\frac

. Such an estimator is not necessarily an M-estimator of ρ-type, but if ρ has a continuous first derivative with respect to

\theta

, then a necessary condition for an M-estimator of ψ-type to be an M-estimator of ρ-type is

\psi(x,\theta)=\nabla_\theta\rho(x,\theta)

. The previous definitions can easily be extended to finite samples. If the function ψ decreases to zero as

x \rightarrow \pm \infty

, the estimator is called redescending. Such estimators have some additional desirable properties, such as complete rejection of gross outliers.

Computation

For many choices of ρ or ψ, no closed form solution exists and an iterative approach to computation is required. It is possible to use standard function optimization algorithms, such as

Newton–Raphson In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-va ...

. However, in most cases an

iteratively re-weighted least squares The method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems with objective functions of the form of a ''p''-norm: :\underset \sum_^n \big, y_i - f_i (\boldsymbol\beta) \big, ^p, by an iterative met ...

fitting algorithm can be performed; this is typically the preferred method. For some choices of ψ, specifically, '' redescending'' functions, the solution may not be unique. The issue is particularly relevant in multivariate and regression problems. Thus, some care is needed to ensure that good starting points are chosen.

Robust Robustness is the property of being strong and healthy in constitution. When it is transposed into a system, it refers to the ability of tolerating perturbations that might affect the system’s functional body. In the same line ''robustness'' ca ...

starting points, such as the

median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic f ...

as an estimate of location and the

median absolute deviation In statistics, the median absolute deviation (MAD) is a Robust statistics, robust measure of the statistical dispersion, variability of a univariate sample of quantitative data. It can also refer to the statistical population, population paramete ...

as a univariate estimate of scale, are common.

Concentrating parameters

In computation of M-estimators, it is sometimes useful to rewrite the

so that the dimension of parameters is reduced. The procedure is called “concentrating” or “profiling”. Examples in which concentrating parameters increases computation speed include

seemingly unrelated regressions In econometrics, the seemingly unrelated regressions (SUR) or seemingly unrelated regression equations (SURE) model, proposed by Arnold Zellner in (1962), is a generalization of a linear regression model that consists of several regression equatio ...

(SUR) models. Consider the following M-estimation problem: :

(\hat\beta_,\hat\gamma_):=\arg\max_\textstyle \sum_^N \displaystyle q(w_,\beta,\gamma)

Assuming differentiability of the function ''q'', M-estimator solves the first order conditions: :

\sum_^N\triangledown_ 
\, q(w_,\beta,\gamma) = 0

\sum_^N \triangledown_ 
\, q(w_,\beta,\gamma) = 0

Now, if we can solve the second equation for γ in terms of

W:=(w_,w_,..,w_)

and

\beta

, the second equation becomes: :

\sum_^N \triangledown_ 
\, q(w_,\beta,g(W,\beta)) = 0

where g is, there is some function to be found. Now, we can rewrite the original objective function solely in terms of β by inserting the function g into the place of

\gamma

. As a result, there is a reduction in the number of parameters. Whether this procedure can be done depends on particular problems at hand. However, when it is possible, concentrating parameters can facilitate computation to a great degree. For example, in estimating SUR model of 6 equations with 5 explanatory variables in each equation by Maximum Likelihood, the number of parameters declines from 51 to 30. Despite its appealing feature in computation, concentrating parameters is of limited use in deriving asymptotic properties of M-estimator. The presence of W in each summand of the objective function makes it difficult to apply the

law of large numbers In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials shou ...

and the

central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themsel ...

Properties

Distribution

It can be shown that M-estimators are asymptotically normally distributed. As such, Wald-type approaches to constructing confidence intervals and hypothesis tests can be used. However, since the theory is asymptotic, it will frequently be sensible to check the distribution, perhaps by examining the permutation or bootstrap distribution.

Influence function

The influence function of an M-estimator of

\psi

-type is proportional to its defining

\psi

function. Let ''T'' be an M-estimator of ψ-type, and ''G'' be a probability distribution for which

T(G)

is defined. Its influence function IF is :

\operatorname(x;T,G) = -\frac

assuming the density function

f(y)

exists. A proof of this property of M-estimators can be found in Huber (1981, Section 3.2).

Applications

M-estimators can be constructed for location parameters and scale parameters in univariate and multivariate settings, as well as being used in robust regression.

Examples

Mean

Let (''X''₁, ..., ''X''_''n'') be a set of

independent, identically distributed In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is usual ...

random variables, with distribution ''F''. If we define :

\rho(x, \theta)=\frac,\,\!

we note that this is minimized when ''θ'' is the

mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value ( magnitude and sign) of a given data set. For a data set, the '' ar ...

of the ''X''s. Thus the mean is an M-estimator of ρ-type, with this ρ function. As this ρ function is continuously differentiable in ''θ'', the mean is thus also an M-estimator of ψ-type for ψ(''x'', ''θ'') = ''θ'' − ''x''.

Median

For the median estimation of (''X''₁, ..., ''X''_''n''), instead we can define the ρ function as :

\rho(x, \theta)=, x - \theta,

and similarly, the ρ function is minimized when ''θ'' is the

of the ''X''s. While this ρ function is not differentiable in ''θ'', the ψ-type M-estimator, which is the

subgradient In mathematics, the subderivative, subgradient, and subdifferential generalize the derivative to convex functions which are not necessarily differentiable. Subderivatives arise in convex analysis, the study of convex functions, often in connecti ...

of ρ function, can be expressed as :

\psi(x, \theta)=\sgn(x - \theta)

and :

& \mbox x - \theta = 0 \end

References

External links

M-estimators
— an introduction to the subject by Zhengyou Zhang {{DEFAULTSORT:M-Estimator M-estimators Estimator Robust regression Robust statistics