James–Stein Estimator
   HOME

TheInfoList



OR:

The James–Stein estimator is a biased
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
, \boldsymbol\theta, of (possibly)
correlated In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
Gaussian distributed
random vector In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. ...
s Y = \ with unknown means \. It arose sequentially in two main published papers, the earlier version of the estimator was developed by Charles Stein in 1956, which reached a relatively shocking conclusion that while the then usual estimate of the mean, or the sample mean written by Stein and James as (Y_i) = , is admissible when m \leq 2, however it is inadmissible when m \geq 3 and proposed a possible improvement to the estimator that shrinks the sample means towards a more central mean vector \boldsymbol\nu (which can be chosen
a priori ("from the earlier") and ("from the later") are Latin phrases used in philosophy to distinguish types of knowledge, justification, or argument by their reliance on empirical evidence or experience. knowledge is independent from current ex ...
or commonly the "average of averages" of the sample means given all samples share the same size), is commonly referred to as Stein's example or paradox. This earlier result was improved later by Willard James and Charles Stein in 1961 through simplifying the original process. It can be shown that the James–Stein estimator dominates the "ordinary"
least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...
approach, meaning the James–Stein estimator has a lower or equal
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
than the "ordinary" least square estimator.


Setting

Let \sim N_m(, \sigma^2 I),\, where the vector \boldsymbol\theta is the unknown
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
of , which is m-variate normally distributed and with known
covariance matrix In probability theory and statistics, a covariance matrix (also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix) is a square matrix giving the covariance between each pair of elements of ...
\sigma^2 I . We are interested in obtaining an estimate, \widehat , of \boldsymbol\theta, based on a single observation, , of . In real-world application, this is a common situation in which a set of parameters is sampled, and the samples are corrupted by independent
Gaussian noise Gaussian noise, named after Carl Friedrich Gauss, is a term from signal processing theory denoting a kind of signal noise that has a probability density function (pdf) equal to that of the normal distribution (which is also known as the Gaussia ...
. Since this noise has mean of zero, it may be reasonable to use the samples themselves as an estimate of the parameters. This approach is the
least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...
estimator, which is \widehat_ = . Stein demonstrated that in terms of
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
\operatorname \left -\widehat \right\, ^2 \right/math>, the least squares estimator, \widehat_, is sub-optimal to a shrinkage based estimators, such as the James–Stein estimator, \widehat_ . The paradoxical result, that there is a (possibly) better and never any worse estimate of \boldsymbol\theta in mean squared error as compared to the sample mean, became known as
Stein's example In decision theory and estimation theory, Stein's example (also known as Stein's phenomenon or Stein's paradox) is the observation that when three or more parameters are estimated simultaneously, there exist combined estimators more accurate on ave ...
.


The James–Stein estimator

If \sigma^2 is known, the James–Stein estimator is given by : \widehat_ = \left( 1 - \frac \right) . James and Stein showed that the above estimator dominates \widehat_ for any m \ge 3, meaning that the James–Stein estimator always achieves lower
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
(MSE) than the
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimator. By definition, this makes the least squares estimator inadmissible when m \ge 3. Notice that if (m-2) \sigma^2<\, \, ^2 then this estimator simply takes the natural estimator \mathbf y and shrinks it towards the origin 0. In fact this is not the only direction of shrinkage that works. Let ''ν'' be an arbitrary fixed vector of dimension m. Then there exists an estimator of the James–Stein type that shrinks toward ''ν'', namely : \widehat_ = \left( 1 - \frac \right) (-) + , \qquad m\ge 3. The James–Stein estimator dominates the usual estimator for any ''ν''. A natural question to ask is whether the improvement over the usual estimator is independent of the choice of ''ν''. The answer is no. The improvement is small if \, \, is large. Thus to get a very great improvement some knowledge of the location of ''θ'' is necessary. Of course this is the quantity we are trying to estimate so we don't have this knowledge
a priori ("from the earlier") and ("from the later") are Latin phrases used in philosophy to distinguish types of knowledge, justification, or argument by their reliance on empirical evidence or experience. knowledge is independent from current ex ...
. But we may have some guess as to what the mean vector is. This can be considered a disadvantage of the estimator: the choice is not objective as it may depend on the beliefs of the researcher. Nonetheless, James and Stein's result is that ''any'' finite guess ''ν'' improves the expected MSE over the maximum-likelihood estimator, which is tantamount to using an infinite ''ν'', surely a poor guess.


Interpretation

Seeing the James–Stein estimator as an
empirical Bayes method Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed be ...
gives some intuition to this result: One assumes that ''θ'' itself is a random variable with
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
\sim N(0, A), where ''A'' is estimated from the data itself. Estimating ''A'' only gives an advantage compared to the maximum-likelihood estimator when the dimension m is large enough; hence it does not work for m\leq 2. The James–Stein estimator is a member of a class of Bayesian estimators that dominate the maximum-likelihood estimator. A consequence of the above discussion is the following counterintuitive result: When three or more unrelated parameters are measured, their total MSE can be reduced by using a combined estimator such as the James–Stein estimator; whereas when each parameter is estimated separately, the least squares (LS) estimator is admissible. A quirky example would be estimating the speed of light, tea consumption in Taiwan, and hog weight in Montana, all together. The James–Stein estimator always improves upon the ''total'' MSE, i.e., the sum of the expected squared errors of each component. Therefore, the total MSE in measuring light speed, tea consumption, and hog weight would improve by using the James–Stein estimator. However, any particular component (such as the speed of light) would improve for some parameter values, and deteriorate for others. Thus, although the James–Stein estimator dominates the LS estimator when three or more parameters are estimated, any single component does not dominate the respective component of the LS estimator. The conclusion from this hypothetical example is that measurements should be combined if one is interested in minimizing their total MSE. For example, in a
telecommunication Telecommunication is the transmission of information by various types of technologies over wire, radio, optical, or other electromagnetic systems. It has its origin in the desire of humans for communication over a distance greater than that fe ...
setting, it is reasonable to combine
channel Channel, channels, channeling, etc., may refer to: Geography * Channel (geography), in physical geography, a landform consisting of the outline (banks) of the path of a narrow body of water. Australia * Channel Country, region of outback Austral ...
tap measurements in a
channel estimation In wireless communications, channel state information (CSI) is the known channel properties of a communication link. This information describes how a signal propagates from the transmitter to the receiver and represents the combined effect of, for ...
scenario, as the goal is to minimize the total channel estimation error. Conversely, there could be objections to combining channel estimates of different users, since no user would want their channel estimate to deteriorate in order to improve the average network performance. The James–Stein estimator has also found use in fundamental quantum theory, where the estimator has been used to improve the theoretical bounds of the
entropic uncertainty principle In quantum mechanics, the uncertainty principle (also known as Heisenberg's uncertainty principle) is any of a variety of mathematical inequalities asserting a fundamental limit to the accuracy with which the values for certain pairs of physic ...
for more than three measurements. An intuitive derivation and interpretation is given by the Galtonian perspective. Under this interpretation, we aim to predict the population means using the imperfectly measured sample means. The equation of the OLS estimator in a hypothetical regression of the population means on the sample means gives an estimator of the form of either the James–Stein estimator (when we force the OLS intercept to equal 0) or of the Efron-Morris estimator (when we allow the intercept to vary).


Improvements

Despite the intuition that the James–Stein estimator shrinks the maximum-likelihood estimate ''toward'' \boldsymbol\nu, the estimate actually moves ''away'' from \boldsymbol\nu for small values of \, - \, , as the multiplier on - is then negative. This can be easily remedied by replacing this multiplier by zero when it is negative. The resulting estimator is called the ''positive-part James–Stein estimator'' and is given by : \widehat_ = \left( 1 - \frac \right)^+ (-) + , m \ge 4. This estimator has a smaller risk than the basic James–Stein estimator. It follows that the basic James–Stein estimator is itself inadmissible. It turns out, however, that the positive-part estimator is also inadmissible. This follows from a general result which requires admissible estimators to be smooth.


Extensions

The James–Stein estimator may seem at first sight to be a result of some peculiarity of the problem setting. In fact, the estimator exemplifies a very wide-ranging effect; namely, the fact that the "ordinary" or least squares estimator is often inadmissible for simultaneous estimation of several parameters. This effect has been called
Stein's phenomenon In decision theory and estimation theory, Stein's example (also known as Stein's phenomenon or Stein's paradox) is the observation that when three or more parameters are estimated simultaneously, there exist combined estimators more accurate on av ...
, and has been demonstrated for several different problem settings, some of which are briefly outlined below. * James and Stein demonstrated that the estimator presented above can still be used when the variance \sigma^2 is unknown, by replacing it with the standard estimator of the variance, \widehat^2 = \frac\sum ( y_i-\overline )^2. The dominance result still holds under the same condition, namely, m > 2. * The results in this article are for the case when only a single observation vector y is available. For the more general case when n vectors are available, the results are similar: :: \widehat_ = \left( 1 - \frac \right) , :where is the m-length average of the n observations. * The work of James and Stein has been extended to the case of a general measurement covariance matrix, i.e., where measurements may be statistically dependent and may have differing variances. A similar dominating estimator can be constructed, with a suitably generalized dominance condition. This can be used to construct a
linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...
technique which outperforms the standard application of the LS estimator. * Stein's result has been extended to a wide class of distributions and loss functions. However, this theory provides only an existence result, in that explicit dominating estimators were not actually exhibited. It is quite difficult to obtain explicit estimators improving upon the usual estimator without specific restrictions on the underlying distributions.


See also

*
Admissible decision rule In statistical decision theory, an admissible decision rule is a rule for making a decision such that there is no other rule that is always "better" than it (or at least sometimes better and never worse), in the precise sense of "better" defined ...
*
Hodges' estimator In statistics, Hodges' estimator (or the Hodges–Le Cam estimator), named for Joseph Hodges, is a famous counterexample of an estimator which is "superefficient", i.e. it attains smaller asymptotic variance than regular efficient estimators. The e ...
*
Shrinkage estimator In statistics, shrinkage is the reduction in the effects of sampling variation. In regression analysis, a fitted relationship appears to perform less well on a new data set than on the data set used for fitting. In particular the value of the coeff ...


References


Further reading

* {{DEFAULTSORT:James–Stein Estimator Estimator Normal distribution