HOME

TheInfoList



OR:

In
decision theory Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...
and
estimation theory Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their valu ...
, Stein's example (also known as Stein's phenomenon or Stein's paradox) is the observation that when three or more parameters are estimated simultaneously, there exist combined
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
s more accurate on average (that is, having lower expected
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
) than any method that handles the parameters separately. It is named after Charles Stein of
Stanford University Stanford University, officially Leland Stanford Junior University, is a private research university in Stanford, California. The campus occupies , among the largest in the United States, and enrolls over 17,000 students. Stanford is consider ...
, who discovered the phenomenon in 1955. An intuitive explanation is that optimizing for the mean-squared error of a ''combined'' estimator is not the same as optimizing for the errors of separate estimators of the individual parameters. In practical terms, if the combined error is in fact of interest, then a combined estimator should be used, even if the underlying parameters are independent. If one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse.


Formal statement

The following is the simplest form of the paradox, the special case in which the number of observations is equal to the number of parameters to be estimated. Let \boldsymbol be a vector consisting of n\geq 3 unknown parameters. To estimate these parameters, a single measurement X_i is performed for each parameter \theta_i, resulting in a vector \mathbf of length n. Suppose the measurements are known to be
independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in the New Hope, Pennsylvania, area of the United States during the early 1930s * Independ ...
,
Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s, with mean \boldsymbol and variance 1, i.e., \mathbf\sim \mathcal(\boldsymbol,\mathbf_n). Thus, each parameter is estimated using a single noisy measurement, and each measurement is equally inaccurate. Under these conditions, it is intuitive and common to use each measurement as an estimate of its corresponding parameter. This so-called "ordinary" decision rule can be written as \hat = \mathbf, which is the
maximum likelihood estimator In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statist ...
(MLE). The quality of such an estimator is measured by its
risk function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
. A commonly used risk function is the
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
, defined as \mathbb \boldsymbol - \hat\boldsymbol\, ^2/math>. Surprisingly, it turns out that the "ordinary" decision rule is suboptimal ( inadmissible) in terms of mean squared error when n\geq 3. In other words, in the setting discussed here, there exist alternative estimators which ''always'' achieve lower ''mean'' squared error, no matter what the value of \boldsymbol is. For a given ''\boldsymbol'' one could obviously define a perfect "estimator" which is always just ''\boldsymbol'', but this estimator would be bad for other values of ''\boldsymbol''. The estimators of Stein's paradox are, for a given ''\boldsymbol'', better than the "ordinary" decision rule ''\mathbf'' for some ''\mathbf'' but necessarily worse for others. It is only on average that they are better. More accurately, an estimator \hat_1 is said to
dominate The Dominate, also known as the late Roman Empire, is the name sometimes given to the "despotic" later phase of imperial government in the ancient Roman Empire. It followed the earlier period known as the "Principate". Until the empire was reunit ...
another estimator \hat_2 if, for all values of \boldsymbol, the risk of \hat_1 is lower than, or equal to, the risk of \hat_2, ''and'' if the inequality is strict for some \boldsymbol. An estimator is said to be admissible if no other estimator dominates it, otherwise it is ''inadmissible''. Thus, Stein's example can be simply stated as follows: ''The "ordinary" decision rule of the mean of a multivariate Gaussian distribution is inadmissible under mean squared error risk.'' Many simple, practical estimators achieve better performance than the "ordinary" decision rule. The best-known example is the
James–Stein estimator The James–Stein estimator is a biased estimator of the mean, \boldsymbol\theta, of (possibly) correlated Gaussian distributed random vectors Y = \ with unknown means \. It arose sequentially in two main published papers, the earlier version ...
, which shrinks ''\mathbf'' towards a particular point (such as the origin) by an amount inversely proportional to the distance of ''\mathbf'' from that point. For a sketch of the proof of this result, see
Proof of Stein's example In decision theory and estimation theory, Stein's example (also known as Stein's phenomenon or Stein's paradox) is the observation that when three or more parameters are estimated simultaneously, there exist combined estimators more accurate on ave ...
. An alternative proof is due to Larry Brown: he proved that the ordinary estimator for an ''n''-dimensional multivariate normal mean vector is admissible if and only if the ''n'' -dimensional
Brownian motion Brownian motion, or pedesis (from grc, πήδησις "leaping"), is the random motion of particles suspended in a medium (a liquid or a gas). This pattern of motion typically consists of random fluctuations in a particle's position insi ...
is recurrent. Since the Brownian motion is not recurrent for n\geq 3, the MLE is not admissible for n\geq 3.


An intuitive explanation

For any particular value of ''\boldsymbol'' the new estimator will improve at least one of the individual mean square errors \mathbb \theta_i - \hat_i)^2 This is not hard − for instance, if \boldsymbol is between −1 and 1, and ''\sigma=1 '', then an estimator that linearly shrinks \mathbf towards 0 by 0.5 (i.e., \operatorname(X_i)\max(, X_i, -0.5,0), soft thresholding with threshold 0.5 ) will have a lower mean square error than \mathbf itself. But there are other values of \boldsymbol for which this estimator is worse than \mathbf itself. The trick of the Stein estimator, and others that yield the Stein paradox, is that they adjust the shift in such a way that there is always (for any ''\boldsymbol'' vector) at least one X_i whose mean square error is improved, and its improvement more than compensates for any degradation in mean square error that might occur for another \hat_i. The trouble is that, without knowing ''\boldsymbol'', you don't know which of the ''n'' mean square errors are improved, so you can't use the Stein estimator only for those parameters. An example of the above setting occurs in
channel estimation In wireless communications, channel state information (CSI) is the known channel properties of a communication link. This information describes how a signal propagates from the transmitter to the receiver and represents the combined effect of, for ...
in telecommunications, for instance, because different factors affect overall channel performance.


Implications

Stein's example is surprising, since the "ordinary" decision rule is intuitive and commonly used. In fact, numerous methods for estimator construction, including
maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statis ...
, best linear unbiased estimation,
least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...
estimation and optimal
equivariant estimation In statistics, the concept of being an invariant estimator is a criterion that can be used to compare the properties of different estimators for the same quantity. It is a way of formalising the idea that an estimator should have certain intuitive ...
, all result in the "ordinary" estimator. Yet, as discussed above, this estimator is suboptimal.


Example

To demonstrate the unintuitive nature of Stein's example, consider the following real-world example. Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we can get a better estimate (on average) for the vector of three parameters by simultaneously using the three unrelated measurements. At first sight it appears that somehow we get a better estimator for US wheat yield by measuring some other unrelated statistics such as the number of spectators at Wimbledon and the weight of a candy bar. However, we have not obtained a better estimator for US wheat yield by itself, but we have produced an estimator for the vector of the means of all three random variables, which has a reduced ''total'' risk. This occurs because the cost of a bad estimate in one component of the vector is compensated by a better estimate in another component. Also, a specific set of the three estimated mean values obtained with the new estimator will not necessarily be better than the ordinary set (the measured values). It is only on average that the new estimator is better.


Sketched proof

The
risk function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
of the decision rule d(\mathbf) = \mathbf is :R(\theta,d) = \operatorname_\theta \left, \mathbf_+_\frac\mathbf\^2\right/math> :_=_\operatorname_\theta\left \mathbf, ^2_+_2(\mathbf)^T\frac\mathbf_+_\frac, \mathbf, ^2_\right/math> :=\operatorname_\theta\left \mathbf, ^2_\right+_2\alpha\operatorname_\theta\left frac\right+_\alpha^2\operatorname_\theta\left frac\right/math> —_a_quadratic_in_\alpha._We_may_simplify_the_middle_term_by_considering_a_general_"well-behaved"_function_h:\mathbf\mapsto_h(\mathbf)\in\mathbb_and_using_
integration_by_parts In calculus, and more generally in mathematical analysis, integration by parts or partial integration is a process that finds the integral of a product of functions in terms of the integral of the product of their derivative and antiderivative. ...
._For_1\leq_i\leq_n,_for_any_continuously_differentiable_h_growing_sufficiently_slowly_for_large_x_i_we_have: :\operatorname_\theta \theta_i_-_X_i)h(\mathbf)\mid_X_j=x_j_(j\neq_i)__\int(\theta_i_-_x_i)h(\mathbf)\left(\frac\right)^_e^dx_i :=\left (\mathbf)\left(\frac_\right)^_e^\right\infty___ -_\int\frac(\mathbf)\left(\frac\right)^_e^dx_i :_=_-_\operatorname_\theta\left frac(\mathbf)\mid_X_j=x_j_(j\neq_i)_\right Therefore, :\operatorname_\theta \theta_i_-_X_i)h(\mathbf)=_-\operatorname_\theta\left frac(\mathbf)_\right (This_result_is_known_as_ Stein's_lemma.)_Now,_we_choose :h(\mathbf)_=__\frac. If_h_met_the_"well-behaved"_condition_(it_doesn't,_but_this_can_be_remedied—see_below),_we_would_have :\frac_=_\frac_-_\frac and_so :\operatorname_\theta\left frac\right=_\sum_^n_\operatorname_\theta_\left (\theta_i_-_X_i)_\frac_\right/math> :_=_-_\sum_^n_\operatorname_\theta_\left \frac_-_\frac_\right/math> :_=_-(n-2)\operatorname_\theta_\left frac\right Then_returning_to_the_risk_function_of_d': :R(\theta,d')_=_n_-_2\alpha(n-2)\operatorname_\theta\left frac\right+_\alpha^2\operatorname_\theta\left frac\right This_quadratic_in_\alpha_is_minimized_at_\alpha_=_n-2,_giving :R(\theta,d')_=_R(\theta,d)_-_(n-2)^2\operatorname_\theta\left frac\right/math> which_of_course_satisfies_R(\theta,d')_<_R(\theta,d)._making_d_an_inadmissible_decision_rule. It_remains_to_justify_the_use_of : h(\mathbf)=_\frac. This_function_is_not_continuously_differentiable,_since_it_is_singular_at_\mathbf=0._However,_the_function : h(\mathbf)_=_\frac is_continuously_differentiable,_and_after_following_the_algebra_through_and_letting_\varepsilon_\to_0,_one_obtains_the_same_result.


__See_also_

*_James–Stein_estimator_ The_James–Stein_estimator_is_a__biased_estimator_of_the_mean,_\boldsymbol\theta,_of_(possibly)_correlated__Gaussian_distributed__random_vectors_Y_=_\_with_unknown_means_\._ It_arose_sequentially_in_two_main_published_papers,_the_earlier_version__...


__Notes_


__References_

*_ *_ *_ *_{{citation __, _first_=_R._J. __, _last_=_Samworth __, _title_=_Stein's_Paradox __, _url_=_http://www.statslab.cam.ac.uk/~rjs57/SteinParadox.pdf __, _journal_=_Eureka __, _volume_=_62 __, _pages_=_38-41 __, _date_=_2012_ Estimation_theory Mathematical_examples Statistical_paradoxeshtml" ;"title="\boldsymbol - \mathbf, ^2] :=\int(\boldsymbol-\mathbf)^T(\boldsymbol-\mathbf)\left(\frac\right)^ e^dx :=n. Now consider the decision rule :d'(\mathbf) = \mathbf - \frac\mathbf, where \alpha = n-2. We will show that d' is a better decision rule than d. The risk function is :R(\theta,d') = \operatorname_\theta\left \left, \mathbf + \frac\mathbf\^2\right/math> : = \operatorname_\theta\left \mathbf, ^2 + 2(\mathbf)^T\frac\mathbf + \frac, \mathbf, ^2 \right/math> :=\operatorname_\theta\left \mathbf, ^2 \right+ 2\alpha\operatorname_\theta\left frac\right+ \alpha^2\operatorname_\theta\left frac\right/math> — a quadratic in \alpha. We may simplify the middle term by considering a general "well-behaved" function h:\mathbf\mapsto h(\mathbf)\in\mathbb and using
integration by parts In calculus, and more generally in mathematical analysis, integration by parts or partial integration is a process that finds the integral of a product of functions in terms of the integral of the product of their derivative and antiderivative. ...
. For 1\leq i\leq n, for any continuously differentiable h growing sufficiently slowly for large x_i we have: :\operatorname_\theta \theta_i - X_i)h(\mathbf)\mid X_j=x_j (j\neq i) \int(\theta_i - x_i)h(\mathbf)\left(\frac\right)^ e^dx_i :=\left (\mathbf)\left(\frac \right)^ e^\right\infty_ - \int\frac(\mathbf)\left(\frac\right)^ e^dx_i : = - \operatorname_\theta\left frac(\mathbf)\mid X_j=x_j (j\neq i) \right Therefore, :\operatorname_\theta \theta_i - X_i)h(\mathbf)= -\operatorname_\theta\left frac(\mathbf) \right (This result is known as Stein's lemma.) Now, we choose :h(\mathbf) = \frac. If h met the "well-behaved" condition (it doesn't, but this can be remedied—see below), we would have :\frac = \frac - \frac and so :\operatorname_\theta\left frac\right= \sum_^n \operatorname_\theta \left (\theta_i - X_i) \frac \right/math> : = - \sum_^n \operatorname_\theta \left \frac - \frac \right/math> : = -(n-2)\operatorname_\theta \left frac\right Then returning to the risk function of d': :R(\theta,d') = n - 2\alpha(n-2)\operatorname_\theta\left frac\right+ \alpha^2\operatorname_\theta\left frac\right This quadratic in \alpha is minimized at \alpha = n-2, giving :R(\theta,d') = R(\theta,d) - (n-2)^2\operatorname_\theta\left frac\right/math> which of course satisfies R(\theta,d') < R(\theta,d). making d an inadmissible decision rule. It remains to justify the use of : h(\mathbf)= \frac. This function is not continuously differentiable, since it is singular at \mathbf=0. However, the function : h(\mathbf) = \frac is continuously differentiable, and after following the algebra through and letting \varepsilon \to 0, one obtains the same result.


See also

*
James–Stein estimator The James–Stein estimator is a biased estimator of the mean, \boldsymbol\theta, of (possibly) correlated Gaussian distributed random vectors Y = \ with unknown means \. It arose sequentially in two main published papers, the earlier version ...


Notes


References

* * * * {{citation , first = R. J. , last = Samworth , title = Stein's Paradox , url = http://www.statslab.cam.ac.uk/~rjs57/SteinParadox.pdf , journal = Eureka , volume = 62 , pages = 38-41 , date = 2012 Estimation theory Mathematical examples Statistical paradoxes>\boldsymbol - \mathbf, ^2/math> :=\int(\boldsymbol-\mathbf)^T(\boldsymbol-\mathbf)\left(\frac\right)^ e^dx :=n. Now consider the decision rule :d'(\mathbf) = \mathbf - \frac\mathbf, where \alpha = n-2. We will show that d' is a better decision rule than d. The risk function is :R(\theta,d') = \operatorname_\theta\left \left, \mathbf + \frac\mathbf\^2\right/math> : = \operatorname_\theta\left \mathbf, ^2 + 2(\mathbf)^T\frac\mathbf + \frac, \mathbf, ^2 \right/math> :=\operatorname_\theta\left \mathbf, ^2 \right+ 2\alpha\operatorname_\theta\left frac\right+ \alpha^2\operatorname_\theta\left frac\right/math> — a quadratic in \alpha. We may simplify the middle term by considering a general "well-behaved" function h:\mathbf\mapsto h(\mathbf)\in\mathbb and using
integration by parts In calculus, and more generally in mathematical analysis, integration by parts or partial integration is a process that finds the integral of a product of functions in terms of the integral of the product of their derivative and antiderivative. ...
. For 1\leq i\leq n, for any continuously differentiable h growing sufficiently slowly for large x_i we have: :\operatorname_\theta \theta_i - X_i)h(\mathbf)\mid X_j=x_j (j\neq i) \int(\theta_i - x_i)h(\mathbf)\left(\frac\right)^ e^dx_i :=\left (\mathbf)\left(\frac \right)^ e^\right\infty_ - \int\frac(\mathbf)\left(\frac\right)^ e^dx_i : = - \operatorname_\theta\left frac(\mathbf)\mid X_j=x_j (j\neq i) \right Therefore, :\operatorname_\theta \theta_i - X_i)h(\mathbf)= -\operatorname_\theta\left frac(\mathbf) \right (This result is known as Stein's lemma.) Now, we choose :h(\mathbf) = \frac. If h met the "well-behaved" condition (it doesn't, but this can be remedied—see below), we would have :\frac = \frac - \frac and so :\operatorname_\theta\left frac\right= \sum_^n \operatorname_\theta \left (\theta_i - X_i) \frac \right/math> : = - \sum_^n \operatorname_\theta \left \frac - \frac \right/math> : = -(n-2)\operatorname_\theta \left frac\right Then returning to the risk function of d': :R(\theta,d') = n - 2\alpha(n-2)\operatorname_\theta\left frac\right+ \alpha^2\operatorname_\theta\left frac\right This quadratic in \alpha is minimized at \alpha = n-2, giving :R(\theta,d') = R(\theta,d) - (n-2)^2\operatorname_\theta\left frac\right/math> which of course satisfies R(\theta,d') < R(\theta,d). making d an inadmissible decision rule. It remains to justify the use of : h(\mathbf)= \frac. This function is not continuously differentiable, since it is singular at \mathbf=0. However, the function : h(\mathbf) = \frac is continuously differentiable, and after following the algebra through and letting \varepsilon \to 0, one obtains the same result.


See also

*
James–Stein estimator The James–Stein estimator is a biased estimator of the mean, \boldsymbol\theta, of (possibly) correlated Gaussian distributed random vectors Y = \ with unknown means \. It arose sequentially in two main published papers, the earlier version ...


Notes


References

* * * * {{citation , first = R. J. , last = Samworth , title = Stein's Paradox , url = http://www.statslab.cam.ac.uk/~rjs57/SteinParadox.pdf , journal = Eureka , volume = 62 , pages = 38-41 , date = 2012 Estimation theory Mathematical examples Statistical paradoxes