Rao–Blackwell Theorem
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, the Rao–Blackwell theorem, sometimes referred to as the Rao–Blackwell–Kolmogorov theorem, is a result which characterizes the transformation of an arbitrarily crude
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
into an estimator that is optimal by the mean-squared-error criterion or any of a variety of similar criteria. The Rao–Blackwell theorem states that if ''g''(''X'') is any kind of
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
of a parameter θ, then the
conditional expectation In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value – the value it would take “on average” over an arbitrarily large number of occurrences – give ...
of ''g''(''X'') given ''T''(''X''), where ''T'' is a
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
, is typically a better estimator of θ, and is never worse. Sometimes one can very easily construct a very crude estimator ''g''(''X''), and then evaluate that conditional expected value to get an estimator that is in various senses optimal. The theorem is named after
Calyampudi Radhakrishna Rao Calyampudi Radhakrishna Rao FRS (born 10 September 1920), commonly known as C. R. Rao, is an Indian-American mathematician and statistician. He is currently professor emeritus at Pennsylvania State University and Research Professor at the Un ...
and
David Blackwell David Harold Blackwell (April 24, 1919 – July 8, 2010) was an American statistician and mathematician who made significant contributions to game theory, probability theory, information theory, and statistics. He is one of the eponyms of th ...
. The process of transforming an estimator using the Rao–Blackwell theorem can be referred to as Rao–Blackwellization. The transformed
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
is called the Rao–Blackwell estimator.


Definitions

*An
estimator In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
δ(''X'') is an ''observable'' random variable (i.e. a
statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypo ...
) used for estimating some ''unobservable'' quantity. For example, one may be unable to observe the average height of ''all'' male students at the University of X, but one may observe the heights of a random sample of 40 of them. The average height of those 40—the "sample average"—may be used as an estimator of the unobservable "population average". *A
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
''T''(''X'') is a statistic calculated from data ''X'' to estimate some parameter θ for which no other statistic which can be calculated from data X provides any additional information about θ. It is defined as an ''observable''
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
such that the
conditional probability In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occur ...
distribution of all observable data ''X'' given ''T''(''X'') does not depend on the ''unobservable'' parameter θ, such as the mean or standard deviation of the whole population from which the data ''X'' was taken. In the most frequently cited examples, the "unobservable" quantities are parameters that parametrize a known family of
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
s according to which the data are distributed. ::In other words, a
sufficient statistic In statistics, a statistic is ''sufficient'' with respect to a statistical model and its associated unknown parameter if "no other statistic that can be calculated from the same sample provides any additional information as to the value of the pa ...
''T(X)'' for a parameter θ is a
statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypo ...
such that the
conditional distribution In probability theory and statistics, given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value; in some cases the ...
of the data ''X'', given ''T''(''X''), does not depend on the parameter θ. *A Rao–Blackwell estimator δ1(''X'') of an unobservable quantity θ is the
conditional expected value In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value – the value it would take “on average” over an arbitrarily large number of occurrences – given ...
E(δ(''X'') , ''T''(''X'')) of some estimator δ(''X'') given a sufficient statistic ''T''(''X''). Call δ(''X'') the "original estimator" and δ1(''X'') the "improved estimator". It is important that the improved estimator be ''observable'', i.e. that it does not depend on θ. Generally, the conditional expected value of one function of these data given another function of these data ''does'' depend on θ, but the very definition of sufficiency given above entails that this one does not. *The ''
mean squared error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
'' of an estimator is the expected value of the square of its deviation from the unobservable quantity being estimated of θ.


The theorem


Mean-squared-error version

One case of Rao–Blackwell theorem states: :The mean squared error of the Rao–Blackwell estimator does not exceed that of the original estimator. In other words, :\operatorname((\delta_1(X)-\theta)^2)\leq \operatorname((\delta(X)-\theta)^2). The essential tools of the proof besides the definition above are the
law of total expectation The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing theorem, among other names, states that if X is a random variable whose expected v ...
and the fact that for any random variable ''Y'', E(''Y''2) cannot be less than (''Y'')sup>2. That inequality is a case of
Jensen's inequality In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier pr ...
, although it may also be shown to follow instantly from the frequently mentioned fact that : 0 \leq \operatorname(Y) = \operatorname((Y-\operatorname(Y))^2) = \operatorname(Y^2)-(\operatorname(Y))^2. More precisely, the mean square error of the Rao-Blackwell estimator has the following decomposition : \operatorname \delta_1(X)-\theta)^2\operatorname \delta(X)-\theta)^2\operatorname operatorname(\delta(X)\mid T(X))/math> Since \operatorname operatorname(\delta(X)\mid T(X))ge 0, the Rao-Blackwell theorem immediately follows.


Convex loss generalization

The more general version of the Rao–Blackwell theorem speaks of the "expected loss" or
risk function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
: :\operatorname(L(\delta_1(X)))\leq \operatorname(L(\delta(X))) where the "loss function" ''L'' may be any
convex function In mathematics, a real-valued function is called convex if the line segment between any two points on the graph of a function, graph of the function lies above the graph between the two points. Equivalently, a function is convex if its epigra ...
. If the loss function is twice-differentiable, as in the case for mean-squared-error, then we have the sharper inequality :\operatorname(L(\delta(X)))-\operatorname(L(\delta_1(X)))\ge \frac\operatorname_T\left inf_x L''(x)\operatorname(\delta(X)\mid T)\right


Properties

The improved estimator is
unbiased Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
if and only if the original estimator is unbiased, as may be seen at once by using the
law of total expectation The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing theorem, among other names, states that if X is a random variable whose expected v ...
. The theorem holds regardless of whether biased or unbiased estimators are used. The theorem seems very weak: it says only that the Rao–Blackwell estimator is no worse than the original estimator. In practice, however, the improvement is often enormous.


Example

Phone calls arrive at a switchboard according to a
Poisson process In probability, statistics and related fields, a Poisson point process is a type of random mathematical object that consists of points randomly located on a mathematical space with the essential feature that the points occur independently of one ...
at an average rate of λ per minute. This rate is not observable, but the numbers ''X''1, ..., ''X''''n'' of phone calls that arrived during ''n'' successive one-minute periods are observed. It is desired to estimate the probability ''e''−λ that the next one-minute period passes with no phone calls. An ''extremely'' crude estimator of the desired probability is :\delta_0=\left\{\begin{matrix}1 & \text{if}\ X_1=0, \\ 0 & \text{otherwise,}\end{matrix}\right. i.e., it estimates this probability to be 1 if no phone calls arrived in the first minute and zero otherwise. Despite the apparent limitations of this estimator, the result given by its Rao–Blackwellization is a very good estimator. The sum : S_n = \sum_{i=1}^n X_{i} = X_1+\cdots+X_n can be readily shown to be a sufficient statistic for λ, i.e., the ''conditional'' distribution of the data ''X''1, ..., ''X''''n'', depends on λ only through this sum. Therefore, we find the Rao–Blackwell estimator :\delta_1=\operatorname{E}(\delta_0\mid S_n=s_n). After doing some algebra we have :\begin{align} \delta_1 &= \operatorname{E} \left (\mathbf{1}_{\{X_1=0\ \Bigg, \sum_{i=1}^n X_{i} = s_n \right ) \\ &= P \left (X_{1}=0 \Bigg, \sum_{i=1}^n X_{i} = s_n \right ) \\ &= P \left (X_{1}=0, \sum_{i=2}^n X_{i} = s_n \right ) \times P \left (\sum_{i=1}^n X_{i} = s_n \right )^{-1} \\ &= e^{-\lambda}\frac{\left((n-1)\lambda\right)^{s_n}e^{-(n-1)\lambda{s_n!} \times \left (\frac{(n\lambda)^{s_n}e^{-n\lambda{s_n!} \right )^{-1} \\ &= \frac{\left((n-1)\lambda\right)^{s_n}e^{-n\lambda{s_n!} \times \frac{s_n!}{(n\lambda)^{s_n}e^{-n\lambda \\ &= \left(1-\frac{1}{n}\right)^{s_n} \end{align} Since the average number of calls arriving during the first ''n'' minutes is ''n''λ, one might not be surprised if this estimator has a fairly high probability (if ''n'' is big) of being close to :\left(1-{1 \over n}\right)^{n\lambda}\approx e^{-\lambda}. So δ1 is clearly a very much improved estimator of that last quantity. In fact, since ''S''''n'' is
complete Complete may refer to: Logic * Completeness (logic) * Completeness of a theory, the property of a theory that every formula in the theory's language or its negation is provable Mathematics * The completeness of the real numbers, which implies t ...
and δ0 is unbiased, δ1 is the unique minimum variance unbiased estimator by the
Lehmann–Scheffé theorem In statistics, the Lehmann–Scheffé theorem is a prominent statement, tying together the ideas of completeness, sufficiency, uniqueness, and best unbiased estimation. The theorem states that any estimator which is unbiased for a given unknown qu ...
.


Idempotence

Rao–Blackwellization is an
idempotent Idempotence (, ) is the property of certain operation (mathematics), operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application. The concept of idempotence ...
operation. Using it to improve the already improved estimator does not obtain a further improvement, but merely returns as its output the same improved estimator.


Completeness and Lehmann–Scheffé minimum variance

If the conditioning statistic is both
complete Complete may refer to: Logic * Completeness (logic) * Completeness of a theory, the property of a theory that every formula in the theory's language or its negation is provable Mathematics * The completeness of the real numbers, which implies t ...
and
sufficient In logic and mathematics, necessity and sufficiency are terms used to describe a conditional or implicational relationship between two statements. For example, in the conditional statement: "If then ", is necessary for , because the truth of ...
, and the starting estimator is unbiased, then the Rao–Blackwell estimator is the unique "
best unbiased estimator In statistics a minimum-variance unbiased estimator (MVUE) or uniformly minimum-variance unbiased estimator (UMVUE) is an Bias of an estimator, unbiased estimator that has lower variance than any other unbiased estimator for all possible values of t ...
": see
Lehmann–Scheffé theorem In statistics, the Lehmann–Scheffé theorem is a prominent statement, tying together the ideas of completeness, sufficiency, uniqueness, and best unbiased estimation. The theorem states that any estimator which is unbiased for a given unknown qu ...
. An example of an improvable Rao–Blackwell improvement, when using a minimal sufficient statistic that is not complete, was provided by Galili and Meilijson in 2016. Let X_1, \ldots, X_n be a random sample from a scale-uniform distribution X \sim U \left( (1-k) \theta, (1+k) \theta \right), with unknown mean E \theta and known design parameter k \in (0,1). In the search for "best" possible unbiased estimators for \theta, it is natural to consider X_1 as an initial (crude) unbiased estimator for \theta and then try to improve it. Since X_1 is not a function of T = \left( X_{(1)}, X_{(n)} \right), the minimal sufficient statistic for \theta (where X_{(1)} = \min( X_i ) and X_{(n)} = \max( X_i )), it may be improved using the Rao–Blackwell theorem as follows: :\hat{\theta}_{RB}=E_{\theta} \left X_{(1)}, X_{(n)} \right \frac{X_{(1)}+X_{(n){2}. However, the following unbiased estimator can be shown to have lower variance: :\hat{\theta}_{LV} = \frac{1}{2 \left (k^2 \frac{n-1}{n+1}+1\right )} \left (1-k)+(1+k) \right And in fact, it could be even further improved when using the following estimator: :\hat{\theta}_{BAYES} =\frac{n+1}{n} \left
scale_model_ A_scale_model_is_a_physical_model_which_is__geometrically_similar_to_an_object_(known_as_the_prototype)._Scale_models_are_generally_smaller_than_large_prototypes_such_as_vehicles,_buildings,_or_people;_but_may_be_larger_than_small_prototypes_...
._Optimal_Equivariant_Estimator.html" ;"title="Scale_parameter.html" "title="1-\frac{\frac{\left( \frac{1-k} \right)}{\left( \frac{1+k} \right)}-1}{1+k} The model is a
scale_model_ A_scale_model_is_a_physical_model_which_is__geometrically_similar_to_an_object_(known_as_the_prototype)._Scale_models_are_generally_smaller_than_large_prototypes_such_as_vehicles,_buildings,_or_people;_but_may_be_larger_than_small_prototypes_...
._Optimal_Equivariant_Estimator">equivariant_estimators_can_then_be_derived_for_
scale_model_ A_scale_model_is_a_physical_model_which_is__geometrically_similar_to_an_object_(known_as_the_prototype)._Scale_models_are_generally_smaller_than_large_prototypes_such_as_vehicles,_buildings,_or_people;_but_may_be_larger_than_small_prototypes_...
._Optimal_Equivariant_Estimator">equivariant_estimators_can_then_be_derived_for_Loss_function">loss_functions_ In__mathematical_optimization_and_decision_theory,_a_loss_function_or_cost_function_(sometimes_also_called_an_error_function)__is_a_function_that_maps_an_event_or_values_of_one_or_more_variables_onto_a_real_number_intuitively_representing_some_"cos_...
_that_are_Invariant_estimator.html" "title="Loss_function.html" "title="Scale parameter">scale model A scale model is a physical model which is geometrically similar to an object (known as the prototype). Scale models are generally smaller than large prototypes such as vehicles, buildings, or people; but may be larger than small prototypes ...
. Optimal Equivariant Estimator">equivariant estimators can then be derived for Loss function">loss functions In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cos ...
that are Invariant estimator">invariant Invariant and invariance may refer to: Computer science * Invariant (computer science), an expression whose value doesn't change during program execution ** Loop invariant, a property of a program loop that is true before (and after) each iteratio ...
.


See also

* Basu's theorem — Another result on complete sufficient and ancillary statistics


References


External links

* {{DEFAULTSORT:Rao-Blackwell Theorem Theorems in statistics Estimation theory