In
estimation theory
Estimation theory is a branch of statistics that deals with estimating the values of parameters based on measured empirical data that has a random component. The parameters describe an underlying physical setting in such a way that their valu ...
and
decision theory
Decision theory (or the theory of choice; not to be confused with choice theory) is a branch of applied probability theory concerned with the theory of making decisions based on assigning probabilities to various factors and assigning numerical ...
, a Bayes estimator or a Bayes action is an
estimator
In statistics, an estimator is a rule for calculating an estimate of a given quantity based on observed data: thus the rule (the estimator), the quantity of interest (the estimand) and its result (the estimate) are distinguished. For example, the ...
or
decision rule
In decision theory, a decision rule is a function which maps an observation to an appropriate action. Decision rules play an important role in the theory of statistics and economics, and are closely related to the concept of a strategy (game theory ...
that minimizes the
posterior expected value
In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
of a
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
(i.e., the posterior expected loss). Equivalently, it maximizes the posterior expectation of a
utility
As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosopher ...
function. An alternative way of formulating an estimator within
Bayesian statistics
Bayesian statistics is a theory in the field of statistics based on the Bayesian interpretation of probability where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about the event, ...
is
maximum a posteriori estimation
In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the b ...
.
Definition
Suppose an unknown parameter
is known to have a
prior distribution
In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
. Let
be an estimator of
(based on some measurements ''x''), and let
be a
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
, such as squared error. The Bayes risk of
is defined as
, where the
expectation is taken over the probability distribution of
: this defines the risk function as a function of
. An estimator
is said to be a ''Bayes estimator'' if it minimizes the Bayes risk among all estimators. Equivalently, the estimator which minimizes the posterior expected loss
''for each
'' also minimizes the Bayes risk and therefore is a Bayes estimator.
If the prior is
improper then an estimator which minimizes the posterior expected loss ''for each
'' is called a generalized Bayes estimator.
[Lehmann and Casella, Definition 4.2.9]
Examples
Minimum mean square error estimation
The most common risk function used for Bayesian estimation is the
mean square error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
(MSE), also called ''squared error risk''. The MSE is defined by
:
where the expectation is taken over the joint distribution of
and
.
Posterior mean
Using the MSE as risk, the Bayes estimate of the unknown parameter is simply the mean of the
posterior distribution
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior p ...
,
:
This is known as the ''minimum mean square error'' (MMSE) estimator.
Bayes estimators for conjugate priors
If there is no inherent reason to prefer one prior probability distribution over another, a
conjugate prior
In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and th ...
is sometimes chosen for simplicity. A conjugate prior is defined as a prior distribution belonging to some
parametric family
In mathematics and its applications, a parametric family or a parameterized family is a indexed family, family of objects (a set of related objects) whose differences depend only on the chosen values for a set of parameters.
Common examples are p ...
, for which the resulting posterior distribution also belongs to the same family. This is an important property, since the Bayes estimator, as well as its statistical properties (variance, confidence interval, etc.), can all be derived from the posterior distribution.
Conjugate priors are especially useful for sequential estimation, where the posterior of the current measurement is used as the prior in the next measurement. In sequential estimation, unless a conjugate prior is used, the posterior distribution typically becomes more complex with each added measurement, and the Bayes estimator cannot usually be calculated without resorting to numerical methods.
Following are some examples of conjugate priors.
* If
is
Normal Normal(s) or The Normal(s) may refer to:
Film and television
* ''Normal'' (2003 film), starring Jessica Lange and Tom Wilkinson
* ''Normal'' (2007 film), starring Carrie-Anne Moss, Kevin Zegers, Callum Keith Rennie, and Andrew Airlie
* ''Norma ...
,
, and the prior is normal,
, then the posterior is also Normal and the Bayes estimator under MSE is given by
:
* If
are
iid
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
Poisson random variables
, and if the prior is
Gamma distributed , then the posterior is also Gamma distributed, and the Bayes estimator under MSE is given by
:
* If
are iid
uniformly distributed , and if the prior is
Pareto distributed , then the posterior is also Pareto distributed, and the Bayes estimator under MSE is given by
:
Alternative risk functions
Risk functions are chosen depending on how one measures the distance between the estimate and the unknown parameter. The MSE is the most common risk function in use, primarily due to its simplicity. However, alternative risk functions are also occasionally used. The following are several examples of such alternatives. We denote the posterior generalized distribution function by
.
Posterior median and other quantiles
* A "linear" loss function, with
, which yields the posterior median as the Bayes' estimate:
:
:
* Another "linear" loss function, which assigns different "weights"
to over or sub estimation. It yields a
quantile
In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile th ...
from the posterior distribution, and is a generalization of the previous loss function:
:
:
Posterior mode
* The following loss function is trickier: it yields either the
posterior mode
In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the ...
, or a point close to it depending on the curvature and properties of the posterior distribution. Small values of the parameter
are recommended, in order to use the mode as an approximation (
):
:
Other loss functions can be conceived, although the
mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
is the most widely used and validated. Other loss functions are used in statistics, particularly in
robust statistics
Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, suc ...
.
Generalized Bayes estimators
The prior distribution
has thus far been assumed to be a true probability distribution, in that
:
However, occasionally this can be a restrictive requirement. For example, there is no distribution (covering the set, R, of all real numbers) for which every real number is equally likely. Yet, in some sense, such a "distribution" seems like a natural choice for a
non-informative prior
In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
, i.e., a prior distribution which does not imply a preference for any particular value of the unknown parameter. One can still define a function
, but this would not be a proper probability distribution since it has infinite mass,
:
Such
measures
Measure may refer to:
* Measurement, the assignment of a number to a characteristic of an object or event
Law
* Ballot measure, proposed legislation in the United States
* Church of England Measure, legislation of the Church of England
* Measu ...
, which are not probability distributions, are referred to as
improper prior
In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
s.
The use of an improper prior means that the Bayes risk is undefined (since the prior is not a probability distribution and we cannot take an expectation under it). As a consequence, it is no longer meaningful to speak of a Bayes estimator that minimizes the Bayes risk. Nevertheless, in many cases, one can define the posterior distribution
:
This is a definition, and not an application of
Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
, since Bayes' theorem can only be applied when all distributions are proper. However, it is not uncommon for the resulting "posterior" to be a valid probability distribution. In this case, the posterior expected loss
:
is typically well-defined and finite. Recall that, for a proper prior, the Bayes estimator minimizes the posterior expected loss. When the prior is improper, an estimator which minimizes the posterior expected loss is referred to as a generalized Bayes estimator.
Example
A typical example is estimation of a
location parameter
In geography, location or place are used to denote a region (point, line, or area) on Earth's surface or elsewhere. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ...
with a loss function of the type
. Here
is a location parameter, i.e.,
.
It is common to use the improper prior
in this case, especially when no other more subjective information is available. This yields
:
so the posterior expected loss
:
The generalized Bayes estimator is the value
that minimizes this expression for a given
. This is equivalent to minimizing
:
for a given
(1)
In this case it can be shown that the generalized Bayes estimator has the form
, for some constant
. To see this, let
be the value minimizing (1) when
. Then, given a different value
, we must minimize
:
(2)
This is identical to (1), except that
has been replaced by
. Thus, the expression minimizing is given by
, so that the optimal estimator has the form
:
Empirical Bayes estimators
A Bayes estimator derived through the
empirical Bayes method
Empirical Bayes methods are procedures for statistical inference in which the prior probability distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed be ...
is called an empirical Bayes estimator. Empirical Bayes methods enable the use of auxiliary empirical data, from observations of related parameters, in the development of a Bayes estimator. This is done under the assumption that the estimated parameters are obtained from a common prior. For example, if independent observations of different parameters are performed, then the estimation performance of a particular parameter can sometimes be improved by using data from other observations.
There are
parametric and
non-parametric
Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distri ...
approaches to empirical Bayes estimation. Parametric empirical Bayes is usually preferable since it is more applicable and more accurate on small amounts of data.
Example
The following is a simple example of parametric empirical Bayes estimation. Given past observations
having conditional distribution
, one is interested in estimating
based on
. Assume that the
's have a common prior
which depends on unknown parameters. For example, suppose that
is normal with unknown mean
and variance
We can then use the past observations to determine the mean and variance of
in the following way.
First, we estimate the mean
and variance
of the marginal distribution of
using the
maximum likelihood
In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
approach:
:
:
Next, we use the
law of total expectation
The proposition in probability theory known as the law of total expectation, the law of iterated expectations (LIE), Adam's law, the tower rule, and the smoothing theorem, among other names, states that if X is a random variable whose expected v ...
to compute
and the
law of total variance In probability theory, the law of total variance or variance decomposition formula or conditional variance formulas or law of iterated variances also known as Eve's law, states that if X and Y are random variables on the same probability space, and ...
to compute
such that
:
:
where
and
are the moments of the conditional distribution
, which are assumed to be known. In particular, suppose that
and that
; we then have
:
:
Finally, we obtain the estimated moments of the prior,
:
:
For example, if
, and if we assume a normal prior (which is a conjugate prior in this case), we conclude that
, from which the Bayes estimator of
based on
can be calculated.
Properties
Admissibility
Bayes rules having finite Bayes risk are typically
admissible. The following are some specific examples of admissibility theorems.
* If a Bayes rule is unique then it is admissible. For example, as stated above, under mean squared error (MSE) the Bayes rule is unique and therefore admissible.
* If θ belongs to a
discrete set
]
In mathematics, a point ''x'' is called an isolated point of a subset ''S'' (in a topological space ''X'') if ''x'' is an element of ''S'' and there exists a neighborhood of ''x'' which does not contain any other points of ''S''. This is equival ...
, then all Bayes rules are admissible.
* If θ belongs to a continuous (non-discrete) set, and if the risk function R(θ,δ) is continuous in θ for every δ, then all Bayes rules are admissible.
By contrast, generalized Bayes rules often have undefined Bayes risk in the case of improper priors. These rules are often inadmissible and the verification of their admissibility can be difficult. For example, the generalized Bayes estimator of a location parameter θ based on Gaussian samples (described in the "Generalized Bayes estimator" section above) is inadmissible for
; this is known as
Stein's phenomenon
In decision theory and estimation theory, Stein's example (also known as Stein's phenomenon or Stein's paradox) is the observation that when three or more parameters are estimated simultaneously, there exist combined estimators more accurate on av ...
.
Asymptotic efficiency
Let θ be an unknown random variable, and suppose that
are
iid
In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
samples with density
. Let
be a sequence of Bayes estimators of θ based on an increasing number of measurements. We are interested in analyzing the asymptotic performance of this sequence of estimators, i.e., the performance of
for large ''n''.
To this end, it is customary to regard θ as a deterministic parameter whose true value is
. Under specific conditions, for large samples (large values of ''n''), the posterior density of θ is approximately normal. In other words, for large ''n'', the effect of the prior probability on the posterior is negligible. Moreover, if δ is the Bayes estimator under MSE risk, then it is
asymptotically unbiased
In analytic geometry, an asymptote () of a curve is a line such that the distance between the curve and the line approaches zero as one or both of the ''x'' or ''y'' coordinates tends to infinity. In projective geometry and related contexts, ...
and it
converges in distribution
In probability theory, there exist several different notions of convergence of random variables. The convergence of sequences of random variables to some limit random variable is an important concept in probability theory, and its applications to ...
to the
normal distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^
The parameter \mu ...
:
:
where ''I''(θ
0) is the
fisher information
In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that model ...
of θ
0.
It follows that the Bayes estimator δ
''n'' under MSE is
asymptotically efficient.
Another estimator which is asymptotically normal and efficient is the
maximum likelihood estimator
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statist ...
(MLE). The relations between the maximum likelihood and Bayes estimators can be shown in the following simple example.
Example: estimating ''p'' in a binomial distribution
Consider the estimator of θ based on binomial sample ''x''~b(θ,''n'') where θ denotes the probability for success. Assuming θ is distributed according to the conjugate prior, which in this case is the
Beta distribution
In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval , 1in terms of two positive parameters, denoted by ''alpha'' (''α'') and ''beta'' (''β''), that appear as ...
B(''a'',''b''), the posterior distribution is known to be B(a+x,b+n-x). Thus, the Bayes estimator under MSE is
:
The MLE in this case is x/n and so we get,
:
The last equation implies that, for ''n'' → ∞, the Bayes estimator (in the described problem) is close to the MLE.
On the other hand, when ''n'' is small, the prior information is still relevant to the decision problem and affects the estimate. To see the relative weight of the prior information, assume that ''a''=''b''; in this case each measurement brings in 1 new bit of information; the formula above shows that the prior information has the same weight as ''a+b'' bits of the new information. In applications, one often knows very little about fine details of the prior distribution; in particular, there is no reason to assume that it coincides with B(''a'',''b'') exactly. In such a case, one possible interpretation of this calculation is: "there is a non-pathological prior distribution with the mean value 0.5 and the standard deviation ''d'' which gives the weight of prior information equal to 1/(4''d''
2)-1 bits of new information."
Another example of the same phenomena is the case when the prior estimate and a measurement are normally distributed. If the prior is centered at ''B'' with deviation Σ, and the measurement is centered at ''b'' with deviation σ,
then the posterior is centered at
, with weights in this weighted average being α=σ², β=Σ². Moreover, the squared posterior deviation is Σ²+σ². In other words, the prior is combined with the measurement in ''exactly'' the same way as if it were an extra measurement to take into account.
For example, if Σ=σ/2, then the deviation of 4 measurements combined matches the deviation of the prior (assuming that errors of measurements are independent). And the weights α,β in the formula for posterior match this: the weight of the prior is 4 times the weight of the measurement. Combining this prior with ''n'' measurements with average ''v'' results in the posterior centered at
; in particular, the prior plays the same role as 4 measurements made in advance. In general, the prior has the weight of (σ/Σ)² measurements.
Compare to the example of binomial distribution: there the prior has the weight of (σ/Σ)²−1 measurements. One can see that the exact weight does depend on the details of the distribution, but when σ≫Σ, the difference becomes small.
Practical example of Bayes estimators
The
Internet Movie Database
IMDb (an abbreviation of Internet Movie Database) is an online database of information related to films, television series, home videos, video games, and streaming content online – including cast, production crew and personal biographies, ...
uses a formula for calculating and comparing the ratings of films by its users, including their
Top Rated 250 Titles which is claimed to give "a true Bayesian estimate".
IMDb Top 250
/ref> The following Bayesian formula was initially used to calculate a weighted average score for the Top 250, though the formula has since changed:
:
where:
: = weighted rating
: = average rating for the movie as a number from 1 to 10 (mean) = (Rating)
: = number of votes/ratings for the movie = (votes)
: = weight given to the prior estimate (in this case, the number of votes IMDB deemed necessary for average rating to approach statistical validity)
: = the mean vote across the whole pool (currently 7.0)
Note that ''W'' is just the weighted arithmetic mean
The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The ...
of ''R'' and ''C'' with weight vector ''(v, m)''. As the number of ratings surpasses ''m'', the confidence of the average rating surpasses the confidence of the mean vote for all films (C), and the weighted bayesian rating (W) approaches a straight average (R). The closer ''v'' (the number of ratings for the film) is to zero, the closer ''W'' is to ''C'', where W is the weighted rating and C is the average rating of all films. So, in simpler terms, the fewer ratings/votes cast for a film, the more that film's Weighted Rating will skew towards the average across all films, while films with many ratings/votes will have a rating approaching its pure arithmetic average rating.
IMDb's approach ensures that a film with only a few ratings, all at 10, would not rank above "the Godfather", for example, with a 9.2 average from over 500,000 ratings.
See also
* Recursive Bayesian estimation
In probability theory, statistics, and machine learning, recursive Bayesian estimation, also known as a Bayes filter, is a general probabilistic approach for estimating an unknown probability density function ( PDF) recursively over time using inc ...
* Generalized expected utility Generalized expected utility is a decision-making metric based on any of a variety of theories that attempt to resolve some discrepancies between expected utility theory and empirical observations, concerning choice under risky (probabilistic) c ...
Notes
References
*
*
External links
Bayesian estimation on cnx.org
*
{{DEFAULTSORT:Bayes Estimator
Estimator