Empirical Bayes Method
   HOME

TheInfoList



OR:

Empirical Bayes methods are procedures for
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
in which the
prior probability distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
is estimated from the data. This approach stands in contrast to standard
Bayesian methods Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and e ...
, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a
hierarchical model A hierarchical database model is a data model in which the data are organized into a tree-like structure. The data are stored as records which are connected to one another through links. A record is a collection of fields, with each field containin ...
wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum
marginal likelihood A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evi ...
, represents a convenient approach for setting
hyperparameters In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...
, but has been mostly supplanted by fully Bayesian hierarchical analyses since the 2000s with the increasing availability of well-performing computation techniques.


Introduction

Empirical Bayes methods can be seen as an approximation to a fully Bayesian treatment of a
hierarchical Bayes model Multilevel models (also known as hierarchical linear models, linear mixed-effect model, mixed models, nested data models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parame ...
. In, for example, a two-stage hierarchical Bayes model, observed data y = \ are assumed to be generated from an unobserved set of parameters \theta = \ according to a probability distribution p(y\mid\theta)\,. In turn, the parameters \theta can be considered samples drawn from a population characterised by
hyperparameters In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...
\eta\, according to a probability distribution p(\theta\mid\eta)\,. In the hierarchical Bayes model, though not in the empirical Bayes approximation, the hyperparameters \eta\, are considered to be drawn from an unparameterized distribution p(\eta)\,. Information about a particular quantity of interest \theta_i\; therefore comes not only from the properties of those data y that directly depend on it, but also from the properties of the population of parameters \theta\; as a whole, inferred from the data as a whole, summarised by the hyperparameters \eta\;. Using Bayes' theorem, : p(\theta\mid y) = \frac = \frac \int p(\theta \mid \eta) p(\eta) \, d\eta \,. In general, this integral will not be tractable analytically or symbolically and must be evaluated by numerical methods. Stochastic (random) or deterministic approximations may be used. Example stochastic methods are
Markov Chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...
and
Monte Carlo Monte Carlo (; ; french: Monte-Carlo , or colloquially ''Monte-Carl'' ; lij, Munte Carlu ; ) is officially an administrative area of the Principality of Monaco, specifically the ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is ...
sampling. Deterministic approximations are discussed in quadrature. Alternatively, the expression can be written as : p(\theta\mid y) = \int p(\theta\mid\eta, y) p(\eta \mid y) \; d \eta = \int \frac p(\eta \mid y) \; d \eta\,, and the final factor in the integral can in turn be expressed as : p(\eta \mid y) = \int p(\eta \mid \theta) p(\theta \mid y) \; d \theta . These suggest an iterative scheme, qualitatively similar in structure to a
Gibbs sampler In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is diffic ...
, to evolve successively improved approximations to p(\theta\mid y)\; and p(\eta\mid y)\;. First, calculate an initial approximation to p(\theta\mid y)\; ignoring the \eta dependence completely; then calculate an approximation to p(\eta\mid y)\; based upon the initial approximate distribution of p(\theta\mid y)\;; then use this p(\eta\mid y)\; to update the approximation for p(\theta\mid y)\;; then update p(\eta\mid y)\;; and so on. When the true distribution p(\eta\mid y)\; is sharply peaked, the integral determining p(\theta\mid y)\; may be not much changed by replacing the probability distribution over \eta\; with a point estimate \eta^\; representing the distribution's peak (or, alternatively, its mean), : p(\theta\mid y) \simeq \frac\,. With this approximation, the above iterative scheme becomes the
EM algorithm EM, Em or em may refer to: Arts and entertainment Music * EM, the E major musical scale * Em, the E minor musical scale * Electronic music, music that employs electronic musical instruments and electronic music technology in its production * Ency ...
. The term "Empirical Bayes" can cover a wide variety of methods, but most can be regarded as an early truncation of either the above scheme or something quite like it. Point estimates, rather than the whole distribution, are typically used for the parameter(s) \eta\;. The estimates for \eta^\; are typically made from the first approximation to p(\theta\mid y)\; without subsequent refinement. These estimates for \eta^\; are usually made without considering an appropriate prior distribution for \eta.


Point estimation


Robbins' method: non-parametric empirical Bayes (NPEB)

Robbins considered a case of sampling from a mixed distribution, where probability for each y_i (conditional on \theta_i) is specified by a
Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...
, :p(y_i\mid\theta_i)= while the prior on ''θ'' is unspecified except that it is also
i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
from an unknown distribution, with
cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ev ...
G(\theta). Compound sampling arises in a variety of statistical estimation problems, such as accident rates and clinical trials. We simply seek a point prediction of \theta_i given all the observed data. Because the prior is unspecified, we seek to do this without knowledge of ''G''. Under squared error loss (SEL), the
conditional expectation In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value – the value it would take “on average” over an arbitrarily large number of occurrences – give ...
E(''θ''''i'' ,  ''Y''''i'' = ''y''''i'') is a reasonable quantity to use for prediction. For the Poisson compound sampling model, this quantity is :\operatorname(\theta_i\mid y_i) = . This can be simplified by multiplying both the numerator and denominator by (+1), yielding : \operatorname(\theta_i\mid y_i)= , where ''pG'' is the marginal distribution obtained by integrating out ''θ'' over ''G''. To take advantage of this, Robbins suggested estimating the marginals with their empirical frequencies ( \#\), yielding the fully non-parametric estimate as: : \operatorname(\theta_i\mid y_i) \approx (y_i + 1) , where \# denotes "number of". (See also
Good–Turing frequency estimation Good–Turing frequency estimation is a statistical technique for estimating the probability of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species. In drawing balls from an urn, t ...
.) ;Example – Accident rates Suppose each customer of an insurance company has an "accident rate" Θ and is insured against accidents; the probability distribution of Θ is the underlying distribution, and is unknown. The number of accidents suffered by each customer in a specified time period has a
Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...
with expected value equal to the particular customer's accident rate. The actual number of accidents experienced by a customer is the observable quantity. A crude way to estimate the underlying probability distribution of the accident rate Θ is to estimate the proportion of members of the whole population suffering 0, 1, 2, 3, ... accidents during the specified time period as the corresponding proportion in the observed random sample. Having done so, it is then desired to predict the accident rate of each customer in the sample. As above, one may use the conditional
expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
of the accident rate Θ given the observed number of accidents during the baseline period. Thus, if a customer suffers six accidents during the baseline period, that customer's estimated accident rate is 7 ×
he proportion of the sample who suffered 7 accidents He or HE may refer to: Language * He (pronoun), an English pronoun * He (kana), the romanization of the Japanese kana へ * He (letter), the fifth letter of many Semitic alphabets * He (Cyrillic), a letter of the Cyrillic script called ''He'' in ...
/ he proportion of the sample who suffered 6 accidents Note that if the proportion of people suffering ''k'' accidents is a decreasing function of ''k'', the customer's predicted accident rate will often be lower than their observed number of accidents. This shrinkage effect is typical of empirical Bayes analyses.


Parametric empirical Bayes

If the likelihood and its prior take on simple parametric forms (such as 1- or 2-dimensional likelihood functions with simple
conjugate prior In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and th ...
s), then the empirical Bayes problem is only to estimate the marginal m(y\mid\eta) and the hyperparameters \eta using the complete set of empirical measurements. For example, one common approach, called parametric empirical Bayes point estimation, is to approximate the marginal using the
maximum likelihood estimate In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed statist ...
(MLE), or a Moments expansion, which allows one to express the hyperparameters \eta in terms of the empirical mean and variance. This simplified marginal allows one to plug in the empirical averages into a point estimate for the prior \theta. The resulting equation for the prior \theta is greatly simplified, as shown below. There are several common parametric empirical Bayes models, including the Poisson–gamma model (below), the
Beta-binomial model In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Ber ...
, the Gaussian–Gaussian model, the Dirichlet-multinomial model, as well specific models for
Bayesian linear regression Bayesian linear regression is a type of conditional modeling in which the mean of one variable is described by a linear combination of other variables, with the goal of obtaining the posterior probability of the regression coefficients (as well ...
(see below) and
Bayesian multivariate linear regression In statistics, Bayesian multivariate linear regression is a Bayesian approach to multivariate linear regression, i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random v ...
. More advanced approaches include
hierarchical Bayes model Multilevel models (also known as hierarchical linear models, linear mixed-effect model, mixed models, nested data models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parame ...
s and Bayesian mixture models.


Gaussian–Gaussian model

For an example of empirical Bayes estimation using a Gaussian-Gaussian model, see Empirical Bayes estimators.


Poisson–gamma model

For example, in the example above, let the likelihood be a
Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...
, and let the prior now be specified by the
conjugate prior In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and th ...
, which is a
gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distri ...
(G(\alpha,\beta)) (where \eta = (\alpha,\beta)): : \rho(\theta\mid\alpha,\beta) = \frac \ \mathrm\ \theta > 0, \alpha > 0, \beta > 0 \,\! . It is straightforward to show the posterior is also a gamma distribution. Write : \rho(\theta\mid y) \propto \rho(y\mid \theta) \rho(\theta\mid\alpha, \beta) , where the marginal distribution has been omitted since it does not depend explicitly on \theta. Expanding terms which do depend on \theta gives the posterior as: : \rho(\theta\mid y) \propto (\theta^y\, e^) (\theta^\, e^) = \theta^\, e^ . So the posterior density is also a
gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma distri ...
G(\alpha',\beta'), where \alpha' = y + \alpha, and \beta' = (1+1 / \beta)^. Also notice that the marginal is simply the integral of the posterior over all \Theta, which turns out to be a negative binomial distribution. To apply empirical Bayes, we will approximate the marginal using the
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
estimate (MLE). But since the posterior is a gamma distribution, the MLE of the marginal turns out to be just the mean of the posterior, which is the point estimate \operatorname(\theta\mid y) we need. Recalling that the mean \mu of a gamma distribution G(\alpha', \beta') is simply \alpha' \beta', we have : \operatorname(\theta\mid y) = \alpha' \beta' = \frac = \frac\bar + \frac (\alpha \beta). To obtain the values of \alpha and \beta, empirical Bayes prescribes estimating mean \alpha\beta and variance \alpha\beta^2 using the complete set of empirical data. The resulting point estimate \operatorname(\theta\mid y) is therefore like a weighted average of the sample mean \bar and the prior mean \mu = \alpha\beta. This turns out to be a general feature of empirical Bayes; the point estimates for the prior (i.e. mean) will look like a weighted averages of the sample estimate and the prior estimate (likewise for estimates of the variance).


See also

*
Bayes estimator In estimation theory and decision theory, a Bayes estimator or a Bayes action is an estimator or decision rule that minimizes the posterior expected value of a loss function (i.e., the posterior expected loss). Equivalently, it maximizes the po ...
*
Bayesian network A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
*
Hyperparameter In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...
*
Hyperprior In Bayesian statistics, a hyperprior is a prior distribution on a hyperparameter, that is, on a parameter of a prior distribution. As with the term ''hyperparameter,'' the use of ''hyper'' is to distinguish it from a prior distribution of a param ...
*
Best linear unbiased prediction In statistics, best linear unbiased prediction (BLUP) is used in linear mixed models for the estimation of random effects. BLUP was derived by Charles Roy Henderson in 1950 but the term "best linear unbiased predictor" (or "prediction") seems not ...
*
Robbins lemma In statistics, the Robbins lemma, named after Herbert Robbins, states that if ''X'' is a random variable having a Poisson distribution with parameter ''λ'', and ''f'' is any function for which the expected value E(''f''(''X'')) exists, then. ...
*
Spike-and-slab variable selection Spike-and-slab regression is a type of Bayesian linear regression in which a particular hierarchical prior distribution for the regression coefficients is chosen such that only a subset of the possible regressors is retained. The technique that ...


References


Further reading

* * *


External links


Use of empirical Bayes Method in estimating road safety (North America)

Empirical Bayes methods for missing data analysis


* ttp://www.biomedcentral.com/1471-2105/7/514/abstract/ A Hierarchical Naive Bayes Classifiers(for continuous an
discrete
variables). {{DEFAULTSORT:Empirical Bayes Method Nonparametric Bayesian statistics