prior probabilities
   HOME

TheInfoList



OR:

In
Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...
statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a
parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
of the model or a
latent variable In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...
rather than an
observable variable In physics, an observable is a physical quantity that can be measured. Examples include position and momentum. In systems governed by classical mechanics, it is a real-valued "function" on the set of all possible system states. In quantum phys ...
. Bayes' theorem calculates the renormalized pointwise product of the prior and the
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
, to produce the ''
posterior probability distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...
'', which is the conditional distribution of the uncertain quantity given the data. Similarly, the prior probability of a random event or an uncertain proposition is the unconditional probability that is assigned before any relevant evidence is taken into account. Priors can be created using a number of methods. A prior can be determined from past information, such as previous experiments. A prior can be ''elicited'' from the purely subjective assessment of an experienced expert. An ''uninformative prior'' can be created to reflect a balance among outcomes when no information is available. Priors can also be chosen according to some principle, such as symmetry or maximizing entropy given constraints; examples are the
Jeffreys prior In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative (objective) prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher info ...
or Bernardo's reference prior. When a family of ''
conjugate prior In Bayesian probability theory, if the posterior distribution p(\theta \mid x) is in the same probability distribution family as the prior probability distribution p(\theta), the prior and posterior are then called conjugate distributions, and th ...
s'' exists, choosing a prior from that family simplifies calculation of the posterior distribution. Parameters of prior distributions are a kind of ''
hyperparameter In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis. For example, if one is using a beta distribution to mo ...
''. For example, if one uses a
beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval , 1in terms of two positive parameters, denoted by ''alpha'' (''α'') and ''beta'' (''β''), that appear as ...
to model the distribution of the parameter ''p'' of a
Bernoulli distribution In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli,James Victor Uspensky: ''Introduction to Mathematical Probability'', McGraw-Hill, New York 1937, page 45 is the discrete probabi ...
, then: * ''p'' is a parameter of the underlying system (Bernoulli distribution), and * ''α'' and ''β'' are parameters of the prior distribution (beta distribution); hence ''hyper''parameters. Hyperparameters themselves may have
hyperprior In Bayesian statistics, a hyperprior is a prior distribution on a hyperparameter, that is, on a parameter of a prior distribution. As with the term ''hyperparameter,'' the use of ''hyper'' is to distinguish it from a prior distribution of a param ...
distributions expressing beliefs about their values. A Bayesian model with more than one level of prior like this is called a
hierarchical Bayes model Multilevel models (also known as hierarchical linear models, linear mixed-effect model, mixed models, nested data models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parame ...
.


Informative priors

An ''informative prior'' expresses specific, definite information about a variable. An example is a prior distribution for the temperature at noon tomorrow. A reasonable approach is to make the prior a
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
with expected value equal to today's noontime temperature, with
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...
equal to the day-to-day variance of atmospheric temperature, or a distribution of the temperature for that day of the year. This example has a property in common with many priors, namely, that the posterior from one problem (today's temperature) becomes the prior for another problem (tomorrow's temperature); pre-existing evidence which has already been taken into account is part of the prior and, as more evidence accumulates, the posterior is determined largely by the evidence rather than any original assumption, provided that the original assumption admitted the possibility of what the evidence is suggesting. The terms "prior" and "posterior" are generally relative to a specific datum or observation.


Weakly informative priors

A ''weakly informative prior'' expresses partial information about a variable. An example is, when setting the prior distribution for the temperature at noon tomorrow in St. Louis, to use a normal distribution with mean 50 degrees Fahrenheit and standard deviation 40 degrees, which very loosely constrains the temperature to the range (10 degrees, 90 degrees) with a small chance of being below -30 degrees or above 130 degrees. The purpose of a weakly informative prior is for regularization, that is, to keep inferences in a reasonable range.


Uninformative priors

An uninformative, flat, or diffuse prior expresses vague or general information about a variable. The term "uninformative prior" is somewhat of a misnomer. Such a prior might also be called a ''not very informative prior'', or an ''objective prior'', i.e. one that's not subjectively elicited. Uninformative priors can express "objective" information such as "the variable is positive" or "the variable is less than some limit". The simplest and oldest rule for determining a non-informative prior is the
principle of indifference The principle of indifference (also called principle of insufficient reason) is a rule for assigning epistemic probabilities. The principle of indifference states that in the absence of any relevant evidence, agents should distribute their cre ...
, which assigns equal probabilities to all possibilities. In parameter estimation problems, the use of an uninformative prior typically yields results which are not too different from conventional statistical analysis, as the likelihood function often yields more information than the uninformative prior. Some attempts have been made at finding a priori probabilities, i.e. probability distributions in some sense logically required by the nature of one's state of uncertainty; these are a subject of philosophical controversy, with Bayesians being roughly divided into two schools: "objective Bayesians", who believe such priors exist in many useful situations, and "subjective Bayesians" who believe that in practice priors usually represent subjective judgements of opinion that cannot be rigorously justified (Williamson 2010). Perhaps the strongest arguments for objective Bayesianism were given by Edwin T. Jaynes, based mainly on the consequences of symmetries and on the principle of maximum entropy. As an example of an a priori prior, due to Jaynes (2003), consider a situation in which one knows a ball has been hidden under one of three cups, A, B, or C, but no other information is available about its location. In this case a ''uniform prior'' of ''p''(''A'') = ''p''(''B'') = ''p''(''C'') = 1/3 seems intuitively like the only reasonable choice. More formally, we can see that the problem remains the same if we swap around the labels ("A", "B" and "C") of the cups. It would therefore be odd to choose a prior for which a permutation of the labels would cause a change in our predictions about which cup the ball will be found under; the uniform prior is the only one which preserves this invariance. If one accepts this invariance principle then one can see that the uniform prior is the logically correct prior to represent this state of knowledge. This prior is "objective" in the sense of being the correct choice to represent a particular state of knowledge, but it is not objective in the sense of being an observer-independent feature of the world: in reality the ball exists under a particular cup, and it only makes sense to speak of probabilities in this situation if there is an observer with limited knowledge about the system. As a more contentious example, Jaynes published an argument (Jaynes 1968) based on the invariance of the prior under a change of parameters that suggests that the prior representing complete uncertainty about a probability should be the Haldane prior ''p''−1(1 − ''p'')−1. The example Jaynes gives is of finding a chemical in a lab and asking whether it will dissolve in water in repeated experiments. The Haldane prior gives by far the most weight to p=0 and p=1, indicating that the sample will either dissolve every time or never dissolve, with equal probability. However, if one has observed samples of the chemical to dissolve in one experiment and not to dissolve in another experiment then this prior is updated to the uniform distribution on the interval
, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
This is obtained by applying Bayes' theorem to the data set consisting of one observation of dissolving and one of not dissolving, using the above prior. The Haldane prior is an improper prior distribution (meaning that it has an infinite mass).
Harold Jeffreys Sir Harold Jeffreys, FRS (22 April 1891 – 18 March 1989) was a British mathematician, statistician, geophysicist, and astronomer. His book, ''Theory of Probability'', which was first published in 1939, played an important role in the revival ...
devised a systematic way for designing uninformative priors as e.g.,
Jeffreys prior In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative (objective) prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher info ...
''p''−1/2(1 − ''p'')−1/2 for the Bernoulli random variable. Priors can be constructed which are proportional to the Haar measure if the parameter space ''X'' carries a natural group structure which leaves invariant our Bayesian state of knowledge (Jaynes, 1968). This can be seen as a generalisation of the invariance principle used to justify the uniform prior over the three cups in the example above. For example, in physics we might expect that an experiment will give the same results regardless of our choice of the origin of a coordinate system. This induces the group structure of the
translation group In Euclidean geometry, a translation is a geometric transformation that moves every point of a figure, shape or space by the same distance in a given direction. A translation can also be interpreted as the addition of a constant vector to every ...
on ''X'', which determines the prior probability as a constant improper prior. Similarly, some measurements are naturally invariant to the choice of an arbitrary scale (e.g., whether centimeters or inches are used, the physical results should be equal). In such a case, the scale group is the natural group structure, and the corresponding prior on ''X'' is proportional to 1/''x''. It sometimes matters whether we use the left-invariant or right-invariant Haar measure. For example, the left and right invariant Haar measures on the
affine group In mathematics, the affine group or general affine group of any affine space over a field is the group of all invertible affine transformations from the space into itself. It is a Lie group if is the real or complex field or quaternions. Rela ...
are not equal. Berger (1985, p. 413) argues that the right-invariant Haar measure is the correct choice. Another idea, championed by Edwin T. Jaynes, is to use the principle of maximum entropy (MAXENT). The motivation is that the
Shannon entropy Shannon may refer to: People * Shannon (given name) * Shannon (surname) * Shannon (American singer), stage name of singer Shannon Brenda Greene (born 1958) * Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum W ...
of a probability distribution measures the amount of information contained in the distribution. The larger the entropy, the less information is provided by the distribution. Thus, by maximizing the entropy over a suitable set of probability distributions on ''X'', one finds the distribution that is least informative in the sense that it contains the least amount of information consistent with the constraints that define the set. For example, the maximum entropy prior on a discrete space, given only that the probability is normalized to 1, is the prior that assigns equal probability to each state. And in the continuous case, the maximum entropy prior given that the density is normalized with mean zero and unit variance is the standard
normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...
. The principle of '' minimum cross-entropy'' generalizes MAXENT to the case of "updating" an arbitrary prior distribution with suitable constraints in the maximum-entropy sense. A related idea, reference priors, was introduced by José-Miguel Bernardo. Here, the idea is to maximize the expected
Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fr ...
of the posterior distribution relative to the prior. This maximizes the expected posterior information about ''X'' when the prior density is ''p''(''x''); thus, in some sense, ''p''(''x'') is the "least informative" prior about X. The reference prior is defined in the asymptotic limit, i.e., one considers the limit of the priors so obtained as the number of data points goes to infinity. In the present case, the KL divergence between the prior and posterior distributions is given by : KL = \int p(t) \int p(x\mid t) \log\frac \, dx \, dt. Here, t is a sufficient statistic for some parameter x . The inner integral is the KL divergence between the posterior p(x\mid t) and prior p(x) distributions and the result is the weighted mean over all values of t . Splitting the logarithm into two parts, reversing the order of integrals in the second part and noting that \log \, (x) does not depend on t yields : KL = \int p(t) \int p(x\mid t) \log (x\mid t)\, dx \, dt \, - \, \int \log (x)\, \int p(t) p(x\mid t) \, dt \, dx. The inner integral in the second part is the integral over t of the joint density p(x, t) . This is the marginal distribution p(x) , so we have : KL = \int p(t) \int p(x\mid t) \log (x\mid t)\, dx \, dt \, - \, \int p(x) \log (x)\, dx. Now we use the concept of entropy which, in the case of probability distributions, is the negative expected value of the logarithm of the probability mass or density function or H(x) = - \int p(x) \log (x)\, dx. Using this in the last equation yields : KL = - \int p(t)H(x\mid t) \, dt + \, H(x). In words, KL is the negative expected value over t of the entropy of x conditional on t plus the marginal (i.e. unconditional) entropy of x . In the limiting case where the sample size tends to infinity, the Bernstein-von Mises theorem states that the distribution of x conditional on a given observed value of t is normal with a variance equal to the reciprocal of the Fisher information at the 'true' value of x . The entropy of a normal density function is equal to half the logarithm of 2 \pi ev where v is the variance of the distribution. In this case therefore H = \log \sqrt where N is the arbitrarily large sample size (to which Fisher information is proportional) and x* is the 'true' value. Since this does not depend on t it can be taken out of the integral, and as this integral is over a probability space it equals one. Hence we can write the asymptotic form of KL as : KL = - \log \sqrt- \, \int p(x) \log (x)\, dx. where k is proportional to the (asymptotically large) sample size. We do not know the value of x* . Indeed, the very idea goes against the philosophy of Bayesian inference in which 'true' values of parameters are replaced by prior and posterior distributions. So we remove x* by replacing it with x and taking the expected value of the normal entropy, which we obtain by multiplying by p(x) and integrating over x . This allows us to combine the logarithms yielding : KL = - \int p(x) \log (x) / \sqrt\, dx. This is a quasi-KL divergence ("quasi" in the sense that the square root of the Fisher information may be the kernel of an improper distribution). Due to the minus sign, we need to minimise this in order to maximise the KL divergence with which we started. The minimum value of the last equation occurs where the two distributions in the logarithm argument, improper or not, do not diverge. This in turn occurs when the prior distribution is proportional to the square root of the Fisher information of the likelihood function. Hence in the single parameter case, reference priors and Jeffreys priors are identical, even though Jeffreys has a very different rationale. Reference priors are often the objective prior of choice in multivariate problems, since other rules (e.g., Jeffreys' rule) may result in priors with problematic behavior. Objective prior distributions may also be derived from other principles, such as
information Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random ...
or
coding theory Coding theory is the study of the properties of codes and their respective fitness for specific applications. Codes are used for data compression, cryptography, error detection and correction, data transmission and data storage. Codes are studied ...
(see e.g.
minimum description length Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occa ...
) or frequentist statistics (see frequentist matching). Such methods are used in
Solomonoff's theory of inductive inference Solomonoff's theory of inductive inference is a mathematical proof that if a universe is generated by an algorithm, then observations of that universe, encoded as a dataset, are best predicted by the smallest executable archive of that dataset. T ...
. Constructing objective priors have been recently introduced in bioinformatics, and specially inference in cancer systems biology, where sample size is limited and a vast amount of prior knowledge is available. In these methods, either an information theory based criterion, such as KL divergence or log-likelihood function for binary supervised learning problems and mixture model problems. Philosophical problems associated with uninformative priors are associated with the choice of an appropriate metric, or measurement scale. Suppose we want a prior for the running speed of a runner who is unknown to us. We could specify, say, a normal distribution as the prior for his speed, but alternatively we could specify a normal prior for the time he takes to complete 100 metres, which is proportional to the reciprocal of the first prior. These are very different priors, but it is not clear which is to be preferred. Jaynes' often-overlooked method of transformation groups can answer this question in some situations. Similarly, if asked to estimate an unknown proportion between 0 and 1, we might say that all proportions are equally likely, and use a uniform prior. Alternatively, we might say that all orders of magnitude for the proportion are equally likely, the , which is the uniform prior on the logarithm of proportion. The
Jeffreys prior In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative (objective) prior distribution for a parameter space; its density function is proportional to the square root of the determinant of the Fisher info ...
attempts to solve this problem by computing a prior which expresses the same belief no matter which metric is used. The Jeffreys prior for an unknown proportion ''p'' is ''p''−1/2(1 − ''p'')−1/2, which differs from Jaynes' recommendation. Priors based on notions of
algorithmic probability In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s. It is used in induc ...
are used in
inductive inference Inductive reasoning is a method of reasoning in which a general principle is derived from a body of observations. It consists of making broad generalizations based on specific observations. Inductive reasoning is distinct from ''deductive'' rea ...
as a basis for induction in very general settings. Practical problems associated with uninformative priors include the requirement that the posterior distribution be proper. The usual uninformative priors on continuous, unbounded variables are improper. This need not be a problem if the posterior distribution is proper. Another issue of importance is that if an uninformative prior is to be used ''routinely'', i.e., with many different data sets, it should have good
frequentist Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pro ...
properties. Normally a
Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...
would not be concerned with such issues, but it can be important in this situation. For example, one would want any
decision rule In decision theory, a decision rule is a function which maps an observation to an appropriate action. Decision rules play an important role in the theory of statistics and economics, and are closely related to the concept of a strategy in game t ...
based on the posterior distribution to be admissible under the adopted loss function. Unfortunately, admissibility is often difficult to check, although some results are known (e.g., Berger and Strawderman 1996). The issue is particularly acute with
hierarchical Bayes model Multilevel models (also known as hierarchical linear models, linear mixed-effect model, mixed models, nested data models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parame ...
s; the usual priors (e.g., Jeffreys' prior) may give badly inadmissible decision rules if employed at the higher levels of the hierarchy.


Improper priors

Let events A_1, A_2, \ldots, A_n be mutually exclusive and exhaustive. If Bayes' theorem is written as :P(A_i\mid B) = \frac\, , then it is clear that the same result would be obtained if all the prior probabilities ''P''(''A''''i'') and ''P''(''A''''j'') were multiplied by a given constant; the same would be true for a
continuous random variable In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon ...
. If the summation in the denominator converges, the posterior probabilities will still sum (or integrate) to 1 even if the prior values do not, and so the priors may only need to be specified in the correct proportion. Taking this idea further, in many cases the sum or integral of the prior values may not even need to be finite to get sensible answers for the posterior probabilities. When this is the case, the prior is called an improper prior. However, the posterior distribution need not be a proper distribution if the prior is improper. This is clear from the case where event ''B'' is independent of all of the ''A''''j''. Statisticians sometimes use improper priors as
uninformative prior In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...
s. For example, if they need a prior distribution for the mean and variance of a random variable, they may assume ''p''(''m'', ''v'') ~ 1/''v'' (for ''v'' > 0) which would suggest that any value for the mean is "equally likely" and that a value for the positive variance becomes "less likely" in inverse proportion to its value. Many authors (Lindley, 1973; De Groot, 1937; Kass and Wasserman, 1996) warn against the danger of over-interpreting those priors since they are not probability densities. The only relevance they have is found in the corresponding posterior, as long as it is well-defined for all observations. (The Haldane prior is a typical counterexample.) By contrast,
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
s do not need to be integrated, and a likelihood function that is uniformly 1 corresponds to the absence of data (all models are equally likely, given no data): Bayes' rule multiplies a prior by the likelihood, and an empty product is just the constant likelihood 1. However, without starting with a prior probability distribution, one does not end up getting a posterior probability distribution, and thus cannot integrate or compute expected values or loss. See for details.


Examples

Examples of improper priors include: * The uniform distribution on an infinite interval (i.e., a half-line or the entire real line). * Beta(0,0), the
beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval , 1in terms of two positive parameters, denoted by ''alpha'' (''α'') and ''beta'' (''β''), that appear as ...
for α=0, β=0 (uniform distribution on
log-odds In statistics, the logit ( ) function is the quantile function associated with the standard logistic distribution. It has many uses in data analysis and machine learning, especially in data transformations. Mathematically, the logit is the ...
scale). * The logarithmic prior on the
positive reals In mathematics, the set of positive real numbers, \R_ = \left\, is the subset of those real numbers that are greater than zero. The non-negative real numbers, \R_ = \left\, also include zero. Although the symbols \R_ and \R^ are ambiguously used fo ...
(uniform distribution on
log scale A logarithmic scale (or log scale) is a way of displaying numerical data over a very wide range of values in a compact way—typically the largest numbers in the data are hundreds or even thousands of times larger than the smallest numbers. Such a ...
). These functions, interpreted as uniform distributions, can also be interpreted as the
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
in the absence of data, but are not proper priors.


See also

*
Base rate In probability and statistics, the base rate (also known as prior probabilities) is the class of probabilities unconditional on "featural evidence" (likelihoods). For example, if 1% of the population were medical professionals, and remaining ...
*
Bayesian epistemology Bayesian epistemology is a formal approach to various topics in epistemology that has its roots in Thomas Bayes' work in the field of probability theory. One advantage of its formal method in contrast to traditional epistemology is that its concep ...
* Strong prior


Notes


References

* * * * * * * ** Reprinted in * * {{DEFAULTSORT:Prior Probability Bayesian statistics Probability assessment