Bayesian hierarchical modelling is a

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...

written in multiple levels (hierarchical form) that estimates the

parameters A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

of the

posterior distribution The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posterior ...

using the

Bayesian method Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian inferen ...

.Allenby, Rossi, McCulloch (January 2005)
"Hierarchical Bayes Model: A Practitioner’s Guide"Journal of Bayesian Applications in Marketing
pp. 1–4. Retrieved 26 April 2014, p. 3 The sub-models combine to form the hierarchical model, and

Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...

is used to integrate them with the observed data and account for all the uncertainty that is present. The result of this integration is it allows calculation of the posterior distribution of the

prior The term prior may refer to: * Prior (ecclesiastical), the head of a priory (monastery) * Prior convictions, the life history and previous convictions of a suspect or defendant in a criminal case * Prior probability, in Bayesian statistics * Prio ...

, providing an updated probability estimate.

Frequentist statistics Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pro ...

may yield conclusions seemingly incompatible with those offered by Bayesian statistics due to the Bayesian treatment of the parameters as

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

s and its use of subjective information in establishing assumptions on these parameters. As the approaches answer different questions the formal results aren't technically contradictory but the two approaches disagree over which answer is relevant to particular applications. Bayesians argue that relevant information regarding decision-making and updating beliefs cannot be ignored and that hierarchical modeling has the potential to overrule classical methods in applications where respondents give multiple observational data. Moreover, the model has proven to be robust, with the posterior distribution less sensitive to the more flexible hierarchical priors. Hierarchical modeling, as its name implies, retains nested data structure, and is used when information is available at several different levels of observational units. For example, in epidemiological modeling to describe infection trajectories for multiple countries, observational units are countries, and each country has its own time-based profile of daily infected cases. In

decline curve analysis Decline curve analysis is a means of predicting future oil well or gas well production based on past production history. Production decline curve analysis is a traditional means of identifying well production problems and predicting well performan ...

to describe oil or gas production decline curve for multiple wells, observational units are oil or gas wells in a reservoir region, and each well has each own time-based profile of oil or gas production rates (usually, barrels per month). Hierarchical modeling is used to devise computatation based strategies for multiparameter problems.

Philosophy

Statistical methods and models commonly involve multiple parameters that can be regarded as related or connected in such a way that the problem implies a dependence of the joint probability model for these parameters. Individual degrees of belief, expressed in the form of probabilities, come with uncertainty. Amidst this is the change of the degrees of belief over time. As was stated by Professor José M. Bernardo and Professor Adrian F. Smith, “The actuality of the learning process consists in the evolution of individual and subjective beliefs about the reality.” These subjective probabilities are more directly involved in the mind rather than the physical probabilities. Hence, it is with this need of updating beliefs that Bayesians have formulated an alternative statistical model which takes into account the prior occurrence of a particular event.

Bayes' theorem

The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options. Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital ''j'' having survival probability

\theta_j

, the survival probability will be updated with the occurrence of ''y'', the event in which a controversial serum is created which, as believed by some, increases survival in cardiac patients. In order to make updated probability statements about

\theta_j

, given the occurrence of event ''y'', we must begin with a model providing a

joint probability distribution A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...

for

\theta_j

and ''y''. This can be written as a product of the two distributions that are often referred to as the prior distribution

P(\theta)

and the

sampling distribution In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given random-sample-based statistic. For an arbitrarily large number of samples where each sample, involving multiple observations (data poi ...

P(y\mid\theta)

respectively: :

P(\theta, y) = P(\theta)P(y\mid\theta)

Using the basic property of

conditional probability In probability theory, conditional probability is a measure of the probability of an Event (probability theory), event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This ...

, the posterior distribution will yield: :

P(\theta\mid y)=\frac = \frac

This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes' theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to deconstruct the probability,

P(\theta\mid y)

, relative to solvable subsets of its supportive evidence.

Exchangeability

The usual starting point of a statistical analysis is the assumption that the ''n'' values

y_1, y_2, \ldots, y_n

are exchangeable. If no information – other than data ''y'' – is available to distinguish any of the

\theta_j

’s from any others, and no ordering or grouping of the parameters can be made, one must assume symmetry of prior distribution parameters. This symmetry is represented probabilistically by exchangeability. Generally, it is useful and appropriate to model data from an exchangeable distribution as independently and identically distributed, given some unknown parameter vector

\theta

, with distribution

P(\theta)

Finite exchangeability

For a fixed number ''n'', the set

y_1, y_2, \ldots, y_n

is exchangeable if the joint probability

P(y_1, y_2, \ldots, y_n)

is invariant under

permutation In mathematics, a permutation of a set can mean one of two different things: * an arrangement of its members in a sequence or linear order, or * the act or process of changing the linear order of an ordered set. An example of the first mean ...

s of the indices. That is, for every permutation

\pi

(\pi_1,  \pi_2, \ldots, \pi_n)

of (1, 2, …, ''n''),

P(y_1, y_2, \ldots, y_n)= P(y_, y_, \ldots, y_).

The following is an exchangeable, but not independent and identical (iid), example: Consider an urn with a red ball and a blue ball inside, with probability

\frac

of drawing either. Balls are drawn without replacement, i.e. after one ball is drawn from the ''n'' balls, there will be ''n'' − 1 remaining balls left for the next draw. :

\text
Y_i =
\begin
1, & \texti\text,\\
0, & \text.
\end

The probability of selecting a red ball in the first draw and a blue ball in the second draw is equal to the probability of selecting a blue ball on the first draw and a red on the second, both of which are 1/2: :

(y_1 = 1, y_2 =0) = P(y_1=0,y_2=1)= \frac /math>)

This makes y_1 and y_2 exchangeable.

But the probability of selecting a red ball on the second draw given that the red ball has already been selected in the first is 0. This is not equal to the probability that the red ball is selected in the second draw, which is 1/2:

: (y_2=1\mid y_1=1)=0 \ne P(y_2=1)=  \frac /math>). 

Thus, y_1 and y_2 are not independent.

If x_1, \ldots, x_n are independent and identically distributed, then they are exchangeable, but the converse is not necessarily true. Diaconis, Freedman (1980)

“Finite exchangeable sequences”
Annals of Probability, pp. 745–747

Infinite exchangeability

Infinite exchangeability is the property that every finite subset of an infinite sequence

y_1

y_2, \ldots

is exchangeable. For any ''n'', the sequence

y_1, y_2, \ldots, y_n

is exchangeable.

Hierarchical models

Components

Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution, namely: # Hyperparameters: parameters of the prior distribution #

Hyperprior In Bayesian statistics, a hyperprior is a prior distribution on a hyperparameter, that is, on a parameter of a prior distribution. As with the term ''hyperparameter,'' the use of ''hyper'' is to distinguish it from a prior distribution of a para ...

s: distributions of Hyperparameters Suppose a random variable ''Y'' follows a normal distribution with parameter

\theta

as the

mean A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers. There are several kinds of means (or "measures of central tendency") in mathematics, especially in statist ...

and 1 as the

variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...

, that is

Y\mid \theta \sim N(\theta,1)

. The

tilde The tilde (, also ) is a grapheme or with a number of uses. The name of the character came into English from Spanish , which in turn came from the Latin , meaning 'title' or 'superscription'. Its primary use is as a diacritic (accent) in ...

relation

\sim

can be read as "has the distribution of" or "is distributed as". Suppose also that the parameter

\theta

has a distribution given by a

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

with mean

\mu

and variance 1, i.e.

\theta\mid\mu \sim N(\mu,1)

. Furthermore,

\mu

follows another distribution given, for example, by the

standard normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac e^ ...

\text(0,1)

. The parameter

\mu

is called the hyperparameter, while its distribution given by

\text(0,1)

is an example of a hyperprior distribution. The notation of the distribution of ''Y'' changes as another parameter is added, i.e.

Y \mid \theta,\mu \sim  N(\theta,1)

. If there is another stage, say,

\mu

following another normal distribution with a mean of

\beta

and a variance of

\epsilon

, then

\mu \sim N(\beta,\epsilon)

\mbox

\beta

and

\epsilon

can also be called hyperparameters with hyperprior distributions.

Framework

Let

y_j

be an observation and

\theta_j

a parameter governing the data generating process for

y_j

. Assume further that the parameters

\theta_1, \theta_2, \ldots, \theta_j

are generated exchangeably from a common population, with distribution governed by a hyperparameter

\phi

.
The Bayesian hierarchical model contains the following stages: :

\text y_j\mid\theta_j,\phi \sim P(y_j\mid\theta_j,\phi)

\text \theta_j\mid\phi \sim P(\theta_j\mid\phi)

\text \phi \sim P(\phi)

The likelihood, as seen in stage I is

P(y_j\mid\theta_j,\phi)

, with

P(\theta_j,\phi)

as its prior distribution. Note that the likelihood depends on

\phi

only through

\theta_j

. The prior distribution from stage I can be broken down into: :

P(\theta_j,\phi) = P(\theta_j\mid\phi)P(\phi)

'' rom the definition of conditional probability' With

\phi

as its hyperparameter with hyperprior distribution,

P(\phi)

. Thus, the posterior distribution is proportional to: :

P(\phi,\theta_j\mid y)  \propto P(y_j \mid\theta_j,\phi) P(\theta_j,\phi)

'' sing Bayes' Theorem' :

P(\phi,\theta_j\mid y)  \propto P(y_j\mid\theta_j ) P(\theta_j \mid\phi ) P(\phi)

Bernardo, Degroot, Lindley (September 1983)
“Proceedings of the Second Valencia International Meeting”Bayesian Statistics 2
Amsterdam: Elsevier Science Publishers B.V, , pp. 371–372

Example calculation

As an example, a teacher wants to estimate how well a student did on the

SAT The SAT ( ) is a standardized test widely used for college admissions in the United States. Since its debut in 1926, its name and Test score, scoring have changed several times. For much of its history, it was called the Scholastic Aptitude Test ...

. The teacher uses the current

grade point average Grading in education is the application of standardized Measurement, measurements to evaluate different levels of student achievement in a course. Grades can be expressed as letters (usually A to F), as a range (for example, 1 to 6), percentage ...

(GPA) of the student for an estimate. Their current GPA, denoted by

Y

, has a likelihood given by some probability function with parameter

\theta

, i.e.

Y\mid\theta \sim P(Y\mid\theta)

. This parameter

\theta

is the SAT score of the student. The SAT score is viewed as a sample coming from a common population distribution indexed by another parameter

\phi

, which is the high school grade of the student (freshman, sophomore, junior or senior). That is,

\theta\mid\phi \sim P(\theta\mid\phi)

. Moreover, the hyperparameter

\phi

follows its own distribution given by

P(\phi)

, a hyperprior. These relationships can be used to calculate the likelihood of a specific SAT score relative to a particular GPA: :

P(\theta,\phi\mid Y) \propto P(Y\mid\theta,\phi)P(\theta,\phi)

P(\theta,\phi\mid Y) \propto P(Y\mid\theta)P(\theta\mid\phi)P(\phi)

All information in the problem will be used to solve for the posterior distribution. Instead of solving only using the prior distribution and the likelihood function, using hyperpriors allows a more nuanced distinction of relationships between given variables. Box G. E. P., Tiao G. C. (1965)
"Multiparameter problem from a bayesian point of view"Multiparameter Problems From A Bayesian Point of View Volume 36 Number 5
New York City: John Wiley & Sons,

2-stage hierarchical model

In general, the joint posterior distribution of interest in 2-stage hierarchical models is: :

P(\theta,\phi\mid Y) =  =

P(\theta,\phi\mid Y) \propto P(Y\mid\theta)P(\theta\mid\phi)P(\phi)

3-stage hierarchical model

For 3-stage hierarchical models, the posterior distribution is given by: :

P(\theta,\phi, X\mid Y) =

P(\theta,\phi, X\mid Y) \propto P(Y\mid\theta)P(\theta\mid\phi)P(\phi\mid X)P(X)

Bayesian nonlinear mixed-effects model

A three stage version of Bayesian hierarchical modeling could be used to calculate probability at 1) an individual level, 2) at the level of population and 3) the prior, which is an assumed probability distribution that takes place before evidence is initially acquired: ''Stage 1: Individual-Level Model''

_ = f(t_;\theta_,\theta_,\ldots,\theta_,\ldots,\theta_ ) + \epsilon_,\quad \epsilon_ \sim  N(0, \sigma^2), \quad i =1,\ldots, N, \, j = 1,\ldots, M_i.

''Stage 2: Population Model''

\theta_= \alpha_l + \sum_^\beta_x_ + \eta_, \quad  \eta_ \sim N(0, \omega_l^2), \quad i =1,\ldots, N, \, l=1,\ldots, K.

''Stage 3: Prior''

\sigma^2 \sim \pi(\sigma^2),\quad  \alpha_l \sim \pi(\alpha_l),  \quad (\beta_,\ldots,\beta_,\ldots,\beta_) \sim  \pi(\beta_,\ldots,\beta_,\ldots,\beta_), \quad \omega_l^2 \sim \pi(\omega_l^2), \quad l=1,\ldots, K.

Here,

y_

denotes the continuous response of the

i

-th subject at the time point

t_

, and

x_

is the

b

-th covariate of the

i

-th subject. Parameters involved in the model are written in Greek letters. The variable

f(t ; \theta_,\ldots,\theta_)

is a known function parameterized by the

K

-dimensional vector

(\theta_,\ldots,\theta_)

. Typically,

f

is a `nonlinear' function and describes the temporal trajectory of individuals. In the model,

\epsilon_

and

\eta_

describe within-individual variability and between-individual variability, respectively. If the prior is not considered, the relationship reduces to a frequentist nonlinear mixed-effect model. A central task in the application of the Bayesian nonlinear mixed-effect models is to evaluate posterior density:

\pi(\_^,\sigma^2, \_^K, \_^,\_^K ,  \_^)

\propto \pi(\_^, \_^,\sigma^2, \_^K, \_^,\_^K)

= \underbrace_
\times 
\underbrace_
\times  
\underbrace_

The panel on the right displays Bayesian research cycle using Bayesian nonlinear mixed-effects model. A research cycle using the Bayesian nonlinear mixed-effects model comprises two steps: (a) standard research cycle and (b) Bayesian-specific workflow. A standard research cycle involves 1) literature review, 2) defining a problem and 3) specifying the research question and hypothesis. Bayesian-specific workflow stratifies this approach to include three sub-steps: (b)–(i) formalizing prior distributions based on background knowledge and prior elicitation; (b)–(ii) determining the likelihood function based on a nonlinear function

f

; and (b)–(iii) making a posterior inference. The resulting posterior inference can be used to start a new research cycle.

References

{{Reflist, 2 Bayesian networks