In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, Markov chain Monte Carlo (MCMC) is a class of
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
s used to draw samples from a
probability distribution
In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...
. Given a probability distribution, one can construct a
Markov chain
In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally ...
whose elements' distribution approximates it – that is, the Markov chain's
equilibrium distribution
In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, ...
matches the target distribution. The more steps that are included, the more closely the distribution of the sample matches the actual desired distribution.
Markov chain Monte Carlo methods are used to study probability distributions that are too complex or too highly
dimensional to study with analytic techniques alone. Various algorithms exist for constructing such Markov chains, including the
Metropolis–Hastings algorithm.
General explanation

Markov chain Monte Carlo methods create samples from a continuous
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
, with
probability density proportional to a known function. These samples can be used to evaluate an integral over that variable, as its
expected value
In probability theory, the expected value (also called expectation, expectancy, expectation operator, mathematical expectation, mean, expectation value, or first Moment (mathematics), moment) is a generalization of the weighted average. Informa ...
or
variance
In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
.
Practically, an
ensemble of chains is generally developed, starting from a set of points arbitrarily chosen and sufficiently distant from each other. These chains are
stochastic processes of "walkers" which move around randomly according to an algorithm that looks for places with a reasonably high contribution to the integral to move into next, assigning them higher probabilities.
Random walk Monte Carlo methods are a kind of random
simulation
A simulation is an imitative representation of a process or system that could exist in the real world. In this broad sense, simulation can often be used interchangeably with model. Sometimes a clear distinction between the two terms is made, in ...
or
Monte Carlo method
Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. The underlying concept is to use randomness to solve problems that might be ...
. However, whereas the random samples of the integrand used in a conventional
Monte Carlo integration are
statistically independent
Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two event (probability theory), events are independent, statistically independent, or stochastically independent if, informally s ...
, those used in MCMC are
autocorrelated. Correlations of samples introduces the need to use the
Markov chain central limit theorem when estimating the error of mean values.
These algorithms create
Markov chains such that they have an
equilibrium distribution
In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, ...
which is proportional to the function given.
History
The development of MCMC methods is deeply rooted in the early exploration of
Monte Carlo
Monte Carlo ( ; ; or colloquially ; , ; ) is an official administrative area of Monaco, specifically the Ward (country subdivision), ward of Monte Carlo/Spélugues, where the Monte Carlo Casino is located. Informally, the name also refers to ...
(MC) techniques in the mid-20th century, particularly in physics, marked by the
Metropolis algorithm proposed by
Nicholas Metropolis,
Arianna W. Rosenbluth,
Marshall Rosenbluth,
Augusta H. Teller, and
Edward Teller in 1953, designed to tackle high-dimensional integration problems using early computers.
W. K. Hastings generalized this algorithm in 1970 and inadvertently introduced the component-wise updating idea later known as
Gibbs sampling, while theoretical foundations for Gibbs sampling, such as the
Hammersley–Clifford theorem (published via
Julian Besag's 1974 paper), were also developing. Although the seeds of MCMC were sown earlier, including the formal naming of Gibbs sampling in image processing by
Stuart Geman and
Donald Geman (1984) and the
data augmentation method by Martin A. Tanner and
Wing Hung Wong (1987), its "revolution" in mainstream statistics largely followed demonstrations of the universality and ease of implementation of sampling methods (especially Gibbs sampling) for complex statistical (particularly
Bayesian) problems, spurred by increasing computational power and software like
BUGS. This transformation was accompanied by significant theoretical advancements, such as
Luke Tierney's (1994) rigorous treatment of MCMC convergence, and
Jun S. Liu, Wong, and
Augustine Kong's (1994, 1995) analysis of Gibbs sampler structure. Subsequent developments further expanded the MCMC toolkit, including
particle filters (
Sequential Monte Carlo) for sequential problems,
Perfect sampling aiming for exact simulation (
Jim Propp and David B. Wilson, 1996),
RJMCMC (
Peter J. Green, 1995) for handling variable-dimension models, and deeper investigations into convergence diagnostics and the
central limit theorem. Overall, the evolution of MCMC represents a paradigm shift in statistical computation, enabling the analysis of numerous previously intractable complex models and continually expanding the scope and impact of statistics.
Mathematical Setting
Suppose ''(X
n)'' is a
Markov Chain
In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally ...
in the general state space
with specific properties. We are interested in the limiting behavior of the partial sums:
:
as ''n'' goes to infinity. Particularly, we hope to establish the
Law of Large Numbers and the
Central Limit Theorem for MCMC. In the following, we state some definitions and theorems necessary for the important convergence results. In short, we need the existence of invariant measure and Harris recurrent to establish the Law of Large Numbers of MCMC (Ergodic Theorem). And we need aperiodicity, irreducibility and extra conditions such as reversibility to ensure the Central Limit Theorem holds in MCMC.
[Robert and Casella (2004), pp. 205–246]
Irreducibility and Aperiodicity
Recall that in the discrete setting, a
Markov chain
In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally ...
is said to be ''irreducible'' if it is possible to reach any state from any other state in a finite number of steps with positive probability. However, in the continuous setting, point-to-point transitions have zero probability. In this case, φ-irreducibility generalizes
irreducibility by using a reference measure φ on the measurable space
.
;Definition (φ-irreducibility)
Given a measure
defined on
, the Markov chain
with transition kernel
is φ-irreducible if, for every
with
, there exists
such that
for all
(Equivalently,
, here
is the first
for which the chain enters the set
).
This is a more general definition for
irreducibility of a
Markov chain
In probability theory and statistics, a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally ...
in non-discrete state space. In the discrete case, an irreducible Markov chain is said to be ''aperiodic'' if it has period 1. Formally, the period of a state
is defined as:
:
For the general (non-discrete) case, we define aperiodicity in terms of small sets:
;Definition (Cycle length and small sets)
A φ-irreducible Markov chain
has a ''cycle of length d'' if there exists a small set
, an associated integer
, and a probability distribution
such that ''d'' is the
greatest common divisor
In mathematics, the greatest common divisor (GCD), also known as greatest common factor (GCF), of two or more integers, which are not all zero, is the largest positive integer that divides each of the integers. For two integers , , the greatest co ...
of:
:
A set
is called small if there exists
and a nonzero measure
such that:
:
Harris recurrent
;Definition (Harris recurrence)
A set
is Harris recurrent if
for all
, where
is the number of visits of the chain
to the set
.
The chain
is said to be Harris recurrent if there exists a measure
such that the chain is
-irreducible and every measurable set
with
is Harris recurrent.
A useful criterion for verifying Harris recurrence is the following:
;Proposition
If for every
, we have
for every
, then
for all
, and the chain
is Harris recurrent.
This definition is only needed when the state space
is uncountable. In the countable case, recurrence corresponds to
, which is equivalent to
for all
.
;Definition (Invariant measure)
A
-finite measure
is said to be invariant for the transition kernel
(and the associated chain) if:
:
When there exists an ''invariant probability measure'' for a ψ-irreducible (hence recurrent) chain, the chain is said to be positive recurrent.
Recurrent chains that do not allow for a finite invariant measure are called null recurrent.
In applications of Markov Chain Monte Carlo (MCMC), a very useful criterion for Harris recurrence involves the use of bounded harmonic functions.
;Definition (Harmonic function)
A measurable function
is said to be harmonic for the chain
if:
:
These functions are ''invariant'' under the transition kernel in the functional sense, and they help characterize Harris recurrence.
;Proposition
''For a positive Markov chain, if the only bounded harmonic functions are the constant functions, then the chain is Harris recurrent.''
Law of Large Numbers for MCMC
;Theorem (Ergodic Theorem for MCMC)
If
has a
-finite invariant measure
, then the following two statements are equivalent:
# The Markov chain
is Harris recurrent.
# If
with
, then
This theorem provides a fundamental justification for the use of Markov Chain Monte Carlo (MCMC) methods, and it serves as the counterpart of the
Law of Large Numbers (LLN) in classical Monte Carlo.
An important aspect of this result is that
does not need to be a probability measure. Therefore, there can be some type of strong stability even if the chain is null recurrent. Moreover, the Markov chain can be started from arbitrary state.
If
is a probability measure, we can let
and get
:
This is the Ergodic Theorem that we are more familiar with.
Central Limit Theorem for MCMC
There are several conditions under which the
Central Limit Theorem (CLT) holds for Markov chain Monte Carlo (MCMC) methods. One of the most commonly used is the condition of reversibility.
;Definition (Reversibility)
A stationary Markov chain
is said to be reversible if the distribution of
given
is the same as the distribution of
given
.
This is equivalent to the ''detailed balance condition'', which is defined as follows:
;Definition (Detailed balance)
A Markov chain with transition kernel
satisfies the detailed balance condition if there exists a function
such that:
:
for every pair
in the state space.
;Theorem (CLT under reversibility)
If
is aperiodic, irreducible, and reversible with invariant distribution
, then:
:
where
:
and
:
.
Even though reversibility is a restrictive assumption in theory, it is often easily satisfied in practical MCMC algorithms by introducing auxiliary variables or using symmetric proposal mechanisms. There are many other conditions that can be used to establish CLT for MCMC such as geometirc ergodicity and the discrete state space.
Autocorrelation
MCMC methods produce autocorrelated samples, in contrast to standard Monte Carlo techniques that draw independent samples.
Autocorrelation means successive draws from the Markov chain are statistically dependent, so each new sample adds less fresh information than an independent draw would. As a result, one must account for this correlation when assessing the accuracy of estimates from the chain. In particular, positive autocorrelation in the chain increases the variance of estimators and slows the convergence of sample averages toward the true expectation.
Autocorrelation and efficiency
The effect of correlation on estimation can be quantified through the
Markov chain central limit theorem. For a chain targeting a distribution with variance
, the variance of the sample mean after
steps is approximately
, where
is an effective sample size smaller than
. Equivalently, one can express this as:
:
where
is the sample mean and
is the autocorrelation of the chain at lag
, defined as
. The term in parentheses,
, is often called the integrated autocorrelation. When the chain has no autocorrelation (
for all
), this factor equals 1, and one recovers the usual
variance for independent samples. If the chain's samples are highly correlated, the sum of autocorrelations is large, leading to a much bigger variance for
than in the independent case.
Effective sample size (ESS)
The effective sample size
is a useful diagnostic that translates the autocorrelation in a chain into an equivalent number of independent samples. It is defined by the formula:
:
so that
is the number of independent draws that would yield the same estimation precision as the
dependent draws from the Markov chain. For example, if
, then
, meaning the chain of length
carries information equivalent to
independent samples. In an ideal scenario with no correlation,
and thus
. But in a poorly mixing chain with strong autocorrelation,
can be much smaller than
. In practice, monitoring the ESS for each parameter is a way to gauge how much correlation is present: a low ESS indicates that many more iterations may be needed to achieve a desired effective sample of independent draws.
Reducing correlation
While MCMC methods were created to address multi-dimensional problems better than generic Monte Carlo algorithms, when the number of dimensions rises they too tend to suffer the
curse of dimensionality: regions of higher probability tend to stretch and get lost in an increasing volume of space that contributes little to the integral. One way to address this problem could be shortening the steps of the walker, so that it does not continuously try to exit the highest probability region, though this way the process would be highly autocorrelated and expensive (i.e. many steps would be required for an accurate result). More sophisticated methods such as
Hamiltonian Monte Carlo and the
Wang and Landau algorithm use various ways of reducing this autocorrelation, while managing to keep the process in the regions that give a higher contribution to the integral. These algorithms usually rely on a more complicated theory and are harder to implement, but they usually converge faster.
We outline several general strategies such as reparameterization, adaptive proposal tuning, parameter blocking, and overrelaxation that help reduce correlation and improve sampling efficiency within the standard MCMC framework.
Reparameterization
One way to reduce autocorrelation is to reformulate or reparameterize the statistical model so that the posterior geometry leads to more efficient sampling. By changing the coordinate system or using alternative variable definitions, one can often lessen correlations. For example, in
Bayesian hierarchical modeling, a non-centered parameterization can be used in place of the standard (centered) formulation to avoid extreme posterior correlations between latent and higher-level parameters. This involves expressing
latent variables in terms of independent auxiliary variables, dramatically improving mixing. Such reparameterization strategies are commonly employed in both
Gibbs sampling and
Metropolis–Hastings algorithm to enhance convergence and reduce autocorrelation.
Proposal tuning and adaptation
Another approach to reducing correlation is to improve the MCMC proposal mechanism. In
Metropolis–Hastings algorithm, step size tuning is critical: if the proposed steps are too small, the sampler moves slowly and produces highly correlated samples; if the steps are too large, many proposals are rejected, resulting in repeated values. Adjusting the proposal step size during an initial testing phase helps find a balance where the sampler explores the space efficiently without too many rejections.
Adaptive MCMC methods modify proposal distributions based on the chain's past samples. For instance, adaptive metropolis algorithm updates the Gaussian proposal distribution using the full information accumulated from the chain so far, allowing the proposal to adapt over time.
Parameter blocking
Parameter blocking is a technique that reduces autocorrelation in MCMC by updating parameters jointly rather than one at a time. When parameters exhibit strong posterior correlations, one-by-one updates can lead to poor mixing and slow exploration of the target distribution. By identifying and sampling blocks of correlated parameters together, the sampler can more effectively traverse high-density regions of the posterior.
Parameter blocking is commonly used in both Gibbs sampling and Metropolis–Hastings algorithms. In blocked Gibbs sampling, entire groups of variables are updated conditionally at each step. In Metropolis–Hastings, multivariate proposals enable joint updates (i.e., updates of multiple parameters at once using a vector-valued proposal distribution, typically a multivariate Gaussian), though they often require careful tuning of the proposal covariance matrix.
Overrelaxation
Overrelaxation is a technique to reduce autocorrelation between successive samples by proposing new samples that are negatively correlated with the current state. This helps the chain explore the posterior more efficiently, especially in high-dimensional Gaussian models or when using Gibbs sampling. The basic idea is to reflect the current sample across the conditional mean, producing proposals that retain the correct stationary distribution but with reduced serial dependence. Overrelaxation is particularly effective when combined with Gaussian conditional distributions, where exact reflection or partial overrelaxation can be analytically implemented.
Examples
Random walk Monte Carlo methods
*
Metropolis–Hastings algorithm: This method generates a Markov chain using a proposal density for new steps and a method for rejecting some of the proposed moves. It is actually a general framework which includes as special cases the very first and simpler MCMC (Metropolis algorithm) and many more recent variants listed below.
**
Gibbs sampling: When target distribution is multi-dimensional, Gibbs sampling algorithm updates each coordinate from its full
conditional distribution given other coordinates. Gibbs sampling can be viewed as a special case of Metropolis–Hastings algorithm with acceptance rate uniformly equal to 1. When drawing from the full conditional distributions is not straightforward other samplers-within-Gibbs are used (e.g., see ). Gibbs sampling is popular partly because it does not require any 'tuning'. Algorithm structure of the Gibbs sampling highly resembles that of the coordinate ascent variational inference in that both algorithms utilize the full-conditional distributions in the updating procedure.
**
Metropolis-adjusted Langevin algorithm and other methods that rely on the gradient (and possibly second derivative) of the log target density to propose steps that are more likely to be in the direction of higher probability density.
**
Hamiltonian (or hybrid) Monte Carlo (HMC): Tries to avoid random walk behaviour by introducing an auxiliary
momentum
In Newtonian mechanics, momentum (: momenta or momentums; more specifically linear momentum or translational momentum) is the product of the mass and velocity of an object. It is a vector quantity, possessing a magnitude and a direction. ...
vector and implementing
Hamiltonian dynamics, so the potential energy function is the target density. The momentum samples are discarded after sampling. The result of hybrid Monte Carlo is that proposals move across the sample space in larger steps; they are therefore less correlated and converge to the target distribution more rapidly.
**
Pseudo-marginal Metropolis–Hastings: This method replaces the evaluation of the density of the target distribution with an unbiased estimate and is useful when the target density is not available analytically, e.g.
latent variable models.
*
Slice sampling: This method depends on the principle that one can sample from a distribution by sampling uniformly from the region under the plot of its density function. It alternates uniform sampling in the vertical direction with uniform sampling from the horizontal 'slice' defined by the current vertical position.
*
Multiple-try Metropolis: This method is a variation of the Metropolis–Hastings algorithm that allows multiple trials at each point. By making it possible to take larger steps at each iteration, it helps address the curse of dimensionality.
*
Reversible-jump: This method is a variant of the Metropolis–Hastings algorithm that allows proposals that change the dimensionality of the space. Markov chain Monte Carlo methods that change dimensionality have long been used in
statistical physics applications, where for some problems a distribution that is a
grand canonical ensemble
In statistical mechanics, the grand canonical ensemble (also known as the macrocanonical ensemble) is the statistical ensemble that is used to represent the possible states of a mechanical system of particles that are in thermodynamic equilibri ...
is used (e.g., when the number of molecules in a box is variable). But the reversible-jump variant is useful when doing Markov chain Monte Carlo or Gibbs sampling over
nonparametric Bayesian models such as those involving the
Dirichlet process or
Chinese restaurant process, where the number of mixing components/clusters/etc. is automatically inferred from the data.
Interacting particle methods
Interacting MCMC methodologies are a class of
mean-field particle methods for obtaining
random samples from a sequence of probability distributions with an increasing level of sampling complexity.
These probabilistic models include path space state models with increasing time horizon, posterior distributions w.r.t. sequence of partial observations, increasing constraint level sets for conditional distributions, decreasing temperature schedules associated with some Boltzmann–Gibbs distributions, and many others. In principle, any Markov chain Monte Carlo sampler can be turned into an interacting Markov chain Monte Carlo sampler. These interacting Markov chain Monte Carlo samplers can be interpreted as a way to run in parallel a sequence of Markov chain Monte Carlo samplers. For instance, interacting
simulated annealing algorithms are based on independent Metropolis–Hastings moves interacting sequentially with a selection-resampling type mechanism. In contrast to traditional Markov chain Monte Carlo methods, the precision parameter of this class of interacting Markov chain Monte Carlo samplers is ''only'' related to the number of interacting Markov chain Monte Carlo samplers. These advanced particle methodologies belong to the class of Feynman–Kac particle models,
also called Sequential Monte Carlo or
particle filter methods in
Bayesian inference and
signal processing
Signal processing is an electrical engineering subfield that focuses on analyzing, modifying and synthesizing ''signals'', such as audio signal processing, sound, image processing, images, Scalar potential, potential fields, Seismic tomograph ...
communities.
Interacting Markov chain Monte Carlo methods can also be interpreted as a mutation-selection
genetic particle algorithm with Markov chain Monte Carlo mutations.
Quasi-Monte Carlo
The
quasi-Monte Carlo method is an analog to the normal Monte Carlo method that uses
low-discrepancy sequences instead of random numbers.
It yields an integration error that decays faster than that of true random sampling, as quantified by the
Koksma–Hlawka inequality. Empirically it allows the reduction of both estimation error and convergence time by an order of magnitude.
Markov chain quasi-Monte Carlo methods such as the Array–RQMC method combine randomized quasi–Monte Carlo and Markov chain simulation by simulating
chains simultaneously in a way that better approximates the true distribution of the chain than with ordinary MCMC. In empirical experiments, the variance of the average of a function of the state sometimes converges at rate
or even faster, instead of the
Monte Carlo rate.
Applications
MCMC methods are primarily used for calculating
numerical approximations of
multi-dimensional integrals, for example in
Bayesian statistics
Bayesian statistics ( or ) is a theory in the field of statistics based on the Bayesian interpretation of probability, where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about ...
,
computational physics
Computational physics is the study and implementation of numerical analysis to solve problems in physics. Historically, computational physics was the first application of modern computers in science, and is now a subset of computational science ...
,
computational biology
Computational biology refers to the use of techniques in computer science, data analysis, mathematical modeling and Computer simulation, computational simulations to understand biological systems and relationships. An intersection of computer sci ...
and
computational linguistics
Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics ...
.
Bayesian Statistics
In Bayesian statistics, Markov chain Monte Carlo methods are typically used to calculate
moments and
credible intervals of
posterior probability
The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posteri ...
distributions. The use of MCMC methods makes it possible to compute large
hierarchical models that require integrations over hundreds to thousands of unknown parameters.
Statistical Physics
Many contemporary research problems in statistical physics can be addressed by approximate solutions using Monte Carlo simulation, which provides valuable insights into the properties of complex systems. Monte Carlo methods are fundamental in computational physics, physical chemistry, and related disciplines, with broad applications including medical physics, where they are employed to model radiation transport for radiation dosimetry calculations. Instead of exhaustively analyzing all possible system states, the Monte Carlo method randomly examines a subset of them to form a representative sample, and yields accurate approximations of the system's characteristic properties. As the number of sampled states increases, the error can be further reduced to a lower level.
Complex Distribution Sampling

Langevin Dynamics are typically used in complex distribution sampling and generative modeling,
via an MCMC procedure. Specifically, given the probability density function
, we use its log gradient
as the score function and start from a prior distribution
. Then, a chain is built by
:
for
. When
and
,
converges to a sample from the target distribution
.
For some complex distribution, if we know its probability density function but find it difficult to directly sample from it, we can apply Langevin Dynamics as an alternate. However, in most cases, especially generative modeling, usually we do not know the exact probability density function of the target distribution we wish to sample from, neither the score function
. In this case, score matching methods
provide feasible solutions, minimizing the
Fisher information metric
In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, ''i.e.'', a smooth manifold whose points are probability distributions. It can be used to calculate the ...
between a parameterized score-based model
and the score function without knowing the ground-truth data score. The score function can be estimated on a training dataset by
stochastic gradient descent.
In real cases, however, the training data only takes a small region of the target distribution, and the estimated score functions are inaccurate in other low density regions with fewer available data examples. To overcome this challenge, denoising score matching
methods purturb the available data examples with noise of different scales, which can improve the coverage of low density regions, and use them as the training dataset for the score-base model. Note that the choice of noise scales is tricky, as too large noise will corrupt the original data, while too small noise will not populate the original data to those low density regions. Thus, carefully crafted noise schedules
are applied for higher quality generation.
Convergence
Usually it is not hard to construct a Markov chain with the desired properties. The more difficult problem is to determine (1) when to start collecting statistics and (2) how many steps are needed to converge to the stationary distribution within an acceptable error.
Fortunately, there are a variety of practical diagnostics to empirically assess convergence.
Total Variation Distance
Formally, let
denote the stationary distribution and
the distribution of the Markov chain after
steps starting from state
. Theoretically, convergence can be quantified by measuring the
total variation distance
In probability theory, the total variation distance is a statistical distance between probability distributions, and is sometimes called the statistical distance, statistical difference or variational distance.
Definition
Consider a measurable ...
:
:
A chain is said to mix rapidly if
for all
within a small number of steps
under a pre-defined tolerance
. In other words, the stationary distribution is reached quickly starting from an arbitrary position, and the minimum such
is known as the
mixing time. In practice, however, the total variation distance is generally intractable to compute, especially in high-dimensional problems or when the stationary distribution is only known up to a normalizing constant (as in most Bayesian applications).
Gelman-Rubin Diagnostics
The
Gelman-Rubin statistic, also known as the potential scale reduction factor (PSRF), evaluates MCMC convergence by sampling multiple independent Markov chains and comparing within-chain and between-chain variances.
If all chains have converged to the same stationary distribution, the between-chain and within-chain variances should be similar, and thus the PSRF must approach 1. In practice, a value of
is often taken as evidence of convergence. Higher values suggest that the chains are still exploring different parts of the target distribution.
Geweke Diagnostics
The Geweke diagnostic examines whether the distribution of samples in the early part of the Markov chain is statistically indistinguishable from the distribution in a later part. Given a sequence of correlated MCMC samples
, the diagnostic splits the chain into an early segment consisting of the first
samples, typically chosen as
(i.e., the first 10% of the chain), and a late segment consisting of the last
samples, typically chosen as
(i.e., the last 50% of the chain)
Denote the sample means of these segments as:
:
Since MCMC samples are autocorrelated, a simple comparison of sample means is insufficient. Instead, the difference in means is standardized using an estimator of the spectral density at zero frequency, which accounts for the long-range dependencies in the chain. The test statistic is computed as:
:
where
is an estimate of the long-run variance (i.e., the spectral density at frequency zero), commonly estimated using
Newey-West estimators or batch means. Under the null hypothesis of convergence, the statistic
follows an approximately standard normal distribution
.
If
, the null hypothesis is rejected at the 5% significance level, suggesting that the chain has not yet reached stationarity.
Heidelberger-Welch Diagnostics
The Heidelberger-Welch diagnostic is grounded in
spectral analysis and
Brownian motion theory, and is particularly useful in the early stages of simulation to determine appropriate burn-in and stopping time. The diagnostic consists of two components, a stationarity test that assesses whether the Markov chain has reached a steady-state, and a half-width test that determines whether the estimated expectation is within a user-specified precision.
Stationary Test
Let
be the output of an MCMC simulation for a scalar function
, and
the evaluations of the function
over the chain. Define the standardized cumulative sum process:
:
where
is the sample mean and
is an estimate of the spectral density at frequency zero.
Under the null hypothesis of convergence, the process
converges in distribution to a
Brownian bridge. The following
Cramér-von Mises statistic is used to test for stationarity:
:
This statistic is compared against known critical values from the Brownian bridge distribution. If the null hypothesis is rejected, the first 10% of the samples are discarded and the test can be repeated on the remaining chain until either stationarity is accepted or 50% of the chain is discarded.
Half-Width Test (Precision Check)
Once stationarity is accepted, the second part of the diagnostic checks whether the Monte Carlo estimator is accurate enough for practical use. Assuming the central limit theorem holds, the confidence interval for the mean