probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms ''the'' Chernoff or Chernoff-Cramér bound, which may decay faster than exponential (e.g. sub-Gaussian). It is especially useful for sums of independent random variables, such as sums of

Bernoulli random variable In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with pro ...

s. The bound is commonly named after

Herman Chernoff Herman Chernoff (born July 1, 1923) is an American applied mathematician, statistician and physicist. He was formerly a professor at University of Illinois Urbana-Champaign, Stanford, and MIT, currently emeritus at Harvard University. Early lif ...

who described the method in a 1952 paper, though Chernoff himself attributed it to Herman Rubin. In 1938

Harald Cramér Harald Cramér (; 25 September 1893 – 5 October 1985) was a Swedish mathematician, actuary, and statistician, specializing in mathematical statistics and probabilistic number theory. John Kingman described him as "one of the giants of statis ...

had published an almost identical concept now known as Cramér's theorem. It is a sharper bound than the first- or second-moment-based tail bounds such as

Markov's inequality In probability theory, Markov's inequality gives an upper bound on the probability that a non-negative random variable is greater than or equal to some positive Constant (mathematics), constant. Markov's inequality is tight in the sense that for e ...

Chebyshev's inequality In probability theory, Chebyshev's inequality (also called the Bienaymé–Chebyshev inequality) provides an upper bound on the probability of deviation of a random variable (with finite variance) from its mean. More specifically, the probability ...

, which only yield power-law bounds on tail decay. However, when applied to sums the Chernoff bound requires the random variables to be independent, a condition that is not required by either Markov's inequality or Chebyshev's inequality. The Chernoff bound is related to the Bernstein inequalities. It is also used to prove Hoeffding's inequality, Bennett's inequality, and McDiarmid's inequality.

Generic Chernoff bounds

The generic Chernoff bound for a random variable

X

is attained by applying

e^

(which is why it is sometimes called the ''exponential Markov'' or ''exponential moments'' bound). For positive

t

this gives a bound on the right tail of

X

in terms of its

moment-generating function In probability theory and statistics, the moment-generating function of a real-valued random variable is an alternative specification of its probability distribution. Thus, it provides the basis of an alternative route to analytical results compare ...

M(t) = \operatorname E (e^)

: :

\operatorname P \left(X \geq a \right) = \operatorname P \left(e^ \geq e^\right) \leq M(t) e^ \qquad (t > 0)

Since this bound holds for every positive

t

, we may take the

infimum In mathematics, the infimum (abbreviated inf; : infima) of a subset S of a partially ordered set P is the greatest element in P that is less than or equal to each element of S, if such an element exists. If the infimum of S exists, it is unique ...

: :

\operatorname P \left(X \geq a\right) \leq \inf_ M(t) e^

Performing the same analysis with negative

t

we get a similar bound on the left tail: :

\operatorname P \left(X \leq a \right) = \operatorname P \left(e^ \geq e^\right) \leq M(t) e^ \qquad (t < 0)

and :

\operatorname P \left(X \leq a\right) \leq \inf_ M(t) e^

The quantity

M(t) e^

can be expressed as the expected value

\operatorname E (e^) e^

, or equivalently

\operatorname E (e^)

Properties

The exponential function is convex, so by

Jensen's inequality In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier p ...

\operatorname E (e^) \ge e^

. It follows that the bound on the right tail is greater or equal to one when

a \le \operatorname E (X)

, and therefore trivial; similarly, the left bound is trivial for

a \ge \operatorname E (X)

. We may therefore combine the two infima and define the two-sided Chernoff bound:

C(a) = \inf_ M(t) e^

which provides an upper bound on the folded

cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Ever ...

X

(folded at the mean, not the median). The logarithm of the two-sided Chernoff bound is known as the rate function (or ''Cramér transform'')

I = -\log C

. It is equivalent to the Legendre–Fenchel transform or

convex conjugate In mathematics and mathematical optimization, the convex conjugate of a function is a generalization of the Legendre transformation which applies to non-convex functions. It is also known as Legendre–Fenchel transformation, Fenchel transformati ...

of the

cumulant generating function In probability theory and statistics, the cumulants of a probability distribution are a set of quantities that provide an alternative to the '' moments'' of the distribution. Any two probability distributions whose moments are identical will have ...

K = \log M

, defined as:

I(a) = \sup_ at - K(t)

The moment generating function is log-convex, so by a property of the convex conjugate, the Chernoff bound must be log-concave. The Chernoff bound attains its maximum at the mean,

C(\operatorname E(X))=1

, and is invariant under translation:

C_(a) = C_X(a - k)

. The Chernoff bound is exact if and only if

X

is a single concentrated mass (

degenerate distribution In probability theory, a degenerate distribution on a measure space (E, \mathcal, \mu) is a probability distribution whose support is a null set with respect to \mu. For instance, in the -dimensional space endowed with the Lebesgue measure, an ...

). The bound is tight only at or beyond the extremes of a bounded random variable, where the infima are attained for infinite

t

. For unbounded random variables the bound is nowhere tight, though it is asymptotically tight up to sub-exponential factors ("exponentially tight"). Individual moments can provide tighter bounds, at the cost of greater analytical complexity. In practice, the exact Chernoff bound may be unwieldy or difficult to evaluate analytically, in which case a suitable upper bound on the moment (or cumulant) generating function may be used instead (e.g. a sub-parabolic CGF giving a sub-Gaussian Chernoff bound).

Bounds from below from the MGF

Using only the moment generating function, a bound from below on the tail probabilities can be obtained by applying the Paley-Zygmund inequality to

e^

, yielding:

\operatorname P \left(X > a\right) \geq \sup_ \left( 1 - \frac \right)^2 \frac

(a bound on the left tail is obtained for negative

t

). Unlike the Chernoff bound however, this result is not exponentially tight. Theodosopoulos constructed a tight(er) MGF-based bound from below using an exponential tilting procedure. For particular distributions (such as the

binomial Binomial may refer to: In mathematics *Binomial (polynomial), a polynomial with two terms *Binomial coefficient, numbers appearing in the expansions of powers of binomials *Binomial QMF, a perfect-reconstruction orthogonal wavelet decomposition * ...

) bounds from below of the same exponential order as the Chernoff bound are often available.

Sums of independent random variables

When is the sum of independent random variables , the moment generating function of is the product of the individual moment generating functions, giving that: and: :

\Pr (X \leq a) \leq \inf_ e^ \prod_i \operatorname E \left^ \right /math>

Specific Chernoff bounds are attained by calculating the moment-generating function \operatorname E \left^ \right /math> for specific instances of the random variables X_i .

When the random variables are also ''identically distributed'' (iid), the Chernoff bound for the sum reduces to a simple rescaling of the single-variable Chernoff bound. That is, the Chernoff bound for the ''average'' of ''n'' iid variables is equivalent to the ''n''th power of the Chernoff bound on a single variable (see Cramér's theorem).

Sums of independent bounded random variables

Chernoff bounds may also be applied to general sums of independent, bounded random variables, regardless of their distribution; this is known as Hoeffding's inequality. The proof follows a similar approach to the other Chernoff bounds, but applying

Hoeffding's lemma In probability theory, Hoeffding's lemma is an inequality that bounds the moment-generating function of any bounded random variable, implying that such variables are subgaussian. It is named after the Finnish– American mathematical stat ...

to bound the moment generating functions (see Hoeffding's inequality). : Hoeffding's inequality. Suppose are

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...

random variables taking values in Let denote their sum and let denote the sum's expected value. Then for any

t>0

, ::

\Pr (X \le \mu-t) < e^,

\Pr (X \ge \mu+t) < e^.

Sums of independent Bernoulli random variables

The bounds in the following sections for

s are derived by using that, for a Bernoulli random variable

X_i

with probability ''p'' of being equal to 1, :

= (1 - p) e^0 + p e^t = 1 + p (e^t -1) \leq e^.

One can encounter many flavors of Chernoff bounds: the original ''additive form'' (which gives a bound on the absolute error) or the more practical ''multiplicative form'' (which bounds the error relative to the mean).

Multiplicative form (relative error)

Multiplicative Chernoff bound. Suppose are

random variables taking values in Let denote their sum and let denote the sum's expected value. Then for any , :

\Pr ( X \ge (1+\delta)\mu) \leq \left(\frac\right)^\mu.

A similar proof strategy can be used to show that for :

\Pr(X \le (1-\delta)\mu) \leq \left(\frac\right)^\mu.

The above formula is often unwieldy in practice, so the following looser but more convenient bounds are often used, which follow from the inequality

\textstyle\frac \le \log(1+\delta)

from the list of logarithmic inequalities: :

\Pr( X \ge (1+\delta)\mu)\le e^, \qquad 0 \le \delta,

\Pr( X \le (1-\delta)\mu) \le e^, \qquad 0 < \delta < 1,

\Pr( , X - \mu,  \ge \delta\mu) \le 2e^, \qquad 0 < \delta < 1.

Notice that the bounds are trivial for

\delta = 0

. In addition, based on the Taylor expansion for the

Lambert W function In mathematics, the Lambert function, also called the omega function or product logarithm, is a multivalued function, namely the Branch point, branches of the converse relation of the function , where is any complex number and is the expone ...

, :

\Pr( X \ge R)\le 2^, \qquad x > 0, \  R \ge (2^x e -1)\mu.

Additive form (absolute error)

The following theorem is due to

Wassily Hoeffding Wassily Hoeffding (June 12, 1914 – February 28, 1991) was a Finnish-born American statistician and probabilist. Hoeffding was one of the founders of nonparametric statistics, in which Hoeffding contributed the idea and basic results on U-stat ...

and hence is called the Chernoff–Hoeffding theorem. :Chernoff–Hoeffding theorem. Suppose are i.i.d. random variables, taking values in Let and . ::

\begin
\Pr \left (\frac \sum X_i \geq p + \varepsilon \right ) \leq \left (\left (\frac\right )^ ^\right )^n &= e^ \\
\Pr \left (\frac \sum X_i \leq p - \varepsilon \right ) \leq \left (\left (\frac\right )^ ^\right )^n &= e^
\end

:where ::

D(x\parallel y) = x \ln \frac + (1-x) \ln \left (\frac \right )

:is the

Kullback–Leibler divergence In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...

between Bernoulli distributed random variables with parameters ''x'' and ''y'' respectively. If then

D(p+\varepsilon\parallel p)\ge \tfrac

which means ::

\Pr\left ( \frac\sum X_i>p+x \right ) \leq \exp \left (-\frac \right ).

A simpler bound follows by relaxing the theorem using , which follows from the convexity of and the fact that :

\frac D(p+\varepsilon\parallel p) = \frac \geq 4 =\frac(2\varepsilon^2).

This result is a special case of Hoeffding's inequality. Sometimes, the bounds :

D(x \parallel y) \geq \frac, & & & x \geq y \end

which are stronger for are also used.

Applications

Chernoff bounds have very useful applications in set balancing and packet

routing Routing is the process of selecting a path for traffic in a Network theory, network or between or across multiple networks. Broadly, routing is performed in many types of networks, including circuit-switched networks, such as the public switched ...

in sparse networks. The set balancing problem arises while designing statistical experiments. Typically while designing a statistical experiment, given the features of each participant in the experiment, we need to know how to divide the participants into 2 disjoint groups such that each feature is roughly as balanced as possible between the two groups.Refer to thi
book section
for more info on the problem. Chernoff bounds are also used to obtain tight bounds for permutation routing problems which reduce

network congestion Network congestion in data networking and queueing theory is the reduced quality of service that occurs when a network node or link is carrying more data than it can handle. Typical effects include queueing delay, packet loss or the blocking of ...

while routing packets in sparse networks. Chernoff bounds are used in

computational learning theory In computer science, computational learning theory (or just learning theory) is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms. Overview Theoretical results in machine learning m ...

to prove that a learning algorithm is probably approximately correct, i.e. with high probability the algorithm has small error on a sufficiently large training data set. Chernoff bounds can be effectively used to evaluate the "robustness level" of an application/algorithm by exploring its perturbation space with randomization. The use of the Chernoff bound permits one to abandon the strong—and mostly unrealistic—small perturbation hypothesis (the perturbation magnitude is small). The robustness level can be, in turn, used either to validate or reject a specific algorithmic choice, a hardware implementation or the appropriateness of a solution whose structural parameters are affected by uncertainties. A simple and common use of Chernoff bounds is for "boosting" of

randomized algorithm A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performan ...

s. If one has an algorithm that outputs a guess that is the desired answer with probability ''p'' > 1/2, then one can get a higher success rate by running the algorithm

n = \log(1/\delta) 2p/(p - 1/2)^2

times and outputting a guess that is output by more than ''n''/2 runs of the algorithm. (There cannot be more than one such guess.) Assuming that these algorithm runs are independent, the probability that more than ''n''/2 of the guesses is correct is equal to the probability that the sum of independent Bernoulli random variables that are 1 with probability ''p'' is more than ''n''/2. This can be shown to be at least

1-\delta

via the multiplicative Chernoff bound (Corollary 13.3 in Sinclair's class notes, ).: :

\ge 1 - e^ \geq 1-\delta

Matrix Chernoff bound

Rudolf Ahlswede and Andreas Winter introduced a Chernoff bound for matrix-valued random variables. The following version of the inequality can be found in the work of Tropp. Let be independent matrix valued random variables such that

M_i\in \mathbb^

and

\mathbb_i 0

. Let us denote by

\lVert M \rVert

the operator norm of the matrix

M

. If

\lVert M_i \rVert \leq \gamma

holds almost surely for all

i\in\

, then for every :

\Pr\left( \left\,  \frac \sum_^t M_i \right\,  > \varepsilon \right) \leq (d_1+d_2) \exp \left( -\frac \right).

Notice that in order to conclude that the deviation from 0 is bounded by with high probability, we need to choose a number of samples

t

proportional to the logarithm of

d_1+d_2

. In general, unfortunately, a dependence on

\log(\min(d_1,d_2))

is inevitable: take for example a diagonal random sign matrix of dimension

d\times d

. The operator norm of the sum of ''t'' independent samples is precisely the maximum deviation among ''d'' independent random walks of length ''t''. In order to achieve a fixed bound on the maximum deviation with constant probability, it is easy to see that ''t'' should grow logarithmically with ''d'' in this scenario. The following theorem can be obtained by assuming ''M'' has low rank, in order to avoid the dependency on the dimensions.

Theorem without the dependency on the dimensions

Let and ''M'' be a random symmetric real matrix with

\, \leq 1

and

\,  M\,  \leq \gamma

almost surely. Assume that each element on the support of ''M'' has at most rank ''r''. Set :

t = \Omega \left( \frac \right).

r \leq t

holds almost surely, then :

\right\, > \varepsilon \right) \leq \frac

where are i.i.d. copies of ''M''.

Sampling variant

The following variant of Chernoff's bound can be used to bound the probability that a majority in a population will become a minority in a sample, or vice versa. Suppose there is a general population ''A'' and a sub-population ''B'' ⊆ ''A''. Mark the relative size of the sub-population (, ''B'', /, ''A'', ) by ''r''. Suppose we pick an integer ''k'' and a random sample ''S'' ⊂ ''A'' of size ''k''. Mark the relative size of the sub-population in the sample (, ''B''∩''S'', /, ''S'', ) by ''r_S''. Then, for every fraction ''d'' ∈ ,1 :

\Pr\left(r_S < (1-d)\cdot r\right) < \exp\left(-r\cdot d^2 \cdot \frac k 2\right)

In particular, if ''B'' is a majority in ''A'' (i.e. ''r'' > 0.5) we can bound the probability that ''B'' will remain majority in ''S''(''r_S'' > 0.5) by taking: ''d'' = 1 − 1/(2''r''):See graphs of
the bound as a function of ''r'' when ''k'' changes
an
the bound as a function of ''k'' when ''r'' changes
:

\Pr\left(r_S > 0.5\right) > 1 - \exp\left(-r\cdot \left(1 - \frac\right)^2 \cdot \frac k 2 \right)

This bound is of course not tight at all. For example, when ''r'' = 0.5 we get a trivial bound Prob > 0.

Proofs

Multiplicative form

Following the conditions of the multiplicative Chernoff bound, let be independent

s, whose sum is , each having probability ''p_i'' of being equal to 1. For a Bernoulli variable: :

= (1 - p_i) e^0 + p_i e^t = 1 + p_i (e^t -1) \leq e^

So, using () with

a = (1+\delta)\mu

for any

\delta>0

and where

= \textstyle\sum_^n p_i

, :

& = \inf_ \exp\Big(-t(1+\delta)\mu + (e^t - 1)\mu\Big). \end

If we simply set so that for , we can substitute and find :

\exp\Big(-t(1+\delta)\mu + (e^t - 1)\mu\Big) = \frac = \left frac\right \mu.

This proves the result desired.

Chernoff–Hoeffding theorem (additive form)

Let . Taking in (), we obtain: :

\Pr\left ( \frac \sum X_i \ge q\right )\le \inf_ \frac = \inf_ \left ( \frac\right )^n.

Now, knowing that , we have :

\left (\frac\right )^n = \left (\frac\right )^n = \left ( pe^ + (1-p)e^ \right )^n.

Therefore, we can easily compute the infimum, using calculus: :

\frac \left (pe^ + (1-p)e^ \right) = (1-q)pe^-q(1-p)e^

Setting the equation to zero and solving, we have :

\begin
(1-q)pe^ &= q(1-p)e^ \\
(1-q)pe^ &= q(1-p)
\end

so that :

e^t = \frac.

Thus, :

t = \log\left(\frac\right).

As , we see that , so our bound is satisfied on . Having solved for , we can plug back into the equations above to find that :

\begin
\log \left (pe^ + (1-p)e^ \right ) &= \log \left ( e^(1-p+pe^t) \right ) \\
&= \log\left (e^\right) + \log\left(1-p+pe^e^\right ) \\
&= -q\log\frac -q \log\frac + \log\left(1-p+ p\left(\frac\right)\frac\right) \\
&= -q\log\frac -q \log\frac + \log\left(\frac+\frac\right) \\
&= -q \log\frac + \left ( -q\log\frac + \log\frac \right ) \\
&= -q\log\frac + (1-q)\log\frac \\
&= -D(q \parallel p).
\end

We now have our desired result, that :

\Pr \left (\tfrac\sum X_i \ge p + \varepsilon\right ) \le e^.

To complete the proof for the symmetric case, we simply define the random variable , apply the same proof, and plug it into our bound.