probability theory Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expre ...

, a subgaussian distribution, the distribution of a subgaussian random variable, is a

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

with strong tail decay. More specifically, the tails of a subgaussian distribution are dominated by (i.e. decay at least as fast as) the tails of a

Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...

. This property gives subgaussian distributions their name. Often in analysis, we divide an object (such as a random variable) into two parts, a central bulk and a distant tail, then analyze each separately. In probability, this division usually goes like "Everything interesting happens near the center. The tail event is so rare, we may safely ignore that." Subgaussian distributions are worthy of study, because the gaussian distribution is well-understood, and so we can give sharp bounds on the rarity of the tail event. Similarly, the subexponential distributions are also worthy of study. Formally, the probability distribution of a random variable ''

X

'' is called subgaussian if there is a positive constant ''C'' such that for every

t \geq 0

, :

\operatorname(, X,  \geq t) \leq 2 \exp

. There are many equivalent definitions. For example, a random variable ''

X

'' is sub-Gaussian iff its distribution function is bounded from above (up to a constant) by the distribution function of a Gaussian: :

P(, X,  \geq t) \leq cP(, Z,  \geq t) \quad \forall t > 0

where

c\ge 0

is constant and

Z

is a mean zero Gaussian random variable.

Definitions

Subgaussian norm

The subgaussian norm of

X

, denoted as

\Vert X \Vert_

, is

\Vert X \Vert_ = \inf\left\.

In other words, it is the Orlicz norm of

X

generated by the Orlicz function

\Phi(u)=e^-1.

By condition

(2)

below, subgaussian random variables can be characterized as those random variables with finite subgaussian norm.

Variance proxy

If there exists some

s^2

such that

\leq e^

for all

t

, then

s^2

is called a ''variance proxy'', and the smallest such

s^2

is called the ''optimal variance proxy'' and denoted by

\Vert X\Vert_^2

. Since

\operatorname^= e^

when

X \sim \mathcal(\mu, \sigma^2)

is Gaussian, we then have

\, X\, ^2_ = \sigma^2

, as it should.

Equivalent definitions

Let ''

X

'' be a random variable. Let

K_1, K_2, K_3, \dots

be positive constants. The following conditions are equivalent: (Proposition 2.5.2 ) # Tail probability bound:

\operatorname(, X,  \geq t) \leq 2 \exp

for all

t \geq 0

; # Finite subgaussian norm:

\Vert X \Vert_ = K_2 < \infty

; # Moment:

\operatorname , X, ^p \leq 2K_3^p \Gamma\left(\frac+1\right)

for all ''

p \geq 1

'', where

\Gamma

is the

Gamma function In mathematics, the gamma function (represented by Γ, capital Greek alphabet, Greek letter gamma) is the most common extension of the factorial function to complex numbers. Derived by Daniel Bernoulli, the gamma function \Gamma(z) is defined ...

; # Moment:

\operatorname, X, ^p\leq K^p p^

for all

p \geq 1

; #

Moment-generating function In probability theory and statistics, the moment-generating function of a real-valued random variable is an alternative specification of its probability distribution. Thus, it provides the basis of an alternative route to analytical results compare ...

(of ''

X

''), or variance proxy :

\leq e^

for all

t

; #

(of ''

X^2

''):

\leq e^

for all

t \in 1/K, +1/K /math>;
#

Union bound In probability theory, Boole's inequality, also known as the union bound, says that for any finite or countable set of events, the probability that at least one of the events happens is no greater than the sum of the probabilities of the indiv ...

: for some ''c > 0'',

\leq c \sqrt

for all ''n > c'', where

X_1, \ldots, X_n

are i.i.d copies of ''X''; # Subexponential:

X^2

has a subexponential distribution. Furthermore, the constant

K

is the same in the definitions (1) to (5), up to an absolute constant. So for example, given a random variable satisfying (1) and (2), the minimal constants

K_1, K_2

in the two definitions satisfy

K_1 \leq cK_2, K_2 \leq c' K_1

, where

c, c'

are constants independent of the random variable.

Proof of equivalence

As an example, the first four definitions are equivalent by the proof below. ''Proof''.

(1)\implies(3)

By the

layer cake representation In mathematics, the layer cake representation of a non-negative number, negative, real number, real-valued measurable function f defined on a measure space (\Omega,\mathcal,\mu) is the formula :f(x) = \int_0^\infty 1_ (x) \, \mathrmt, for all ...

\begin
\operatorname , X, ^p &= \int_0^\infty \operatorname(, X, ^p \geq t) dt\\
&= \int_0^\infty pt^\operatorname(, X,  \geq t) dt\\
&\leq 2\int_0^\infty pt^\exp\left(-\frac\right) dt\\
\end

After a change of variables

u=t^2/K_1^2

, we find that

\begin
\operatorname , X, ^p &\leq 2K_1^p \frac\int_0^\infty u^e^ du\\
&= 2K_1^p \frac\Gamma\left(\frac\right)\\
&= 2K_1^p \Gamma\left(\frac+1\right).
\end

(3)\implies(2)

By the

Taylor series In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...

e^x = 1 + \sum_^\infty \frac,

\begin
\operatorname

exp Exp or EXP may stand for: * Exponential function, in mathematics * Expiry date of organic compounds like food or medicines * Experience point An experience point (often abbreviated as exp or XP) is a unit of measurement used in some tabletop r ...

&= 1 + \sum_^\infty \frac\\ &\leq 1 + \sum_^\infty \frac\\ &= 1 + 2 \sum_^\infty \lambda^p K_3^\\ &= 2 \sum_^\infty \lambda^p K_3^-1\\ &= \frac-1 \quad\text\lambda K_3^2 <1, \endwhich is less than or equal to

2

for

\lambda \leq \frac

. Let

K_2 \geq 3^K_3

, then

\operatorname

\leq 2.

(2)\implies(1)

Markov's inequality In probability theory, Markov's inequality gives an upper bound on the probability that a non-negative random variable is greater than or equal to some positive Constant (mathematics), constant. Markov's inequality is tight in the sense that for e ...

\operatorname(, X, \geq t) = \operatorname\left( \exp\left(\frac\right) \geq \exp\left(\frac\right) \right) \leq \frac \leq 2 \exp\left(-\frac\right).

(3) \iff (4)

by asymptotic formula for gamma function:

\Gamma(p/2 + 1 ) \sim \sqrt \left(\frac \right)^

. From the proof, we can extract a cycle of three inequalities: * If

\operatorname(, X,  \geq t) \leq 2 \exp

, then

\operatorname , X, ^p \leq 2K^p \Gamma\left(\frac+1\right)

for all

p \geq 1

. * If

\operatorname , X, ^p \leq 2K^p \Gamma\left(\frac+1\right)

for all

p \geq 1

, then

\, X \, _ \leq 3^K

. * If

\, X \, _ \leq K

, then

\operatorname(, X,  \geq t) \leq 2 \exp

. In particular, the constant

K

provided by the definitions are the same up to a constant factor, so we can say that the definitions are equivalent up to a constant independent of

X

. Similarly, because up to a positive multiplicative constant,

\Gamma(p/2 + 1) =  p^ \times ((2e)^p^)^p

for all

p \geq 1

, the definitions (3) and (4) are also equivalent up to a constant.

Basic properties

X \lesssim X'

means that

X \leq CX'

, where the positive constant

C

is independent of

X

and

X'

Concentration

Strictly subgaussian

Expanding the

cumulant generating function In probability theory and statistics, the cumulants of a probability distribution are a set of quantities that provide an alternative to the '' moments'' of the distribution. Any two probability distributions whose moments are identical will have ...

t^2 + \kappa_3 t^3 + \cdots

we find that

\leq \, X\, _^2

. At the edge of possibility, we define that a random variable

X

satisfying

\mathrm \, X\, _^2

is called ''strictly'' subgaussian.

Properties

Theorem. Let

X

be a subgaussian random variable with mean zero. If all zeros of its

characteristic function In mathematics, the term "characteristic function" can refer to any of several distinct concepts: * The indicator function of a subset, that is the function \mathbf_A\colon X \to \, which for a given subset ''A'' of ''X'', has value 1 at points ...

are real, then

X

is strictly subgaussian. Corollary. If

X_1, \dots, X_n

are independent and strictly subgaussian, then any linear sum of them is strictly subgaussian.

Examples

By calculating the characteristic functions, we can show that some distributions are strictly subgaussian: symmetric uniform distribution, symmetric Bernoulli distribution. Since a symmetric uniform distribution is strictly subgaussian, its convolution with itself is strictly subgaussian. That is, the symmetric

triangular distribution In probability theory and statistics, the triangular distribution is a continuous probability distribution with lower limit ''a'', upper limit ''b'', and mode ''c'', where ''a'' < ''b'' and ''a'' ≤ ''c'' ≤ ''b''. ...

is strictly subgaussian. Since the symmetric Bernoulli distribution is strictly subgaussian, any symmetric

Binomial distribution In probability theory and statistics, the binomial distribution with parameters and is the discrete probability distribution of the number of successes in a sequence of statistical independence, independent experiment (probability theory) ...

is strictly subgaussian.

Examples

The optimal variance proxy

\Vert X\Vert_^2

is known for many standard probability distributions, including the beta, Bernoulli, Dirichlet, Kumaraswamy, triangular, truncated Gaussian, and truncated exponential.

Bernoulli distribution

Let

p + q = 1

be two positive numbers. Let

X

be a centered Bernoulli distribution

p\delta_q + q \delta_

, so that it has mean zero, then

\Vert X\Vert_^2 = \frac

. Its subgaussian norm is

t

where

t

is the unique positive solution to

pe^ + qe^ = 2

. Let ''

X

'' be a random variable with symmetric Bernoulli distribution (or Rademacher distribution). That is, ''

X

'' takes values

-1

and

1

with probabilities

1/2

each. Since ''

X^2=1

'', it follows that

\Vert X \Vert_ = \inf\left\ = \inf\left\=\frac,

and hence ''

X

'' is a subgaussian random variable.

Bounded distributions

Bounded distributions have no tail at all, so clearly they are subgaussian. If

X

is bounded within the interval

, b The comma is a punctuation mark that appears in several variants in different languages. Some typefaces render it as a small line, slightly curved or straight, but inclined from the vertical; others give it the appearance of a miniature fille ...

/math>,

Hoeffding's lemma In probability theory, Hoeffding's lemma is an inequality that bounds the moment-generating function of any bounded random variable, implying that such variables are subgaussian. It is named after the Finnish– American mathematical stat ...

states that

\Vert X\Vert_^2 \leq \left(\frac\right)^2

Hoeffding's inequality In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wass ...

is the Chernoff bound obtained using this fact.

Convolutions

Since the sum of subgaussian random variables is still subgaussian, the convolution of subgaussian distributions is still subgaussian. In particular, any convolution of the normal distribution with any bounded distribution is subgaussian.

Mixtures

Given subgaussian distributions

X_1, X_2, \dots, X_n

, we can construct an additive mixture

X

as follows: first randomly pick a number

i \in \

, then pick

X_i

. Since

\operatorname\left exp\right = \sum_i p_i \operatorname\left exp\right

we have

\, X\, _ \leq \max_i \, X_i\, _

, and so the mixture is subgaussian. In particular, any gaussian mixture is subgaussian. More generally, the mixture of infinitely many subgaussian distributions is also subgaussian, if the subgaussian norm has a finite supremum:

\, X\, _ \leq \sup_i \, X_i\, _

Subgaussian random vectors

So far, we have discussed subgaussianity for real-valued random variables. We can also define subgaussianity for random vectors. The purpose of subgaussianity is to make the tails decay fast, so we generalize accordingly: a subgaussian random vector is a random vector where the tail decays fast. Let

X

be a random vector taking values in

\R^n

. Define. *

\, X\, _ := \sup_\, v^T X\, _

, where

S^

is the unit sphere in

\R^n

. Similarly for the variance proxy

\, X\, _ := \sup_\, v^T X\, _

X

is subgaussian iff

\, X\, _ < \infty

. Theorem. (Theorem 3.4.6 ) For any positive integer

n

, the uniformly distributed random vector

X \sim U(\sqrt S^)

is subgaussian, with

\, X\, _ \lesssim 1

. This is not so surprising, because as

n \to \infty

, the projection of

U(\sqrt S^)

to the first coordinate converges in distribution to the standard normal distribution.

Maximum inequalities

Inequalities

Theorem. (Theorem 2.6.1 ) There exists a positive constant

C

such that given any number of independent mean-zero subgaussian random variables

X_1, \dots,X_n

\left\, \sum_^n X_i\right\, _^2 \leq C \sum_^n\left\, X_i\right\, _^2

Theorem. (Hoeffding's inequality) (Theorem 2.6.3 ) There exists a positive constant

c

such that given any number of independent mean-zero subgaussian random variables

X_1, \dots,X_N

\mathbb\left(\left, \sum_^N X_i\ \geq t\right) \leq 2 \exp \left(-\frac\right)
\quad \forall t > 0

Theorem. (Bernstein's inequality) (Theorem 2.8.1 ) There exists a positive constant

c

such that given any number of independent mean-zero subexponential random variables

X_1, \dots,X_N

\mathbb\left(\left, \sum_^N X_i\ \geq t\right) \leq 2 \exp \left(-c \min \left(\frac, \frac\right)\right)

Theorem. (Khinchine inequality) (Exercise 2.6.5 ) There exists a positive constant

C

such that given any number of independent mean-zero variance-one subgaussian random variables

X_1, \dots,X_N

, any

p \geq 2

, and any

a_1, \dots, a_N \in \R

\left(\sum_^N a_i^2\right)^ \leq\left\, \sum_^N a_i X_i\right\, _ \leq C K \sqrt\left(\sum_^N a_i^2\right)^

Hanson-Wright inequality

The Hanson-Wright inequality states that if a random vector

X

is subgaussian in a certain sense, then any

quadratic form In mathematics, a quadratic form is a polynomial with terms all of degree two (" form" is another name for a homogeneous polynomial). For example, 4x^2 + 2xy - 3y^2 is a quadratic form in the variables and . The coefficients usually belong t ...

A

of this vector,

X^TAX

, is also subgaussian/subexponential. Further, the upper bound on the tail of

X^TAX

, is

uniform A uniform is a variety of costume worn by members of an organization while usually participating in that organization's activity. Modern uniforms are most often worn by armed forces and paramilitary organizations such as police, emergency serv ...

. A weak version of the following theorem was proved in (Hanson, Wright, 1971). There are many extensions and variants. Much like the central limit theorem, the Hanson-Wright inequality is more a cluster of theorems with the same purpose, than a single theorem. The purpose is to take a subgaussian vector and uniformly bound its quadratic forms. Theorem. There exists a constant

c

, such that: Let

n

be a positive integer. Let

X_1, ..., X_n

be independent random variables, such that each satisfies

E_i = 0

. Combine them into a random vector

X = (X_1, \dots, X_n)

. For any

n\times n

matrix

A

, we have

P(, X^T AX - E^TAX > t ) \leq \max\left( 2 e^, 2 e^ \right) = 
2 \exp \left c \min \left(\frac, \frac\right)\right

where

K = \max_i \, X_i\, _

, and

\, A\, _F = \sqrt

is the

Frobenius norm In the field of mathematics, norms are defined for elements within a vector space. Specifically, when the vector space comprises matrices, such norms are referred to as matrix norms. Matrix norms differ from vector norms in that they must also ...

of the matrix, and

\, A\,  = \max_ \, Ax\, _2

is the

operator norm In mathematics, the operator norm measures the "size" of certain linear operators by assigning each a real number called its . Formally, it is a norm defined on the space of bounded linear operators between two given normed vector spaces. Inform ...

of the matrix. In words, the quadratic form

X^TAX

has its tail uniformly bounded by an exponential, or a gaussian, whichever is larger. In the statement of the theorem, the constant

c

is an "absolute constant", meaning that it has no dependence on

n, X_1, \dots, X_n, A

. It is a mathematical constant much like ''pi'' and ''e''.

Consequences

Theorem (subgaussian concentration). There exists a constant

c

, such that: Let

n, m

be positive integers. Let

X_1, ..., X_n

be independent random variables, such that each satisfies

E_i = 0, E_i^2 = 1

. Combine them into a random vector

X = (X_1, \dots, X_n)

. For any

m\times n

matrix

A

, we have

P( ,  \,  AX\, _2 - \, A\, _F ,  > t ) \leq  2 e^

In words, the random vector

A X

is concentrated on a spherical shell of radius

\, A \, _F

, such that

\,  AX\, _2 - \, A \, _F

is subgaussian, with subgaussian norm

\leq \sqrt \, A\,  K^2

Notes

References

* * * * * * * * * * Vershynin, R. (2018)
"High-dimensional probability: An introduction with applications in data science"
(PDF). Volume 47 of ''Cambridge Series in Statistical and Probabilistic Mathematics''. Cambridge University Press, Cambridge. * Zajkowskim, K. (2020). "On norms in some class of exponential type Orlicz spaces of random variables". ''Positivity. An International Mathematics Journal Devoted to Theory and Applications of Positivity.'' 24(5): 1231--1240. . {{doi, 10.1007/s11117-019-00729-6. Continuous distributions

Definitions

Subgaussian norm

Variance proxy

Equivalent definitions

Proof of equivalence

Basic properties

Concentration

Strictly subgaussian

Properties

Examples

Examples

Bernoulli distribution

Bounded distributions

Convolutions

Mixtures

Subgaussian random vectors

Maximum inequalities

Inequalities

Hanson-Wright inequality

Consequences

See also

Notes

References