computational learning theory In computer science, computational learning theory (or just learning theory) is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms. Overview Theoretical results in machine learning m ...

(

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

and

theory of computation In theoretical computer science and mathematics, the theory of computation is the branch that deals with what problems can be solved on a model of computation, using an algorithm, how efficiently they can be solved or to what degree (e.g., a ...

), Rademacher complexity, named after

Hans Rademacher Hans Adolph Rademacher (; 3 April 1892, Wandsbeck, now Hamburg-Wandsbek – 7 February 1969, Haverford, Pennsylvania, USA) was a German-born American mathematician, known for work in mathematical analysis and number theory. Biography Rademacher r ...

, measures richness of a class of real-valued functions with respect to a

probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...

Definitions

Rademacher complexity of a set

Given a set

A\subseteq \mathbb^m

, the Rademacher complexity of ''A'' is defined as follows:Chapter 26 in :

\operatorname(A)
:= 
\frac
   \mathbb_\sigma \left \sup_
      \sum_^m \sigma_i a_i
\right

where

\sigma_1, \sigma_2, \dots, \sigma_m

are independent random variables drawn from the

Rademacher distribution In probability theory and statistics, the Rademacher distribution (which is named after Hans Rademacher) is a discrete probability distribution where a random variate ''X'' has a 50% chance of being +1 and a 50% chance of being -1. A series (th ...

i.e.

\Pr(\sigma_i = +1) = \Pr(\sigma_i = -1) = 1/2

for

i=1,2,\dots,m

, and

a=(a_1, \ldots, a_m)

. Some authors take the absolute value of the sum before taking the supremum, but if

A

symmetric Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...

this makes no difference.

Rademacher complexity of a function class

Let

S=(z_1, z_2, \dots, z_m) \in Z^m

be a sample of points and consider a function class

\mathcal

of real-valued functions over

Z^m

. Then, the empirical Rademacher complexity of

\mathcal

given

S

is defined as: :

\operatorname_S(\mathcal) 
= 
\frac
   \mathbb_\sigma \left \sup_
   \sum_^m \sigma_i f(z_i) 
\right

This can also be written using the previous definition: :

\operatorname_S(F) = \operatorname(F \circ S)

where

F \circ S

denotes

function composition In mathematics, function composition is an operation that takes two functions and , and produces a function such that . In this operation, the function is applied to the result of applying the function to . That is, the functions and ...

, i.e.: :

F \circ S := \

Let

P

be a probability distribution over

Z

. The Rademacher complexity of the function class

F

with respect to

P

for sample size

m

is: :

\operatorname_(F) 
:= 
\mathbb_ \left \operatorname_S(F) \right

where the above expectation is taken over an identically independently distributed (i.i.d.) sample

S=(z_1, z_2, \dots, z_m)

generated according to

P

Intuition

The Rademacher complexity is typically applied on a function class of models that are used for classification, with the goal of measuring their ability to classify points drawn from a probability space under arbitrary labellings. When the function class is rich enough, it contains functions that can appropriately adapt for each arrangement of labels, simulated by the random draw of

\sigma_i

under the expectation, so that this quantity in the sum is maximised.

Examples

A

contains a single vector, e.g.,

A = \ \subset \mathbb^2

. Then: ::

\operatorname(A) = \cdot \left(\cdot(a+b) + \cdot(a-b) + \cdot(-a+b) + \cdot(-a-b)\right) = 0

The same is true for every singleton hypothesis class. 2.

A

contains two vectors, e.g.,

A = \ \subset \mathbb^2

. Then: ::

& = (3+0+1-2) = \end

Using the Rademacher complexity

The Rademacher complexity can be used to derive data-dependent upper-bounds on the

learnability Learnability is a quality of products and interfaces that allows users to quickly become familiar with them and able to make good use of all their features and capabilities. Software testing In software testing learnability, according to ISO/IEC 9 ...

of function classes. Intuitively, a function-class with smaller Rademacher complexity is easier to learn.

Bounding the representativeness

, it is desired to have a

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

that represents the true distribution of some sample data

S

. This can be quantified using the notion of representativeness. Denote by

P

the

from which the samples are drawn. Denote by

H

the set of hypotheses (potential classifiers) and denote by

F

the corresponding set of error functions, i.e., for every hypothesis

h\in H

, there is a function

f_h\in F

, that maps each training sample (features,label) to the error of the classifier

h

(note in this case hypothesis and classifier are used interchangeably). For example, in the case that

h

represents a binary classifier, the error function is a 0–1 loss function, i.e. the error function

f_h

returns 1 if

h

correctly classifies a sample and 0 else. We omit the index and write

f

instead of

f_h

when the underlying hypothesis is irrelevant. Define: :

L_P(f) := \mathbb E_(z) /math> – the expected error of some error function f\in F on the real distribution P;
: L_S(f) :=  \sum_^m f(z_i) – the estimated error of some error function f\in F on the sample S .
The representativeness of the sample S, with respect to P and F, is defined as:
: \operatorname_P(F,S) := \sup_ (L_P(f) - L_S(f)) Smaller representativeness is better, since it provides a way to avoid

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

: it means that the true error of a classifier is not much higher than its estimated error, and so selecting a classifier that has low estimated error will ensure that the true error is also low. Note however that the concept of representativeness is relative and hence can not be compared across distinct samples. The expected representativeness of a sample can be bounded above by the Rademacher complexity of the function class: :

\mathbb E_operatorname_P(F,S) \leq 2 \cdot \mathbb E_operatorname(F\circ S) /math>

Bounding the generalization error

When the Rademacher complexity is small, it is possible to learn the hypothesis class H using

empirical risk minimization Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an alg ...

. For example, (with binary error function), for every

\delta>0

, with probability at least

1-\delta

, for every hypothesis

h\in H

: :

L_P(h) - L_S(h) \leq 2 \operatorname(F\circ S) + 4 \sqrt

Bounding the Rademacher complexity

Since smaller Rademacher complexity is better, it is useful to have upper bounds on the Rademacher complexity of various function sets. The following rules can be used to upper-bound the Rademacher complexity of a set

A \subset \mathbb^m

. 1. If all vectors in

A

are translated by a constant vector

a_0 \in \mathbb^m

, then Rad(''A'') does not change. 2. If all vectors in

A

are multiplied by a scalar

c\in \mathbb

, then Rad(''A'') is multiplied by

, c,

. 3.

\operatorname(A+B) = \operatorname(A) + \operatorname(B)

. 4. (Kakade & Tewari Lemma) If all vectors in

A

are operated by a

Lipschitz function In mathematical analysis, Lipschitz continuity, named after German mathematician Rudolf Lipschitz, is a strong form of uniform continuity for functions. Intuitively, a Lipschitz continuous function is limited in how fast it can change: there exis ...

, then Rad(''A'') is (at most) multiplied by the

Lipschitz constant In mathematical analysis, Lipschitz continuity, named after German mathematician Rudolf Lipschitz, is a strong form of uniform continuity for functions. Intuitively, a Lipschitz continuous function is limited in how fast it can change: there ex ...

of the function. In particular, if all vectors in

A

are operated by a

contraction mapping In mathematics, a contraction mapping, or contraction or contractor, on a metric space (''M'', ''d'') is a function ''f'' from ''M'' to itself, with the property that there is some real number 0 \leq k < 1 such that for all ''x'' and ...

, then Rad(''A'') strictly decreases. 5. The Rademacher complexity of the

convex hull In geometry, the convex hull or convex envelope or convex closure of a shape is the smallest convex set that contains it. The convex hull may be defined either as the intersection of all convex sets containing a given subset of a Euclidean space ...

A

equals Rad(''A''). 6. (Massart Lemma) The Rademacher complexity of a finite set grows logarithmically with the set size. Formally, let

A

be a set of

N

vectors in

\mathbb^m

, and let

\bar

be the mean of the vectors in

A

. Then: :

\operatorname(A) \leq \max_ \, a-\bar\,  \cdot

In particular, if

A

is a set of binary vectors, the norm is at most

\sqrt

, so: :

\operatorname(A) \leq \sqrt

Bounds related to the VC dimension

Let

H

be a

set family In set theory and related branches of mathematics, a collection F of subsets of a given Set (mathematics), set S is called a family of subsets of S, or a family of sets over S. More generally, a collection of any sets whatsoever is called a famil ...

whose

VC dimension VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and also by certain Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Vic ...

d

. It is known that the

growth function The growth function, also called the shatter coefficient or the shattering number, measures the richness of a set family. It is especially used in the context of statistical learning theory, where it measures the complexity of a hypothesis class. Th ...

H

is bounded as: :for all

m>d+1

\operatorname(H,m)\leq (em/d)^d

This means that, for every set

h

with at most

m

elements,

, H\cap h, \leq (em/d)^d

. The set-family

H\cap h

can be considered as a set of binary vectors over

\mathbb^m

. Substituting this in Massart's lemma gives: :

\operatorname(H\cap h) \leq

With more advanced techniques ( Dudley's entropy bound and Haussler's upper bound) one can show, for example, that there exists a constant

C

, such that any class of

\

-indicator functions with

Vapnik–Chervonenkis dimension Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a set of functions that can be learned by a statistical binary classification algorithm ...

d

has Rademacher complexity upper-bounded by

C\sqrt

Bounds related to linear classes

The following bounds are related to linear operations on

S

– a constant set of

m

vectors in

\mathbb^n

. 1. Define

A_2 = \ =

the set of dot-products of the vectors in

S

with vectors in the

unit ball Unit may refer to: Arts and entertainment * UNIT, a fictional military organization in the science fiction television series ''Doctor Who'' * Unit of action, a discrete piece of action (or beat) in a theatrical presentation Music * ''Unit'' (alb ...

. Then: :

\operatorname(A_2) \leq

2. Define

A_1 = \ =

the set of dot-products of the vectors in

S

with vectors in the unit ball of the 1-norm. Then: :

\operatorname(A_1) \leq \max_i\, x_i\, _\infty\cdot \sqrt

Bounds related to covering numbers

The following bound relates the Rademacher complexity of a set

A

to its external

covering number In mathematics, a covering number is the number of spherical balls of a given size needed to completely cover a given space, with possible overlaps. Two related concepts are the ''packing number'', the number of disjoint balls that fit in a space ...

– the number of balls of a given radius

r

whose union contains

A

. The bound is attributed to Dudley. Suppose

A\subset \mathbb^m

is a set of vectors whose length (norm) is at most

c

. Then, for every integer

M>0

: :

\operatorname(A) \leq 

+
\cdot
\sum_^M 2^\sqrt

In particular, if

A

lies in a ''d''-dimensional subspace of

\mathbb^m

, then: :

\forall r>0: N^_r(A) \leq (2 c \sqrt/r)^d

Substituting this in the previous bound gives the following bound on the Rademacher complexity: :

\operatorname(A) \leq 
\cdot
\bigg(\sqrt + 2\sqrt\bigg)
=
O\bigg(\bigg)

Gaussian complexity

Gaussian complexity is a similar complexity with similar physical meanings, and can be obtained from the Rademacher complexity using the random variables

g_i

instead of

\sigma_i

, where

g_i

are

Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...

i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...

random variables with zero-mean and variance 1, i.e.

g_i \sim \mathcal(0,1)

. Gaussian and Rademacher complexities are known to be equivalent up to logarithmic factors.

Equivalence of Rademacher and Gaussian complexity

Given a set

A\subseteq\mathbb^n

then it holds that:

\frac \leq \text(A) \leq \sqrtG(A)

Where

G(A)

is the Gaussian Complexity of A

References

{{reflist * Peter L. Bartlett, Shahar Mendelson (2002) ''Rademacher and Gaussian Complexities: Risk Bounds and Structural Results''. Journal of Machine Learning Research 3 463–482 * Giorgio Gnecco, Marcello Sanguineti (2008) ''Approximation Error Bounds via Rademacher's Complexity''. Applied Mathematical Sciences, Vol. 2, 2008, no. 4, 153–176 Machine learning Measures of complexity