The sample complexity of a
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
algorithm represents the number of training-samples that it needs in order to successfully learn a target function.
More precisely, the sample complexity is the number of training-samples that we need to supply to the algorithm, so that the function returned by the algorithm is within an arbitrarily small error of the best possible function, with probability arbitrarily close to 1.
There are two variants of sample complexity:
* The weak variant fixes a particular input-output distribution;
* The strong variant takes the worst-case sample complexity over all input-output distributions.
The
No free lunch theorem, discussed below, proves that, in general, the strong sample complexity is infinite, i.e. that there is no algorithm that can learn the globally-optimal target function using a finite number of training samples.
However, if we are only interested in a particular class of target functions (e.g., only linear functions) then the sample complexity is finite, and it depends linearly on the
VC dimension
VC may refer to:
Military decorations
* Victoria Cross, a military decoration awarded by the United Kingdom and other Commonwealth nations
** Victoria Cross for Australia
** Victoria Cross (Canada)
** Victoria Cross for New Zealand
* Victorious ...
on the class of target functions.
Definition
Let
be a space which we call the input space, and
be a space which we call the output space, and let
denote the product
. For example, in the setting of binary classification,
is typically a finite-dimensional vector space and
is the set
.
Fix a hypothesis space
of functions
. A learning algorithm over
is a computable map from
to
. In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from
to
. Typical learning algorithms include
empirical risk minimization
In statistical learning theory, the principle of empirical risk minimization defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large num ...
, without or with
Tikhonov regularization.
Fix a loss function
, for example, the square loss
, where
. For a given distribution
on
, the expected risk of a hypothesis (a function)
is
:
In our setting, we have
, where
is a learning algorithm and
is a sequence of vectors which are all drawn independently from
. Define the optimal risk
Set
, for each
sample size
Sample size determination or estimation is the act of choosing the number of observations or replicates to include in a statistical sample. The sample size is an important feature of any empirical study in which the goal is to make inferences abo ...
.
is a
random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...
and depends on the random variable
, which is drawn from the distribution
. The algorithm
is called consistent if
probabilistically converges to
. In other words, for all
, there exists a positive integer
, such that, for all sample sizes
, we have
The sample complexity of
is then the minimum
for which this holds, as a function of
, and
. We write the sample complexity as
to emphasize that this value of
depends on
, and
. If
is not consistent, then we set
. If there exists an algorithm for which
is finite, then we say that the hypothesis space
is learnable.
In others words, the sample complexity
defines the rate of consistency of the algorithm: given a desired accuracy
and confidence
, one needs to sample
data points to guarantee that the risk of the output function is within
of the best possible, with probability at least
.
In
probably approximately correct (PAC) learning, one is concerned with whether the sample complexity is ''polynomial'', that is, whether
is bounded by a polynomial in
and
. If
is polynomial for some learning algorithm, then one says that the hypothesis space
is PAC-learnable. This is a stronger notion than being learnable.
Unrestricted hypothesis space: infinite sample complexity
One can ask whether there exists a learning algorithm so that the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that the algorithm can learn any distribution over the input-output space with a specified target error. More formally, one asks whether there exists a learning algorithm
, such that, for all
, there exists a positive integer
such that for all
, we have
where
, with
as above. The
No Free Lunch Theorem says that without restrictions on the hypothesis space
, this is not the case, i.e., there always exist "bad" distributions for which the sample complexity is arbitrarily large.
Thus, in order to make statements about the rate of convergence of the quantity
one must either
*constrain the space of probability distributions
, e.g. via a parametric approach, or
*constrain the space of hypotheses
, as in distribution-free approaches.
Restricted hypothesis space: finite sample-complexity
The latter approach leads to concepts such as
VC dimension
VC may refer to:
Military decorations
* Victoria Cross, a military decoration awarded by the United Kingdom and other Commonwealth nations
** Victoria Cross for Australia
** Victoria Cross (Canada)
** Victoria Cross for New Zealand
* Victorious ...
and
Rademacher complexity which control the complexity of the space
. A smaller hypothesis space introduces more bias into the inference process, meaning that
may be greater than the best possible risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce more uniformly consistent functions. This trade-off leads to the concept of
regularization.
It is a theorem from
VC theory that the following three statements are equivalent for a hypothesis space
:
#
is PAC-learnable.
# The VC dimension of
is finite.
#
is a uniform
Glivenko-Cantelli class.
This gives a way to prove that certain hypothesis spaces are PAC learnable, and by extension, learnable.
An example of a PAC-learnable hypothesis space
, and let
be the space of affine functions on
, that is, functions of the form
for some
. This is the linear classification with offset learning problem. Now, four coplanar points in a square cannot be shattered by any affine function, since no affine function can be positive on two diagonally opposite vertices and negative on the remaining two. Thus, the VC dimension of
is
, so it is finite. It follows by the above characterization of PAC-learnable classes that
is PAC-learnable, and by extension, learnable.
Sample-complexity bounds
Suppose
is a class of binary functions (functions to
). Then,
is
-PAC-learnable with a sample of size:
where
is the
VC dimension
VC may refer to:
Military decorations
* Victoria Cross, a military decoration awarded by the United Kingdom and other Commonwealth nations
** Victoria Cross for Australia
** Victoria Cross (Canada)
** Victoria Cross for New Zealand
* Victorious ...
of
.
Moreover, any
-PAC-learning algorithm for
must have sample-complexity:
Thus, the sample-complexity is a linear function of the
VC dimension
VC may refer to:
Military decorations
* Victoria Cross, a military decoration awarded by the United Kingdom and other Commonwealth nations
** Victoria Cross for Australia
** Victoria Cross (Canada)
** Victoria Cross for New Zealand
* Victorious ...
of the hypothesis space.
Suppose
is a class of real-valued functions with range in