Cover's theorem is a statement in computational learning theory and is one of the primary theoretical motivations for the use of non-linear

kernel methods In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

applications. It is so termed after the information theorist

Thomas M. Cover Thomas M. Cover �koʊvər(August 7, 1938 – March 26, 2012) was an American information theorist and professor jointly in the Departments of Electrical Engineering and Statistics at Stanford University. He devoted almost his entire career to dev ...

who stated it in 1965, referring to it as ''counting function theorem''.

The Theorem

The theorem expresses the number of homogeneously linearly separable sets of

N

points in

D

dimensions as an explicit ''counting'' function

C(N,D)

of the number of points

N

and the dimensionality

D

. It requires, as a necessary and sufficient condition, that the points are in general position. Simply put, this means that the points should be as linearly independent (non-aligned) as possible. This condition is satisfied "with probability 1" or

almost surely In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (or Lebesgue measure 1). In other words, the set of possible exceptions may be non-empty, but it has probability 0. ...

for random point sets, while it may easily be violated for real data, since these are often structured along smaller-dimensionality manifolds within the data space. The function

C(N,D)

follows two different regimes depending on the relationship between

N

and

D

. * For

N \le D + 1

, the function is exponential in

N

. This essentially means that ''any'' set of labelled points in general position and in number no larger than the dimensionality + 1 is linearly separable; in jargon, it is said that a linear classifier ''shatters'' any point set with

N\le D+1

. This limiting quantity is also known as the Vapnik-Chervonenkis dimension of the linear classifier. * For

N > D + 1

, the counting function starts growing less than exponentially . This means that, given a sample of fixed size

N

, for larger dimensionality

D

it is more probable that a random set of labelled points is linearly separable. Conversely, with fixed dimensionality, for larger sample sizes the number of linearly separable sets of random points will be smaller, or in other words the probability to find a linearly separable sample will decrease with

N

. A consequence of the theorem is that given a set of training data that is not

linearly separable In Euclidean geometry, linear separability is a property of two sets of points. This is most easily visualized in two dimensions (the Euclidean plane) by thinking of one set of points as being colored blue and the other set of points as being colo ...

, one can with high probability transform it into a training set that is linearly separable by projecting it into a

higher-dimensional space In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coordina ...

via some

non-linear transformation In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many other ...

, or:

Proof

The proof of Cover's counting function theorem can be obtained from the recursive relation

C(N+1,D) = C(N,D) + C(N,D-1).

To show that, with fixed

N

, increasing

D

may turn a set of points from non-separable to separable, a deterministic mapping may be used: suppose there are

N

points. Lift them onto the vertices of the

simplex In geometry, a simplex (plural: simplexes or simplices) is a generalization of the notion of a triangle or tetrahedron to arbitrary dimensions. The simplex is so-named because it represents the simplest possible polytope in any given dimension. ...

in the

N-1

dimensional real space. Since every

partition Partition may refer to: Computing Hardware * Disk partitioning, the division of a hard disk drive * Memory partition, a subdivision of a computer's memory, usually for use by a single job Software * Partition (database), the division of a ...

of the samples into two sets is separable by a linear separator, the property follows. Kernel_trick_idea

References

* * * (Section 3.5) Computational learning theory Statistical classification Artificial neural networks {{statistics-stub

The Theorem

Proof

See also

References