Vapnik–Chervonenkis theory Vapnik–Chervonenkis theory (also known as VC theory) was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statis ...

, the Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a set of functions that can be learned by a statistical binary classification

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

. It is defined as the

cardinality In mathematics, the cardinality of a set is a measure of the number of elements of the set. For example, the set A = \ contains 3 elements, and therefore A has a cardinality of 3. Beginning in the late 19th century, this concept was generalized ...

of the largest set of points that the algorithm can shatter, which means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of those data points. It was originally defined by

Vladimir Vapnik Vladimir Naumovich Vapnik (russian: Владимир Наумович Вапник; born 6 December 1936) is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning, and the co-inventor of the support-vector machine ...

and

Alexey Chervonenkis Alexey Yakovlevich Chervonenkis (russian: link=no, Алексей Яковлевич Червоненкис; 7 September 1938 – 22 September 2014) was a Soviet and Russian mathematician. Along with Vladimir Vapnik, he was one of the main develop ...

. Informally, the capacity of a classification model is related to how complicated it can be. For example, consider the thresholding of a high-

degree Degree may refer to: As a unit of measurement * Degree (angle), a unit of angle measurement ** Degree of geographical latitude ** Degree of geographical longitude * Degree symbol (°), a notation used in science, engineering, and mathematics ...

polynomial In mathematics, a polynomial is an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and positive-integer powers of variables. An exa ...

: if the polynomial evaluates above zero, that point is classified as positive, otherwise as negative. A high-degree polynomial can be wiggly, so it can fit a given set of training points well. But one can expect that the classifier will make errors on other points, because it is too wiggly. Such a polynomial has a high capacity. A much simpler alternative is to threshold a linear function. This function may not fit the training set well, because it has a low capacity. This notion of capacity is made rigorous below.

Definitions

VC dimension of a set-family

Let

H

be a

set family In set theory and related branches of mathematics, a collection F of subsets of a given Set (mathematics), set S is called a family of subsets of S, or a family of sets over S. More generally, a collection of any sets whatsoever is called a famil ...

(a set of sets) and

C

a set. Their ''intersection'' is defined as the following set family: :

H\cap C := \.

We say that a set

C

is ''

shattered Shattered may refer to: Books * ''Shattered'' (Casey book), a 2010 non-fiction book: true-crime account of pregnant mother's murder * ''Shattered'' (Francis novel), a 2000 novel by Dick Francis: glassblower seeks videotape following death of j ...

'' by

H

H\cap C

contains all the subsets of

C

, i.e.: :

, H\cap C,  = 2^.

The ''VC dimension''

D

H

is the largest

of sets shattered by

H

. If arbitrarily large subsets can be shattered, the VC dimension is

\infty

VC dimension of a classification model

A binary classification model

f

with some parameter vector

\theta

is said to '' shatter'' a set of generally positioned data points

(x_1,x_2,\ldots,x_n)

if, for every assignment of labels to those points, there exists a

\theta

such that the model

f

makes no errors when evaluating that set of data points. The VC dimension of a model

f

is the maximum number of points that can be arranged so that

f

shatters them. More formally, it is the maximum cardinal

D

such that some data point set of

D

can be shattered by

f

Examples

f

is a constant classifier (with no parameters); Its VC dimension is 0 since it cannot shatter even a single point. In general, the VC dimension of a finite classification model, which can return at most

2^d

different classifiers, is at most

d

(this is an upper bound on the VC dimension; the

Sauer–Shelah lemma In combinatorial mathematics and extremal set theory, the Sauer–Shelah lemma states that every family of sets with small VC dimension consists of a small number of sets. It is named after Norbert Sauer and Saharon Shelah, who published it indep ...

gives a lower bound on the dimension). 2.

f

is a single-parametric threshold classifier on real numbers; i.e, for a certain threshold

\theta

, the classifier

f_\theta

returns 1 if the input number is larger than

\theta

and 0 otherwise. The VC dimension of

f

is 1 because: (a) It can shatter a single point. For every point

x

, a classifier

f_\theta

labels it as 0 if

\theta>x

and labels it as 1 if

\theta . (b) It cannot shatter all the sets with two points. For every set of two numbers, if the smaller is labeled 1, then the larger must also be labeled 1, so not all labelings are possible.

3. f is a single-parametric interval classifier on real numbers; i.e, for a certain parameter \theta, the classifier f_\theta returns 1 if the input number is in the interval theta,\theta+4 /math> and 0 otherwise. The VC dimension of f is 2 because: (a) It can shatter some sets of two points. E.g, for every set \, a classifier f_\theta labels it as (0,0) if \theta < x - 4 or if \theta > x + 2, as (1,0) if \theta\in -4,x-2),_as_(1,1)_if_\theta\in[x-2,x/math>,_and_as_(0,1)_if_\theta\in(x,x+2.html" ;"title="-2,x.html" ;"title="-4,x-2), as (1,1) if \theta\in

-4,x-2),_as_(1,1)_if_

\theta\in[x-2,x

/math>,_and_as_(0,1)_if_

\theta\in(x,x+2">-2,x.html"_;"title="-4,x-2)

,_as_(1,1)_if_

\theta\in[x-2,x">-4,x-2)

,_as_(1,1)_if_

\theta\in[x-2,x/math>,_and_as_(0,1)_if_\theta\in(x,x+2/math>.
(b)_It_cannot_shatter_any_set_of_three_points._For_every_set_of_three_numbers,_if_the_smallest_and_the_largest_are_labeled_1,_then_the_middle_one_must_also_be_labeled_1,_so_not_all_labelings_are_possible.

4._f_is_a_linear_classifier.html"  "title="-2,x">-4,x-2), as (1,1) if \theta\in[x-2,x/math>, and as (0,1) if \theta\in(x,x+2">-2,x.html" ;"title="-4,x-2), as (1,1) if \theta\in[x-2,x">-4,x-2), as (1,1) if \theta\in[x-2,x/math>, and as (0,1) if \theta\in(x,x+2/math>.
(b) It cannot shatter any set of three points. For every set of three numbers, if the smallest and the largest are labeled 1, then the middle one must also be labeled 1, so not all labelings are possible.

4. f is a linear classifier">straight line

In geometry, a line is an infinitely long object with no width, depth, or curvature. Thus, lines are One-dimensional space, one-dimensional objects, though they may exist in Two-dimensional Euclidean space, two, Three-dimensional space, three, ...

as a classification model on points in a two-dimensional plane (this is the model used by a perceptron). The line should separate positive data points from negative data points. There exist sets of 3 points that can indeed be shattered using this model (any 3 points that are not collinear can be shattered). However, no set of 4 points can be shattered: by

Radon's theorem In geometry, Radon's theorem on convex sets, published by Johann Radon in 1921, states that any set of ''d'' + 2 points in R''d'' can be partitioned into two sets whose convex hulls intersect. A point in the intersection of these conve ...

, any four points can be partitioned into two subsets with intersecting

convex hull In geometry, the convex hull or convex envelope or convex closure of a shape is the smallest convex set that contains it. The convex hull may be defined either as the intersection of all convex sets containing a given subset of a Euclidean space ...

s, so it is not possible to separate one of these two subsets from the other. Thus, the VC dimension of this particular classifier is 3. It is important to remember that while one can choose any arrangement of points, the arrangement of those points cannot change when attempting to shatter for some label assignment. Note, only 3 of the 2³ = 8 possible label assignments are shown for the three points. 5.

f

is a single-parametric

sine In mathematics, sine and cosine are trigonometric functions of an angle. The sine and cosine of an acute angle are defined in the context of a right triangle: for the specified angle, its sine is the ratio of the length of the side that is oppo ...

classifier, i.e, for a certain parameter

\theta

, the classifier

f_\theta

returns 1 if the input number

x

has

\sin(\theta x)>0

and 0 otherwise. The VC dimension of

f

is infinite, since it can shatter any finite subset of the set

\

Uses

In statistical learning theory

The VC dimension can predict a

probabilistic Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...

upper bound In mathematics, particularly in order theory, an upper bound or majorant of a subset of some preordered set is an element of that is greater than or equal to every element of . Dually, a lower bound or minorant of is defined to be an element ...

on the test error of a classification model. Vapnik proved that the probability of the test error (i.e., risk with 0-1 loss function) distancing from an upper bound (on data that is drawn

i.i.d. In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...

from the same distribution as the training set) is given by: :

\Pr \left(\text \leqslant  \text + \sqrt \, \right )= 1 - \eta,

where

D

is the VC dimension of the classification model,

0 < \eta \leqslant 1

, and

N

is the size of the training set (restriction: this formula is valid when

D\ll N

. When

D

is larger, the test-error may be much higher than the training-error. This is due to

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

). The VC dimension also appears in

sample-complexity bounds The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function. More precisely, the sample complexity is the number of training-samples that we need to ...

. A space of binary functions with VC dimension

D

can be learned with: :

N = \Theta\left(\frac\right)

samples, where

\varepsilon

is the learning error and

\delta

is the failure probability. Thus, the sample-complexity is a linear function of the VC dimension of the hypothesis space.

In
computational geometry Computational geometry is a branch of computer science devoted to the study of algorithms which can be stated in terms of geometry. Some purely geometrical problems arise out of the study of computational geometric algorithms, and such problems ar ...

The VC dimension is one of the critical parameters in the size of ε-nets, which determines the complexity of approximation algorithms based on them; range sets without finite VC dimension may not have finite ε-nets at all.

Bounds

0. The VC dimension of the dual set-family of

\mathcal F

is strictly less than

2^

, and this is best possible. 1. The VC dimension of a finite set-family

H

is at most

\log_2, H,

. This is because

, H\cap C,  \leq , H,

by definition. 2. Given a set-family

H

, define

H_s

as a set-family that contains all intersections of

s

elements of

H

. Then: ::

\operatorname(H_s) \leq \operatorname(H)\cdot (2 s \log_2(3s))

3. Given a set-family

H

and an element

h_0\in H

, define

H\,\Delta h_0 := \

where

\Delta

denotes

symmetric set difference In mathematics, the symmetric difference of two sets, also known as the disjunctive union, is the set of elements which are in either of the sets, but not in their intersection. For example, the symmetric difference of the sets \ and \ is \. T ...

. Then: ::

\operatorname(H\,\Delta h_0) = \operatorname(H)

VC dimension of a finite projective plane

finite projective plane In mathematics, a projective plane is a geometric structure that extends the concept of a plane. In the ordinary Euclidean plane, two lines typically intersect in a single point, but there are some pairs of lines (namely, parallel lines) that d ...

of order ''n'' is a collection of ''n''² + ''n'' + 1 sets (called "lines") over ''n''² + ''n'' + 1 elements (called "points"), for which: * Each line contains exactly ''n'' + 1 points. * Each line intersects every other line in exactly one point. * Each point is contained in exactly ''n'' + 1 lines. * Each point is in exactly one line in common with every other point. * At least four points do not lie in a common line. The VC dimension of a finite projective plane is 2. ''Proof'': (a) For each pair of distinct points, there is one line that contains both of them, lines that contain only one of them, and lines that contain none of them, so every set of size 2 is shattered. (b) For any triple of three distinct points, if there is a line ''x'' that contain all three, then there is no line ''y'' that contains exactly two (since then ''x'' and ''y'' would intersect in two points, which is contrary to the definition of a projective plane). Hence, no set of size 3 is shattered.

VC dimension of a boosting classifier

Suppose we have a base class

B

of simple classifiers, whose VC dimension is

D

. We can construct a more powerful classifier by combining several different classifiers from

B

; this technique is called boosting. Formally, given

T

classifiers

h_1,\ldots,h_T \in B

and a weight vector

w\in \mathbb^T

, we can define the following classifier: :

f(x) = \operatorname \left( \sum_^T w_t\cdot h_t(x) \right)

The VC dimension of the set of all such classifiers (for all selections of

T

classifiers from

B

and a weight-vector from

\mathbb^T

), assuming

T,D\ge 3

, is at most: :

T\cdot (D+1) \cdot 
(3\log(T\cdot (D+1)) + 2)

VC dimension of a neural network

neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

is described by a

directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ve ...

''G''(''V'',''E''), where: * ''V'' is the set of nodes. Each node is a simple computation cell. * ''E'' is the set of edges, Each edge has a weight. * The input to the network is represented by the sources of the graph – the nodes with no incoming edges. * The output of the network is represented by the sinks of the graph – the nodes with no outgoing edges. * Each intermediate node gets as input a weighted sum of the outputs of the nodes at its incoming edges, where the weights are the weights on the edges. * Each intermediate node outputs a certain increasing function of its input, such as the

sign function In mathematics, the sign function or signum function (from '' signum'', Latin for "sign") is an odd mathematical function that extracts the sign of a real number. In mathematical expressions the sign function is often represented as . To avoi ...

or the

sigmoid function A sigmoid function is a mathematical function having a characteristic "S"-shaped curve or sigmoid curve. A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula: :S(x) = \frac = \f ...

. This function is called the ''activation function''. The VC dimension of a neural network is bounded as follows: * If the activation function is the sign function and the weights are general, then the VC dimension is at most

O(, E, \cdot \log(, E, ))

. * If the activation function is the sigmoid function and the weights are general, then the VC dimension is at least

\Omega(, E, ^2)

and at most

O(, E, ^2\cdot , V, ^2)

. * If the weights come from a finite family (e.g. the weights are real numbers that can be represented by at most 32 bits in a computer), then, for both activation functions, the VC dimension is at most

O(, E, )

Generalizations

The VC dimension is defined for spaces of binary functions (functions to ). Several generalizations have been suggested for spaces of non-binary functions. * For multi-class functions (e.g., functions to ), the Natarajan dimension can be used. Ben David et al present a generalization of this concept. * For real-valued functions (e.g., functions to a real interval, ,1, Pollard's pseudo-dimension can be used. * The

Rademacher complexity In computational learning theory (machine learning and theory of computation), Rademacher complexity, named after Hans Rademacher, measures richness of a class of real-valued functions with respect to a probability distribution. Definitions Rad ...

provides similar bounds to the VC, and can sometimes provide more insight than VC dimension calculations into such statistical methods such as those using

kernels Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learnin ...

. * The Memory Capacity (sometimes Memory Equivalent Capacity) gives a lower bound capacity, rather than an upper bound (see for example: Artificial_neural_network#Capacity) and therefore indicates the point of potential overfitting.

Footnotes

References

* * * * (containing information also for VC dimension) * * * * * * * {{DEFAULTSORT:Vapnik-Chervonenkis dimension Dimension Statistical classification Computational learning theory Measures of complexity