Vapnik–Chervonenkis Dimension
   HOME

TheInfoList



OR:

In
Vapnik–Chervonenkis theory Vapnik–Chervonenkis theory (also known as VC theory) was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a stat ...
, the Vapnik–Chervonenkis (VC) dimension is a measure of the size (capacity, complexity, expressive power, richness, or flexibility) of a class of sets. The notion can be extended to classes of binary functions. It is defined as the
cardinality The thumb is the first digit of the hand, next to the index finger. When a person is standing in the medical anatomical position (where the palm is facing to the front), the thumb is the outermost digit. The Medical Latin English noun for thum ...
of the largest set of points that the algorithm can shatter, which means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of those data points. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis. Informally, the capacity of a classification model is related to how complicated it can be. For example, consider the thresholding of a high- degree
polynomial In mathematics, a polynomial is a Expression (mathematics), mathematical expression consisting of indeterminate (variable), indeterminates (also called variable (mathematics), variables) and coefficients, that involves only the operations of addit ...
: if the polynomial evaluates above zero, that point is classified as positive, otherwise as negative. A high-degree polynomial can be wiggly, so that it can fit a given set of training points well. But one can expect that the classifier will make errors on other points, because it is too wiggly. Such a polynomial has a high capacity. A much simpler alternative is to threshold a linear function. This function may not fit the training set well, because it has a low capacity. This notion of capacity is made rigorous below.


Definitions


VC dimension of a set-family

Let \mathcal C = \_ be a
family of sets In set theory and related branches of mathematics, a family (or collection) can mean, depending upon the context, any of the following: set, indexed set, multiset, or class. A collection F of subsets of a given set S is called a family of su ...
(also called set family, collection of sets or set of sets) and X a set. Their ''intersection'' is defined as the following set family: :\mathcal C\cap X := \. Here typically X and each C \in \mathcal C are subsets of a big "universe" of possibilities U where intersection takes place. We say that a set X is '' shattered'' by \mathcal C if \mathcal P(X) = \mathcal C\cap X i.e. the set of intersections contains (hence is equal to) all the subsets of X. For finite sets X this is equivalent to :, \mathcal C\cap X, = 2^. The ''VC dimension'' D of \mathcal C is the
cardinality The thumb is the first digit of the hand, next to the index finger. When a person is standing in the medical anatomical position (where the palm is facing to the front), the thumb is the outermost digit. The Medical Latin English noun for thum ...
of the largest set that is shattered by \mathcal C. If arbitrarily large sets can be shattered, the VC dimension of \mathcal C is \infty.


VC dimension of a classification model

A binary classification model f with some parameter vector \theta is said to '' shatter'' a set of generally positioned data points (x_1,x_2,\ldots,x_n) if, for every assignment of labels to those points, there exists a \theta such that the model f makes no errors when evaluating that set of data points. The VC dimension of a model f is the maximum number of points that can be arranged so that f shatters them. More formally, it is the maximum cardinal D such that there exists a generally positioned data point set of
cardinality The thumb is the first digit of the hand, next to the index finger. When a person is standing in the medical anatomical position (where the palm is facing to the front), the thumb is the outermost digit. The Medical Latin English noun for thum ...
D that can be shattered by f.


Examples

# f is a constant classifier (with no parameters); Its VC dimension is 0 since it cannot shatter even a single point. In general, the VC dimension of a finite classification model, which can return at most 2^d different classifiers, is at most d (this is an upper bound on the VC dimension; the
Sauer–Shelah lemma In combinatorial mathematics and extremal set theory, the Sauer–Shelah lemma states that every family of sets with small VC dimension consists of a small number of sets. It is named after Norbert Sauer and Saharon Shelah, who published it inde ...
gives a lower bound on the dimension). # f is a single-parametric threshold classifier on real numbers; i.e., for a certain threshold \theta, the classifier f_\theta returns 1 if the input number is larger than \theta and 0 otherwise. The VC dimension of f is 1 because: (a) It can shatter a single point. For every point x, a classifier f_\theta labels it as 0 if \theta>x and labels it as 1 if \theta. (b) It cannot shatter all the sets with two points. For every set of two numbers, if the smaller is labeled 1, then the larger must also be labeled 1, so not all labelings are possible. # f is a single-parametric interval classifier on real numbers; i.e., for a certain parameter \theta, the classifier f_\theta returns 1 if the input number is in the interval theta,\theta+4/math> and 0 otherwise. The VC dimension of f is 2 because: (a) It can shatter some sets of two points. E.g., for every set \, a classifier f_\theta labels it as (0,0) if \theta < x - 4 or if \theta > x + 2, as (1,0) if \theta\in -4,x-2), as (1,1) if \theta\in[x-2,x/math>, and as (0,1) if \theta\in(x,x+2">-2,x.html" ;"title="-4,x-2), as (1,1) if \theta\in[x-2,x">-4,x-2), as (1,1) if \theta\in[x-2,x/math>, and as (0,1) if \theta\in(x,x+2/math>. (b) It cannot shatter any set of three points. For every set of three numbers, if the smallest and the largest are labeled 1, then the middle one must also be labeled 1, so not all labelings are possible. # f is a linear classifier">straight line In geometry, a straight line, usually abbreviated line, is an infinitely long object with no width, depth, or curvature, an idealization of such physical objects as a straightedge, a taut string, or a ray of light. Lines are spaces of dimens ...
as a classification model on points in a two-dimensional plane (this is the model used by a perceptron). The line should separate positive data points from negative data points. There exist sets of 3 points that can indeed be shattered using this model (any 3 points that are not collinear can be shattered). However, no set of 4 points can be shattered: by Radon's theorem, any four points can be partitioned into two subsets with intersecting
convex hull In geometry, the convex hull, convex envelope or convex closure of a shape is the smallest convex set that contains it. The convex hull may be defined either as the intersection of all convex sets containing a given subset of a Euclidean space, ...
s, so it is not possible to separate one of these two subsets from the other. Thus, the VC dimension of this particular classifier is 3. It is important to remember that while one can choose any arrangement of points, the arrangement of those points cannot change when attempting to shatter for some label assignment. Note, only 3 of the 23 = 8 possible label assignments are shown for the three points. # f is a single-parametric
sine In mathematics, sine and cosine are trigonometric functions of an angle. The sine and cosine of an acute angle are defined in the context of a right triangle: for the specified angle, its sine is the ratio of the length of the side opposite th ...
classifier, i.e., for a certain parameter \theta, the classifier f_\theta returns 1 if the input number x has \sin(\theta x)>0 and 0 otherwise. The VC dimension of f is infinite, since it can shatter any finite subset of the set \.


Uses


In statistical learning theory

The VC dimension can predict a
probabilistic Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...
upper bound In mathematics, particularly in order theory, an upper bound or majorant of a subset of some preordered set is an element of that is every element of . Dually, a lower bound or minorant of is defined to be an element of that is less ...
on the test error of a classification model. Vapnik proved that the probability of the test error (i.e., risk with 0–1 loss function) distancing from an upper bound (on data that is drawn i.i.d. from the same distribution as the training set) is given by: : \Pr \left(\text \leqslant \text + \sqrt \, \right )= 1 - \eta, where D is the VC dimension of the classification model, 0 < \eta \leqslant 1, and N is the size of the training set (restriction: this formula is valid when D\ll N. When D is larger, the test-error may be much higher than the training-error. This is due to
overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
). The VC dimension also appears in sample-complexity bounds. A space of binary functions with VC dimension D can be learned with: : N = \Theta\left(\frac\right) samples, where \varepsilon is the learning error and \delta is the failure probability. Thus, the sample-complexity is a linear function of the VC dimension of the hypothesis space.


In computational geometry

The VC dimension is one of the critical parameters in the size of ε-nets, which determines the complexity of approximation algorithms based on them; range sets without finite VC dimension may not have finite ε-nets at all.


Bounds

#
  • The VC dimension of the dual set-family of \mathcal C is strictly less than 2^, and this is best possible.
  • # The VC dimension of a finite set-family \mathcal C is at most \log_2, \mathcal C, . This is because , \mathcal C\cap X, \leq , X, by definition. # Given a set-family \mathcal C, define _s as a set-family that contains all intersections of s elements of \mathcal C. Then: \operatorname(_s) \leq \operatorname(\mathcal C)\cdot (2 s \log_2(3s)) # Given a set-family \mathcal C and an element C_0\in \mathcal C, define \mathcal C\,\Delta C_0 := \ where \Delta denotes symmetric set difference. Then: \operatorname(\mathcal C\,\Delta C_0) = \operatorname(\mathcal C)


    Examples of VC Classes


    VC dimension of a finite projective plane

    A
    finite projective plane In mathematics, a projective plane is a geometric structure that extends the concept of a plane (geometry), plane. In the ordinary Euclidean plane, two lines typically intersect at a single point, but there are some pairs of lines (namely, paral ...
    of order ''n'' is a collection of ''n''2 + ''n'' + 1 sets (called "lines") over ''n''2 + ''n'' + 1 elements (called "points"), for which: * Each line contains exactly ''n'' + 1 points. * Each line intersects every other line in exactly one point. * Each point is contained in exactly ''n'' + 1 lines. * Each point is in exactly one line in common with every other point. * At least four points do not lie in a common line. The VC dimension of a finite projective plane is 2. ''Proof'': (a) For each pair of distinct points, there is one line that contains both of them, lines that contain only one of them, and lines that contain none of them, so every set of size 2 is shattered. (b) For any triple of three distinct points, if there is a line ''x'' that contain all three, then there is no line ''y'' that contains exactly two (since then ''x'' and ''y'' would intersect in two points, which is contrary to the definition of a projective plane). Hence, no set of size 3 is shattered.


    VC dimension of a boosting classifier

    Suppose we have a base class B of simple classifiers, whose VC dimension is D. We can construct a more powerful classifier by combining several different classifiers from B; this technique is called boosting. Formally, given T classifiers h_1,\ldots,h_T \in B and a weight vector w\in \mathbb^T, we can define the following classifier: :f(x) = \operatorname \left( \sum_^T w_t\cdot h_t(x) \right) The VC dimension of the set of all such classifiers (for all selections of T classifiers from B and a weight-vector from \mathbb^T), assuming T,D\ge 3, is at most: : T\cdot (D+1) \cdot (3\log(T\cdot (D+1)) + 2)


    VC dimension of a neural network

    A
    neural network A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or signal pathways. While individual neurons are simple, many of them together in a network can perfor ...
    is described by a
    directed acyclic graph In mathematics, particularly graph theory, and computer science, a directed acyclic graph (DAG) is a directed graph with no directed cycles. That is, it consists of vertices and edges (also called ''arcs''), with each edge directed from one ...
    ''G''(''V'',''E''), where: * ''V'' is the set of nodes. Each node is a simple computation cell. * ''E'' is the set of edges, Each edge has a weight. * The input to the network is represented by the sources of the graph – the nodes with no incoming edges. * The output of the network is represented by the sinks of the graph – the nodes with no outgoing edges. * Each intermediate node gets as input a weighted sum of the outputs of the nodes at its incoming edges, where the weights are the weights on the edges. * Each intermediate node outputs a certain increasing function of its input, such as the
    sign function In mathematics, the sign function or signum function (from '' signum'', Latin for "sign") is a function that has the value , or according to whether the sign of a given real number is positive or negative, or the given number is itself zer ...
    or the
    sigmoid function A sigmoid function is any mathematical function whose graph of a function, graph has a characteristic S-shaped or sigmoid curve. A common example of a sigmoid function is the logistic function, which is defined by the formula :\sigma(x ...
    . This function is called the ''activation function''. The VC dimension of a neural network is bounded as follows: * If the activation function is the sign function and the weights are general, then the VC dimension is at most O(, E, \cdot \log(, E, )). * If the activation function is the sigmoid function and the weights are general, then the VC dimension is at least \Omega(, E, ^2) and at most O(, E, ^2\cdot , V, ^2). * If the weights come from a finite family (e.g. the weights are real numbers that can be represented by at most 32 bits in a computer), then, for both activation functions, the VC dimension is at most O(, E, ).


    Generalizations

    The VC dimension is defined for spaces of binary functions (functions to ). Several generalizations have been suggested for spaces of non-binary functions. * For multi-class functions (e.g., functions to ), the Natarajan dimension, and its generalization the DS dimension can be used. * For real-valued functions (e.g., functions to a real interval, ,1, the Graph dimension or Pollard's pseudo-dimension can be used. * The
    Rademacher complexity In computational learning theory (machine learning and theory of computation), Rademacher complexity, named after Hans Rademacher, measures richness of a class of sets with respect to a probability distribution. The concept can also be extended to ...
    provides similar bounds to the VC, and can sometimes provide more insight than VC dimension calculations into such statistical methods such as those using
    kernels Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learnin ...
    . * The Memory Capacity (sometimes Memory Equivalent Capacity) gives a lower bound capacity, rather than an upper bound (see for example: Artificial neural network#Capacity) and therefore indicates the point of potential overfitting.


    See also

    *
    Growth function The growth function, also called the shatter coefficient or the shattering number, measures the richness of a set family or class of functions. It is especially used in the context of statistical learning theory, where it is used to study propertie ...
    *
    Sauer–Shelah lemma In combinatorial mathematics and extremal set theory, the Sauer–Shelah lemma states that every family of sets with small VC dimension consists of a small number of sets. It is named after Norbert Sauer and Saharon Shelah, who published it inde ...
    , a bound on the number of sets in a set system in terms of the VC dimension. * Karpinski–Macintyre theorem, a bound on the VC dimension of general Pfaffian formulas.


    Footnotes


    References

    * * * * (containing information also for VC dimension) * * * * * * * * {{DEFAULTSORT:Vapnik-Chervonenkis dimension Dimension Statistical classification Computational learning theory Measures of complexity