HOME

TheInfoList



OR:

Vapnik–Chervonenkis theory (also known as VC theory) was developed during 1960–1990 by
Vladimir Vapnik Vladimir Naumovich Vapnik (russian: Владимир Наумович Вапник; born 6 December 1936) is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning, and the co-inventor of the support-vector machine ...
and
Alexey Chervonenkis Alexey Yakovlevich Chervonenkis (russian: link=no, Алексей Яковлевич Червоненкис; 7 September 1938 – 22 September 2014) was a Soviet and Russian mathematician. Along with Vladimir Vapnik, he was one of the main develo ...
. The theory is a form of
computational learning theory In computer science, computational learning theory (or just learning theory) is a subfield of artificial intelligence devoted to studying the design and analysis of machine learning algorithms. Overview Theoretical results in machine learning m ...
, which attempts to explain the learning process from a statistical point of view.


Introduction

VC theory covers at least four parts (as explained in ''The Nature of Statistical Learning Theory''): *Theory of
consistency In classical deductive logic, a consistent theory is one that does not lead to a logical contradiction. The lack of contradiction can be defined in either semantic or syntactic terms. The semantic definition states that a theory is consistent ...
of learning processes **What are (necessary and sufficient) conditions for consistency of a learning process based on the
empirical risk minimization Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an al ...
principle? *Nonasymptotic theory of the rate of convergence of learning processes **How fast is the rate of convergence of the learning process? *Theory of controlling the generalization ability of learning processes **How can one control the rate of convergence (the
generalization A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...
ability) of the learning process? *Theory of constructing learning machines **How can one construct algorithms that can control the generalization ability? VC Theory is a major subbranch of
statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...
. One of its main applications in statistical learning theory is to provide
generalization A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...
conditions for learning algorithms. From this point of view, VC theory is related to
stability Stability may refer to: Mathematics *Stability theory, the study of the stability of solutions to differential equations and dynamical systems **Asymptotic stability **Linear stability **Lyapunov stability **Orbital stability **Structural stabilit ...
, which is an alternative approach for characterizing generalization. In addition, VC theory and
VC dimension VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and also by certain Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Vic ...
are instrumental in the theory of empirical processes, in the case of processes indexed by VC classes. Arguably these are the most important applications of the VC theory, and are employed in proving generalization. Several techniques will be introduced that are widely used in the empirical process and VC theory. The discussion is mainly based on the book ''Weak Convergence and Empirical Processes: With Applications to Statistics''.


Overview of VC theory in empirical processes


Background on empirical processes

Let X_1, \ldots, X_n be independent, identically distributed random elements of a
measurable space In mathematics, a measurable space or Borel space is a basic object in measure theory. It consists of a set and a σ-algebra, which defines the subsets that will be measured. Definition Consider a set X and a σ-algebra \mathcal A on X. Then the ...
(\mathcal, \mathcal). For any measure Q on (\mathcal, \mathcal), and any measurable functions f:\mathcal \to \mathbf, define :Qf = \int f dQ Measurability issues will be ignored here, for more technical detail see. Let \mathcal be a class of measurable functions f:\mathcal \to \mathbf and define: :\, Q\, _ = \sup \. Define the empirical measure :\mathbb_n = n^ \sum_^n \delta_, where here stands for the
Dirac measure In mathematics, a Dirac measure assigns a size to a set based solely on whether it contains a fixed element ''x'' or not. It is one way of formalizing the idea of the Dirac delta function, an important tool in physics and other technical fields. ...
. The empirical measure induces a map \mathcal \to \mathbf given by: :f \mapsto \mathbb_n f = \frac 1 n (f(X_1 ) + ... + f(X_n)) Now suppose is the underlying true distribution of the data, which is unknown. Empirical Processes theory aims at identifying classes \mathcal for which statements such as the following hold: *uniform
law of large numbers In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials shou ...
:
\, \mathbb_n - P\, _ \underset 0,
:That is, as n\to \infty, :
\left, \frac 1 n (f(X_1 ) + ... + f(X_n)) - \int f dP \ \to 0
:uniformly for all f \in \mathcal. *uniform
central limit theorem In probability theory, the central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselv ...
: :: \mathbb_n = \sqrt (\mathbb_n - P) \rightsquigarrow \mathbb, \quad \text \ell^(\mathcal) In the former case \mathcal is called ''Glivenko–Cantelli'' class, and in the latter case (under the assumption \forall x, \sup\nolimits_\vert f(x) - Pf \vert < \infty) the class \mathcal is called ''Donsker'' or -Donsker. A Donsker class is Glivenko–Cantelli in probability by an application of
Slutsky's theorem In probability theory, Slutsky’s theorem extends some properties of algebraic operations on convergent sequences of real numbers to sequences of random variables. The theorem was named after Eugen Slutsky. Slutsky's theorem is also attributed to ...
. These statements are true for a single f, by standard LLN, CLT arguments under regularity conditions, and the difficulty in the Empirical Processes comes in because joint statements are being made for all f \in \mathcal. Intuitively then, the set \mathcal cannot be too large, and as it turns out that the geometry of \mathcal plays a very important role. One way of measuring how big the function set \mathcal is to use the so-called
covering number In mathematics, a covering number is the number of spherical balls of a given size needed to completely cover a given space, with possible overlaps. Two related concepts are the ''packing number'', the number of disjoint balls that fit in a space ...
s. The covering number :N(\varepsilon, \mathcal, \, \cdot\, ) is the minimal number of balls \ needed to cover the set \mathcal (here it is obviously assumed that there is an underlying norm on \mathcal). The entropy is the logarithm of the covering number. Two sufficient conditions are provided below, under which it can be proved that the set \mathcal is Glivenko–Cantelli or Donsker. A class \mathcal is -Glivenko–Cantelli if it is -measurable with envelope such that P^ F < \infty and satisfies: :\forall \varepsilon > 0 \quad \sup\nolimits_ N(\varepsilon \, F\, _Q, \mathcal, L_1(Q)) < \infty. The next condition is a version of the celebrated
Dudley's theorem In probability theory, Dudley's theorem is a result relating the expected upper bound and regularity properties of a Gaussian process to its entropy and covariance In probability theory and statistics, covariance is a measure of the joint ...
. If \mathcal is a class of functions such that :\int_0^ \sup\nolimits_ \sqrt d \varepsilon < \infty then \mathcal is -Donsker for every probability measure such that P^ F^2 < \infty. In the last integral, the notation means :\, f\, _ = \left(\int , f, ^2 d Q\right)^.


Symmetrization

The majority of the arguments of how to bound the empirical process, rely on symmetrization, maximal and concentration inequalities and chaining. Symmetrization is usually the first step of the proofs, and since it is used in many machine learning proofs on bounding empirical loss functions (including the proof of the VC inequality which is discussed in the next section) it is presented here. Consider the empirical process: :f \mapsto (\mathbb_n - P)f = \dfrac \sum_^n (f(X_i) - Pf) Turns out that there is a connection between the empirical and the following symmetrized process: :f \mapsto \mathbb^0_n f = \dfrac \sum_^n \varepsilon_i f(X_i) The symmetrized process is a Rademacher process, conditionally on the data X_i . Therefore, it is a sub-Gaussian process by
Hoeffding's inequality In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassi ...
. Lemma (Symmetrization). For every nondecreasing, convex and class of measurable functions \mathcal, : \mathbb \Phi (\, \mathbb_n - P\, _) \leq \mathbb \Phi \left(2 \left \, \mathbb^0_n\right \, _ \right) The proof of the Symmetrization lemma relies on introducing independent copies of the original variables X_i (sometimes referred to as a ''ghost sample'') and replacing the inner expectation of the LHS by these copies. After an application of
Jensen's inequality In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906, building on an earlier pr ...
different signs could be introduced (hence the name symmetrization) without changing the expectation. The proof can be found below because of its instructive nature. A typical way of proving empirical CLTs, first uses symmetrization to pass the empirical process to \mathbb_n^0 and then argue conditionally on the data, using the fact that Rademacher processes are simple processes with nice properties.


VC Connection

It turns out that there is a fascinating connection between certain combinatorial properties of the set \mathcal and the entropy numbers. Uniform covering numbers can be controlled by the notion of ''Vapnik–Chervonenkis classes of sets'' – or shortly ''VC sets''. Consider a collection \mathcal of subsets of the sample space \mathcal. \mathcal is said to ''pick out'' a certain subset W of the finite set S = \ \subset \mathcal if W = S \cap C for some C \in \mathcal. \mathcal is said to ''shatter'' if it picks out each of its subsets. The ''VC-index'' (similar to
VC dimension VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and also by certain Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Vic ...
+ 1 for an appropriately chosen classifier set) V(\mathcal) of \mathcal is the smallest for which no set of size is shattered by \mathcal. Sauer's lemma then states that the number \Delta_n(\mathcal, x_1, \ldots, x_n) of subsets picked out by a VC-class \mathcal satisfies: :\max_ \Delta_n(\mathcal, x_1, \ldots, x_n) \leq \sum_^ \leq \left( \frac\right)^ Which is a polynomial number O(n^) of subsets rather than an exponential number. Intuitively this means that a finite VC-index implies that \mathcal has an apparent simplistic structure. A similar bound can be shown (with a different constant, same rate) for the so-called ''VC subgraph classes''. For a function f : \mathcal \to \mathbf the ''subgraph'' is a subset of \mathcal \times \mathbf such that: \. A collection of \mathcal is called a VC subgraph class if all subgraphs form a VC-class. Consider a set of indicator functions \mathcal_ = \ in L_1(Q) for discrete empirical type of measure (or equivalently for any probability measure ). It can then be shown that quite remarkably, for r \geq 1: :N(\varepsilon, \mathcal_, L_r(Q)) \leq KV(\mathcal) (4e)^ \varepsilon^ Further consider the ''symmetric convex hull'' of a set \mathcal: \operatorname\mathcal being the collection of functions of the form \sum_^m \alpha_i f_i with \sum_^m , \alpha_i, \leq 1. Then if :N \left (\varepsilon\, F\, _, \mathcal, L_2(Q) \right) \leq C \varepsilon^ the following is valid for the convex hull of \mathcal: :\log N \left (\varepsilon\, F\, _, \operatorname\mathcal, L_2(Q) \right ) \leq K \varepsilon^ The important consequence of this fact is that : \frac > 2, which is just enough so that the entropy integral is going to converge, and therefore the class \operatorname\mathcal is going to be -Donsker. Finally an example of a VC-subgraph class is considered. Any finite-dimensional vector space \mathcal of measurable functions f:\mathcal \to \mathbf is VC-subgraph of index smaller than or equal to \dim(\mathcal) + 2. Proof: Take n = \dim(\mathcal) + 2 points (x_1, t_1), \ldots, (x_n, t_n). The vectors: :(f(x_1), \ldots, f(x_n)) - (t_1, \ldots, t_n) are in a dimensional subspace of . Take , a vector that is orthogonal to this subspace. Therefore: :\sum_ a_i (f(x_i) - t_i) = \sum_ (-a_i) (f(x_i) - t_i), \quad \forall f \in \mathcal Consider the set S = \. This set cannot be picked out since if there is some f such that S = \ that would imply that the LHS is strictly positive but the RHS is non-positive. There are generalizations of the notion VC subgraph class, e.g. there is the notion of pseudo-dimension. The interested reader can look into.


VC inequality

A similar setting is considered, which is more common to
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. Let \mathcal is a feature space and \mathcal = \. A function f : \mathcal \to \mathcal is called a classifier. Let \mathcal be a set of classifiers. Similarly to the previous section, define the '' shattering coefficient'' (also known as growth function): :S(\mathcal,n) = \max_ , \, Note here that there is a 1:1 go between each of the functions in \mathcal and the set on which the function is 1. We can thus define \mathcal to be the collection of subsets obtained from the above mapping for every f \in \mathcal . Therefore, in terms of the previous section the shattering coefficient is precisely :\max_ \Delta_n(\mathcal, x_1, \ldots, x_n). This equivalence together with Sauer's Lemma implies that S(\mathcal,n) is going to be polynomial in , for sufficiently large provided that the collection \mathcal has a finite VC-index. Let D_n = \ is an observed dataset. Assume that the data is generated by an unknown probability distribution P_. Define R(f) = P(f(X) \neq Y) to be the expected 0/1 loss. Of course since P_ is unknown in general, one has no access to R(f). However the ''empirical risk'', given by: :\hat_n(f) = \dfrac\sum_^n \mathbb(f(X_i) \neq Y_i) can certainly be evaluated. Then one has the following Theorem:


Theorem (VC Inequality)

For binary classification and the 0/1 loss function we have the following generalization bounds: :\begin P\left(\sup_ \left , \hat_n(f) - R(f) \right , >\varepsilon \right) &\leq 8 S(\mathcal,n) e^ \\ \mathbb\left \hat_n(f) - R(f) \right , \right&\leq 2 \sqrt \end In words the VC inequality is saying that as the sample increases, provided that \mathcal has a finite VC dimension, the empirical 0/1 risk becomes a good proxy for the expected 0/1 risk. Note that both RHS of the two inequalities will converge to 0, provided that S(\mathcal,n) grows polynomially in . The connection between this framework and the Empirical Process framework is evident. Here one is dealing with a modified empirical process :\left, \hat_n - R \right , _ but not surprisingly the ideas are the same. The proof of the (first part of) VC inequality, relies on symmetrization, and then argue conditionally on the data using concentration inequalities (in particular
Hoeffding's inequality In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassi ...
). The interested reader can check the book Theorems 12.4 and 12.5.


References

* See references in articles: Richard M. Dudley, empirical processes,
Shattered set The concept of shattered sets plays an important role in Vapnik–Chervonenkis theory, also known as VC-theory. Shattering and VC-theory are used in the study of empirical processes as well as in statistical computational learning theory. Defi ...
. * * {{DEFAULTSORT:Vapnik-Chervonenkis theory Computational learning theory Empirical process