HOME

TheInfoList



OR:

Uniform convergence in probability is a form of
convergence in probability In probability theory, there exist several different notions of convergence of random variables. The convergence of sequences of random variables to some limit random variable is an important concept in probability theory, and its applications to ...
in statistical asymptotic theory and
probability theory Probability theory is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set o ...
. It means that, under certain conditions, the ''empirical frequencies'' of all events in a certain event-family converge to their ''theoretical probabilities''. Uniform convergence in probability has applications to
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
as well as
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
as part of
statistical learning theory Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on dat ...
. The
law of large numbers In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials shou ...
says that, for each ''single'' event A, its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class S from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events S The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its
VC dimension VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and also by certain Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Vic ...
is sufficiently small) then uniform convergence holds.


Definitions

For a class of
predicates Predicate or predication may refer to: * Predicate (grammar), in linguistics * Predication (philosophy) * several closely related uses in mathematics and formal logic: **Predicate (mathematical logic) **Propositional function **Finitary relation, ...
H defined on a set X and a set of samples x=(x_1,x_2,\dots,x_m), where x_i\in X, the ''empirical frequency'' of h\in H on x is : \widehat_x(h)=\frac 1 m , \, . The ''theoretical probability'' of h\in H is defined as Q_P(h) = P\. The Uniform Convergence Theorem states, roughly, that if H is "simple" and we draw samples independently (with replacement) from X according to any distribution P, then
with high probability In mathematics, an event that occurs with high probability (often shortened to w.h.p. or WHP) is one whose probability depends on a certain number ''n'' and goes to 1 as ''n'' goes to infinity, i.e. the probability of the event occurring can be mad ...
, the empirical frequency will be close to its
expected value In probability theory, the expected value (also called expectation, expectancy, mathematical expectation, mean, average, or first moment) is a generalization of the weighted average. Informally, the expected value is the arithmetic mean of a l ...
, which is the theoretical probability. Here "simple" means that the
Vapnik–Chervonenkis dimension Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity, expressive power, richness, or flexibility) of a set of functions that can be learned by a statistical binary classification algorithm ...
of the class H is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole. The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis using the concept of
growth function The growth function, also called the shatter coefficient or the shattering number, measures the richness of a set family. It is especially used in the context of statistical learning theory, where it measures the complexity of a hypothesis class. Th ...
.


Uniform convergence theorem

The statement of the uniform convergence theorem is as follows:Martin Anthony Peter, l. Bartlett. Neural Network Learning: Theoretical Foundations, pages 46–50. First Edition, 1999. Cambridge University Press
If H is a set of \-valued functions defined on a set X and P is a probability distribution on X then for \varepsilon>0 and m a positive integer, we have: : P^m\\leq 4\Pi_H(2m)e^. : where, for any x\in X^m,, : Q_P(h)=P\, : \widehat_x(h)=\frac 1 m , \, : and , x, =m. P^m indicates that the probability is taken over x consisting of m i.i.d. draws from the distribution P. : \Pi_H is defined as: For any \-valued functions H over X and D\subseteq X , : \Pi_H(D)=\. And for any natural number m, the
shattering number The growth function, also called the shatter coefficient or the shattering number, measures the richness of a set family. It is especially used in the context of statistical learning theory, where it measures the complexity of a hypothesis class. Th ...
\Pi_H(m) is defined as: : \Pi_H(m)=\max, \, . From the point of Learning Theory one can consider H to be the Concept/Hypothesis class defined over the instance set X. Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.


Sauer–Shelah lemma

The
Sauer–Shelah lemma In combinatorial mathematics and extremal set theory, the Sauer–Shelah lemma states that every family of sets with small VC dimension consists of a small number of sets. It is named after Norbert Sauer and Saharon Shelah, who published it indep ...
Sham Kakade and Ambuj Tewari, CMSC 35900 (Spring 2008) Learning Theory, Lecture 11
/ref> relates the shattering number \Pi_(m) to the VC Dimension. Lemma: \Pi_(m)\leq\left( \frac\right)^, where d is the
VC Dimension VC may refer to: Military decorations * Victoria Cross, a military decoration awarded by the United Kingdom and also by certain Commonwealth nations ** Victoria Cross for Australia ** Victoria Cross (Canada) ** Victoria Cross for New Zealand * Vic ...
of the concept class H. Corollary: \Pi_(m)\leq m^.


Proof of uniform convergence theorem

and are the sources of the proof below. Before we get into the details of the proof of the ''Uniform Convergence Theorem'' we will present a high level overview of the proof. #''Symmetrization:'' We transform the problem of analyzing , Q_(h)-\widehat_(h), \geq\varepsilon into the problem of analyzing , \widehat_(h)-\widehat_(h), \geq\varepsilon/2, where r and s are i.i.d samples of size m drawn according to the distribution P. One can view r as the original randomly drawn sample of length m, while s may be thought as the testing sample which is used to estimate Q_(h). #''Permutation:'' Since r and s are picked identically and independently, so swapping elements between them will not change the probability distribution on r and s. So, we will try to bound the probability of , \widehat_(h)-\widehat_(h), \geq\varepsilon/2 for some h \in H by considering the effect of a specific collection of permutations of the joint sample x=r, , s. Specifically, we consider permutations \sigma(x) which swap x_i and x_ in some subset of . The symbol r, , s means the concatenation of r and s. #''Reduction to a finite class:'' We can now restrict the function class H to a fixed joint sample and hence, if H has finite VC Dimension, it reduces to the problem to one involving a finite function class. We present the technical details of the proof.


Symmetrization

Lemma: Let V=\ and : R=\. Then for m\geq\frac 2 , P^m(V)\leq 2P^(R). Proof: By the triangle inequality,
if , Q_(h)-\widehat_r(h), \geq\varepsilon and , Q_P(h)-\widehat_s (h), \leq\varepsilon /2 then , \widehat_r(h)-\widehat_s (h), \geq\varepsilon /2. Therefore, : \begin & P^(R) \\ pt\geq & P^\ \\ pt= & \int_V P^m\ \, dP^m(r) \\ pt= & A \end since r and s are independent. Now for r\in V fix an h\in H such that , Q_P(h)-\widehat_r(h), \geq\varepsilon. For this h, we shall show that : P^m \left\ \geq\frac 1 2. Thus for any r\in V, A\geq\frac2 and hence P^(R)\geq\frac2. And hence we perform the first step of our high level idea. Notice, m\cdot \widehat_s(h) is a binomial random variable with expectation m\cdot Q_(h) and variance m\cdot Q_P(h)(1-Q_P(h)). By
Chebyshev's inequality In probability theory, Chebyshev's inequality (also called the Bienaymé–Chebyshev inequality) guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from th ...
we get : P^m \left\ \leq \frac \leq \frac 1 \leq\frac 1 2 for the mentioned bound on m. Here we use the fact that x(1-x)\leq 1/4 for x.


Permutations

Let \Gamma_ be the set of all permutations of \ that swaps i and m+i \forall i in some subset of \. Lemma: Let R be any subset of X^ and P any probability distribution on X. Then, : P^(R)=E Pr[\sigma(x)\in_R\leq_\max_(\Pr Pr[\sigma(x)\in_R\leq_\max_(\Pr[\sigma(x)\in_R">sigma(x)\in_R.html"_;"title="Pr[\sigma(x)\in_R">Pr[\sigma(x)\in_R\leq_\max_(\Pr[\sigma(x)\in_R,_ where_the_expectation_is_over_x_chosen_according_to_P^,_and_the_probability_is_over_\sigma_chosen_uniformly_from_\Gamma_. Proof:_ For_any_\sigma\in\Gamma_m, :__P^(R)_=_P^\_ (since_coordinate_permutations_preserve_the_product_distribution__P^.) :_ \begin \therefore_P^(R)_=__&_\int_1_(x)_\,_dP^(x)_\\_pt=__&_\frac\sum__\int__1_R(\sigma(x))_\,_dP^(x)_\\_pt=__&_\int__\frac_1_\sum__1_R_(\sigma(x))_\,_dP^(x)_\\_pt&_\text_.html" ;"title="sigma(x)\in_R.html" ;"title="sigma(x)\in_R.html" ;"title="Pr[\sigma(x)\in R">Pr[\sigma(x)\in R\leq \max_(\Pr[\sigma(x)\in R">sigma(x)\in_R.html" ;"title="Pr[\sigma(x)\in R">Pr[\sigma(x)\in R\leq \max_(\Pr[\sigma(x)\in R, where the expectation is over x chosen according to P^, and the probability is over \sigma chosen uniformly from \Gamma_. Proof: For any \sigma\in\Gamma_m, : P^(R) = P^\ (since coordinate permutations preserve the product distribution P^.) : \begin \therefore P^(R) = & \int_1_(x) \, dP^(x) \\ pt= & \frac\sum_ \int_ 1_R(\sigma(x)) \, dP^(x) \\ pt= & \int_ \frac 1 \sum_ 1_R (\sigma(x)) \, dP^(x) \\ pt& \text ">\Gamma_, \text \\ pt= & \int_ \Pr[\sigma(x)\in R\, dP^(x) \quad \text \\ pt\leq & \max_(\Pr sigma(x)\in R. \end The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.


Reduction to a finite class

Lemma: Basing on the previous lemma, : \max_(\Pr sigma(x)\in R\leq 4\Pi_H(2m)e^ . Proof: Let us define x=(x_1,x_2,\ldots,x_) and t=, H, _x, which is at most \Pi_H(2m). This means there are functions h_1,h_2,\ldots,h_t\in H such that for any h\in H,\exists i between 1 and t with h_i(x_k)=h(x_k) for 1\leq k\leq 2m. We see that \sigma(x)\in R iff for some h in H satisfies, , \frac, \, -\frac, \, , \geq\frac. Hence if we define w^_=1 if h_(x_)=1 and w^_=0 otherwise. For 1\leq i\leq m and 1\leq j\leq t, we have that \sigma(x)\in R iff for some j in satisfies , \frac 1 m \left(\sum_i w^j_-\sum_i w^j_\right), \geq\frac \varepsilon 2 . By union bound we get : \Pr sigma(x)\in Rleq t\cdot \max\left(\Pr \left, _\frac_1_m_\left(\sum_i_w^j_-\sum_i_w^j_\right)\_\geq_\frac_\varepsilon_2_\right\right). Since,_the_distribution_over_the_permutations_\sigma_is_uniform_for_each_i,_so_w^j_-w^j__equals_\pm_, w^j_i-w^j_, ,_with_equal_probability. Thus, :_\Pr\left \frac_1_m_\left(\sum_i_\left(w^j_-w^j_\right)\right)\\geq\frac_\varepsilon_2\right=_\Pr\left _\frac_1_m_\left(_\sum_i, w^j_i-w^j_, \beta_i\right)\\geq\frac_\varepsilon_2\right where_the_probability_on_the_right_is_over_\beta__and_both_the_possibilities_are_equally_likely._By_
Hoeffding's_inequality In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassi ...
,_this_is_at_most_2e^. Finally,_combining_all_the_three_parts_of_the_proof_we_get_the_Uniform_Convergence_Theorem.


_References

{{Reflist Combinatorics Machine_learning Articles_containing_proofs Probability_theoremshtml" ;"title="\frac 1 m \left(\sum_i w^j_ - \sum_i w^j_\right), \geq \frac \varepsilon 2]\right) :\leq \Pi_(2m)\cdot \max\left(\Pr\left \left, \frac 1 m \left(\sum_i w^j_-\sum_i w^j_\right)\ \geq \frac \varepsilon 2 \right\right). Since, the distribution over the permutations \sigma is uniform for each i, so w^j_-w^j_ equals \pm , w^j_i-w^j_, , with equal probability. Thus, : \Pr\left \frac 1 m \left(\sum_i \left(w^j_-w^j_\right)\right)\\geq\frac \varepsilon 2\right= \Pr\left \frac 1 m \left( \sum_i, w^j_i-w^j_, \beta_i\right)\\geq\frac \varepsilon 2\right where the probability on the right is over \beta_ and both the possibilities are equally likely. By
Hoeffding's inequality In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassi ...
, this is at most 2e^. Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.


References

{{Reflist Combinatorics Machine learning Articles containing proofs Probability theorems>\frac 1 m \left(\sum_i w^j_ - \sum_i w^j_\right), \geq \frac \varepsilon 2right) :\leq \Pi_(2m)\cdot \max\left(\Pr\left \left, \frac 1 m \left(\sum_i w^j_-\sum_i w^j_\right)\ \geq \frac \varepsilon 2 \right\right). Since, the distribution over the permutations \sigma is uniform for each i, so w^j_-w^j_ equals \pm , w^j_i-w^j_, , with equal probability. Thus, : \Pr\left \frac 1 m \left(\sum_i \left(w^j_-w^j_\right)\right)\\geq\frac \varepsilon 2\right= \Pr\left \frac 1 m \left( \sum_i, w^j_i-w^j_, \beta_i\right)\\geq\frac \varepsilon 2\right where the probability on the right is over \beta_ and both the possibilities are equally likely. By
Hoeffding's inequality In probability theory, Hoeffding's inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. Hoeffding's inequality was proven by Wassi ...
, this is at most 2e^. Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.


References

{{Reflist Combinatorics Machine learning Articles containing proofs Probability theorems