Notion of convergence of random variables
Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory . It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities . Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory .
The law of large numbers says that, for each single event A {\displaystyle A} , its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class S {\displaystyle S} from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events S {\displaystyle S} [ 1] The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.
For a class of predicates H {\displaystyle H} defined on a set X {\displaystyle X} and a set of samples x = ( x 1 , x 2 , … , x m ) {\displaystyle x=(x_{1},x_{2},\dots ,x_{m})} , where x i ∈ X {\displaystyle x_{i}\in X} , the empirical frequency of h ∈ H {\displaystyle h\in H} on x {\displaystyle x} is
Q ^ x ( h ) = 1 m | { i : 1 ≤ i ≤ m , h ( x i ) = 1 } | . {\displaystyle {\widehat {Q}}_{x}(h)={\frac {1}{m}}|\{i:1\leq i\leq m,h(x_{i})=1\}|.} The theoretical probability of h ∈ H {\displaystyle h\in H} is defined as Q P ( h ) = P { y ∈ X : h ( y ) = 1 } . {\displaystyle Q_{P}(h)=P\{y\in X:h(y)=1\}.}
The Uniform Convergence Theorem states, roughly, that if H {\displaystyle H} is "simple" and we draw samples independently (with replacement) from X {\displaystyle X} according to any distribution P {\displaystyle P} , then with high probability , the empirical frequency will be close to its expected value , which is the theoretical probability.[ 2]
Here "simple" means that the Vapnik–Chervonenkis dimension of the class H {\displaystyle H} is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.
The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis[ 1] using the concept of growth function .
The statement of the uniform convergence theorem is as follows:[ 3]
If H {\displaystyle H} is a set of { 0 , 1 } {\displaystyle \{0,1\}} -valued functions defined on a set X {\displaystyle X} and P {\displaystyle P} is a probability distribution on X {\displaystyle X} then for ε > 0 {\displaystyle \varepsilon >0} and m {\displaystyle m} a positive integer, we have:
P m { | Q P ( h ) − Q x ^ ( h ) | ≥ ε for some h ∈ H } ≤ 4 Π H ( 2 m ) e − ε 2 m / 8 . {\displaystyle P^{m}\{|Q_{P}(h)-{\widehat {Q_{x}}}(h)|\geq \varepsilon {\text{ for some }}h\in H\}\leq 4\Pi _{H}(2m)e^{-\varepsilon ^{2}m/8}.} where, for any x ∈ X m , {\displaystyle x\in X^{m},} , Q P ( h ) = P { ( y ∈ X : h ( y ) = 1 } , {\displaystyle Q_{P}(h)=P\{(y\in X:h(y)=1\},} Q ^ x ( h ) = 1 m | { i : 1 ≤ i ≤ m , h ( x i ) = 1 } | {\displaystyle {\widehat {Q}}_{x}(h)={\frac {1}{m}}|\{i:1\leq i\leq m,h(x_{i})=1\}|} and | x | = m {\displaystyle |x|=m} . P m {\displaystyle P^{m}} indicates that the probability is taken over x {\displaystyle x} consisting of m {\displaystyle m} i.i.d. draws from the distribution P {\displaystyle P} . Π H {\displaystyle \Pi _{H}} is defined as: For any { 0 , 1 } {\displaystyle \{0,1\}} -valued functions H {\displaystyle H} over X {\displaystyle X} and D ⊆ X {\displaystyle D\subseteq X} , Π H ( D ) = { h ∩ D : h ∈ H } . {\displaystyle \Pi _{H}(D)=\{h\cap D:h\in H\}.} And for any natural number m {\displaystyle m} , the shattering number Π H ( m ) {\displaystyle \Pi _{H}(m)} is defined as:
Π H ( m ) = max | { h ∩ D : | D | = m , h ∈ H } | . {\displaystyle \Pi _{H}(m)=\max |\{h\cap D:|D|=m,h\in H\}|.} From the point of Learning Theory one can consider H {\displaystyle H} to be the Concept/Hypothesis class defined over the instance set X {\displaystyle X} . Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.
The Sauer–Shelah lemma [ 4] relates the shattering number Π h ( m ) {\displaystyle \Pi _{h}(m)} to the VC Dimension.
Lemma: Π H ( m ) ≤ ( e m d ) d {\displaystyle \Pi _{H}(m)\leq \left({\frac {em}{d}}\right)^{d}} , where d {\displaystyle d} is the VC Dimension of the concept class H {\displaystyle H} .
Corollary: Π H ( m ) ≤ m d {\displaystyle \Pi _{H}(m)\leq m^{d}} .
[ 1] and [ 3] are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.
Symmetrization: We transform the problem of analyzing | Q P ( h ) − Q ^ x ( h ) | ≥ ε {\displaystyle |Q_{P}(h)-{\widehat {Q}}_{x}(h)|\geq \varepsilon } into the problem of analyzing | Q ^ r ( h ) − Q ^ s ( h ) | ≥ ε / 2 {\displaystyle |{\widehat {Q}}_{r}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2} , where r {\displaystyle r} and s {\displaystyle s} are i.i.d samples of size m {\displaystyle m} drawn according to the distribution P {\displaystyle P} . One can view r {\displaystyle r} as the original randomly drawn sample of length m {\displaystyle m} , while s {\displaystyle s} may be thought as the testing sample which is used to estimate Q P ( h ) {\displaystyle Q_{P}(h)} . Permutation: Since r {\displaystyle r} and s {\displaystyle s} are picked identically and independently, so swapping elements between them will not change the probability distribution on r {\displaystyle r} and s {\displaystyle s} . So, we will try to bound the probability of | Q ^ r ( h ) − Q ^ s ( h ) | ≥ ε / 2 {\displaystyle |{\widehat {Q}}_{r}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2} for some h ∈ H {\displaystyle h\in H} by considering the effect of a specific collection of permutations of the joint sample x = r | | s {\displaystyle x=r||s} . Specifically, we consider permutations σ ( x ) {\displaystyle \sigma (x)} which swap x i {\displaystyle x_{i}} and x m + i {\displaystyle x_{m+i}} in some subset of 1 , 2 , . . . , m {\displaystyle {1,2,...,m}} . The symbol r | | s {\displaystyle r||s} means the concatenation of r {\displaystyle r} and s {\displaystyle s} .[citation needed ] Reduction to a finite class: We can now restrict the function class H {\displaystyle H} to a fixed joint sample and hence, if H {\displaystyle H} has finite VC Dimension, it reduces to the problem to one involving a finite function class. We present the technical details of the proof.
Lemma: Let V = { x ∈ X m : | Q P ( h ) − Q ^ x ( h ) | ≥ ε for some h ∈ H } {\displaystyle V=\{x\in X^{m}:|Q_{P}(h)-{\widehat {Q}}_{x}(h)|\geq \varepsilon {\text{ for some }}h\in H\}} and
R = { ( r , s ) ∈ X m × X m : | Q r ^ ( h ) − Q ^ s ( h ) | ≥ ε / 2 for some h ∈ H } . {\displaystyle R=\{(r,s)\in X^{m}\times X^{m}:|{\widehat {Q_{r}}}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2{\text{ for some }}h\in H\}.} Then for m ≥ 2 ε 2 {\displaystyle m\geq {\frac {2}{\varepsilon ^{2}}}} , P m ( V ) ≤ 2 P 2 m ( R ) {\displaystyle P^{m}(V)\leq 2P^{2m}(R)} .
Proof: By the triangle inequality, if | Q P ( h ) − Q ^ r ( h ) | ≥ ε {\displaystyle |Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon } and | Q P ( h ) − Q ^ s ( h ) | ≤ ε / 2 {\displaystyle |Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2} then | Q ^ r ( h ) − Q ^ s ( h ) | ≥ ε / 2 {\displaystyle |{\widehat {Q}}_{r}(h)-{\widehat {Q}}_{s}(h)|\geq \varepsilon /2} .
Therefore,
P 2 m ( R ) ≥ P 2 m { ∃ h ∈ H , | Q P ( h ) − Q ^ r ( h ) | ≥ ε and | Q P ( h ) − Q ^ s ( h ) | ≤ ε / 2 } = ∫ V P m { s : ∃ h ∈ H , | Q P ( h ) − Q ^ r ( h ) | ≥ ε and | Q P ( h ) − Q ^ s ( h ) | ≤ ε / 2 } d P m ( r ) = A {\displaystyle {\begin{aligned}&P^{2m}(R)\\[5pt]\geq {}&P^{2m}\{\exists h\in H,|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon {\text{ and }}|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2\}\\[5pt]={}&\int _{V}P^{m}\{s:\exists h\in H,|Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon {\text{ and }}|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq \varepsilon /2\}\,dP^{m}(r)\\[5pt]={}&A\end{aligned}}} since r {\displaystyle r} and s {\displaystyle s} are independent.
Now for r ∈ V {\displaystyle r\in V} fix an h ∈ H {\displaystyle h\in H} such that | Q P ( h ) − Q ^ r ( h ) | ≥ ε {\displaystyle |Q_{P}(h)-{\widehat {Q}}_{r}(h)|\geq \varepsilon } . For this h {\displaystyle h} , we shall show that
P m { | Q P ( h ) − Q ^ s ( h ) | ≤ ε 2 } ≥ 1 2 . {\displaystyle P^{m}\left\{|Q_{P}(h)-{\widehat {Q}}_{s}(h)|\leq {\frac {\varepsilon }{2}}\right\}\geq {\frac {1}{2}}.} Thus for any r ∈ V {\displaystyle r\in V} , A ≥ P m ( V ) 2 {\displaystyle A\geq {\frac {P^{m}(V)}{2}}} and hence P 2 m ( R ) ≥ P m ( V ) 2 {\displaystyle P^{2m}(R)\geq {\frac {P^{m}(V)}{2}}} . And hence we perform the first step of our high level idea.
Notice, m ⋅ Q ^ s ( h ) {\displaystyle m\cdot {\widehat {Q}}_{s}(h)} is a binomial random variable with expectation m ⋅ Q P ( h ) {\displaystyle m\cdot Q_{P}(h)} and variance m ⋅ Q P ( h ) ( 1 − Q P ( h ) ) {\displaystyle m\cdot Q_{P}(h)(1-Q_{P}(h))} . By Chebyshev's inequality we get
P m { | Q P ( h ) − Q s ( h ) ^ | > ε 2 } ≤ m ⋅ Q P ( h ) ( 1 − Q P ( h ) ) ( ε m / 2 ) 2 ≤ 1 ε 2 m ≤ 1 2 {\displaystyle P^{m}\left\{|Q_{P}(h)-{\widehat {Q_{s}(h)}}|>{\frac {\varepsilon }{2}}\right\}\leq {\frac {m\cdot Q_{P}(h)(1-Q_{P}(h))}{(\varepsilon m/2)^{2}}}\leq {\frac {1}{\varepsilon ^{2}m}}\leq {\frac {1}{2}}} for the mentioned bound on m {\displaystyle m} . Here we use the fact that x ( 1 − x ) ≤ 1 / 4 {\displaystyle x(1-x)\leq 1/4} for x {\displaystyle x} .
Let Γ m {\displaystyle \Gamma _{m}} be the set of all permutations of { 1 , 2 , 3 , … , 2 m } {\displaystyle \{1,2,3,\dots ,2m\}} that swaps i {\displaystyle i} and m + i {\displaystyle m+i} ∀ i {\displaystyle \forall i} in some subset of { 1 , 2 , 3 , … , 2 m } {\displaystyle \{1,2,3,\ldots ,2m\}} .
Lemma: Let R {\displaystyle R} be any subset of X 2 m {\displaystyle X^{2m}} and P {\displaystyle P} any probability distribution on X {\displaystyle X} . Then,
P 2 m ( R ) = E [ Pr [ σ ( x ) ∈ R ] ] ≤ max x ∈ X 2 m ( Pr [ σ ( x ) ∈ R ] ) , {\displaystyle P^{2m}(R)=E[\Pr[\sigma (x)\in R]]\leq \max _{x\in X^{2m}}(\Pr[\sigma (x)\in R]),} where the expectation is over x {\displaystyle x} chosen according to P 2 m {\displaystyle P^{2m}} , and the probability is over σ {\displaystyle \sigma } chosen uniformly from Γ m {\displaystyle \Gamma _{m}} .
Proof: For any σ ∈ Γ m , {\displaystyle \sigma \in \Gamma _{m},}
P 2 m ( R ) = P 2 m { x : σ ( x ) ∈ R } {\displaystyle P^{2m}(R)=P^{2m}\{x:\sigma (x)\in R\}} (since coordinate permutations preserve the product distribution P 2 m {\displaystyle P^{2m}} .)
∴ P 2 m ( R ) = ∫ X 2 m 1 R ( x ) d P 2 m ( x ) = 1 | Γ m | ∑ σ ∈ Γ m ∫ X 2 m 1 R ( σ ( x ) ) d P 2 m ( x ) = ∫ X 2 m 1 | Γ m | ∑ σ ∈ Γ m 1 R ( σ ( x ) ) d P 2 m ( x ) (because | Γ m | is finite) = ∫ X 2 m Pr [ σ ( x ) ∈ R ] d P 2 m ( x ) (the expectation) ≤ max x ∈ X 2 m ( Pr [ σ ( x ) ∈ R ] ) . {\displaystyle {\begin{aligned}\therefore P^{2m}(R)={}&\int _{X^{2m}}1_{R}(x)\,dP^{2m}(x)\\[5pt]={}&{\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}\int _{X^{2m}}1_{R}(\sigma (x))\,dP^{2m}(x)\\[5pt]={}&\int _{X^{2m}}{\frac {1}{|\Gamma _{m}|}}\sum _{\sigma \in \Gamma _{m}}1_{R}(\sigma (x))\,dP^{2m}(x)\\[5pt]&{\text{(because }}|\Gamma _{m}|{\text{ is finite)}}\\[5pt]={}&\int _{X^{2m}}\Pr[\sigma (x)\in R]\,dP^{2m}(x)\quad {\text{(the expectation)}}\\[5pt]\leq {}&\max _{x\in X^{2m}}(\Pr[\sigma (x)\in R]).\end{aligned}}} The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.
Reduction to a finite class [ edit ] Lemma: Basing on the previous lemma,
max x ∈ X 2 m ( Pr [ σ ( x ) ∈ R ] ) ≤ 4 Π H ( 2 m ) e − ε 2 m / 8 {\displaystyle \max _{x\in X^{2m}}(\Pr[\sigma (x)\in R])\leq 4\Pi _{H}(2m)e^{-\varepsilon ^{2}m/8}} . Proof: Let us define x = ( x 1 , x 2 , … , x 2 m ) {\displaystyle x=(x_{1},x_{2},\ldots ,x_{2m})} and t = | H | x | {\displaystyle t=|H|_{x}|} which is at most Π H ( 2 m ) {\displaystyle \Pi _{H}(2m)} . This means there are functions h 1 , h 2 , … , h t ∈ H {\displaystyle h_{1},h_{2},\ldots ,h_{t}\in H} such that for any h ∈ H , ∃ i {\displaystyle h\in H,\exists i} between 1 {\displaystyle 1} and t {\displaystyle t} with h i ( x k ) = h ( x k ) {\displaystyle h_{i}(x_{k})=h(x_{k})} for 1 ≤ k ≤ 2 m . {\displaystyle 1\leq k\leq 2m.}
We see that σ ( x ) ∈ R {\displaystyle \sigma (x)\in R} iff for some h {\displaystyle h} in H {\displaystyle H} satisfies, | 1 m | { 1 ≤ i ≤ m : h ( x σ i ) = 1 } | − 1 m | { m + 1 ≤ i ≤ 2 m : h ( x σ i ) = 1 } | | ≥ ε 2 {\displaystyle |{\frac {1}{m}}|\{1\leq i\leq m:h(x_{\sigma _{i}})=1\}|-{\frac {1}{m}}|\{m+1\leq i\leq 2m:h(x_{\sigma _{i}})=1\}||\geq {\frac {\varepsilon }{2}}} . Hence if we define w i j = 1 {\displaystyle w_{i}^{j}=1} if h j ( x i ) = 1 {\displaystyle h_{j}(x_{i})=1} and w i j = 0 {\displaystyle w_{i}^{j}=0} otherwise.
For 1 ≤ i ≤ m {\displaystyle 1\leq i\leq m} and 1 ≤ j ≤ t {\displaystyle 1\leq j\leq t} , we have that σ ( x ) ∈ R {\displaystyle \sigma (x)\in R} iff for some j {\displaystyle j} in 1 , … , t {\displaystyle {1,\ldots ,t}} satisfies | 1 m ( ∑ i w σ ( i ) j − ∑ i w σ ( m + i ) j ) | ≥ ε 2 {\displaystyle |{\frac {1}{m}}\left(\sum _{i}w_{\sigma (i)}^{j}-\sum _{i}w_{\sigma (m+i)}^{j}\right)|\geq {\frac {\varepsilon }{2}}} . By union bound we get
Pr [ σ ( x ) ∈ R ] ≤ t ⋅ max ( Pr [ | 1 m ( ∑ i w σ i j − ∑ i w σ m + i j ) | ≥ ε 2 ] ) {\displaystyle \Pr[\sigma (x)\in R]\leq t\cdot \max \left(\Pr[|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)|\geq {\frac {\varepsilon }{2}}]\right)} ≤ Π H ( 2 m ) ⋅ max ( Pr [ | 1 m ( ∑ i w σ i j − ∑ i w σ m + i j ) | ≥ ε 2 ] ) . {\displaystyle \leq \Pi _{H}(2m)\cdot \max \left(\Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}w_{\sigma _{i}}^{j}-\sum _{i}w_{\sigma _{m+i}}^{j}\right)\right|\geq {\frac {\varepsilon }{2}}\right]\right).} Since, the distribution over the permutations σ {\displaystyle \sigma } is uniform for each i {\displaystyle i} , so w σ i j − w σ m + i j {\displaystyle w_{\sigma _{i}}^{j}-w_{\sigma _{m+i}}^{j}} equals ± | w i j − w m + i j | {\displaystyle \pm |w_{i}^{j}-w_{m+i}^{j}|} , with equal probability.
Thus,
Pr [ | 1 m ( ∑ i ( w σ i j − w σ m + i j ) ) | ≥ ε 2 ] = Pr [ | 1 m ( ∑ i | w i j − w m + i j | β i ) | ≥ ε 2 ] , {\displaystyle \Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}\left(w_{\sigma _{i}}^{j}-w_{\sigma _{m+i}}^{j}\right)\right)\right|\geq {\frac {\varepsilon }{2}}\right]=\Pr \left[\left|{\frac {1}{m}}\left(\sum _{i}|w_{i}^{j}-w_{m+i}^{j}|\beta _{i}\right)\right|\geq {\frac {\varepsilon }{2}}\right],} where the probability on the right is over β i {\displaystyle \beta _{i}} and both the possibilities are equally likely. By Hoeffding's inequality , this is at most 2 e − m ε 2 / 8 {\displaystyle 2e^{-m\varepsilon ^{2}/8}} .
Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem .
^ a b c Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications . 16 (2): 264. doi :10.1137/1116025 . This is an English translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk . 181 (4): 781. 1968. The translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Measures of Complexity . p. 11. doi :10.1007/978-3-319-21852-6_3 . ISBN 978-3-319-21851-9 . ^ "Martingales" , Probability with Martingales , Cambridge University Press, pp. 93– 105, 1991-02-14, doi :10.1017/cbo9780511813658.014 , ISBN 978-0-521-40455-6 , retrieved 2023-12-08 ^ a b Martin Anthony Peter, l. Bartlett. Neural Network Learning: Theoretical Foundations, pages 46–50. First Edition, 1999. Cambridge University Press ISBN 0-521-57353-X ^ Sham Kakade and Ambuj Tewari, CMSC 35900 (Spring 2008) Learning Theory, Lecture 11