1,454 302 4MB
Pages 451 Page size 430.94 x 646.42 pts Year 2010
Uniform
z J '.L
The book shows how the central limit theorem for independent, identically distributed random variables with values in general, multidimensional spaces holds uniformly over some large classes of functions. The book contains, with complete proofs, the FerniqueTalagrand majorizing measure theorem for Gaussian processes, an extended treatment of VapnikCernovenkis combinatorics, the Ossiander L2 bracketing central limit theorem,
the GineZinn bootstrap central limit theorem in probability, the Bronstein theorem on approximation of convex sets, and the Shor theorem on rates of convergence over lower layers. The book incorporates an updated form of the author's 1984 St.Flour lecture notes and also gives various results of the author's not previously collected in one place. A number of recent results of Talagrand and others are surveyed without proofs in separate sections. The book will interest mathematicians working in probability, mathematical statisticians, and computer scientists working in computer learning theory.
R. M. Dudley is Professor of Mathematics at the Massachusetts Institute of Technology in Cambridge, Massachusetts.
Cambridge Studies in Advanced Mathematics 63 Editorial Board: W. Fulton, D. J. H. Garling, T. tom Dieck, P. Walters Already published 2 K. Petersen Ergodic theory 3 P. T. Johnstone Stone spaces J.P. Kahane Some random series of functions, 2nd edition 5 7 J. Lambek & P. J. Scott Introduction to higherorder categorical logic H. Matsumura Commutative ring theory 8 9 C. B. Thomas Characteristic classes and the cohomology of finite groups 10 M. Aschbacher Finite group theory 11 J. L. Alperin Local representation theory 12 P. Koosis The logarithmic integral I 13 A. Pietsch Eigenvalues and snumbers 14 S. J. Patterson An introduction to the theory of the Riemann zetafunction 15 H. J. Baues Algebraic homotopy V. S. Varadarajan Introduction to harmonic analysis on semisimple 16 Lie groups 17 W. Dicks & M. Dunwoody Groups acting on graphs L. J. Corwin & F. P. Greenleaf Representations of nilpotent Lie groups 18 and their applications 19 R. Fritsch & R. Piccinini Cellular structures in topology 20 H. Klingen Introductory lectures on Siegel modular forms 21 P. Koosis The logarithmic integral II 22 M. J. Collins Representations and characters of finite groups 24 H. Kunita Stochastic flows and stochastic differential equations 25 P. Wojtaszczyk Banach spaces for analysts 26 J. E. Gilbert & M. A. M. Murray Clifford algebras and Dirac operators in harmonic analysis 27 A. Frohlich & M. J. Taylor Algebraic number theory 28 K. Goebel & W. A. Kirk Topics in metric fixed point theory 29 J. F. Humpheys Reflection groups and Coxeter groups 30 D. J. Benson Representations and cohomology I D. J. Benson Representations and cohomology 11 31 32 C. Allday & V. Puppe Cohomological methods in transformation groups 33 C. Soule et al. Lectures on Arakelov geometry 34 A. Ambrosetti & G. Prodi A primer of nonlinear analysis 35 J. Palis & F. Takens Hyperbolicity, stability and chaos at homoclinic
bifurcation 37 38 39
40 41
42 43 44 45
46 47 49
Y. Meyer Wavelets and operators 1 C. Weibel Introduction to homological algebra W. Bruns & J. Herzog CohenMacaulay rings V. Snaith Explicit Brauer induction G. Laumon Cohomology of Drinfeld modular varieties I E. B. Davies Spectral theory and differential operators
J. Diestel, H. Jarchow, & A. Tonge Absolutely summing operators P. Mattila Geometry of sets and measures in Euclidean spaces R. Pinsky Positive harmonic function and diffusion G. Tenenbaum Introduction to analytic and probabilistic number theory C. Peskine An algebraic introduction to complex projective geometry R. Stanley Enumerative combinatorics I
50 51
52 53 54 55
56 60
I. Porteous Clifford algebras and the classical groups M. Audin Spinning tops V. Jurdjevic Geometric control theory H. Volklein Groups as Galois groups J. Le Potier Lectures on vector bundles D. Bump Automorphic forms and representations G. Laumon Cohomology of Drinfeld modular varieties 11 M. P. Brodmann & R. Y. Sharp Local cohomology
UNIFORM CENTRAL LIMIT THEOREMS R. M. DUDLEY Massachusetts Institute of Technology
CAMBRIDGE UNIVERSITY PRESS
PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE The Pitt Building, Trumpington Street, Cambridge CB2 IRP 4o West 20th Street, New York, NY 100114211, USA 10 Stamford Road, Oakleigh, Melbourne 3166, Australia
© Cambridge University Press 1999 First published 1999 Typeface Times 10/13 pt. System LATEX [RW]
A catalog record of this book is available from the British Library
Library of Congress cataloging in publication data
P.
Dudley, R. M. (Richard M.) Uniform central limit theorems / R. M. Dudley. cm.  (Cambridge studies in advanced mathematics: 63) Includes bibliographical references. ISBN 0 52146102 2
1. Central limit theorem. I. Title. H. Series. QA 273.67.D84 1999 519.2DC21 9835582 CIP
ISBN 0 521 46102 2 hardback
Transferred to digital printing 2004
To Liza
Contents
Preface
page xiii
Introduction: Donsker's Theorem, Metric Entropy, and Inequalities 1.1 Empirical processes: the classical case 1.2 Metric entropy and capacity 1.3 Inequalities Problems Notes 1
References
1
2 10 12 18
19 21
Gaussian Measures and Processes; Sample Continuity 2.1 Some definitions 2.2 Gaussian vectors are probably not very large 2.3 Inequalities and comparisons for Gaussian distributions 2.4 Gaussian measures and convexity 2.5 The isonormal process: sample boundedness and continuity 2.6 A metric entropy sufficient condition for sample continuity 2.7 Majorizing measures 2.8 Sample continuity and compactness **2.9 Volumes, mixed volumes, and ellipsoids **2.10 Convex hulls of sequences Problems Notes 2
References
23 23
24 31
40 43 52 59 74 78
82 83
86 88
Foundations of Uniform Central Limit Theorems: Donsker Classes 3.1 Definitions: convergence in law 3.2 Measurable cover functions 3
ix
91 91
95
x
Contents 3.3
3.4 3.5 3.6 3.7 3.8 3.9
4 4.1 4.2 *4.3 *4.4 *4.5 4.6 4.7 4.8 **4.9
Almost uniform convergence amd convergence in outer probability Perfect functions Almost surely convergent realizations Conditions equivalent to convergence in law Asymptotic equicontinuity and Donsker classes Unions of Donsker classes Sequences of sets and functions Problems Notes
6.2 6.3 6.4 **6.5 **6.6 **6.7
117 121
122 127 130 132
VapnikCervonenkis Combinatorics VapnikCervonenkis classes Generating VapnikCervonenkis classes Maximal classes Classes of index 1 Combining VC classes Probability laws and independence VapnikCervonenkis properties of classes of functions Classes of functions and dual density Further facts about VC classes Problems Notes
134 134 138 142 145 152 156 159
References
168
Measurability *5.1 Sufficiency 5.2 Admissibility 5.3 Suslin properties, selection, and a counterexample Problems Notes
6.1
111
References
5
6
100 103 106
161 165
166 167
170 171
179 185 191
193
References
194
Limit Theorems for VapnikCervonenkis and Related Classes KoltchinskiiPollard entropy and GlivenkoCantelli theorems VapnikCervonenkisSteele laws of large numbers Pollard's central limit theorem Necessary conditions for limit theorems Inequalities for empirical processes GlivenkoCantelli properties and random entropy Classification problems and learning theory Problems
196 196 203
208 215 220 223 226 227
Contents
7 7.1
7.2 7.3 **7.4
Notes References Metric Entropy, with Inclusion and Bracketing Definitions and the BlumDeHardt law of large numbers Central limit theorems with bracketing The power set of a countable set: the BorisovDurst theorem Bracketing and majorizing measures Problems Notes
228 230 234 234 238 244 246 247 248
References
248
Approximation of Functions and Sets 8.1 Introduction: the Hausdorff metric 8.2 Spaces of differentiable functions and sets with differentiable boundaries 8.3 Lower layers 8.4 Metric entropy of classes of convex sets Problems Notes References Sums in General Banach Spaces and Invariance Principles 9 9.1 Independent random elements and partial sums 9.2 A CLT implies measurability in separable normed spaces 9.3 A finitedimensional invariance principle 9.4 Invariance principles for empirical processes **9.5 Log log laws and speeds of convergence Problems Notes References 10 Universal and Uniform Central Limit Theorems 10.1 Universal Donsker classes 10.2 Metric entropy of convex hulls in Hilbert space **10.3 Uniform Donsker classes Problems Notes References The TwoSample Case, the Bootstrap, and Confidence Sets 11 11.1 The twosample case 11.2 A bootstrap central limit theorem in probability 11.3 Other aspects of the bootstrap 8
xi
250 250 252
264 269 281
282 283 285 286 291 293 301
306 309 310 311
314 314 322 328 330 330 330 332 332 335 357
Contents
xii
** 11.4 Further GineZinn bootstrap central limit theorems Problems Notes References
Classes of Sets or Functions Too Large for Central Limit Theorems 12.1 Universal lower bounds 12.2 An upper bound 12.3 Poissonization and random sets 12.4 Lower bounds in borderline cases 12.5 Proof of Theorem 12.4.1 Problems Notes
358 359
360 361
12
References
Appendix A Differentiating under an Integral Sign Appendix B Multinomial Distributions Appendix C Measures on Nonseparable Metric Spaces Appendix D An Extension of Lusin's Theorem Appendix E Bochner and Pettis Integrals Appendix F Nonexistence of Types of Linear Forms on Some Spaces Appendix G Separation of Analytic Sets; Borel Injections Appendix H YoungOrlicz Spaces Appendix I Modifications and Versions of Isonormal Processes Subject Index Author Index Index of Notation
363 363 365
367 373
384 388 388 389 391
399
402 405 407 413 417 421
425 427
432 435
Preface
Suppose given a probability distribution P on the plane and a random sample of n points, chosen independently with distribution P. For each halfplane H bounded by a line, the number k of sample points in H has a binomial distribution. Suitably centered, taking k  nP(H), and divided by 'In, k has an asymptotically normal (Gaussian) distribution as n ) oo, by De Moivre's classical central limit theorem. It will be seen herein, as one example of a uniform central limit theorem, that the asymptotic normality holds simultaneously and uniformly over all halfplanes. The corresponding property of halflines in the line was first shown by M. Donsker. Thus classes of sets, or functions, for which a uniform central limit theorem holds are called Donsker classes. It turns out, as will be seen, that rather general classes of sets in, or functions on, general spaces are Donsker classes. This book developed out of some topics courses given at M.I.T. and my lectures at the St.Flour probability summer school in 1982. The material of the book has been expanded and extended considerably since then. The reader will need to know some real analysis including Lebesgue integration, and probability based on it, including the finitedimensional central limit theorem. Starred sections are not cited later in the book except perhaps in other starred sections. At the end of some chapters are doubly starred sections. These are surveys, without proofs, on topics not covered in the book, usually because I did not know short enough proofs to include. Also at the end of each chapter are some problems, notes, and references on that chapter. For useful conversations on topics in the book, I'm glad to thank Kenneth Alexander, Niels Trolle Andersen, Miguel Arcones, Patrice Assouad, Erich Berger, Lucien Birge, Igor S. Borisov, Donald Cohn, Yves Derrienic, Uwe Einmahl, Joseph Fu, Evarist Gine, Sam Gutmann, David Haussler, Jt rgen HoffmannJorgensen, YenChin Huang, Vladimir Koltchinskii, Lucien Le Cam, xiii
xiv
Preface
Pascal Massart, James Munkres, Rimas Norvaiga, Walter Philipp, Tom Salisbury, Galen Shorack, Rae Shortt, Michel Talagrand, He Sheng Wu, Joe Yukich, and Joel Zinn. I especially thank Yong Chen, Peter Gaenssler, Evarist Gind, Jinghua Qian, Arvind Sankar, and Franz Strobl for providing lists of corrections and suggestions. I also thank Xavier Fernique and Evarist Gine very much for sending me copies of recent expositions. Cambridge, Massachusetts February 22, 1999
Note. Throughout this book, all references to "RAP" are to the author's book Real Analysis and Probability, Wadsworth and Brooks/Cole, Pacific Grove, Calif. 1989, reprinted with corrections by Chapman and Hall, New York, 1993. Also, "A := B" means A is defined by B, whereas "A =: B" means B is defined by A.
1
Introduction: Donsker's Theorem, Metric Entropy, and Inequalities
Let P be a probability measure on the Borel sets of the real line IR with distribu
tion function F(x) := P((oo, x]). Here and throughout, ":=" means "equals by definition." Let X1, X2, be i.i.d. (independent, identically distributed) random variables with distribution P. For each n = 1, 2, and any Borel
set A C R, let P,, (A) := ; Fjn1 Sxj(A), where 8x(A) = lA(x). Then P , X and is called the empirical measure. Let F be the distribution function of P,,. Then F, is called the empirical distribution function. The developments to be described in this book began with the GlivenkoCantelli theorem, a uniform law of large numbers, which says that with probis a probability measure f o r each X1,
ability 1, F converges to F as n * oo, uniformly on R, meaning that supx I (F  F) (x) I + 0 as n oo (RAP, Theorem 11.4.2); as mentioned in the Note at the end of the Preface, "RAP" refers to the author's book Real Analysis and Probability. The next step was to consider the limiting behavior of a := n112(F  F) as n + oo. For any fixed t, the central limit theorem in its most classical form, for binomial distributions, says that converges in distribution to N(0, F(t)(1  F(t))), in other words a normal (Gaussian) law, with mean 0 and variance F(t)(1  F(t)). Here a law is a probability measure defined on the Borel sets. For any finite set T of values oft, the multidimensional central limit theorem (RAP, Theorem 9.5.6) tells us that a (t) for t in T converges in distribution as n oc to a normal law N(0, CF) with mean 0 and covariance CF(s, t) _
F(s)(1  F(t)) for s < t. The Brownian bridge (RAP, Section 12.1) is a stochastic process yt (w) defined for 0 < t < 1 and w in some probability space 0, such that for any finite
set S C [0, 1], yt for t in S have distribution N(0, C), where C = CU for the uniform distribution function U(t) = t, 0 < t < 1, and t  yt(w) is I
2
Introduction: Donsker's Theorem, Metric Entropy, and Inequalities
continuous for almost all w. So the empirical process a converges in distribution to the Brownian bridge composed with F, namely t i+ YF(t), at least when restricted to finite sets. It was then natural to ask whether this convergence extends to infinite sets or the whole interval or line. Kolmogorov (1933) showed that when F is continuous, the supremum sups a,r (t) and the supremum of absolute value, sups l an (t) 1, converge in distribution to the laws of the same functionals of yF. Then, these functionals of yF have the same distributions as for the Brownian
bridge itself, since F takes R onto an interval including (0, 1) and which may or may not contain 0 or 1; this makes no difference to the suprema since yo = y1 = 0. Also, yt + 0 almost surely as t 10 or t f 1 by sample continuity; the suprema can be restricted to a countable dense set such as the rational numbers in (0, 1) and are thus measurable. Kolmogorov evaluated the distributions of sups yt and sups I Yr I explicitly (see RAP, Propositions 12.3.3 and 12.3.4). Doob (1949) asked whether the convergence in distribution held for more general functionals. Donsker (1952) stated and proved (not quite correctly) a general extension. This book will present results proved over the past few
decades by many researchers, where the collection of halflines (oo, x], x E R, is replaced by much more general classes of sets in, and functions on, general sample spaces, for example the class of all ellipsoids in R3. To motivate and illustrate the general theory, the first section will give a revised formulation and proof of Donsker's theorem. Then the next two sections, on metric entropy and inequalities, provide concepts and facts to be used in the rest of the book.
1.1 Empirical processes: the classical case In this section, the aim is to treat an illuminating and historically basic special
case. There will be plenty of generality later on. Here let P be the uniform distribution (Lebesgue measure) on the unit interval [0, 1]. Let U be its distribution function, U(t) = t, 0 < t < 1. Let U be its empirical distribution functions and a := n1/2(U  U) on [0, 1]. It will be proved that as n + oo, a, converges in law (in a sense to be made precise below) to a Brownian bridge process yt, 0 < t < 1 (RAP, before Theorem 12.1.5). Recall that yt can be written in terms of a Wiener process (Brownian motion) xt, namely yt = xt  tx1, 0 < t < 1. Or, yt is xt conditioned on x1 = 0 in a suitable sense (RAP, Proposition 12.3.2). The Brownian bridge (like the Brownian motion) is samplecontinuous, that is, it can be chosen such that for all w, the function t H yt (w) is continuous on [0, 1] (RAP, Theorem 12.1.5).
1.1 Empirical processes: the classical case
3
Donsker in 1952 proved that the convergence in law of an to the Brownian bridge holds, in a sense, with respect to uniform convergence in t on the whole interval [0, 1]. How to define such convergence in law correctly, however, was not clarified until much later. General definitions will be given in Chapter 3.
Here, a more special approach will be taken in order to state and prove an accessible form of Donsker's theorem. For a function f on [0, 1], we have the sup norm
Ilfll0 = sup{If(t)I: 0 < t < 1}. Here is the form of Donsker's theorem that will be the main result of this section.
1.1.1 Theorem
For n = 1, 2,
, there exist probability spaces Stn such
that:
(a) On Q,,, there exist n i.i.d. random variables X1, , X, with uniform distribution in [0, 1]. Let an be the nth empirical process based on these X,; (b) On Qn a samplecontinuous Brownian bridge process Yn: (t, w) H Yn (t, (o) is also defined;
(c) Ilan  Ynllooismeasurable,andforalls > 0,Pr(IlanYnlloo > E) * 0 as n > oo.
Notes. (i) Part (c) gives a sense in which the empirical process an converges in distribution to the Brownian bridge with respect to the sup norm II Il,". (ii) It is actually possible to use one probability space on which X1, X2,
are i.i.d., while Y, = (B1 +
+ Bn)//i, Bj being independent Brownian
bridges. This is an example of an invariance principle, to be treated in Chapter 9, not proved in this section. (iii) One can define all an and Yn on one probability space and make Y, all equal some Y, although here the joint distributions of an for different n will be
different from their original ones. Then an will converge to Y in probability and moreover can be defined so that Ilan  Y 1100 0 almost surely, as will be shown in Section 3.5. Proof For a positive integer k, let Lk be the set of k + 1 equally spaced points,
Lk := {0, 1/k, 2/k,
,
1} c [0, 1].
It will first be shown that both processes an and yt, for large enough n and k, can be well approximated by step functions and then by piecewiselinear interpolation of their values on Lk.
4
Introduction: Donsker's Theorem, Metric Entropy, and Inequalities
Given 0 < s < 1, take k = k(s) large enough so that
4k exp ( ks2/648) < s/6.
(1.1.2)
Let Ijk := [j/k, (j + 1)/k], j = 0,
,
k  1. By the representation yt =
xt  txl, we have Pr{IYt  Yi1kI > 8/6 for some t E Ilk} < P1 + p2, where
P1 := Pr(lxl l > ks/18),
P2 := Pr{lxt  x3/kl > E/9 for some t E Ilk}.
Then pi < 2 exp(k2s2/648) (RAP, Lemma 12.1.6(b)). For P2, via a reflection principle (RAP, 12.3.1) and the fact that {xu+h  xu}h>o has the same
distribution as {xh }h>o (applied to u = j/k), we have p2 < 4 exp(ks2/162). Thus by (1.1.2), (1.1.3)
Pr{IYt  Yj/kl > s/6
for some j = 0,
, k  1 and some t E Ilk} < s/3.
Next, we need a similar bound for a when n is large. The following will help:
1.1.4 Lemma Given the uniform distribution U on [0, 1]:
(a) For 0 < u < 1 and any finite set S C [0, 1  u], the joint distribution of { Un (u + s)  U (u) }SES is the same as for u = 0. (b) The same holds for an in place of Un.
(c) The distribution of sup[ lan(t + j1 k)  an(j/k)I: 0 < t < 1/k) is the same for all j.
Proof (a) Let S = NJ', where we can assume so = 0. It's enough to consider { Un (u + sj)  Un (u + sj_ t) }iL1, whose partial sums give the desired quantities. Multiplying by n, we get m random variables from a multinomial distribution for n observations for the first m of m + 1 categories, which have
probabilities {sj  sj_111, where sn,+l = 1 (Appendix B, Theorem B.2). This distribution doesn't depend on u.
(b) Since an (u + s)  a (u) = n 1/2 (Un (u + s)  U. (u)  s), (b) follows from (a). (c) The statement holds for finite subsets of Ijk by (b). By monotone convergence, we can let the finite sets increase up to the countable set of rational
numbers in Ilk. Since Un is rightcontinuous, suprema over the rationals in Ilk equal suprema over the whole interval (the right endpoint is rational), and Lemma 1.1.4 is proved.
1.1 Empirical processes: the classical case
5
So in bounding the supremum in Lemma 1.1.4(c) we can take j = 0, and we need to bound Pr{n 112 Un (t)  t I > s for some t E [0, 1/k]}. Suppose , nr of sample size n = given a multinomial distribution of numbers n 1, , pr. Then for each j, the + nr in r bins with probabilities p I, n1+ conditional distribution of nj+1 given n 1, , nj is the same as that given
 nj trials + nj, namely a binomial distribution for n  n i nj+ with probability pj+l/(P.i+l + + pn) of success on each (see Appendix B, Theorem B.3(c)). It follows that the empirical distribution function U, has the following Markov property: if 0 < tl < < tj < t < u, then the conditional distribution of U (u) given U (ti), , U (tj), U (t) is the same as that given U, (t). Specifically, given that U (t) = m/n, the conditional distribution of
Un(u) is that of (m + X)/n where X has a binomial distribution for n  m trials with success probability (u  t)/(1  t). To be given U,(t) = m/n is equivalent to being given an (t) = n 1/2 (n  t), and an also has the Markov property. So the conditional distribution of an (u) given m = n Un (t) has mean Am
Itt
:=n1/2{n
=n1/2\\n t/
\1// 1
and variance
(nm)(ut)(1u)
ut
n(1  t)2
1t
< u.
So, by Chebyshev's inequality, Pr { Ian (u)  µm 1 > 2u 1/2 1 m } < 1/4.
If u < 1/2, then it > 2. Let 0 < S < 1. If an (t) > 8, then n  t > S/n1/2 and µm > 8(i ) > 3/2, so for any y > 8 (such that Pr{an (t) = y} > 0), S
Pr an(u) > 2  2u 1/2 an (t) = Y } > 3/4. (For such a y, y = n1/2(n  t) for some integer m.) If u < 82/64, then u < 1 /2 and Pr{an (u) > 8/4 1 an (t) = y) > 3/4.
Let u = 1 / k and S = e/4. Then by (1.1.2), since e' < 1/24 implies x > 2, we have u < 82/64, so
Pr{an(1/k)>e/161an(t)=y} > 3/4 for y>s/4. Now take a positive integer r and let r be the smallest value of j/(kr), , r, for which an (r) > s/4. Let Ar be the event that such a j exists. Let Arj := {r = j/(kr)}. Then Ar is the union of the
if any, for j = 1,
6
Introduction: Donsker's Theorem, Metric Entropy, and Inequalities
disjoint sets Arj for j = 1, , r. For each such j, by the Markov property, Pr{an(1/k) > E/16 I Arj} > 3/4. Thus
Pr{an(1/k) > E/16 I Ar} > 3/4. Let r + oo. Then by right continuity of U, and an, we get
Pr{an(t) > E/4 for some t E [0, 1/k])
E/16). 3
Likewise,
Pr{an(t) < E/4 for some t E [0, 1/k]) < 3 Pr{an(1/k) < E/16). Thus by Lemma 1.1.4(c), (1.1.5)
Pr{Ian(t) an(j/k)I > E/4 forsomet E Ilk and j = 0, 1,
,
k  11 < (4k/3) Pr(lan(1/k)I > e/16).
As n oo, for our fixed k, by the central limit theorem and RAP, Lemma 12.1.6(b),
Pr{lan(1/k)I > E/16)  Pr{IY1/kI > E/16} < 2 exp ( ks2/512
.
So for n large enough, say n > no = no(E), recalling that k = k(E),
Pr{lan(1/k)l > s/16} < 3 exp ( ks2/512. Then by (1.1.5) and (1.1.2), for n > no, (1.1.6)
Pr{Ian(t)  an(jl k) I > E/4 for some j = 0,
,
k  1 and t E Ilk} < E/6.
As mentioned previously, the law, say 4 (an ), of {an (i / k) }k o converges by the central limit theorem in IRk+i to that of {Yi/k}k o, say Gk(y). On ][8k+1 put the metric d,,,(x, y) := Ix  yl := maxi I xi  yi 1, which of course metrizes the
usual topology. Since convergence of laws is metrized by Prokhorov's metric p (RAP, Theorem 11.3.3), for n large enough, say n > n 1(E) > no (E), we have p(Gk(an), 4(y)) < E/6. Then by Strassen's theorem (RAP, Corollary 11.6.4), there is a probability measure lcn on Rk+1 x Rk+1 such that for (X, Y) with G(X, Y) = µn, we have (1.1.7)
L(X) = Lk(an),
£(Y) = Lk(Y),
and µn {(x, Y) : Ix  Yloo > E/6} < E/6
(RAP, Section 9.2).
1.1 Empirical processes: the classical case
7
Let Libk (for "linear in between") be the function from Rk+1 into the space
C[0, 1] of all continuous real functions on [0, 1] such that Libk (x) (j/ k) = xj, j = 0, , k, and is linear (affine) on each closed interval Ilk = [j/k, (j + 1)/k], j = 0, , k  1. For any x, y E Rk+1, Libk(x) Libk(y) is also linear on each Ilk, so it attains its maximum, minimum, and

maximum absolute value at endpoints. So for the supremum norm 11 f 11 o0 supo
e/3} < e/3,
where for each w, we have a function y: t i+ yt (w), 0 < t < 1. We can take the probability space for each an process as the unit cube In, , xn, with where then i.i.d. uniform variables in defining Un and an are x1, x = (xl, xn) E In. Then
Ak: X H flan(J/k)}lo is measurable from In into Rk+1 and has distribution Gk(an) on Rk+1 Also, x i Libk(Ak(x)) is measurable from In into C[0, 1]. The next theorem will give a way of linking up or "coupling" processes. Recall that a Polish space is a topological space metrizable by a complete separable metric. 1.1.10 Theorem (Vorob'evBerkesPhilipp) Let X, Y and Z be Polish spaces with Borel aalgebras. Let a be a law on X x Y and let # be a law on Y x Z.
8
Introduction: Donsker's Theorem, Metric Entropy, and Inequalities
Let zry(x, y) := y and ry(y, z) := y for all (x, y, z) E X x Y x Z. Suppose the marginal distributions of a and P on Y are equal, in other words z a o7rY 1 = B o rY1 on Y. Let n12(x, Y, z) :_ (x, y) and 1x23(x, Y, z) :_ (Y, Z). Then there exists a law yon X x Y x Z such that y o7r121 = a and y o7r231 = B
Proof There exist conditional distributions ay for a on X given y E Y, so that for each y E Y, ay is a probability measure on X, for any Borel set A C X, the function y i± ay(A) is measurable, and for any integrable function f for a,
f fda = fff(xY)day(x)d1)(y) (RAP, Section 10.2). Likewise, there exist conditional distributions fly on Z for P. Let x and z be conditionally independent given y. In other words, define a set function y on X x Y x Z by
y(C) = JJf
lc(x, y,
z) day(x) doy(z) dq (Y)
The integral is welldefined if
(a) C = U x V x W for Borel sets U, V, and W in X, Y, and Z, respectively; (b) C is a finite union of such sets, which can be taken to be disjoint (RAP, Proposition 3.2.2 twice); or
(c) C is any Borel set in X x Y x Z, by RAP, Proposition 3.2.3 and the monotone class theorem (RAP, Theorem 4.4.2).
Also, y is countably additive by monotone convergence (for all three integrals). Soy is a law on X x Y x Z. Clearly y o Jr121 = a and y o 7r1
Now, let's continue the proof of Theorem I.I.I. The function (x, f) H Ilan  flloo is jointly Borel measurable for x E In and f E C[0, 1]. Also, u i+ Libk(u) is continuous and thus Borel measurable from Rk+1 into C[0, 1]. So (x, u) i+ Ilan  Libk(u) II , is jointly measurable on In x 1Rk+1 (This is
true even though an V C [0, 1] and the functions t * an (t) for different w form a nonseparable space for II Iloo.) Thus x ia Ilan  Libk(Ak(x))Iloo is measurable on In. From (1.1.6), we then have (1.1.11)
Pr{IIan  Libk(Ak(x))Il oo > E/2} < E/6.
Apply Theorem 1.1.10 to (X, Y, Z) = (In, Rk+1 Rk+1) with the law of (x, Ak(x)) on In x Rk+1, and p,, from (1.1.7) on Rk+1 x Rk+1 both of which induce the law L k (an) on Y = Rk+1 to get a law yn .
1.1 Empirical processes: the classical case
9
Then apply Theorem 1.1.10, this time to (X, Y, Z) = (In X Rk+t Rk+1 C[0, 1]), with yn on X x Y and the law of (zrk(y), y) on Y x Z, where y is the Brownian bridge.
We see that there is a probability measure n on In x C[0, 1] such that if L (Vn, Yn) = cn, then £(Vn) is uniform on In, £(Y,) is the law of the Brownian bridge, and if we take an = an (Vn ), then for n > n 1(E) defined after (1.1.6), Ilan  YnIIoc
1 . For r = 1, 2, , let nr := nt(1/r). Let N, be an increasing sequence with N,. > nr for all r. For n < Ni, define µn as in (1.1.7) but with 1 in place of E/6 (both times), so that it always
holds: one can take An as the product measure 4 (an) x 4(y). Define on On as above, but with 1 in place of E/m for m = 2, 4, or 6 in (1.1.6) and
(1.1.11). For Nr < n < Nr+t, define µn and as for e = 1/r. Then Pr(Ilan  N loo > 1/r) < 1/r for n > N,., r > 1, and Theorem 1.1.1 is proved.
0
Remarks. It would be nice to be able to say that an converges to the Brownian bridge yin law in some space S of functions with supremum norm. The standard definition of convergence in law, at least if S is a separable metric space, would
say that EH(an)  EH(y) for all bounded continuous real functions H on S (RAP, Section 9.3). Donsker (1952) stated this when continuity is assumed only at almost all values of y in C[0, 1]. But then, H could be nonmeasurable away from the support of y, and EH(an) is not necessarily defined. Perhaps
more surprisingly, EH(an) may not be defined even if H is bounded and continuous everywhere. Consider for example n = 1. Then in the set of all possible functions Ul  U, any two distinct functions are at distance 1 apart Iloo. So the set and all its subsets are complete, closed, and discrete for II oo If the image of Lebesgue (uniform) measure on [0, 11 by the function II x i* (t ik 1{x>t}  t) were defined on all Borel sets for II III in its range, for II
or specifically on all complete, discrete sets, it would give an extension of Lebesgue measure to a countably additive measure on all subsets of [0, 1]. Assuming the continuum hypothesis, which is consistent with the other axioms of set theory, such an extension is not possible (RAP, Appendix Q.
10
Introduction: Donsker's Theorem, Metric Entropy, and Inequalities
So in a nonseparable metric space, such as a space of empirical distribution functions with supremum norm, the Borel oralgebra may be too large. In Chapter 3 it will be shown how to get around the lack of Borel measurability. Here is an example relating to the Vorob'ev theorem (1.1.10). Let X = Y = Z = {1, 1). In X x Y x Z let each coordinate x, y, z have the uniform distribution giving probability 1 /2 each to 1, 1. Consider the laws on the
products of two of the three spaces such that y = x, z = y, and x = z. There exist such laws having the given marginals on X, Y and Z. But there is no law on X x Y x Z having the given marginals on X x Y, Y x Z, and Z x X, since the three equations together yield a contradiction.
1.2 Metric entropy and capacity The word "entropy" is applied to several concepts in mathematics. What they have in common is apparently that they give some measure of the size or complexity of some set or transformation and that their definitions involve logarithms. Beyond this rather superficial resemblance, there are major differences. What are here called "metric entropy" and "metric capacity" are measures of the size of a metric space, which must be totally bounded (have compact completion) in order for the metric entropy or capacity to be finite. Metric entropy will provide a useful general technique for dealing with classes of sets or functions in general spaces, as opposed to Markov (or martingale) methods. The latter methods apply, as in the last section, when the sample space is R and the class C of sets is the class of halflines (oo, x], x E IR, so that C with its ordering by inclusion is isomorphic to R with its usual ordering.
Let (S, d) be a metric space and A a subset of S. Let s > 0. A set F C S (not necessarily included in A) is called an snet for A if and only if for each x E A, there is a y E F with d (x, y) < s. Let N(s, A, S, d) denote the minimal number of points in an cnet in S for A. Here N(s, A, S, d) is sometimes called a covering number. It's the number of closed balls of radius s and centers in S needed to cover A. For any set C C S, define the diameter of C by diam C := sup{d(x, y) : x, y E C}. Let N(s, C, d) be the smallest n such that C is the union of n sets of diameter at most 2s. Let D(s, A, d) denote the largest n such that there is a subset F C A
with F having n members and d (x, y) > e whenever x 0 y for x and y in F. Then, in a Banach space, D(2s, A, d) is the largest number of disjoint closed balls of radius s that can be "packed" into A and is sometimes called a "packing number."
1.2 Metric entropy and capacity
11
The three quantities just defined are related by the following inequalities:
1.2.1 Theorem For any E > 0 and set A in a metric space S with metric d,
D(2E, A, d) < N(s, A, d) < N(s, A, S, d)
< N(s, A, A, d) < D(s, A, d). Proof The first inequality holds since a set of diameter 2s can contain at most one of a set of points more than 2e apart. The next holds because any ball B(x, s) := {y: d(x, y) < s} is a set of diameter at most 2E. The third inequality holds since requiring centers to be in A is more restrictive. The last holds because a set F of points more than E apart, with maximal cardinality, must be an snet, since otherwise there would be a point more than s away from each point of F, which could be adjoined to F, a contradiction unless F is infinite, but then the inequality holds trivially. It follows that as s ,j. 0, when all the functions in the theorem go to 00 unless S is a finite set, they have the same asymptotic behavior up to a factor of 2 in s. So it will be convenient to choose one of the four and make statements about
it, which will then yield corresponding results for the others. The choice is somewhat arbitrary. Here are some considerations that bear on the choice. The finite set of points, whether more than s apart or forming an snet, are often useful, as opposed to the sets in the definition of N(s, A, d). N(s, A, S, d) depends not only on A but on the larger space S. Many workers, possibly for these reasons, have preferred N(s, A, A, d). But the latter may decrease when the set A increases. For example, let A be the surface of a sphere of radius s
around 0 in a Euclidean space S and let B := A U (0). Then N(s, B, B, d) = 1 < N(s, A, A, d) for 1 < s < 2. This was the reason, apparently, that Kolmogorov chose to use N(s, A, d). In this book I adopt D(s, A, d) as basic. It depends only on A, not on the larger space S, and is nondecreasing in A. If D(s, A, d) = n, then there are n points which are more than s apart and at the same time form an snet. Now, the sentropy of the metric space (A, d) is defined as H(s, A, d) log N(s, A, d), and the scapacity as log D(s, A, d). Some other authors take logarithms to the base 2, by analogy with informationtheoretic entropy. In this book logarithms will be taken to the usual base e, which fits for example with bounds coming from momentgenerating functions as in the next section, and with Gaussian measures as in Chapter 2. There are a number of interesting sets
of functions where N(s, A, d) is of the order of magnitude exp(s') as s 4, 0, for some power r > 0, so that the sentropy, and likewise the scapacity, have
Introduction: Donsker's Theorem, Metric Entropy, and Inequalities
12
the simpler order ar. But in other cases below, D(e, A, d) is itself of the order of a power of 1/e.
1.3 Inequalities This section collects several inequalities bounding the probabilities that random variables, and specifically sums of independent random variables, are large. Many of these follow from a basic inequality of S. Bernstein and P. L. Chebyshev.
1.3.1 Theorem For any real random variable X and t E R,
Pr{X > t}
oetuEeux.
Proof For any fixed u > 0, the indicator function of the set where X > t satisfies 1{x>t) < eu(xt)so the inequality holds for a fixed u; then take info>o.
For any independent real random variables X1,
, X,,, let S := X1 +
... + Xn. 1.3.2 Bernstein's inequality Let X1, X2, , X be independent real random variables with mean 0. Let 0 < M < oo and suppose that I Xj I < M almost surely for j = 1, , n. Let Q? = Var(Xj) and t,, := Var(SS)
of + (1.3.3)
+ Q,2. Then for any K > 0,
Pr{ISnI > Kn1/2)
0, since otherwise S, = 0 a.s. (where a.s. means almost surely) and the inequality holds. For any u > 0 and j = 1, (1.3.4)
, n,
E exp(uXj) = I+ u2o Fj/2 < exp (6,2 Fju2/2),
or Fj = 0 if 6? = 0. For r > 2, IXjjr < X2Mr2 a.s., so Fj < 2 Er2 (Mu)r2/r! < Err=2 (Mu/3)r2 = 1/(1Mu/3)forall v) < e`12 and euv12
= exp (  v2/(2r,, + 2Mv/3)) = exp (  nK2/(2rn +2MKn1/2/3))
Here are some remarks on Bernstein's inequality. Note that for fixed K and M, if X, are i.i.d. with variance v2, then as n + oo, the bound approaches the normal bound 2 exp(K2/(2Q2)), as given in RAP, Lemma 12.1.6. Moreover, this is true even if M := Mn + oo as n + oo while K stays constant, provided that M/n1/2  0. Sometimes the inequality can be applied to unbounded variables, replacing them by "truncated" ones, say replacing an unbounded f by fm where fm(x) := f(x)l{I f(x)1 0 and real a1, n
n
ajsj > t
Pr
< exp t2 2 a
j=1
j=1
Proof Since 1/(2n)! < 2'In! for n =
we have cosh x
(ex + ex)/2 < exp(x2/2) for all x. Apply Theorem 1.3.1, where by calculus, infra exp(ut + y=1 a?u2/2) is attained at u = t/ Fjn=1 aj?, and the result follows. Here are some remarks on Proposition 1.3.5. Let Y1, Y2, , be independent variables which are symmetric, in other words Yj has the same distribution as  Yj for all j. Let s; be Rademacher variables independent of each other and of all the Yj. Then the sequence {sjYj}{ j>1} has the same distribution as {Yj}{ j>1}.
Thus to bound the probability that E 1 Yj > K, for example, we can consider the conditional probability for each Y1, , Yn ,
Pr{ YsjYj > K j=1
Y11..., Ynl
(2nj)112e11(12j)
(mj+1/jj+2)e13/12(27r)1/2 > m1/2/8.
In the following two facts, let X1, X2, , X, be independent random variables with values in a separable normed space S with norm II II . (Such spaces
are defined, for example, in RAP, Section 5.2.) Let Sj := X1 +
j = 1,...,n.
+ Xj for
1.3.15 Ottaviani's inequality If for some a > 0 and c with 0 < c < 1, we have P(IIS,,  Sjll >
P{maxj0, P(maxjM) M). Notes. Each II Sj 11Y is a measurable random variable because Y is countable.
Lemma 9.1.9 treats uncountable Y. The norm on a separable Banach space (X, 11
II) can always be written in the form 11 II y for Y countable, via the Hahn
Banach theorem (apply RAP, Corollary 6.1.5, to a countable dense set in the unit ball of X to get a countable norming subset Y in the dual X' of X, although X' may not be separable). On the other hand, the preceding lemma applies to some nonseparable Banach spaces: the space of all bounded functions on an infinite Y with supremum norm is itself nonseparable. Proof Let Mk(w) := maxjM,then l1Sn11Y> Mor II 2Sm  Sn Il Y > Mor both. The transformation which interchanges Xj and  Xj
just for m < j < n preserves probabilities, by symmetry and independence. Then S, is interchanged with 2Sm  Sn, while Xj are preserved for j < m. So P(Cm fl {IISnIlY > M}) = P(Cm fl {112Sm  Sn11Y > M}) > P(Cm)/2, and
P(Mn>M)=Em=1 P(Cm)M).
Problems 1. Find the covariance matrix on 10, 1/4, 1/2, 3/4, 1) of (a) the Brownian bridge process yr; (b) U4  U. Hint: Recall that n 1/2(U  U) has the same covariances as yr.
2. Let 0 < t < u < 1.
Let an be the empirical process for the uniform
distribution on [0, 1].
(a) Show that the distribution of an (t) is concentrated in some finite set At.
(b) Let f (t, y, u) := E(an (u) I an (t) = y). Show that for any yin At, (u, f (t, y, u)) is on the straight line segment joining (t, y) to (1, 0).
Notes
19
3. Let (S, d) be a complete separable metric space. Let s be a law on S x S and let 6 > 0 satisfy
p,({(x, y): d(x, y) > 28}) < 38.
Let n2(x, y) := y and P := s o Let Q be a law on S such that p(P, Q) < S where p is Prokhorov's metric. On S x S x S let n21.
7r12(x, y, z) := (x, y) and 1r3(x, y, z) := Z. Show that there exists a law a on S x S x Ssuch that ao7ri21 =µ,ao7T 1 = Q, and
a({(x, y, z): d(x, z) > 33)) < 48. Hint: Use Strassen's theorem, which implies that for some law v on S x S,
if £(Y, Z) = v, then £(Y) = P, C(Z) = Q, and v({d(Y, Z) > S)) < 8. Then the Vorob'evBerkesPhilipp theorem applies.
4. LetA=B=C={0,1}.OnAxB,let it :_ (8(0,0) + 28(1,0) + 56(0,1) + 8(1,1)) /9.
On B x C, let v := [8(0,0) + 8(1,0) + 8(0,1) + 38(1,1)]/6. Find a law y on A x B x C such that if y = £(X, Y, Z), then L (X, Y) = ,a and G(Y, Z) = v. 5. Let I = [0, 1] with usual metric d. For e > 0, evaluate D(e, I, d), N(E, I, d), and N(E, I, I, d). Hint: The ceiling function [xl is defined as the least integer > x. Answers can be written in terms of [.1.
6. For a Poisson variable X with parameter ,l > 0, that is, P(X = k) _ exAk/k! for k = 0, 1 , 2, , evaluate the momentgenerating function EeIX for all t. For M > A, find the bound for Pr(X > M) given by the momentgenerating function inequality (1.3.1).
Notes Notes to Section 1.1. The contributions of Kolmogorov (1933), Doob (1949), and Donsker (1952) were mentioned in the text. When it was realized that the formulation by Donsker (1952) was incorrect because of measurability problems, Skorokhod (1956)  see also Kolmogorov (1956)  defined a separable metric d on the space D[0, 1] of rightcontinuous functions with left limits on [0, 1], such that convergence ford to a continuous function is equivalent to convergence for the sup norm, and the empirical process a converges in law in D[0, 1] to the Brownian bridge; see, for example, Billingsley (1968, Chapter 3). The formulation of Theorem 1.1.1 avoids the need for the Skorokhod topology and deals with measurability. I don't know whether Theorem 1.1.1 has
20
Introduction: Donsker's Theorem, Metric Entropy, and Inequalities
been stated before explicitly, although it is within the ken of researchers on empirical processes. In Theorem 1.1.10, the assumption that X, Y and Z are Polish can be weakened: they could instead be any Borel sets in Polish spaces (RAP, Section 13.1). Still more generally, since the proof of Theorem 10.2.2 in RAP depends just on
tightness, it is enough to assume that X, Y and Z are universally measurable subsets of their completions, in other words, measurable for the completion of any probability measure on the Borel sets (RAP, Section 11.5). Shortt (1983) treats universally measurable spaces and considers just what hypotheses on X, Y and Z are necessary. Vorob'ev (1962) proved Theorem 1.1.10 for finite sets. Then Berkes and Philipp (1977, Lemma Al) proved it for separable Banach spaces. Their proof carries over to the present case. Vorob'ev (1962) treated more complicated families of joint distributions on finite sets, as did Shortt (1984) for more general measurable spaces.
Notes to Section 1.2. Apparently the first publication on eentropy was the announcement by Kolmogorov (1955). Theorem 1.2.1, and the definitions of all the quantities in it, are given in the longer exposition by Kolmogorov and Tikhomirov (1959, Section 1, Theorem IV). Lorentz (1966) proposed the name "metric entropy" rather than "sentropy," urging that functions should not be named after their arguments, as functions of a complex variable z are not called "zfunctions." The name "metric entropy" emphasizes the purely metric nature of the concept. Actually, "sentropy" has been used for different quantities. Posner, Rodemich, and Rumsey (1967, 1969) define an s, S entropy, for a metric space S with a probability measure P defined on it, in terms of a decomposition of S into sets of diameter at most s and one set of probability at most S. Also, Posner et al. define sentropy as the infimum
of entropies  F; P(U;) log(P(U;)) where the U; have diameters at most E. So Lorentz's term "metric entropy" seems useful and will be adopted here.
Notes to Section 1.3. Sergei Bernstein (1927, pp. 159165) published his inequality. The proof given is based on Bennett (1962, p. 34) with some incorrect, but unnecessary steps (his (3), (4), ...) removed as suggested by Gine (1974). For related and stronger inequalities under weaker conditions, such as unbounded variables, see also Bernstein (1924, 1927), Hoeffding (1963), and Uspensky (1937, p. 205). Hoeffding (1963, Theorem 2) implies Proposition 1.3.5. Chernoff (1952, (5.11)) proved (1.3.9). Okamoto (1958, Lemma 2(b')) proved (1.3.10). Inequality (1.3.11) appeared in Dudley (1978, Lemma 2.7) and Lemma 1.3.12 in
References
21
Dudley (1982, Lemma 3.3). On Ottaviani's inequality (1.3.15) for realvalued functions, see (9.7.2) and the notes to Section 9.7 in RAP. The P. Levy inequality (1.3.16) is given for Banachvalued random variables in Kahane (1985, Section 2.3). For the case of realvalued random variables, it was known much earlier; see the notes to Section 12.3 in RAP.
References *An asterisk indicates a work I have seen discussed in secondary sources but not in the original.
Bennett, George (1962). Probability inequalities for the sum of independent random variables. J. Amer. Statist. Assoc. 57, 3345. Berkes, Istvan, and Philipp, Walter (1977). An almost sure invariance principle for the empirical distribution function of mixing random variables. Z Wahrscheinlichkeitsth. verw. Gebiete 41, 115137. Bernstein, Sergei N. (1924). Ob odnom vidoizmenenii neravenstva Chebysheva i o pogreshnosti formuly Laplasa (in Russian). Uchen. Zapiski Nauchn.issled. Kafedr Ukrainy, Otdel. Mat., vyp. 1, 3848; reprinted in S. N. Bem9tein, Sobranie Sochinenii [Collected Works], Tom IV, Teoriya Veroiatnostei, Matematicheskaya Statistika, Nauka, Moscow, 1964, pp. 7179. *Bernstein, Sergei N. (1927). Teoriya Veroiatnostei (in Russian), 2d ed. Moscow, 1934. Billingsley, Patrick (1968). Convergence of Probability Measures. Wiley, New York.
Chemoff, Herman (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Statist. 23, 493507. Donsker, Monroe D. (1952). Justification and extension of Doob's heuristic approach to the KolmogorovSmimov theorems. Ann. Math. Statist. 23, 277281. Doob, J. L. (1949). Heuristic approach to the KolmogorovSmirnov theorems. Ann. Math. Statist. 20, 393403. Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6, 899929; Correction 7 (1979), 909911. Dudley, R. M. (1982). Empirical and Poisson processes on classes of sets or functions too large for central limit theorems. Z Wahrscheinlichkeitsth. verve Gebiete 61, 355368.
22
Introduction: Donsker's Theorem, Metric Entropy, and Inequalities
Feller, W. (1968). An Introduction to Probability Theory and Its Applications, 3d ed. Wiley, New York.
Gine, Evarist (1974). On the central limit theorem for sample continuous processes. Ann. Probab. 2, 629641. Hoeffding, Wassily (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 58, 1330. Kahane, J.P. (1985). Some Random Series of Functions, 2d ed. Cambridge University Press, Cambridge. *Kolmogorov, A. N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giorn. Ist. Ital. Attuari 4, 8391. Kolmogorov, A. N. (1955). Bounds for the minimal number of elements of an snet in various classes of functions and their applications to the question of representability of functions of several variables by superpositions of functions of fewer variables (in Russian). Uspekhi Mat. Nauk 10, no. 1 (63), 192194. Kolmogorov, A. N. (1956). On Skorokhod convergence. Theory Probab. Appl. 1, 215222. Kolmogorov, A. N., and Tikhomirov, V. M. (1959). sentropy and scapacity of sets in function spaces. Amer. Math. Soc. Transls. (Ser. 2) 17 (1961), 277364 (Uspekhi Mat. Nauk 14, vyp. 2 (86), 386). Lorentz, G. G. (1966). Metric entropy and approximation. Bull. Amer. Math. Soc. 72, 903937. Okamoto, Masashi (1958). Some inequalities relating to the partial sum of binomial probabilities. Ann. Inst. Statist. Math. 10, 2935. Posner, Edward C., Rodemich, Eugene R., and Rumsey, Howard Jr. (1967). Epsilon entropy of stochastic processes. Ann. Math. Statist. 38, 10001020.
Posner, Edward C., Rodemich, Eugene R., and Rumsey, Howard Jr. (1969). Epsilon entropy of Gaussian processes. Ann. Math. Statist. 40, 12721296.
Shortt, Rae M. (1983). Universally measurable spaces: an invariance theorem and diverse characterizations. Fund. Math. 121, 169176. Shortt, Rae M. (1984). Combinatorial methods in the study of marginal problems over separable spaces. J. Math. Anal. Appl. 97,462479.
Skorokhod, A. V. (1956). Limit theorems for stochastic processes. Theory Probab. Appl. 1, 261290. Uspensky, J. V. (1937). Introduction to Mathematical Probability. McGrawHill, New York. Vorob'ev, N. N. (1962). Consistent families of measures and their extensions. Theory Probab. Appl. 7, 147163 (English), 153169 (Russian).
2 Gaussian Measures and Processes; Sample Continuity
Let X1, X2, .. , be independent, identically distributed realvalued random
variables with EX1 = 0 and EX1 = 02 < co. Let S, := Xl +
+ X.
Then the onedimensional central limit theorem says that the distribution of converges as n + oo to the normal distribution N(0, 02), which (if or > 0) has a density a1(2ir)1/2 exp(x2/(202)) with respect to Lebesgue measure on R (RAP, Theorem 9.5.6). Also, if the X; are i.i.d. with values in 1/2 converges to Rk, EX1 = 0, and E I Xl 12 < oo, then the distribution of a normal distribution N(0, C) where C is the covariance matrix of XI (ibid.). This book is mainly about extensions of the central limit theorem to infinitedimensional situations. Here the limit distributions will be normal distributions on infinitedimensional spaces. Since their behavior is not as simple as in the finitedimensional case, this chapter is devoted to a study of normal or Gaussian measures.
2.1 Some definitions Let V be a real vector space. Recall that a seminorm is a function II II from X into the nonnegative real numbers such that lix + yll lixII + Ilyll for all x and y in X and Ilcxll = Iclllx11 for all real c and x E X. The seminorm 11 is called a norm if IIx 11 = 0 only for x = 0 in X, and then (X, II 11) is called 11
a normed linear space. A norm defines a metric by d(x, y) := llx  yll. A normed linear space complete for this metric is called a Banach space. As with
any metric space, it is called separable if it has a countable dense subset. A probability distribution P defined on a separable Banach space will be assumed to be defined on the Borel 0algebra generated by the open sets, unless another 0algebra is specified. Then P will be called a law. Let (X, 11 11) be a separable Banach space. A law P on X will be called Gaussian or normal if for every continuous linear form f E X', P o f 1 is 23
Gaussian Measures and Processes; Sample Continuity
24
a normal law on R. Recall that a law on a finitedimensional real vector space is normal if and only if every real linear form is normally distributed (RAP, Theorem 9.5.13).
2.2 Gaussian vectors are probably not very large First, let's have some bounds for onedimensional Gaussian variables. Let 4) be the standard normal distribution function and 0 its density function, so 0(x) = (2n)1i2 exp(x2/2) for all real x, and D(x) = fX,,,, 0(u)du.
2.2.1 Proposition Let X be a realvalued random variable with a normal distribution N(0, a2). Then
(a) for any M > 0, Pr(IXI > M) < exp(M2/(2a2)); (b) if M/a > 1, then/ M.
< Pr(,XI > M) < M0
I
(M
Proof Replacing X by X/v, we can assume or = 1. For (a) we want to prove
24 (c) < exp(c2/2) for any c > 0. This holds for c = 0 and follows by differentiating both sides for 0 < c < (2/sr)1/2. For larger c, it follows from '1 (c) < 0(c)lc (RAP, Lemma 12.1.6(a)), as does the right side of (b). For the left side of (b), note that 0 is a convex function for x > 1, since there 0"(x) = (x2  1)0(x) > 0. Thus, the region between the graph of 0 and the x axis for x > c includes a right triangle with right vertex at (c, 0) and a vertex at (c, 0(c)), and whose hypotenuse is along the tangent line to the graph of 0 at c. It's easily seen that this triangle has area 0(c)/(2c), which finishes the proof.
This section will prove an extension of inequality (a) to infinitedimensional Gaussian variables such as those taking values in separable Banach spaces. It will be said that a law P on a separable Banach space (X, II II) has mean 0
if f Ilx II dP(x) < oo and f f (x) dP(x) = 0 for each f E X'. Recall the dual norm 11f II' := sup{ l f(x) I
:
IIx II < 1). Here is one of the main results.
Let P be a normal law with mean 0 on a separable Banach space X. For f E X' let a2(f) f f2dP. Then r2 := sup(a2(f): llfll' < 1) < oo and 2.2.2 Theorem (LandauSheppMarcusFemique)
fexp(allxll2)dP(x) < 00 for any a < 1/(2r2).
2.2 Gaussian vectors are probably not very large
25
By Proposition 2.2.1(a), the theorem holds in the onedimensional case, and
by the left side of part (b), the conclusion fails when a = 1/(2r2). Before proving the theorem in general, some other facts will be developed.
Definition. Let X be a real vector space and B a aalgebra of subsets of X. Then (X,13) is called a measurable vector space if both
(a) addition is jointly measurable from X x X to X, and (b) scalar multiplication is jointly measurable from R x X to X (for the usual Borel aalgebra on R).
Example. Let X be a topological vector space, namely a vector space with a topology for which (a) and (b) hold with "measurable" replaced by "continuous." Suppose the topology of X is metrizable and separable. For a Cartesian product of two separable metric spaces, since their topologies have countable bases, the Borel aalgebra in the product equals the product aalgebra of the Borel aalgebras in the two spaces (RAP, Proposition 4.1.7). Thus X with its Borel aalgebra is a measurable vector space.
The notion of normal law can't be defined for general measurable vector spaces by way of linear forms, as it was for Banach spaces in the last section, since there exist measurable vector spaces, such as spaces LP[0, 1] for 0 < p < 1, which have nontrivial normal measures but turn out to have no nontrivial measurable linear forms (Appendix F). Fernique (1970) proposed the following ingenious definition.
Definition. A probability measure P on a measurable vector space (X, 8) will be called centered Gaussian if for variables U and V independent with law P (say, coordinates on the product X x X for the product law P x P) and any O with O < 0 < 2ir, U cos O + V sin 9 and U sin 9 + V cos 9 are also independent with distribution P.
If X = JR, the transformation of (U, V) E R2 in the last definition is a rotation through an angle 0. Normal laws with mean 0 on finitedimensional real vector spaces are centered Gaussian in this sense, as can be seen from covariances. Conversely, a law on X = JR satisfying the above definition of "centered Gaussian," even for one value of 0 with sin (20) 0, must be normal according to the "DarmoisSkitovic" theorem  see the notes for this section. We will not need the full strength of the latter theorem below, but the following will be proved.
Gaussian Measures and Processes; Sample Continuity
26
2.2.3 Proposition some a2 > 0.
A centered Gaussian law P on JR is a law N(0, a2) for
Proof P x P on JR2 is invariant under all rotations. A rotation through 6 = n
shows that P is symmetric, dP(x)  dP(x). Let f be the characteristic function of P, f (t) := f 0 e`txdP(x). Then f is realvalued, f (0) = 1, and f (t) = f(t). Any point (t, u) E R2 can be rotated to a point on a coordinate axis, so f(t) f(u)  f((t2 + u2)1/2). Let h(t) := log f(ItI112) where it is defined and finite, that is, where f > 0, as is true at least in a neighborhood of 0. Then h(t + u) = h(t) + h(u) for t, u > 0 and, perhaps, small enough. Where both sides are defined and finite, we have h(qu)  qh(u) first when q is an integer, then when it is rational, and then for general real q by continuity. Since h thus doesn't become unbounded on finite intervals where it's defined, it is defined and continuous on the whole line, with h(t) = ct for some constant
c = h(l), so f(t) = exp(ct2) for all t, and c < 0 since I f(t)I < 1. Thus P = N(0, a2) where a2 = 2c (RAP, Proposition 9.4.2, Theorem 9.5.1).
Given a normal measure P = N(C) on a finitedimensional space X and a vector subspace Y of X, it follows from the structure of normal measures (RAP,
Theorem 9.5.7) that P(Y) = 0 or 1. This fact extends to general measurable vector spaces:
2.2.4 Theorem (01 law) Let (X, 8) be a measurable vector space and Y a vector subspace with Y E B. Then for any centered Gaussian law P on X, P(Y) = 0 or 1. Proof Let U and V be independent in X with law P. For 0 < 6 < 7r/2, let A(6) be the event
A(O) := {Ucos9+Vsin6EY, If 6
with 0 < 0 < r/2, then
cos 9 sin 0  sin 9 cos 0 = sin(0  6) 0 0, so if Y1 := u cos 6 + v sin 6 E Y and Y2 := u cos 0 + v sin 0 E Y, then u and v can be solved for as linear combinations of yi, Y2, so they are in Y and u sin 6 + v cos 9 E Y, u sin 0 + v cos 0 E Y. So the sets A(9) are disjoint for different values of 6 E [0, r/2]. By definition of centered Gaussian, these sets all have the same probability, which thus must be 0. Taking 9 = 0 gives
0 = Pr(U E Y) Pr(V 0 Y) = P(Y)P(X\Y), so P(Y) = 0 or 1.
2.2 Gaussian vectors are probably not very large A measurable function II
27
II from a measurable vector space X into [0, oo]
will be called a pseudoseminorm if Y := {x E X : 11x 11 < oo} is a vector subspace of X and I I I I is a seminorm on Y, that is, I I cx I I = I c I Ilx II for each real
candx E Y, and so forallx E X,with
Ilxll+llyll for all x, y E Y, and so for all x, y E X. By the 01 law (Theorem 2.2.4), for any pseudoseminorm II II and centered Gaussian P on X, P(II II < oo) = 0 or 1. Likewise, P(11 II = 0) = 0 or 1. A realvalued stochastic process consists of a set T, a probability space (0, A, P) , and a map (t, w) i+ Xt(w) from T x 0 into R such that for each t E T, Xt is measurable from 0 into R. The process is called Gaussian if for every finite subset F of T, the law G({Xt}tEF) is a normal distribution on 1(S F.
If S is a countable set, then RS, the set of all realvalued functions on S, with product topology, is a separable metric topological linear space, hence a measurable vector space. If P is the law of a Gaussian stochastic process {xt, t e S} on ]IBS, with Ext = 0 for all t e S, then P is centered Gaussian on IRS. The supremum "norm" II { yt, t E S) II := suet lyt I is clearly a pseudoseminorm on IRS.
Here is a step toward proving Theorem 2.2.2:
2.2.5 Lemma (LandauSheppFemique) Let (X, 8) be a measurable vector space, P a centered Gaussian measure, and 11 II a pseudoseminorm on X with 0. Then for some s > 0, P(Il 11 < oo) >fexp(ailxli2)dP(x)
< oo for 0 < a < E. Proof As noted above, P(11 II < oo) = 1. Let U and V be independent with distribution P in X. The definition of centered Gaussian for 6 = ir/4 yields, for any real s < t, (2.2.6)
POI PrIII(UV)/21/211 t}.
Note that 2112min(IIUII, IIVII) > II(U + V)/21/211 11 (U

V)/21V211,
where the event that the right side is undefined, equaling oo  oo, has zero probability and so can be neglected. Thus on the event on the right in (2.2.6), we have IIVII > (t  s)/21/2 and IIVII > (t  s)/21/2. So (2.2.7)
P(11
II < s)P(II II > t)
(t  s)/2).
28
Gaussian Measures and Processes; Sample Continuity
Choose s with 0 < s < oo large enough so that q := P(Il II < s) > 1/2. Define a sequence t, recursively by to := s, t,+1 := s +21/2tn, n = 0, 1, Then we have by induction
to =
(21/2
+ 1)(2(n+1)/2 1)s.
So to increases up to +oo with n. Let xn := P(II By induction, we then have P(II
II > tn)/q, so xn < x2n_1.
it > tn) < q((1 q)lq)2n
It follows that E exp (all 112) < geas2 + E P(tn
< 11
I1
_1 be a sequence of measurable linear forms: X H R. Let Il x 11 suPn I Yn (x) 1. Suppose that P (ll x II < oo) > 0. Then r :=
(supra f yn dP)1/2 < 00, and E exp(allxll2) < oo if and only if a < 1/(2t2).
Proof For each n, II ? I Yn I. It's easily checked that P o yn 1 is centered Gaussian and thus by Proposition 2.2.3 is a law N(0, Q,2), an < t2. Now P(I Yn I > Qn) > c > 0 for all n (c > 0.3). Thus if a, are unbounded, II
P(II II > Qn) > c for all n gives a contradiction. Then to prove "only if," we have E exp(lyn l2/(2t2)) = 1/(12  an2)1/2, or = +00 if a,, = t2. Taking the
supremum over n gives E exp(llx 112/(212)) = +00. Now to prove "if," recall the space £°O of all bounded sequences of real numbers with supremum norm. This is a nonseparable Banach space. With the smallest oralgebra making the coordinates measurable, it is a measurable vector space (by the way, this aalgebra is smaller than the Borel aalgebra for
2.2 Gaussian vectors are probably not very large
29
the supremum norm). Let Y(x) := {yn(x)}n>1. Let S be the vector subspace is finite. Then P(S) = 1 by Theorem 2.2.4 and Y is linear, of X where II II
measurable, and preserves norms from S into £°O. So it will be enough to prove the theorem in £°O with coordinates l y, }.
By GramSchmidt orthonormalization in L2(P) (RAP, 5.4.6) we can write yn = m I an jgj for all n, where gj are linear functions on £°O (finite linear combinations of coordinates), are orthonormal in L2(P), and are normally distributed, so they are i.i.d. N(0, 1). If the yn are linearly independent in L2(P), then m(n) = n. Otherwise, m(n + 1) = m(n) + 1 or m(n) according as Yn+1 is or is not linearly independent of the yj for j < n.
Let an j = 0 for j > m (n). Each gi is in turn a linear combination of yl, , yn for the least n such that in (n) > i.
n=1,2,...,let
For each n, f yndP=F_ja k an jgj. Since an j = 0 for j > n, the sum defining Vkn runs over k < j < n and there is no problem of convergence. Let 13k be the smallest aalgebra for which gj are measurable for all j > k. Then for any 0 < j < k Vkn :=
and n, we have Vkn = E(Vjn 113_k). Let 11 Vk II
supra I Vkn I < +00. Then for
a > 0 we have the inequalities exp (a II Vk II2) = exp (a supra I Vkn 12)
= exp (a supra I E(Vjn I8k)12) < exp (a{E(supn I Vin I IBk)}2) = exp (a{E(11 Vj 1118k)}2)
< E (exp (a II Vj ll2) 18k)
by conditional Jensen's inequality (RAP, 10.2.7) if the expectations are finite. First taking j = 0, E exp(a II Vo 112) < oo for some a > 0 by Lemma 2.2.5. Then for j = 0 < k, the inequalities hold and give for that a, E exp(a II Vk 112)
1 E too. Let Wk := exp(aIIVk112), k = 0, 1, .. Then by the inequalities for general 0 < j < k, {(Wj, Bj) : j = , 2, 1, 01 is a submartingale (RAP, Section 10.3) and in view of its index set, a reversed submartingale. For any s > 0 and finite k, by the Doob maximal inequality (RAP, 10.4.2),
Pr{maxo s) = 0 by the Kolmogorov 01 law (RAP, 8.4.4). Then for 0 < s < 1/2, there is a k(e) < oo such that Pr(II Vk 11 > s) < s fork > k(s), since if Pr(II Vk II ? s) > s for infinitely many values of k, then Pr(lim supk..., 11 Vk II > s) > s. Then by the last line of the proof of Lemma 2.2.5,
(2.2.9)
< oo for 0
0 small enough so that
log(( s/s)
ay
(2.2.10)
(y 1/2  a1/2)2 
24s2
Let k = k(s) and Uk(y) := y  Vk(y). Then al/211Y11
a'1211Uk(Y)II +a'/2II Vk(Y)II
(a/Y)112y112IIUk(Y)II + (1Since t i+ exp(t2) is a convex function (RAP, Section 6.3, see Problem 1), exp(ally1I2)
< (a/Y)1/2eXp(YIIUk112)
+ (1 (a/Y)'/2) exp ( II Vk II2). By (2.2.9) and (2.2.10), E exp( II Vk 112) < oo. Now, by the Cauchy inequality, for each n, )2
k 2
Ukn =
(
(;gi2)(a;i) k
anigi
k
\
j=1
so
k
E( exp (Y II Uk II2))
E exp
k
g?
y
j=1
i=1 k
< Eexp
{(g) t2
l
(1  2yr2)k/2 < oo. 111
So Theorem 2.2.8 is proved.
a
supra
2.3 Inequalities and comparisons for Gaussian distributions
31
Proof of Theorem 2.2.2 Let {x,, }n 1 be dense in X. For each n, by the HahnBanach theorem and a corollary (RAP, 6.1.5), there is a yn E X' with II Yn II ' = 1 and I Yn (xn) I = II xn I I . Then for all x E X, we have sup,, l y, (x) l = I I x 11, so the theorem follows from Theorem 2.2.8.
2.3 Inequalities and comparisons for Gaussian distributions The main result of this section will show that if a set of Gaussian random variables is large enough in the sense of metric entropy (as defined in Section 1.2), meaning that the number of variables more than s apart grows rather fast as s 4. 0, then it is almost surely unbounded. The main steps in the proof will be some inequalities, one due to Slepian and another to Sudakov and Chevet.
2.3.1 Slepian's inequality. Let X1, , Xn be real random variables with a normal joint distribution N(0, r) on Rn. Let Pn(r) := Pr{Xj > 0 for all j = 1, , n}. Let q be another covariance matrix, with r = q;; = 1 for all , n. If ri j > qi j for all i and j, then Pn (r) > Pn (q). i = 1, Remarks. Since each Xj has distribution N(0, 1), clearly P, (r) < 1/2, with Pn (r) = 1 /2 when r, j = 1 for all i and j. On the other hand, Pn (r) = 0 if rl j = 1 for some i j. Proof First, suppose that r is nonsingular, so that it is strictly positive definite and N(0, r) has a density gn given by Fourier inversion of its characteristic function as oo
gn(xl,...,xn)
:=
gn(xl,...,xn;r) = (2n)n Jt n
eXp(ixjtj2
n
j=1
k,m=1
1
00
...
t
J
oo
o0
rkmtktm dt1...dtn
(RAP, Theorem 9.5.4). Since r is symmetric, it is given by the n(n + 1)/2 variables rkm, 1 < k < m < n. The partial derivatives a2gn/axkaxm can be evaluated by differentiating under the integral signs, applying Corollary A.15 (in Appendix A) twice, thus multiplying the integrand by tktm (RAP, Theorem 9.4.4). The same integral results from taking ag,,/arkm for k < m, where a/arkm can be taken under the integral sign by Proposition A.16 (since r is strictly positive definite). So (2.3.2)
a2gn/axkaxm = agnlarkm,
k
m.
Gaussian Measures and Processes; Sample Continuity
32
Now
r0 P,, (r) =
f00(2.3.3) xn)dxI...dxn, gn(Xi,...
J
0
J0
where g, (x) = (27r)n/2(detr)1/2 exp((rlx,x)/2) (RAP, 9.5.8). For a positive definite symmetric matrix s, by Proposition A. 16, applied to t = sib
and i(x) = xixj (or xi /2 if i = j), the integral of exp((sx, x)/2) over any measurable region in Rn, specifically the positive orthant {0 < xi < 00, i = , n } , can be differentiated under the integral sign with respect to any com
1,
ponent of s. Then, since the functions r H r1 and r i+ (det r) 1/2 are smooth for r in the set of symmetric, (strictly) positive definite matrices, the integral (2.3.3) can be differentiated under the integral sign with respect to any rkm , k
0,
Pr(maxp t) < 2Pr(IIS,IIY > t)
Further, for k = 1,
, n,
Pr(maxkt) So, assuming (d) for the given sequence 7rn, given e > 0, take k large enough so that Pr{IL(irk C)I* > s} < E/2. It follows that for any n > k, Pr{IL((zrn Jrk)C)I* > s} < s/2, by Lemma 2.5.2 with Jrk C in place of C and B := 7rn,
48
Gaussian Measures and Processes; Sample Continuity
noting that 7rn o zrk = nn o (I  Irk) = 77rn  nk Thus by the Levy inequality (Lemma 1.3.16),
Pr(maxkk IL((nj  nk)(Y))
> s} < s.
Thus the processes L o 7rj converge uniformly on Y a.s. be an orthonormal basis of H including an orthonormal basis Let el, e2, of the range of nn for each n, by Lemma 2.5.4. Consider the process M(x) :_
>j(x, ej)L(ej) and the approximating processes Mn(x) := >2j{(x, ej)L(ej): ej E range (7rn)}. We have M.(y) = L(nn(y)) a.s. for each y E Y, thus for all y E Y since Y is countable. It will be shown that for each x, Mn (x) converges to M(x) a.s. Let the variables be defined on a probability space (S2, P). In the Hilbert space J := L2(S2, P), the series defining M(x) converges since L(ej) are orthonormal and F j (X, e)2 = IIx 112 < oo. In J, the series converges
in any order. Thus MM(x) converges to M(x) in J, and so in probability by Chebyshev's inequality. Then, since L(ej) are independent, M,(x) are partial sums of a series of independent variables and so converge a.s. by the Levy equivalence theorem (RAP, Theorem 9.7.1). Now, Mn converges uniformly on Y a.s., and since each Mn is uniformly continuous on C a.s., because C is bounded, the limit is uniformly continuous on Y a.s. and is a version
of L on Y, in fact for each y E Y, M(y) = L(y) a.s. Then M extends by uniform continuity to a process defined on C and uniformly continuous there. Since L is continuous in probability, M is then a modification of L on C. Now
M  Mn converges uniformly to 0 on Y and on C a.s., which implies that 0 a.s., proving (d'). Now let Q,n be another sequence of fdp's with Q,n T I. Then for k fixed as above, let el, , e, be a basis for the range of Irk. Then Qmej + 0 in H for each j. Since r is fixed and C is bounded, I L (QmjrkC) I *  0 in probability. L (nn C) I *
We also have Pr{IL(QmJrk C)I* > 81 < Pr{IL(nk C)I* > s} by Lemma 2.5.2. The latter is < s by choice of k. Since I L (Qm C) I * < I L (Qm. rkC) I * + I L (Qm7rk C) I * a.s., we have I L (Q,l C) I * + 0 in probability as m + oo. By
the last paragraph, the convergence is almost sure. So the properties (d), (d'), (e), and (e') are equivalent. These properties clearly imply (c). To see that they imply (f), note that in the above proof, each Mn (co) is the inner product with some element of H, and Mn almost surely converges uniformly on C to M. Each Mn defines a measurable function from 0 into H, thus into V3. Hence M defines a random variable with values in V3 (RAP, Theorem 4.2.2). So (f) follows.
2.5 The isonormal process: sample boundedness and continuity
49
Clearly (f) implies (g) implies (h) implies (a). On the other hand, (d) implies that the Mn converge uniformly also on the closed symmetric convex hull sco(C) of C. In fact, for any fdp 7r,
IL(nlsco(C))I* = IL(n1C)I* a.s., since IL(x)I = IL(x)I a.s. for any x, and finite convex combinations with rational coefficients of elements of Y U Y give a countable dense set in sco(C). The limit of the Mn is again M, a modification of L, now on sco(C). So (d) implies (a'), which implies (a) clearly.
Next, to see that (d) implies (b), given e > 0, take an fdp n such that Pr{IL(Jr1C)I* > s/2} < 1/2. Also Pr{IL(JrC)I* < s/2} > 0, since,rC is a bounded set in a finitedimensional space, and since L o n and L o it can be taken to be independent (on Y), we get Pr{ I L (C) I * < s} > 0, proving (b).
Next it will be shown that (b) implies (c). For e > 0, and fdp's nn T I, Lemma 2.5.2 implies Pr{IL(Jr, C)I* < e} > Pr{IL(C)I* < s} > S for some S > 0 for all n. The event D that I L (Jr,, C) I * < e for infinitely many n, that is nm> t Un>m (I L (7r C) I * < s }, thus has probability at least S. But D is a "tail event," since it depends on the sequence of independent random variables L (ej)
only for j > k for k arbitrarily large. It follows that D has probability 0 or 1 (Kolmogorov's zeroone law, RAP, 8.4.4), thus probability 1. This yields (c).
Next (c) implies (a): for any s > 0, suppose that IL(Jr,lC)I* < 8/2 for some w and n. Then Mn, being linear on the finitedimensional bounded set TrnC, is uniformly continuous there, so for some y > 0, lix  YII < y implies I Mn (x)  Mn (y) I < s/2, for x, y E Jrn (Y) and thus for x and y in Y, and then since Mn + L o Jr, = L almost surely on Y, I L (x)  L (y) I < E. Thus L (w) is almost surely uniformly continuous on Y, hence again extends to a uniformly continuous function on C which yields a modification of L, giving (a). It will now be enough to prove that (a) implies (d). Given e > 0, take a version of L and S > 0 such that
Pr{sup{IL(x)  L(Y)I: x, y E Y, IIx  YII < S) > s} < E. Take a finitedimensional linear subspace F of H such that F fl C is within 3/2 of every point of C. We can assume that Y fl F is dense in F fl C. Let Tr be
the orthogonal projection onto F. Since Y is countable, we have L (x  y) =
L (x)  L (y) and L (n 1(x  y)) = L (r'x)  L (r y) almost surely for all x, y E Y. Then by Lemma 2.5.2,
s > Pr {sup I I L(7r'x)  L(ir'y)1: X, Y E Y, IIx  YII < S} > e}
> Pr {sup{IL(n1x)I: xEY}>>s},
50
Gaussian Measures and Processes; Sample Continuity
since for any x E Y there is a y in F fl Y with Ilx  yll < E and irLy = 0. Letting E = 1/n J. 0, n oo, (d) holds. Recall that a Borel probability measure on a separable Banach space B is called Gaussian if every continuous linear form in B' has a Gaussian distribution. It follows that the norm II II on B satisfies some inequalities on the upper tail of its distribution for it (LandauSheppMarcusFernique bounds, Theorem 2.2.2). In particular, f IIx I12dµ(x) < oo.
2.5.7 Theorem Let (B, II.11) be a separable Banach space. Let p be a Gaussian probability measure with mean 0 on the Borel sets of B. Then the unit ball Bi := { f : 11 f 11' < 1 } in the dual Banach space B' is a compact GCset in L2(B, A).

Proof A Cauchy sequence {y, } in Bi for the L2 (µ) norm converges in L 2 (µ). Consider the weakstar topology on B', in other words the topology of pointwise convergence on B. The functions in Bi are uniformly equicontinuous, indeed
Lipschitzwiththeuniformbound I f(x)f(y)I < Ilxyll, f E Bi, x, y E B. Thus in Bi, pointwise convergence on B is equivalent to convergence on a countable dense set. So the weakstar topology on Bi is metrizable (cf. RAP, Theorem 2.4.4). Any linear function on B is given by its values on B1 := {x E B : Iix 11 11, and pointwise convergence on B is equivalent to pointwise convergence on B1. The set of all functions from Bt into [1, 1], with the topology of pointwise convergence, is compact by Tychonoff's theorem (RAP, 2.2.8). It's easily
seen that B, is a closed subset of this compact space, so it is also compact. (Compactness of Bi in the weak* topology for a general Banach space B is known as Alaoglu's theorem; see, for example, Dunford and Schwartz, 1958, pp. 424426.) So {y,} has a subsequence converging pointwise on B to some element y of B. For jointly Gaussian variables, pointwise convergence (convergence in probability) implies L2 convergence, so y is the L2 limit of {y,} and Bi is compact in L2(µ). The natural mapping T of B' into L2(µ) has an adjoint T* taking L2(µ) into B", the dual space of (B', 11 II'). There is a natural map of B into B" given
by x H (h H h(x)) for x E B and h E B'. The map is an isometry (RAP, Corollary 6.1.5, of the HahnBanach theorem 6.1.4). So B can be viewed as a linear subspace of B". If it is all of B", then B is called reflexive. In the present case, whether or not B is reflexive, T* actually has values in B:
2.5 The isonormal process: sample boundedness and continuity
51
2.5.8 Lemma Let B be a separable Banach space and µ a measure on B such that f IIx112dIL(x) < oo. Then for the natural mapping T of B' into H := L2(µ), the adjoint T* on H' has values in B. Proof For any h e H and Y E B',
(T*h)(y) = (h, Ty) =
h(x)y(x)dµ(x) = y(u), lB
where u e B is defined by the Bochner integral u = fB h (x )x d it (x) (Appendix E, Theorem E.9). The linear form y can be taken under the integral sign since the Bochner integral, when it exists, equals the Pettis integral (Appendix E).
So let J be the range of T*, a linear subspace of B, and S its closure, a Banach subspace of B. Note that each element of S is uniformly continuous on Bi for the H = L2(µ) norm topology, since it is a limit in the norm II II on B,
and thus uniformly on Bi, of such functions. It will be shown that µ(S) = 1. If S # B, take a countable dense subset {x,n} of B\S. By the HahnBanach theorem (RAP, 6.1.4) f o r each m = 1 , 2, .. , there is a u,,, E Bj such that
um =0on Sand um (x)=d(x,,,,S):=inf{llxmyll: yES}. For any x E B\S, let e := d (x, S) > 0 and take m with Ilx  x, II < 6/2. Then l un, (x) I > d (xn
, S)  e/2 > s  8/2  e/2 = 0, so u,n (x) # 0. Thus
S = n,n u,;,t(0). For each m, to show that p(u,n = 0) = I is equivalent to showing that T (urn) = 0. If not, let T (u,n) = v 0. Then 0 < (v, v) = (T (un,) , v) = um (T * v) = 0 since T * v E S, a contradiction. So tt(u,n = 0) _
1 for each of the countably many values of m. It follows that µ(S) = 1. Let K be the closure of the range of T in H. Then K is a Hilbert space. A limit of Gaussian random variables with mean 0 in L2 (µ) is also such a random variable, so K consists of such random variables, and any finite set of them has a joint normal distribution. Thus the identity from K to itself is an isonormal process L. For this L, we can apply Theorem 2.5.5, where S is the space V3 of Theorem 2.5.5(f). It follows that Bi is a GCset. The next fact is a direct consequence of Theorem 2.5.5.
2.5.9 Corollary For any two GCsets C and D, their union C U D is also a GCset.
Proof Condition (e) or (e') in Theorem 2.5.5 holds on C and on D and so, clearly, on C U D.
Gaussian Measures and Processes; Sample Continuity
52
2.6 A metric entropy sufficient condition for sample continuity Recall that a stochastic process Xt (w), t E T, is said to be samplebounded on T if suptIT I Xt I is finite for almost all w. If T is a topological space, then the process is said to be samplecontinuous if for almost all (o, t i+ Xt (w) is continuous. The isonormal process is not samplecontinuous on the Hilbert space H: let {en) be an orthonormal sequence. Then L(en) are i.i.d. N(0, 1) 0 slowly enough, specifically if a n (log n)1/2 + 00 variables. Thus if an oo, L (an en) are almost surely unbounded (by Theorem 2.3.5). So not as n all bounded sets or even compact sets are GBsets or GCsets. Such sets must be small enough in a metric entropy sense. This section will prove a sufficient condition based on metric entropy (defined in Section 1.2), while Section 2.7 will give a characterization based on what are called majorizing measures. A metric entropy sufficient condition for sample continuity of L will actually
give a quantitative bound for the continuity. Let (T, d) be a metric space. A function f will be called a sample modulus for a real stochastic process {Xt, t E T} if there is a process Yt with the same laws as Xt and such that for almost all w, there is an M(w) < oo such that for all s, t e T, I Ys  Yt 1(w) < M((o) f (d (s, t)).
Whenever f is a sample modulus for L on C C H, and {Xt, t E T} is a Gaussian process with mean 0 and {Xt t E T) = C, then f is also a sample modulus for the process Xt, with the intrinsic pseudometric d (s, t) (E(XS  Xt)2)112 on T. Recall from Section 1.2 the definitions of N(e, C) and H(s, C), for the usual metric d(x, y) := Ilx  yll on H. Now the main theorem of this section can be stated.
2.6.1 Theorem For any C C H, if f0 (log N(t, C))1/2dt < oo, then C is a GCset, and if
f (x) := f (log N(t, C))1/2dt, x > 0, o
then f is a sample modulus for L on C. Note.
If C is bounded, then N(t, C) = 1 and log N(t, C) = 0 for t
large enough, and N(., C) is a nonincreasing function, so integrability of (log N(t, C))112 is only an issue near t = 0. If f (x) = +oo for some x > 0, then f (x) = +oo for all x > 0, so it still provides a sample modulus but only a trivial one. By Theorem 1.2.1, N(t, C) could be replaced equivalently by D(t, C).
2.6 A metric entropy sufficient condition for sample continuity
53
Proof We have f(l) < oo and can assume C is infinite. Then H(E) := H(E, C) +oo as s 4, 0. Sequences Sn 4. 0 and E(n) := En 4. 0 will be , En, let defined recursively as follows. Let Si := 1. Given E1, Sn
:= 2inf (E: H(E) < 2H(En)}, min(En/3,Sn)
8n+1
Then En < 3(En  En+1)/2. Also, if En+1 = Sn, then
J6(n) (n+1)
H(x)2dx < 2H(En)2En,
while otherwise En+1 = En/3 and
I(n) n+1)
H(x)2dx
3H(En) 1/2).
If is the standard normal distribution function, then for T > 0, 1  1(T) < exp(T2/2) (RAP, Lemma 12.1.6(b)). Then
Pn < 4exp{4H(En)9H(En)/2} = 4exp(H(En)/2). Since H(En+2) > H(Sn/3) > 2H(sn), En Pn is dominated for n > 2 by a sum of two geometric series, one for n even and one for n odd, and so converges. Then for almost all w, there is an no (w) such that for all n > no (w), I L (z) I < 3IIZIIH(En)1/2 for all z E Gn.
Either Sm = Em+l or Sm < 2Em = 6Em+1, so Sm < 6Em+1 for all m > 1,
and En>2Sn_1H(en)1/2 < oo. For each x E C choose An(x) E An with
Gaussian Measures and Processes; Sample Continuity
54
I I x  A n (x) I I < 28n. Then I I A n  i (x) An(X)11 2. On the other hand Theorem 2.3.5 implies that C is not a GBset
if as E J, 0, eventually N(E, C) > exp(EP) for some p > 2 or N(e, C) > exp(e21 log 815) for some s > 0. It turns out that the gap cannot be closed further: if N(e, C) is of the order of exp(E21loge I'') for 0 < r < 2, there are examples showing that C may or may not be a GBset (see problems 14 and 15). So a characterization of the GBproperty can't be given in terms of metric
2.6 A metric entropy sufficient condition for sample continuity
55
entropy, although it comes rather close. For a characterization in other terms, see the next section.
Remark. If C is a GCset, then can be chosen such that for all w, x H L(x)(w) is continuous for x E C. Then for any countable dense subset A of C, L(C)* = SUPXEA L(x) a.s.
Next, the same integral as in Theorem 2.6.1 yields a bound for expectations of certain suprema.
2.6.2 Theorem Let C C H be nonempty and let D := diam C = supx,YEC Ilx  yll Let B
{x  y: x, y E Q. Then for f as in Theorem 2.6.1,
(a) EIL(B)I* < 81 f(D/4), and (b) EL(C)* < 81 f(D/4). Remarks. All three quantities in (a) and (b) are invariant under translation, replacing C by {c + u : c E C) for any fixed u. But E I L (C) I * does not have such invariance, and it becomes unbounded as II u II + oo, so for it we cannot
have an upper bound Kf(D), K < oo. If the constant 81 is replaced by a larger one, one can have, instead of the quantities on the left in (a) and (b), YoungOrlicz norms (Appendix H) II where g(x) := exp(x2)  1; see Theorem 2.6.8 at the end of this section.
IIg
Proof Note that log N(t, C) = 0 for t > D/2, so f(D/2) = f(+oo). If f (x) = +oo for some (and hence all) x > 0, then (a) and (b) hold trivially (under the given definitions). If f (D) < oo, then we can take L samplecontinuous on C by Theorem 2.6.1. We first have:
2.6.3 Lemma Let go(u) := 2uO(41(1/(2u)))foru > 1/2, where 0and are the standard normal density and distribution function, respectively. Then go is concave. For any random variable Z with distribution N(0, a2) and any event A with P(A) > 0, (2.6.4)
f
JA
IZI dP < aP(A)go(1/P(A)).
Gaussian Measures and Processes; Sample Continuity
56
Proof Let h(v) := go(v/2) = v4)(1(1/v)) for v > 1. Then
h'(v) = 0
(1)) \c1
+v4
\
1
(0, \v
()) \(D_1 \v/) 0 ((D1 (1/v)) 1) + v(D1 vl
\v
(i)
V2
V
+
1
v
1
V2
J
((D1 (1/v))
0
V 1
v2
_J
1 v34) (q)1 (1/v)` < 0
so h is concave for v > 1 and go is concave for u > 1/2. Next, the left side of (2.6.4) is maximized for fixed P(A) > 0 when A is a set {IZI > r} for some r > 0, by the NeymanPearson lemma (e.g., Lehmann, 1986, p. 74). Then P(A) = 2(D(r/a) and
J. I Z I dP = (2/ir)1/2a exp ( r2/(2a2)). Letting x = r/a, we need to prove, for x > 0,
24)(x) < 24 (x)go(l/(2(D(x))).
(2.6.5)
Setting u := 11(2(D(x)), so that x =  1(1/(2u)), shows that (2.6.5) holds, with equality.
2.6.6 Lemma If Zl,
, ZN are each normally distributed with mean 0 and
variance < a2, then Emaxl (x2 + exp(x2/2) for x > 0 (RAP, Lemma 12.1.6(b)), and gi is nondecreasing, it will be enough to show that gi(exp (x2/2)/2) > (x2 +4) 1/2, x > 0.
Letting y := exp(x2/2)/2, we need to show that
gi(y) = K(log(1 + y))1/2 > (4 + 2log(2Y))1/2,
y > 1/2,
or
y > 1/2,
K2log(1 + y) > 4 + 2 log 2 + 2 log y, which follows from the definition of K.
Now to prove Theorem 2.6.2, D = 0 if and only if C consists of a single point. Then both sides of (a) and (b) are 0 and they hold. So assume D > 0. Let Ek := D/2k and Nk := N(Ek/2, C), k = 0, 1, . Then for each k = 0, 1, 2, , there is a set Ck of Nk points xkj, j = 1, , Nk, such that for all x E C, Ilx  xki II < Ek for some j. Then No = 1, so Co = {xoi } for some xoi. For each k = 1, 2, and j = 1, , Nk, choose and fix a point yki = xk_1,i for some i such that Ilxki  Ykj Il < Ek1 Let Wk be the set of all variables , Nk. Then by Lemma 2.6.6, L (xkj)  L (ykj), j = 1,
Emax(Izl: z E Wk) < Ekigo(Nk) For any uk E Ck, there is a sequence of points uj E Cj, j = 0, , k. Thus
that L(ug)  L(uj_i) E Wj, j = 1,
k
l
k
Esup jJL(x)L(y)I: x,YE UCi1 < I
i=1
j=1
,
k, such
58
Gaussian Measures and Processes; Sample Continuity
The union of all the Cl is dense in C, so by sample continuity and monotone convergence, ao
EC := E sup{IL(x)  L(y)I: x, y E C) < 2 E sj_tgo(Nj) j=1 00
4D Y go(Nj)/23 j=1
By Lemma 2.6.7, where K < 4, we get 00
EE < 16D E(log(1 + Nj))1/2/2j. j=1
For all j > 1, Nj > 2, so [log(1+ Nj)/ log Nj] 1/2 < (log 3/ log 2)1/2 < 1.26. Thus
Ec < 20.2D E (log Nj)1/2/2j 00
j=1 00
Ei+t
81
(logN(t, C))1/2dt = 81 f(D/4),
E1+2
proving (a). Then for any fixed y c C, SUPXEC L(x) < L(y) + suPXEC IL(x)  L(y)I, so (b) follows and Theorem 2.6.2 is proved.
Let g be a convex, increasing function from [0, oo) onto itself. If Y is a random variable such that Eg(SY) < oo for some S > 0, let II Y IIg := inf {c > 0 : Eg(I Y I /c) < 11. Then 11 IIg is a seminorm on such random variables (Appendix H). If there is no such S > 0, let II Y 11 g
+oo.
2.6.8 Theorem There is an absolute constant M < oo such that for any subset C of a Hilbert space H, and g(x) := exp(x2)  1,
IIL(C)*IIg < II IL('')I*IIg < ME(IL(C)I*). Proof If C is not a GBset, all three expressions will be infinite, so suppose C is a GBset, which we can then take to be countable. Then by the LandauSheppMarcusFemique theorem (2.2.2), 111 L (C) I * 11 g < oo.
2.7 Majorizing measures
59
The first inequality in the theorem is clear. Suppose there is no such M < co. Then there a r e countable GBsets C j C H with 111 L (Cj) I* II g> j 3 E I L (Cj) I* for j = 1, 2, . By homogeneity, we can assume E I L (C j) I * = 1 for each j.
Let H1, H2,
be infinitedimensional Hilbert spaces and form the direct sum 11 := ®jHj, so that Hj are taken as orthogonal subspaces of R. We can take
Cj C Hj for each j. Let Dj := Cj/j2 for each j. Let D := Uj Dj C R. Then
EIL(D)I* = EmaxjlL(Dj)I* < T EIL(Dj)I* = J:j2 < 00, l I so D is a GBset. Thus by the LandauSheppMarcusFernique theorem (2.2.2), II IL(D)I*IIg < oo. But for each j, II IL(D)I*Ilg >_ 1 L(Dj)111 g > j, a contradiction; so Theorem 2.6.8 is proved. 11
2.7 Majorizing measures This section will prove a characterization of GBsets, that is subsets of a Hilbert
space H on which the isonormal process is samplebounded, in terms of majorizing measures, to be defined next. Problems 14 and 15 at the end of this chapter show that there is no such characterization in terms of metric entropy in general, although there is under further restrictions (Theorem 2.7.4). The majorizing measure characterization and its proof are due to X. Fernique and M. Talagrand. For a metric space (T, d), r > 0, and x E T, the open ball of center x and radius r is B(x, r) :_ f y: d (x, y) < r). Definition. Let (T, d) be a metric space and P(T) the set of all laws (Borel probability measures) on T. For M E P(T) let Ym(T)
5UpxeT
f 0
(log
1
1/2
11dr. \m(B(x, r)))J
(
If y,,, (T) < oo, then m is called a majorizing measure for (T, d). Let y (T)
y(T,d) := inf{ym(T): ME P(T)}. Then y (T, d) < oo if and only if there exists a majorizing measure on T. If m is a majorizing measure on T, then for all x E T, m(B(x, r)) > 0 for all r > 0 and does not approach 0 too fast as r 0. For example, if T is finite, then ym (T) < oo if and only if m ({x)) > 0 for all x E T. On [0, 1] with usual metric, Lebesgue measure ? is a majorizing measure. More generally, so is any law having an absolutely continuous component with a density h > c for some
c>0.
60
Gaussian Measures and Processes; Sample Continuity
First, two theorems will be stated, which together characterize GBsets as sets in Hilbert space having majorizing measures. Here T = C will be a subset of a Hilbert space with the usual Hilbert metric. Recall L(C)* := ess.supXEc L(x) and I L (C) I * as defined by Lemma 2.5.1.
2.7.1 Theorem (Fernique, 1975) If C is a subset of a Hilbert space H and y(C) < oo, then C is a GBset. For some absolute constant K, EL(C)*
0, consider the randomvariableX= µ(B(x, r))
where x has distribution m. Evaluating EX by the TonelliFubini theorem, since
y e B(x, r) is equivalent to x E B(y, r), gives EX = f m (B(y, r)) d p (y). Thus for any x E T, m (B(x, r))/M < EX < Mm (B(x, r)). Jensen's inequality (RAP, 10.2.6) gives g(EX/M) < Eg(X/M). Let K := supx,YET d(x, y). Then for any x, using M > e, 00
r
Jo
1
[log (m (B(x, r))
)]1/2
K
dr
1. Notes. Fernique (1975) proved that for Gaussian processes satisfying a homogeneity condition like that in Corollary 2.7.5, the metric entropy integral condition (or the corresponding condition on N(e, T), equivalent by (1.2.1)) is necessary and sufficient for sample continuity. Theorems 2.7.1 and 2.7.2, with
2.7 Majorizing measures
63
Theorem 2.7.4, show that the metric entropy integral condition is equivalent to
sample continuity for the isonormal process on T for a subset T of a Hilbert space satisfying the conditions of Theorem 2.7.4. For an example of the situation in Corollary 2.7.5, let T be the unit circle x2 + y2 = 1 in R2, let G be the group of rotations, let d be the usual metric on R2, and let dm(O) = d9/(27r). Likewise, T could be a sphere of any dimension, with the orthogonal group G. To see how Theorem 2.7.4 applies beyond Corollary 2.7.5, suppose one wants to prove sample continuity of a Gaussian process on a locally compact but not compact metric space, such as a Euclidean space or a noncompact manifold. Then it suffices to prove sample continuity on each of a family of compact sets whose interiors form a base for the topology, such as balls or cubes in Euclidean spaces. Then one can often define a measure, such as Lebesgue measure in a Euclidean space, restrict it to a compact set C, and normalize it to have mass 1 to get a law m. Then m (B(x, r)) may not depend on x while B(x, r) is included in the interior of C, but become smaller as x approaches the boundary of C, yet the hypothesis of Theorem 2.7.4 still holds.
For any set A C H, let E(A) := EL(A)*. Then if A is countable, E(A) E supxEA L(x).
Proof of Theorem 2.7.2 The proof below for Talagrand's theorem 2.7.2 will be mainly as given by Fernique (1997). We have:
2.7.6 Lemma Let S C T C H, u > 0, and suppose d (x, y) > 4u for all x =,4 y in S. Let M be the cardinality of S. Then
E(T) >
E[B(s, u) fl T] + 12 (log M)112u.
Proof The statement holds if E(T) = +oo, so we can assume E(T) < oo. This implies that T is totally bounded by Theorem 2.3.5, so S is finite. , sM}. Let L1, , LM be independent copies of L. Let Z1, , ZM be i.i.d. N(0, 1) variables independent of L1, , LM. We can
Let S = {sl,
assume T C UM1 B(sj, u). Let Y(t) := Li(t)  Lj(sj) + uZj for each t E B(sj, u) and j = 1, M. Let v, t E T. If v, t E B(sj, u) for the same j, then dy(t, v)2 := E((Y(v)  Y(t))2) = d(v, t)2 = Ilv  t112. Otherwise, we have d(v, t) > 2u > dy(v, t). Thus dy(v, t) < d(v, t) in all cases, so by Theorem 2.3.7(b), E(T) > E suptET Y(t). To find a lower bound
Gaussian Measures and Processes; Sample Continuity
64
for E suptET Y(t), we have
suptET Y(t) = maxl 0. By derivatives, 4) (x) < 0 (x) for x > 0, and (a) follows from Theorem H.6. Now to prove (b), let p < 1/2. Then supXEw E exp(pY(x)2) < oo. Thus for almost all w, x i+ exp(pY(x) (w)2) is integrable for it by the TonelliFubini
theorem. Let
z := inf I C1 > 0:
J
(exp[a2Y(x)2]  l)dµ(x) < 1I
.
(Thus z = IIYIIg as defined in Appendix H for g(x) := exp(x2)  1.) The dominated convergence theorem implies that as a + oo,
J
(exp[a2Y(x)2]  l)dµ(x) + 0
almost surely. Thus [a > 0: f (exp[a2Y(x)2]  1) dµ(x) < 1} is almost surely nonempty. So z has finite values a.s. It is easily seen that z(w) = 0 if and only if Y(x)(w) = 0 for /.talmost all x, and then (b) holds. It remains to prove (b) when z(w) > 0. Since f is not 0 a.s., it follows by part (a) with
0 0, (2.7.21)
f
IY(x)(w)I f(x)dg(x)
< z1 f f(x) dlt (x) +
f
[log .f (x)
f(x)
(l + I f da)]
1112
dIL(x)
For u > 0 let F(u) := u[log(1 + (u/,B))]1/2. Then F is increasing and
Gaussian Measures and Processes; Sample Continuity
70
convex on [0, oo) by derivatives. Jensen's inequality (RAP, 10.2.6) then gives (log 2)1/2 f
fdg =
F
(f f d i)/// < fF(f)dt fx)
1/2
f(f
f f(x)
['og 11+
dµ
\l
dµ(x).
I
This, with (2.7.21), gives part (b) except for bounding Ez. To do that, we have for any u > 0 and p > 1 by definition of z and then the ChebyshevMarkov inequality that l
(
P(z > u) < P
if
< 2"E
exp (y(x)2/u2) dµ(x) > 2 } dµ(x)JP
([f exp (Y(x)2/u2)
)
.
By Holder's inequality, multiplying the latter integrand by 1 (or Jensen's inequality), we have
P(z > u) < 2PE For 1 < p < u2/2, the
\
f exp (PY(x)2/u2) dµ(x))
.
quantity` on the right is finite and, by the Tonelli
2pu2)1/2. For Fubini theorem, bounded above by g(p, u) := 2P(1 u > uo := (2 + (log 2)1)1/2, if we set p := u2/2  1/(2 log 2), then p > 1
and
g(p, u) = u(e log 2)1/2 exp ( u2(log 2)/2). For any t > 0 and random variable X with law P,
l
00
00
P(X > u) du
f fuo"
dP(x) du = 100
JX du dP(x)
t
00
(x  t) dP(x) = E(Xlx>t)  tP(X > t). It follows that if X > 0, then EX < t + fto° P(X > u) du. Thus
Ez < uo + (e log 2) 1/2
f
2_U2/2du
00 u
up
uo + (e log 2)1/2(log 2)1 exp ( ua(log 2)/2) < 2.5 as stated, which finishes the proof of the lemma.
2.7 Majorizing measures
71
Now to continue with the proof of Theorem 2.7.18, recall that we have chosen
a jointly measurable version of L()(). It will be shown that on T can be well approximated by its averages over small balls with respect to A. Recall that B(x, r) :_ {y: d(x, y) < r}. For each k = 0, 1, and t E T let
A(B(t, D/2k)),
µk(t) (2.7.22)
Pk(t, )
0pk(t))11g(t
D12k) ('),
and Mk(t) := Mk(t, (o) f pk(t, u)L(u) d s(u), which exists and is finite a.s. by the TonelliFubini theorem since for any x, to E T, E I L (x) I < 1I to II + D. For each t E T, we have B(t, D) = T, so M0(t) = f L d p doesn't depend on t. We also have E I L (x)  Mk (x) I < D/2k for each x and k. It follows that for each x in T, a.s. as k + oc, (2.7.23)
Mk(x)  L(x).
The rest of the proof will bound the increments of the processes Mk and their rate of convergence to L.
Let Y(u, v) := (L(u)  L(v))/d(u, v) if u v and 0 if u = v. Then Y is jointly measurable on H x H, EY(u, v) = 0, and EY(u, v)2 < 1 for all , letting pi (u) := p; (t, u), u, v E H. For each t E T and k = 0, 1, 2, k
Mk(t)  MO = T, MM(t)  Mj1(t), j=1
and
IMj(t)  Mj1(t)I = f Y(u, v)d(u, v)pj(u)pjl(v)dtc(u)d1t(v)
< 3D.2i. f I Y(u, v)1pj(u)pj1(v) dlL(u) dlt(v) Applying Lemma 2.7.20(b) to W := T x T with f(u, v) := pj(u)pj_1(v)
for j = 1, 2, .. and s x µ gives a random variable z > 0 with Ez < 5/2, not depending on n or t, and a set 21 C 0 with P(21) = 1 such that for all
c0E221,n EN,n> 1, and t E T, I Mn (t, (o)  Mn1(t, (0)1
1 and t while log(l + x2) < 2log(1 + x) for all x > 0. Now, letting h(t, r) :_ [log(1 + µ(B(t, r))1)]1/2
log (1 + µk(t)1) < h(t, r)2
(2.7.24)
for r < D/2k. Applying this for r > D/2k+1 for each k > 1, we get 00
D/2
E I Mn  M,,11(t, w) < 18z(c))21/2 n=1
h(t, r) dr.
J
For x > 1, we have (log(l +x))112 < (log(2x))1/2 < (log2)1/2 + (log X)112. So for co E 521, the series M0 + Fool Mn(t, (o)  Mn_1(t, (v) converges absolutely and converges to a limit
M(t, (v) := M(t)(c)) := limk,oo Mk(t, (0) Thus by (2.7.23), for each t, M(t, w) = L(t)(w) a.s., so M(., ) gives a version of Then we have D/2
(2.7.25)
IM(t, w)  Mo(cv)I < 18.21/2z(c))
h(t, r)dr.
J0
Next, for any s and tin T, take the unique integer k = k(s, t) > 0 such that (2.7.26)
D/2k+1 < d(s, t) < D/2k.
Then
IM(s)  M(t)I < IM(s)  Mk(s)I + IMk(S)  Mk(t)I + IM(t)  Mk(t)I. The proof of (2.7.25) yields (2.7.27)
I (M  Mk) (t, w) I
< 18.2 1/2Z(CO)
I
2k1 D
h(t, r) dr
and likewise for s in place of t. We have (2.7.28)
Mk(S)  Mk(t)
= f Y(u, v)d(u, v)pk(s, u)pk(t, v)dis(u)d,c(v). By the definition of pk (2.7.22) and (2.7.26), where the integrand in (2.7.28) is nonzero, we have d(u, v) < 3D/2k. Thus for all w, by Lemma 2.7.20(b) again,
I Mk(S)  Mk(t)I
0 let µ(r) := µs,t(r) := min (It (B (t, r), B(s, r))). Then D/2k+1
1/2
1
[log 1+
I Mk(s)  Mk(t)I < 18.2 1/2Z I
)1dr.
For any nonincreasing function f > 0 and S > 0 we have S
S/2
f(r)dr.
(1. f
f(r)dr < 2 fo
Recalling (2.7.27) for s and t, we then have for w E 521,
1
f D/2k+2
GM(s)  M(t) I
< 154zJ
[log (1
1 /2
+ µ(r))J
dr.
Since d(s, t) > D/2k+1 by (2.7.26), for r < D/2k+2 the balls B(t, r) and B(s, r) are disjoint and so µ(r) < 1/2. For x > 2, we have log(1 + x) (log 3) (log 2) 1 log(x) by first derivatives. Thus
[log (1 +
< [log2 log
/t(r))J
log 3
log2)
log 3
1/2
1
i/2
A(r)
1/2
1
Llog
1/2
1
\lt(B(t, r)))]
1
)]1/2
+ [log \it(B(s, r))
Then (2.7.19) follows with Y(w) := 194z(w). Since Ez < 2.5 from Lemma 2.7.20, we get EY < 485 as stated. So, Theorem 2.7.18 is proved. Now to prove Theorem 2.7.1, by Theorem 2.7.18 there is a version M of L such that inequality (2.7.19) holds, and therefore letting t be fixed and s vary,
EL(C)* = EM(C)* and (2.7.29)
EM(C)*
0 and any countable set F C T, define a random variable by
D(F, 3, y) := sup { (Xs, Xt) : s, t E F, d (s, t) < 8, e(s, t) < y). This is measurable since F is countable and S is separable (RAP, Proposition 4.1.7). Let D(F, 8) := D(F, 8, 1 + diame T), and
UC := n00U "0(D(A, 1/m) < 1/n), n=1 m=1
so that UC is measurable. If P(UC) = 1, then the sample functions t H Xt (w) are almost surely uniformly continuous with respect to d on A. Since A is countable and Y o h has the same laws as X, Y o h also has sample functions
almost surely uniformly continuous for d on A. Equivalently, Y has sample functions almost surely uniformly continuous for p on B. For any x E K let Y'(x) := lim{Y(u) : u + x, u E B). Almost surely, all these limits exist (by uniform continuity and since S is complete), and x i+ Y'(x) is continuous. Since X and Y' o h both have continuous sample functions on T and have the same law on A, they have the same laws on T, and so does Y o h by choice of X. It follows that Y' has the same laws as Y on K, so Y is versioncontinuous as desired. Otherwise, P(UC) < 1. Then for some s > 0,
infs>oP(D(A, S) > 3e) > 3e. Then by inclusion (monotone convergence, with 8 = 1/m), (2.8.3)
P n {D(A, 8) > 3s) I > 3s.
\ s>o
/
On the other hand, continuity oft H Xt and compactness of T imply that for some y > 0 and any countable set C C T, (2.8.4)
P {D (C, 1 + diamdT, y) > e) < E.
2.8 Sample continuity and compactness
(Otherwise, take a countable union of countable sets for y = 1, 2,
77
1/n, n =
, to get a contradiction.)
T is a finite union of eopen sets Tj with diame Tj < y. For each i j, if there are s E T, and t E Tj with h(s) = h(t), then we say (i, j) E G, and let us choose and fix such s = s(i, j) and t = t(i, j). Let C be the union of A and the set of all s(i, j) and t(i, j). Since G is finite, we can assume that XX(i, j) (w) = Xt (t, j) (w) for all co and (i, j) E G. Let
J := n {D(C, 1/n) > 3s} f1 {D(C, 1 + diamd T, y) < e}. 0" n=1
Then (2.8.3) and (2.8.4) give P(J) > 3s  s > E. Fix an co E J and choose sn E C and to E C such that d(s, tn) < 1/n and (X,s,,, Xtn)((o) > 3e. By compactness, we can assume that the sequences sn and tn both converge fore and
hence also ford. Let sn * s and to  t. Then d (s, t) = 0, XSn (w) + X5 (w), and Xt,, (co) Xt (co) as n oo. Lets E T, and t E Tj. If i = j we have, since
co E J, 3e < (X5, Xt)(co) < s, a contradiction. If i # j, then (i, j) E G. For n large enough, s, E T, and to E Tj, so
3s
n, the process is not 11 into H. Recall that a sample function of a stochastic process Xt is a function t H Xt (w) for a fixed w. The usual metric on a Hilbert space is the natural one for an isonormal process, but the GCproperty holds for other metrics in the following sense:
2.8.6 Theorem Let C be a subset of a Hilbert space H. Then the following are equivalent:
(I) C is a GCset; (II) L on C has a version with bounded, uniformly continuous sample functions; (III) there exists a metric p on C such that (C, p) is totally bounded, and the sample functions of the isonormal process L on C can be chosen to be puniformly continuous a.s.
Proof (I) implies (H) since by definition a GCset is totally bounded, so a uniformly continuous function on it must be bounded. (II) implies (III) directly where p is the usual metric. Suppose (III) holds. Take a version of L such that on a set of probability 1,
the sample functions of L are puniformly continuous on C. Then L extends to a Gaussian process t i± Xt on the compact completion M of C for p. Here Xt is versioncontinuous, and so by Corollary 2.8.5, C is included in, and thus is, a GCset.
**2.9 Volumes, mixed volumes, and ellipsoids There will be no proofs in this section.
2.9.1 Volumes Any closed set C in a finitedimensional Euclidean space Rk has a welldefined
volume, its Lebesgue measure V(C) := Vk(C) :_ Ak(C), invariant under Euclidean transformations (rotations, reflections, translations, etc.). Let C be a compact set in a Hilbert space H of dimension > k. Then Vk (C)
will be defined by Vk(C) := sup Vk(Jr(C)) where the supremum is over all
* *2.9 Volumes, mixed volumes, and ellipsoids
79
orthogonal projections 7r with kdimensional range. Define the exponent of volume of C by
EV(C) := lim
(log V (C))/(n log n).
The exponent of entropy of C is defined by
r(C) := lim sup,10 (log log D(e, C))/(log(1/s)).
Thus C is a GCset if r(C) < 2 by Theorem 2.6.1 and is not a GBset if r(C) > 2 by the SudakovChevet theorem (2.3.5). Milman and Pisier (1987) (see also Pisier (1989)) proved the following relation between r(C) and EV (C): Theorem (Milman and Pisier) Let C be a convex, symmetric, compact subset of
a Hilbert space. Suppose that for some 3 > 0 and M < oo, Vk(C) < Mk'' for all k = 1, 2, .. Then C is a GCset and
EV(C) _
1
1
r(C)
2
2.9.2 Mixed volumes and mean widths
Let C be a convex set in a Euclidean space Rk and let B be the unit ball
B:={x: IxI OletC+rB:={x+ry: xEC, yE B). Then it is known that Vk(C + rB) is a polynomial of degree kin r, namely k
Vk(C+rB) = L.rk_ j(C) j=0
where ,Bj (C) is called the jth mixed volume of C (see, for example, Bonnesen and Fenchel (1934, section 29) or Leichtweiss (1980, Satz 15.4 p. 162); other normalizations of the mixed volumes can be found in the literature). We have fk (C) = Vk (C). Here if C is viewed as a subset of a space of higher dimension,
its mixed volumes may change, so more precisely one can write fj(C) = ,Bj,k(C). If C is a straight line segment, then 01 (C) is proportional to its length.
Let h1(C) := )81,k(C)/4 where k is such that for a line segment S, h1(S) equals the length of the segment for each k. Now if C is a compact, convex set in a separable, possibly infinitedimensional Hilbert space and 2rm are finitedimensional projections increasing up to the identity, then h (C)) increase
up to a limit h1(C) < +oo.
Gaussian Measures and Processes; Sample Continuity
80
Let Sk1 be the unit sphere in Rk, Sk1 :_ {x E Rk : (x) = 1). For any convex set C in IRk and x E Sk1, C has a width wx(C) in the direction x defined by
wx(C) := suPZEc(x, z)  infyec(x, Y) is the usual inner product. Let cok_1 be the unique rotationally where invariant Borel probability measure on Sk1 (normalized (k  1)dimensional
surface area measure). Then the mean width of C is defined by w(C) := f wx (C)dwk_ 1(x). It is known that for each k, w (C) is proportional to h 1(C) or to #t (C) for convex sets C; see, for example, McMullen (1993, Theorem 5.5 p. 970), who refers for a proof to Hadwiger (1957). Sudakov (1971, Theorem 4, and 1973) discovered that:
Theorem (Sudakov) A compact, convex set C in a Hilbert space is a GBset if and only if h 1(C) < oo. In fact,
h1(C) = (2n)1/2EL(C)*. Indeed, it is easily seen that for each k, EL (C)* for a convex set C in Rk is proportional to the mean width. Since for a finitedimensional set C, EL (C)* doesn't depend on the dimension of the space in which C is imbedded, and neither does h 1(C), one can find the proportionality constant between h 1(C) and EL(C)* by evaluating both on a line segment.
2.9.3 Ellipsoids Let H be a separable, infinitedimensional Hilbert space. Then a subset E of H will be called a (compact) ellipsoid if there exists an orthonormal basis {en }n> 1 of H such that for some sequence an 4. 0, we have 00
E = E({an})
jx=xnenEH: n=1
00
>(x,,/a,,)2 < 1
.
n=1
Then E is called a Schmidt ellipsoid if an2 < oo. Suppose for X E E we write L (x) xn Gn where Gn = L (en) are i.i.d. N(0, 1) variables, and then En xn Gn = En (xn /an )an Gn. We then have a.s. L(E({an}))* = En a2nG,2)112, where " cj, then m(B(gn, e)) >
Fi> j c/(i(Li)2) > c/Lj. Break up the range of integration [0, oo) as the union of the intervals [0, CO, [cn, cn1), c(n)
fo
[log (1/m (B (gn,
8)))11"2ds
, [c4, c3), [c3, oo). We have
< cn ( log [n(Ln)2/c])1/2
< 2(Ln)1/2(Ln + 2LLn  log c)112 which is bounded in n. Next, the integral from c/+1 to cj is bounded by (cj 
cj+1)(L(L(j + 1)/c))1/2. By the mean value theorem, we get cj  cj+t = (Lj)1I2  (L(j + 1))1/2 < 1/(2 j(Lj)3/2). The sum from j = 3 ton  1 of (LL(j + 1)  logc)1/2/(2j(Lj)312) is bounded in n. The integral from c3 to oc is the integral from 2 to oo, which is 0, plus the integral from c3 to 2, which is bounded above by 2(L3/c)1/2 for all n (using logy < y for all y), so m is a majorizing measure.
To show {gn } is a GBset, Proposition 2.2.1 gives P(I L (gn) I > M) < exp(M2Ln/2), which is summable inn for M > 21/2, and the BorelCantelli
Problems
83
lemma applies. The set Y of finite convex combinations with rational coefficients of the gn is countable and dense in co({gn}), and L is a.s. bounded on Y, so co({g, }) is a GBset by Lemma 2.5.1 and its proof.
Let en be orthonormal in H and let an
2.10.4 Proposition
,{
0, C =
{anen }n>1. Then the following are equivalent:
(I) C is a GBset;
(II) an = O((Ln)1"2);
(III) y(C) < 00.
Proof (I) implies (II) by Theorem 2.3.5. (II) implies (I) and (III) by Proposition
2.10.3. To show that (III) implies (II), let m be a majorizing measure and mn := m{anen}. Then for each n, 00
00 > ym(C) ? f (log(1/m(B(anen, 8))))1/2 de. 0
The integral is bounded below by the integral from 0 to an, which is at least
Let J
Ym(C)2 < oo. Then mn > exp(J/an), E. m, = 1, and exp(J/an) is nonincreasing inn, son exp(J/an) < 1, a(log(1/mn))1/2.
and an
0 if
(a) G(X) = N(0,1) in R, IIXII = IXI; (b) G(X) = N(0, C) in R2, C = (0 ), and II (xl , x2) I =
(x1 +x22)1 2.
3. Let H be a Hilbert space with orthonormal basis {ej}j>1. Let Gn be independent with laws N(0, (a) Under what conditions on does F_ Gnen converge almost surely in the norm of H? Hint: Apply the 3series theorem (RAP, 9.7.3) to suitable realvalued random variables.
(b) If G = En Gnen in H as in (a), where the sum converges almost surely, find for what a > 0 we have E exp(a II G II2) < 00. After doing this directly, compare it with the results of Section 2.2.
4. Let Gn be i.i.d. N(0, 1) variables and an > 0 for each n. Under what conditions on an is Yn an I Gn I < oo a.s.? Hint: Use the 3series theorem.
Gaussian Measures and Processes; Sample Continuity
84
5. (a) Show that for any set A in a real vector space V and for any vector space W of linear forms on V, the polar A*1 of A, defined by A*1 := {w E W: w(v) < 1 for all v E A}, is convex in W. (b) Let C be a set and D its convex hull, that is, the smallest convex set including C. Show that D*1 = C*1.
(c) In I[82 let C be the unit square {0 < x < 1, 0 < y < 1). Evaluate the polar C*1.
6. Let H be a Hilbert space with orthonormal basis {en }n> 1. For c > 0 let E((cnIn>1) :_ (En>1 xnen : Y_n>1 xn/cn < 1), an infinitedimensional ellipsoid. Show that E is a GBset if and only if n c,2 < oo. 7. With notation as in the previous problem, let C {en /(log n) 1/2 : n > 21. Show that C is not a GCset (it is a GBset as shown in the example before Theorem 2.3.7).
8. Let 1/i be a realvalued characteristic function on R, where 1/i(t) _ °0 00 e`xt dP(x) for some symmetric law P on R, P(A) = P(A) for
f
all Borel sets A. Show that there exists a Gaussian process Xt, t E R, with mean 0 and covariance EXSXt = 1/i(s  t) for all real s, t.
9. (a) For each t > 0 let flit be a "triangle function" on R with *t (0) = 1, and for some t > 0, Tit(s) = 0 whenever Isi > t, while flit is linear on each interval [t, 0] and [0, t]. Show that 1/!t satisfies the conditions of the previous problem. Hint: Find its (inverse) Fourier transform and show that it is a probability density.
(b) Let 1/i be a continuous realvalued function on R which is even, 1/i(x) = 1/i(x), 1/i(0) = 1, and on [0, oo), 1 is nonincreasing, nonnegative, and convex. Show that 1 is a characteristic function. Hint: Use a mixture of triangle functions. First consider the case that 1/i is piecewise linear and is 0 outside some finite interval. Take increasing limits of such piecewise linear functions.
10. Force > 0let 1/i(x) = 1  (log(l/Ix1))' forx in some neighborhood of 0 (piecewise linear elsewhere). (a) Show that there exists such a 1/i satisfying the conditions of problem 8, assuming problem 9. (b) What can be said about samplecontinuity of the Gaussian process
Xt for different values of a? 11. Let p be a majorizing measure on a metric space (T, d). (a) If T is a subset of a Hilbert space H with usual metric, then T is a GBset by Theorem 2.7.1 and is totally bounded by Theorem 2.3.5. Prove directly, in general, that (T, d) is totally bounded. Hint: If for some r > 0 there are infinitely many disjoint balls B(x1, r),
Problems
85
then inf1 µ(B(x;, r)) = 0, and l/p (B(x1, s)) is nonincreasing in s for each i.
(b) Let p be another probability measure on (T, d). If 0 < A < 1, show that A s + (1  .k)p is also a majorizing measure on (T, d). 12. Let Xt (w) Fn=_,,o G, (co)etnt where Gn are independent random variables with laws N(0, Q,2) and >n Qn < oo. Show that the process t i+ Xt is samplecontinuous if and only if {Xt () : 0 < t < 2ir } is a GCset in
L2(0) 13. Let z be as in Lemma 2.7.20. Show that E(z2) < 100/9. 14. Let e be orthonormal in H and C :_ {an (log n)112en }n>2 where an > 0 asn > co. (a) Show that every such set is a GCset.
0 slowly enough, show that for any r > 0 there (b) By taking a exist GCsets C with D(e, C) > exp[1/(s2J log ej'')] for s > 0 small enough.
15. (A further extension of problem 10). For c > 0 let fi(x) := 1 (log(1/IxI))1(log log(1/lxI))c forx in some neighborhood of O (piecewise linear elsewhere).
(a) Show that there exists such a f satisfying the conditions of problem 8, assuming problem 9. (b) What can be said about sample continuity of the Gaussian process Xt for different values of c? (c) Show that for any r < 2, there exist GCsets C with D(e, C) < exp(1/(s2l logejT) for s > 0 small enough. Compare this with problem 14, part (b), to see that for 0 < r < 2, one cannot tell from D(e, C) whether C is a GCset or not. 16. Let vk be the Lebesgue volume of the unit ball in Rk. Then it is known that . For any c, > 0, the ellipsoid Vk = nk/2/ l,(1 + (k/2)) f o r k = 1 , 2, [X: rk1 x2/c? < 11 C Rk has volume VkC1 ck £k E({c; }k1)
for any ci > 0, i = 1,
,
k. For s > 0 let m := D(s, E).
(a) Show that m > cic2 . . . ck/sk.
(b) If cj > s for j = 1, , k, show that m (s/2)k < 2kc1 c2 . . . ck. (c) If c j = j " f o r j = 1 , 2, , give upper and lower bounds for D(e, E) as s . 0 and k  oo. Hint: Recall Stirling's formula k!/[(k/e)k(22rk)112]
1 ask  oo (Theorem 1.3.13).
17. If g is a YoungOrlicz modulus, x2 = o(log g(x)) as x > +oo, and Y is a N(0, 1) random variable, show that 11 YIIg = +oo.
86
Gaussian Measures and Processes; Sample Continuity
18. If g is a YoungOrlicz modulus, log g(x)) = 0(x2) as x + +oo, and Y is a N(0, 1) random variable, show that II YIIg < oo.
19. Let f(x) = rxfor0 g with f *dµ = oo, in which case f * gdµ will also be defined as oo. There always exists at least one measurable h > g, namely h  +00. We will be dealing often with compositions of functions. If f is a function whose domain includes the range of g then either f (g) or fog will denote the
function such that (f o g)(x)  f (g(x)). A function, which may not be measurable, from a probability space into a metric space will be called a random element. Now here is a definition of convergence in law, where only the limit variable necessarily has a law:
Let (S, d) be any metric space. Let (Q,, A,z , Q,) be probability spaces for n = 0, 1, 2, .. , and Yn , n > 0, functions from Stn into S. Suppose that Yo takes values in some separable Definition (J. HoffmannJorgensen).
subset of S and is measurable for the Borel sets on its range. Then Y, will be Yo, if for every said to converge to Yo in law as n + oo, in symbols Y, bounded continuous realvalued function g on S,
fg(Yfl)dQfl
 fg(Yo)dQoasnoo.
For g bounded, f * g(YY) dP is always defined and finite. Then, here is a general definition of when the central limit theorem for empirical measures holds with respect to uniform convergence over a class F of functions. The metric space S will be the space £°O(F) of all bounded realvalued functions on F, with the metric given by the supremum norm IIHII,, := sup{IH(f)I :
fc.T}. Definition. Let (S2, A, P) be a probability space and F c G2(P). Then F will be called a Donsker class for P, or PDonsker class, or be said to satisfy the central limit theorem (for empirical measures) for P, if F is pregaussian
for P and vn = Gp in Later on, a number of rather large classes F of functions will be shown to be Donsker classes for various laws P. The next few sections develop some of the needed theory.
3.2 Measurable cover functions
95
3.2 Measurable cover functions In the last section, convergence in law was defined in terms of upper integrals. The notion of upper integral is related to that of measurable cover. Let (7, A, P) be a probability space. Then for a possibly nonmeasurable set A C S2, a set B is called a measurable cover of A if A C B, B E A,
and P(B) = inf{P(C) : A C C, C measurable). For A C B E A with P(A), a measurable cover of A is n,, B,,. If B and C are measurable
covers of the same set A, then clearly so is B fl C. It follows that B = C up to a set of measure 0, in other words P(BOC) = 0 where A denotes the symmetric difference, or equivalently P (1 B = 1 c) = 1. For any set A C S2 let P*(A) := inf{P(B) : B measurable, A C B}. Then for any measurable cover B of A, clearly P*(A) = P(B). Let r0 := &A A, P, 9) denote the set of all measurable functions from S2 into R. Then G° is a lattice: for any f, g E G°, f v g := max(f, g) and f n g := min(f, g) are in G°. But this LO is not a vector space since we could have, for example, f = +oo and g = oo, so f + g would be undefined. The map y i* tan1 y is onetoone from ]l8 onto [7r/2,7r/21. Then a
metric on R is defined from the usual metric on [ir/2, r/2] by d (x, y) := I tan1 x  tan1 yl. On Lo we have the Ky Fan metric (RAP, Theorem 9.2.2) defined by
d(f,g) := inf{E>0: P(d(f(x),g(x))>s) f everywhere), then f * := ess.inf ,7 can be chosen so that f * > f everywhere. Also, f f * dP and E*f := f * f dP are both defined and equal if either of them is welldefined (possibly infinite), for example if f * is bounded below. P r o o f Let j i be the class of all functions min (fl ,
, f , , , ) for fl,
, f E .7
and m = 1, 2, . Then JI is a lower semilattice. For f E G°, f = ess.inf ,71 if and only if f = ess.inf J. So we can assume ,7 is a lower semilattice. For
96
Uniform Central Limit Theorems: Donsker Classes
each j E J, tan1 j is a measurable function with values in [7r/2, r/2]. Take jn E J such that f tan1 j f tan1 j dP. Then min(J1, , j,,,) is in Ji and decreases to g as m  oo for some g E Ge(S2, A, P, ]R). For any h E J, min(h, j1, , jm) J, min(h, g) so f tan Imin(h, g) = f tan '(g) and g < h a.s., so g satisfies the definition of ess.inf J. If 3 = (h E LO : h > f) then J is a lower semilattice and g > f everywhere. By the definitions, f * f dP < f f * dP if either side is welldefined, and the inequality is an equation by the definition of essential infimum.
Here f * is called the measurable cover function of f. Recall that in Chapter 2, L (A)* was the essential supremum of L (x) for x E A, and so the essential infimum of random variables Y such that for each x E A, Y > L (x) a.s.  a different, although related, notion. If f is realvalued and bounded above by some finitevalued measurable function then f * is a measurable realvalued function. But whenever there exist nonmeasurable sets A 4. 0 with P* 1, as for Lebesgue measure (e.g., RAP, Section 3.4, problem 2), let f := n on An \
Then f is realvalued but f * _ +oo a.s. The next two lemmas on measurable cover functions are basic.
3.2.2 Lemma For any two functions f, g: S2 H (oo, oo], we have
(a) (f + g)* < f * + g* a.s., and (b) (f  g)* > f  g* whenever both sides are defined a.s. Proof (a) We have oo < f * < +oo and oo < g* < +oo everywere, so f * +g* is an everywhere defined, measurable function > f+g, and (a) follows. For part (b), on the measurable set where g* = +oo, where by assumption f is finite a.s., the right side is oo and the inequality holds. Where g* is finite,
g is also finite and f = (f  g) + g. Then f * < (f  g)* + g* by (a), so f *  g* < (f  g)*, since this holds where (f  g)* < oo and where
(f g)*=oo. 3.2.3 Lemma Let S be a vector space with a seminorm II II. Then for any two functions X, Y from 0 into S,
IIX+ YII* < (11XII + IIYII)* < IIXII* + IIYII* a.s. and IIcXII* = Ici IIXII* a.s. for any real c. Proof The first inequality is clear, the second follows from Lemma 3.2.2, and the equation is clear (for c = 0 and c # 0).
3.2 Measurable cover functions
97
Next, in some cases of independence, the upperstar operation can be distributed over products or sums.
3.2.4 Lemma Let (2j, Aj, Pj), j = 1,
, n, be any n probability spaces. Let fj be functions from Q j into R. Suppose either
(a) f >0, j= 1,   , n, or (b) fl = 1 and n = 2. Then on the Cartesian product IInj_1(S2j, Aj, Pj) with x := (xl,
f(x) :=
, xn), if
fj(xj), we have f*(x) = nj=1 f.*(xj) a.s., where 0 oo is set
equal to 0.
(c) Or, if fj(xj) >  ooforallxj, j = fl (XI) + ... + fn (xn ), then g* (xi, ... , xn) = fl* (x i) + .
+ fn (xn )
a.s.
Proof First, for (c), by induction we can assume n = 2. Clearly g*(x, y) < f l* (x)+ f2 (y) a.s., and if equality does not hold a. s., there is a rational t such that
on a set C of positive probability in the product space, g* (x, y) < t < fl* (x) +
f2 (y), and there exist rational q, r with q + r > t such that C can be chosen
with fl*(x) > q and f2 (y) > r for (x, y) E C. Let Cx := {y: (x, y) E C}. By the TonelliFubini theorem there is a set D C 01 with PI (D) > 0 such that
P2(Cx) > 0 for all x E D. If fi < q on D then fl* < q a.s. on D, but for any x E D and y E Cx # 0 we have fl* (x) > q, a contradiction. So choose and fix an x E D with fl (x) > q. Then for any y E Cx, q + f2 (y) < fi (x) + f2 (y) < g* (x, y), so f2 (y) < g* (x, y)q and f2 (y) < g* (x, y)q for almost all y E C. For any such y, q + f2 (y) < q +r and f2 (y) < r, a contradiction. So (c) is proved. Now for products, in case (a) or (b), clearly f *(x) < nj=1 f.* (xj) a.s., with 1 * = 1. For the converse inequality we can assume n = 2, by induction in case (a). Suppose f *(x) < fl (xl) f2 (x2) with positive probability. Then for some
rational r, f * (x) < r < fl*(x1) f2 (x2) with positive probability. If fj  1, this gives f (x) < f * (x) < r < f2 (x2) on a set of positive probability. Then by _
the TonelliFubini theorem, for some x1, f2(x2) < f *(xl, x2) < r < f2 (x2) on a set of x2 with P2 > 0, contradicting the choice of f2*' So assume fi > 0 and f2 > 0. Then as in case (c), there are rationals a, b
with ab > r, a > 0, and b > 0, such that on a set C in the product with positive probability, fl*(xi) > a, f2 (x2) > b, and f*(x1,x2) < r. Again by the TonelliFubini theorem, there is a set D C 01 with PI (D) > 0 and P2(Cu) > 0, u E D, and there is a point u of D where fl (u) > a. Then for any v E C, f2(v) < f *(u, v)/a, so f2 (v) < f *(u, v)/a for almost all
98
Uniform Central Limit Theorems: Donsker Classes
v E G. For such a v we have aft (v) < ab and f2 (v) < b, a contradiction, finishing the proof.
For the next fact here is some notation: given two functions f, g and a a
algebra S on the range of f, let (f, g)(x) := (f (x), g(x)) and f 1(S) { f '(A): A E S}. 3.2.5 Lemma Let (0, A, P) = II3_1(521, Si, PI) with coordinate projections 111: 111(x1, x2, x3) := xi, i = 1, 2, 3. Let Si ®S2 denote the product aalgebra on S21 x 522. Then for any bounded real function f on 01 x 523 and 9(x1, x2, x3) := f (x1, x3), conditional expectations of g* satisfy
E(g*1(111, 112)1(S1 (9 S2)) = E(g*I111(S1)) a.s.for P.
Proof By Lemma 3.2.4(b), for 22 X (S21 x 523), g* equals Palmost surely a measurable function not depending on x2, thus independent of 111(S2). Let S be the collection of all sets A E (111, f12)1(S1 0 S2) such that g* and h := E(g* 1111(Si )) have the same integral over A. Then S contains all finite
disjoint unions of sets (Hi, n2)1(B1 x B2) = IIi 1(B1) n n21(B2), Bi E Si, i = 1, 2, since both g* and h are independent of 17121(S2). Now S is easily seen to be a monotone class, so it equals all of (111, H2)1(S1 0 S2) (RAP, Theorem 4.4.2).
3.2.6 Lemma Let X be a realvalued function on a probability space (0, A, P). Then for any t E R, (a) P* (X > t) = P(X* > t);
(b) foranye>0, P*(X>t)t)te). Proof Clearly {X > t} c {X* > t) and {X > t) c {X* > t}, so we have "_ mk. Then d(fn, fo)* < 1/k on Ck, so d(fn, fo)* > 0 a.s., proving (A). Clearly, (A) and (D) are equivalent.
Example. In [0, 1] with Lebesgue measure P let A 1 D A2 D . . . be sets with P*(A,) = 1 and fn°_1 An = 0 (e.g., RAP, Section 3.4, problem 2; Cohn, 1980, p. 35). Then 1An * 0 everywhere and, in that sense, almost surely, but not almost uniformly. Note also that lAn doesn't converge to 0 in law as defined in Section 3.1. To avoid such pathology, almost uniform convergence is helpful.
3.3.3 Proposition Let (S, d) and (Y, e) be two metric spaces and (S2, A, P) a probability space. Let f, be functions f r o m 0 into S for n = 1 , 2,
, such
that f,  fo in outer probability as n + oo. Assume that fo has separable
Uniform Central Limit Theorems: Donsker Classes
102
range and is measurable (for the Borel or algebra on S). Let g be a continuous function from S into Y. Then g(fn) g(o) in outer probability.
Note. If f G (fo) dP is defined for all bounded continuous real G (as it must be if fn fo, by definition) then the image measure P o f0 1 is defined on all Borel subsets of S (RAP, Theorem 7.1.1). Such a law does have a separable support except perhaps in some settheoretically pathological cases (Appen
dix Q. , let Bk :_ {x E S: d(x, y) < 1/k implies P r o o f Given e > 0, k = 1, 2, e(g(x), g(y)) < s, y E S}. Then each Bk is closed and Bk T S as k + oo. Fix k large enough so that P(f01(Bk)) > 1  e. Then {e(g(fn), g(.f0)) > s) (1 f0 1(Bk) C {d (fn, .fo)
1/k}.
Thus
P*{e(g(fn),g(f0)) > e} < e+P*{d(fn, fo) > 1/k} < 2s for n large enough.
3.3.4 Lemma Let (S2, A, P) be a probability space and {gn }n o a uniformly bounded sequence of realvalued functions on Q such that go is measurable. If gn * go in outer probability then lim supn,, f * gn dP < f go dP.
Proof Let I g, (x) I < M < oo for all n and all x E 22. We can assume M = 1. Given s > 0, for n large enough P*(I gn  goI > e) < e. Let An be a measurable set on which Ign  gol < e with P(S2 \ An) < e. Then
J
gndP < e+ J*gndP < 2s+ f A
A An
< 3s+ J $odP.
Letting s 1. 0 completes the proof. On any metric space, the aalgebra will be the Borel or algebra unless something is said to the contrary.
3.3.5 Corollary If f, are functions from a probability space into a metric space, fn * fo in outer probability, and fo is measurable with separable range, then fn = fo. Proof Apply Proposition 3.3.3 to g = G for any bounded continuous G and Lemma 3.3.4 to gn := G o fn and also to gn = G o fn.
3.4 Perfect functions
103
3.4 Perfect functions For a function g defined on a set A let g[A] :_ {g(x) : x E A}. It will be useful that under some conditions on a measurable function g and general realvalued
f, (f o g)* = f * o g. Here are some equivalent conditions: 3.4.1 Theorem Let (X, A, P) be a probability space, (Y, l3) any measurable space, and g a measurable function from X to Y. Let Q be the restriction of P o g t to B. For any realvalued function f on Y, define f *for Q. Then the following are equivalent:
(a) For any A E A there is a B E B with B C g[A] and Q(B) > P(A); (b) for any A E A with P(A) > 0 there is a B E 8 with B C g[A] and Q(B) > 0; (c) for every real function f on Y, (f o g)* = f * o g a.s.; (d) f o r any D C Y, OD o 9)* = 1 *D o g a. s.
Proof Clearly (a) implies (b).
To show (b) implies (c), note that always (f o g)* < f* o g. Suppose (f o g)* < f * o g on a set of positive probability. Then for some rational
r, (fog)* 0.Letg[A]DBEB with Q(B) > 0. Then f o g < r on A implies f < r on B, so f * < r on B a.s., contradicting f * o g > r on A. Clearly (c) implies (d). Now to show (d) implies (a), given A E A, let D := Y \ g[A]. Then we can take 1D = lc for some C E B: let C be the set where 10 > 1. Then D C C
andIDog=(lDog)*=Oa.sonA.LetB:=Y\C.Then BCg[A],and
Q(B) = 1  Q(C) = 1 
f 1* d(P o g '),
which by the image measure theorem (RAP, Theorem 4.1.11) equals
1J 1*DogdP = 1 J(1Dog)*dP > P(A). Note. In (a) or (b), if the direct image g[A] E B, we could just set B := g[A]. But, for any uncountable complete separable metric space Y, there exists a complete separable metric space S (for example, a countable product N°O of copies of N) and a continuous function f from S into Y such that f [B] is not a Borel set in Y (RAP, Theorem 13.2.1, Proposition 13.2.5). If f is only required to be Borel measurable, then S can also be any uncountable complete metric space (RAP, Theorem 13.1.1).
104
Uniform Central Limit Theorems: Donsker Classes
A function g satisfying any of the four conditions in Theorem 3.4.1 will be called perfect or P perfect. Coordinate projections on a product space are, as one would hope, perfect:
3.4.2 Proposition Suppose A = X x Y, P is a product probability v x m, and g is the natural projection of A onto Y. Then g is Pperfect.
Proof Here P o g1 =m. For any B CA let By := {x : (x, y) E B}, y e Y. If B is measurable, then by the TonelliFubini theorem, for C := {y: v(By) >
0}, C is measurable, C C g[B], and P(B) < m(C), so condition (a) of Theorem 3.4.1 holds.
3.4.3 Theorem Let (0, A, P) be a probability space and (S, d) a metric , (Yn,13,,) is a measurable space, gn a space. Suppose that for n = 0, 1, perfect measurable function from 0 into Y,,, and fn a function from Y into S, where fo has separable range and is measurable. Let Qn := P o gn 1 on 13n and suppose fn o gn  fo o go in outer probability as n * oo. Then fn = fo as n oo for fn on (Yn, 13,,, Qn)
Before this is proved, an example will be given. Recall that for any measure space (X, S, µ) and C C X, the inner measure is defined by /,t. (C)
sup{µ(A) : A E S, ACC}. 3.4.4 Proposition
Theorem 3.4.3 can fail without the hypothesis that gn be
perfect.
Proof Let C C I := [0, 1] satisfy 0 = X*(C) < X*(C) = 1 for Lebesgue measure A (RAP, Theorem 3.4.4). Let P = A*, giving a probability measure on
the Borel sets of C (RAP, Theorem 3.3.6). Let 0 = C, fo = 0, Yn = I, fn :_ II\C, and let gn be the identity from C into Y,, for all n. Then fn o gn = 0 for all n, so f, o gn + fo o go in outer probability (and in any other sense). Let 13,, be the Borel oralgebra on Y,, = I for each n. Let G be the identity from I into IR. Then f * dQn = f * fn da, = 1 for n > 1, while f G(fo) dQo = 0, so fn does not converge to fo in law.
After Theorem 3.4.3 is proved, it will follow that the gn in the last proof are not perfect, as can also be seen directly from condition (c) or (d) in Theorem 3.4.1.
fo o go. Let H Proof of Theorem 3.4.3 By Corollary 3.3.5, fn o g, be any bounded, continuous, realvalued function on S. Then by RAP,
3.4 Perfect functions
105
Theorem 4.1.11,
JH(ffl(gfl))dP  f H(fo(go)) dP
f
H(fo) dQo
Also,
f *H(fn(gn))dP =
f H(fn(gn))*dP
by Theorem 3.2.1
f(Hofn)*(gn)dP
by Theorem 3.4.1
f(Hofn)*dQn
(RAP, Theorem 4. 1.11)
f * H(fn) dQn
by Theorem 3.2.1,
and the theorem follows.
In Proposition 3.4.2, X x Y could be an arbitrary product probability space, but projection is a rather special function. The following fact will show that all measurable functions on reasonable domain spaces are perfect.
Recall that a law P is called tight if sup{P(K) : K compact) = 1. A set P of laws is called uniformly tight if for every e > 0 there is a compact K such that P(K) > 1  e for all P E P. Also, a metric space (S, d) is called universally measurable (u.m.) if for every law P on the completion of S, S is measurable for the completion of P (RAP, Section 11.5). So any complete metric space, or any Borel set in its completion, is u.m.
3.4.5 Theorem
Let (S, d) be a u.m. separable metric space. Let P be a
probability measure on the Borel or algebra of S. Then any Borel measurable function g from S into a separable metric space Y is perfect for P.
Note. In view of Appendix C, the hypothesis that S be separable is not very restrictive.
Proof Let A be any Borel set in S with P(A) > 0. Let 0 < s < P(A). By the extended Lusin theorem (Theorem D.1, Appendix D) there is a closed set F with
P(F) > 1s/2 such that g restricted to F is continuous. Since P is tight (RAP, Theorems 11.5.1 and 7.1.3), there is a compact set K C A with P(K) > s. Then C := F fl K is compact, C C A, P(C) > 0, and g is continuous on C,
so g[C] is compact, g[C] C g[A], and (P o g 1)(g[C]) > P(C) > 0, so the conclusion follows from Theorem 3.4.1.
106
Uniform Central Limit Theorems: Donsker Classes
Let (0, A, P) be a probability space and g a measurable function from S2 into Y where (Y, 8) is a measurable space. Let Q := P o g 1 on B. Call g quasiperfect for P or Pquasiperfect if for every C C Y with g '(C) E A, C is measurable for the completion of Q. Then the probability space (0, A, P) is called perfect if every realvalued function G on 0, measurable for the usual Borel or algebra on R, is quasiperfect.
Example. A measurable, quasiperfect function g on a finite set need not be perfect: let X:= {al, a2, a3, a4, a5, a6}, U := {al, a2}, V :_ {a3, a4}, W {a5, a6}, A := {0, U, V, W, U U V, U U W, V U W, X), P(U) = P(V) _
P(W) = 1/3, Y := {0, 1, 21, g(al) := g(a3)
0, g(a2) := g(a5) :=
1,
g(a4) := g(a6) := 2. Let B := {0, Y). For C C Y, g1 (C) E A if and only if C E B, so g is quasiperfect. But P(U) > 0 and g[U] does not include any nonempty set in 13, so g is not perfect.
3.4.6 Proposition Any perfect function is quasiperfect.
Proof Let C C Y, A := 9_1(C) E A. By Theorem 3.4.1 take B C g[A]
with B E 8 and Q(B) > P(A). Then B C C, so Q(B) = P(g 1(B)) < P(g 1(C)) = P(A), and Q(B) = P(A). Thus the inner measure Q,,(C) = P o g 1(C). Likewise, Q. (Y \ C) = (P o g 1) (Y \ C), so Q*(C) = (P o g 1) (C) and C is Qcompletion measurable.
U
3.5 Almost surely convergent realizations First let's recall a theorem of Skorokhod (RAP, Theorem 11.7.2): if (S, d) is a complete separable metric space, and Pn are laws on S converging to a law Po, then on some probability space there exist Svalued measurable functions X, such that P, for all n and X + Xo almost surely. This section will prove an extension of Skorokhod's theorem to our current setup. Having almost uniformly convergent realizations shows that the definition of convergence in law for random elements is reasonable. The realizations will be useful in some later proofs on convergence in law.
Suppose f, = fo where f, are random elements, in other words functions not necessarily measurable except for n = 0, defined on some probability spaces (52,,, into a possibly nonseparable metric space S. We want to find random elements Y "having the same laws" as f, for each n such that Yo almost surely or better, almost uniformly. At first look it isn't clear Yn what "having the same laws" should mean for random elements fn, n > 1, not
3.5 Almost surely convergent realizations
107
having laws defined on any nontrivial oralgebra. A way that turns out to work is to define Y, = fn o gn where gn are functions from some other probability space 0 with probability measure Q into Stn such that each gn is measurable and Q o gn1 = Q, for each n. Thus the argument of fn will have the same law Qn as before. It turns out, moreover, that the gn should be not only measurable but perfect. Before stating the theorem, here is an example to show that there may really be no way to define a or algebra on S on which laws could be defined and yield an equivalence as in the next theorem, even if S is a finite set.
Example. Let (Xn, An, Qn) = ([0, 1], B, A) for all n (), = Lebesgue measure, 13 = Borel aalgebra ). Take sets C(n) C [0, 1] with O = A*(C(n)) < .X*(C(n)) = 1/n2 (RAP, Theorem 3.4.4). Let S be the twopoint space {0, 11 with usual metric. Then f,, := lC(n) * 0 in law and almost uniformly, but each "law" On := Qn o fn 1 is only defined on the trivial aalgebra {0, S). The only larger or algebra on S is 2S, but no On for n > 1 is defined on 2S.
3.5.1 Theorem Let (S, d) be any metric space, (Xn, An, Qn) any probability . Suppose fo has spaces, and fn a function from Xn into S for each n = 0, 1, separable range So and is measurable (for the Borel a algebra on So). Then fn = fo if and only if there exist a probability space (S2, S, Q) and perfect , such measurable functions gn from (S2, S) to (Xn, An) for each n = 0, 1, fo o go almost uniformly that Q o gn 1 = Qn on An for each n and fn o gn as n + oo. Notes. Proposition 3.4.4 and the "if and only if " in Theorem 3.5.1 show that the hypothesis that gn be perfect can't just be dropped from the theorem.
Proof "If" follows from Proposition 3.3.1 and Theorem 3.4.3. "Only if" will be proved very much as in RAP, Theorem 11.7.2. Let S2 be the Cartesian product fl 0Xn x In where each In is a copy of [0, 1]. Here gn will be the natural projection of 0 onto Xn for each n. Let P := Qo o fo 1 on the Borel or algebra of S, concentrated in the separable subset So. A set B C S will be called a continuity set (RAP, Section 11.1) for P if P(dB) = 0 where 8B is the boundary of B. Then,
3.5.2 Lemma For any e > 0 there are disjoint open continuity sets Uj, j = 1, , J, for some J < oo, where for each j, diam Uj := sup{d(x, y) :
x,yEUj)1e.
108
Uniform Central Limit Theorems: Donsker Classes
Proof Let {xjf°° 1 be dense in So. Let B(x, r) :_ {y E S: d(x, y) < r} for 0 < r < oo and x E So. Then B(xj, r) is a continuity set of P for all but at most countably many values of r. Choose rj with e/3 < rj < e/2 such that B(xj, rj) is a continuity set of P for each j. The continuity sets form an algebra (RAP, Proposition 11.1.4). Let
U j := B(xj, rj) \ U{y: d(xi, y) < r,}. i 0, and J(k)
E P(Ukj) > 12 k.
(3.5.3)
j=1
For any open set U in S with complement F, let d(x, F) := inf{d(x, y) y E F). For r = 1, 2,  , let Fr := {x : d (x, F) > 1/r}. Then Fr is closed and Fr T U as r * oo. There is a continuous hr on S with 0 < hr < 1, hr = 1 on Fr, and hr = 0 outside F2r: let hr(X) := min(l, max(0, 2rd(x, F)  1)). For each j and k, let F(k, j) := S \ Ukj. Take r := r(k, j) large enough so that P(F(k, f)r) > (1  2k)P(Ukj). Let hkj be the hr as defined above for such an r and Hkj the her. For n large enough, say for n > nk, we have
jhki(fn)dQn > (12k)P(Ukj) and
f* Hkj(fn)dQn < (1+2k)P(Ukj) for all j = 1,   , Jk. We may assume n1 < n2 < ... . For every n = 0, 1,  , let fkjn := (hkj o fn)* for Qn, so that by Theorem  
3.2.1,
fhki(fn)dQn =
ffkJndQn,
0 < ,fkjn < hkj(fn),
and fkjn is Anmeasurable. For n > 1 let Bkjn :_ (.fkjn > 0) E A. Let Bkjo
1(Ukj) E Ao. For each k and n, the Bkjn C f1
for j= 1,,Jk,andHkj(fn)=Ion Bkjn, soforn>nk, (3.5.4)
(12k)P(Ukj) < Qn(Bkjn) < (1+2k)P(Ukj),
are disjoint
3.5 Almost surely convergent realizations
109
and Qo(Bk jo) = P(Ukj). Let Tn := X, x In. Let it, be the product law Q, x .X on Tn where A is Lebesgue measure on the Borel oralgebra B in In. For each
k> 1, Ckjn
:= Bkjn x [0, F(k, j, n)] C T,,
Dkjn
:= Bkjo x [0, G(k, j, n)] C To,
where F and G are defined so that
An(Ckjn) = Ito(Dkjn) = min(Qn(Bkjn), Qo(Bkjo))
Then for each k, j, and n > nk,wehave by(3.5.4),since 1/(1+2k) > 12k,
(3.5.5) 12 k < min(F, G)(k, j, n) < max(F, G)(k, j, n) = 1. Let J(k) Ckon
7'n \ U Ckjn,
J(k)
DkOn = To \ U Dkjn.
j=1
j=1
For k = 0let Jo := J(0) := 0, Coon Tn, Doon := To, and no := 0. For each n = 1, 2, , let k(n) be the unique k such that nk < n < nk+1 Then for n > 1, Tn is the disjoint union of sets Wnj Ck(n)jn, j = 0, 1, (3.5.6)
, Jk(n). We also have
If j > 1 and (v, s) E Wn j then V E Bk(n) jn so fn (v, S) E Uk(n) j
Next, To is the disjoint union of sets Enj := Dk(n) jn. Then µn (Wnj) = Ito(Enj)
for each n and j, and if j > 1 or k(n) = 0, then Ao(En j) > 0. For x in To and each n, (3.5.7)
let j (n, x) be the j such that x E En j .
Let L := (x E To : so(Enj(n,x)) > 0 for all n). Then To \ L C IJi En(i)o for some (possibly empty or finite) sequence n (i) such that po(En(1)o) = 0 for all i. Thus µo(L) = 1. For X E L and any measurable set B C Tn, in other words B is in the product
oralgebra An 0 B, let j := j(n, x), recall that Itn(Wnj) = t o(Enj), and let (3.5.8)
Pnj(B) := An(B fl Wnj)/Ito(Enj),
Pnx
Pnj(n,x).
Then Pnx is a probability measure on An 0 B. Let px be the product measure 1 100 Pnx on T := IIn° 1 T, (RAP, Theorem 8.2.2).
Uniform Central Limit Theorems: Donsker Classes
110
3.5.9 Lemma For any measurable set H C T (for the infinite product a algebra with A® ® 13 on each factor), x i+ px(H) is measurable on (To, Ao ®13).
Proof Let H be the collection of all H for which the assertion holds. Given n, Pnx is one of finitely many laws, each obtained for x in a measurable subset
En j. Thus if Y, is the natural projection of T onto Tn and H = Y, 1(B) for some B E A®®13 then H E H. If H= ITEM Ym(i)(Bi), where Bi E An(i) 0 13 and M is finite, we may assume the m (i) are distinct. Then I
px(H) = niEMpx(Yn(i)(Bi)), so H E R. Then, any finite, disjoint union of such intersections is in 7l. Such unions form an algebra. If Hn E 71 and Hn T H or Hn ,. H, then H E 71. Since the smallest monotone class containing an algebra is a aalgebra (RAP, Theorem 4.4.2), the lemma follows.
Now returning to the proof of Theorem 3.5.1, S2 = To x T. For any set C C S2, measurable for the product aalgebra, and x E To, let C, :_ l y E T : (x, y) E C}, and Q(C) := f px (Cx) dµo(x). Here x r+ px (Cx) is measurable
if C is a finite union of products Ai x Fi where Ai E Ao ® 13 and Fi is product measurable in T. Such a union equals a disjoint union (RAP, Proposition 3.2.2). Thus by monotone classes again, x i+ px(Cx) is measurable on To for
any productmeasurable set C C Q. Thus Q is defined. It is then clearly a countably additive probability measure.
Let p be the natural projection of T onto Xn. Recall that Pnx = Pnj for all x E En j, by (3.5.7) and (3.5.8). The marginal of Q on X,,, in other words Q o gn 1, is by (3.5.8) again J(k(n))
E lso(Ej)Pnj o p1 = An o
P1
= Qn
j=0
Thus Q has marginal Qn on Xn for each n, as desired. By (3.5.3), 00
J(k)
E Q0 X 0 \ U f 1(Ukj) < 1: 2k < 00. k=1
j=1
k
l
So Qoalmost every y E X o belongs to U f0 1(Uk j) for all large enough k. Also if t E 10 and t < 1, then by (3.5.5), t < G(k, j, n) for all j > 1 as soon
3.6 Conditions equivalent to convergence in law
111
as 1  2k > t and n > nk. Thus for µoalmost all (y, t), there is an in such that (y, t) E U (k(n)) En j for all n > m. If x := (y, t) E En j for j > 1, then y E Bk(n) jo, so fo(y) E Uk(n) j. Also, by (3.5.8), Pnx = Pn j is concentrated in Wnj. For (v, s) E Wnj, fn (V) E Uk(n) j by (3.5.6). Since diam(Ukj) < 1/k for
each j > 1, Q*(d(fn(gn), .Po(go)) > 1/k(n) for some n > m)
< µo({(y, t) E Eno for some n > m}) + 0 as in * oo, so fn (gn) * fo (go) almost uniformly. Lastly, let's show that the gn are perfect. Suppose Q(A) > 0 for some A.
Now Q(A) = f px(Ax) dµo(x). Firstletn > 1. Thenforsomex, px(Ax) > 0. If µo(Eno) = 0, we take x Eno. Now T = Tn x T(n) where T(n) :_ ff1 0. Choose and fix such a v as well as
x. Now Pnx = Pn j for j = j (n, x) with µo(En j) > 0. Let u = (s, t), s E Xn, and t E In. Then since Pn j = Qn x A restricted to a set of positive measure and normalized,
0 < ff1A(X)(stv)dQfl(s)dt. Choose and fix a t with 0 < f IA(x)(s, t, v)dQn(s). Let C IS E Xn (s, t, v) E Ax}. Then Qn(C) > 0. Clearly C C gn[A], so gn is perfect for n > 1 by Theorem 3.4.1. To show go is perfect, we have µo = Qo x ,., and px (Ax) > 0 for x = (y, t)
in a set X with Ao(X) > 0. There is a t E Io such that Qo(C) > 0 where C := {y: (y, t) E X). Then C C go[A], so go is perfect, finishing the proof of Theorem 3.5.1.
3.6 Conditions equivalent to convergence in law Conditions equivalent to convergence of laws on separable metric spaces are given in the portmanteau theorem (RAP, Theorem 11.1.1) and metrization theorem (RAP, Theorem 11.3.3). Here, the conditions will be extended to general random elements for the theory being developed in this chapter. For any probability space (S2, A, P) and realvalued function f on S2 let
E*f := f * f dP, E* f := f* f dP. If (S, d) is a metric space and f is a
112
Uniform Central Limit Theorems: Donsker Classes
realvalued function on S, recall (RAP, Section 11.2) that the Lipschitz seminorm of f is defined by
IIflIL
Sup{If(x)  f(y)Ild(x, y) : x 0 y)
and f is called a Lipschitz function if II f 11L < oo. The bounded Lipschitz := sup., If(x)j. norm is defined by II f II BL II f II L + II f II where Iif II Then f is called a bounded Lipschitz function if II f II BL < oo, and 11 II BL is a norm on the space of all such functions.
The extended portmanteau theorem about to be proved is an adaptation of RAP, Theorem 11.1.1, and some further facts based on the last section (Theorem
3.5.1). The proof to be given includes relatively easy implications, some of which consist of putting in stars at appropriate places in the proofs in RAP.
3.6.1 Theorem Let (S, d) be any metric space. F o r n = 0, 1 , 2, , let (Xn, An, Q,) be a probability space and fn a function from Xn into S. SupQo o f0 1 on S. pose fo has separable range So and is measurable. Let P Then the following are equivalent: (a) fn = fo;
(a') lim supn,, E*G(fn) < EG(fo) for each bounded continuous realvalued G on S; (b) E*G (fn) * EG (fo) as n + oo for every bounded Lipschitzfunction
GonS; (b') (a') holds for all bounded Lipschitz G on S;
(c) sup{IE*G(fn)  EG(fo)I: IIG1IBL < 1)  Oasn  oo; (d) for any closed F C S, P(F) > lim supn,oo Qn (fn E F); (e) for any open U C S, P(U) < liminfn,oo(Qn)*(fn E U);
(f) for any continuity set A of P in S, Q* (fn E A)  P(A) and (Qn)*(fn E A) * P(A) as n * oo; (g) there exist a probability space (S2, S, Q) and measurable functions gn from S2 into Xn and hn from S2 into S such that the gn are perfect,
Qogn1 = Qn and Qohn1 = Pforalln, andd(fn ogn,hn) ). 0 almost uniformly.
Moreover, (g) remains equivalent if any of the following changes are made in it: "almost uniformly" can be replaced by "in outer probability"; we can take hn = fo o yn for some measurable functions yn from S2 into X0, which can be taken to be perfect; and we can take yn to be all the same, yn = yl for all n.
3.6 Conditions equivalent to convergence in law
113
Proof Clearly (a) implies (a'). Conversely, by interchanging G and G, (a') implies
liminfn*cE*G(fn) > liminfn,,,,E*G(fn) > EG(fo), and (a) follows, so (a) and (a') are equivalent.
Clearly (a) implies (b), which is equivalent to (b') just as (a) is to (a'). To show (b) implies (c), let T be the completion of S. Then all the fn take values in T. Each bounded Lipschitz function G on S extends uniquely to such a function on T, and the functions G o fn on Xn are exactly the same. So we can assume in this step that S and So are complete. Let E > 0. By Ulam's theorem (RAP, Theorem 7.1.4), take a compact K C So with P(K) = Qo(fo E K) > 1  E. Recall that d (x, K) := inf {d (x, y) :
y E K) and KE := Ix: d(x, K) < E}. Let g(x) := max(O, 1d(x, K)/e) for 1 + 1/E < oo. Clearly 1K < g < 1KE. Since E*g(fn) ). Eg(fo) as n + oo, it follows that x E S. Then g is a bounded Lipschitz function with II9IIBL for n large enough (3.6.2)
(Q,,). (f, E K5) > E*g(fn) > Eg(fo)  E > I  2E.
Let B be the set of all G on S with II G II BL < 1. Then the functions in B are uniformly equicontinuous, so their restrictions to K are totally bounded for the supremum distance over K by the ArzelaAscoli theorem (RAP, Theorem 2.4.7). Let G I, , G j for some finite J be functions in B such that for each
G E B, supXEK I(G  Gj)(x)I < E for some j = 1, , J. Next, for any G E B, choose such a j. Then (3.6.3)
I E*G(fn)  EG(fo)I < IE*G(fn)  E*Gj(fn)I + I E*(Gj(fn))  E(Gj(fo))I + I E(Gj(fo)  G(fo))I,
a sum of three terms. For the last term, splitting the integral into two parts according as fo c K or not, the first part is bounded by E and the second by 2E since I G j  G I < 2 everywhere and Qo (fo K) < E. The middle term on the right side of (3.6.3) is bounded above by
maxj 1 2E by (3.6.2), and the other over the complement of A. Since G and G j E B, and I G  G j I < E on
K, we haveIG  GjI 0 there is a 3 > 0 and an no large enough such that for n > no,
f,gE.F, t(f,g)e} < E. Then.F E AEC(P) will mean.F E AEC(P, pp).
Uniform Central Limit Theorems: Donsker Classes
118
3.7.2 Theorem Let .T' C G2(X, A, P). Then the following are equivalent: (I) F is a Donsker class for P, in other words .F is Ppregaussian and vn
Gp in £'(F);
(II) (a) F is totally bounded for pp, and (b) F satisfies the asymptotic equicontinuity condition for P, F E AEC(P); (III) there is a pseudometric r on .F such that .F is totally bounded for r and
.T' E AEC(P, t).
Proof If F is a Donsker class and e > 0, then since F is pregaussian, it is totally bounded for pp by Theorem 2.3.5, so (a) holds. Take 0 < 3 < s/3 such that for any coherent Gp process,
Pr {sup{IGp(f)  Gp(g)I: pp(f, g) < 8} > s/31 < s/2. By almost surely convergent realizations (Theorem 3.5.1), for n > no large enough we can assume Pr*{Ilvn GpII.p > s/3} < 8/2. If IIvn GpII,p < s/3 and I Gp (f)  Gp (g) I < s/3 then I vn (f)  vn (g) l < s. So the asymptotic equicontinuity holds with the 3 and no chosen, and (b) holds, so (I) implies (II).
(II) implies (III) directly, with r = pp.
To show (III) implies (I), suppose F is rtotally bounded and F E AEC(P, r). Let UC := UC(F) denote the set of all realvalued functions on F uniformly continuous for T. Then UC is a separable subspace of t°O(F) for II Il.p since F is totally bounded and UC equals, in effect, the space of continuous functions on the compact completion (RAP, Corollary 11.2.5). For any finite subset g of F, by the finitedimensional central limit theorem (RAP, Theorem 9.5.6) we can let n oo in the asymptotic equicontinuity
condition and so replace v (f  g) by Gp (f)  Gp (g) for f, g E 9. Given s > 0, take a 8 := 8(s) > 0 and no := no(s) < oo from the definition of asymptotic equicontinuity condition. Then 8 and no don't depend on 9 C F. So we can let CJ increase up to a countable dense set 7l C F. Gp has sample functions almost surely uniformly continuous on H.
If fo E C and fk E 7l are such that r(fk, fo) + 0 then by applying the finitedimensional central limit theorem to the finite set 9 = { fo, fi, , fk} for each k, applying the asymptotic equicontinuity condition and letting k + oo we see that Gp is almost surely uniformly continuous for r on the set { f )j>o.
So Gp(fk)  Gp(fo) almost surely. Thus for each f E F, the almost sure limit of Gp(hk) as hk  f through 7l, which exists by uniform continuity, equals Gp (f) a.s., so the almost sure limit defines a Gp process. This Gp has uniformly continuous sample functions for r on F. So by Theorem 2.8.6, since
3.7 Asymptotic equicontinuity and Donsker classes
119
Gp is isonormal for ( , )o, p, it follows that .F is pregaussian. So Gp has a law µ3 defined on the Borel sets of the separable Banach space UC. Given e > 0, take S > 0 from the asymptotic equicontinuity condition and a finite set 9 c Y such that for each f E F there is a g E 9 such that r (f, g) < S. Then 1189 is the set of all realvalued functions on 9. Let µ2 be the law of Gp on 1R and let tµ23 be the law on 1[89 x UC where Gp on g in R9 is just the restriction of Gp on UC. So µ23 has marginals µ2 and µ3. Let 1t1,n be the law of v on g, so µ1,n is also defined on R9. Then by the finitedimensional central limit theorem again, the laws µ1,n converge to µ2 on 1[89. So for the Prokhorov metric p, since it metrizes convergence of laws (RAP,
Theorem 11.3.3), p(µt,,, µ2) < e for n large enough. Taken > no also, then fix n. By Strassen's theorem (RAP, Corollary 11.6.4), there is a law A12 on 1[89 x R9 with marginals 1t1,n and µ2 such that µ1,2{(x, Y): Ix  yA > E) < E. By the Vorob'evBerkesPhilipp theorem (1.1.10), there is a Borel measure /'t123 on 1[89 x 1[89 x UC having marginals A12 and µ23 on the appropriate spaces.
The next step is to link up v with its restriction to the finite set 9. The Vorob'evBerkesPhilipp theorem may not apply here since v on F may not be in a Polish space, at least not one that seems apparent. (About nonmeasurability on nonseparable spaces see the remarks at the end of Section 1.1.) Here we can use instead:
3.7.3 Lemma Let S and T be Polish spaces and (Q, A, P) a probability space. Let Q be a law on S x T with marginal q on S. Let V be a random variable on Q with values in S and law G(V) = q. Suppose there is a real random variable U on S2 independent of V with continuous distribution function FU. Then there is a random variable W : S2 i). T such that the joint law G(V, W)
of(V,W)isQ. Proof Two metric spaces X, Y are called Borelisomorphic if there is a onetoone Borel measurable map of X onto Y with Borel measurable inverse. Every Polish space is Borelisomorphic to some compact subset of [0, 11, either the whole interval, a finite set, or a convergent sequence and its limit (RAP, Theorem 13.1.1). Since the lemma involves only measurability and not topological properties of the Polish spaces, we can assume S = T = [0, 11. Here we need the fact that for any realvalued random variable X with continuous distribution function F, F(X) has a uniform distribution in [0, 1 ], which can be seen as fol
lows. For oo < t < oo, the probability that X < t equals F(t). Now X < t implies F(X) < F(t). On the other hand, the probability that F(X) < F(t) is the supremum of probabilities that X < y for y such that F(y) < F(t), and this supremum is at most F(t). So the probability that X < t equals the probability
Uniform Central Limit Theorems: Donsker Classes
120
that F(X) < F(t). Since F is continuous, for any s with 0 < s < 1 there is a t E R with F(t) = s, so the probability that F(X) < s equals F(t) = s, and F(X) is uniformly distributed in [0, 1] as claimed. So, taking FU (U), we can assume U is uniformly distributed in [0, 1]. By way of regular conditional probabilities (RAP, Section 10.2; Bauer, 1981) we can write Q = f Qx dq(x) where for each x, Qx is a probability measure on T, so that for any measurable set A in (the square) S x T, Q(A) = f f 1 A (X, y) dQx (y) d q (x) (RAP, Theorems 10.2.1 and 10.2.2). Let Fx be the distribution function of Qx and
0 < t < 1.
Fx 1(t) := inf{u: Fx(u) > t1,
Then for any real z and 0 < t < 1, Fx 1(t) < z if and only if Fx(z) > t. Now x i Fx (z) is measurable for any fixed Z. It follows that x i+ Fx 1(t) is measurable for each t, 0 < t < 1. For each x, Fx 1 is leftcontinuous and nondecreasing in t. It follows that (x, t) i+ Fx 1(t) is jointly measurable. Thus f o r W (co) := F7 ) (U (w)), co H W (co) is measurable. For each x we have the image measure A o (Fx 1)1 = Qx (RAP, Proposition 9.1.2). So for any bounded Borel function g,
J
gdQ =
f1 f1
JJ 0
g(x, y) dQx(y) dq(x)
by Theorem 10.2.1 of RAP
0
f
= J0
g(x, Fx 1(y)) dy dq (x) by the image measure theorem
0
I fo g (x, Fx 1(y)) d(q x A)(x, y) 1
1
0
by the TonelliFubini
theorem
= E(g(V, FV 1(U))) = Eg(V, W), since U is independent of V and G(V) = q and by the image measure theorem again. So G(V, W) = Q, proving Lemma 3.7.3. Now let (Q, S, Q) be a probability space on which all the empirical processes v, and an independent U are defined, specifically a countable product of copies
of the probability space (X, A, P) times one copy of [0, 1] with Lebesgue measure A. Then Lemma 3.7.3 applies to Q with S = RO and T = V x
UC, where V is v, restricted to 9, and Q = µ123 on S x T. On Q we then have processes v,, and Gp defined on F, which by construction and the asymptotic equicontinuity condition are within 3e of each other uniformly on F except with a probability at most 38 (as in the proof of Donsker's theorem in Section 1.1).
3.8 Unions of Donsker classes
121
Lets j, 0 through the sequence s = 1 / k, k = 1, 2, . Let the approximation just shown hold for n > nk on a probability space (Qk, Sk, Qk). We can assume nk is nondecreasing ink. Let Akn be the vn process defined on S2k and let
Gkn be the corresponding Gp process on Stk. Let no := 1 and let (Qo, So, Qo) be a probability space on which vn processes Aon and Gp processes Go, Go
are defined, where Go is independent of the Aon processes. Let An Akn and Gn := Gk, if and only if nk < n < nk+1 for k = 0, 1 , . Then for all n, An is a vn process and Gn is a Gp process. Also, II A,  Gn CIF 0 in outer probability, so by Theorem 3.6.1, vn z' Gp on .T7, and (III) implies (I), proving Theorem 3.7.2.
3.8 Unions of Donsker classes It will be shown in this section that the union of any two Donsker classes F and
g is a Donsker class. This is not surprising: one might think it was enough, given the asymptotic equicontinuity conditions for the separate classes, for a given s > 0, to take the larger of the two no's and the smaller of the two 8's. But it is not so easy as that. For example, F and G could both be finite sets, with distinct elements of F at distance, say, more than 0.2 apart for pp, and likewise for CG, but there may be some element of F very close to an element of G. So the equicontinuity condition on the union won't just follow from the conditions on the separate families. Given a probability measure P, F C G2(P), s > 0, 8 > 0, and a positive integer no, say that AE(F, no, s, 8) holds if for all n > no,
Pr* {sup{Ivn(f  g)I: f, g E F, pp(f, g) < 8} > s} < E. Then the asymptotic equicontinuity condition, as in the previous section, holds for .P and P if and only if for every s > 0 there is a 8 > 0 and an no such that AE(F, no, s, 8) holds. The asymptotic equicontinuity condition, together with total boundedness of F for pp, is equivalent to the Donsker property of F for P (Theorem 3.7.2).
3.8.1 Theorem (K. Alexander) Let (S2, A, P) be a probability space and let .T'1 and F2 be two Donsker classes for P. Then F := .T'1 U .T72 is also a Donsker class for P.
Proof (M. Arcones) Given s > 0, take 8i > 0 and ni < oo so that AE(Fi, ni, s/3, 8i) holds, i = 1, 2. F1 and F2, being Donsker classes, are pregaussian, and Gp is an isonormal process for ( , )o,p, so F is pregaussian by Corollary
Uniform Central Limit Theorems: Donsker Classes
122
2.5.9. So there is an a > 0 such that for a suitable version of GP,
Pr{sup{JGP(fg)j: pp(f,g)s/3} < 8/3. Let S := min(31, 82, a/3). Take finite sets hi c Ti, i = 1, 2, such that for
each i and f E Ti there is an h := Ti f E Hi with pp (f, h) < 3. Since 7l := Ni U N2 is finite, by the finitedimensional central limit theorem (RAP, Theorem 9.5.6), vn restricted to N converges in law to Gp restricted to H. Let
F(R, a, s/3) be the set of all y E R H such that I y(f)  y(g) I > 8/3 for some f, g E N with pp(f, g) < a. Then F(N, a, e/3) is closed and has probability less than s/3 for the law of Gp on IRx. Thus by the portmanteau theorem (RAP, Theorem 11.1.1), there is an m such that
AE(N, m, s/3, a) holds.
(3.8.2)
Let no := max(n 1, n2, m). It will be shown that AE(.TF, no, e, 3) holds. By the
asymptotic equicontinuity conditions in each Ti and since no > max(n1, n2) and S < min(81, 32), there is a set of probability less than s/3 for each i = 1, 2 such that outside these sets, I v,, (f)  vn (g) I < 8/3 for any f, g in the same
.T'i with pp(f, g) < S. For pairs f, g with pp(f, g) < 8, with f E .T'1 and g E F2, we have pP(ri f, r2$) < a since pp(f, rl f) < 8, pp(g, r2g) < S, and 33 < a. Thus by (3.8.2), I vn (rl f)  vn (r2g) I < s/3 for all such f, g, outside of another set of probability at most e/3. Thus
vn(f)  vn(g)I
Ivn(f)  vn(nif)I + Ivn(rlf)  vn(r2g)I + (vn (r2g)  vn (g) i
< e/3 + s/3 + e/3 = s
for all f E .T'1 and g E F2 except on a set of probability at most 8/3 + s/3 + s/3 = e.
3.9 Sequences of sets and functions This section will show how the asymptotic equicontinuity condition in Theorem 3.7.2 can be applied to prove that some sequences of sets and functions are Donsker classes. In Chapters 6 and 7, other sufficient conditions for the Donsker property will be given that will apply to uncountable families of sets and functions. 3.9.1 Theorem Let (X, A, P) be a probability space and {C,n },n> 1 a sequence of measurable sets. If 00
(3.9.2)
r(P(Cm)(1  P(Cm)))' < oo forsome r < oo, m=1
3.9 Sequences of sets and functions
123
then the sequence {Cm),,,>1 is a Donsker class for P. Conversely, if the sets Cm are independent for P, then the sequence is a Donsker class only if (3.9.2) holds.
Proof Suppose (3.9.2) holds. Then the positive integers can be decomposed into two subsequences, over one of which P(Cm) 0 and over the other P(Cm) + 1. It's enough to prove the Donsker property separately for each subsequence by Theorem 3.8.1. For any measurable set A with complement
vn(A) and Gp(A`)  Gp(A). The transformation of these processes into their negatives preserves convergence in law if it holds. So we can assume P(Cm) ,(, 0 as m  oo, and then E. P(Cm)' < oo. Also, {Cm}m>1 is totally bounded for pp. By Theorem 3.8.1 we can assume pm := P(Cm) < 1/2 for all m. For any i and m such that P(CiACm) = 0, we will have almost surely for any n that P. (Ci) = Pn (Cm ). So we can assume that P(C1 ACm) > 0 for all i 0 M. For any m such that P(Cm) = 0 we will have Pn (Cm) = 0 almost surely for any n, and then vn (Cm) = 0, so we can assume P (Cm) > 0 for all m. Let 0 < s < 1. Suppose we can find M and N such that for all n > N, Ac,
(3.9.3)
Pr {supra>MIVn(Cm)I >s} < s.
Then for J large enough, pm < pM12 form > J. Let y := min{P(Ci ACj :
1 < i < j < J), a := min(y, pM)12. Then sup{Ivn(Ci)  vn(Cj)I: P(CiACj) < a) M}
< 2sup{Ivn(C1)I: j > M} < 2s with probability at least 1 e, proving the asymptotic equicontinuity condition. So it will be enough to prove (3.9.3). For that, recalling the binomial prob
abilities defined in Section 1.3, it will suffice to find M and N such that for
n > N, 00
(3.9.4)
57, E
< 8/2
m=M and 00
B(npm  sn1 /2, n, pm) < s/2.
(3.9.5) M=M
Uniform Central Limit Theorems: Donsker Classes
124
Let qm := 1  pm. For (3.9.5), by the ChernoffOkamoto inequality (1.3.10), B(npm  En 1/2, n1
p.) < exp ( e2/(2pmgm)).
For some K, 1 < K < oo, pm < Km 1/'' for all m. Choose M large enough so that exp
m' e2/(2K)) < E/2,
M=M
and so (3.9.5) holds for all n. The other side, (3.9.4), is harder. Bernstein's inequality 1.3.2 gives, if 2pn1/2 > e and 0 < p < 1/2, with q := 1  p, that
E(np + En1/2, n, p) < exp (  e2/(2pq + en1/2)} < exp ( E2/(6pq)). Then 00
E {E(npm +En1/2, n, pm) : 2pmn1/2 >
(3.9.6)
m=M 00
{expE2/(6pm)): 2pmn1/2 > e} M=M 00
exp ( e2m1/rl (6K)) < e/4
< M=M
for all n if M is large enough. It remains to treat the sum, which will be called S2, of (3.9.4) restricted to values of m with 2pmn1/2 < E.
(3.9.7)
For p:= pm, inequality (1.3.11) implies
E(np + en 1/2, n, p) < (npl (np +
En1/2))np+En1/2eenl/2
Lety:= y(n,m,e) :=n1/2pm/e,so y> 0. Let f(x)  (1+x)log(1+x Then
 (np + en1/2) log (1 + e/(pn1/2)) = En1/2 (1  f(y)). Forx > 0, f(x) < 0. By (3.9.7), y < 1/2, so f(y) > f(1/2) > 3/2. Thus En1/2
1 < 2f(y)/3, En1/2 (1  f(y)) < En 1/2 f (y) /3, and 00
{exp((En1/2+npm) [log (1+E/(n1/2pm)}]/3): 2pmn112 n3 for some n3. Thus forn > max(n 1, n2, n3), S2 < e/4. This and (3.9.6) give (3.9.4). So (3.9.2) implies that {Cm}m>1 is a Donsker class. For the converse, if the measurable sets {Cm }m>1 are independent for P and form a Donsker class for P, it will be shown that (3.9.2) holds. Note that for each n, P (Cm) are independent random variables for in = 1, 2, . First suppose
lim supm,,,,, pm =: p < 1 and for all n, EM' 1 pm = +oo. Then for each n, Pr{ P (Cm) = 1 for infinitely many in } = 1 by the BorelCantelli lemma. Now
Pn(Cm) = 1 implies vn(Cm) = n1/2(1  pm) > n1/2(1  p) /2 form large. From Theorem 3.7.2, { 1 C. : m > 1) must be totally bounded for pp and satisfy
the asymptotic equicontinuity condition. It follows that pm * 0 as m  oo, and by the asymptotic equicontinuity condition, we must have En° 1 pm < 00 for some n. A symmetrical argument applies if lim infra + pm > 0. We can write {Cm )m> l as a union of two subsequences, one for which
P(C,) < 1/2 and another for which P(Cm) > 1/2, both Donsker classes. Thus by the argument just given, for some n < oo, the first subsequence sat
isfies E. P(Cm)" < oo and the second, F_m(1  P(Cm))" < oo. So for the original sequence, (3.9.2) holds.
126
Uniform Central Limit Theorems: Donsker Classes
Next, let's consider sequences of functions. For a probability space (A, A, P) and f E £2(A, A, P) let 0 ,2( f := f f 2 dP  (f f dP) 2 (the variance of f). Here is a sufficient condition for the Donsker property of a sequence { fm } which is easy to prove, yet turns out to be optimal of its kind:
3.9.8 Theorem If { fm}m>1 C G2(P) and Fn° 1 ci (fm) < oo, then {fm }m> 1 is a Donsker class for P.
Proof Since v, and Gp are the same on fm  Efm as on fm a.s. for all m, we can assume Efm = 0 for all m. Then fm  0 in Go, so the sequence { fm} is totally bounded for pp. For any 0 < s < 1, n > 1 and m > 1, by Chebyshev's inequality,
Pr(Ivn(f )I > s/2) < 4 E o (f)/s2 < s j>m
j>m
form > mo for some mo < oo. We have a.s. for all n, vn (f) = 0 for all j such
that a (f) = 0 and vn (f) = v (fk) for all j and k with or p (f  fk) = 0. Let a be the infimum of o (f  f k ) over all j and k such that o (fj  fk) > 0, j < me, and o (f) > 0. Then a > 0 since in some pp neighborhood of fj there are only finitely many fk. Let 8 := min(a, 1). Then the probability that I vn (f)  vn (fk) I > s f o r some j, k with orp2 (f  fk) < 8 is less than s, implying the asymptotic equicontinuity condition and so finishing the proof by Theorem 3.7.2. The following shows that Theorem 3.9.8, although it does not imply the first half of Theorem 3.9.1, is sharp in one sense:
3.9.9 Proposition Let A := [0, 1] and P := U[0, 1] := Lebesgue measure on A. Let am > 0 satisfy Fm 1 am = +oo. Then there is a sequence { fm I C G2(A, A, P) with c,' (f.) < am for all m where { fm } is not a Donsker class. . 0. There exist cm J. 0 such that >m am cm = +oo (see problem 12). In A let Cm be independent sets with P(Cm) = am cm for
Proof We can assume am
1/2
each m (see problem 13). Let fm := cm I Cm. Then oP (fm) < f f,2 dP = am. For each n, almost surely Pn (Cm) > 1/n for infinitely many m. Then vn(Cm) > n1/2/2 for infinitely many m and supm vn(fm) = +oo a.s., so the asymptotic equicontinuity condition fails and { fm } is not a Donsker class for P.
Problems
127
Problems 1. Let (0, A, P) be a probability space, let (S, d) be a (possibly nonseparable) metric space, and let xn, n = 0, 1, 2, , be points of S. Let fn ((o) = Xn
for all co. Show that fn = f0 if and only if xn  X. 2. In a general (possibly nonseparable) metric space show that if X0  p E S is a constant random variable then random elements Xn = X0 if and only X0 in outer probability. if X,
3. Let (T, d) be any metric space and S C T a separable subspace. Let Xn, n > 0, be Svalued measurable random variables on some probability space such that Xn ) X0 in law as n  oc, in S. Show that Xn = X0 as Tvalued random elements. 4. Show that C := J(oo, t] : t E R} is a Donsker class for any law P on R. Hint: Use Section 1.1.
5. Let Ak be independent sets in a probability space (S2, A, P) such that Yk[P(Ak)(1  P(Ak))]" = +oo for all n = 1, 2,  . Show that {Ak}k>1 is not a GlivenkoCantelli class, that is, SUpk I (Pn  P) (Ak) I doesn't con
verge to 0 in probability as n  oo. Hint: See the last part of the proof of Theorem 3.9.1.
6. Let (A, A, P) be a probability space and let F C L2 (A, A, P) be a Donsker class for P. (a) Show that the convex hull of F, namely k
co(F) := j E ) , j : f E .P, j = 1, ... , k, Aj > 0, j=1 k
YAj=1, j=1
is a Donsker class. Hint: Use Theorem 2.5.5 to get that co(F) is pregaussian and Gp can be taken to be prelinear on co(F). Then use almost surely convergent realizations (3.5.1). (b) For any fixed k < oo, show that f : f E F for j = 1, , k} is also a Donsker class. Hints: Use induction and Theorem 3.8.1 on unions. (2f : f E F} is Donsker. Thus take k = 2. It is easy to show that 2F Then apply part (a). 7. Let c > 0. For Lebesgue measure A on [0, 1], the Poisson process with intensity measure cl is defined by first choosing n at random having a
Poisson distribution with parameter c, namely P(n = k) = e`ck/k! for , then setting Yc := 8x, where Xj are i.i.d. with law). on k = 0, 1, [0, 1]. In the Banach space of bounded functions on [0, 1] with supremum
Uniform Central Limit Theorems: Donsker Classes
128
norm, prove that as c
oo (along any sequence) the random functions
t H (Yc  c? )c1/2[0, t], 0 < t < 1, converge in law to the Brownian motion process xt, 0 < t < 1. (Recall that xt = L(llo,tl).) Hints: For c = ck + oo let n = nk be Poisson (ck). For F(t)  t, 0 < t < 1, write Yc(t) := Yc([0, t]) = n F, (t), so c1/2(Yc(t)
 Ct)
_ (n/c)1/2[n1"2(Fn  F)(t)] + c112(n  c)t.
By Donsker's theorem take Brownian bridges y(n) such that n 1/2(Fn  F)
is close to y(n) for n large. Also, c1/2(n  c) is close in law by the central limit theorem and so in probability (Strassen's theorem) to some random N(0, 1) variable Zc. Show that one can take n((o) independent y(n('0)) of X1, X2, , then y( (w) is a Brownian bridge; apply the t w) := Vorob'evBerkesPhilipp theorem to take Zc independent of yt, and then yt + Zc t is Brownian. 8. Let P be a law on a separable, infinitedimensional Hilbert space H such that f IIx X12 dP(x) < oo and with mean 0, so that f (x, h) dP(x) = 0 for all h E H. Let X1, X2, be i.i.d. in H with law P and Sn := X1 + + Xn.
(a) Show that the central limit theorem holds in H, that is, Sn/n1/2 converges in law to some normal measure on H. Hint: Prove, using variances, that the laws of Sn/n1/2 are uniformly tight. (b) Show that the class of functions x r* (x, h) for h E H with 1Ih 11 < 1 (the unit ball of the dual space of H) is a Donsker class of functions on H
for P. Hints: For part (a) let {en ] be an orthonormal basis so that for any x E H,
x = En xnen with lix 112 = En xn. Thus En Exn is finite. Show that for Exn/cn is still finite. For any K let some cn 0 slowly enough, CK be the set of x such that En (xn /cn )2 < K2 (an infinitedimensional ellipsoid). Show that each CK is compact and that the laws of Sn/n1/2 are uniformly tight by way of the sets CK. Thus subsequences of these laws converge. Show that they all converge to the same Gaussian limit law, since by the StoneWeierstrass theorem, functions depending on only finitely many xj are dense in the continuous functions on each CK. , X,n, Y1, 9. The twosample empirical process. Let X1, , Yn be i.i.d. with the uniform distribution U[0, 1] on [0, 1] (Lebesgue measure restricted to [0, 1]). Let F be the empirical distribution function based , X,,,, and Gn likewise based on Y1, , Yn. Show that in the on X1, space of all bounded functions on [0, 1] with supremum norm, mn
(m+n)
1/2
(Fm
 Gn)
y
Problems
129
as m, n * oo where t i± yt, 0 < t < 1 is a Brownian bridge process. Hints: As in Donsker's theorem (1.1.1), form, n large, m1I2(F,,  F) is close to a Brownian bridge Y(m), and n 1/2 (Gn  F) is close to an independent Brownian bridge Z (n ). It follows that [(m n) / (m +n)] 1/2 (Fm  Gn) is
close to (n/(m +n))1/2Y(m)  (m/(m +n))1/2Z(n), which is a Brownian bridge.
10. Let fn (x) = cos(27rnx) for0 < x < I with law U[0, 1] on [0, 1]. For real c let Fc be the sequence of functions n' fn for all n = 1, 2, . For what values of c is Fc pregaussian? A Donsker class? Hints: If c < 0 show easily that the class is not pregaussian and so not Donsker. If c > 0 then it is pregaussian, by metric entropy. If c > 1/2, it is Donsker by 3.9.8. Show that it is Donsker for 0 < c < 1 /2 by the Bernstein inequality and the asymptotic equicontinuity condition. 11. In R, for any law P on IR, show that for any fixed k < oo, k
U (aj, bj]: aj < bj for all j j=1
is a Donsker class, that is, .F := {lc : C E C} is a Donsker class. Hint: Apply Donsker's theorem (1.1.1) and take differences (and sums). For k = 2 reduce to the case of disjoint intervals and apply problem 6. Do induction to get general k.
12. Show that as stated in the proof of Proposition 3.9.9, for any am > 0 with E. an = +oo there are cm 1 0 with >m cmam = +oo. Hint: Take a sequence mk such that Flan : mk < m < mk+1} > k for each k = 1, 2, . Let cm have the same value for Mk < m < mk+1 13. Show that as also stated in the proof of Proposition 3.9.9, in [0, 1] for P = U[0, 1] there exist independent sets Cm with any given probabilities. Hint: Use binary expansions. By decomposing the set of positive integers into a countable union of countably infinite sets, show that ([0, 1], P) is isomorphic as a probability space to a countable Cartesian product of copies of itself. 14. Suppose that {Cm)m>1 are independent for P, 00
P(Cm)(1  P(Cm)) _ +00, m=1
and cm
oo. Show that (cm I cm )m> 1 is not a Donsker class.
130
Uniform Central Limit Theorems: Donsker Classes
Notes
Notes to Section 3.1.
For any metric space (S, d), let Bb (S, d) be the o
algebra generated by all balls B(x, r) := {y: d(x, y) < r}, x E S, r > 0. Then Bb (S, d) is always included in the Borel or algebra 13(S, d) generated by all the open sets, with 13b(S, d)= 13(S, d) if (S, d) is separable. Suppose Y, are functions from a probability space (S2, P) into S, measurable
for Bb(S, d). Then each Y has a law it, = P o Yn 1 on Bb(S, d). Dudley (1966, 1967) defined convergence in law of Y to Yo to mean that f * Hdµ k f Hd µo for every bounded continuous realvalued function Hon S. HoffmannJorgensen (1984) gave the newer definition adopted generally and here, where the upper integrals and integral are taken over n, not S, so that the laws µ are not necessarily defined on any particular or algebra in S. HoffmannJorgensen's monograph was published in 1991. Andersen (1985a,b), Andersen and Dobric (1987, 1988), and Dudley (1985) developed further the theory based on HoffmannJorgensen's definition. Notes to Section 3.2. Blumberg (1935) defined the measurable cover function f *, see also Goffman and Zink (1960). (Thanks to Rae M. Shortt for pointing out Blumberg's paper.) Later, Eames and May (1967) also defined f *. Lemmas 3.2.2 through 3.2.6 are more or less as in Dudley and Philipp (1982, Section 2), except that Lemma 3.2.4(c) is new here. Theorem 3.2.1 and its proof are as in Vulikh (1967, pp. 7879) for Lebesgue measure on an interval (the proof needs no change). Luxemburg and Zaanen (1983, Lemma 94.4 p. 222) also prove existence of essential suprema and infima of families of measurable (extended) real functions.
Note to Section 3.3. This section is based on parts of Dudley (1985). Notes to Section 3.4. Perfect probability spaces were apparently first defined by Gnedenko and Kolmogorov (1949, Section 3), and their theory was carried on among others by RyllNardzewski (1955), Sazonov (1962), and Pachl (1979).
Perfect functions are defined and treated in HoffmannJorgensen (1984, 1985) and Andersen (1985a,b); see also Dudley (1985).
Notes to Section 3.5. The existence of almost surely convergent random variables with a given converging sequence of laws was first proved by Skorokhod (1956) for complete separable metric spaces, then by Dudley (1968) for any separable metric space, with a reexposition in RAP, Section 11.7, and by Wichura (1970) for laws on the aalgebra generated by balls in an arbitrary metric space
Notes
131
as mentioned in the notes to Section 3.1. The current version was given in Dudley (1985).
Notes to Section 3.6. HoffmannJorgensen (1984), who defined convergence in law in the sense adopted in this chapter, also developed the theory of it as in this section, and partly in a more general form (with nets instead of sequences, and other classes of functions in place of the bounded Lipschitz functions). Andersen and Dobric (1987, Remark 2.13) pointed out that the portmanteau theorem (as in Topsoe, 1970, Theorem 8.1) "can be extended to the nonmeasurable case. The proof of this extension is the same as the ordinary proof." Much the same might be said of other equivalences in this section. Dudley (1990, Theorem A) gave a form of the portmanteau theorem and (Theorem B) of the metrization theorem (3.6.4).
But not all facts or proofs from the separable case extend so easily: for example, in the separable case, there is an inequality for the two metrics, 0 < 2p, in the opposite direction to Lemma 3.6.5, which follows from Strassen's theorem on nearby variables with nearby laws (RAP, Corollary 11.6.5), but Strassen's theorem seems not to extend well to the nonmeasurable case (Dudley, 1994).
Notes to Section 3.7. An early form of the asymptotic equicontinuity condition appeared in Dudley (1966, Proposition 2) and a later form in Dudley (1978).
The equivalence with a different pseudometric r in Theorem 3.7.2 is due to Gine and Zinn (1986, p. 58). Lemma 3.7.3 is essentially contained in the proof of Skorokhod (1976, Theorem 1), as Erich Berger kindly pointed out. See also Erlov (1975). Notes to Section 3.8. Alexander (1987, Corollary 2.7) stated Theorem 3.8.1 but didn't publish a proof of it, although he had written out an unpublished proof several years earlier. He says that the result is "an extension of a slightly weaker result of Dudley (1981)," where F2 is finite, but this author himself doesn't think his 1981 result was only "slightly weaker"! The proof presented was suggested by Miguel Arcones in Berkeley during the fall of 1991, but I take responsibility for any possible errors in it. Apparently van der Vaart (1996, Theorem A.3) first published a proof.
Notes to Section 3.9. Theorem 3.9.1 first appeared in Dudley (1978, Section 2), Theorem 3.9.8 and Proposition 3.9.10 in Dudley (1981), and Proposition 3.9.9 in Dudley (1984).
132
Uniform Central Limit Theorems: Donsker Classes
References Alexander, K. S. (1987). The central limit theorem for empirical processes on VapnikCervonenkis classes. Ann. Probab. 15, 178203. Andersen, N. T. (1985a). The central limit theorem for nonseparable valued functions. Z. Wahrscheinlichkeitstheorie verve. Gebiete 70, 445455. Andersen, N. T. (1985b). The calculus of nonmeasurable functions and sets. Various Publication Series no. 36, Matematisk Institut, Aarhus Universitet. Andersen, N. T., and Dobric, V. (1987). The central limit theorem for stochastic processes. Ann. Probab. 15, 164177. Andersen, N. T., and Dobric, V. (1988). The central limit theorem for stochastic processes II. J. Theoret. Probab. 1, 287303. Bauer, H. (1981). Probability Theory and Elements of Measure Theory, 2d ed. Academic Press, London. Blumberg, Henry (1935). The measurable boundaries of an arbitrary function. Acta Math. (Uppsala) 65, 263282. Cohn, D. L. (1980). Measure Theory. Birkhauser, Boston.
Dudley, R. M. (1966). Weak convergence of probabilities on nonseparable metric spaces and empirical measures on Euclidean spaces. Illinois J. Math. 10, 109126. Dudley, R. M. (1967). Measures on nonseparable metric spaces. Illinois J. Math. 11, 449453. Dudley, R. M. (1968). Distances of probability measures and random variables. Ann. Math. Statist. 39, 15631572.
Dudley, R. M. (1978). Central limit theorems for empirical measures. Ann. Probab. 6, 899929; Correction 7 (1979) 909911. Dudley, R. M. (1981). Donsker classes of functions. In Statistics and Related Topics (Proc. Symp. Ottawa, 1980), NorthHolland, New York, 341352. Dudley, R. M. (1984). A course on empirical processes. Ecole d'ete de probabilites de St.Flour, 1982. Lecture Notes in Math. (Springer) 1097, 1142. Dudley, R. M. (1985). An extended Wichura theorem, definitions of Donsker class, and weighted empirical distributions. In Probability in Banach Spaces V (Proc. Conf. Medford, 1984), Lecture Notes in Math. (Springer) 1153, 141178. Dudley, R. M. (1990). Nonlinear functionals of empirical measures and the bootstrap. In Probability in Banach Spaces VII (Proc. Conf. Oberwolfach, 1988), Progress in Probability 21, Birkhauser, Boston, 6382. Dudley, R. M. (1994). Metric marginal problems for setvalued or nonmeasurable variables. Probab. Theory Related Fields 100, 175189.
References
133
Dudley, R. M., and Philipp, Walter (1983). Invariance principles for sums of Banach space valued random elements and empirical processes. Z. Wahrscheinlichkeitsth. verw. Gebiete 62, 509552. Eames, W., and May, L. E. (1967). Measurable cover functions. Canad. Math. Bull. 10, 519523. Erlov, M. P. (1975). The Choquet theorem and stochastic equations. Analysis Math. 1, 259271. Gine, E., and Zinn, J. (1986). Lectures on the central limit theorem for empirical processes. In Probability and Banach Spaces (Proc. Conf. Zaragoza, 1985), Lecture Notes in Math. (Springer) 1221, 50113. Gnedenko, B. V., and Kolmogorov, A. N. (1949). Limit Distributions for Sums of Independent Random Variables. Moscow. Transl. and ed. by K. L. Chung, AddisonWesley, Reading, MA, 1954; rev. ed. 1968. Goffman, C., and Zink, R. E. (1960). Concerning the measurable boundaries of a real function. Fund. Math. 48, 105111. HoffmannJt rgensen, Jorgen (1984). Stochastic processes on Polish spaces. Published (1991): Various Publication Series no. 39, Matematisk Institut, Aarhus Universitet. HoffmannJorgensen, Jorgen (1985). The law of large numbers for nonmeasurable and nonseparable random elements. Asterisque 131, 299356. Luxemburg, W. A. J., and Zaanen, A. C. (1983). Riesz Spaces, vol. 2. NorthHolland, Amsterdam. Pachl, Jan K. (1979). Two classes of measures. Colloq. Math. 42, 331340. RyllNardzewski, C. (1953). On quasicompact measures. Fund. Math. 40, 125130. Sazonov, V. V. (1962). On perfect measures (in Russian). Izv. Akad. Nauk SSSR 26, 391414. Skorokhod, Anatolii Vladimirovich (1956). Limit theorems for stochastic processes. Theor. Probab. Appl. 1, 261290. Skorokhod, A. V. (1976). On a representation of random variables. Theor. Probab. Appl. 21, 628632 (English), 645648 (Russian). Topsoe, Flemming (1970). Topology and Measure. Lecture Notes in Math. (Springer) 133. van der Vaart, Aad (1996). New Donsker classes. Ann. Probab. 24, 21282140.
Vulikh, B. Z. (1961). Introduction to the Theory of Partially Ordered Spaces (transl. by L. F. Boron, 1967). WoltersNoordhoff, Groningen. Wichura, Michael J. (1970). On the construction of almost uniformly convergent random variables with given weakly convergent image laws. Ann. Math. Statist. 41, 284291.
4
VapnikCervonenkis Combinatorics
This chapter will treat some classes of sets satisfying a combinatorial condition. In Chapter 6, it will be shown that under a mild measurability condition to be treated in Chapter 5, these classes have the Donsker property, for all probability measures P on the sample space, and satisfy a law of large numbers (GlivenkoCantelli property) uniformly in P. Moreover, for either of these limittheorem properties of a class of sets (without assuming any measurability), the VapnikCervonenkis property is necessary (Section 6.4). The present chapter will be selfcontained, not depending on anything earlier in this book, except in some examples.
4.1 VapnikCervonenkis classes Let X be any set and C a collection of subsets of X. For A C X let CA C n A := A nC := (Cf1A : C E C). Letcard(A) :_ I A I denotethecardinality (number of elements) of A and 2A :_ [B: B C A). Let AC(A) :_ ICAI. If A n C = 2A, then C is said to shatter A. If A is finite, then C shatters A if and only if AC (A) = 21 A1.
Let mc(n) := max[0C(F) : F C X , IF] = n} for n = 0, 1, IXI < n let mc(n) := mc(IXI). Then mc(n) < 2" for all n. Let V(C)
:
inf in: me (n) < 2")
if this is finite,
+00
if mc(n) = 2" for all n,
, or if
{I
sup {n: mo(n)=2"}, S(C)
:=
if C is empty.
1
Then S(C)  V (C)  1, and S(C) is the largest cardinality of a set shattered by C, or +oo if arbitrarily large finite sets are shattered. So, V (C) is the smallest n 134
4.1 VapnikCervonenkis classes
135
such that no set of cardinality n is shattered by C. If V (C) < oo, or equivalently if S(C) < oo, C will be called a VapnikCervonenkis class or VC class. If X is finite, with n elements, then clearly 2X is a VC class, with S(2x) = n. Let NC where
(N
\j/
N!/(j!(N  j)!),
j = 0, 1, ... N,
0,
j>N.
Then NC nC NC7,
which follows from (e/2)" > 2n1/2, n > 7; for f(x) := (e/2)x and g(x) :_ 2x1/2 it is straightforward to check that f (7) > g(7), f'(7) > g'(7), f" > 0, and g" < 0, so f (x) > g(x) for all x > 7. Now suppose the proposition has been proved for n = k + i, i = 2, , j, and for n = k + J, J := j + 1, for k = 1, , K, as we have done for j = 2 and for K = 1. We need to prove (4.1.5) for n = k + J and k = K + 1. We
have k+j = K + J and nC 01, nn(g) := IX: g(x) > 0), gEG,
pos(G)
{pos(g): g E G}, nn(G) := {nn(g): g E G),
U(G)
pos(G) U nn(G).
4.2.1 Theorem Let H be an mdimensional real vector space of functions on
a set X, let f be any real function on X, and let Hi := (f + h : h E H). Then S(pos(Hl)) = S(nn(H1)) = m. If H contains the constants then also S(U(H1)) = M.
Proof First it will be shown that S(pos(H1)) = m. Clearly card(X) > m. If card(X) = in, then H = Hl is the set ]RX of all realvalued functions on X, so the result holds.
Otherwise, let A C X with card(A) = m + 1. Let G be the vector space {a f + h : a E IR, h E H). Let rA : G ra IRA be the restriction of functions in G to A. If rA is not onto, take 0 v E IRA where v is orthogonal to rA(G) for the usual inner product )A Let A+ := {x E A : v(x) > 0}. We can assume A+ is nonempty, replacing v by v if necessary. If A+ = A fl pos(g) for some g E G, then (rA(g), V)A > 0, a contradiction. So pos(G) doesn't shatter A.
Suppose instead that rA(G) = RA. Then rA is 11 on G, f 0 H, and rA(H1) is a hyperplane in RA not containing 0, so that for some v E RA,
(j, v)A = 1 for all j E rA(H1). Again let A+ :_ {x E A : v(x) > 0). If A+ = A fl pos(f + h) for some h E H, then (rA(f + h), v)A > 0, a contradiction (here A+ may be empty). Thus pos(H1) never shatters A, so S(pos(Hi)) < m. For each x E X, a linear form 8x is defined on H by 3, (h) := h (x), h E H.
Let H' be the vector space of all real linear forms on H. Then H' is m
dimensional. Let HH{aJxj: be the linear span in H' of the set of all 8,, x E X, (4.2.2)
HH
xj E X, aj E ]I8, r
1, 2,
}.
j1 The map h H (r i4 *(h)): h E H, r E HX, is 11 and linear from H JJJ
into (HX)', so HH is mdimensional. Take B = {xl, , xm) C X such that the 8,1 are linearly independent in H. So rB (H) = RB, rB (H1) = JRB, and pos(rB(H1)) = 2B, so S(pos(Hi)) = m. Then S(nn(H1)) = m by taking complements (Proposition 4.1.8). If H contains the constant functions, then the sets nn(f ), f E H1, are the same as
140
VapnikCervonenkis Combinatorics
the sets If > t), f E H1, t E R, and the sets pos(f), f E H1, are the same as the sets If > t), f E H1, t E R. Now for any finite subset A of X, f E Hl and t E R, since f takes only finitely many values on A, there exist s and u
such that AflIf >t}=AflIf >s}and AflIf >t}=AflIf >u}. So in this case S(U(H1)) = m. Examples. (I) Let H := Pd,k be the space of all polynomials of degree at most k on Rd. Then for each d and k, H is a finitedimensional vector space of functions, so pos(H) is a VapnikCervonenkis class. For k = 2, it follows specifically that the set of all ellipsoids in 1W' is included in a VapnikCervonenkis class and thus is one.
(II) Let X = R. Let H be the 1dimensional space of linear functions f(x) = cx, x E 1k, c E R. Then S(pos(H)) = S(nn(H)) = 1 by Theorem 4.2.1, but U(H) shatters {0, 1). Since sets in U(H) are convex (halflines), it follows that S(U(H)) = 2. So the condition that H contains the constants can't just be dropped from Theorem 4.2.1 for U(H). Let X be a real vector space of dimension m. Let H be the space of all real affine functions on X, in other words functions of the form h + c where h is real linear and c is any real constant. Then H has dimension m + 1 and pos(H) is the set of all open halfspaces of X. Letting f = 0 in Theorem 4.2.1 for this H gives a special case known as Radon's theorem. On the other hand, Theorem 4.2.1 for f = 0 with general X and H follows from Radon's theorem via the following stability fact.
4.2.3 Theorem If X and Y are sets, F is a function from X into Y, C C 2Y, S(Ft (C)) < S(C). If F is onto Y and F1(C) := {F1(A) : A E C}, then then S(F1(C)) = S(C).
x j for i # j. Then , xm } where x; Proof Let F1(C) shatter {xl, F(xi) 0 F(xj) for i ¢ j and C shatters {F(xl), , F(xm)}. So S(F1(C)) < S(C). If F is onto Y and H C Y with card(H) = m, choose G C X such that F takes G 11 onto H. Then if C shatters H, F1 (C) shatters G, so S(F1(C)) = S(C). Now let X be any set and G a finitedimensional real vector space of real functions on X. Then there is a natural map F : x H S from X into the space of linear functions on G. Then by Theorem 4.2.3 one could deduce Theorem 4.2.1 from its special case where X is an m or (m + 1)dimensional real vector space and f and all functions in H are affine, so that sets in pos(H1) are open halfspaces.
4.2 Generating VapnikCervonenkis classes
141
Next it will be seen how a bounded number of Boolean operations preserves the VapnikCervonenkis property.
, let C(k) be the 4.2.4 Theorem Let X be a set, C C 2X, and f o r k = 1 , 2, union of all (Boolean) algebras generated by k or fewer elements of C. Then dens(C(k)) < k dens(C), so if S(C) < oo then S(C(k)) < oo.
Proof Let dens(C) = r, so that for any s > 0 there is some M < oo such that mc(n) < Mnr+E for all n. For any A C X we have A n C(k) = (A n C)(k). An algebra A with k generators A1, , Ak has at most 2k atoms, which are those nonempty sets that are intersections of some of the A; and the complements of the rest. Sets in A are unions of atoms, so JAI < 22k. Thus IA n C(k) I < 22k IA n CIk < 22kMkIAlk(r+e). Letting E J, 0 gives dens(C(k)) < k dens(C). If S(C) < 00 then by Corollary 4.1.7, S(C(k)) < 00. The constant 22k is very large if k is at all large. Let C(lk) be the class of all intersections of at most k sets in C. Then C(nk) C C(k). For C(nk) the constant 22k is not needed. Theorems 4.2.1 and 4.2.4 can be combined to generate VapnikCervonenkis classes. For example, halfspaces in Rd form a VC class. Intersections of at most k halfspaces give convex polytopes with at most k faces, so these form a VC class.
Remarks. Let X be an infinite set, r = 1 , 2,
, and let Cr be the collection of all subsets of X with at most r elements. Then clearly dens(Cr) = S(Cr) = r. It's easy to check that D := C(rk) consists of all sets B such that either B or X \ B
has at most kr elements. Thus mD(n) < 2(nC 2 sets which contain x, but it only contains one, a contradiction. This proves the first conclusion. The second follows on taking complements, setting A = X in Proposition 4.3.3.
On ZX = 2X there is a product topology coming from the discrete topology on Z2. The product topology is compact by Tychonoff's theorem (RAP, Theorem 2.2.8). 4.3.5 Proposition For any set X, any nmaximal class C C 2X is closed and therefore compact in 2X.
Proof Suppose Ca * C is a convergent net in 2X with Ca E C for all a. Then for any finite set F C X, there is some a with Ca fl F = C fl F, so S(C U {C}) = S(C) and C E C. So C is closed (RAP, Theorem 2.1.3). A class C of subsets of a set X will be called complemented if X \ A E C for every A E C.
4.3.6 Theorem If S(C) = n, C C A strictly, and C is complemented, then C is not (A, n)maximal.
Proof For any finite set F C X and G C F, G E C n F if and only if F \ G E C n F. Thus if IFl = n + 1, then IC n Fl < 21+1  2. So, for any
AEA\C,S(CU{A})=n. If F is a kdimensional real vector space of realvalued functions on a set X containing the constants, and C is the collection U(F) of all sets {x : f (x) > 0}
or {x : f (x) > 0} for all f E .F, then S(C) = k by Theorem 4.2.1. Since C is complemented, if U(.F) 2X, then C is not kmaximal.
*4.4 Classes of index 1
145
Let X be any set and Ck the collection of all subsets of X with at most k elements. Then clearly S(C) = k. Also, C is kmaximal since if A 0 C, A C X, then I A I > k and if B is any subset of A with I B I = k + 1 then B is shattered by C U {A). For C = Ck we have me (n) C 2, a contradiction. So (b) holds. Now (b) implies (c) directly. If (c) holds and I Y I = 2, since 2Y doesn't have a treelike ordering by inclusion, C must not shatter Y, so (a) follows.
4.4.3 Proposition Let X be a set and A C 2X where 0 E A and for any B and C in A, B fl C E A. If C is (A, 1) maximal and satisfies (4.4.2) for any
x# yin X, then 0ECandBflCECforanyBandCinC. Proof If x
y and (4.4.2) holds for A, then A fl {x, y} = 0 = 0 fl {x, y}, so adjoining 0 to C doesn't induce any additional subsets of sets with two elements, and S(C U 10)) = 1, so by maximality 0 E C. Suppose B, C E C and S(C U {B fl C}) > 1. Then for some x # yin X, B fl C fl (x, y) # D fl {x, y} for all D E C. Then by (4.4.2), we can assume x E B fl C. If {x, y} C B fl C C B then taking D = B would give a contradiction, so y 0 B fl C. Now B fl c fl {x, y} = {x} # D fl {x, y) for D = B or C implies y E B and y E C, again a contradiction. So B fl C E C.
4.4.4 Proposition Let X be a set and C a finite class of subsets of X with S(C) = 1 such that for any x y in X, (4.4.2) holds. Let D := D(C) consist of 0 and all intersections of nonempty subclasses of C. Then S(D) = 1. For each nonempty set D E D there is a C := C(D) E D such that C C D strictly (C 0 D) and if B is any set in D with B C D strictly, then B C C. Proof By Theorem 4.3.1, let C C E where £ is 1maximal. Then by Proposition 4.4.3 for A = 2X and induction, C C D C E, so S(D) = 1. Clearly, D is finite. For each nonempty D E D, by Theorem 4.4.1, {B E D: B C D} is linearly ordered by inclusion and contains 0, so it has a largest element C(D) other than D itself.
4.4.5 Proposition
Under the hypotheses of Proposition 4.4.4, the sets
D \ C(D) for distinct nonempty D E D are all disjoint and are nonempty.
Proof Let A
B in D. If B C A, then B C C(A), so B and hence
B \ C(B) are disjoint from A \ C(A). Otherwise, A fl B C B strictly, and then
*4.4 Classes of index 1
Af1B C C(B), so again A \C(A) is disjoint from B\C(B). That A\C(A) 0 follows from the definitions.
147
0
for A
A graph is a nonempty set S together with a set E of unordered pairs {x, y} for some x y in S. Then S will be called the set of nodes and E the set of edges of the graph. The graph (S, E) is called a tree if (a) it is connected, in other words for any x and y in S there is a finite n and x; E S, i = 0, 1, , n, such that xo = x, x, = y, and {xk_1, xk} E E for
k= (b) the graph is acyclic, which means that there is no cycle, where a cycle
is a set of distinct xl,
, x E S such that n > 3, and letting xo := x,,,
{xk_1,xk} E Efork= 1, ,n. 4.4.6 Theorem (a) For in nodes, for any positive integer m, there exist connected graphs with in  1 edges. (b) A connected graph with in nodes cannot have fewer than in  1 edges. (c) A connected graph with m nodes has exactly m  1 edges if and only if it is a tree.
Proof (a) is clear. (b) will be proved by induction. It is clearly true for in = 1, 2. Suppose (S, E) is a connected graph with BSI = in, I EI < in  2, and m > 3. The edges in E contain at most 2m  4 nodes, counted with multiplicity, so at least four nodes appear in only one edge each, or some node is in no edge, a contradiction. Select a node in only one edge and delete it and the edge that contains it. The remaining graph must be connected, but is not by induction assumption, a contradiction, so (b) holds.
For (c), let (S, E) be a connected graph with ISO = in and IEJ = m  1. If the graph contains a cycle, we can delete any one edge in the cycle and the graph remains connected, contradicting (b). So (S, E) is a tree. Conversely, let (S, E) be a tree with I SI = m. It will be proved by induction that I E I < in  1.
This is clearly true for m = 1, 2. Suppose I E I > in > 3. Take a maximal , xk} C S such that the xi are distinct and {xj_1, Xj) E E for set C := {xl, j = 2, , k. Then there is no y 0 X2 with {x1, y) E E: y cannot be any xj, j > 3, or there would be a cycle, and there is no such y C since C and k are maximal. So we can delete the node xl and the edge {xl, x21 from the graph, leaving a graph which is still a tree with in  1 nodes and at least in  1 edges, contradicting the induction hypothesis and so proving (c). Let the class D in Propositions 4.4.4 and 4.4.5 form the nodes of a graph G
whose edges are the pairs {C(D), D} for D E D, D # 0.
148
VapnikCervonenkis Combinatorics
4.4.7 Proposition The graph G is a tree. Proof If D has m elements, then there are exactly m  I pairs {C(D), D}, for D E D, D # 0. Starting with any D E D, we have a decreasing sequence of sets D D C(D) D C(C(D)) D , which must end with the empty set, so all sets in D are connected in G via the empty set, and G is connected. Then by Theorem 4.4.6 it is a tree. 4.4.8 Proposition Let X be a finite set. Let C be 1maximal in X and suppose (4.4.2) holds for all x y in X. Then C = D(C) as defined in Proposition 4.4.4. The sets D\C(D)for nonempty D E C are all the singletons {x}, x E X.
IfIXI=mthen ICI =m+1. Proof C = D(C) by Proposition 4.4.4. Suppose that for some D E C, D\C(D)
has two or more elements. Then for some B, C := C(D) C B C D where both inclusions are strict. It will be shown that S(C U {B}) = 1. If not, then for some x 0 y, C U {B} shatters {x, y}, so B fl {x, y} 54 F fl {x, y} for all F E C. Letting F = C shows that B fl {x, y} 0 0. Likewise, letting F = D shows that B fl {x, y} {x, y}. So we can assume B fl {x, y} = {x}. Taking F = D shows
that y e D. If G n {x, y} = (y) for some G E C, then y E G fl D E C, and G n D C D strictly, so G n D C C and y E C C B, giving a contradiction. So C U { BI does not shatter {x, y}, and S(C U { B)) = 1, contradicting 1maximality of C.
So, each set D \ C(D) for 0 0 D E C is a singleton. Each singleton {x} equals D \ C(D) for at most one D E C by Proposition 4.4.5. By Proposition
4.3.4, X = UDEC D. For any x E X take D1 E C with x e D. Let . Forsomem, D,,, = 0, and {x} = Dj\C(Dj) D,+1 := C(D,) forn = 1, 2, for some j < m. So all singletons are of the form D \ C(D), D E C. This
gives a 11 correspondence between singletons and nonempty sets in D, so there are exactly m such sets and I D I = m + 1. Suppose throughout this paragraph that C is a class of two or more sets such that (4.4.2) holds with 0 replaced by {x, y}. Then the class of complements, N := {X \ C: C E C), satisfies the original hypotheses of Proposition 4.4.1. If C is 1maximal, so is N by Proposition 4.3.3. So Theorem 4.4.1 and the facts numbered 4.4.3 through 4.4.7 apply to N, and so does Proposition 4.4.8 if X is finite. Then, C itself has a "cotreelike" ordering, where for each C E C, {D E C : C C D} is linearly ordered by inclusion. Propositions 4.4.3 and 4.4.4 apply to C if 0 is replaced by X and intersections by unions; in Proposition 4.4.4, we will have an immediate successor D(C) D C instead of a
*4.4 Classes of index 1
149
predecessor; and sets D(C) \ C instead of D \ C(D) in Propositions 4.4.5 and 4.4.8. The resulting tree (Proposition 4.4.7) then branches out as sets become smaller rather than larger. Next will be several facts in the general case, that is, without the hypothesis (4.4.2).
4.4.9 Theorem Let X be any set and C any collection of subsets with S(C) = 1. Then for any C E C, the collection Cx\c := (B \ C : B E C) satisfies (4.4.2) for any x y as a collection of subsets of X \ C. Likewise, Cc \ := (C \ B : B E C) satisfies (4.4.2) for any x y as a collection of subsets of C, S(Cc V) < 1 and S(CX\C) < 1.
Proof Letting B = C shows that both classes Cc \ and CX\c contain 0, so (4.4.2) holds for them. Both have index S < 1 by Propositions 4.3.2 and 4.3.3.
So, for an arbitrary class C with S(C) = 1, we have by Theorem 4.4.1 a treelike partial ordering by inclusion in one part X \ C of X and a cotreelike ordering in the complementary part C, for any C E C. If also X \ C happens to be in C, both orderings are linear. To see how the two orderings fit together in general, Proposition 4.3.3 gives:
4.4.10 Corollary Let C be any class of sets with S(C) = 1 and A E C. Let D := AAAC. Then S(D) = 1 and 0 E D. If C is 1maximal, so is D. Then Theorem 4.4.1, Proposition 4.4.3, and if C is finite, Propositions 4.4.4, 4.4.5, 4.4.7, and if X is finite, Proposition 4.4.8, apply to D. The last sentence in Proposition 4.4.8 has a converse and extension:
4.4.11 Proposition Let X be finite with m elements and C C 2X with S(C) _ 1. Then C is 1maximal if and only if ICI = m + I.
Proof For any fixed C E C, we can replace C by CAiC without loss of generality by Theorem 4.3.3. So we can assume 0 E C, and then (4.4.2) holds for all x, y. Then Proposition 4.4.8 implies "only if."
Conversely, let S(C) = 1 and ICI = m + 1. Let C C D strictly. Then IDS > m + 1, so by Sauer's lemma (Theorem 4.1.2), S(D) > 2. So C is 1maximal.
Now, m + 1 = mC 5. For classes satisfying stronger conditions, more can be proved:
4.5.7 Theorem Let X and Y be sets, C C 2X and D C 2Y. If C is linearly ordered by inclusion, then S(C ® D) < S(D) + 1. Proof Let m = S(D). We can assume that m < oo and X E C. Let F C X x Y
with IFI = in + 2. Suppose C ®D shatters F. The subsets of IIXF C X induced by C are linearly ordered by inclusion. Let G be the next largest subset
other than I1XF itself, and p E f}XF \ G. Take y such that (p, y) E F. Then all subsets of F containing (p, y) are induced by sets of the form I}XF x D, D E D. Thus H := F \ {(p, y)} is shattered by such sets. Now I H I = in + 1, and IIY must be onetoone on H or it could not be shattered by sets of the given form. But then D shatters flyH, giving a contradiction.
*4.5 Combining VC classes
155
4.5.8 Theorem For any set X and C, D C 2X, if C is linearly ordered by inclusion, then S(C D D) < S(D) + 1 for D = n or u.
Proof First consider D = n. Again let m = S(D), and we can assume m < oo. Let F C X and I Fl = m + 2. Suppose C n D shatters F. Now CF is linearly ordered by inclusion. We can assume that X E C, so F E CF. Let G be the next largest element of CF and p E F \ G. Each set A C F containing p is of the form C f1 D fl F, C E C, D E D, so we must have C f1 F = F, and A = D fl F. So D shatters F \ (p), a contradiction. Since the complements of a class linearly ordered by inclusion are again linearly ordered by inclusion, the case of unions follows by taking complements (Proposition 4.1.8).
Then by Theorem 4.2.6 and induction we have:
4.5.9 Corollary Let Ci be classes of subsets of a set X and C := {n"=1 C; : Ci E Ci, i = 1, , n}, where each C1 is linearly ordered by inclusion. Then S(C) < n. Definition. For any set X and VapnikCervonenkis class C C 2X, C will be
called bordered if for some F C X, with I Fl = S(C), and x (= X \ F, F is shattered by sets in C all containing x.
4.5.10 Theorem Let Ci C 2x0) be bordered VapnikCervonenkis classes for i = 1, 2. Then S(C1®C2) > S(C1) + S(C2).
Proof Take F C X(i) and xi as in the definition of "bordered." Let H := ({x1} x F2) U (F1 x {x2}). Then IHI = S(C1) + S(C2). For any Vi C F, i = 1, 2, take Ci E Ci with Ci fl Fi = Vi and xi E Ci. Then (C1 x C2) fl H = ({x1 } x V2) U (V1 x {x21), so Ci ®C2 shatters H.
Theorem 4.5.10 extends by induction to any number of factors. One consequence is:
4.5.11 Corollary In JR let ,7 be the set of all intervals, which may be open or closed, bounded or unbounded on each side. In other words, ,7 is the set of all convex subsets of R. In JR let C be the collection of all rectangles parallel , m }. Then S(C) = 2m. to the axes, C :_ {III" 1 J : J i E ,7, i = 1, Let D be the set of all left halflines (oo, x] or (oo, x) for x E R. Let
VapnikCervonenkis Combinatorics
156
T :_ {IIm 1 Hi : Hi E D, i = 1, , m}, so T is the class of lower orthants parallel to the given axes. Then S(C) = m. Proof The class D is linearly ordered by inclusion and is bordered with S(D) =
1, and so is the class of halflines [a, oo) or (a, oo), a E JR. The class ,7 of all intervals in R is bordered with S(J) = 2. The results now follow from Corollary 4.5.9 with n = 2m and n = m, and Theorem 4.5.10 and induction.
4.5.12 Proposition Let ,7 be the set of all intervals in JR. Let Y be any set and C C 2t' with Y E C. Then in JR x Y, S(,7 ®C) < 2 + S(C).
Proof If S(C) = +oo the result is clear, so suppose m := S(C) < oo. Let F C ] I R x Y with I FI = 3 + m and suppose ,7 ®C shatters F. Let (xi, yi ),
i=
, m + 3, be the points of F. Let u := mini xi and v := maxi xj. Let
1,
p := (u, yi) E F and q := (v, yj) E F. All subsets of F which include 1p, q) must be induced by sets of the form JR x C, C E C. So n y must be onetoone on F \ { p, q }, and C shatters n y (F \ (p, q }) of cardinality m + 1, a contradiction.
4.6 Probability laws and independence Let (X, A, P) be a probability space. Recall the pseudometric dp (A, B) P(AAB) on A, where AAB := (A \ B) U (B \ A). Recall also that for any (pseudo) metric space (S, d) and s > 0, D(s, S, d) denotes the maximum number of points more than e apart (Section 1.2).
For a measurable space (X, A) and C C A let s(C) := inf{w : there is a K = K(w, C) < oo such that for every law P on A and 0 < s < 1, D(s, C, dp) < Kew). Definition.
This index s(C) turns out to equal the density:
4.6.1 Theorem For any measurable space (X, A) and C C A, dens(C) _ S (C).
Proof Let P be a probability measure on A. Suppose A,, , A, E C and be i.i.d. dp(Ai, Aj) > s > 0 for 1 < i < j < m, with m > 2. Let X1, X2, (P), specifically coordinates on a countable product of copies of (X, A, P).
4.6 Probability laws and independence Note that P may have atoms. For n = 1 , 2, Pr{for some i
157
,
j, Xk 0 Al DAB for all k < n)
< in (m  1)(1  s)'/2 < 1 forn large enough, n > log(m(m1)/2)/log(1e). LetPn := n FY=18x,. (as usual) be empirical measures for P. For such n, there is positive probability that Pn (Ai AAA) > 0 for all i j, and so Ai and Aj induce different subsets of {X1, , Xn} and mc(n) > m. For any r > dens(C) there is an M =
M(r, C) < oo, where we can take M > 2, such that mc(n) < Mnr for all n.
Note that log(1  s) > E. Thus form > 2, m < M(2log(m2))rsr, or m (log m) r < M1 8r for some M1 = Ml (r, C) = 4r M. For any 3 > 0 and mlg < C large enough, (log m)' < Cm8 for all in > 1. Thus for all m > 0, M28r for 0 < s < 1 for some M2 = M2(r, C, 8). Thus m < (M28r)11(11)
Letting r ,I, dens(C) and 8 ,{ 0 gives dens(C) > s(C). In the converse direction, let IA I := card(A). Since it is not the case that
mc(n) < knt f o r all n > 1, for r < t < dens(C) and k = 1, 2, , let Ak C X with Ak 0 0 and I Ak n CI > k I Ak I t. Then I Ak I  oo as k oo. Let B0 := Al. Other sets Bj will be defined recursively. Let Bo, , B,_1 be disjoint subsets of X and let C(n) := Uo 224IAk(n)It > 224IBnIt. Since Bn is nonempty, it follows that I Bn n C I > 22', and hence I Bn I > 2n Let
an := IBnIt/r,
S:_
IBnl1t/r
< 00.
n=0
Let P be the probability measure on Un_1 Bn C X giving mass an/S to each point of Bn for each n. The distinct sets in Bn n C are at dpdistance at least an / S apart, and so are a set of elements of C which induce the subsets of Bn. So for all n,
D(an1(2S), C, dp) > I B. n CI > 224 I Bn I t
= 224an r = 224 (2S)r(an/(2S))r For s := an/(2S) * 0 as n + oo, this implies D(s, C, dp) > 2 24
(2S)rsr
158
VapnikCervonenkis Combinatorics
Since 2"  oo as a" 4. 0 this implies r < s(C). Then letting r f dens(C), the proof is complete. There is a notion of independence for sets without probability. To define it, for
any set X and subset A C X let A l := A and A1 := X \ A. Sets A l , , A,,, from are called independent, or independent as sets, if for every function {1, , m} into {1, +1}, n 1 As(j) 0 0. Such intersections, when they are , Am. nonempty, are called atoms of the Boolean algebra generated by A 1 , Thus for A 1, , Am to be independent as sets means that the Boolean algebra they generate has the maximum possible number, 2m, of atoms. If A 1, , Am are independent as sets, then one can define a probability law on the algebra they generate for which they are jointly independent in the usual probability sense and for which P(Ai) = 1/2, i = 1, , m. For example, choose a point in each atom and put mass 1/2m at each point chosen. Or, if desired, given any qi, 0 < qi < 1, one can define a probability measure Q for which the Ai are jointly independent and have Q(Ai) = qi, i = 1, , n. For a set X and C C 2X let I (C) be the supremum of m such that A t ,
are independent as sets for some Ai E C, i = 1,
, Am
, M.
4.6.2 Theorem F o r any set X , C C 2X, and n = 1, 2, , if S(C) > 2" then I (C) > n. Conversely, if I (C) > 2" then S(C) > n. So I (C) < 00 if and only if S(C) < oo. In both cases, 2" cannot be replaced by 2"  1.
Proof Clearly, if a set Y has n or more independent subsets, then I Y I > 2'. Conversely, if I Y I = 2", we can assume that Y is the set of all strings of n digits
each equal to 0 or 1. Let Aj be the set of strings for which the jth digit is 1. Then the Aj are independent. It follows that if S(C) > 2", then I(C) > n, while if I Y I = 2"  1 and C = 21', then S(C) = 2"  1 while I (C) < n, as stated.
Conversely, if Bj are independent as sets for j = 1, , 2", Bj E C, let A(i) := Ai be independent subsets of {1, , 2"} for i = 1, , n. Choose
xi E nJEA(i) Bi n n0A(i)(X \ Bi). Then xi E Bj if and only if j E A(i).
For each setS C {1, ,n},
nAi nn({1,...,2"}\Ai) iES
for some j := is E 11, B; n
i0s , 2" }. Then j E Ai if and only if i E S, and (xi: i E S).
4.7 VapnikCervonenkis properties of classes of functions
159
So C shatters {xl, , xn) and S(C) > n, as stated. If C consists of 2'  1 (independent) sets, then clearly S(C) < n.
For any set X, C C 2X and Y C X, recall that Cy := Y n c := {Y fl C : Y) be the set of atoms of the algebra of subsets of Y
C E C}. Let At(C
I
generated by Cy, where in the cases to be considered, Cy will be finite because
C or Y is. Let Ac(Y) := IAt(C Y)J be the number of such atoms. Let mc(n) := sup{DA(Y) : A C C, IAA < n} < 2n. Let dens*(C) denote the infimum of all s > 0 such that for some C < oo, me (n) < Cn' for all n. For anyx E XletCX = {A E C: X E A}. LetC'' := {Cy: y E Y}. I
4.6.3 Theorem For any set X and A C C C 2X, with A finite,
(a) AA(X) = AcX (A); (b) for n = 1, 2, ... , me (n) = mcx'(n); (c) S(CX) = I (C); (d) dens*(C) = dens(CX) < I(C).
Proof For any B C A let
a(B) := n B fl n (X \ A). BES
AEA\l3
Then a (B) is an atom, in At(AI X), if and only if it is nonempty. Now y E a (B)
if and only if A fl Cy = B, so (a) follows. Then taking the maximum over A with IAA = n on both sides of (a) gives (b), which then implies (c) and (d). The last inequality follows from Corollary 4.1.7.
4.7 VapnikCervonenkis properties of classes of functions The notion of VC class of sets has several extensions to classes of functions.
Definitions. Let X be a set and.F a class of realvalued functions on X. Let C C 2X. If f is any realvalued function, each set If > t } fort E R will be called a major set of f. The class .F will be called a major class for C if all the major sets of each f E ,F are in C. If C is a VapnikCervonenkis class, then .F will be called a VC major class (for C). The subgraph of a realvalued function f will be the set {(x, t) E X x R:
0 < t < f(x) or f(x) < t < 01. If D is a class of subsets of X x R, and for each f E .F, the subgraph of f is in D, then .F will be called a subgraph class for D. If D is a VC class in X x JR then .F will be called a VC subgraph class.
VapnikCervonenkis Combinatorics
160
The symmetric convex hull of F is the set of all functions E"'_1 c; f for f E F, c; E I[8, any finite m, and Ym1 Ic; I < 1. If 0 < M < oo, let H(F, M) denote M times the symmetric convex hull of F. Let HS (F, M) be the smallest class 9 of functions including H(.F, M) such that whenever gn E 9 for all n and gn (x) > g(x) as n + oo for all x, we have g E tj. A class F of functions such that F C HS(G, M) for some M < oo and a given 9 will be called a VC subgraph hull class if G is a VC subgraph class, and a VC hull class if g = (IC: C E C} where C is a VC class of sets. So there are at least four possible ways to extend the notion of VC class to classes of functions. Some implications hold between these different conditions, but no two of them are equivalent. The next theorem deals with some of the easier cases of implication or nonimplication.
4.7.1 Theorem Let F be a uniformly bounded class of nonnegative realvalued functions on a set X. Then
(a) if F is the set of indicators of members of a VC class of sets, then F is also a VC major class, a VC subgraph class, and a VC hull class; (b) if F is a VC major class then it is a VC hull class; (c) there exist VC hull classes F which are not VC major; (d) there exist VC subgraph classes F which are not VC major. Proof (a) The indicators of a VC class C of sets clearly form a VC major class (forCU{0, X}) and aVChull class. Also, the class of sets (Cx[0, 1])U(Xx{0}) for C E C form a VC class, for example by Theorem 4.5.4. So (a) holds. (b) Since .F is uniformly bounded and the definition of VC hull class allows
multiplication by a constant, we can assume 0 < f < 1 for all f E F. For such an f let n1
A
n
n1
l{f>j/n} j=1
jn l{ j/n< f f(x) as n + cc for all x. (In fact, f(x)1/n < fn (x) < f(x) for all x, so fn > f uniformly.) So .F is a VC hull class. (c) Let X =1R2. Let C be the set of open lower left quadrants {(x, y) : x < a, y < b} for all a, b E R. Then S(C) = 2 by Corollary 4.5.11. Let F be the set of all sums Yk 1 lc(k)/2k where C(k) E C. Then clearly .T' is a VC hull class. The sets where f > 0 for f in .F are exactly the countable unions of sets
4.8 Classes of functions and dual density
161
in C. But such unions are not a VC class; for example, they shatter any finite subset of the line x + y = 1. So .F is not a VC major class and (c) is proved. (d) Let f, := n1 + n21 B. for n = 1, 2, and for any measurable sets Bn. Then fn 4. 0, so the subgraphs of the functions fn are linearly ordered by inclusion and form a VC class of index 1 (Theorem 4.2.6(a)). Now If, > 1/n) = Bn for each n, and the sequence {Bn} need not form a VC class, for example let Bn be a sequence of independent sets (see Section 4.6). Then If,) is not a VC major class, proving (d).
4.8 Classes of functions and dual density For a metric space (S, d) and e > 0 recall D(s, S, d), the maximum number of points more than E apart. For a probability measure Q and 1 < p < oo we
have theLpmetric dp,Q(f,g):=(f IfgIPdQ)1/p.For aclass FC£"(Q) let D(p) (E, F, Q) := D(e, Jr, dp,Q). Let D(p) (E, F) be the supremum of D(E, F, dp,Q) over all laws Q concentrated in finite sets. If F is a class of measurable realvalued functions on a measurable space
(X, A), let FF(x) := supfEJ I f(x)I. Then a measurable function F will be called an envelope function for.F if and only if FF < F. If F2 is measurable it will be called the envelope function of F. For any law P on (X, A), F* is an envelope function for F, which in general depends on P.
Given .F, an envelope function F for it, E > 0, and 1 < p < oo, let DF i (e, F, Q) be the supremum of m such that there exist f1,
which f I f  f Ip dQ > Ep f FP dQ for all i
fn, E .F for
j.
The next fact extends part of Theorem 4.6.1 to families of functions. The proof is also similar.
4.8.1 Theorem Let 1 < p < oo. Let (X, A, Q) be a probability space and .F be a VC subgraph class of measurable realvalued functions on X. Let .F have an envelope F E LP (X, A, Q) with 0 < f F dQ and F > 1. Let C be the collection of subgraphs in X x JR of functions in F. Then for any W > S(C) there is an A < oo depending only on W and S(C) such that (4.8.2)
DF ) (e, .F, Q) < A (2P11sP)W for 0 < E < 1.
Proof Given 0 < E < 1, take a maximal m and f1, , fm as in the definition of Df ). First suppose p = 1. For any measurable set B let (FQ)(B)
fB FdQ. Let QF := FQ/QF where QF := fFdQ, so QF is a probability measure. Let k = k(m, E) be the smallest integer such that eke'2 > (2). Then k < 1 + (4logm)/E. Let X1,
, Xk be i.i.d. (QF). Given X1, let Y, be uniformly distributed on the interval [ F(X1), F(X1)] and such that the vectors
VapnikCervonenkis Combinatorics
162
(X1, Yi) E I[82 are independent for i = 1, , m. Then for all i and j # s,
f , j = 1,
,
k. Let Cj be the subgraph of
Pr((Xi, Y;) E CjACS)
= =
f [if (xi)  fs(Xi)I/(2F(xi))] d QF(xi)
f If fldQ/(2 f FdQ)
Let Afsk be the event that (Xi, Yi) independence, for each j, s,
> s/2.
Cj ACS for all i = 1,
Pr(A,jsk) < (1  2)
k
, k. Then by
< e h12'
and
Pr { Ajsk occurs for some j # s
(')e_1sI2 < 1.
So with positive probability, for all j # s there is some (Xi, Yi) E Cj ACs. Fix Xi and Y1, i = 1, , k, for which this happens. Then the sets Cj n {(X1, Yi)}k 1 are distinct for all j, so mc(k) > m. Let S := S(C). By the Sauer and VapnikCervonenkis lemmas (4.1.2 and 4.1.5), mc(k) < 1.5ks/S! fork > S + 2. It follows that for some constant C depending only on S, mc(k) < Cks for all k > 1, where C < 2s+1  1. We can assume that C > 1. So
m < Cks < C((1 +4logm)/e)s. For any a > 0 there is an mo such that 1 + 4logm < m" form > mo and then ml"s < C/ss. Choosing a small enough so that aS < 1 we have m < Cl/£s/(1«s) form > mo and a constant C1. For any W > S, we can AsW for A := max(mo, C1), and solve W = as by a = Wyys . Then m < 1
then mo and A are functions of W and S, finishing the proof for p = 1.
Now suppose p > 1. Let QF,P := FP 1 Q/Q(FP1). Then for i # j, EP
f FPdQ < =
JIi_fjIPdQ
f
fi f f I dQ2F,P Q((2F)P 1)>
and Q2F,p = QF,p. Thus by the p = 1 case,
Q) < Dp)(6, F, QF,P)
c} for real a, b, c with a and b not both 0. Define a wedge as an intersection of two halfplanes. Let C be the collection of all wedges in R2. Show that S(C) > 5. Also find an upper bound for S(C). 7. Let C be the set of all interiors of triangles in R2. Show that S(C) > 7. Also give an upper bound for S(C).
8. Show that the lower bounds for S(C) in problems 6 and 7 are the values of S(C). Hint: For a convex polygon, the set F of vertices can be arranged in cyclic order, say clockwise around the boundary of the polygon, Vi, v2, , v,,, vt. Show that if a halfplane J contains vi and vj with i < j then it includes either {v, vi+t, , vj} or {vj, vj+t, , v,,, vi, , vi}. Thus find what kind of set the intersection of J and F must be. From that, find what occurs if two or three halfplanes are intersected (or unioned, via complements).
Notes
167
9. In the example at the end of Section 4.3, for each set A C X with three elements, find a specific subset of A not in A n C. 10. Let .F be the class of all probability distribution functions on R. Show that F is a VC major class but not a VC subgraph class. Hint: Show with that the subgraphs of functions in F shatter all sets {(xj,
xt x. We noted previously that A in the supremum norm is nonseparable: it is an uncountable set, in which any two points are at a distance 1 apart. Thus A and all its subsets are closed. If x := X1 has a continuous distribution such
as the uniform distribution U[0, 1] on [0, 1], then x H (t H It>x) takes [0, 1] onto A, but it is not continuous for the supremum norm. Also, it is not measurable for the Borel cralgebra on the range space. So, in Chapter 3, functions f * and upper expectations E* were used to get around measurability problems. Here is a different kind of example. It is related to the basic "ordinal triangle" counterexample in integration theory, showing why measurability is needed in the TonelliFubini theorem on Cartesian product integrals. Let (S2, 0, and gp(x) := 0 otherwise. Then A.(k = 0) = 0, and for each P E P, gp is Ameasurable, P is absolutely continuous with respect to ,l, and dP/dA = gp. So A is sufficient for P by Theorem 5.1.5, and Theorem 5.1.2 is proved. Given a statistic T, that is, a measurable function, from S into Y for mea
surable spaces (S, B) and (Y, .T'), let A := T t (.F) := IT'(A): A E Y}, a Qalgebra. For a family Q of laws on (S, B), T is called a sufficient statistic for Q if A is sufficient for Q. If T is sufficient we can write fp = gp o T for some .Fmeasurable function gp by RAP, Theorem 4.2.8. Sufficiency, defined in terms of conditional probabilities of measurable sets, can be extended to suitable conditional expectations:
5.1.6 Theorem Let A be sufficient for a family P of laws on a measurable space (5,13). Then for any measurable realvalued function f on (S, B) which
is integrable for each P E P, there is an Ameasurable function g such that
g=Ep(fIA)a.s.forallPEP. Proof When f is the indicator function of a set in B, the assertion is the definition of sufficiency. It then follows for any simple function, which is a finite linear combination of such indicators. If f is nonnegative, there is a sequence of nonnegative simple functions increasing up to f, and the conclusion
*5.1 Sufficiency
175
holds (RAP, Proposition 4.1.5 and Theorem 10.1.7). Then any f satisfying the
hypothesis can be written as f = f +  f  for f + and f  nonnegative, and the result follows.
Let p. and v be two probability measures on the same measurable space (V, U). Take the Lebesgue decomposition (RAP, Theorem 5.5.3) v = vac + vs where vac is absolutely continuous, and vs is singular, with respect to A. Let
A E U with vs(A) = µ(V \ A) = 0, so vac(V \ A) = 0. Then the likelihood ratio R, I,, is defined as the RadonNikodym derivative d vac/d tt on A and +oo
on V \ A. By uniqueness of the Hahn decomposition of V for vs  tt (RAP, Theorem 5.6.1), R IN, is defined up to equality (it + vs) and so (µ + v)almost everywhere.
5.1.7 Theorem For any family P of laws on a measurable space (S, B) and suboralgebra A C B, A is sufficient for P if and only if for all P, Q E P, RQlp can be taken to be Ameasurable, that is, it is equal (P + Q)almost everywhere to an Ameasurable function.
Proof Only if: suppose A is sufficient for P. Then it is also sufficient for {P, Q}, which is dominated by µ := P + Q. So by factorization (Theorem 5.1.2) there are Ameasurable functions fp and fQ and a Bmeasurable function
h such that dP/dµ = fph and dQ/dµ = fQh. Then RQlp = fQh/(fph) = fQ/fp, where y10
+oo if y > 0 and 0 if y = 0, is Ameasurable (note that RQlp doesn't depend on the choice of dominating measure). Conversely, given P, Q E P, suppose RQlp is Ameasurable. Again let
It := P + Q. Then dP/dµ = 1/(1 + RQ1p) and dQ/dµ = RQIp/(1 + RQ1p) almost everywhere for tt, so dP/dµ and dQ/d µ can be chosen to be Ameasurable. Thus by factorization again (this time with h = 1), A is sufficient
for {P, Q}. So for all B E 8, P(BIA) = Q(BIA) almost surely for P and Q. Taking P fixed and letting Q vary over P, we get that A is sufficient for P. Suppose we observe X1,
, X7z i.i.d. with law P or Q but we do not know
which and want to decide. Suppose we have no a priori reason to favor a choice of P or Q, only the data. Then it is natural to evaluate the likelihood ratio RQn/pn and choose Q if RQlpn > 1 and P if RQnlpn < 1, while if RQnlpn = 1 we still have no basis to prefer P or Q. More generally, decisions between P and Q can be made optimally in terms of minimizing error probabilities or expected losses by way of the likelihood ratio RQlp or RQnlpn as appropriate (the NeymanPearson lemma, Lehmann, 1986, pp. 74, 125;
Measurability
176
Ferguson, 1967, Section 5.1). By Theorem 5.1.7, if B is sufficient for P" for some P D {P, Q}, then RQn/pn is Bmeasurable. Specifically, if T is a sufficient statistic, then by Theorem 5.1.7 and RAP, Theorem 4.2.8, RQl pn is a measurable function of T. Thus, no information in (X1, , Xn) beyond T is helpful in choosing P or Q. In this sense, the definition of sufficiency fits with the informal notion of sufficiency given at the beginning of the section. It will be shown that empirical measures are sufficient in a sense to be defined. Let Sn be the suboralgebra of A" consisting of sets invariant under all permutations of the coordinates.
5.1.8 Theorem S, is sufficient for Pn
{Pn : P E P} where P is the set
of all laws on (X, A). Proof Let Sn be the symmetric group of all permutations of 11, 2,
, n}. For
each R E Sn and x := (x1,
, xn(n))
, xn) E Xn, set fn (x) :_ (xn(1),
Then fn is a 11 measurable transformation of X" onto itself with measurable inverse and preserves the product law P" for each law P on (X, A). For any C E An, we have
P"(CISS) = 11
ufn(C)( ) nES,,
almost surely for P" since for any B E Sn,
P" (B n c) =
n
E Pn (C o f n1(B)) nEsn
1
Pn(fn(C) n B).
n FEsn The conclusion follows.
For example, if X has just two points, say X = to, 1}, and S := Eni=1 x;, then Sn is the smallest oralgebra for which S is measurable. In this case no aalgebra strictly smaller than Sn is sufficient (Sn is "minimal sufficient"). For each B E A and x = (x1, , Xn) E X", let n
Pn(B)(x) :_
1
Y, 1B(xj). j1
So Pn is the usual empirical measure, except that in this section, x H Pn (B) (x)
is a measurable function, or statistic, on a measurable space, rather than a
*5.1 Sufficiency
177
probability space, since no particular law P or P" has been specified as yet. Here, P, (B) (x) is just a function of B and x.
For a collection F of measurable functions on (X", A"), let Sy be the smallest or algebra making all functions in F measurable. Then I will be called sufficient if and only if ST is sufficient.
5.1.9 Theorem F o r any measurable space (X, A) andfor each n = 1 , 2, , the empirical measure P" is sufficient for P" where P is the set of all laws on (X, A). In other words, the set .F of functions x H P" (B)(x), for all B E A, is sufficient. In fact the or algebra ST is exactly S".
Proof Clearly S. C S" . To prove the converse inclusion, for each set B E An let S(B) := UrlESn fn(B) E S". Then if B E S", S(B) = B. Let £ {C E An: S(C) E Sj,}. We want to prove £ = Now £ is a monotone class: if C, E £ and C" t C or C, , C, then C E S. An.
Also, since commutes with finite unions, any finite union of sets in £ is in S. So it will be enough to prove that (5.1.10)
A,x...xAnE9 for anyAj EA,
since the collection C of finite unions of such sets, which can be taken to be disjoint, is an algebra, and the smallest monotone class including C is At (RAP, Propositions 3.2.2 and 3.2.3 and Theorem 4.4.2). Here by another finite union the A; can be replaced by Bl(1) where Bt, , Br are atoms of the algebra generated by At, , An, so r < 2". So we just need to show that for all j(l), , j(n) with 1 < j(i) < r, i = 1, , n, we have S(Bj(t) x x x Bj(")) if and only if for each i = Bj(")) E S.F. Now, x E S(BJ(1) X 1,
, r, P"(Bi) = k, /n where k; is the number of values of s such that
j(s) = i, s = 1,
, n. So 5F = S", and by Theorem 5.1.8 the conclusion
follows.
For some subclasses C C A, the restriction of P" to C may be sufficient, and handier than the values of P" on the whole oralgebra A. Recall that a class C included in a or algebra A is called a determining class if any two measures on A, equal and finite on C, are equal on all of A. If C generates the oralgebra A, C is not necessarily a determining class unless for example it is an algebra (RAP, Theorem 3.2.7 and the example after it). Sufficiency of P" (A) for A E C can depend on n. Let X = 11, 2, 3, 4, 5}, A = 2X, and C = {{ 1, 2, 3}, {2, 3, 4}, {3, 4, 5}}. Then C is sufficient for n = 1, but not for n = 2 since for example (61 + 64)/2 = (62 + 65)/2 on C. This is a case where C generates A but is not a determining class.
Measurability
178
5.1.11 Theorem Let (X, d) be a separable metric space which is a Borel subset of its completion, with Borel oralgebra. Suppose C = {Q11 is a countable determining class for A. Then for each n = 1, 2, , the sequence { Pn (Ck) }k 1 is sufficient for the class Pn of all laws Pn on (X n, An) where P E P, the class of all laws on (X, A). P r o o f For n = 1 , 2, , let I n be the finite set { j/n : j = 0, 1, , n}. Let I,°° be a countable product of copies of In with the product or algebra defined by the oralgebra of all subsets of In. We have:
5.1.12 Lemma Under the hypotheses of Theorem 5.1.11, for each n = 1, 2, and A E A, there is a Borel measurable function fA on I,°° such that for any x1,
, xn E
X, and Pn := n1 Fj 3, we have PP(A) _
fA({Pn(Ck)}k=1).
Proof Since C is a determining class, the function fA exists. We need to show it is measurable. If X is uncountable, then by the Borel isomorphism theorem (RAP, Theorem 13.1.1) we can assume X = [0, 1 ]. Or if X is countable, then A is the or algebra of all its subsets and we can assume X C [0, 1]. Let
X(n) :_ {X :_ (xj)'_1 E Xn : X1 < X2 < ... < Xn }. The map x i P,ix) is 11 from X W into the set of all possible empirical measures P, , since if Q := P,(x) = Pny), with x, y E X(n), then x1 = y1 = the smallest u such that Q({u}) > 0. Next, X2 = Y2 = x1 if and only if Q({x1}) > 2/n, while otherwise x2 = Y2 = the nextsmallest v such that Q({v}) > 0, and so on. Now, X (n) is a Borel subset of Xn. The completion of X' for any of the usual product metrics is isometric to Sn where S is the completion of X. Clearly X' is a Borel subset of Sn. It follows that X (n) is a Borel subset of its completion. Here the following will be useful:
5.1.13 Lemma On I,°O, the product or algebra l3°°, the smallest or algebra for which all the coordinates are measurable, equals the Borel a algebra 8 of the product topology T.
Proof Clearly 8°O C BO° since the coordinates are continuous. Conversely, T has a base R consisting of all sets ilk ° 1 A j where A j = In for all but finitely many j; R is countable (since In is finite) and consists of sets in 8°O, and every
5.2 Admissibility
179
U E T is a countable union of sets in R, so U E BOO and BOO = B°°. Lemma 5.1.13 is proved. Now to continue the proof of Lemma 5.1.12, the map f: x i+ { P,(x) (Ck) )_ 1
is Borel measurable from X(') into I,°O. Since {Ck)' area determining class, 1
f is onetoone. Thus by Appendix G, Theorem G.6, f has Borel image f [X (' ] in I,°O and f1 is Borel measurable from f [X (") ] onto X('). Then
f1 extends to a Borel measurable function h on all of I,°O into X00, since Xt"> is Borelisomorphic to R, or a countable subset, with Borel aalgebra, by the Borel isomorphism theorem (RAP, Theorem 13.1.1 again), and thus the extension works as for realvalued functions (RAP, Theorem 4.2.5). For any A E A, gA : x H P,$x) (A) is Borel measurable. Thus fA = gA o h is Borel measurable and Lemma 5.1.12 is proved. Now to prove Theorem 5.1.11, we know from Theorem 5.1.9 that the smallest
aalgebra S,, making all functions x i P, (B) (x) measurable for B E A is sufficient. By Lemma 5.1.12, SF is the same as the smallest aalgebra making x r) P1t (Ck) measurable for all k, which finishes the proof.
In the real line R, the closed halflines (oo, x] form a determining class. In other words, as is well known, a probability measure P on the Borel aalgebra of IR is uniquely determined by its distribution function F (RAP, Theorem 3.2.6). It follows that the halflines (oo, q] for q rational are a determining class: for any real x, take rational qk J, x, then F(qk) ,[ F(x). Thus we have:
5.1.14 Corollary In R, the empirical distribution functions F (x) := P, ((oo, x]) are sufficient for the family P" of all laws P" on W1 where P varies over all laws on the Borel or algebra in R.
5.2 Admissibility Let F be a family of realvalued functions on a set X, measurable for a aalgebra A on X. Then there is a natural function, called here the evaluation
map, F x X H IR given by (f, x) H f (x). It turns out that for general F there may not exist any aalgebra of subsets of F for which the evaluation map is jointly measurable. The possible existence of such a aalgebra and its uses will be the subject of this section. Let (X, 5) be a measurable space. Then (X, B) will be called separable if B is generated by some countable subclass C C B and B contains all singletons
180
Measurability
{x}, x E X. In this section, (X, B) will be assumed to be such a space. Let F be a collection of realvalued functions on X. (The following definition is unrelated to the usage of "admissible" for estimators in statistics.)
F is called admissible if there is a aalgebra T of subsets of F such that the evaluation map (f, x) i+ f (x) is jointly measurable from (F, T) x (X, B) (with product aalgebra) to R with Borel sets. Then T will Definition.
be called an admissible structure for F. .T' will be called image admissible via (Y, S, T) if (Y, S) is a measurable space and T is a function from Y onto F such that the map (y, x) r* T(y)(x) is jointly measurable from (Y, S) x (X, B) with product aalgebra to R with Borel sets. To apply these definitions to a family C of sets let T 1 A : A E C).
Remarks. For one example, let (K, d) be a compact metric space and let F be a set of continuous realvalued functions on K, compact for the supremum norm. Then the functions in F are uniformly equicontinuous on K by the ArzelaAscoli theorem (RAP, 2.4.7). It follows that the map (f, x) H f(x) is jointly continuous for the supremum norm on f E F and d on K. Since both spaces are separable metric spaces, the map is also jointly measurable, so that F is admissible. If a family F is admissible, then it is image admissible, taking T to be the identity. In regard to the converse direction, here is an example. Let X = [0, 1] with usual Borel aalgebra B. Let (Y, S) be a countable product
of copies of (X, 8). For y = E Y let T(y)(x) := lj(x, y) where J := {(x, y) : x = y, for some n}. Let C be the class of all countable subsets of X and .F the class of indicator functions of sets in C. Then it is easy to check
that F is image admissible via (Y, S, T). If a aalgebra T is defined on F by setting T := {F C F: T' (F) E S) then T is not countably generated (see problem 5(b)) although S is. This example shows how sometimes image admissibility may work better than admissibility.
5.2.1 Theorem For any separable measurable space (X, B), there is a subset Y of [0, 1] and a 11 function M from X onto Y which is a measurable isomorphism (is measurable and has measurable inverse) for the Borel or algebra on Y.
Remark. Note that Y is not necessarily a measurable subset of [0, 1]. On the other hand, if (X, B) is given as a separable metric space which is a Borel subset of its completion, with Borel aalgebra, then (X, 5) is measurably isomorphic
5.2 Admissibility
181
either to a countable set, with the aalgebra of all its subsets, or to all of [0, 1] by the Borel isomorphism theorem (RAP, Theorem 13.1.1).
Proof Recall that the Borel subsets of Y as a metric space with usual metric are the same as the intersections with Y of Borel sets in [0, 1], since the same is true for open sets. Let A := {Aj}j>1 be a countable set of generators of S. Consider the map f : x H {lAj(x)}j>1 from X into a countable product 2°O of copies of {0, 11 with product aalgebra. Then f is 11 and onto its range Z. Thus it preserves all set operations, specifically countable unions and complements. So it is easily
seen that f is a measurable isomorphism of X onto Z. Next consider the map g : {zj} i+ Yj 2zj/3' from 2' into [0, 1], actually onto the Cantor set C (RAP, proof of Proposition 3.4.1). Then g is continuous from the compact space 2°O with product topology onto C. It is easily seen
that g is 11. Thus g is a homeomorphism (RAP, Theorem 2.2.11) and a measurable isomorphism for Borel aalgebras. The Borel aalgebra on 21 equals the product aalgebra (Lemma 5.1.13). So the restriction of g to Z is a measurable isomorphism onto its range Y. It follows that the composition go f, called the Marczewskifunction,
M(x)
00
E 2 , IA. (x)13n n=1
is onetoone from X onto Y C I
[0, 1] and is a measurable isomorphism
onto Y.
Let (X,13) be a separable measurable space where 13 is generated by a sequence {Aj 1. By taking the union of the finite algebras generated by A 1,
, An
for each n, we can and do take A := {A; )1>1 to be an algebra. Let Fo be the class of all finite sums "=1 C; IA, for rational c; E Ik, and n = 1, 2, . Then "Borel classes" or "Banach classes" are defined as follows by transfinite recursion (RAP, 1.3.2). Let (S2, 0, and Ta is defined for all a < P, let .Ffi
Measurability
182
be the union of all Jla for a < P. Note that Fa C Fp whenever a < P. Let U := U is the set of all measurable real functions on X.
Proof Clearly, each function in U is measurable. Conversely, the class of all sets B such that I B E U is a monotone class and includes the generating algebra A, so it is the aalgebra B of all measurable sets (RAP, Theorem 4.4.2). Likewise, for a fixed A E A and constants c, d, the collection of all sets B such that c 1 A + d 1 B E U is B. Then for a fixed B E B, the set of all C E B such that c 1 C + d 1 B E U is all of 13. By a similar proof for a sum of n terms, we get that all simple functions F"=t ci 1B(j) are in U for any c; E R and B(i) E B. Since any measurable real function f is the limit of a sequence of simple functions min(f, 0) (RAP, Proposition 4.1.5), max(f, 0) and gn f, +gn where fn every measurable real function on X is in U.
On admissibility there is the following main theorem:
5.2.3 Theorem (Aumann) Let I :_ [0, 1] with usual Borel aalgebra. Given a separable measurable space (X, B) and a class . F of measurable realvalued functions on X, the following are equivalent:
(i) F C Fa for some a E S2; (ii) there is a jointly measurable function G : I x X H R such that for
each f E F, f = G(t, ) for some t E I; (iii) there is a separable admissible structure for F; (iv) F is admissible; (v) 2F is an admissible structure for .T'; (vi) F is image admissible via some (Y, S, T).
Remarks.
The specific classes Ta depend on the choice of the countable
family A of generators, but condition (i) does not: if C is another countable set of generators of B with corresponding classes 9a, then for any a E 0 there are ig E S2 and Y E 0 with.T'a C 9,6 and 9J C F.
Proof (ii) implies (iii): for each f E F choose a unique t E I and restrict the Borel aalgebra to the set of is chosen. Then (iii) follows. Clearly (iii) implies (iv), which is equivalent to (v). (iv) implies (iii): note that any realvalued measurable function f (for the Borel aalgebra on R as usual) is always measurable for some countably generated subaalgebra, for example generated by If > t), t rational. A product
5.2 Admissibility
183
aalgebra is the union of the aalgebras generated by countable sets of rectangles A, x B,. So the evaluation map is measurable for such a subaalgebra. Let D be the aalgebra of subsets of F generated by the A; A. For any two dis
tinct functions f, g in F, f(x) g(x) for some x. The map h i+ h(x) is D measurable. So If) is the intersection of those AI which contain f and the complements of the others, and (iii) follows. So (iii) through (v) are equivalent.
(iii) implies (ii): by Theorem 5.2.1, there is a subset Y C I [0, 1] with Borel aalgebra and a measurable isomorphism t H G(t, ) from Y onto F. The assumed admissibility implies that (t, x) H G (t, x) is jointly measurable. By the general extension theorem for realvalued measurable functions (RAP, Theorem 4.2.5), although Y is not necessarily measurable, we can assume G is jointly measurable on I x X, proving (ii). So (ii) through (v) are equivalent. (ii) implies (i): G belongs to some Fa on [0, 1] x X, where we take gener
ators of the form A; x B;, Ai Borel, B; E 13, and {A;};,t and {B;};>1 are algebras. Hence the sections G (t, ) on X all belong to Fa on X for the generators {B1).
(i) implies (ii): for this we need universal functions, defined as follows. A jointly measurable function G: I x X H R will be called a universal class a function if every function f e Fa on X is of the form G(t, ) for some t E I. (G itself will not necessarily be of class a on I x X.) Recall by the way that an open universal open set in N°O x X exists for any separable metric space X (RAP, Proposition 13.2.3).
5.2.4 Theorem (Lebesgue) For any a E 0 there exists a universal class a function G : I x X i+ R.
Proof For a = 0, Ft is a countable sequence { fk}k>l of functions. Let G(1/k, x) := fk(x) and G(t, x) := 0 if t 1/k for all k. Then G is jointly measurable and a universal class 0 function. For general a > 0, by transfinite induction (RAP, Section 1.3) suppose there
is a universal class 3 function for all ,B < a. First suppose a is a successor, that is, a = f + 1 for some P. Let H be a universal class ,B function on I x X. Let I°O be the countable product of copies of I with the product aalgebra. The product topology on I°O is compact and metrizable (RAP, Theorem 2.2.8 and Proposition 2.4.4). Its Borel aalgebra is the same as the product aalgebra, by way of the usual base or subbase of the product topology (RAP, pp. 3031).
For t =
E I°O let G(t, x) := 1im sup, H(t,,, x) if the lim sup is finite, otherwise G(t, x) := 0. Then G is jointly measurable. Now I°O, as a
Measurability
184
Polish space, is Borelisomorphic to I (RAP, Section 13.1), so we can replace I°O by I. Then G is a universal class a function. If a is not a successor, then there is a sequence P k f a, 18k < a, meaning < a, there is some k with f < pk. For each k let Gk be a that for every
universal class A function. Define G on I2 x X by G(s, t, x) = Gk(t, x) if s = 1/k and G(s, t, x) = 0 otherwise. Then G is jointly measurable. Since 12 is Borelisomorphic to I we again have a universal class a function, proving Theorem 5.2.4.
With Theorem 5.2.4, (i) in Theorem 5.2.3 clearly implies (ii), so (i) through (v) are equivalent. (iv) implies (vi) directly. If (vi) holds, let Z be a subset of Y on which the map z H T (z) is onetoone and onto F. Let Sz and TZ be the restrictions to Z of
S and T respectively. For a function f and set B let f [B] := { f(x) : x E B). Then F remains image admissible via (Z, SZ, Tz), and {TZ[A] : A E SZ} is an admissible structure for F, giving (iv).
For 0 < p < oo and a probability law Q on (X,13) we have the space GP (X,13, Q) of measurable realvalued functions f on X such that f If I P dQ < oo, with the pseudometric
dd,Q(f,g)
(f IfgIPdQ)1/P,
1 < p 0: Q(I f  gI > s) < s},
0 0 for all x, y. Now, the TonelliFubini theorem implies the desired measurability. For p = 0, given s > 0, by the TonelliFubini the
orem again, the set AE of y such that Q(IT(y)(x)  g(x)I > s) < e is
5.3 Suslin properties, selection, and a counterexample
185
measurable, and {y : do,Q(T(y), g) < r} is the union of AE for a rational,
0 0 and F C C2 (X, A, P) with
f I.f12dP < 2 for all f E F.
6.1 KoltchinskiiPollard entropy and GlivenkoCantelli theorems
199
Assume F is image admissible Suslin via some (Y, S, T). Then for any 11 > 0,
Pr 111411', > n} > (1 Pr JJ l11vn ll, > 2rl}. Proof Note that the given events are measurable by Corollary 5.3.5. For x = , x2n) let H :_ {(x, f) : I vn (f )I > 21j, f E F}. Then by admis41, sibility, H is a product measurable subset of X2n X F. The Suslin property implies (Theorem 5.3.2) that there is a universally measurable selector h such
that whenever (x, f) E H for some f E F, h(x) E .F and (x, h(x)) E H. Let x7 := (xr (1),
, xr (n) ). Then for some function h t , h (x) = h 1(x7) where
h 1(y) is defined if and only if y E J for some u.m. set J C Xn (by Theorem 5.3.2). Let Tn be the smallest oralgebra with respect to which x7(.) is measurable. Then on the set where x't E J, since v° = vn  v", Pr(11vn'll,. > rl I T,,)
Pr (11 v' (h 1 (x))
E,,).
Given Tn, that is, given x', h I (x7) () is a fixed function f E F with f I f 12d P < 2. Then since vn is independent of Tn, we can apply Chebyshev's inequality to obtain
Pr(lvn(.f)j < rl) > 1_0/11)2. Integrating gives Pr{IIvOII.F > rl} > (1 
211}, which,
since v,,' is a copy of vn, gives the result.
Some reverse martingale and submartingale properties of the empirical measures Pn will be proved. Recall that Q(f) f f dQ for any f E G1(Q), and that in defining empirical measures Pn := n F Sxj, the Xj are always (in this book) taken as coordinates on a product of copies of a probability space (X, A, P), so that the underlying probability measure for P, is Pn, a product of n copies of P. The definitions of (reversed) (sub)martingale (e.g., RAP, Sections 10.3 and 10.6), for random variables Y, and oalgebras ,tan, will be slightly extended here by allowing Yn to be measurable for the completion of tan.
,
6.1.6 Theorem Let (X, A, P) be a probability space, let.F C 'C I (P), and let P, be empirical measures for P. Let S, be the smallest or algebra for which Pk(f) are measurable for all k > n and f E L1 (X, A, P). Then: (a) For any f E F, { Pn (f ), Sn }n> 1 is a reversed martingale; in other words,
E(Pn1(f)ISn) = Pn(f) a.s. if n > 2.
200
Limit Theorems for VapnikCervonenkis and Related Classes
(b) (F Strobl) Suppose that .F has an envelope function F E L' (X, A, P) and that for each n, 11P,  P II, is measurable for the completion of Pn. Then (IIPn  P I Lr , Sn )n> 1 is a reversed submartingale; in other words,
IIPn  PIIj_ < E(IIPk  PIIj, I Sn) a.s.fork no. Then by Markov's inequality,
Pr (log Opn > nE3) < E.
(6.2.10)
Given A2", the event I (Pn  P")(A)I > e is the same for any two sets A having the same intersection with {x 1,
Pr III Pn
 Pn IIc > E I A2n }
, x2n }. Thus
_< Ao,2n exp (  n82 /2).
Hence, for e < 1/8 and n > no large enough so that exp(ns2/4) < e, we have by (6.2.10) Pr (II Pn
 Pn
II
c > e) < e + exp (2ne3  n82 /2)
< E + exp ( ns2/4) < 2e, so (c) implies (b).
Now let us show (b) implies (c). By (6.2.9) we have 1 > n1 log AC  c a.s. for some constant c := C2 > 0. Thus n1 E log Dan  c and we want to prove c = 0. Suppose c > 0. Given e > 0, for n large enough,
Pr {(2n)1 log 00 2n > c/2} = Pr {Op 2n > encI > Next, to symmetrize,
Pr {IIPnPn'IIC>2E} < Pr {IIPnPIIC>e}+Pr {IIPn PIIC>e} 2Pr{IIPn  PIIC > E}. So it will suffice to prove it P,,  Fn' IIc  0 in probability. If 2 < k :_ [an] where [x] is the largest integer < x and 0 < a < 1/2, then an < k + 1 < 3k/2,
6.2 VapnikCervonenkisSteele laws of large numbers
207
so by the inequality (4.1.5) and Stirling's formula (Theorem 1.3.13) we have
2nC 0. Then for n large enough, (6.2.11)
n > 2/a
and
(3e/a)"n < enc
Hence by Sauer's lemma (4.1.2), if Oo 2n > e"' then ko 2n > [an]. Fix n satisfying (6.2.11). Let k :_ [an]. Now on an event U with Pr(U) > 1  a, there is a subset T of the indices {1, , 2n} such that card T = k, C shatters {xi : i E T}, and xi xj for i j in T. If there is more than one such T, let us select each of the possible T's with equal probability, using for this a random variable rl independent of xj and Qj, 1 < j < 2n. Then since xj are i.i.d., T is uniformly distributed over its ( possible values. For any distinct ji E 11, , 2n }, N := 2n, we have, k) where the following equations are conditional on U,
Pr (ji E T) = k/N, Pr (ji, j2 E T) = k(k  1)/N(N  1),
(6.2.12)
Pr (ji E T,
i = 1, 2, 3, 4) = ()/().
Let Mn be the number of values of j < n such that both 2j  I and 2j are in T. Then from (6.2.12),
EMn = k(k  1)/2(N  1) = a2n/4 + 0(1), n + oo;
EMn = k(k  1)/2(N  1) h n(n  1)
\4/
/()
x
+ 0(n),
and a bit of algebra gives Q2(Mn) = EMn  (EMn)2 = 0(n) as n  oo. Thus for 0 < S < a2/4, by Chebyshev's inequality, Pr(MM > Sn) > I  2e for n large. On the event U n { Mn > Sn }, let us make a measurable selection (5.3.2)
of a sequence J of [Sn] values of i such that J' := UfEJ{2i  1, 2i) C T. Here Mn and J are independent of the or (j). Now measurably select, by Theorem
5.3.2 again with y(.) =
a set A = A(w) = T(y(co)) E C such that
t j E Y: xj E A) = ja(i): i E J}. Then EiEJ(SxQ(;)  SxI(,))(A) = [n6]. Here is measurable for the aalgebra 13J generated by all the xj, by rl, and by or (i) for i E J. Conditional on 13 j,
E (Sxo(i)  Sx,(n) (A) _ i0J
iJ
siai
208
Limit Theorems for VapnikCervonenkis and Related Classes
where ai are Bjmeasurable functions with values 1, 0, and 1, and si have values ±1 with probability 1/2 each, independently of each other and of BJ. Thus by Chebyshev's inequality,
PrI{ Ysiai >013 BJ I
< 9/(n62) i0J on the event where J is defined. Thus for n large, I
Pr {(Pn'Pn)(A((s))>S/3} > 13s and II P, 
II
C  0 in probability. So Theorem 6.2.1 is proved.
6.2.13 Theorem In (6.2.9), if any ci is 0, all three are 0.
Proof In the last proof, we saw that cl = 0 if and only if c2 = 0, and just after (6.2.11), that if c2 > 0 then C3 > 0. On the other hand, if c3 > 0 then for some S > 0, ko" > Sn for n large enough a.s., and then AC > 26". Thus
c2>c3log2>0. 6.3 Pollard's central limit theorem By way of the KoltchinskiiPollard kind of entropy and law of large numbers (Section 6.1) the following will be proved:
6.3.1 Theorem (Pollard) Let (X, A, P) be a probability space and F C G2 (X, A, P). Let .F be image admissible Suslin via (Y, S, T) and have an envelope function F E C2 (X, A, P). Suppose that 1
log D(2,i (x, .F))112dx < oo. fo Then.F is a Donsker class for P. (6.3.2)
The hypothesis (6.3.2) will be called Pollard's entropy condition. Before the proof of the theorem, here is a consequence:
6.3.3 Theorem (Jain and Marcus) Let (K, d) be a compact metric space. Let C(K) be the space of continuous real functions on K with supremum norm. Let X1, X2, be i.i.d. random variables in C(K). Suppose EX1(t) = 0 and EX1(t)2 < oc for all t E K. Assume that for some random variable M with
EM2 < cc, I X1(s)  X1(t) I (w) < M(c))d (s, t)
6.3 Pollard's central limit theorem
209
for all cw and all s, t E K. Suppose that 1
(6.3.4)
J.
(log D(s, K, d)) 1/2 ds < oo.
Then the central limit theorem holds, that is, in C(K), G(n1/2(XI + + Xn)) converges to some Gaussian law.
Proof For a realvalued function h on K recall (RAP, Section 11.2) that IIhIIL
IIhIIBL
sups#t Ih(s)  h(t)I/d(s, t), IIhIIL+Ilhllsup,
Ilhllsup := sups lh(t)I,
BL(K) := {h EC(K): IIhIIBL 0
or f(x) = 1/max(I logxl, 2). Thus one can increase the possibilities for obtaining the Lipschitz property of Xl with respect to d, so long as (6.3.4) holds for d. Now to prove Theorem 6.3.1 we first have:
6.3.5 Lemma Let (X, A, P) be a probability space, F E £2(X, A, P) and .F C £2(X, A, P) having envelope function F. Let H := 4F2 and H :_ {(f  g)2: f, g E .F). Then 0 < cp(x) < H(x) for all cP E H and X E X, and for any S > 0, DHt (4S, H) < D(2) (8, .F)2.
Proof Clearly 0 < cp < H for V E H. Given any y E r, choose m < DF ) (S, .F) and f , . , fm E F such that (6.1.1) holds with p = 2. For any f, g E F, take i and j such that max (Y ((.f
 f )2), Y ((g  f )2))
0 then DHt (e, 71) < oo for all e > 0. Thus hypothesis (6.3.2) lets us apply Theorem 6.1.7, with F there = H.
6.3.6 Proposition Let (X, A, P) be a probability space. Suppose an image admissible Suslin class F of measurable realvalued functions on X has an envelope function F E C2 (X, A, P). If DF) (8, F) < oo for all S > 0, then .F is totally bounded in
L2(X, A, p).
6.3 Pollard's central limit theorem
211
Proof We may assume f F2dP > 0, as otherwise.F is quite totally bounded. By Lemma 6.3.5 and Theorem 6.1.7 we have sup { I (P
 P) ((f  g)2) I
:
f, g E .F} > 0
n  oo.
a.s.,
Also, as n f oo, f F2dP2n * f F2dP a.s. Given s > 0, take no large enough and a value of P2n, n > no, such that f F2dP2n < 2 f Fed P and sup { I (P2n  P) ((f  g)2) I
:
f, g E .F} < s/2.
Take 0 < 8 < (s/(4P(F2)))1/2 and choose fl, , fin E F to satisfy (6.1.1) for p = 2 and y = P2n. Then for each f E .F we have for some j,
f (f  fj)2dP < i + f(f 
j)2dP2n < 2 + S2 J F2dP2n < E.
Now to continue the proof of Theorem 6.3.1, the total boundedness gives us condition 3.7.2(II)(a); it remains to check the asymptotic equicontinuity condition 3.7.2(II)(b). For this, given s > 0, let us apply the symmetrization lemma (6.1.5) to sets .Fj,s
ff  f: .fEF, f(f  f)2dP 0 small enough, it = 8/2, and 2; = s/4. By Theorem 5.2.5 with p = 2, l y E Y : T (y)  f E Fj,s } E S, and since a measurable subset of a Suslin space is Suslin with its relative Borel structure (by RAP, Theorem 13.2.1), Fj,s is image admissible Suslin. Thus Lemma 6.1.5 applies and we need only check 3.7.2(II)(b) for the symmetrized v0, in place of vn. To do this we will prove 3.7.2(II)(b) conditionally on P2n for P2, in a set it belongs to with probability converging to 1 as n + oo. Via Lemma 6.3.5 and Theorem 6.1.7 again, Fj,s will be treated by way of
{f_fi:fEF, f(f_fi)2dP2 0, it will be enough to prove (6.3.7)
Pr{ Pr[SUP {Iv?(f g)I:
, x2n) measurable.
f,gE.F,
f(f_g)2dP2n
31l A2n1 > 3e } I
< 3s for 8 small enough and n large enough.
212
Limit Theorems for VapnikCervonenkis and Related Classes
Given A2n, that is, given x, let II.fII2n i = 1, 2, .. Choose finite subsets .F(1, x), F(2, x),
(P2n(f2))1/2
Let Si := 2`,
of F such that for
all i, and f E F, (6.3.8)
min{Ilfg112n: gEF(i,x)} < SiIIFII2n,
with card(F(i, x)) < DF) (S1, .T'). We can write.F(i, x) = {g i , . > g k(t,x) } where by Lemma 6.1.9 we have gim) = T (yim (x)), k(i, ) and yim () being universally measurable in x, and where yim (x) is defined if and only if m k(i, x).
For each f E F, let f := g := gi,,, E F(i, x) achieve the minimum in (6.3.8), with m minimal in case of a tie. Let Ak denote the valgebra of universally measurable sets for laws defined on A". For each f = T(y) E F and i, wehave f = T (y)i = gim(x,y,t) where m ( , , i) is ,42' x S measurable. Thus (u, x, y) + gim(x.Y,i)(u) is '42,+1 x S measurable. Hence v,,O(T(y) T(y)j) is A2n x S measurable and thus equals some A2n x S measurable function G (x, y) for x V V where Pen (V) = 0. Then by the selection theorem (5.3.2), sups I G (x, y) I is universally measurable in x. Thus for each j, (6.3.9)
sups I vnO (T (y)  T(y)j)I
is Penmeasurable in x.
Now II fi  f II2n * 0 as i a oo by (6.3.8), and for any fixed r, f  Jr =
Jr 0. Then
1: lj < 00,
(6.3.10)
j>1
j > 576P(F2)32j Hj,
(6.3.11) and
(6.3.12)
Lexp ( qj/(2883 P(F2))) j>1
exp ( j2/(288P(F2))) < 00.
< j>1
6.3 Pollard's central limit theorem
213
Then
(6.3.13)
Pr f supfeIv(.f  fr)>
A2n
j>r
1
< I: Pr{supfE Ivo(f f1)I > njI A2n} j>r
r
 fi)
> 77j I A2n I I
i = j  1, j. For a fixed j and f
since there are exp(Hi) possibilities for let
(fj  f1)(x2i)  (f  f 1)(x211)
zi
Then n
vnno(f  f1) =
n1/2E(1)e(i)zi i=1
where e(i) := 1 {Q (i)=211 } are random variables taking values 0 and 1 with prob
ability 1/2 each, independently of each other and the zi. Then by an inequality of Hoeffding (Proposition 1.3.5), n
E(1)e(il zi >r/j
r1 I A2n } < S.
(6.3.14)
Next, if 11f  gII2n 0 as r
just before it). Clearly Pr(Bn) + 1 as n over F,
oo (see (6.3.10) and
oo. Now for f and g ranging
sup{IvO,(fg)I: 11f gII2n < S}
1, Mj := {IWP(Bj)I > 2K) \ Uo n) < oo, so Yn°_1 P(F* > n) < 00, which implies EF*(Xi) < 00 (RAP, Lemma 8.3.6).
**6.5 Inequalities for empirical processes This section gives no proofs, but gives references. Recall that for the Brownian
bridge yr, and 0 < M < 00,
P (supo 0, 7r(n, C, M) < K exp((2  s)M2) for a large enough constant K = K(s, S(C)). Here, unlike the case of the DvoretzkyKieferWolfowitz inequality, 2  s cannot be replaced by 2. Let
,r(oo, F, M) := supp Pr(IIGPILT > M).
222
Limit Theorems for VapnikCervonenkis and Related Classes
Any upper bound for zr(n, C, M) that holds for all finite n, uniformly inn, will also hold for n (oo, F, M), for any countable class F, by the finitedimensional central limit theorem, and thus for any class F such that II II.F = II II4 for a
countable 9 c F, as is true for many classes T. Consider inequalities of the form (6.5.1)
r(oo,F,M) < CMYexp(2M2)
and (6.5.2)
sup,,,ir(n,F,M) < CMYexp(2M2)
where C is a constant possibly depending on y and F. For the VCM class O(d) of orthants Oy :_ {x : x j < y,, j = 1, , d) in Rd for ally E Rd, the smallest possible value of y in (6.5.1) is 2(d  1), by results of Goodman (1976) for d = 2, Massart (1983, 1986, Theorem A.1), Cabana (1984, Section
3.2), and Adler and Brown (1986). For d > 2 it follows that y in (6.5.1) and (6.5.2) cannot be taken as 0. Adler and Samorodnitsky (1987, Example 3.2 p. 1347) found that for the VCM class B(d) of rectangular blocks parallel
to the axes in Rd, one has y = 2(2d  1) as the precise value in (6.5.1). Here S(6 (d)) = d and S(B(d)) = 2d (Corollary 4.5.11). For these classes of sets, y = 2(S(C)  1) is optimal in (6.5.1). For the VCM class H(2) of all open halfplanes in R2, where S(H(2)) = 3 by Theorem 4.2.1, and P is uniform on the unit square, Adler and Samorodnitsky (1987, Example
3.3 p. 1349) showed that y < 2, so y is smaller than the other examples suggested.
On the other hand, for S(C) = 1, y can be 1 > 2(S(C)  1), as follows. In the unit circle Sl :_ {(cos0, sine) : 0 < 6 < 2ir} C R2 let HC be the class of halfopen halfcircles Ht :_ {(cos0, sine) : t < e < t + r} for 0 < t < it. Then it is easy to check that S(HC) = 1. Let P be the uniform
law on St, dP(6) = de/(27r) for 0 < 0 < 2n. Let Xt := Gp(Ht). Then Xt, oo < t < oo, is a stationary Gaussian process, periodic of period 2n. Here y = 1 is the precise exponent in (6.5.1) by a result of Pickands (1969, Lemma 2.5). See also Leadbetter, Lindgren and Rootzen (1983, Theorem 12.2.9) and Adler (1990, pp. 117118). The class HC does not contain the empty set, and HC U {0} shatters some 2element sets. Smith and Dudley (1992) showed that there exists a VCM class C with S(C) = 1 and 0 E C such that y = 1 is optimal in (6.5.1). Here C has a treelike partial ordering by inclusion as in Section 4.4. Returning to general cases, Samorodnitsky (1991) shows that for any VCM class C, (6.5.1) holds for any y > 2S(C)  1. By way of a theorem of Haussler stated in Section 4.9.2, (6.5.1) also holds for y = 2S(C)  1.
**6.6 GlivenkoCantelli properties and random entropy
223
For empirical processes, upper bounds for exponents y have been harder to obtain than for Gaussian processes, while lower bounds in (6.5.1) also provide
them for (6.5.2). Massart (1986, Theorem 3.3.1°(a)) proved (6.5.2) for any VCM class of sets and also for VC subgraph classes .F of functions with values in [0, 1], for any y > 6S(C), where C is the class of subgraphs of functions in .F and is VCM. Talagrand (1994) showed that for any VCM class C one can take y = 2S(C)  1 in (6.5.2). Talagrand also proves similar bounds for other,
not necessarily VC classes. The halfcircle and tree examples show that for S(C) = 1, Talagrand's bound is precise, so the best value of y in (6.5.1) and in (6.5.2) for VCM classes C with S(C) = 1 is y = 1. At this writing, the problem of finding optimal or useful constants C in (6.5.1) and (6.5.2) for general VCM classes with given S(C) seems to be open. Another direction for inequalities is what is called the concentration of measure phenomenon. Ledoux (1996) gives an exposition. Here is one of the many
results. Let P be the standard normal measure N(0, I) on Rd. Let f be a realvalued Lipschitz function on Rd, so that
II.fIIL := supxOy If(x)  f(Y)I/Ix  yI < 00. Then for any r > 0,
f JfdP
>
r)
< 2exp(r2/(211.f112))
(Ledoux, 1996, (2.9) p. 181). Notably, the dimension d doesn't appear in the inequality. Concentration inequalities of similar form have been proved for spheres and some other Riemannian manifolds and for product spaces. Two major works for product spaces are Talagrand (1995, 1996a).
**6.6 GlivenkoCantelli properties and random entropy This section is also a survey giving references rather than proofs. Let (X, A, P) be a probability space and F a set of measurable realvalued functions on X.
Recall that F is called a weak (resp. strong) GlivenkoCantelli class for P if both F C Gl (X, A, P) and II P,  P II.T  0 in outer probability (resp. almost
uniformly) as n  oo. Also, F is called order bounded for P if it has an integrable envelope function, EFF < oo. Let .F0,p := { f  Pf : f E .F}. Talagrand (1987a, Theorem 22) proved the following:
Theorem A (Talagrand) For a probability space (X, A, P) and a class F C G1 (X, A, P), the following are equivalent as n k oo: (a) F is a strong GlivenkoCantelli class;
224
Limit Theorems for VapnikCervonenkis and Related Classes
(b) the possibly nonmeasurable functions II Pn  P IIjr converge to 0 a.s.; (c) F is a weak GlivenkoCantelli class and .F0, p is order bounded.
Proof Each of (a), (b), or (c) for F is equivalent to the same statement for .F0, p. Clearly (a) implies (b), which for .Fo, p is equivalent to (I) in Talagrand's
theorem (1987a, Theorem 22), while (c) is intermediate between Talagrand's (VI) and (VII). Talagrand's (V) implies (a). El
A class F satisfying any of the three equivalent conditions in Theorem A will be called a GlivenkoCantelli class. There exist weak GlivenkoCantelli classes which are not GlivenkoCantelli classes (problem 12). Talagrand (1987a, 1996b) gave a kind of VapnikCervonenkis criterion for the GlivenkoCantelli property, as follows. For a finite set F C X and oo
f for
allyEF\G. Thus, if F is the class of indicators of sets in a class C, then F shatters F at levels a, fi where 0 < a < f < 1 if and only if C shatters F as defined in Section 4.1.
Suppose given a probability space (X, A, P). Let.F C G1 (X, A, P), oo < , Xn be coordinates on a product of copies of (X, A, P). Let W (T, A, a, fi, n) be the set of all X:= {Xjj =1 E A" , Xn } at levels a, ,B. such that the Xj are all distinct and.F shatters {X1, X2, The triple (A, a, ,B) is called a witness of irregularity for.F if the following
a < fi < oo, and A E A. Let X1, X2,
all hold: P(A) > 0; the restriction of P to A has no atoms; and for all n =
1,2,..., (Pn)*(W (F, A, a, f, n)) = P(A)'. The terminology is as in Talagrand (1996b), except that in the 1996 paper, measurability assumptions make the star unnecessary. Talagrand (1987a, Theorem 2) proved:
Theorem B (Talagrand) Given (X, A, P), an order bounded class F C G1 (X, A, P) fails to be a GlivenkoCantelli class if and only if there exists a witness of irregularityfor it. Thus, a class F is a GlivenkoCantelli class for P if and only if Fo, p is order bounded and has no witness of irregularity. Now, given a measurable space (X, A), a class.F of realvalued measurable functions on X is called a universal GlivenkoCantelli class if it is a GlivenkoCantelli class for every law (probability measure) on (X, A).
**6.6 GlivenkoCantelli properties and random entropy
225
While a universal Donsker class C of sets must be a VC class (Theorem 6.4.1), a universal GlivenkoCantelli class need not be, even if X and C are both
countable (problems 14, 15). In any uncountable complete separable metric space X, there exists an uncountable "universally null" set Z, which has outer measure 0 for every nonatomic law P on the Borel sets of X (Sierpinski and Szpilrajn, 1936). [As Shortt (1984) notes, Szpilrajn later changed his name to Marczewski.] The collection of all subsets of a universally null set is clearly a universal GlivenkoCantelli class. Such examples and extensions of them show that the notion "universal GlivenkoCantelli class" is surprisingly wide (Dudley, Gine and Zinn, 1991, Section 3). Some of the width is in seemingly pathological directions. More useful are classes.F such that II Pn  P11* + 0 uniformly in P, as follows. Given a measurable space (X, A) and a class.F of realvalued measurable functions on X,.F is a uniform GlivenkoCantelli class as defined in Theorem 6.4.5 if and only if for every E > 0 there is an no such that for every law P on A and n > no, Pr*{IIPn  PIIF > E} < E. Each function f in a universal (or uniform) GlivenkoCantelli class must be bounded, and Fo := { f  inf f : f E .F} must be uniformly bounded (problem 13)..Fo is a universal or uniform GlivenkoCantelli class if and only if F has the same property.
For x :_ (xl,
, xn) E X, n = 1, 2,
and 0 < p < oo, define on To
the pseudometrics ex,p(f, g) :_
n [n_1If(xi)_g(xi)IP]
min(1,1/p)
i=1
ex,oo(f g) := maxi 2/e Fm>1 Pr (I (Pn  P)(Cm)I > E) < oo, and the BorelCantelli lemma gives the result. Now, given functions fi,
, fk and g1,
, gk in G 1, suppose
{Cm}m>1 C U {[f , gi]: P (gi  f) < 1/2). We may assume 0 < f < gi < 1 for all i. For eachi with P(gi f) < 1/2, we have E j P(Cm) : f < 1C(m) < gi } < +oo since if the series diverges then for a subsequence Cm(r) we have >, P(Cm(r)) = +oo, and for C := Ur Cm(r),
we have P(C) = 1 by the BorelCantelli lemma, so f < 1c < gi implies
7.1 Definitions and the BlumDeHardt law of large numbers
P(g;) = 1 and P(f) > 1  P(gi  f) > 1  1 = f < 1C(m) for only finitely many m, a contradiction. Thus +oo for every s < 1/2, which finishes the proof.
N 2l)
237
> 0, but then (s, {Cm }, P) _
On the other hand, let C be the collection of all finite subsets of [0, 1] with Lebesgue law P. Then I I P,  P l I c = 1 fi 0 although IA = 0 a. s. for all
A E C. This shows that in Theorem 7.1.5, NI < oo cannot be replaced by N (s, F, dp)  1 for any LP distance dp. A Banach space (S, II 11) has a dual space (S', II II') of continuous linear forms f : S H R with Ilfll' := sup{I f(x) I : x E S, IIxII t CFbe a countable total set: if fm (x) = 0 for all m, then x = 0. Such f t exist by the HahnBanach theorem (RAP, Corollary 6.1.5). Let D := n m fm t ({P(fm)}).
Then Y E D a.s., so D is nonempty. But if y, z E D, then lly  zll = supra I fm (y  z) I = 0, so D = {x0} for some xo.
238
Metric Entropy, with Inclusion and Bracketing
Direct proof Given e > 0, there is a Borel measurable function g from S into a finite subset of itself such that P (Il x  g(x) II) < s. To show this, let {xi }°O 1 be dense in S, with xo := 0. For k = 1, 2, , let gk(x) := xi for the smallest i such that IIx  xi II = minr 0. Suppose that P(G 1B) < 8/2 and that for each g E Q, A(g) is a measurable set with A(g) C B. Then
Pr* {I)(Pn  P)(g1A(g))IIg > 28) < Pr{I(PP  P)(G1B)I > 8). Proof I P(g1A(g))I < P(G1B) < 8/2 for all g E G, so Pr* {II (Pn  P)(g1A(g)) II g > 28) < Pr* {II Pn (g1A(g)) II g > 38/2)
< Pr{Pn(G1B) > 38/2) < Pr{I(Pn  P)(GIB)I > 8). . Let Now to prove the theorem, let Nk := .T', P), k = 1, 2, yk :_ (log(kNi . . . Nk))112. Then yk is increasing in k. By the integral test,
1(logNk)1/2/2k+1 < 00, so 00
00
k=1
k=1
r
k
1` 2k yk < 12k [(logk)1/2 +
(log Nj)1/2J j=1
00
00
(log k)1/2/2k + k=1
00
2k < oo.
(log Nj)1/2 j=1
k=j
Let 6k := Ej=k yj/2f. Then fik + 0 as k k oo. Let Ski := [fki, hkil, i = 1, , Nk, be a set of 2kbrackets covering .P. Let Tki := Ski \ Us 1, n > 1, and f E .F let (7.2.4)
Bk := B(k) B(k, f, n) :_ {x E X: Ak f > n1/2/(2k+lyk+1)}.
For any fixed j and n and x E X let
rf := rj,, (f, x) := min{k> j: x E B(k, f,n)}, where min0 := +oo. Then {rf = j} = B(j, f, n), {rf > j} = {Oj f < n1/2/(2j+lyj+1)}, and fork > j, (7.2.5)
{rf > k) C X \ Bk_1 C {okf < Ak1 f < n1/2/(2kyk) I
by (7.2.3), and {rf = k} C Bk \ Bk_i. For any f, g E F, let pl (f, g) := 112K for the largest K such that for some s, f and g are both in AK,S, or p[ (f, g) = 0 if this holds for arbitrarily large K. Then F is totally bounded for pl1. So by Theorem 3.7.2 it will be enough to prove the asymptotic equicontinuity condition for pl 1, in other words that for
every a > 0, (7.2.6)
limj.,,,
Pr* 1n112 II (Pn
 P)(f 
a} = 0.
We can assume 0 < a < 1. Then for any positive integers j < r, f 7rj f will be decomposed as follows: (7.2.7)
f 7rjf = (f 7rjf)lrf=j + (f 7rrf)lrf>r r1
+ Y. (f 7rkf)lrf=k k=j+1 r
+ E (7rkf 7rk1f)lrf>k; k=j+1
this is easily seen for r = j + 1 and then by induction on r. The decomposition (7.2.8) will give a bound for the outer probability in (7.2.6) by a sum of four terms to be labeled (I), (II), (III), and (IV) below.
Let s := a/8. Fix j = j (E) large enough so that (7.2.8)
,Bj < s/24 and E k12 < 2s. k>j
7.2 Central limit theorems with bracketing
241
Then choose r > j large enough so that, since yr increases with r,
n1122r < e/4 and 2 exp ( 
(7.2.9)
yr2rIE)
< e.
Lemma 7.2.2 will be applied to classes of functions
9 :_ 9(k, s) = ck,s := {f  nkf: f E F, nk.f = fk,s} with envelope < G := Gk,, := Akfk,s About (I): for any function i/j > 0 and t > 0, l,,>t < .r/t. So by (7.2.4),
n1/2E(1BjAj f) < 2j+lyj+iE((Ajf)2)
< 2tjyj+1 < 4,Oj < e/4 for all f E F. Then since {t f = j} = Bj,
n1/21P((fnjf)l,f=j)I < n1/2P(IBjLjf) < e/4. Apply Lemma 7.2.2 to Cjjs for each s with B := Bj := B(j, f,,, n), S e/n 1/2, and A(f  nj f) = Bj = {tr f = j} in this case. Then Pr* { Il n1/2(Pn  P)(g)1Bi 11O(j,s) > 2e}
< Pr {n1/2I (Pn  P)((Aj fj,s)1Bj)I > e}. Then summing over s, (I)
:= Pr* { Il n1/2(Pn  P)((f  njf)1rf=j)II.F > 2E1
< exp(y?)max,Pr {n1/2I(Pn  P)(Oj f,slBj)I > s}
< exp(y?)e2maxsVar(Ajf,s1Bj) As n + oo, for fixed j and s, by (7.2.3) and (7.2.4), since P(Bj) _ P(B(j, fj,s, n))  0 for each j and s, we have (I) + 0, so (I) < e for n large enough. About (II): we have by (7.2.3) and (7.2.9), (7.2.10)
n1/2E(Ar f1(orf r} and B :_ {Ar fr,s < nl/2/(2ryr)}, noting that Ar f < Ar_1 f for all f. Thus by (7.2.5) again, 1/21
(II)
:= Pr*
111n112(Pn
 P)((f
)Trf)lrf>r)II.F > 281
Pr {n1/2II (F'n  P)(Arf 1{Arf E}.
k is at most exp(yk ). Also, Irk f E Ak_ 1(f) and so by (7.2.3), E((7rk f 7rkl f)2) < 222k. Now (IV)
:= Pr*In1/2 ll
k(Pn  P)(7rkf 7rk1.f)ltf>k)I) 1
j
>E
r
PrIlnl/2II(Pn  P)((7rkf nk1f) ITf>k) IlF > 2kEYk/16jj k=j+1
Metric Entropy, with Inclusion and Bracketing
244
from the definition of fj. Bernstein's inequality and (7.2.5) give r
(IV) k
E
J+1
2 E222ky2 2 PT exp (yk  232k + 3 E2
k12k
\
l/
.
OT
Now since fj < e/24, 8 + 3ePJ 1 < E/pj and r
r
exp (yk (1  e/fi1)) < E exp ( 23yk) < 2e
(IV) < k=j+1
k=j+1
as in (III). Thus the expression in (7.2.6) is less than a. Letting a 4. 0, j + oo, and n + oo, the proof of Theorem 7.2.1 is complete. Theorem 7.2.1 implies the following for L 1 entropy with bracketing:
7.2.13 Corollary Let (X, A, P) be a probability space and F a uniformly bounded set of measurable functions on X. Suppose that r1
(logN1( 1)(x2,F, P))112dx < oo. 1
Then F is a Donsker class for P.
Proof Suppose I f (x) I < M < oo for all f E .E and X E X. Since multiplication by a constant preserves the Donsker property (by Theorem 3.7.2), we
can assume M = 1/2. Then for any f, g E F and e > 0, If  gI < 1 everywhere. So if f If  gIdP < E2 then (f If  gI2dP)1"2 < E. So N1( 2) (E, 1
F, P)
0. We also find that the sufficient condition given in Corollary 7.2.13 is necessary for 2N. Recall NJ as defined above Theorem 7.1.5.
7.3.1 Theorem The following are equivalent:
7.3 The power set of a countable set: the BorisovDurst theorem
245
(a) 211 is a Donsker class for P; (b) >2m Pm/2 < 00;
(c) fo (log NI(x2, 21, p))1/2 dx < 00.
Proof We have (c) = (a) by Corollary 7.2.13. Next, to prove (a) ==> (b), suppose E = oo. The random variables W (m) := Wp (1 {m }) (for the p112
isonormal Wp on L2(P) as defined in Section 2.5) are independent and Gaus
sian with mean 0 and variances p,,,. We can write Gp(f) = Wp(f) P(f)Wp(1) since the right side is Gaussian and has mean 0 and the covariances of Gp. Then F_m EIWp({m})I _ (2/n)1/2p1l?i/2 diverges, while Em Var(IWp({m})1) < Em pm < oo. Thus for any M < oo, by Chebyshev's inequality, P(F_T1 JWp({ j})I > M) + 1 as m * oo. Thus >j IWp({ j})I = +oo almost surely. Now >m pmIWp(1N)I < oo a.s., so E IGp({m})l = +oo a.s. Hence SUPACN Gp(1A) = +oo a.s. and 2N is not a pregaussian class, so a fortiori not a Donsker class. Thus (a) implies (b).
Next, to prove (b) = (c): equivalently, let us prove 00
E 2k (log Nl
(4k 2N, P))112 < 00.
k=1
We can assume pm > pr > 0 for m < r. Let rj be the number of values of m such that 4j1 < p1i/2 < 4j, j = 0, 1, 2, , and Cj := rj/41. Then j Cj < oo. Fork > ko large enough there is a unique j (k) such that
Cj/41 < 4k
j(k) Let m(k) := Mk :_ j_oj
(k)
Cj/41.
j>j(k)
rj. Then
m>m(k)
pm < L: rj142 < 4k j>j(k)
Let Ai run overall subsets of fl,
, m (k)} where i = 1,
,
2m(k). Let Bt :=
Ai U {m E N: m > m(k)}. Then for any C C N, c fl 11, , m(k)} = Ai for some i. Then Ai C C C Bi and P(Bi \ Ai) < 4k. So NI(4k, 2N, P) < 2m(k)+1 Thus it will be enough to prove
mk/2/2k < 00,
(7.3.3) k
Metric Entropy, with Inclusion and Bracketing
246
with >k restricted to k > k0. We have j(k)
mk/2/2k
> C j=0
k
j(k)
exp(CedI") fore small enough, and if d > 2, for small enough e > 0, and M := C(y, a, K, d  1),
NI(e, C(a, K, d), P) > D(e, C(a, K, d), dp) > exp (Me Proof The following combinatorial fact will be used:
8.2.11 Lemma Let B be a set with n elements, n = 0, 1, .. Then there exist subsets Ei C B, i = 1, , k, where k > e"/6, such that for i 0 j, the symmetric difference Ei DEj has at least n/5 elements.
Proof For any set E C B, the number of sets F C B such that card (E A F) :5 n/5 is 2"B(n/5, n, 1/2), where binomial probabilities B(k, n, p) are as defined before the ChernoffOkamoto inequalities (1.3.8). If S" is the sum of n independent Rademacher variables Xi taking values ±1 with probability 1/2 each,
258
Approximation of Functions and Sets
then by one of Hoeffding's inequalities (1.3.5), defining "success" as Xi = 1,
B(n/5, n, 1/2) = Pr (S > 3n/5) < exp(9n/50)
Smad Applying Lemma 8.2.11 and obtaining sets S with card(S) > and/5 gives f fs dP > Sm 1/6, and f Ifs  fT I dP = f fsoTdP. So D (Sma/6, ca,K,d, dip) > exp (md/6). Thus if 0 < e < 3/6, since [x] > x/2 for x > 1,
D (E, Qa,K,d, dt,P) > exp ([(S/(6e))1/a]d/6l
> exp
(2d
(S/ (6E)
)d/a /6)
proving the statement about cca,K,d
IfC=J(f)andD=J(g)then dp(C, D) := P(CAD) > yl(CAD) = Y f If  gldA, so for E > 0,
NI(E, C(a, K, d), P) > D(e, C(a, K, d), dp) > D (EIY, 9a,K,d1, di,x), which finishes the proof of Theorem 8.2.1. To get lower bounds for the Hausdorff metric, the following will help:
8.2.12 Lemma If a > 1 and f, g E9Ja,K,d, then
h (Jf, Jg) >
g)/(2 max(1, Kd)).
Proof Note that for a > 1, any g E 9a,K,d is Lipschitz in each coordinate by the mean value theorem with Ig(x)  g(y)I < KIx  yl if x/ = yj for all but one value of j. In the cube, one can go from a general x to a general y by changing one coordinate at a time, so g is Lipschitz with Il91lL < Kd and Lemma 8.1.3 applies.
8.2 Spaces of differentiable functions and sets
8.2.13 Corollary I f a > 1 and d = 1 , 2,
259
, then as E , 0,
log D(s, C(a, K, d + 1), h) : Ed/« Proof This follows from Lemma 8.2.12 and the first and third statements in Theorem 8.2.1.
8.2.14 Remark.
For m = 1 , 2, , let Id be decomposed into a grid of and subcubes of side 1/m. Let E be the set of centers of the cubes. For any A C Id let B C E be the set of centers of the cubes in the grid that A intersects. Then
h(A, B) < d1/2/(2m), which includes the possibility that A = B = 0. For 0 < s < I there is a least m = 1, 2, such that d112/m < s, namely m = 1d1/2/el. It follows that D(s 21d h) < 2(1+d'121E)d
Hence for a < d/(d + 1), Corollary 8.2.13 cannot hold, nor can the upper bound for h in Theorem 8.2.1 be sharp.
The classes C(a, K, d) considered so far contain sets with flat faces except for one curved face. There are at least two ways to form more general classes of sets with piecewise differentiable boundaries, still satisfying the bounds in Theorem 8.2.1. One is to take a bounded number of Boolean operations. Let vl, , Vk be nonzero vectors in Rd where d > 2. For constants cl, , ck let Hj {x E Rd : (x, vj) = cj}, a hyperplane. Let Jrj map each x E Rd to its
nearest point in Hj, nj(x) := x  ((x, vj)  cj)vj/I vjI2. Let Tj be a cube in Hj and a, K > 0. Let f be a linear transformation taking Tj onto Id1. For g E Qa,K,d1, let
Jj(g) := {x E Rd : 7rj(x) E Tj, Cj < (vj, x) < Cj +g(fj(nj(x)))}. Then Theorem 8.2.1 implies that if K is a compact set in Rd (e.g., a cube) including all sets in Cj, and P is a law on K having bounded density with respect to Lebesgue measure ) , then for some Mj < oo,
log D(e, Cj, dp) < log Nj(e, Cj, P) < MjE By Theorem 8.1.4 we then have the following:
8.2.15 Theorem Let d > 2 and let Co :=
orCo :=
just defined. Then for some M < oo,
log D(E, Co, dp) < log NI(E, Co, P)
1, the minimum or maximum of two functions in ca,K,d need not have first derivatives everywhere and then will not be in cy,K,d for any y > 1 and
K 1/2) and C := {x E Sd1: x1 < 1/21. There is a onetoone, C°O function ,/r from {x E Rd1 : IxI < 9/8} into Rd, with derivative matrix a ; ax i=I,j of maximum rank d  1 everywhere,
such that * takes Bd_1 := {x E Rdt : IxI < 1} onto A. Let ii(y) := (*1(y), *2(y), , *d(y)). Then the above statements for >/r and A also hold for q and C. Sd1 For 0 < a, K < oo let J'a,K(Sd1) be the set of functions h : HR such that for Bd1 := {x E Rd': IxI < 11, h o t1i and h o 11 E .Fa,K(Bd1), recalling that f o g(y) := f(g(y)). Let .F(d) (Sd1) be the set of functions a,K
h = (h1,
hd) such that hj E Fa,K(Sd1) for each j = 1, , d. Two continuous functions F, G from one topological space X to another, Y, are called homotopic if there exists a jointly continuous function H from ,
X x [0, 1] into Y such that
0) = F and H(, 1) = G. His then called
a homotopy of F and G. Let I (F) be the set of all y E Y, not in the range of F, such that among mappings of X into Y \ l y), F is not homotopic to any
constant map G(x) = z # y. For a function F let R(F) := ran(F) := range(F) and C(F) := I(F) U R(F). For example, if F is the identity from Sdt onto itself in W', then I (F) {y : Iyl < 11 by wellknown facts in algebraic topology; see, for example, Eilenberg and Steenrod (1952, Chapter 11, Theorem 3.1).
Let I (d, a, K) := {I(F): F E
and let K(d, a, K) := (C(F): jadK FE (Sd1) }. Then I (d, a, K) is a collection of open sets and IC (d, a, K) of compact sets, each of which, in a sense, has boundaries differentiable of order a. (For functions F that are not onetoone, the boundaries may not be differentiable in some other senses.) For IC(d, a, K) and to some extent for .FctdK(Sdi)}
8.2 Spaces of differentiable functions and sets
261
I (d, a, K) there are bounds as for other classes of sets with a times differentiable boundaries (Theorem 8.2.15):
8.2.16 Theorem For each d = 2, 3,
, K > 1, and a > 1,
(a) there is a constant Hd,a,K < oo such that for 0 < E < 1, and the Hausdorff metric h,
log D(E, lC(d, a, K), h)
1 and g(u) := u + y(1  Jul) for Jul < 1. Then g is the identity for Jul > 1 and is continuous, with g(0) = y. Also, gis 11 since Ig(u)l < 1 for Jul < 1,
andifg(u)=g(v)with lul,lvl < 1, then uv=y(lullvl)andluvl < I(lul  Ivl)I/2 < lu  v1/2, so u = v. Thus Y E I(F), so I(F) is open. Since C(F) is closed by Lemma 8.2.18, it follows that the boundary of I (F) is included in R(F). Recall that for a metric space (S, d), a set A C S and 8 > 0, the 8interior of A is defined by S A := Ix: d (x, y) < 8 implies y E A), and the Sneighborhood
by AS := {y: d(x, y) < S for some x E A}. 8.2.20 Lemma For continuous functions F, G from Sd1 into Rd, if
ds p(F, G) := sup {I F(u)  G(u)J: u E Sd1 } < 8, then
sI(F) C I(G) C C(G) C C(F)a. Proof If x E E 1 (F) and x E R(G), then d (x, y) < 8 for some y E R(F), so
y V I(F), a contradiction. Thus x V R(G). For 0< t < 1 and u E Sd1 let H(u, t) := (1  t)F(u) + tG(u). Then H is a homotopy of F and G, and R(H) C R(F)s, but I (F) n R(F) = 0, so x R(H). Thus by Lemma 8.2.17, x E I (G).
Next, let y E C(G). If Y E R(G) then y E R(F)s C C(F)s. Otherwise y E I (G). Then y E I (F) or by Lemma 8.2.17, y E R(H) C R(F)S.
8.2.21 Lemma For any continuous function F from a compact Hausdorff space K into Rd, C(F)8 \ sI (F) C R(F)s.
Proof Let x E C(F)' \ SI(F). Suppose d(x, R(F)) > S. Then Ix  yl < 8 for some y E I(F). Since the boundary of I(F) is included in R(F) by Lemma 8.2.19, the line segment {tx + (1  t)y: 0 < t < 11 C I(F). It follows that x E I(F) and then likewise that z E I(F) whenever Ix  z1 < S. Thus x E SI(F), a contradiction. The Lipschitz seminorm II FII z, is defined for functions with values in R' just
as for realvalued functions. Let vk be the Lebesgue volume of the unit ball in Rk.
8.2 Spaces of differentiable functions and sets
263
8.2.22 Lemma For k = 1, 2, , if (T, d) is a metric space, 8 > 0, for some M < oo, D(8, T, d) < M81x, and F is Lipschitz from T into 1[8x, with II FII L < K, then Ad
(C(F)s \ sI (F)) < vkM(K + 2)x8.
Proof For the usual metric e on 1[8x, we have D(K8, R(F), e) < It follows that D((K + 2)8, R(F)s, e) < M81x. Lemma 8.2.21 gives the M81k
conclusion.
Proof of Theorem 8.2.16. By the definitions, a function F E F(l) a, K (Sd1) is given by a pair F(1), F(2) of functions F(j) :_ (F(j)1, F(j)d) where each
F(f)i E Fa,K(Bd1) and Bj is the closed unit ball in R. Since a > 1, each F(j), is Lipschitz with II F(j)i II L < K, so each F(j) is Lipschitz with 11F(j)ft < M. Let T := T1 U T2 be a union of two disjoint copies Ti of Bd_1, with the Euclidean metric eon each and e(x, y) := 2 forx E Ti, y E Tj, i j. Letting G := F(j) on Tj, j = 1, 2, gives a function G := GF on T with (8.2.23)
IIGIIL < maxj IIF(j)IIL < dK.
Let 0 < s < 1. Then there are D(s, B, e) disjoint balls of radius s/2, included in a ball of radius 1 + (s/2). It follows by volumes that (8.2.24)
D(s, Bj,e) < [(2 + s)/2]i(2/s)i < (3/e)j.
By Theorem 8.2.1 for the ball case, for any K > 1, d > 2, and a > 1, there is
aC=C(K,d,a) < oo such that for 0 < 8 < 1, log D(8, Fa,K(Bd1), dsup)
2, having bounded density with respect to Lebesgue measure, and K < oo,
(a) (TzeGong Sun) I (d, a, K) is a Donsker class for P if a > d  1; (b) I (d, a, K) is a GlivenkoCantelli class for P whenever a > 1.
Proof Apply 8.2.16(b) and, for part (a), Corollary 7.2.13; for part (b), the BlumDeHardt theorem (7.1.5).
8.3 Lower layers A set B C Rd is called a lower layer if and only if for all x = (xl, , xd) E Bandy= (yl, , yd) with yj < xjfor j = 1, , d, we have y E B. Let LGd denote the collection of all nonempty lower layers in Rd with nonempty complement. Let 0 be the empty set and let
GGd,l := {L fl Id : L E GGd, L fl Id # 0}. Let ,l := )4 denote Lebesgue measure on Id. Recall that f
g means f/g . 1. The size of GCd,l will be bounded first when d = 1 and 2. Let [xl be the smallest integer > x.
8.3.1 Theorem Ford = 1, D(e, GG1,1, h) = D(e,,Cr1, d)j = Nr(s, L 121 1, ),) = [1/c]. Ford = 2, any m = 1, 2,
, and 0 < t < 21/2/m, we have
max (Nj(2/m, GG2, A) , D(2 1/2/M, GG2,1, h))
< (2m  2)
m1
D(t GG2,1, h)
For 0 < s < 1, N, (E, GG2,1, )'I) < 42/E and
D(s, GG2,1, h) < exp ((21/2log4)
/s).
8.3 Lower layers
265
Proof For d = 1, sets in LL1,1 are intervals [0, t), 0 < t < 1, or [0, t], 0 < t < 1. For any s with 0 < s < 1, let m := m(E) [1/Ea. Then the collection of m brackets [[0, (k  1)s], [0, ks]], k = 1, , m, covers GL1,1 with minimal m for E, showing that NI(s, CC1,1, k) = m. For any 3 > 0, the points k(s + 3) for k = 0, 1, , [1/(s + 6)], are at distances at least s + 3 apart. For 0 < x < y < 1, h([0, x], [0, y]) = dA([0, x], [0, y]) = y  x, so
D(s, LL1,1, h) = D(s, LLI, dd) = D(s, [0, 1], d) =: D(s) for the usual metric d. Letting 3 J, 0 gives D(s) = [1/e1 = m, finishing the proof for d = 1. For d = 2, decompose the unit square 12 into a union of m2 squares Si j
[(i  1)/m, i/m) x [(j  1)/m, j/m), i, j = 1,
, m  1, but for i = m or j = m, replace "i/m)" or "j/m)" respectively by "1]." For any L E LL2,1, let m L be the union of the squares in the grid included in L and let Lm be the union of the squares which intersect L. Then mL C L C Lm and both m L and Lm are in LL2,1 U {P}.
For each m and each function f from {2, 3, .. , 2m  1 } into 10, 11 taking the value 1 exactly m 1 times, define a sequence S(f)(k), k = 1, , 2m 1 of squares in the grid as follows. Let S(f)(1) be the upper left square S1m. Given S(f) (k  1) = Si j, let S(f) (k) be the square Si+1, j just to its right if f(k) = 1, otherwise the square Si, j_1 just below it, for k = 2, .. , 2m  1. Then S(f) (2m  1) is always the lower right square Sm 1. Let Bm (f) 2m1 Uk=1 S(f) (k). Let Am (f) be the union of the squares below and to the left
of Bm(f), and Cm(f) := Am(f) U Bm(f). Here Am(f) and Cm(f) belong to LL2,1 U {0}. Also, if f # g then h(Cm (f ), Cm(g)) > 1/m. Let L E LL2,1 U 0. Let L be its closure and
M := ML := LU{(0,y): 0 2, as e y. 0, log D(e, GGd,1, h) x log D(s, GGd, dx) = log D(e, LLd,t, dx)
log NI(e,LLd,I,A) x E1
d
Proof First, for the Hausdorff metric h, it will be shown that for some constants
Cd with I < cd < 00, (8.3.3)
log D(e, 1CGd,1, h)
2 by induction on d. Suppose it holds for d  1, for d > 3. Given 0 < E < 1, take a maximal number of sets L 1,
, Lm E GGd_ 1,1 such that
h(Li, Lj) > E/4 for i 0 j, where m < exp(cd1(4/e)d2). Let k := 13/e1 and A E GGd,1. For j = 1, , k let Aj := {x E Id1 : (x, j1 k) E A) and A(j) := Aj x {j/k) C A. Then Aj = 0 or Aj E GGd1,1. In the latter case we can choose i := i(j, A) such that h(Ap Li) < E/4. Let Lo := 0 and i := i (j, A) := 0 if Aj = 0, so h(Aj, Li) = 0 < E/4 in that case also. , k1. LetA, B E GGd,1 and suppose that i (j, A) = i(j, B) for j = 0, 1, It will be shown that h (A, B) < E. Let X E A. There is a j = 0, 1, ,k1 such that j/k < xd < (j + 1)/k. Let y := (xl, , xd1, j/k) E A(p. Then Aj 0 0 and h(A., Bj) < E/2, so for some z E B(p C B, we have
8.3 Lower layers
267
ly  zi < 2e/3 and Ix  zI < k1 + 2e/3 < E. So d(x, B) < e and by symmetry, h(A, B) < e. Thus
D(e,LLd,1,h) < (m+1)k < [exp (2cd_1(4/e)d2114/8 < eXp (cd/ed1) IJ for cd := 2 4d1 cd_1, so (8.3.3) is proved. For the metrics in terms of A we have the following:
8.3.4 Lemma
Let 8 > 0 and let A, B E £L d,1 with h(A, B) < S. Then
Ad (AAB) < dd128.
Proof Let U be a rotation of Rd which takes
v := (1, 1, into (0, 0,
,
d1/2). Let nd(y) :_ (y1,
, 1)
, yd1, 0). Let C be the cube
C := U[Id] := {U(x) : x E Id}. Each point of Id or C is within d1/2/2 of its respective center. Thus each point z e H := lyd[C] is within d 1/2/2 of 0. Also, for any z E Rd, It E R : (z, t) E C} is empty or a closed interval
h(z) < t < j(z). Let CZ := {(w, t) E C : w = z), a line segment. The intersections of U[A] and U[B] with CZ are each either empty or line segments with the same lower endpoint (z, h (z)) as C, so the two sets are linearly ordered
by inclusion. Thus the intersection of U[A]AU[B] = U[AAB] with CZ is some line segment SA,B,Z. It will be shown that SA,B,Z has length < d1/26. Suppose not. Then by symmetry we can assume that there is some > S
and a point x E B \ A such that v := x + ( 1 , l , . . . , 1 ) E B. The orthant O := l y: yj > xj for all j = 1, , d} is disjoint from A. But, the open ball of radius and center v is included in 0, contradicting h (A, B) < 8. Now, H is included in a cube of side d1/2 in Rd1 with center at 0, so by the TonelliFubini theorem,
), d(AAB) < (3d 1/2)
(d1/2)d1
< 8d d12,
proving the lemma.
Returning to the proof of Theorem 8.3.2, from the last lemma and (8.3.3) it follows that for each d = 2, 3, and some Cd < oo, for 0 < e < 1,
log D(e, GGd, dx) = log D(e, GGd,I, d))
f(t)  Kls  tl,
K := (d  1)1/2.
Hence, interchanging s and t, I f (s)  f (t) I < K Is  t 1. So II f II L < K. Let
J(f) := {x : 00 < Xd < f(x(d))}. Thus for each B E CLd,l we have U[B] = ,7(f) fl U[Id] for a function f = fB on ][8d1 with Il f IIL < K. We can restrict the functions f to a cube T of side d112 centered at the origin in Rd1 parallel to the axes, which includes the projection of U[Id]. We can also assume that II fB I I sup < d112 for each B, since replacing f by max(d1/2, (min(f, d1/2)) doesn't change ,7(f) fl U[Id ], nor does it increase II .f II L (RAP, Proposition 11.2.2(a), since 11911L = 0 if g is constant). Now, apply
Theorem 8.2.1 for a = 1 and d  1 in place of d, where by a fixed linear transformation we have a correspondence between Id and the cube T. Since f < g
implies ,7(f) C ,7(g) and i(f) fl U[Id] C i(g) n U[Id], the bracketing parts of Theorem 8.2.1 imply the desired upper bound for log NI (e, LCd,1, ,L)
with (d  1)/a = 1  d. This finishes the proof for upper bounds. Now for lower bounds, it will be enough to prove them for D(s, Ld, d;,) in light of Lemma 8.3.4 and since NI(e, ) > D(s, ). The angle between v = (1, 1, , 1) and each coordinate axis is
9d :=
COs1 d1/2
= sin1 (((d  1)/d)1/2) = tan1 ((d  1)1/2).

Thus if f : Rd1 i+ R satisfies II.fIIL < (d 1)1/2, then to see that L U1(J(f )) is a lower layer, suppose not. For some x E L and y ¢ L, x; = y, for all i except that xj < y, for some j. Then U transforms the line through x, y to a line f forming an angle 6d with the dth coordinate axis. Writing £ as td = h (t(d) ), we have I I h I I L = cot Bd = (d  1)1/2, which yields a contradiction. Recall (Section 8.2) that for S > 0 and a metric space Q,
.Fi,s(Q) := {f: Q 'R, max(II.fIIL,Ilfllsup) 0 and all E small enough, which finishes the proof of Corollary 8.4.2 from Theorem 8.4.1. Now Theorem 8.4.1 will be proved. Let 0 < E < 1. For any set C C Rd and r > 0 let Cr1 := [x E Rd : d(x, C) < r}. Then the open set C' is included in
the closed set C'1, aCr = aC'1, a closed set, and h(C', Cr]) = 0.
8.4.4 Lemma For any C, D E Cd, and r > 0, h (Cr], D'1) = h (C, D), in other words Or : E H Erl is an isometry for h. Proof Lets > 0. It will be shown that D C CS if and only if Drl C (C'1)S = Cr+S. "Only if" is straightforward. To prove "if," suppose not. Let a E D \ CS. There is a unique point of q of Csl closest to a: there is a nearest point q since CS1 is compact, and if b is another nearest point, then (q + b)/2 E CS1 since Cs1 is convex and (q + b)/2 is nearer to a, a contradiction. (Possibly q = a.) Now q E 8(Cs1). If q = a, take a support hyperplane H to CS1 at q (RAP, Theorem 6.2.7). If q a, then the hyperplane H through q perpendicular to
the line segment aq is a support hyperplane to Csl at q (if there were a point c of Cs1 on the same side of H as a, then on the line segment cq there would be a point of CS1 closer to a than q is, a contradiction). Let p be a point at distance r from a in the direction perpendicular to H and heading away from CS1. Then
p E D'1 but p 0 (Cs1)r = Cr+s, a contradiction. So "if" is proved. Since C and D can be interchanged, the lemma follows.
For r > 0, 0r is a useful smoothing, as it takes a convex set D, which may have a sharply curved boundary (vertices, edges, etc.) to a convex set Drl
272
Approximation of Functions and Sets
whose boundary is no more curved than a sphere of radius r, and so will be easier to approximate. Now, for a given C E Cd and s > 0 let
AI(C)
{DECd: h(C,D) 1 and ZOxp > sin1(1/3). Proof Existence of support hyperplanes is proved in RAP (Theorem 6.2.7). Clearly H is disjoint from B(0, 1), so Ipl > 1. Since IxI < 3 and LOpx = r/2, the lemma follows.
8.4.6 Lemma If E E Cd,1,3, r > 0, and z is a point such that z E E' \ E, let x (z) be the point on the halfline from 0 to z and in W. Then I z  x (z) I< 3r.
Proof Note that x = x(z) is uniquely determined since E D B(0, 1). Apply Lemma 8.4.5. Let zl be the point of H closest to z. Then Iz  zl I < r. The vectors z  zi and p are parallel, so := LpOz = LOzzl, and Iz  xI = Iz  ziIIxI/IPI < 3r. 8.4.7 Lemma Suppose C E Cd, y E aC, x E aC2, and Ix  yI = 2. Then for any twodimensional subspace V containing x, V fl B(y, 2) is a disk containing 0 of radius at least 2/3.
Proof Clearly 0 E B(y, 2) C C2. Apply Lemma 8.4.5 again. Then H is also a support hyperplane to B(y, 2) at x, so x  y is orthogonal to H and in the same direction as p. Let q be the point on the segment [0, x] closest to y. Then 0 := ZOxp = Zqyx, sin0 > 1/3, and so Ix  q1 > 2/3. Let u be the closest point to yin V. Then V fl B(y, 2) D V fl B(u, 2/3). Now polyhedra to approximate convex sets will be constructed. Let Wd be the cube centered at 0 in Rd of side 2/d1/2, parallel to the axes, so that the
8.4 Metric entropy of classes of convex sets
273
coordinates of the vertices are +I /d1/2. Given s > 0, decompose the 2d faces of Wd into equal (d  1)cubes of side sd := sd(e) where sd  c(81d)112 as 8 O and c := 104/(d112(d1)), so sd ^ s112104/(d(d l)). Specifically, let sd := 2/(d't2kd) where kd := kd(s) is the smallest positive integer such
that sd < s1!2104/(d(d  1)). Then for 0 < s < 1,
81/2104/d2 < sd < e1"2104/(d(d  1)). Let Ld be the set of all (d  1)cubes thus formed. The diameter of each cube in Cd is d 112sd < e1/2104/(d1/2(d  1)). The next fact follows directly, by the law of sines, since 0 < s < 1 and d > 2, and
sin1 x < 1.1x for 0 < x < 104. 8.4.8 Lemma (a) For any cube in Cd and any two vertices p and q of the cube,
ZpOq
i = 1. Then z = Fi < j µiai + Ei> j pibi, where µi > 0, pi > 0, and K< j µi + Fk> j pk = 1, if and only if µi = Xi for i < j, Pk = Ak for
k > j, Aj = µj + pj, and x = Ek> j pk. Thus z is in Sj if and only if Fi> Xi < x < >i> j Xi. If both inequalities are strict, then j is unique. Every point of S x [0, 1] is in some Sj, and a point in more than one Sj is on the boundary of both. Let K be a convex set including a neighborhood of 0. For each vertex pi of a cube in £d, let Hi be the halfline starting at 0 passing through pi and let vi
Approximation of Functions and Sets
274
be the unique point at which Hi passes through the boundary of K. For each simplex Sj in the triangulation of the cubes in Cd, let Tj be the corresponding simplex with vertices vi in place of pi. Let lre (K) be the polyhedron with faces Tj, in other words the union of the ddimensional simplices which are convex hulls of Tj U (0). Ford > 3, ,r8 (K) is not necessarily convex.
Let E E Cd,1,3, 8 > 0 ands > 0. For i = 1, 2 let zi be points such that zi E E16' \ E. Assume that Lzi 0z2 < 8 < 1/15. Then 8.4.9 Lemma
Izi Z21 1 for all i. By symmetry we can assume Ixi I > Ix2I. To get a bound for Ixi  x21, we can assume x i and x2 are not on the same line through 0, or they would be equal. Take a halfline L starting at xi which is tangent to the unit circle aB(0, 1) at a point v and crosses the halfline from 0 through z2 at a point y.
If v is between xi and y, then Ixi  v I< tan S < 28 since 8 < 1/15 < 7r/4. Likewise, lyvi < 23. Also, 1 < Ix2I < Ixi l < (1+452)1/2 < 1+252 < 1+5, So 1X2Y1 1/3, and
To bound Ixi  yl, let
Ixi  yl
2/3 and let a, b, a be points of R2 such that: a E, B(a, r) C E, b is on the boundary

of E and of B(a, r), d(a, E) < 168, and fi := LaOb < c(Ke)1i2. Then
8.4 Metric entropy of classes of convex sets
275
d(a, B(a, r)) < 50e. Also, Ia  bl < 0.0015e1/2/(K  1) and Laab < D(K)E112 where D(K) := 0.004/(K  1). Proof Apply Lemma 8.4.5 at x = b. The tangent line L to E at b is also tangent to the circle 8B(a, r). Let y LObp. Then y > sin1(1/3) > 1/3. We have (8.4.11)
6 < c(KE) 1/2 = (1.1)104e1/2/ (K  1) < (1.1)1010
Let H be the halfplane including E bounded by L. First suppose a V H. Then d (a, H) < 16e. Let the line from 0 to a intersect L at a point q. If 17 is between p and b, then y + ,B < 7r/2 and
Ill  bl = Ipl(cot y  cot(y + #)) < 30 csc2 y < 27k. If p is between n and b, then
Inb1 = Ipl coty+tan y+,8
IT
2
= Ipl(cot y  cot(y +,B)), where now r/2 < y + P < n. Thus by (8.4.11),
117bI < IPIPmax(csc2y,csc2(y+fi)) < 3#max (csc2 y, csc2
< 27,6
sin' (1/3) imply sin(y  f) > 0.333, so
In  bl < 30/(0.333)2 < 280 < 28c(KE)1/2 in all three cases. For any ordering of b, p, and rl,
]a  ill < 16E/ sin(y  ,B) < 49e. Next, let x be the distance from a varying point on L to b. Then the distance y from to the circle 8B(a, r) satisfies y = (r2 + x2)1/2  r. Now
(r2 + t)1/2 < r + t fort > 0 and r > 2/3, so 0 < y < x2 for all x. So the distance from a to B(a, r) is at most CE for C = 49 + 282c2K < 50, giving the first conclusion for a 0 H. A line W through a, orthogonal to the line V through 0 and b, meets V at
a point l;. Then Lbal; = y > sin1(1/3), so Ia  I < r(1 
11)1/2. Let q
276
Approximation of Functions and Sets
the point on the circle I q  a = r and the line W, on the same side of a as is. Then Iq I > 3(l  (9)1/2) > 0.03. By (8.4.11), i6 < tan 1(0.01), so the line A through 0, rj, and a must intersect B(a, r). Then since E is convex, a E, 0 E E, and B(a, r) C E, for a E H, d (a, B(a, r)) is maximized when a = q on L, and d(a, B(a, r)) < Cs for the same C < 50 as before. So the

first conclusion is proved. Lemma 8.4.9 with b = iB gives
Ia  bl
2/3 and
0EB,solaI 2. If Vd is the volume of B(0, 1) C Rd, then X(/Sd1)E) > Vd[(1 +E)d
 1] > dvdE.
Then by the left side of the last displayed inequality in the proof of Lemma 8.4.3, there is an ad > 0 such that D(E,
Sd1 p) > adE1d for 0 < E < 1.
Given s, take a set {xi }m 1 of (points of Sd 1 more than 2e apart, of maximal
cardinality m := D(2s) := D(2s, Sd1, p). As above let Zabc denote the angle at b in the triangle abc. Then for i # j,
0 := LxiOxj > 2 sin(e/2) > 2s. Let Ki be the halfline from 0 through xi. Let Ci be the spherical cap cut from the unit ball Bd by a hyperplane orthogonal to K, at a distance cos s from 0. Then the caps Ci are disjoint.
For any set I C (1, , m) let DI := Bd \ UiEI C;. Then each DI is convex. Let a,d be ddimensional Lebesgue measure (volume). Then for all i, .d(C1) > bdsd+l for some constant bd > 0. By Lemma 8.2.11 there are at
Problems least em /6 sets I (j) such that for all j contains at least m /5 elements. Then
281
k the symmetric difference I (j) D I (k) CdEId+d+1 = CdE2
Xd(Di(j)ADI(k)) >
for a constant cd := adbd21d/5. So for some ,Bd > 0, D (8, Cd, dX) > exp
(fldS(1d)121
for/d)L
This finishes the proof of the lower bound 8.4.1 is proved.
for 0 < 8 < 1. and so also for h. Theorem
Problems
1. Let (K, d) be a compact metric space. Show that the collection of all nonempty closed subsets of K, with the Hausdorff metric, is also compact. 2. In the proof of Theorem 8.2.1, just after Lemma 8.2.7, show that the cubes
can be ordered so that i = j  1. 3. If inequality (1.3.12) is used instead of (1.3.5) in the proof of Lemma 8.2.11, what is the result?
4. Show that in Theorem 8.2.16(a), 1C(d, a, K) cannot be replaced by I(d,
a, K) if d = 2 and a > 1. Hint: Let f ((cos 0, sin B)) :_ (1  cos 6)/2, so f takes S1 onto [0, 1]. Then I ((f, 0)) = 0 where (f, 0)(u, v) (f (u, v), 0). For any interval (a, b) C R there is a C°O function g(a,b) >
0 with g > 0 just on (a, b).
rkf=1
Show that for functions r _ (f,
for disjoint (at, b,) and small enough 8i > 0, de
pending on k, JI Vi 11a can remain bounded as k increases while I(*) can approximate any finite subset of [0, 1] x {0} for h. 5. Show that the unit disk {(x, y) : x2 + y2 < 1) belongs to a class Co in
Theorem 8.2.15 for k = 4, any a E (0, oo), and some K = K(a) < oo. Hint: The function g(x) :_ (I  x2)1/2 is smooth on intervals Ix I < for
0 such that lim infE{o E log NI(E, £ 2, .4) > c, and likewise for D(E, L L2,1, dA). Hint: Consider squares along a decreasing diagonal, Sj := Sj,,,, _ 1 _j, j = 1, , m, for the grid defined in the proof of Theorem 8.3.1. The union of Ui 4kk1/2/3. Hint: Use Stirling's formula (Theorem 1.3.13).
(b) Use this to give a lower bound for lim inf 40 8 log D(E, £ 2,1, h).
282
Approximation of Functions and Sets
8. In the proof of Lemma 8.3.2, after Lemma 8.3.4, a function f is defined with 11f II c < (d  1)1/2. For d = 2, deduce this from monotonicity of x(.), in the proof of Theorem 8.3.1.
9. Show that each class g ,K,d in Theorem 8.2.1 for a > 0 and K < oo is a uniform GlivenkoCantelli class as defined in Section 6.6. 10. Show that the lower layers in Rd form a GlivenkoCantelli class for any law having a bounded support and bounded density with respect to Lebesgue measure.
Notes Note to Section 8.1. Hausdorff (1914, Section 28) defined his metric between closed, bounded subsets of a metric space.
Notes to Section 8.2. Kolmogorov (1955) gave the first statement in Theorem 8.2.1. The proof of that part as given is essentially that of Kolmogorov and Tikhomirov (1959). Lorentz (1966, p. 920) sketches another proof. Theorem 8.2.10 is essentially due to Clements (1963, Theorem 3); the proof here is adapted from Dudley (1974, Lemmas 3.5 and 3.6). Remark 8.2.14, for which I am very grateful to Joseph Fu, shows that there is an error in Dudley (1974, (3.2)), even with the correction (1979), which is proved only for a > 1. Theorem 8.2.16 and its proof by a sequence of lemmas are newly corrected and extended versions of results and proofs of Dudley (1974). TzeGong Sun and R. Pyke (1982), whose research began in 1974, proved Corollary 8.2.25(a) independently by a different method. See also Pyke (1983). The statement was apparently first published in Dudley (1978, Theorem 5.12), and attributed to Sun.
Notes to Section 8.3. In Theorem 8.3.2 and its proof ford > 3, I am grateful to Lucien Birge (1982, personal communication) for the idea of the transformation U. For Theorem 8.3.1, I am much indebted to earlier conversations with Mike Steele. Any errors, however, are mine. A lower bound for empirical processes on lower layers in the plane will be given in Section 12.4, where P is uniform on the unit square. For other, previous results, such as laws of large numbers uniformly over L L d for more general P, and on the statistical interest of lower layers (monotone regression), see Wright (1981) and references given there.
References
283
Notes to Section 8.4. Bolthausen (1978) proved the Donsker property of the class of convex sets in the plane for the uniform law on I2 (cf. Corollary 8.4.2(b)). Theorem 8.4.1, as mentioned, is due to Brongtein (1976). Specifically, the smoothing C H C' and the set of lemmas used follows mainly, but not entirely,
his original proof. Perhaps most notably, it appears that the polyhedra used to approximate convex sets in Brongtein's construction need not be convex themselves, and this required some adjustments in the proof. A more minor point is that if C E Cd then C1 doesn't necessarily include B(0, 1), although C2 does. So Bronstein's Lemma 1 seems incorrect as stated but could be repaired by changing various constants.
In an earlier result of Dudley (1974, Theorem 4.1), for d > 2, in the upper bound, e(1d)/2 was multiplied by Ilog s i. This bound, weaker than Bronstein's,
is easier to prove and suffices to give Corollary 8.4.2(b) and (c). The lower bound in Dudley (1974) was reproduced here. I thank James Munkres for telling me about the triangulation method used after Lemma 8.4.8. Gruber (1983) surveys other aspects of approximation of convex sets.
References *An asterisk indicates a work I have seen discussed in secondary sources but not in the original. Bolthausen E. (1978). Weak convergence of an empirical process indexed by the closed convex subsets of I2. Z. Wahrscheinlichkeitsth. verw. Gebiete 43, 173181. Bonnesen, Tommy, and Fenchel, Werner (1934). Theorie der konvexen Korper. Springer, Berlin; repub. Chelsea, New York, 1948. Bronshtein [Bronltein], E. M. (1976). sentropy of convex sets and functions. Siberian Math. J. 17, 393398, transl. from Sibirsk. Mat. Zh. 17, 508514. Clements, G. F. (1963). Entropies of several sets of real valued functions. Pacific J. Math 13, 10851095. Dudley, R. M. (1974). Metric entropy of some classes of sets with differentiable boundaries. J. Approx. Theory 10, 227236; Correction, ibid. 26 (1979), 192193. Eggleston, H. G. (1958). Convexity. Cambridge University Press, Cambridge. Reprinted with corrections, 1969. Eilenberg, S., and Steenrod, N. (1952). Foundations of Algebraic Topology. Princeton University Press.
284
Approximation of Functions and Sets
Gruber, P. M. (1983). Approximation of convex bodies. In Convexity and Its Applications, ed. P. M. Gruber and J. M. Wills. Birkhauser, Basel, pp. 131162. Hausdorff, Felix (1914). Mengenlehre, transl. by J. P. Aumann et al. as Set Theory. 3d English ed. of transl. of 3d German edition (1937). Chelsea, New York, 1978. Kolmogorov, A. N. (1955). Bounds for the minimal number of elements of an snet in various classes of functions and their applications to the question of representability of functions of several variables by functions of fewer variables (in Russian). Uspekhi Mat. Nauk (N.S.) 10, no. 1 (63), 192194. Kolmogorov, A. N., and Tikhomirov, V. M. (1959). sentropy and scapacity of sets in function spaces. Uspekhi Mat. Nauk 14, no. 2, 386 = Amer. Math. Soc. Transl. (Ser. 2) 17 (1961), 277364. Lorentz, George C. (1966). Metric entropy and approximation. Bull. Amer. Math. Soc. 72, 903937. Pyke, R. (1983). The Haarfunction construction of Brownian motion indexed by sets. Z. Wahrscheinlichkeitsth. verw. Gebiete 64, 523539. *Sun, TzeGong, and Pyke, R. (1982) Weak convergence of empirical measures. Technical Report no. 19, Dept. of Statistics, Univ. of Washington, Seattle. Wright, F. T. (1981). The empirical discrepancy over lower layers and a related
law of large numbers. Ann. Probab. 9, 323329.
9 Sums in General Banach Spaces and Invariance Principles
Let (S, II II) be a Banach space (in general nonseparable). A subset F of the unit ball If E S' : II .f II' < 1) is called a norming subset if and only if 11S11 = sup fE I f (s) I
for all s E S. The whole unit ball in S' is always a
norming subset by the HahnBanach theorem (RAP, Corollary 6.1.5). Conversely, given any set F, let S := £°O(F) be the set of all bounded real functions on F, with the supremum norm IISII = IISIIJ
SUPfEJIs(f)I,
S E S.
Then the natural map f i+ (s i+ s(f )) takes F onetoone onto a norming subset of S. So, limit theorems for empirical measures, uniformly over a class .F of functions, can be viewed as limit theorems in a Banach space S with norm II II.F. Conversely, limit theorems in a general Banach space S with norm II II can be viewed as limit theorems for empirical measures on S, uniformly over a class F of functions, such as the unit ball of S', since for f E S' and x j, , xn E S,
(8x, + ... + 8xn) (J) = f (X1 + ... + xn). \ Suppose that Xj are i.i.d. real random variables with mean 0 and variance 1. Let S,, := Fj 1/2 as u 0. If not, first suppose for some S > 0, g(uk) > (2 + S)u2t for some uk  0. Taking a subsequence, we can assume that for all n, u, = to/n1/2 where Itni < 1. Then (1  g(tn/n1/2))" exp((2 + 8)t,), contradicting (9.2.3) as n oo. Or, if g(uk) < (2  S)u2ti for some Uk + 0, again we can assume un = t"/n1/2 where It" I < 1 for all n. Then (1  g(tn/n1/2))n > exp((1  S)tn/2) for n large enough, again contradicting (9.2.3). So, as claimed, g(u)/u2 1/2 as u + 0. Now for P := L (Z1), and all u, (9.2.4)
g(u)
 g(u) + g(u) =
u2
2u2
Letting u 
f
cos(ux) oo
u2
dP(x).
0 and applying Fatou's Lemma (RAP, 4.3.3) gives that
f x2dP(x) < 1. Then by the TonelliFubini theorem, E(X?) < oo, so EJX1 I
< oo, and (Sn  nEX1)/n1/2 converges in law to some N(0, a2).
Since Sn/n1/2 also converges in law it follows by tightness that EX1 = 0 and
then EX? =a2=1. So, to continue the proof of Theorem 9.2.1, suppose h is nonmeasurable for
(x E A : h*(x) = +oo}. Suppose P(B) > 0. Then for each j, Pr(Xj* _ +oo) = P(B) by Lemma P, so that the Xn are nonmeasurable. Let B
3.2.4, case (b). By Lemma 3.2.4(c), S _ +oo a.s. on the set where X,* = +00
for any j < n. Thus Pr ((Sn/n1 I2)* = +oo)
> Pr (Xj* _ +oo for some j < n) = 1  (I  P(B))n _+ 1
9.3 A finitedimensional invariance principle
293
as n  oo. By Lemma 3.2.6 this contradicts the hypothesis. Thus P(B) _
Pr(X = +oo) = 0.
LetBj:=B(j):_{Xj>X
2J}.
Then
C := nj'=1 Bj. Apply Lemma 3.2.4 with Pj = P and f = 1B(j) to obtain Pr*(Cn) = 1. On Cn,
S n < XI .+.... + x n < Sn + 1. N(0, 1). Hence by Theorem 9.2.2,EXI = 0. Likewise EX1* = 0. Thus X1* = X1 = XI a.s., that is, X1 is completion Thus
measurable.
9.2.5 Corollary Suppose (S, I.1) is a separable normed space and Xn = h (xn ) where the xn are independent, identically distributed random variables with values in some measurable space (A, A) and h is any function from A into S (not assumed measurable). Suppose Yn are i.i.d. Gaussian variables in S with mean 0 and *
(9.2.6)
Xj  Yj
limn> 00 n1/2
= 0
j_ 1/(68)}+86. Also, for S > 0 small enough, G{x : Ilxll > 1/(6e)} < 86 for 0 < 8 < S by the LandauSheppMarcusFernique theorem (Lemma 2.2.5 suffices in this case). Take S < 1/2. Then for such an e and k > kl (e) large enough, we have nk and 84(1 + E4)k > 256, so
4s2(l + e4)k/2 +4 < 4 821 + E4)k12,
(l + e4)k  1 >
n (k)
Etk/2/4) 1/(5E)} 8tk12/41 < 2e6
also for n < No, fork > k2(8) > k1 (8) large enough, and thus for 1 < n < nk. Then in particular, Ck as defined above is less than 1/2. Applying Ottaviani's inequality (1.3.15) gives
Xi
Pr S maim E) < E for i = 1, 2. Now by (9.3.13), k
X(P)
VP = n1/2 max1 no(p + 1) and n > no(p), (9.3.12) gives Pr(Vp > 2P) < 2P. Further, since n > r(M) > no(M) and n > n  r(M), we have Pr(U3 > 2M) < 2M, Thus M1
Vp
0. Now, there is a local law of the iterated logarithm for Wt at t = 0, namely for v (t) := (2t log log(r ))1/2
almost surely lim supty0 Wt/v(t) _  lim inftl,0 Wt/v(t) = 1. This follows from the law at infinity (RAP, Theorem 12.5.2) and the fact that the process {tWi11}t>o has the same distribution as {Wt}t>0, being Gaussian with mean 0
and the same covariances. So, suppose W7 = b(t). Then b(r + h)  b(r) < v(h)/2 for small enough h > 0 by the Holder condition, so we will have almost
surely W (r + h) > b(r + h) for some h with 0 < h < 1  r. So the path t r). Wt, 0 < t < 1, will not be in the boundary of (a, b). A similar argument holds if Wi = a(r), so the proof is done.
9.4 Invariance principles for empirical processes Recall the notion of coherent Gp process (Section 3.1). Let (A, A, P) be a probability space and .F c L2 (A, A, P). Let Q be the product of ([0, 1], B, A), where B is the Borel aalgebra, A = Lebesgue measure, and a countable product of copies of (A, A, P), with coordinates xi. Then .P will be called a functional Donsker class for P if it is pregaussian for P and there are independent coherent
Gp processes Yj (f, co), f E .T, co E Sl, such that f H Yj (f, co) is bounded and ppuniformly continuous for each j and almost all cv, and such that in outer probability, m
(9.4.1)
n1/2 maim E/3
} < E/2. JJJ
By definition, F is pregaussian for P, so F is totally bounded for pp by the SudakovChevet theorem (2.3.5). Also, there is a S > 0, by Theorem 3.1.1, such that for a coherent version of Gp,
Pr(sup {IGp(f)  Gp(g)I:.f, g E F, pp(f, g) < 8} > e/31 < E/2. By assumption, each Y; , and so each (Y1 + + Yn )/n 1/2, is a coherent version of Gp. Combining gives an asymptotic equicontinuity condition for the empir
ical processes vn, proving "only if" by Theorem 3.7.2. To prove "if," assume .T' is a PDonsker class and again apply Theorem 3.7.2.
So .P is totally bounded for pp and we have an asymptotic equicontinuity condition: f o r each k = 1 , 2, , there is a Sk > 0 and an Nk < oo such that (9.4.3)
Pr* { sup { I vn (f)  vn (g) I
< 2k,
:
f, g E P, pp (f, g) < Sk } > 2k }
n > Nk.
We may assume Nk increases with k. As in the proof of Theorem 3.7.2, we have for a coherent Gp that (9.4.4) Pr { sup { I Gp (f)  Gp (g) I : ff g E F, pp (f, g) < 8k } > 2k }
< 2k
Let U := UC(F) denote the set of all prelinear realvalued functions on F uniformly continuous for pp. Then U is a separable subspace of S := 2°O(Y), as shown after the proof of Lemma 2.5.4.F is pregaussian for P and Gp has a law A defined on the Borel sets of U by Theorem 2.5.5. For k = 1, 2, let Fk be a finite subset of .T' such that
supfE.min {pp(f,g): gE.Pk} < 8k. Let Tk denote the finitedimensional space of all real functions on .Tk, with supremum norm II Ilk. Let m(k) be the number of functions in Fk and let .Fk = {g1, , gm (k)). For each f E F let fk = gj for the least j such that
pp(f gj) < 8k. For any 0 E S let (ak(f) := 0(fk), f E Y. Then Qbk E S. Let AkW = Ok and Xkj := Xj  AkXj where Xj := 8x, There is a complete P.
separable subspace T of S which includes both U and the (finitedimensional) ranges of all the Ak.
9.4 Invariance principles for empirical processes
303
Let II II = II II. from here on. Note that II AkO II < 11011 for all k and all
E S. Then by (9.4.3) we have for n > Nk, Pr* {n_1/2 Fj=1 Xkj
(9.4.5)
pr*
>2k}
l supfE. I vn (.f  .fk) I > 2k } < 2k,
Thus by Lemma 3.2.6, n
> 2k l < 2k
Xkj
Pr { n1/2 1(
j=1
Il
1
Then if n > 2Nk and m = 0, 1, n1/2
, n, we have by Lemma 3.2.3 a.s.
*
n
< n1/2
E Xkj
n > Nk.
*
n
M
+m1/2 57 Xkf
E Xki
j=1
j=1
j=m+1
and since either n  m > Nk or m > Nk, *
n
Pr
In112
E Xkj
< 21k
21k
>
j=m+1
Thus fork > 2, Ottaviani's inequality (Lemma 9.1.6) with c = 1 /2 gives *
(9.4.6)
Pr In1/2 maxm
j=1
l < 22k 1
for n > 2Nk. Let Pk be the law on Tk off i+ f(x1)  f fdP, f E .Pk. Then by Theorem 9.3.1 there exist Vkj i.i.d. Pk, Wkj i.i.d. with a Gaussian law Qk, and some nk := n(k) > 2Nk such that for all n > nk, k > 1, (9.4.7)
Pr fn_1/2max
1: Vkj  Wkj j 1 are independent of each other for different k. It
can and will be assumed that no := 1 < n1 < n2
1 and{Wj}j>1 are sequences of independent random variables. (Note: The notations nk, Vk, Wk are all different from those in the previous section.) Each sequence has its values in a countable product of Polish spaces which itself is Polish (RAP, Theorem 2.5.7). Let {Yj, j > 1} be i.i.d. It on U. Then Wj has the law of the restriction Yj P .T'k of Yj to Pk
304
Sums in General Banach Spaces and Invariance Principles
for nk < j < nk+1. Let Tkj be a copy of Tk and Uj of U for each j, and let Let T.:= T(j) := 1 T(j), UU := 1 Uj, .P(j) _ Fk(j). Then, Tk(j)j.
apply the Vorob'evBerkesPhilipp theorem (1.1.10) to P := G({ (Vj , W j)) j> 1),
and G({(Yj [ F(j), Yj)} j>1). So we can assume Wj = Yj [ X(j) for all j > 1. Now for ((xj)j>1, t) E 0 := A°O x [0, 1], let
V(w) :_ I f H f(xj)  f fdP, f E X(j)1j>1 E T. I.
Let Q := G({(Vj, Wj, Yj)}j>1) on T,,, x T,,) x U,,,,. By Lemma 3.7.3, Vj and
Yj can be taken to be defined on 7 with V (w) j = Vj for all j. It remains to
prove (9.4.1) for the given Yj, with Xj(f) := f (xj)  f f dP, f E.F. Given 0 < e < 1, take k large enough so that 25k < e. We have Xj [ F(j) = Vj, j > 1. Let Mk > nk be large enough so that for all n > Mk, n(k)
(9.4.8)
2k
Pr* n1/2K II AkXjll + IIYj11 > j=1
Mk. Then there is a unique r such that nr < n < nr+1, and (9.4.9)
In
EXj  Yj
:= maxm e for all i 0 j. Also, D(2) (E, F) is the supremum over all laws Q with finite support of D(2) (e, F, Q). The following is a continuation of Theorem 4.8.3. .fi ,
10.1.2 Proposition There exist VC major (thus VC hull) classes which do not satisfy (4.8.4), thus are not VC subgraph classes.
Proof A VC major class is VC hull by Theorem 4.7.1(b). Let T be the set of all rightcontinuous nonincreasing functions f on 118 with 0 f I f  gI dQ. Thus D(2) (E, .F, Q) > Dpi) (E, F, Q)
for any e > 0.
Let P be Lebesgue measure on [0, 1]. Then Dpi) (E, F, P) = D(e, CC2,1, dX)
where C C2,1 is the set of all lower layers (defined in Section 8.3) in the unit square 72 in 1182, and dx is the Lebesgue measure of the symmetric difference of sets in 12. For some c > 0, D(E, GC2,1, d),) > ecIE as e J, 0 by Theorem 8.3.1. For each e > 0 small enough, by the law of large numbers, there is a law Q with finite support and DO) (E, .F, Q) > e`1'  1, so (4.8.4) fails and by Theorem 4.8.3, F is not a VC subgraph class. Recall that if 1
(log D(2) (E, .F))1/2de < oc, fo as in Theorem 6.3.1 for F = 1, then.F is said to satisfy Pollard's entropy con(10.1.3)
dition.
10.1.4 Theorem If F is a uniformly bounded, image admissible Suslin class of measurable functions and satisfies Pollard's entropy condition then F is a universal Donsker class.
Proof .F has a finite constant C as an envelope function. For a constant envelope, the hypotheses of Theorem 6.3.1 don't depend on the law P, so Theorem 10.1.4 is a corollary of Theorem 6.3.1. Here are some more details. Let.F/ C :_ If/C: f E .F}, so that.F/ C has as an envelope the constant 1. It will be enough to show that.F/ C is a universal Donsker class. Then for 8 > 0,
D(2)(8, F1 C) = Di2)(3, F/C) as in Theorem 6.3.1. Make the substitution
10.1 Universal Donsker classes
317
8 = s/C and note that D(2)(E/C, .F/C) = D(2)(E, .F). It follows that.F/C satisfies Pollard's entropy condition. So Theorem 6.3.1 applies and .F/C and .F are universal Donsker classes. 10.1.5 Corollary A uniformly bounded, image admissible Suslin VC subgraph class is a universal Donsker class.
Proof This follows from Theorems 4.8.3 and 10.1.4. Specializing further, the set of indicators of an image admissible Suslin VC class of sets is a universal Donsker class (Corollary 6.3.16 for F = 1). For a class F of realvalued functions on a set X, recall from Section 4.7 the class H(.F, M) which is M times the symmetric convex hull of F, and HS(.F, M) which is the closure of H(.F, M) for sequential pointwise convergence. Note that for any uniformly bounded class.F of measurable functions for a or algebra A and any law Q defined on A, H(.F, M) is dense in HS (.F, M) for the L2( Q)
distance (or any LP(Q) distance, 1 < p < oo).
10.1.6 Theorem If F is a Donsker class for a law P such that F has an envelope function in G2 (P), or a universal Donsker class, then for any M < 00, HS(.F, M) is a PDonsker (resp. universal Donsker) class.
Proof First, for a given P, the limiting Gaussian process Gp can be viewed as an isonormal process, for the inner product
(f, g)o,P :=
f fgdP  f fdP f gdP.
So, by Corollary 2.5.9, F U .F is Ppregaussian. (Or, by Theorem 3.8.1, .FU .F is PDonsker, thus Ppregaussian.) Clearly, MF is an envelope func
tion for HS(.F, M) and is in G2(P). Suppose hk E HS(.F, M) and hk a h pointwise. Then for any empirical measure P, f hk dP * f h dP ask + oo. Also, f hk dP f h dP and f (hk  h)2dP 0, both by dominated convergence. It follows that pp(hk, h) + 0. Then by Theorem 2.5.5, the set HS(.F, M) is pregaussian for P. If g is any prelinear function on H(.F, M), then IIgllH(J,M) = MllglI.F. Next, II gIIHs(.P,M) = MIIgII if g is any linear combination of P, P, and a sample function of a coherent Gp process. Then, the Donsker property of HS (.F, M) follows from the invariance principle, Theorem 9.4.2, where the Gp processes are coherent.
Suppose F is a universal Donsker class. Let 9 :_ If  inf f : f E .F}, as in the discussion before Proposition 10.1.2. Then functions in H(.F, M)
318
Universal and Uniform Central Limit Theorems
differ from functions in H(Q, M) by additive constants. If hk E HS(.F, M),
hk + h pointwise, and for all k, hk = 4'k + ck for some /k E HS(Q, M) and constants Ck, then since 4k are uniformly bounded and h has finite values, the ck are bounded. So, taking a subsequence, we can assume ck converges to some c. Then cbk converge pointwise to some 0 E HS (Q, M) and h = 0 + c. Thus all functions in M) differ by additive constants from functions
in HS(Q, M) (the converse may not hold). Thus, if HS(Q, M) is a universal Donsker class, so is HS(.F, M). So we can assume F is uniformly bounded. Then it has an envelope function in L2(P) for all P, so by the first half of the proof, HS (.F, M) is a universal Donsker class.
By Theorem 10.1.4, for any 8 > 0, if log D(2)(e, )) = O(1/s2s) as a 4, 0, and if .F satisfies a measurability condition (specifically, if .F is image admissible Suslin), then F is a universal Donsker class. In the converse direction, we have:
10.1.7 Theorem For a uniformly bounded class.F to be a universal Donsker class it is necessary that log
D(2)(8'.F)
=
O(e2)
as s 4. 0.
Proof Suppose not. Then there are a universal Donsker class F and sk 4. 0 such that logD(2)(ek, .F) > k3/sk for k = 1, 2, , so there are probability laws Pk with finite support for which log D(2) (sk, .F, Pk) > k3/ek for k = 2, 3, . Let P be a law with P > Ek 2 Pk/k2. Then for any measurable f and g,
(f(f  g)2dPl Let Sk := sk/k. Then
1/2
1/2
>
\ f (.f  g)2dPk)
/
/k.
/
log D(2) (Sk, .F, P) > log D(2) (sk, .F, Pk) > k' /82 = k132 So any isonormal process L on L2(p) is a.s. unbounded on.F by Theorem 2.3.5.
We can write L (f) = Gp (f)+G f fdP where G is a standard normal variable independent of Gp. Since .F is uniformly bounded, Gp is a.s. unbounded on (a countable ppdense subset of) .F, so F is not Ppregaussian and so not a PDonsker class. Theorem 10.1.7 is optimal, as the following shows:
10.1.8 Proposition There exists a universal Donsker class E such that lim infs o 82 log Dt2) (S, E) > 0.
10.1 Universal Donsker classes
319
Proof Let A j := A (j) be disjoint, nonempty measurable sets for j = 1, 2, f2 norm, 11x 112 = (Fj xi?) 1/2 for x = {x j) ° 1. Let Let II II2 be the
J:xjlA(j): lIX112 0 for all j, since if B is the union of all Aj such that pj = 0, then P(B) = 0, Gp(B) = 0 a.s., and vn(B) = 0 a.s. for all n. (k v, (A Let E > 0. For any k and n, let II vn 112,k Then for all n, j)2)1/2.
00
EIIvnII2,k
 pj + 0
as k+ oo.
j=k
Take k = k(E) large enough so that
Pi < E3/18. Then
Pr{Ilvnll2,k > E/3} < E/2.
(10.1.9)
If Ilvnll2,k < E/3, lIx112 < 1, and 11YI12 < 1, then by the Cauchy (Schwarz) inequality, (10.1.10)
vn
\
k(xj
Yj)1A(j)) < 2s/3.
j=k
Also, EIIv.112,1 < (EIIvn1121)1/2 < 1, so
(10.1.11)
Pr{Ilvnll2,1 > 2/E} < E/2.
Let 8 := (minj 0. For each x = {xj} j>i
let
o°
.fx := Y,xj1A(j) j=1
If fx and fy E E and ep(fx, fy) := (f (10.1.12)
(_fY)2dp)112 < 8, then
x
((_y)2)
1/2
< E2/6.
j P) = r >
(m1/2 + 6)m/(26)m
\2[1 + 1/(6m1/2)])m >
(3/2)m
exp (log (2)/(482)). 3
Letting S 10, Proposition 10.1.8 follows.
El
10.1.14 Proposition There is a uniformly bounded class F of measurable functions, which is not a universal Donsker class, such that
log D(2)(e, .F) < s21og(1/s) as s 4, 0. Proof Let Bj := B(j) be disjoint nonempty measurable sets. Recall that Lx := max(l, logx). Let aj := 1/(jLj)1/2, j > 1, and
F=j l j=1
xjIB(j): xj = faj for all j }. )))
Take c such that Fjt 1 Pj = 1, wherePj := c(aj/LLj)2. Herey jPj
ee. Take a probability measure P with P(Bj) = pj for all j. Let
a := (I  PI)1/2/2 > 0. Then 00
EIIGPII,P = EajEIGP(Bj)I j=1 00
00
aj(2/7r)1/2(Pj(I  Pj)) 1/2 > j=1
aT aj Pi
1/2
j=1 00
1/(jLjLLj) = +oo
acl/2 j=1
by the integral test since (d/dx)(LLLx) = 1/(xLxLLx) for x large enough. If F were Ppregaussian, then since Gp can be treated as an isonormal process (see Section 3.1), by Theorem 2.5.5 ((a) if and only if (h)) and the material just before it, Gp could be realized on a separable Banach space with norm II IIj, Then the norm would have finite expectation by the LandauSheppMarcusFernique theorem (2.2.2), a contradiction. So F is not pregaussian for P and so not a universal Donsker class. For any probability measure Q and r = 1, 2, ,
f=
xiIB(j) E F and g =
j
YjIB(j) E
if x j = yj for 1 < j < r, z
ao
(10.1.15)
eQ(f, g) =
( (
\1 \
E r
(xj
1/z
Yj)1B(j)) dQ) /
00
(QB)1"2
< 2ar.
/=r
Given s > 0, let r := r(s) be the smallest integer s > 1 such that as < s/2. By (10.1.15), if xj = yj for 1 < j < r, then ep(f, g) < E. Thus since there are F, Q) < 2r1 only 2r1 possibilities for xj = ±aj, j < r, we have and since r doesn't depend on Q, D(2) (s, F) < 2r1 and log D(2) (s, F) < r log 2. As s j 0, we have ar(,)  s/2, so D(z)(£,
log(1/s)  log(2/s)  log(1/ar(e))
Zlog(r(s)),
Universal and Uniform Central Limit Theorems
322
and e/2  1/(r(s) .2 log(1/8)) 1/2, so r(s)
2/(e2log(1/e)). Since
log D(2) (e, F) < r (e) log 2 < r (e), the conclusion follows.
Theorems 10.1.4 and 10.1.7 show that Pollard's entropy condition (10.1.3) comes close to characterizing the universal Donsker property, but Propositions 10.1.8 and 10.1.14 show that there is no characterization of the universal Donsker property in terms of D(2).
10.2 Metric entropy of convex hulls in Hilbert space Let H be a real Hilbert space and for any subset B of H let co(B) be its convex hull, k
CO(B)
E
j=1
k
t j > 0, Y' tj = 1, xj E B, k = 1, 2,
I.
j=1
J
Recall that D(e, B) is the maximum number of points in B more than a apart.
10.2.1 Theorem Suppose that B is an infinite subset of a Hilbert space H,
IIxii < 1 for all x E B, and for some K < oo and 0 < y < oo, we have D(e, B) < KeY for 0 < e < 1. Lets := 2y/(2 + y). Then for any t > s, there is a constant C, which depends only on K, y, and t, such that
D(s, co(B)) < exp (CE') for 0 < s < 1. Note. Carl (1997) gives the sharper bound with t = s.
Proof We may assume K > 1. Choose any x1 E B. Given
B(n) := {x1,...,xn_1),
n > 2,
let d(x, B(n)) := minyEB(n) IIx  yii and Sn := supXEB d(x, B(n)). Since B is infinite, Sn > 0 for all n. Choose xn E B with d(xn, B(n)) > Sn/2. Then
for all n, K(2/Sn)Y > D(Sn/2, B) > n, so 8n < Mn1/Y for all n where M:= 2K1/Y. Let O < e < 1. Let N := N(s) be the next integer larger than (4M/s)Y. Then
SN < e/4. Let G := B(N). For each x E B there is an i = i(x) < N  1 with IIx xi 11
8N. For any convex combination z = EXEB zxx where rlx > 0 and
EXEB11x = 1,witht7x =0except forfinitely many x,letzN :_ ExEBi)xxi(x)
10.2 Metric entropy of convex hulls in Hilbert space
323
Then 11 Z  ZN 11 < SN < s/4, so
D(S, co(B)) < D(s/2, co(G)).
(10.2.2)
To bound D(s/2,co(G)), let m := m (s) be the largest integer < sS. Note that y > s. Then for each i with in < i < N, there is a j < in such that (10.2.3)
1l xi  xj 11 < Sm+1 < MSSIY
Let j(i) be the least such j. Let Am :_ {{),j}1 0 for all j. , m and x E Fx, let x(j) := EyEA(j) Ay,x y. Let Yj be a random variable with values in A(j) and P(Yj = y) = µy,x/X.j for each y E A(j). Then EYj = x(j)/),j =: zj. Take Y1, , Ym to be independent and let Y:=F_j 1.XjYj. Then EY =x and
For any j = 1,
2
M
EIIYx112 = E E.lj(Yj  zj) j=1
m
_ YA EIIYj ZjIl2, j=1
since Yj  zj are independent and have mean 0, and H is a Hilbert space. Now the diameter of A(j) is at most 2Mesl y by (10.2.3), and zj is a convex combination of elements of A (j). Thus 2
EIIYjzjII2 = Aj1
lty,x
J1
E lLz,x(YZ)
< 4M2s2sly,
zEA(j)
YEA(j)
and for any set F C {1, 2
E Y,.lj(Yj  zj) jEF
vn(C,,). By Stirling's formula (Theorem 1.3.13), as n * oo, 2n(e/n)n+n/y(27rn)(y+l)/(2y)nn/2(n/(2e))n/2(2rn)1/2
vn(Cn)lcn
_ for a constant Dy D.
(eln)n(2+y)1(2y)(2/n)n12n1/(2y)Dy
10.2 Metric entropy of convex hulls in Hilbert space
327
Take any d such that d > 1 + v . Then for n large enough, vn (Cn) /cn > n dn
Let g, (e) := ndo/sn. Then m > g, (e). The following paragraph is only for motivation. As s .. 0, n = n(s) will be chosen to make gn (s) about as large as possible. We have
81(n + 1)d(n+1)ndn dn
gn+1 (E)/gn (s)
(ne) d s This sequence is decreasing in n, so to maximize gn (s) for a given s we want
ln d (1+ ) e nI
to take n such that the ratio is approximately 1.
At any rate, let n(s) be the largest integer < f(s) := e1s11d. Then for e small enough,
d 1)11d) gn(E)(s) > f(8) df(e) s I(e) = sexp(df(s)) = eexp(ee s8 ' 1
Now s > exp(s8) as s ,. 0 for any 8 > 0. Taking 8 < 1/d 'and letting d 4, (y + 2)/(2y) we have 1/d'' (2y)/(2 + y), showing that the exponent in Theorem 10.2.1 is indeed optimal. Recall the definitions of D(2) from Section 4.8 and HS from after Corollary 10.1.5.
10.2.7 Corollary If 9 is a uniformly bounded class of measurable functions
and for some K < oo and 0 < y < oo, D(2) (s, Q) < KsY for 0 < s < 1, then for any t > r := 2y/(2 + y), and for the constants Ci = Ci (2K, y, t), i = 1, 2, of Theorem 10.2.1,
D(2) (s, HS(9, 1)) < C1 exp (C28 t)
for 0 < e < 1.
Proof We have D(2)(e, g U g) < 2KsY, 0 < s < 1. Thus for any law Q with finite support, D(2)(s, Q U Q, Q) < 2KsY. By Theorem 10.2.1, D(2) (e, H(g, 1), Q) < C1 exp(C2s`) for Ci = Ci (2K, y, t). It is easily seen that H(c, 1) is a dense subset of R, (9, 1) in £2(Q). The conclusion follows.
Now, recall the notions of VC subgraph and VC subgraph hull class from Section 4.7. 10.2.8 Corollary If C is a uniformly bounded VC subgraph class and M < 00 then the VC subgraph class H,s (9, M) satisfies Pollard's entropy condition (10.1.3).
328
Universal and Uniform Central Limit Theorems
Proof Let M9 :_ (Mg: g E 9). Then M9 is a uniformly bounded VC subgraph class. By Theorem 4.8.3(a), M9 satisfies the hypothesis of Corollary
10.2.7, so r < 2 and we can take t < 2.
Remark.
It follows by Theorem 6.3.1 that for a uniformly bounded VC subgraph class 9, if F C HH (9, M) and F satisfies the image admissible Suslin measurability condition, then F is a universal Donsker class. This also follows from Corollary 10.1.5 and Theorem 10.1.6.
Example. Let C be the set of all intervals (a, b] for 0 < a < b < 1. Let G be the set of all real functions f on [0, 1] such that I f(x)I < 1/2 for all
x, I f(x)  f(y) I < Ix  yI for 0 < x, y < 1, and f(x) = 0 for x < 0 or x > 1. Each fin G has total variation at most 2 (at most 1 on the open interval 0 < x < 1 and 1/2 at each endpoint 0, 1). By the Jordan decomposition we
have, for each f E G, f = g  h where g and h are both nondecreasing functions, 0 for x < 0. Then g and h have equal total variations < 1 and G C H5 (C, 2) by the proof of Theorem 4.7.1(b). Let P be Lebesgue measure on [0, 1]. By Theorem 8.2.10, and since (f I f I2dP)1/2 > f If IdP, there is a c > 0 such that D(2) (s, G) > ecl' as s 4. 0 (consider laws with finite support which approach P). Since S(C) = 2, the exponent y can be any number larger
than 2 by Corollary 4.1.7 and Theorem 4.6.1. Letting y ,[ 2, tin Corollary 10.2.7 can be any number > 1, and we saw above that it cannot be < 1 in this case, so again the exponent is sharp.
**10.3 Uniform Donsker classes This section will state some results with references but without proofs. A class F of measurable functions on a measurable space (X, A) is a uniform Donsker class if it is a universal Donsker class and the convergence in law of v, to Gp is also uniform in P. To formulate this notion precisely we will follow Gine and Zinn (1991) in using the dualboundedLipschitz metric ,B as defined just before Theorem 3.6.4.
Let P(X) be the set of all probability measures on (X, A) and let Pf(X ) be the set of all laws in P(X) with finite support. For S > 0, a class F of measurable realvalued functions on X and a pseudometric d on F let
F(S,d) := {fg:.f,gEF, d(ff g) 0,therearemo
mo and n > no, I vm,n o hm,n  G( 'n) I < E on a set VE C V with it (VE) > 1  E. It follows that dm,n := I H(vm,n oh n)
 H(G( 'n)) I
u)
= 2 Fo 1(4k2u2  1) exp(2k2u2). Proof The distributions of the given functionals of the Brownian bridge yt are given in RAP, Propositions 12.3.3, 12.3.4, and 12.3.6. All three are continuous in u for u > 0. Thus convergence follows from RAP, Theorem 9.3.6. Note. The statistics supx Hmn (x) and sup., I Hmn (x) I in (a) and (b) are called KolmogorovSmirnov statistics, while sup., Hmn (x)  infy Hmn (y) is called a Kuiper statistic.
11.2 A bootstrap central limit theorem in probability Iterating the operation by which we get an empirical measure P, from a law P, we form the bootstrap empirical measure PB by sampling n independent points whose distribution is the empirical measure Pn. The bootstrap was first introduced in nonparametric statistics, where the law P is unknown and we want to make inferences about it from the observed Pn. This can be done by way of bootstrap central limit theorems, which say that under some conditions, n112(PB  Pn) behaves like n112(Pn  P) and both behave like Gp.
Let (S, S, P) be a probability space and F a class of realvalued measurable functions on S. Let as usual X1, X2, be coordinates on a countable product of copies of (S, S, P). Then let XB , XB be independent with distribution Pn . Let
PB := 1 n
3,'.. nl
j=1
Then PB will be called a bootstrap empirical measure. A statistician has a data set, represented by a fixed Pn, and estimates the distribution of PB by repeated resampling from the same Pn. So we are interested not so much in the unconditional distribution of PB as P, varies, but
336
The TwoSample Case, the Bootstrap, and Confidence Sets
rather in the conditional distribution of PB given Pn or (X1, , X,). Let vB := n1/2( P,,"  Pn). The limit theorems will be formulated in terms of dualboundedLipschitz "metric" P of Section 3.6, which metrizes convergence in distribution for not necessarily measurable random elements of a possibly nonseparable metric space (S, d), to a limit which is a measurable random variable with separable range. Let fi,F be the ,B distance where d is the metric defined by the norm ll
II.F.
Definition. Let (S, S, P) be a probability space and .F a class of measurable realvalued functions on S. Then the bootstrap central limit theorem holds in probability (resp. almost surely) for P and F if and only if F is pregaussian for P and 0.,(vB, Gp), conditional on X1, , Xn, converges to 0 in outer oo. probability (resp. almost uniformly) as n In other words, the bootstrap central limit theorem holds in probability if and only if for any e > 0 there is an no large enough so that for any n > no, there is a set A of values of X(n) :_ (X1, , Xn) with Pn (A) < e such that if X(n) A, and vB(Xtn))O is vB conditional on X(n), we have,B.F (vB(Xini)(.), Gp) < e. A main bootstrap limit theorem will be stated. The rest of this section will be devoted to proving it, with a number of lemmas and auxiliary theorems.
11.2.1 Theorem (Gine and Zinn) Let (X, A, P) be any probability space. Then the bootstrap central limit theorem holds in probabilityfor P and I if F is a Donsker class for P. Remarks. Gine and Zinn (1990), see also Gine (1997), also proved "only if" under a measurability condition and proved a corresponding almost sure form of the theorem where F has an L2 envelope up to additive constants. See Section 11.4 for detailed statements (but not proofs) of these other theorems.
Proof EB, PrB, and GB will denote the conditional expectation, probability, and law given the sample X(n) := (X1, , Xn). Given the sample, vB has only finitely many possible values. First, a finitedimensional bootstrap central limit theorem is needed.
11.2.2 Theorem Let X1, X2, be i.i.d. random variables with values in Rd and let XB,, i = 1, .. , n, be i.i.d. (Pn), where Pn Sx1. Let n 1X3. Assume that E I Xi 12 < oo. Let C be the covariance matrix n Ej= of X1, Crs := E(X1i.X1s)  E(X1i.)E(X1s). Then for the usual convergence
Xn
11.2 A bootstrap central limit theorem in probability
337
of laws in Rd, almost surely as n + oo, (11.2.3)
GB
(n1/2
N(0, Q.
E (XB j  Xn)) j=1
Proof Note: N(0, 0) is defined as a point mass at 0. Suppose the theorem holds for d = 1. Then for each t E Rd, the theorem holds for the inner products (t, Xj) and thus (t, XB j), and so on. This implies that the characteristic functions of the laws on the left in (11.2.3) converge to that of N(0, C), and thus that the laws converge by the Levy continuity theorem (RAP, Theorem 9.8.2). (This last argument is sometimes called the "CramerWold" device.) So, we can assume d = 1, and on the right in (11.2.3) we have a law N(0, a2) where a2 is the variance of X1. If a2 = 0 there is no problem, so assume a2 > 0. We
can assume that EX1 = 0 since subtracting EXI from all Xj doesn't change the expression on the left in (11.2.3). Let sn2
l
(Xj 
n
sn
(sn2)1/2.
j=1
Then by the strong law of large numbers for the Xj and the X?, sn2 converges
a.s. as n * oo to a2. For a given n and sample XW, Xn is fixed and the variables Yni
(XB1  Xn)/n1/2
for i = 1,
, n are i.i.d. We have a triangular array and will apply the Lindeberg theorem (RAP, Theorem 9.6.1). For each n and i, EBYn i = 0. Each
Yni has variance VarB(Yni) = sn2/n. The sum from i = 1 to n of these variances is sn2. Let Zni := Yni/sn. Then Zni remain i.i.d. given Pn, and the sum of their variances is 1. We would like to show by the Lindeberg theorem that GB(yi=1 Zni) converges to N(0, 1) as n  oo, for almost all . We have EB(Zni) = 0. Let 1{.....} := 1{.....{ for any event {..... }. For s > 0 let En j, := EB (I Zn j I2l { I Zn j l > s}). It remains to show that En jE = n En t, 0 as n oo, almost surely. Now X1, X2,
Anj
{jZnjI > E) = {IX,
 XnI >
n1/2sn8},
which for n large enough, almost surely, is included in the set
C(n, j) := Cnj := {I XB .1 > n1/2as/2}. Also, Zni = (4 j  Xn)2/(nsn2) < 2[(Xa )2 + Xn ]/(nsn2). Then Xn /sn2 which is constant given the sample, approaches 0 a.s. as n f oo since EXI = 0
338
The TwoSample Case, the Bootstrap, and Confidence Sets
and so does its integral over Cj. For the previous term, n
X21{IXiI
EB((Xnl)2lC(n,j))/(nsn2) = n2
> n1I2QE/2)/sn2,
i=1
which, multiplied by n, still goes to 0 a.s. as n + oo, by the strong law of large numbers for the variables Xt21 { I X, I > K) for any fixed K, then letting K oo. Theorem 11.2.2 is proved. Here is a desymmetrization fact:
11.2.4 Lemma Let T be a set, and for any realvalued function f on T let II f II T : = Supt, T I f (t) I. Let X and Y be two stochastic processes indexed by t E T defined on a probability space (S2 x S2', S ® S', P x P'), where X(t) (co, w') depends only on co E S2, and Y(t)(w, w') only on w E S2'. Then
(a) for any s > 0 and any u > 0 such that suptET Pr{IY(t)I > u) < 1, we have
Pr*(IIXIIT > s)
< Pr*{IIX YIIT > s  u}/[l  SUPtET Pr{IY(t)I > u}]. (b) If 9 > suptET E(Y(t)2) then for any s > 0,
P*(IIXIIT > S) < 2Pr* (IIX  YIIT > s  (29)1/2).
Proof Let A(s) := (co: II X(w)II T > s}. For part (a) we have by (3.2.8),
Pr*(IIXYIIT>Su) > EP(P')*(IIXYIIT>Su) > P*(IIXIIT > s)infwEA(s)(P')*(IIX((O)YIIT > s  u) > P*(IIXIIT > s) inftET P'(IY(t)I < u) since if IIX (w) II T > s then for some t E T, I X((o, t) I > s. This proves part (a). Then part (b) follows by Chebyshev's inequality.
Now si will denote i.i.d. Rademacher variables, that is, variables with the
distribution P(Ei = 1) = P(E, = 1) = 1/2, defined on a probability space (JGE, SE, Pe) which we can take as [0, 1] with Lebesgue measure. We also take the product of (QE, Se, Pe) with the probability space on which other random variables and elements are defined. Next, some hypotheses will be given for later reference.
11.2 A bootstrap central limit theorem in probability
339
Let (S, S, P) be a probability space and (Sn, Sn, Pn) a Cartesian product of n copies of (S, S, P). Let F C £2(S, S, P). Let E1, , en be i.i.d. Rademacher variables defined on a probability space (0', A, P'). Then take the probability space (Sn
(11.2.5)
x c', Sn ®A, Pn x
p).
References to (11.2.5) will also be to the preceding paragraph. Here is another fact on symmetrization and desymmetrization.
11.2.6 Lemma Under (11.2.5), f o r a n y t > 0 and n = 1, 2,
,
(a) Pr*(II y;`=1 Ej.f(Xj)IIF > t) < 2maxk t/2). (b) Suppose that a2 := supfEF f (f  Pf )2dP < oo. Then for all n = 1, 2, , and t > 21/2an1/2
E(f(Xi)  Pf) i=1
>
t)
F
n
> (t  (2n)1/2a)/2 }
Ei f (Xi )
F
i=1
(r
Proof For part (a), let E(n)
{ri}
1 :
ri = fl for each i}. Let
£n := {Ei }in_1. Then Pt
>t
:= Pr* Ili=1 Etf(Xi) F Pr*
[ten = r} x
f(Xi) 
f(Xi) r;=1
r,=1
TEE(n)
By Lemma 3.2.4 and Theorem 3.2.1, for laws p, v and sets A, B,
(A x v)*(A x B) = A*(A)v*(B) = p(A)v*(B) if A is measurable, so
< 2 E 2nmaxk 0 and
t>0,
(11.2.10)
Pr (maxk 3t + s)
< (Pr {maxk t})2 + Pr (maxj s)
Proof Letll.ll :=
Letr :=min{j t},ort :=n+1
if there is no such j < n. Then f o r j = 1 , , n, {r = j} is by Lemma 3.2.4 , Xj, and {max k t) is the disjoint a measurable function of X1, union Ujn=1 1r = j). On {t = j}, we have Il Sk ll* < t if k < j and when k > j, IISkII* < t+IIXjII*+IISkSjII*. Thus
maXk Yvfn}

N(T)
nPr{1Un,I(ti)I > y ln} . 0
< i=1
by hypothesis (i) and (11.2.23). For an arbitrary set A in a probability space, let A* be a measurable cover, so that A* is a measurable set and (0A)* = lA*. k * k C* Then it is easily checked that for any sets C1, Ck, (Uk Uk up to sets of probability zero. Take Ci :_ { II Un,i  Un,i,r II T > yn 1/2}. Then by Theorem 3.2.1 and Lemma 3.2.4,
, G) =
1  exp(n Pr*(C1)) < 1  (1  Pr*(Cl))n = Pr. (
Ii
U j_1
Pr* {maxl_r} =
r)
E*
EEif(Xi) i=1
F
354
The TwoSample Case, the Bootstrap, and Confidence Sets
since (e,, Xi) are i.i.d. Thus E(n) is bounded above by
f( n
Pr I
E_1
f
n
(
1{I (i)I>t}JE= kEi f(Xi)
l 1
dt
iY=1
,(i)I_t)
n
ooPr
'(I
o
i=1
(joo
n
+ k=m+1
k
> 0 dt
maxk 0 there is an N(r) < oo and a map in from F into a subset having N(r) elements such that pp(Jrt f, f) < a for all f E F. Let vBt (f) := vB(nt f) and Gp,t (f) := Gp(Jr7 (f )). Then for any bounded Lipschitz function H on £°O(F) for II ILr,
I EBH(vB)  EH(Gp)I < I EBH(vB) E BH(vB.,)I
 EH(Gp,t)
+ EBH(vnB
+ EH(Gp,t)  EH(Gp) I
A(H) + Bn,r (H) + Ct (H). Let BL (1) denote the set of all functions H on £°O (F) with bounded Lipschitz norm < 1. Then supHEBL(1)Ct(H) + 0 as r 0 since by definition of Donsker class, Gp is a.s. uniformly continuous with respect to pp, so Gp,t +
Gp uniformly on F as r 1 0. For fixed r > 0, the supremum for H E BL(1) of Bn,t(H) is measurable and converges to 0 a.s. by Theorem 11.2.2, the finitedimensional bootstrap central limit theorem, in RN(,). Recall the definition of II 11s,., from before Theorem 11.2.28. For the An,t term, we have
so it will be enough to show that for all
SUPHEBL(1) An,t (H) < 2EB II vBllt,
e>0,as8 J. 0, (11.2.39)
{EBIIvBIIS"
limsupn,,,,, Pr*
>
Apply (11.2.18) with B :=
ri _ 0.
and vi := 8X;(,)) for each w, and the
starred TonelliFubini theorem where one coordinate is discrete (3.2.9). We get n
E*EBIIVBIIs,.p
= n1/2E*EB
E (SXB  Pn) i=1
< n1/2 < n1/2
a
e1 a
e1
n
E*
(Ni1)(SXrPn) i=1
E.*
E(Ni1)SX;II i=1
n
+n1/2
e E* e1
(Ni1)
IIPnhIL,T)
_: S+T.
i=1
Since .P is Donsker for P, the limit as S J, 0 of the lim sup as n > oo of S is 0 by Theorem 11.2.37. For T we can apply independence, Lemma 3.2.6, and (3.2.9). Then E y1(Ni  1)I < n1/2, so T < (e/(e  1))E*II Pn IIo,.
n
11.3 Other aspects of the bootstrap
357
Recalling that each f E .7= is taken to be centered, we have E* II P Ila,.F = E*Iln1/2vn + Plls,.r + 0 as n oo and S * 0 by Theorem 11.2.28(d). Thus Theorem 11.2.1 is proved.
11.3 Other aspects of the bootstrap Efron (1979) invented the bootstrap and by now there is a very large literature about it. This section will address some aspects of the application of the GineZinn theorems. These do not cover the entire field by any means. For example, some statistics of interest, such as max(X1, , X,,), are not averages
f(Xn)) as f ranges over a class F. Some bootstrap limit theorems are stated in probability, and others for almost
sure convergence. To compare their usefulness, first note that almost sure convergence is not always preferable to convergence in probability:
Example. Let Xn be a sequence of realvalued random variables converging to some Xo in probability but not almost surely. Then some subsequences Xnk converge to X0 almost surely. Suppose this occurs whenever nk > k2 for all
k. Let Yn := X2k for 2k < n < 2k+1 where k = 0, 1,
.. Then Y, * Xo
almost surely, but in a sense, X,, + Xo faster although it converges only in probability. Another point is that almost sure convergence is applicable in statistics when
inferences will be made from data sets with increasing values of n, in other words, in the part of statistics called sequential analysis. But suppose one has a fixed value of the sample size n, as has generally been the case with the bootstrap. Then the probability of an error of a given size, for a given n, which relates to convergence in probability, may be more relevant than the question of what would happen for values of n ) oo, as in almost sure convergence. The rest of this section will be devoted to confidence sets. A basic example of a confidence set is a confidence interval. As an example, suppose X1, , Xn
are i.i.d. with distribution N(µ, a2) where a2 is known but It is not. Then + Xn)/n has a distribution N(µ, a2/n). Thus (X1 +
P(X < p  1.96a/n1/2) = 0.025 = P(X > p +
1.96a/nt/2).
So we have 95% confidence that the unknown it belongs to the interval [ X 1.96a/n1/2, X+ 1.96a/n1/2], which is then called a 95% confidence interval for A.
Next, suppose X1, , Xn are i.i.d. in R" with a normal (Gaussian) distribution N(µ, a2I) where I is the identity matrix. Suppose a > 0 and Ma = Ma (k)
358
The TwoSample Case, the Bootstrap, and Confidence Sets
is such that N(0, I) [x: Ix I > Ma } = a. Then n 112(X  µ)/o has distribution
N(0, I) so P(IX  µI > Ma/n'/2) = a. Thus, the ball with center X and radius Mqa/n1/2 is called a 100(1  a)% confidence set for the unknown lc. When the distribution of the Xi is not necessarily normal, but has finite variance, then the distribution of X will be approximately normal by the central limit theorem for n large, so we get some approximate confidence sets. Now let's extend these ideas to the bootstrap. Let X1, , Xn be i.i.d. from an otherwise unknown distribution P. Let Pn be the empirical measure formed from X1, , X,,. Let F be a universal Donsker class (Section 10.1). Then we know from the GineZinn theorem in the last section that v8 and vn have asymptotically the same distribution on Y. By repeated resampling, given a small a > 0 such as a = 0.05, one can find M = M(a) such that approximately P(IIvBIIj_ > M) = a. Then {Q: II Q  Pn IIT < M/n1/2}
is an approximate 100(1  a)% confidence set for P.
**11.4 Further GineZinn bootstrap central limit theorems Gine and Zinn (1990) proved two theorems giving criteria for bootstrap central
limit theorems, one for the almost sure bootstrap central limit property, the other in (outer) probability, both under some measurability conditions. Later, in work by Strobl (1994) and van der Vaart and Wellner (1996), the measurability conditions were removed from the direct parts of each theorem (that the Donsker property, together with an L2 envelope in the almost sure case, implies the corresponding bootstrap central limit theorem). Gine (1997) gives an exposition, with proofs, of the resulting forms of the theorems. A law it on the Borel aalgebra of a metric space is said to be Radon if it is tight and thus, concentrated in a separable subspace. The theorems then are, first (Gine, 1997, Theorem 2.2):
Theorem A If F is any Donsker class for a law P then the bootstrap central limit theorem in probability also holds for.F and P. Conversely, if (S, S, P) is any probability space, F is a class of realvalued measurable functions on S such that FF(x) := sup fEj, I f (x) I < oo for all x E S, F is image admissible Suslin, and there exists a Gaussian process G indexed by .F with Radon law on
G) * 0 in outer probability as n + oo, then F is £'(.F) such that Donsker for P and G = Gp. Only the first (direct) statement in Theorem A is proved in Section 11.2. In the almost sure case, Gine (1997, Theorem 3.2) is:
Problems
Theorem B
359
Let F be any Donsker class for a law P and assume that
E* 11 f (XI)  Pf II* < oo. Then the almost sure bootstrap central limit theorem also holds for F and P. Conversely, if a class F of measurable realvalued functions is image admissible Suslin, has an everywhere finite envelope F,p, and there is a centered Gaussian process G indexed by F having a Radon law in f' (Y) and such that P,F(vB, G) * 0 almost uniformly as n + oo, then .F is
Donskerfor P, EF2 < oo, and G = Gp. Note that the image admissible Suslin property implies that EF2 is welldefined without any star.
The proof of a form of Theorem B in Gine and Zinn (1990) relies on results from Araujo and Gine (1980), Fernique (1975,1985), Gine and Zinn (1984,1986), Jain and Marcus (1978), and Ledoux and Talagrand (1988). Especially when the needed material from these references is included, the proof
is quite long (over 50 pages in a version I prepared). The proof of the "in probability" version is shorter, but still rather long, as is the proof of it in one direction in Section 11.2.
Problems 1. If the sample space is the unit circle Sl := {(cos 0, sin 6) : 0 < 0 < 2n}, and distribution functions are defined by evaluating laws on arcs
((cos B, sin 0): 0 < 6 < x}, show that the quantity in Corollary 11.1.2(c), sup., Hmn (x)  inf yHmn (y)}, is invariant under rotations of the circle. 2. (Continuation) For m = n = 20, find the approximation given by Corollary 11.1.2(c) to the probability that there exists an arc A := ((cos 6, sin 0) : a < 0 < b) such that X j E A and Y j V A f o r j = 1, , 20, if all 40 variables
are i.i.d. from the same continuous distribution on the circle. Hint: Not many terms of the series should be needed. 3. In Corollary 11.1.2,
(i) show that the series in (b) and (c) diverge when u = 0.
(ii) Show that for u = 0, the probabilities equal 1 for any m > 1 and n > l and for yt. (iii) Show that the three expressions on the right are all less than 1 for u > 0 and converge to 1 as u 0, so that the corresponding distribution functions are continuous everywhere. Hint: For (b) and (c) use the left sides.
360
The TwoSample Case, the Bootstrap, and Confidence Sets
4. Let X1, X2, , be i.i.d. in R with a continuous distribution. Given n, < arrange X1, , X in order as X(1) < X(2) < Let XB , j = 1, , n,bei.i.d. sample. Thus P(XB = X(I)) = 1/n for i, j = 1, n. Let XB) := minl x. Decompose the unit cube Id into and cubes Ci of side 1 /m. Then and > 2n.
For any Pn let S := {i : Pn (Ci) = 0). Then card(S) > n. For x(i), f and fs as defined after (8.2.9), we then have for each i,
P(f) = ma J f (M (x  x(i))) dx = ma f f(m(x  x(i))) dx ;
= fjd f (y)dy/rn= yndp(.f) Thus I(PP  P)(fs)I = p(fs) >
madnP(f) >
cn"Id for some constant
c = c(d, a) > 0, while 11 fs lla < II f 1Ia < 1 (dividing the original f by some constant depending on d and a if necessary).
12.1.2 Theorem For P = U(Id), any K > 0, and 0 < a < d  I there is a 8 = 8(a, K, d) > 0 such that for all n = 1, 2, and all possible values of a/(d1+a) Pn, II Pn  PII C(a,K,d) > 8n
Remark. Since a/(d  1 + a) < 1/2 for a < d  1, the classes C(a, K, d) are then not Donsker classes. For a > d  1, C(a, K, d) is a Donsker class by Theorem 8.2.1 and Corollary 7.2.13. For a = d  1, it is not a Donsker class (Theorem 12.4.1). Proof Again, the construction and notation around (8.2.9) will be applied, now KAd1(f)/llfIla, [(nc)11(a+d1)]. Then on Id1. Let c := m := m(n) := 1/n < cm 1da Taken > r (the result holds for n < r) for an r with
M := supn>r nc Let On :=
m(n)a+d1/(nc). Then 1/M
m(n)1da
1/n. Either at least half the Bi have
12.2 An upper bound
365
Pn (B;) = 0, or at least half have Pn (B;) > 1 In. In either case,
FIPn  PIIC(a,K,d) > In
>
d11(4n)
>:
cm"/(4M)
8na1(a+d1)
for some 8(a, K, d) > 0. The method of the last proof applies to convex sets in Rd, as follows.
12.1.3 Theorem (W. Schmidt) Let d = 2, 3, .. For the collection Cd of closed convex subsets of a bounded nonempty open set U in Rd, there is a constant b := b(d, U) > 0 such that for P = Lebesgue measure normalized on U, and all Pn , sup(I(Pn  P)(C)I: C E Cd) >
bn2/(d+1).
Proof We can assume by a Euclidean transformation that U includes the unit ball. Take disjoint spherical caps as at the end of Section 8.4. Let each cap C have volume P(C) = 1/(2n). Then as n oo, the angular radius en of such caps is asymptotic to cdn1/(d+1) for some constant cd. Thus the number of such disjoint caps is of the order of n(d1)/(d+1) Either (Pn  P)(C) > 1/(2n)
for at least half of such caps, or (Pn  P)(C) = 1/(2n) for at least half of them. Thus for some constant i = rl(U, d), there exist convex sets D, E, differing by a union of caps, such that
I (Pn  P)(D)  (Pn  P)(E)l >
r1n(d1)1(d+1)  1
and the result follows with b = q/2. Thus ford > 4, Cd is not a Donsker class for P. If d = 3 it is not either; see problem 3 and Dudley (1982). C2 is a Donsker class for A2 on 12 by Theorem 8.4.1 and Corollary 7.2.13.
12.2 An upper bound Here, using bracketing numbers NI as in Section 7.1, is an upper bound for supBec I vn (B) I which applies in many cases where the hypotheses II vn IIC of Corollary 7.2.13 fail. Let (X, A, Q) be a probability space, vn
n1/2(Qn
and recall NJ as defined before (7.1.4).
 Q),
Classes Too Large for Central Limit Theorems
366
12.2.1 Theorem
(
Let C C .A, 1 < c < oo, it > 2/( + 1) and O forsome K < oc Nj(E, C, Q) < 0 n°(logn)''} = 0. Remarks.
The classes C = C(a, M, d) satisfy the hypothesis of Theorem
12.2.1 for
= (d  1)/a > 1, that is, a < d  1, by the last inequality in
Theorem 8.2.1. Then O = 2  di+a Thus Theorem 12.1.2 shows that the exponent O is sharp for > 1. Conversely, Theorem 12.2.1 shows that the exponent on n in Theorem 12.1.2 cannot be improved. In Theorem 12.2.1 we cannot take < 1, for then O < 0, which is impossible even for a single set,
C = {C}, with 0 < P(C) < 1. Proof The chaining method will be used as in Section 7.2. Let loge be logarithm to the base 2. For each n > 3 let
k(n) := [(2  O) loge n  q 1092 log n].
(12.2.2)
NI(2k, C, Q), k = 1, 2, .. Then there exist Aki and Bkj E N(k), such that for any A E C, there is an i < N(k) with Aki C A C Bki and Q(Bki\Aki) < 2k. Let Aol := 0 (the empty set) and B01 := X. Choose such i = i (k, A) for k = 0, 1, . Then for each k > 0, Let N(k)
A, i = 1,
,
Q (Aki(k,A) A Ak1,i(k1,A))
1 let 1(k) be the collection of sets B, with Q(B) < 22k, of the form Aki\Ak_1,l or Ak_1,J\Aki or Bki\Aki Then
card(13(k)) < 2N(k  1)N(k) + N(k) < 3 exp (2K2 For each B E 8(k), Bernstein's inequality (1.3.2) implies, for any t > 0, (12.2.3)
Pr{Ivn(B)I > t) < 2exp (  t2/(23k +
tn1/2)).
Choose 3 > 0 such that 3 < 1 and (2+25)/(1 + ) < ij. Letc := 8/(1 +8) and t := tn,k := cn°(log n)7k1s. Then for each k = 1, , k(n), 23k > tn,kn112. Hence by (12.2.3), 8n°1/2(log n)n > Pr{Ivn(B)I > tn,k} < 2exp (  to k124k), and
Pnk
:= Pr {SUPBE13(k)Ivn(B)I > tn,k) < 6exp (2K2k . = 6exp (2K2k  2k4c2n2°(log n)20k223)
2k_4tn,k)
12.3 Poissonization and random sets
367
Fork I for
k> 1, so k(n)
E Pnk < 6(logn) exp ( 25c2(log n)')
0
as n o oo.
k=1
Let En
:= {CV: SUPBEB(r) Ivn(B)I < tn,r, r = 1, ... , k(n)}.
Then lim n_ ,0 Pr(EE) = 1. For any A E C and n, let k := k(n) and i := i (k, A). Then for each co E En, I vn (Aki) I is bounded above by k
T'wn(Ari(r,A) \ Arl,i(r1,A))j + Ivn
`Arl,i(r1,A) \ Ar,i(r,A))j
r=1 k
< 2Etn,r < 2n°(logn)'iT r=1
cr1s
< 2n°(logn)'1,
r>1
Ivn(Bki\Aki)) < n°(logn)i, and by (12.2.2), n 1,2 Q(Bki\Aki)
< n112/2k < 2n°(logn)n.
Hence n 1/2 Qn(Bki\Aki) < 3n°(logn)'i, and Ivn(A\Aki)I < 3n©(logn)'i. So on En, I vn (A) I < 5n° (log n)'1. As ri J, 2/( + 1) the factor of 5 can be dropped.
Since A E C is arbitrary, the proof is done.
12.3 Poissonization and random sets Section 12.4 will give some lower bounds Ilvn 11C > f (n) with probability converging to 1 as n oc where f is a product of powers of logarithms or iterated logarithms. Such an f has the following property. A realvalued function f defined for large enough x > 0 is called slowly varying (in the sense of Karamata) if for every c > 0, f (cx)/ f (x) * 1 as x +oo.
368
Classes Too Large for Central Limit Theorems
12.3.1 Lemma If f is continuous and slowly varying then for every e > 0 there is a S = S (e) > 0 such that whenever x > 1 IS and I 1  I < S we have
I1 fI 0 and e > 0 there is an x(c,s) such that for x > (c) 1 I < 4. Note that if x (c, e) < n, even for one c, then f (x) # 0 AX) for all x > n. By the category theorem (e.g., RAP, Theorem 2.5.2), for fixed X (c, s),
s > 0 there is an n < oo such that x(c, e) < n for all c in a set dense in some interval [a, b], where 0 < a < b, and thus by continuity for all c in [a, b]. Then fort, d E [a, b] and x > n, I(f(cx)  f(dx))/f(x)I < s/2. Fix c := (a + b)/2. Let u := cx. Then for u > nc, we have
f(udl c)  1
f(u)
e
f (u/c)
2 f (u) As u * +oo, f (u)/f (u/c) + 1. Then there is a S > 0 with
S < (b  a)/(b + a) such that for u > 1 IS and all r with I r  1 I < S we have I f(u))
 1 < E.
Recall the Poisson law Pc on N with parameter c > 0, so that Pc(k) := ecck/k! for k = 0, 1, .. Given a probability space (X, A, P), let Uc be a Poisson point process on (X, A) with intensity measure cP. That is, for any disjoint A1, , A. in A, UU(Aj) are independent random variables, j = has law PcP(A). , m, and for any A E A, 1, Let Yc(A) := (Uc  cP)(A), A E A. Then Yc has mean 0 on all A and still has independent values on disjoint sets. be coordinates for the product space (X°O, A°°, P°°). Let x(1), x(2), For c > 0 let n (c) be a random variable with law P, independent of the x (i). + Sx(n)), n > 1, PO := 0 we have: Then for P := n1(8x(1) + 12.3.2 Lemma The process Zc := n (c) P, (c) is a Poisson process with intensity measure cP.
Proof We have laws L(Uc(X)) = L(Zc(X)) = P. If Xi are independent Poisson variables with £(X1) = Pc(i) and Em, c(i) = c, then given n :_ Em, Xi, the conditional distribution of {X1}' 1 is multinomial with total n and probabilities pi = c(i)/c. Thus Uc and Z, have the same conditional distributions on disjoint sets given their values on X. This implies the lemma.
12.3 Poissonization and random sets
369
From here on, the version Uc  Zc will be used. Thus for each co, UU()((v)
is a countably additive integervalued measure of total mass Uc(X)(cv) _ n(c)(w). Then
Yc = n(c)Pn(c)  cP = n(c)(Pn(c)  P) + (n(c)  c)P, (12.3.3)
Yc/c1/2 = (n(c)Ic)112vn(c) + (n(c)  c)c112P. The following shows that the empirical process vn is asymptotically "as large" as a corresponding Poisson process.
12.3.4 Lemma
Let (X, A, P) be a probability space and C C A. Assume
that for each n and constant t, supAEc I (Pn  t P) (A) I is measurable. Let f be
a continuous, slowly varying function such that as x a +oo, f(x) * +oo. For b > 0 let g(b) := liminfx.+,,, Pr {supAEc IYX(A)I > bf(x)x1/2} . Then for any a < b,
liminfn. Pr {supAEc I v.(A)I > af(n) I > g(b). Proof It follows from Lemma 12.3.1 that f(x)/x * 0 as x + oo. From 1 and (12.3.3), supAEc I YY (A) I is measurable. As x +oo, Pr(n (x) > 0)
n(x)lx  1 in probability. If the lemma is false there is a O < g(b) and a sequence mk
+oo with, for each m = mk,
Pr {supAEc I vm(A)I > af(m) I < 0.
Choose 0 < s < 1/3 such that a (1 + 7s) < b. Then let 0 < S < 1 /2 be such that 8 < S (s) in Lemma 12.3.1 and (1 + 8) (1 + 5s) < 1 + 6s. We may assume
that forallk=
m =mk>2/Sand 1+2s
nPn(A)forallA E A, som(Pm  P)(A) > n(PnP)(A)mSm.
Conversely n Pn (A) > m Pm (A)  m8m and
n(PP  P) (A) > m(Pm  P) (A)  mSm. Thus
Ivm(A)I > (n/m)1/2I vn(A)I  .f(m)1/2 > (1
+8)'lvn(A)I
 f(m)1/2,
Classes Too Large for Central Limit Theorems
370
I = n 1 < 28m < 8
so I vn(A)I < (1 +8)(Ivm(A)I + f(m)1/2). Next, I1
n implies I f(n)  II < s, so 1/f(n) < (1 +s)/f(m). Hence Ivn(A)Ilf(n) < (1 +28)(Ivm(A)I.f(m)1 +
f(m)1/2).
Thus since (1 + 2s) f(m)1/2 < ae,
af(n)(1+3s)} < O.
Pr{supAEc Ivn(A)I
For each m = mk, set c = cm = (1  1Sm)m. Then as k + oo, since f(mk) * oo, by Chebyshev's inequality and since Pc has variance c,
Pr{(1 Sm)m af(c)(1 + 5s)} < Y.
Fork large enough, applying Chebyshev's inequality ton (c)  c, we have since
coo, Pr
/n(c)
1 c)
1/2
1+6s > 1 +5c

(n(c)c
Pr j
c'!2
2sc112
gy
> (1 +56)2 }
af(c)s) < (g  y)/4. Thus by (12.3.3), Pr {supAEc
IYc(A)I > ac1/2f(c)(1 +7e)} < (y+g)/2 < g,
a contradiction, proving Lemma 12.3.4. Next, the Poisson process's independence property on disjoint sets will be extended to suitable random sets. Let (X, A) be a measurable space and (S2, B, Pr)
a probability space. A collection (BA : A E A) of subaalgebras of B will be called a filtration if 8A C 13B whenever A C B in A. A stochastic process Y indexed by A, (A, c)) H Y(A)(w), will be called adapted to (BA : A E A) if for every A E A, Y(A)(.) is BA measurable. Then the process and filtration will be written {Y(A), BA)AE.A. A stochastic process Y: (A, w)  Y(A)(w), A E A, co E S2, will be said to have independent pieces if for any disjoint A1, , A. E A, Y(AK) are independent,
12.3 Poissonization and random sets
371
j = 1,
, m, and Y(A1 U A2) = Y(A1) + Y(A2) almost surely. Clearly each Y,, has independent pieces. If in addition the process is adapted to a filtration (BA : A E A), the process {Y(A),13A}AEA will be said to have independent pieces if for any disjoint sets A1, . , A in A, the random variables Y02), , Y(An) and any random variable measurable for the aalgebra BA, are jointly independent. For example, for any C E A let Bc be the smallest aalgebra for which every is measurable for A C C, A E A. This is clearly a filtration, and the smallest filtration to which Y is adapted. A function G from 7 into A will be called a stopping set for a filtration {BA :
A E A) if for all C E A, (w: G(w) C C) E 13c. Given a stopping
set let EG be the aalgebra of all sets B E 13 such that for every C E A, B fl {G C C} E Bc. (Note that if G is not a stopping set, then SZ ¢ 13G, so BG would not be a aalgebra.) If G (w) = H E A then it is easy to check that G is
a stopping set and BG = 13x 12.3.5 Lemma Suppose (Y (A), BA}AEA has independent pieces and for all
WES2,G(w)EA,A(W)EA,and E(w)EA. Assume that:
(i) G(.) is a stopping set; (ii) for all w, G(w) is disjoint from A(w) and from E(w); (iii) G (w), A(w), and E(w) each have just countably many possible val
ues G(j) := Gj E A, C(i) := Ci E A, and D(j) := Dj E A, respectively;
(iv) for all i, j, {A(.) = Ci} E BG and
D1) E 13G.
Then the conditional probability law (joint distribution) of Y(A) and Y(E) given BG satisfies
C{(Y(A), Y(E))IBG} =
I{A(.)=C(i),E(.)=D(j)}C(Y(Ci), Y(Dj))
i,j where G(Y(Ci), Y(Dj)) is the unconditional joint distribution of Y(Ci) and Y(Dj). If this unconditional distribution is the same for all i, j, then (Y(A), Y(E)) is independent 0f 8G. rather Proof The proof will be given when there is only one random set than two, A0 and E(.). The proof for two is essentially the same. If for some
w, A(w) = Ci and G(w) = G j, then by (ii), Ci fl G j = 0 (the empty set). Thus Y(Ci) is independent of BG(j). Let Bi := B(i) := Ci) E 13G by
Classes Too Large for Central Limit Theorems
372
(iv). For each j,
{G = Gj} = {G C Gj} \ U{{G C Gi}: Gi C Gj, Gi
Gj},
so by (i),
{G = Gj} E BG(j).
(12.3.6)
LetHj:={G=Gj}. ForanyDEA, Hjn(GCD)=0EBDifGj¢D. If Gj c D, then Hj n {G C D) = Hj E BG(j) C BD by (12.3.6). Thus
H(j) := Hj E BG.
(12.3.7)
For any B E BG, by (12.3.6), (12.3.8)
BnHj = Bn{GCGj}nHj E BG(j).
We have for any real t almost surely, since Bi E BG,
Pr(Y(A) < t I BG) = EPr(Y(A) < t 1 BG)1B(i)lH(j)
i,j
EPr(Y(Ci) < t I BG)IB(i)IH(j) i, j
The sums can be restricted to i, j such that Bi n Hj # 0 and therefore Ci n G j =
0. Now Pr(Y(Ci) < tIBG) is a BGmeasurable function f, such that for any
I'EBG,
Pr({Y(Ci)0}f1H(j)
J {g d  1, C(a, K, d) is a Donsker class by Theorem 8.2.1 and Corollary 7.2.13, so II vn IIC(a,K,d) is bounded in
probability. Thus a = d  1 is a borderline case. Other such cases are given by the class GG2 of lower layers in R2 (Section 8.3) and the class C3 of convex sets in R3 (Section 8.4), for ),d = Lebesgue measure on the unit cube Id, where
I := [0, 1]. Any lower layer A has a closure A which is also a lower layer, with Ad (A \ A) = 0, where in the present case d = 2. It is easily seen that suprema of our processes over all lower layers are equal to suprema over closed lower layers, so it will be enough to consider closed lower layers. Let GG2 be the class of all closed lower layers in 1182.
Classes Too Large for Central Limit Theorems
374
Let P = ,kd and c > 0. Recall the centered Poisson process Yc from Section 12.3. Let Nc := UU  Vc where U.. and VV are independent Poisson processes,
each with intensity measure cP. Equivalently, we could take Uc and Vc to be centered. The following lower bound holds for all the above borderline cases:
12.4.1 Theorem
For any K > 0 and 8 > 0 there is a y = y(d, K, 8) > 0
such that
limx_,+,,,,Pr{IlYxllc > y(xlogx)112(loglogx)d1/2} = 1 and
limn.,,,,Pr{llvnlIc >
y(logn)1/2(loglogn)61/2}
=1
where C = C(d  1, K, d), d > 2, or C = LL2, or C = C3. For a proof, see the next section, except that here I will give a larger lower bound with probability close to 1, of order (log n)3/4 in the lower layer case (C = LL2, d = 2). Shor (1986) first showed that Ell Yxilc > yxt/2(logx)3/4 for some y > 0 and x large enough. Shor's lower bound also applies to
C(1, K, 2) by a 45° rotation as in Section 8.3. For an upper bound with a 3/4 power of the log also for convex subsets of a fixed bounded set in R3 see Talagrand (1994, Theorem 1.6). To see that the supremum of N, Y, or an empirical process v, over LL2 is measurable, note first for P, that for each F C { 1, , n) and each w,
there is a smallest, closed lower layer LF((O) containing the xj for j E F, with LF((O) := 0 for F = 0. For any c > 0, w i± (Pn  cP) (LF((O))(w) is measurable. The supremum of P,  cP over LL2, as the maximum of these 2n measurable functions, is measurable. Letting n = n (c) as in Lemma 12.3.2 and (12.3.3) then shows that sup{Yc(A) : A E LL2} is measurable. Likewise, there is a largest, open lower layer not containing xj for any j E F, so sup{IYY(A)l: A E LL21 and sup{Ivn(A)I : A E LC2) are measurable. For N0, taking noncentered Poisson processes Uc and V0, their numbers of points m (w) and n (w) are measurable, as are the mtuple and ntuple of , n, it is points occurring in each. For each i = 0, 1, , m and k = 0, 1, a measurable event that there exists a lower layer containing exactly i of the
m points and k of the n, and so the supremum of Nc over all lower layers, as a measurable function of the indicators of these finitely many events, is measurable.
12.4 Lower bounds in borderline cases
375
12.4.2 Theorem For every s > 0 there is a 8 > 0 such that for the uniform distribution P on the unit square 12, and n large enough,
Pr sup{Ivn(A)I: A (= GG2} > 6(logn)3/4) > 1  e, and the same holds forC(1, 2, 2) in place of LL2. Also, vn can be replaced by NN/c1/2 or Yc/c1/2 if log n is replaced by log c, for c large enough.
Remark. The order (log n)314 of the lower bound is optimal, as there is an upper bound in expectation of the same order, not proved here, see Rhee and Talagrand (1988), Leighton and Shor (1989), and Coffman and Shor (1991).
Proof Let M 1 be the set of all functions f on [0, 11 with f (O) = f (l) = 1 /2 and I I f I I L < 1, that is, If(u)  f (x) I < I x  u I for 0 < x < u < 1. Then
0 < f(x) < 1 for 0 < x < 1. For any f : [0, 1] H [0, oo) let Sf be the subgraph of f, S f := S(f) := {(x, y) : 0 < x < 1 , 0 < y < f (x)}. Then for each f E M1 we have Sf E C(1, 2, 2). Let Sl := {Sf f E X11}. Let R be a counterclockwise rotation of R2 by 450, R = 21/2(iflower 1). Then
for each f E MI, R1(Sf) = M fl R1(I2) where M is a
layer,
I2 := I x I, and I := [0, 1]. So, it will be enough to prove the theorem for S1 C CO, 2, 2).
Let [x] denote the largest integer < x and let 0 < 8 < 1/3. For each c large enough so that 82(log c) > 1 and w, functions fo < fi < < fL < gL < .. < gi < go will be defined for L := [82(log c)], with f := f(t)
fj(w,t)for CO Ecland 0 0, there is a constant C = CS such that Jul < C(es" + e3") for all real u. If IhI < 3 := e/2 then since euh < esu +eSu and ecu +e`" is increasing in u > 0 for any real c, we have I(euh
 1)/hl < 3CS(eE"+esu).
Letting u (x), we have domination by an integrable function, so Corollary A. 10 applies for the first derivative. The proof extends to higher derivatives since I V' (x)In < KeSI*(x)I for some K = K",S.
Notes
Perhaps the most classical result on interchange of integral and derivative is attributed to Leibniz: if f and 8 f/8 y are continuous on a finite rectangle [a, b] x 0,, (H.6)
fg(A(L)+(1_A))d/2 < AJ
g()dA+(1,l) f g(d)dµ,
where some or all three integrals may be infinite. Applying this for c = 1 and h = 0 gives (H.7)
fg(Af)d/2 < A J g(f) dg.
Clearly, if the inequality in (H. 1) holds for some c > 0 it also holds for any larger c. It follows that Af + (1  A)h E Lg(X, S, µ). For the triangle inequality, let c := II f IIg and d := IIh IIg. If c = 0 then f = 0 a.e. (A), so 11f + h IIg = IIh IIg < c + d, and likewise if d = 0. If c > 0 and d > O let A:= c/(c + d) in (H.6). Applying Proposition H.2 to both terms on the right
in (H.6) gives f g((f + h)/(c + d)) dg < 1, and so II f + h IIg < c + d. So the triangle inequality holds and II IIg is a seminorm on Lg(X, S, µ). Clearly
it becomes a norm on Lg(X, S, A). To see that the latter space is complete for II . IIg, let [ fk} be a Cauchy sequence. By Lemma H.4, take Sj := S for s := si := 1/2j. Take a subsequence fk(j) with II f  fk(j) II g < Sj for any i > k(j) and j = 1 , 2, . Then fk(j) converges Aalmost everywhere, by the proof of the BorelCantelli lemma, to some f Then II fk(J)  f IIg  0 as j oc by Fatou's lemma applied to functions g(I fk(j)  fk(r) I /c) as j < r + oo for c > Sj. It follows that II fi  f II g 0 as i oo, completing the proof.
Let 4 be a YoungOrlicz modulus. Then it has onesided derivatives as follows (RAP, Corollary 6.3.3):
O(x) := 4'(x+) := limyl,x(4 (y)  4(x))/(Y  x)
YoungOrlicz Spaces
423
exists for all x > 0, and ¢(x) := fi'(x) := limytX(fi(x)  41(y))/(x  y) exists for all x > 0. As the notation suggests, for each x > 0, O(x) limytx 0 (y), and 0 is a nondecreasing function on [0, oo). Thus, 0 (x ) _ 0 (x) except for at most countably many values of x, where ¢ may have jumps with O(x) > O(x). On any bounded interval, where is bounded, 4) is Lipschitz and so absolutely continuous. Thus since 4) (0) = 0 we have 4) (x) _ fa 0 (u) du for any x > 0 (e.g., Rudin, 1974, Theorem 8.18). For any x > 0, O(x) > 0 since 4) is strictly increasing.
If 0 is unbounded, for 0 < y < oo let i/i(y)
(y) := inf{x > 0:
(x) > y}. Then * (0) = 0 and * is nondecreasing. Let %P (y) := fo *(t) dt. Then W is convex and 'Y' = Eli except on the at most countable set where Eli
has jumps. Thus for each y > 0 we have *(y) > 0 and ' is also strictly increasing. For any nondecreasing function f from [0, oo) into itself, it's easily seen that
for any x > 0 and u > 0, f " (u) > x if and only if f (t) < u for all t < x. It follows that (f ')'(x) = f(x) for all x > 0. Since a change in 0 or Vr on a countable set (of its jumps) doesn't change its indefinite integral 4) or %P respectively, the relation between 4) and kP is symmetric. A YoungOrlicz modulus fi such that 0 is unbounded and 0 (x) 4, 0 as x 4, 0
will be called an Orlicz modulus. Then * is also unbounded and *(y) > 0 for all y > 0, so ' is also an Orlicz modulus. In that case, 4) and ' will be called dual Orlicz moduli. For such moduli we have a basic inequality due to W. H. Young:
H.8 Theorem (W. H. Young) Let fi, 4J be any two dual Orlicz moduli from [0, oo) onto itself. Then for any x, y > 0 we have xy < (P (X) + 41 (y),
with equality if x > 0 and (x) < y < 0 (x).
Proof If x = 0 or y = 0 there is no problem. Let x > 0 and y > 0. Then 4) (x) is the area of the region A : 0 < u < x, 0 < v < ¢ (u) in the (u, v) plane. Likewise, 11(y) is the area of the region B : 0 < v < y, 0 < u < *(v). By monotonicity and rightcontinuity of 0, u > *(v) is equivalent to 0(u) > v, so u < * (v) is equivalent to 0 (u) < v, so A f1 B = 0. The rectangle Rx, y :
0 < u < x, 0 < v < y is included in A U B U C, where C has zero area, and if 0(x) < y < 4(x), then Rx,y = A U B up to a set of zero area, so the conclusions hold.
Appendix H
424
One of the main uses of inequality H.6 is to prove an extension of the RogersHolder inequality to YoungOrlicz spaces:
H.9 Theorem Let 4 and %P be dual Orlicz moduli, and for a measure space (X, S, µ) let f E Gq(X, S, g) and g E L,1, (X, S, µ). Then fg E L1(X, S, µ) and f Ifgl dp < 211.f11 (p llgllp.
Proof By homogeneity we can assume 11.f 11o = II gII , = 1. Then applying Proposition H.2 with c = 1 and Theorem H.6 we get f I fgl dµ (x) < 2 and the conclusion follows.
Notes
W. H. Young (1912) proved his inequality (Theorem H.6) for smooth functions 4). Birnbaum and Orlicz (1931) apparently began the theory of "Orlicz spaces," and W. A. J. Luxemburg defined the norms 11 11,p; see Luxemburg and
Zaanen (1956). Krasnosel'skii and Rutitskii (1961) wrote a book on the topic.
References
Birnbaum, Z. W., and Orlicz, W. (1931). Uber die Verallgemeinerung des Begriffes der zueinander konjugierten Potenzen. Studia Math. 3, 167. Krasnosel'skii, M. A., and Rutitskii, Ia. B. (1961). Convex Functions and Orlicz Spaces. Transl. by L. F. Boron. Noordhoff, Groningen.
Luxemburg, W. A. J., and Zaanen, A. C. (1956). Conjugate spaces of Orlicz spaces. Akad. Wetensch. Amsterdam Proc. Ser. A 59 (= Indag. Math. 18), 217228. Rudin, Walter (1974). Real and Complex Analysis, 2d ed. McGrawHill, New York.
Young, W. H. (1912). On classes of summable functions and their Fourier series. Proc. Roy. Soc. London Ser. A 87, 225229.
Appendix I Modifications and Versions of Isonormal Processes
Let T be any set and (0, A, P) a probability space. Recall that a realvalued stochastic process indexed by T is a function (t, (o) i+ Xt ((0) from T x S2 into R such that for each t E T, Xt () is measurable from 0 into R. A modification
of the process is another stochastic process Yt defined for the same T and 0
such that for each t, we have P(Xt = Yt) = 1. A version of the process Xt is a process Zt, t E T, for the same T but defined on a possibly different probability space (01, 8, Q) such that Xt and Zt have the same laws, that is, for each finite subset F of T, G({Xt ItEF) = L({Zt)tEF). Clearly, any modification of a process is also a version of the process, but a version, even if on the same probability space, may not be a modification. For example, for an isonormal
process L on a Hilbert space H, the process M(x) := L (x) is a version, but not a modification, of L. One may take a version or modification of a process in order to get better properties such as continuity. It turns out that for the isonormal process on subsets of Hilbert space, what can be done with a version can also be done by a modification, as follows. 1.1 Theorem Let L bean isonormal process restricted to a subset C of Hilbert space. For each of the following two properties, if there exists a version M of L with the property, there also is a modification N with the property. For each Co. x  M(x) (w) for x e C is: (a) bounded
(b) uniformly continuous.
Also, if there is a version with (a) and another with (b) then there is a modification N(.) having both properties.
Proof Let A be a countable dense subset of C. For each x E C, take xn E A with IIx  xII < 1/n2 for all n. Then L(x) Thus if we define 425
426
N(x)(w) := lim sup,
Appendix I
or 0 on the set of probability 0 where the lim sup is infinite, then N is a modification of L. If (a) holds for M it will also hold for L on A and so for N, and likewise for (b), since a uniformly continuous function L (co) on A has a unique uniformly continuous extension to C given by Since is the same in both cases, the last conclusion follows.
Subject Index
a.s., almost surely, 12 acyclic graph, 147 admissible (measurability), 180, 184 Alaoglu's theorem, 50 almost uniformly, see convergence analytic sets, 417420 asymptotic equicontinuity, 117, 118, 121, 131, 315, 349 atom of a Boolean algebra, 141, 158, 159, 163 of a probability space, 92 soft, 192, 247
Banach space, 23, 50, 51, 285 binomial distribution, 1, 5, 16, 123, 400 BlumDeHardt theorem, 235 Bochner integral, 51, 248, 407412 Boolean algebras, 141, 158, 165 bootstrap, 335362 bordered, 155 Borel classes, 181 Borel injections, 417420 Borelisomorphic, 119 BorisovDurst theorem, 244 boundaries, differentiable, 252264, 363 bracket, bracketing, 234313, 365 and majorizing measures, 246 Brownian bridge, 13, 7, 18, 19, 91, 220, 334, 335 Brownian motion, 2, 300 in d dimensions, 310 Cestimator, 217 Cantor set, 181, 187 cardinals, 167, 403 ceiling function, 19 centered Gaussian, 2528, 359 central limit theorems general, 94, 208, 238, 285
see also Donsker class, 94 triangular arrays, 246 in finite dimensions, 1, 23 in separable Banach spaces, 209, 291 characteristic function, 26, 31, 84, 337 classification problems, 226 closed regular measure, 405 coherent process, 93, 117, 315 compact operator, 81 comparable, 145 complemented, 144 composition of functions, 94 concentration of measure, 223 confidence sets, 357358 continuity set, 107, 112 continuous mapping theorem, 116 convergence almost uniform, 100, 101, 107, 112, 223, 306, 336, 359 in law, 3, 93, 94 classical, 9, 91, 93, 128, 130 Dudley (1966), 130 HoffmannJorgensen, 94, 99, 101, 104, 106,107,111117,127,130,131, 291, 328, 360
in outer probability, 100102, 104, 112, 127, 215, 223, 301, 310, 336, 358 convex combination, 46, 322 functions, 14, 24, 30, 58, 62, 70, 84, 421,423 hull, 46, 47, 84, 117, 127, 159, 166, 322 metric entropy of, 322328 of a sequence, 8283, 88 sets, 227, 384 and Gaussian laws, 4043 approximation of, 269283, 322328 427
428
Subject Index
convex sets (cont.) classes of, 138, 166, 227, 248, 283, 309, 365, 373, 374, 384, 388
symmetric, 40,47 convolution, 345, 395, 397 correlation, 33 cotreelike ordering, 148, 149 coupling, see Strassen's theorem Vorob'ev form, 7, 10, 19, 119, 128, 296, 304 Cram6rWold device, 337 cubature, numerical integration, 363 cubes, unions of, 247, 364 cycle in a graph, 147 DarmoisSkitovic theorem, 25, 86 desymmetrization, 338, 339, 343 determining class, 177, 179 diameter, 61, 75 differentiability degrees for functions, and bounding sets, 252264, 363 differentiable boundaries, 309 differentiating under integrals, 391398 dominated family of laws, 172 Donsker classes, 94, 134 and invariance principles, 286, 301, 306, 307, 310 and log log laws, 307 criteria for, 349 asymptotic equicontinuity, 117, 118 multipliers, 355 envelopes of, 307, 308, 310 examples all subsets of N, 228, 244 bounded variation, 227 convex sets in 72, 228 convex sets in 72, 269, 283 halflines, 127 unions of k intervals, 129 unit ball in Hilbert space, 128 weighted intervals, 227 functional, 301 nonDonsker classes, 129, 138, 228,
363390 Puniversal, 333 stability of convex hull, 127 subclass, 315 sums, 127 union of two, 121122 sufficient conditions for bracketing, 238248 differentiability classes, 253, 264 KoltchinskiiPollard entropy, 208 random entropy, 226 sequences of functions, 122126, 129 VapnikCervonenkis classes, 214, 215
uniform, 328329, 334 universal, 314322, 328330, 334 Donsker's invariance principle, 311 Donsker's theorem, 3, 9, 19, 129 dual Banach space, 50, 237 edge (of a graph), 147 Effros Borel structure, 190, 194 Egorov's theorem, 101, 405 ellipsoids, 80, 81, 84, 85, 128, 140, 319, 329 empirical measure, 1, 91, 94, 100, 138, 199, 285, 306, 399 as statistic, 176, 191, 226, 332, 335 bootstrap, 335 empirical process, 9194, 117121, 220223, 286, 301, 308, 309, 311 in one dimension, 2, 3, 7, 18, 19 envelope function, 161, 196, 220, 223, 234,
306308,310,316,317,358 enet, 10, 228 equicontinuity, asymptotic, 117, 118, 121, 131, 315, 349 equivalent measures, 172 essential infimum or supremum, 44, 95, 96, 98, 130 estimator, 217219, 226 exchangeable, 191 exponent of entropy, 79 exponent of volume, 79 exponential family, 191, 397 factorization theorem, 172, 174, 193 filtration, 370, 371 Gaussian law, see normal law Gaussian processes, 2390, 222, 223 see also pregaussian, 92 GBsets, 44, 52 and metric entropy, 54, 79 criteria for convex hull, 82 majorizing measures, 59, 60 mean width, 80 implied by GCset, 45 necessary conditions, 45, 54, 64 special classes ellipsoids, 80, 84 sequences, 82 GCsets, 44, 52, 93 and Gaussian processes, 77 and metric entropy, 79, 85 criteria for, 46, 47 convex hull, 82 other metrics, 78 implies GBset, 45 special classes ellipsoids, 80, 81
Subject Index random Fourier series, 85 sequences, 84, 85 stability of convex hull, 47 union of two, 51 sufficient conditions majorizing measures, 74 metric entropy, 52, 54, 87 unit ball of dual space, 50 volumes, 79 geometric mean, 14 GlivenkoCantelli class, 101, 223, 224,
234238 criteria for, 223, 224, 226 envelope, integrability, 220 examples, 247 convex sets, 269 lower layers, 282 nonexamples, 127, 138, 171, 247 strong, 100 sufficient conditions for bracketing (BlumDeHardt), 235 differentiability classes, 253, 264 unit ball (Mourier), 237 VapnikCervonenkis classes, 134 uniform, 219, 225, 282 universal, 224, 225, 228 weak, 101 not strong, 228 GlivenkoCantelli theorem, 1 graph, 147 Hausdorff metric, 190, 250 HilbertSchmidt operator, 81 Holder condition, 252, 384 homotopy, 260 image admissible, 180, 182, 184, 185, 193 Suslin, 186, 189, 190, 192, 193, 217, 229,
316,317,359 incomparable, 145 independent as sets, 158, 220 events, sequences of, 123, 126, 129 pieces, process with, 370, 371 random elements, 286291 inequalities, 1218 Bernstein's, 1214, 20, 124, 129 ChernoffOkamoto, 16 DvortezkyKieferWolfowitz, 221 for empirical processes, 220223 Hoeffding's, 13, 14, 20 HoffmannJorgensen, 288, 341 Jensen's, 340 conditional, 29 Komatsu's, 57, 87
429
Levy, 17, 21, 289, 310 Ottaviani's, 17, 21, 288 Slepian's, 31, 33, 34
with stars, 95100, 285286, 339341, 343, 344, 352
Young, W. H., 423,424 inner measure, 104 integral Bochner, 51 Pettis, 51 integral operator, 81 intensity measure, 127, 368 invariance principles, 3, 285313 isonormal process, 39, 51, 52, 59, 63, 74, 75, 78, 92, 425426
KolmogorovSmirnov statistics, 335 KoltchinskiiPollard entropy, 196, 228, 316 Kuiper statistic, 335 law, see convergence in law law, probability measure, 1, 23, 345 laws of the iterated logarithm, 306308 learning theory, 226, 227 likelihood ratio, 175 linearly ordered by inclusion, 142, 143, 145, 154, 155, 161, 247 Lipschitz functions, seminorm, 112, 210, 223, 251, 356 log log laws, 306308 loss function, 217 lower layers, 264269, 282, 316, 373384, 389 lower semilattice, 95 Lusin's theorem, 105, 405, 406
major set or class, 159 majorizing measures, 5974, 8185, 87, 246 Marczewski function, 181 marginals, laws with given, 8, 10, 119, 296 Markov property, 5, 10, 301, 389 for random sets, 373 measurability, 10, 19, 93, 170194, 217, 221, 226, 229, 291293, 402406,
413420 measurable cover functions, 95100, 130,
285286 cover of a set, 95, 348 universally, 20, 105, 186, 199, 202, 203,
413,415
vector space, 2527 measurable cardinals, 403 metric dualboundedLipschitz, 115, 328, 336 Prokhorov, 6, 19, 114, 119, 295 Skorokhod's, 19
430
Subject Index
metric entropy, 1012 of convex hulls, 322328 with bracketing, 234 metrization of convergence in law, 111, 112, 329, 336 Mills' ratio, 57 minimax risk, 217, 218 minoration, Sudakov, 33 mixed volume, 79, 270 modification, 43, 138, 229,425426 monotone regression, 282 multinomial distribution, 4, 16, 368, 399401 NeymanPearson lemma, 56, 175 node (of a graph), 147 nonatomic measure, 92, 192, 225, 403 norm, 23 bounded Lipschitz, 112, 252, 336 supremum, 3, 7, 10, 18, 19, 27, 170, 180, 190, 300 normal distribution or law, 357 in finite dimensions, 23, 24, 26, 36 in infinite dimensions, 23, 25 in one dimension, 2325, 86 normed space, 23 norming set, 285 null hypothesis, 332 order bounded, 223 ordinal triangle, 170 orthants, as VC class, 155 outer measure, 100 outer probability, 100, 336, 358
PDonsker class, see Donsker class Pperfect, 104 pvariation, 329 PAC learning, 227 Pascal's triangle, 135 pattern recognition, 226 perfect function, 104107, 112, 333 perfect probability space, 106 Pettis integral, 51, 248, 407412 Poisson distributions, 17, 19, 127, 344, 346, 368, 379 process, 127, 368370, 374, 388 Poissonization, 344346, 363, 367373, 389 polar, of a set, 43, 84 Polish space, 7, 20, 119, 303 Pollard's entropy condition, 208, 316, 322, 327, 329 polyhedra, polytopes, 273, 283, 384 polynomials, 140 polytope, 141
portmanteau theorem, 111, 112, 131
pregaussian, 9294, 117, 129, 215, 315, 388 uniformly, 329 prelinear, 45, 46, 93, 138 pseudoseminorm, 27 pseudometric, 75 quasiorder, 145 quasiperfect, 106 Rademacher variables, 13, 206, 338 Radon law, 358 random element, 94, 100, 106, 111, 115, 127, 286291, 336 random entropy, 223, 226 RAP, xiii realization a.s. convergent, 106111 of a stochastic process, 46, 47 reflexive Banach space, 50 regular conditional probabilities, 120 reversed (sub)martingale, 199201, 205 risk, 217, 218 sample, 332, 336 bounded, 52, 59 continuous, 2, 3, 7, 52, 55, 76, 85, 92 function, 44, 7678 modulus, 52 space, 91, 351 Sauer's lemma, 135, 145, 149, 154, 162, 167, 207, 221 Schmidt ellipsoid, 80, 81 selection theorem, 186, 188, 193 semialgebraic sets, 165 seminorm, 23 separable measurable space, 179 shatter, 134, 330 for functions, 224 oralgebra Borel, 7, 23, 25, 28, 74 product, 91, 98, 181, 183
tail, 30,49 slowly varying function, 367 standard deviation, 33 standard model, 91, 191 standard normal law, 24, 39, 43, 53 stationary process, 204 statistic, 171, 174, 176, 357, 360 Stirling's formula, 16 stochastic process, 27, 43, 52, 74, 75, 78, 87, 217, 370, 425 stopping set, 363, 371, 373, 388 Strassen's theorem, 6, 19, 131 subadditive process, 204, 205 subgraph, 159, 250, 373 submartingale, 29, 199, 200
431
Subject Index
SudakovChevet theorem, 33, 34, 39, 45, 79,81 sufficient aalgebra, 171179 collection of functions, 177 statistic, 171, 174, 176, 191, 192 superadditive process, 204 supremum norm, 28 Suslin property, 185, 186, 190, 217, 229 symmetric random elements, 289 symmetrization, 198, 205, 206, 211, 228, 339, 343
universally measurable, 20, 105, 186, 199, 202, 203, 413
universally null, 225 upper integral, 93, 95 VapnikCervonenkis classes of functions hull, 160, 163, 164, 315, 316, 327 major, 159, 160, 164, 167, 308, 315, 316
subgraph, 159, 160, 214, 223, 307, 308, 315317,327,328,330 of sets, 134159, 165, 169, 197, 215, 218,
tail event or aalgebra, 30, 49 tail probability, 50 tetrahedra, 227 tight, 105, 358 uniformly, 105 tightness, 20 topological vector space, 25, 27, 413 tree, 147 treelike ordering, 145, 222 triangle function, 84 triangular arrays, 247, 337, 346 triangulation, 273, 283 truncation, 13, 247 twosample process, 128, 332335 u.m., see universally measurable, 105 uniform distribution on [0, 1], 2, 119 uniformly integrable, 392, 393, 396, 397 uniformly pregaussian, 330 universal class a function, 183
219,221223,225,226,314 VC, see VapnikCervonenkis classes VCM class, 221 VCSM class, 307 vector space
measurable, 2527 topological, 25, 27, 410 version, 43, 71, 74, 138, 425426 versioncontinuous, 74, 75, 77, 78 volume, 78, 79, 85, 262, 270, 320, 365 weak* topology, 50 width, of a convex set, 80 Wiener process, 2 witness of irregularity, 224
YoungOrlicz norms, modulus, 55, 85, 86,
420424 zeroone law for Gaussian measures, 26
Author Index
Adamski, W., 191,194 Adler, R. J., 222 Alexander, K. S., 121, 131, 168, 221, 307, 308 Alon, N., 166, 226, 227 Andersen, N. T., 88, 130, 131, 246, 248 Anderson, T. W., 86 Araujo, A., 359 Arcones, M., 121, 131, 239, 248 Assouad, P., 167, 168, 218, 219, 225, 229, 230 Aumann, R. J., 182, 193 Bahadur, R. R., 193 Bakhvalov, N. S., 363, 364, 388 Bauer, H., 120 Beck, J., 309 BenDavid, S., 226 Bennett, G., 20 Berger, E., 131 Berkes, I., 7, 20 Bernstein, S. N., 12, 20 Bickel, P., 360 Billingsley, P., 19 Birge, L., 154, 282 Birkhoff, Garrett, 411 Birnbaum, Z. W., 424 Blum, J. R., 234, 235, 248 Blumberg, H., 130 Blumer, A., 227 Bochner, S., 411 Bolthausen, E., 269, 283 Bonnesen, T., 79, 270 Borell, C., 86 Borisov, I. S., 244,248 Borovkov, A. A., 311 Breiman, L., 311 Bretagnolle, J., 308
Bron"stein, E. M., 269, 282, 283 Brown, L. D., 222, 397
Cabana, E. M., 222 Carl, B., 322, 330 tervonenkis, A. Ya., 134, 135, 162, 167, 221, 226, 229 CesaBianchi, N., 226 Chebyshev, P. L., 5, 12 Chernoff, H., 16, 20, 124 Chevet, S., 31, 33, 34, 86 Clements, G. F., 282 Coffman, E. G., 375, 389 Cohn, D. L., 101, 193, 411, 420 Danzer, L., 167 Darmois, G., 25, 86 DeHardt, J., 234, 235, 248 Devroye, L., 221 Dobric, V., 88, 130, 131 Donsker, M. D., 2, 3, 9, 19, 311 Doob, J. L., 2, 19, 229 Douady, A., 415 Drake, F. R., 403 Dudley, R. M., 20, 86, 87, 130, 131, 154, 167, 168, 193, 222, 225, 229, 248, 282, 307, 310, 311, 329, 330, 360,
365,388,389 Dunford, N., 50, 411, 412 Durst, M., 193, 229, 244, 248 Dvoretzky, A., 221 Eames, W., 130 Effros, E. G., 190, 194 Efron, B., 357, 361 Eggleston, H. G., 270 Ehrenfeucht, A., 227 Eilenberg, S., 260 Einmahl, U., 193 432
Author Index Ersov, M. P., 131 Evstigneev, I. V., 389
Feldman, J., 87 Feller, W., 17, 389 Fenchel, W., 79, 270 Ferguson, T. S., 176
Fernique, X., 24,25,27,39,40,50,5860, 62, 63, 86, 87, 295, 321, 359 Fisher, R. A., 193 Freedman, D. A., 193, 311, 360 Fu, J., 282 Gaenssler, P., 191, 194, 248, 360 Gelfand, I. M., 411, 412 Gine, E., 20, 88, 131, 225, 226, 239, 246,
248,307,328330,336,357360 Gnedenko, B. V., 130, 311 Goffman, C., 130 Goodman, V., 222, 307 Gordon, Y., 86 Grunbaum, B., 167 Graves, L. M., 411 Gross, L., 86, 87 Gruber, P. M., 283 Gutmann, S., 193 Gyorfi, L., 227 Hadwiger, H., 80 Hall, P., 361 Halmos, P. R., 193 Harary, F., 167 Hausdorff, F., 282 Haussler, D., 165, 166, 168, 222, 226, 227 Heinkel, B., 307 Hobson, E. W., 397 Hoeffding, W., 13, 14, 16, 20 HoffmannJorgensen, J., 130, 131, 310, 341
Il'in, V. A., 397 10, K., 87 Jain, N. C., 208, 229, 308, 310, 359
Kac, M., 389 Kahane, J.P., 21, 86, 310 Kartashev, A. P., 397 Kelley, J. L., 404 Kiefer, J., 221 Kingman, J. F. C., 204, 229 Klee, V. L., 167 Kolmogorov, A. N., 2, 11, 19, 20, 130, 282, 311 Koltchinskii, V. I., 196, 198, 208, 228,
229,307309 Komatsu, Y., 57, 87 Koml6s, J., 308
433
Korolyuk, V. S., 311 Krasnosel'skii, M. A., 424 Kuelbs, J., 307, 310 Kulkarni, S. R., 227
Landau, H. J., 24, 27, 50, 58, 59, 86, 295, 321 Lang, S., 397 Laskowski, M. C., 165 Le Cam, L., 193, 311 Leadbetter, M. R., 222 Lebesgue, H., 183, 193 Ledoux, M., 81, 87, 223, 307, 359 Lehmann, E. L., 56, 175 Leichtweiss, K., 79 Leighton, T., 375 Levy, P., 17, 21 Lindgren, G., 222 Lorentz, G. G., 20, 282 Lueker, G. S., 389 Lugosi, G., 227 Luxemburg, W. A. J., 130, 424 Major, P., 308, 311 Marcus, M. B., 24, 50, 58, 59, 81, 86, 208, 229, 295, 310, 321, 359 Marczewski, E., 404 Massart, P., 221223, 308, 309 Maurey, B., 324, 330 May, L. E., 130 McKean, H. P., 87 McMullen, P., 80 Milman, V. D., 79 Mityagin, B. S., 81 Mourier, E., 237, 248 Munkres, J. R., 283 Nagaev, S. V., 311 Natanson, I. P., 193 Neveu, J., 193 Neyman, J., 175, 193
Okamoto, M., 16, 20, 124 Orlicz, W., 423, 424 Ossiander, M., 239, 246, 248, 253
Pachl, J. K., 130 Pettis, B. J., 411 Philipp, W., 7, 20, 130, 307, 310, 311, 388 Pickands, J., 222 Pisier, G., 79, 307, 308, 330 Pollard, D. B., 168, 196, 198, 208, 214, 228, 229 Posner, E. C., 20 Pozniak, E. G., 397 Price, G. B., 412
434
Author Index
Prokhorov, Yu. V., 6, 19 Pyke, R., 389
Strassen, V., 6
Revesz, P., 309 Radon, J., 140, 141, 167 Rao, B. V., 193 Rao, C. R., 86 Rhee, W. T., 375 Richardson, T., 227 Rio, E., 309 Rodemich, E. R., 20 Rootzen, H., 222 Rozhdestvenskif, B. L., 397 Rudin, W. 423,424 Rumsey, H., 20 Rutitskii, Ia. B., 424 RyllNardzewski, C., 130
Sudakov, V. N., 31, 33, 34, 80, 81, 86, 87 Sun, TzeGong, 264, 282
SainteBeuve, M.F., 186, 193 Samorodnitsky, G., 222 Sauer, N., 135, 167 Savage, L. J., 193 Sazonov, V. V., 130 Schaefer, H. H., 413, 416 Schaerf, H. M., 406 Schmidt, W. M., 365, 388 Schwartz, J. T., 50, 411, 412 Schwartz, L., 37, 415, 416 Shao, Jun, 361 Shelah, S., 167 Shepp, L. A., 24, 27, 50, 58, 59, 86, 295, 321 Shor, P. W., 374, 375, 389 Shortt, R. M., 20, 130 Sikorski, R., 404 Skitovi6, V. P., 25, 86 Skorokhod, A. V., 19, 106, 130, 131 Slepian, D., 31, 86 Smith, D. L., 222 Steele, J. M., 203, 204, 226, 229, 282 Steenrod, N., 260 Stengle, G., 165 Stone, A. H., 404
Valiant, L. G., 227 van der Vaart, A., 131, 358 Vapnik, V. N., 134, 135, 162, 167, 221, 226, 227, 229 Vorob'ev, N. N., 7, 20 Vulikh, B. Z., 130
Strobl, F., 191, 200, 229, 358 Stute, W., 248
Talagrand, M., 59, 60, 63, 81, 82, 87, 88, 223, 224, 226, 307, 308, 359, 374, 375 Tarsi, M., 166 Tibshirani, R. J., 361 Tikhomirov, V. M., 20, 282 Topsoe, F, 131 Tu, Dongsheng, 361 Tusnady, G., 308
Uspensky, J. V., 20
Warmuth, M., 227 Wellner, J., 358 Wenocur, R. S., 167, 168 Wiener, N., 2 Wolfowitz, J., 221, 229 Wright, F T., 282
Young, W. H., 423,424 Yukich, J., 165, 389
Zaanen, A. C., 130,424 Zakon, E., 406 Zeitouni, 0., 227 Ziegler, K., 191 Zink, R. E., 130 Zinn, J., 88, 131, 225, 226, 246, 307,
328330,336,357360
Index of Notation
B means A is defined by B, xiv A =: B means B is defined by A, xiv x, of the same order, 252 ®, set of Cartesian products, 153
8 A, Sinterior of A, 262
A
3, point mass at x, 1 D(E, A, d), packing number, 10 DFl (S, )c_), 196
converges in law, 94 (9, product aalgebra, 109 n, class of intersections, 134 nki=' , class of intersections, 251 u, class of unions, 153 uk=1, class of unions, 251
DFi (E, .F, Q), 161 D(P)(E,.F), 161 D(P) (E, .F, Q), 161 DP, partial derivative, 252 dens, combinatorial density, 137 diam, diameter, 10
( , )0,p, covariance, 92 II
Ila, differentiability norm, 252 IIBL, bounded Lipschitz norm, 112
II
11 C, sup norm over C, 138
II
II II II
dp, dp(A, B) := P(AAB), 156 dp,Q, LP(Q) distance, 161 ds p, supremum distance, 235
II.F, sup norm over .F, 94 IIL, Lipschitz seminorm, 112 II', dual norm, 24
E(k, n, p), binomial probability, 16 E*, upper expectation, 95 e", A a signed measure, 345 ess.inf, essential infimum, 95 ess.sup, essential supremum, 43
[.f, g] := (h: f < h < g), 234 an nt/2(F,,  F), 1 AEC(P, r), asymptotic equicontinuity condition, 117 fl ( , ), bounded Lipschitz distance, 115 B(k, n, p), binomial probability, 16
C(F) := I(F) U R(F), 260 Ct, polar of C, 43 card, cardinality, 134
covp, 92 C(a, K, d), subgraphs of asmooth f's, 252 A, symmetric difference, 95 AA, set of symmetric differences, 143 AO, number of induced subsets, 134
c, normal distribution function, 55 ¢, normal density, 55 F, distribution function, 1 F, empirical distribution function, 1 FT, envelope, 161 f*, 98 f o g, composition, 94 f *, measurable cover function, 95 Fa,K(F), adifferentiable functions, 252
y(T) = infy(T), 59 y(T), majorizing measure integral, 59 Gp, 91 CJa,K,d, differentiable functions, 252
435
436
Index of Notation
H(s, A, d) = log N(s, A, d), 11 H5(.F, M), hull, 160
PB, bootstrap empirical measure, 335 P*, outer probability, 95
h(. , ), Hausdorff metric, 250
pos(G) _ {pos(g): g E G}, 139 pos(g) _ {x: g(x) > 0}, 139
I(F), inside F boundary, 260 I = [0, 1], unit interval, 19 I", unit cube in I[R", 7 i.i.d., indep. identically dist., 1
isonormal process, 39 L(A)*, ess.supA L, 43 IL(A)I*, ess.supA ILI,43
£2(P), 91 G0, 95 Go, 92
GGd, lower layers in Rd, 264 £ Cd,1, lower layers in unit cube, 264
mi(n), 134 vn, empirical process, 91 vB, bootstrap empirical process, 336 NC 01, 139 )ro, centering projection, 92 Pn, empirical measure, 1
p(, ), Prokhorov distance, 115 pp, covp distance, 92
R(F), range(F), 260 JR = [oo, oo], 93 IlRX, all functions X H JR, 139
f upper integral, 94 f*, lower integral, 98 S 1, unit circle, 138
S(C) = V(C)  1, VC index, 134 sco, symmetric closed convex hull, 49 sco, symmetric convex hull, 117 S(IIR"), L. Schwartz space, 37 U[0, 1], uniform law on [0, 1], 126
V(C) = S(C) + 1, VC index, 134
Var(X) = E((X  EX)2), 12 [x], least integer > x, 19 XP = xPO)xP(2) ... 252 2 xt, Brownian motion, 2 1
yt, Brownian bridge, 1