870 43 8MB
Pages 228 Page size 412.8 x 664.32 pts Year 2011
David Pollard
Convergence of Stochastic Processes With 36 Illustrations
SpringerVerlag New York Berlin Heidelberg Tokyo
Oavid Pollard Deparlmcnt of Statistics
Yale University e~ Haven. U.S.A.
er 06520
AMS Subject Classifications: 6OF99, 60007, 60H99 62M99
Library of Congress Cataloging in Publication Data Pollard, David
Convergence of stocilas[ic pror;Cs.. W(a*, b*, P). C\u
Continuity of W( " " P) takes care of the infimum over bounded regions of C\U. If there were an unbounded sequence (IX;, Pi) in C with W(lXi' Pi' P) ~ W(a*, b*, P),
we could extract a subsequence along which, say, IX; IPI ~ M. Dominated convergence would give W(a*, b*, P) = Plx 
~  00
and Pi
~
p, with
P1 2,
which would contradict uniqueness of (a*, b*): for every a, the pair (a, P) would minimize W(',', P). The pair (an, bJ, by seeking out the unique D minimum of W(', " P) over the region C, must converge to (a*, b*).
12
11. Uniform Convergence of Empirical Measures
The kmeans example typifies consistency proofs for estimators defined by optimization of a random criterion function. By ad hoc arguments one forces the optimal solution into a restricted, often compact, region. That is usually the hardest part of the proof. (Problem 2 describes one particularly nice ad hoc argument.) Then one appeals to a uniform strong law over the restricted region, to replace the random criterion function by a deterministic limit function. Global properties of the limit function force the optimal solution into desired neighborhoods. If one wants consistency results that apply not just to independent sequences but also, for example, to stationary ergodic sequences, one is stuck with cumbersome direct approximation arguments; but for independent sampling, slicker methods are available for proving the uniform strong laws. We shall return to the kmeans problem in Section 5 (Example 29 to be precise) after we have developed these methods. 5 Example. Let
ebe the parameter of a stationary auto regressive process
for independent, identically distributed innovations {un}. Stationarity requires Ie I :s; 1. A generalized M estimator for e is any value en for which the random function n1 HnCe) = (n  1)1 I g(Yi)c!J(Yi+1  eYi) i= 1
takes the value zero. We would hope that en converges to the e* at which the deterministic function
takes the value zero. If I9 I :s; 1 and Ic!J I :s; 1 and c!J is continuous, we can go part of the way towards proving this by means of a uniform strong law for a bivariate empirical measure. Write Qn for the probability measure that puts equal mass (n  1)1 on each of the pairs (h, Yz), ... , (Ynb Yn). For fixed (integrable)f(·, .), Qnf ~ Qf almost surely,
where Q denotes the joint distribution of (Y1, Yz). This follows from the ergodic theorem for the stationary bivariate process {(Yn, Yn+ 1)}' Check the approximation conditions of Theorem 2, with Q in place of P, for the class of functions
First, choose an integer K so large that
IP{IY11:s;
K,
IYzI :s;
K} > 1 
8.
13
II.3. The Combinatorial Method
Then appeal to uniform continuity of
0 such that I 0 and a > 0 such that IP{ IZ'(t) I :::; a} 2:: [J for every t in T. Then (9)
IP{s~p IZ(t) I > 8}:::; PIIP{S~p IZ(t) 
Z'(t) I >
8 a}
PROOF. Select a random 7: for which IZ(7:) I > 8 on the set {sup /Z(t)1 > 8}. Since 7: is determined by Z, it is independent of Z'. It behaves like a fixed index value when we condition on Z: IP{IZ'(7:)I :::; a/Z} 2:: [J. Integrate out.
[JIP{s~p IZ(t)1 > 8}:::; IP{IZ'(r)1 :::; a, IZ(r) I > e} :::; IP{ IZ(7:) Z'(7:) I > :::;
IP{s~p IZ(t) 
8 
Z'(t) I >
a} 8 
a}
D
Close inspection of the proof would reveal a disregard for a number of measuretheoretic niceties. A more careful treatment may be found in Appendix C. For our present purpose it would suffice if we assumed T countable; the proof is impeccable for stochastic processes sharing a countable index set. We could replace suprema over all intervals ( 00, t] by suprema over intervals with a rational end point. For fixed t, Pn(  00, t] is an average of the n independent random variables {~i :::; t}, each having expected value P(  00, t] and variance P(  00, t] (P(  00, t])2, which is less than one. By Tchebychev's inequality, IP{IP~( 00, t]  P( 00, t]1 :::; !8} 2::!
if n 2:: 88 2 .
15
11.3. The Combinatorial Method
Apply the Symmetrization Lemma with Z = P n class f as index set, rx = 1e, and fJ = 1
(10)
W{llPn  PII > e} ::;; 2W{llPn 

P and Z'
P~II >!e}
=
P~  P,
the
if n;:::: 8e 2 .
SECOND SYMMETRIZATION.
The difference P n  P~ depends on 2n observations. The double sample size creates a minor nuisance, at least notationally. It can be avoided by a second symmetrization trick, at the cost of a further diminution of the e. Independently of the observations ~l"'" ~n' ~'l"'" ~~ from which the empirical measures are constructed, generate independent sign random variables (Jl"'" (In for which W{(Ji = + 1} = W{(Ji = 1} = !. The symmetric random variables {~i::;; t}  {~; ::;; t}, for i = 1, ... , nand  00 < t < 00, have the same joint distribution as the random variables (JH~i ::;; t}  g; ::;; t}]. (Consider the conditional distribution given {(JJ.) Thus
W{llPn 
P~II > !e} = w{s~p !n1itl (JH~i::;; t}  {~;::;; t}]! > !e} ::;;
itl
w{s~p !n 1 (Ji{~i::;; t}! >!c} + w{s~p !n
1
itl(Jig;::;;
t}! > !cl
Write P~ for the signed measure that places mass n1(Ji at ~i' The two symmetrizations give, for n ;:::: 8e  2, (11)
W{llPn  PII > e} ::;;
4W{IIP~11 >
!c}.
To bound the righthand side, work conditionally on the vector of observations ;, leaving only the randomness contributed by the sign variables. MAXIMAL INEQUALITY.
Once the locations of the; observations are fixed, the supremum IIP~II reduces to a maximum taken over a strategically chosen set of intervals I j = ( 00, t j ], for j = 0, 1, ... , n. Of course the choice of these intervals depends on ;; we need one tj between each pair of adjacent observations. (The to and tn are not really necessary.) With the number of intervals reduced so drastically, we can afford a crude bound for the supremum. n
(12)
W{IIP~II >
iel;} ::;; I
W{IP~Ijl >
iel;}
j~O
::;; (n
+ 1) max W{IP~Ijl > iel;}. j
16
11. Uniform Convergence of Empirical Measures
This bound will be adequate for the present because the conditional probabilities decrease exponentially fast with n, thanks to an inequality of Hoeffding for sums of independent, bounded random variables. EXPONENTIAL BOUNDS.
Let Y1,"" y" be independent random variables, each with zero mean and bounded range: ai::;; Yi ::;; bi' For each ry > 0, Hoeffding's Inequality (Appendix B) asserts IP{IY1
+ ... +
Y"I;:::: ry} ::;;2exp [ 2ry2!J1(b i

aY].
Apply the inequality with Yi = ai{~i ::;; t}. Given ~, the random variable takes only two values, ± {~i ::;; t}, each with probability l
Yi
IP{IP~(CX), t]l;:::: hl~}::;; 2ex p [ 2(ne/4)2lt4{~i::;; t}] ::;; 2 exp(  ne 2 /32), because the indicator functions sum to at most n. Use this for each I j in inequality (12). IP{ 11 P~ 11 > I~} 2(n + 1) exp(  ne 2/32).
te
:;
Notice that the righthand side now does not depend on
~.
INTEGRATION.
Take expectations over
~.
IP{IJPn  PII > e} ::;; 8(n
+ 1) exp( ne 2/32).
This gives very fast convergence in probability, so fast that 00
L IP{IIP
n 
PII > e} < CX)
n =1
for each e > 0. The BorelCantelli lemma turns this into the full almost sure convergence asserted by the GlivenkoCantelli theorem.
11.4. Classes of Sets with Polynomial Discrimination We made use of very few distinguishing properties of intervals for the proof of the GlivenkoCantelli theorem in Section 3. The main requirement was that they should pick out at most n + 1 subsets from any set of n points. Other classes have a similar property. For example, quadrants of the form (  00, t] in JR2 can pick out fewer than (n + 1)2 different subsets from a
11.4. Classes of Sets with Polynomial Discrimination
17
set of n points in the planethere are at most n + 1 places to set the horizontal boundary and at most n + 1 places to set the vertical boundary. (Problem 8 gives the precise upper bound.) With (n + 1)2 replacing the n + 1 factor, we could repeat the arguments from Section 3 to get the bivariate analogue of the GlivenkoCantelli theorem. The exponential bound would swallow up (n + 1)2, just as it did the n + 1. Indeed, it would swallow up any polynomial. The argument works for intervals, quadrants, and any other class of sets that picks out a polynomial number of subsets. 13 Definition. Let qj) be a class of subsets of some space S. It is said to have polynomial discrimination (of degree v) if there exists a polynomial p(.) (of degree v) such that, from every set of N points in S, the class picks out at most peN) distinct subsets. Formally, if So consists of N points, then there are at most peN) distinct sets of the form So (\ D with D in qj). Call pC) the discriminating polynomial for qj). D
When the risk of confusion with the algebraic sort of polynomial is slight, let us shorten the name "class having polynomial discrimination" to "polynomial class," and adopt the usual terminology for polynomials of low degree. For example, the intervals on the real line have linear discrimination (they form a linear class) and the quadrants in the plane have quadratic discrimination (they form a quadratic class). Of course there are classes that don't have polynomial discrimination. For example, from every collection of N points lying on the circumference of a circle in IR 2 the class of closed, convex sets can pick out all 2N subsets, and 2N increases much faster than any polynomial.
The method of proof set out in Section 3 applies to any polynomial class of sets, provided measurability complications can be taken care of. Appendix C describes a general method for guarding against these complications. Classes satisfying the conditions described there are called permissible. Every specific class we shall encounter will be permissible. As the precise details of the method are rather delicatethey depend upon properties of analytic setslet us adopt a naive approach. Ignore measurability problems from now on, but keep the term permissible as a reminder that some regularity conditions are needed if pathological examples (Problem 10) are to be excluded. Problems 3 through 7 describe a simpler approach, based on the more familiar idea of existence of countable, dense subclasses.
18
11. Uniform Convergence of Empirical Measures
14 Theorem. Let P be a probability measure on a space S. For every permissible class f0 of subsets of S with polynomial discrimination,
sup IPnD  PDI+ 0
almost surely.
~
PROOF. Go back to Section 3, change ~ to f0, replace the n + 1 multiplier by the polynomial appropriate to f0, and strike out the odd reference to D interval and real line. Which classes have only polynomial discrimination? We already know about intervals and quadrants; their higherdimensional analogues have the property too. Other classes can be built up from these. 15 Lemma. If
~
and f0 have polynomial discrimination, then so do each of:
(i) {D c :DEf0}; (ii) {C u D: C E (iii) {C n D: C E
~ ~
and D and D
E E
f0}; f0}.
PROOF. Write cO and dO for the discriminating polynomials. We may assume them both to be increasing functions of N. From a set So consisting of N points, suppose ~ picks out subsets S 1, ... , Sk with k s c(N). Suppose Si consists of Ni points. The class f0 picks out at most deN) distinct subsets from Si. This gives the bound d(N 1 ) + ... + d(N k ) for the size of the class in (iii). The sum is less than c(N) deN). That proves the assertion for (iii). The other two are just as easy. D The lemma can be applied repeatedly to generate larger and larger polynomial classes. We must place a fixed limit on the number of operations allowed, though. For instance, the class of all singletons has only linear discrimination, but with arbitrary finite unions of singletons we can pick out any finite set. Very quickly we run out of interesting new classes to manufacture by means of Lemma 15 from quadrants and the like. Fortunately, there are other systematic methods for finding polynomial classes. Polynomials increase much more slowly than exponentials. For N large enough, a polynomial class must fail to pick out at least one of the 2N subsets from each collection of N points. Surprisingly, this characterizes polynomial discrimination. Some picturesque terminology to describe the situation has become accepted in the literature. A class f0 is said to shatter a set of points F if it can pick out every possible subset (the empty subset and the whole of F included); that is, f0 shatters F if each of the subsets of F has the form D n F for some D in f0. This conveys a slightly inappropriate image, in which F gets broken into tiny fragments, rather than an image of a diligent f0 trying to pick out all the different subsets of F; but at least it is vivid.
19
II.4. Classes of Sets with Polynomial Discrimination
For example, the class of all closed discs in IR 2 can shatter each threepoint set, provided the points are not collinear. But from no set of four points, no matter what its configuration, can the discs pick out more than 15 of the 16 possible subsets. The discs shatter some sets of three points; they shatter no set of four points. 16 Theorem. Let So be a set of N points in S. Suppose there is an integer V ~ N such that ::0 shatters no set of V points in So. Then ::0 picks out no more than (~) + (~) + ... + (v i'I. 1) subsets from So· PROOF. Write F 1, .•. , F k for the collection of all subsets of V elements from So. Of course k = (~). By assumption, each Fi has a "hidden" subset Hi that ::0 overlooks: D n Fi 1= Hi for every Din ::0. That is, all the sets of the form D n So, with D in ::0, belong to ~o
= {C
S;
So: C n Fi 1= Hi for each i}.
It will suffice to find an upper bound for the size of ~ 0 .
In one special case it is possible to count the number of sets in ~o directly. ~o can contain an F i ; no C can contain a set of V points. In other words, members of ~o consist of either 0, 1, ... , or VI points. The sum of the binomial coefficients gives the number of sets of this form. By playing around with the hidden sets we can reduce the general case to the special case just treated. Label the points of So as 1, ... , N. For each i define H; = (Hi u {I}) n F i ; that is, augment Hi by the point 1, provided it can be done without violating the constraint that the hidden set be contained in F i • Define the corresponding class
If Hi = Fi for every i then no C in
~1 =
{C
S;
So: C n Fi 1= H;
for each i}.
The class ~ 1 has nothing much to do with ~ o. The only connection is that all its hidden sets, the sets it overlooks, are bigger. Let us show that this implies ~ 1 has a greater cardinality than ~o. (Notice: the assertion is not that ~o S; ~1.)
Check that the map CH C\{I} is onetoone from ~O\~l into ~l\~O' Start with any C in ~o \ ~1' By definition, C n Fi 1= Hi for every i, but C n F j = Hi for at least one j. Deduce that H j 1= Hi, so 1 belongs to C and F j and Hi, but not to H j • The stripping of the point 1 does define a onetoone map. Why should C\ {I} belong to ~ 1 \ ~0 ? Observe that (C\{I}) n F j = Hi\{I} = H j ,
which bars C\{I} from belonging to ~o. Also, if Fi contains 1 then so must H;, but C\{I} certainly cannot; and if Fi doesn't contain 1 then (C\{1}) n Fi = C n Fi 1= Hi = H;.
In either case (C\{I}) n Fi 1= H;, so C\{I} belongs to
~1'
20
H. Uniform Convergence of Empirical Measures
Repeat the procedure, starting from the new hidden sets and with 2 taking over the role played by 1. Define Hi = (H; u {2}) n Fi and ~2
= {C S So: C n Fi
=F Hi' for each i}.
The cardinality of ~2 is greater than the cardinality of ~ 1. Another N . 2 repetitions would generate classes ~ 3, ~4, ... , ~N with increasing cardirtalities. The hidden sets for ~N would fill out the whole of each F i : the special 0 case already treated. 17 Corollary. If a class shatters no set of V points, then it must have polynomial discrimination of degree no greater than V  1. 0
All we lack now is a good method for identifying classes that have trouble picking out subsets from large enough sets of points. 18 Lemma. Let ... , ~n}; it could be any of the (n + 1) possible subsets of size n obtained by deleting one of the support points of P n + 1. (Count coincident observations as distinct support points.) The conditional distribution of P n given P n + 1 must be uniform on one of these (n + 1) subsets, each subset being chosen with probability (n + 1)1. The conditional expectation of P n given P n + 1 (in the intuitive sense of the average over the n + 1 possible choices for P n) must be P n+ 1. The extra information carried
n.
22
Uniform Convergence of Empirical Measures
by P n+ 2 , P n+ 3 , ••• adds nothing more to our knowledge about P n; the conditional expectation of P n given the afield generated by P n+ 1 , P n+ 2 , ••• still equals Pn+ l' That is, the sequence {P n} is a reversed martingale, in some wonderful measurevalued sense. Apply Jensen's inequality to the convex function thattakes P nonto lIPn  PII to deduce that {lIP n  PII} is a bounded, reversed submartingale. (Problem 11 arrives at the same conclusion in a slightly more rigorous manner.) Such a sequence must converge almost surely (Neveu 1975, Proposition V313) to a limit random variable, W. Since W is unchanged by finite permutations of {~;}, the zeroone law of Hewitt and Savage (Breiman 1968, Section 3.9) forces it to take on a constant value almost surely. The only question remaining for the proof of a uniform strong law oflarge numbers is whether the constant equals zero or not: convergence in probability to zero plus convergence almost surely to a constant gives convergence almost surely to zero. 21 Theorem. Let:» be a permissible class of subsets of S. A necessary and sufficient condition for sup IPnD  PDI 0
almost surely
~
is the convergence ofnq;;, to zero in probability, where v" = v,,(~1"'" ~n) is the smallest integer such that :» shatters no collection of v" points from
{~l""'~n}' PROOF. You can formalize the sufficiency argument outlined above; necessity is taken care of in Problem 12. D Because 0::; n 1v,,::; 1, convergence in probability of n 1v" to zero is equivalent to n 1 IPv" _ O. This has an appealing interpretation. The uniform strong law of large numbers holds if and only if, on the average, the class of sets behaves as if it has polynomial discrimination with degree but a tiny fraction of the sample size. 22 Example. Let's see how easy it is to check the necessary and sufficient condition stated in Theorem 21. Consider the class Cfl of all closed, convex subsets of the unit square [0, 1]2. We know that there exist arbitrarily large collections of points shattered by Cfl. Were we sampling from a nonatomic
23
HA. Classes of Sets with Polynomial Discrimination
distribution concentrated around the rim of a disc inside [0, 1J2, the class ~ could always pick out too many subsets from the sample. Indeed, there would always exist a convex C with P n C = 1 and PC = 0. But such configurations of sample points should be thoroughly atypical for sampling from the uniform distribution on [0, 1]2. Theorem 21 should say something useful in that case. How large a subcollection of sample points can ~ shatter? Suppose it is larger than the size requested by Theorem 21. That is, for some 8 > 0, IP{n1v" ;;:::
8} ;;::: 8
infinitely often.
This will lead us to a contradiction. A set of k points is shattered by ~ if and only if none of the points can be written as a convex combination of the others; each must be an extreme point of their convex hull. So there exists a convex set whose boundary has empirical measure at least kin, which seems highly unlikely because P puts zero measure around the boundary of every convex set. Be careful of this plausibility argument; it contains a hidden appeal to the very uniformity result we are trying to establish. An approximation argument will help us to avoid the trap. Divide [0, 1J2 into a patchwork of m 2 equal subsquares, for some fixed m that will be specified shortly. Because the class .91 of all possible unions of these subsquares is finite,
IP{s~ IPnA 
PA I ;;:::
~8} < ~8
for all n large enough.
The ~8 here is chosen to ensure that, for some n, IP{n1v" ;;:::
8
and sup IPnA  PAl < ~8}
>!a
d
Since a set with positive probability can't be empty, there must exist a sample configuration for which ~ shatters some collection of at least n8 sample points and for which IPnA  PAl < ~8 for every A in d. Write H for the convex hull of the shattered set, and AH for the union of those subsquares that intersect the boundary of H. The set AH contains all the extreme points of H, so PnAH ;;::: 8; it belongs to .91, so IPnAH  PAHI < !a Consequently PA H > k which will give the desired tontradiction if we make m large enough.
24
H. Uniform Convergence of Empirical Measures
Experiment with values of m equal to a power of 3. No convex set can have boundary points in all nine of the subsquares; the middle subsquare would lie inside the convex hull of four points occupying each of the four corner squares. For every convex C the P measure of the union of those
subsquares intersecting its boundary must be less than !. Subdivide each of the nine sub squares into nine parts, then repeat the same argument eight times. This brings the measure of squares on the boundary down to (!)2. Keep repeating the argument until the power of! falls below 1c. That destroys the claim made for A H • 0
11.5. Classes of Functions The direct approximation methods of Section 2 gave us sufficient conditions for the empirical measure Pn to converge to the underlying P uniformly over a class of functions, sup IP nf
 Pf I 4 0
almost surely.
fi'
The conditions, though straightforward, can prove burdensome to check. In this section a transfusion of ideas from Sections 3 and 4 will lead to a more tractable condition for the uniform convergence. The method will depend heavily on the independence of the observations {~i}' but the assumption of identical distribution could be relaxed (Problem 23). Throughout the section write 1111 to denote sUpfi' 1·1. Let us again adopt a naive approach towards possible measurability difficulties, with only the word permissible (explained in Appendix C) to remind us that some regularity conditions are needed to exclude pathological examples. A domination condition will guard against any complications that could be caused by $' containing unbounded functions. Call each measurable F such that If I ::;; F, for every f in $', an envelope for !F. Often F will be taken as the pointwise supremum of If lover $', the natural envelope, but it will
25
lI.5. Classes of Functions
be convenient not to force this. We shall assume PF < 00. With the proper centering, the natural envelope must satisfy this condition (Problem 14) if the uniform strong law holds. The key to the uniform convergence will again be an approximation condition, but this time with distances calculated using the 21 seminorm for the empirical measures themselves. This allows us to drop the requirement that the approximating functions sandwich each member of :F. 23 Definition. Let Q be a probability measure on Sand :!i' be a class of functions in 2 1 (Q). For each s > 0 define the covering number N 1(s, Q,:!i') as the smallest value of m for which there exist functions g b . . . , gm (not necessarily in :!i') such that min j Q If  gj I ::;; s for each f in :F. For definite0 ness set N 1(s, Q, :!i') = 00 if no such m exists.
If :!i' has envelope F we can require that the approximating functions satisfy the inequality Igjl ::;; F without increasing N 1 (s, Q, :!i'): replace gj by
max{ F, min[F, gjJ}. We could also require g j to belong to :!i', at the cost of a doubling of s: replace gj by an jj in:!i' for which QIjj  gjl ::;; s. 24 Theorem. Let :!i' be a permissible class of functions with envelope F. Suppose PF < 00. If P n is obtained by independent sampling from the probability measure P and if log N 1 (s, Pn,:!i') = op(n) for each fixed s > 0, then sup? IPnf  Pfl+ 0 almost surely.
PROOF. Problem 11 (or the slightly less formal symmetry argument leading up to Theorem 21 in Section 4) shows that {llPn  PII} is a reversed submartingale; it converges almost surely to a constant. It will suffice if we deduce from the approximation condition that {llPn  PII} converges in probability to zero. Exploit integrability of the envelope to truncate the functions back to a finite range. Given £ > 0, choose a constant K so large that PF {F > K} < £. Then sup IPnf  Pfl ::;; sup IPnf{F ::;; K}  Pf{F::;; K}I ~
,~
+ sup Pnlfl{F >
K}
+ sup Plfl{F >
K} .
§
.'F
Because I f I ::;; F for each f in :!i', the last two terms sum to less than PnF{F> K}
+ PF{F >
K}.
This converges almost surely to 2PF {F > K}, which is less than 2£. It remains for us to show that the supremum over the functions f {F ::;; K} converges in probability to zero. As truncation can only decrease the 2\P n ) distance between two functions, the condition on log covering numbers also
26
H. Uniform Convergence of Empirical Measures
holds if eachfis replaced by its truncation; without loss of generality we may assume that If I ::; K for each f in JF In the two SYMMETRIZATION steps of the proof of the GlivenkoCantelli theorem (Section 3) we showed that IP{llPn  PII > s} ::; 4IP{IIP~11 >
h}
n ~ 8s 2 ,
for
where 11·11 denoted a supremum over intervals (  00, tJ of the real line. The signed measure P~ put mass ±n 1 on each observation ~l' ••• '~n' the random ± signs being decided independently of the {~i}. The argument works just as well if 11·11 denotes a supremum over g;; the interpretation adopted in the current section. The only property of the indicator function (  00, tJ needed in the SYMMETRIZATION steps was the boundedness, which implied var(PnC  00, tJ) ::; n 1 . This time an ~xtra factor of K2 would appear in the lower bound for n. With intervals we were able to reduce IIP~ 11 to a maximum over a finite collection; for functions the reduction will not be quite so startling. Given ~,choose functions gl' ... ' gM' where M = N 1(is, P n , fiP), such that min P n If
 gj I ::; is
for each f in JF
j
Write f* for the gj at which the minimum is achieved. Now we reap the benefits of approximation in the :£l(Pn ) sense. For any function g,
IP~gl =
1 1 \n it1 ±g(O\::; n
J1 19(OI
=
Pnlgl·
Choose 9 = f  f* for each f in turn. IP{suJ
IP~fl > *sl~}::; IP{s~p [IP~f*1 + Pnlf ::; IP{m:x
f*IJ >
hl~}
IP~gjl > isl~} because Pnlf 
f*l::; is
::; N1(is, Pn , fiP) max IP{IP~gjl > isl~}. j
Once again Hoeffding's Inequality (Appendix B) gives an excellent bound on the conditional probabilities for each gj.
IP{IP~gjl > isl~} =
IP{\it1
± gi~;)1 >
insl~}
::; 2 exp [  2(ins)2lt1 ::; 2 exp( ns 2 j128K 2 )
(2gi~i))2 ] because Igjl ::; K.
n.s.
27
Classes of Functions
When the logarithm of the covering number is less than n8 2/256K 2, the inequality IP{IIP~II > i81~}
: :; 2 exp[log N 1(k8, Pn , $') 
n8 2 /128K 2 ]
will serve us well; otherwise use the trivial upper bound of 1. Integrate out. IP{IIP~II > i8} :::;; 2 exp(  n8 2/256K 2) + IP{log Ni (k8, P , $') > n8 2/256K2}. n
Both terms on the righthand side of the inequality converge to zero.
0
For some classes of functions the conditions of the theorem are easily met because N 1(8, P n , $') remains bounded for each fixed 8> O. This happens if the graphs of the functions in $' form a polynomial class of sets. The graph of a realvalued function f on a set S is defined as the subset GJ = {Cs, t): 0:::;; t :::;; f(s)
or
f(s):::;; t:::;; O}
of S ® JR. We learn something about the covering numbers of a class$' by observing how its graphs pick out sets of points in S ® JR. 25 Approximation Lemma. Let$' be a class of functions on a set S with envelope F, and let Q be a probability measure on S with 0 < QF < 00. lithe graphs of functions in $' form a polynomial class of sets then
N 1(8QF, Q, $') :::;; A8 W
for
0
8QF
if i =P j.
Maximality means that no larger collection has the same property; each f must lie within 8QF of at least one.fj. Thus m;::: N 1(8QF, Q, $'). Choose independent points (Si' t 1 ), ... , (Sk, t k ) in S ® JR by a twostep procedure. First sample Si from the distribution Q(. F)!Q(F) on S. Given Si> sample ti from the conditional distribution Uniform [  F(Si), F(Si)]. The value of k, which depends on m and 8, will be specified soon. F
28
H. Uniform Convergence of Empirical Measures
The graphs Gland G 2, corresponding to 11 and 12, pick out the same subset from this sample if and only if everyone of the k points lands outside the region G 1 1:::,. G2' This occurs with probability equal to k
IT
[1  IPIP{(Si' ti) E G1 1:::,. G2lsi}] = [1  IP(II1(sl)  I2(sl)I/2F(sj))]k
i= 1
= [1  Q I11  I21/2Q(F)]k ::; (1  !S)k ::; exp( 
! ks ).
Apply the same reasoning to each of the G') possible pairs of functions j; and fj. The probability that at least one pair of graphs picks out the same set of points from the k sample is less than (;) exp(  !ks) ::;
! exp(2 log m 
!ks).
Choose k to be the smallest value that makes the upper bound strictly less than 1. Certainly k ::; (1 + 4 log m)/s. With positive probability the graphs all pick different subsets from the k sample; there exists a set of k points in S ® IR from which the polynomial class of graphs can pick out m distinct subsets. From the defining property of polynomial classes, there exist constants B and V such that m ::; Bk v for all k ;::::: 1. Find no so that (1 + 4 log nt ::; n 1 / 2 for all n ;::::: no. Then either m < no or m::; Bm 1 / 2 s V • Set W = 2Vand A = max(B2, no). D To show that a class of graphs has only polynomial discrimination we can call upon the results of Section 4. We build up the graphs as finite unions and intersections (Lemma 15) of simpler classes of sets. We establish their discrimination properties by direct geometric argument (as for intervals and quadrants) or by exploitation of finite dimensionality (as in Lemma 18) of a generating class of functions.
26 Example. Define a center of location for a distribution P on IR m as any value 8 minimizing the criterion function
H(B, P) = PcP(lx 
(1),
where cP(·) is a continuous, nondecreasing function on [0, (0) and 1·1 denotes the usual euclidean distance. If PcP( Ix I) < 00 and cPO does not increase too rapidly, in the sense that there exists a constant C for which cP(2t) ::; CcP(t) for all t, then the function H(·, P) is well defined:
H(8, P) ::; P[cP(2181){lxl ::;
181} + CcP(lxl){lxl > 181}]
0,
00.
29
II.5. Classes of Functions
the minimizing value will be achieved (Problem 21); extra regularity conditions on P, which are satisfied by distributions such as the multivariate normal, ensure uniqueness (Problem 22). For this example, let us not get bogged down by the exact conditions needed; just assume that H(·, P) has a unique minimum at some (}o. Estimate (}o by any value (}n that minimizes the sample criterion function H(·, Pn). To show that (}n converges to (}o almost surely, it will suffice to prove that H((}n, P) ~ H((}o, P) almost surely, because H(B, P) is bounded away from H((}o, P) outside each neighborhood of (}o. The argument follows the same pattern as for kmeans (Example 4). First show that (}n eventually stays within a large compact ball {I x I s K}. Choose the K greater than I(}o I and large enough to ensure that P 2b 1
X}
::;; MmaxIP{IZ(gj)  Z'(gj) 1 > 2bIX}. j
Fix a 9 with Igl ::;; 1. Bound IZ(g)  Z'(g) 1 by IZ(g?  Z'(g)21/[Z(g) which is less than
+ Z'(g)J
33
II.6. Rates of Convergence
thanks to the inequality a 1/2 + b 1/2 ~ (a Hoeffding's Inequality (Appendix B).
+ b)1/2,
for a, b ~ O. Apply
IP{IZ(g)  Z'(g) 1 > 2 2n 11'0 I/( I1'0 I + 8).
Show that liminf inf.?"\.lf Pnf > 1'0 almost surely. [Trivial if 1'0 < 0.] Deduce that!" belongs to % eventually (almost surely). Now read the case A consistency proof of Huber (1967). Compare the last part of his argument with our Theorem 3. [3] Call a class:F offunctions universally separable if there exists a countable subclass ~ such that eachfin :F can be written as a pointwise limit of a sequence in~. If :F has an envelope F for which PF < 00, prove that universal separability implies measurability of lIPn  PII. [4] For any finitedimensional vector space '§ of real functions on S, the class ~ of sets of the form {g ~ O}, for g in '§, is universally separable. [Express each g in '§ as a linear combination of some fixed finite collection of nonnegative functions. Let '§O be the countable subclass generated by taking rational coefficients. Fqr each gin '§ there exists a sequence {gn} in '§O for which gn 1 g. Show that {gn ;::: O} 1 {g ;::: O} pointwise.] [5] The operations in Lemma 15 preserve universal separability.
Problems
39
[6J For a universally separable class 2&, the quantity v" defined in Theorem 21 is unchanged if 2& is replaced by its countable subclass 2&0' Prove that v" is measurable. [7J Prove that the class of indicator functions of closed, convex subsets of IRd is universally separable. [Consider convex hulls of finite sets of points with rational coordinates.J [8J Theorem 16 informs us that the class of quadrants in IR2 picks out at most 1 + tN + tN 2 subsets from any collection of N points. Find a configuration for which this bound is achieved. [9J For N ~ 2 the sum of binomial coefficients singled out by Theorem 16 is bounded by NV. [Count subsets of {I, ... , N} containing fewer than V elements by arranging each subset into increasing order then padding it out with enough copies of the largest element to bring it up to a Vtuple. Don't forget the empty set.J [lOJ Let M be a subset of [0, IJ that has inner lebesgue measure zero and outer lebesgue measure one (Halmos 1969, Section 16). Define the probability measure fl as the trace of lebesgue measure on M (the measure defined in Theorem A of Halmos (1969), Section 17). Assuming the validity of the continuum hypothesis, put M into a onetoone correspondence with the space [0, U) of all ordinals less than the first uncountable ordinal U (Kelley 1955, Chapter 0). Define 2& as the class of subsets of [0, IJ corresponding to the initial segments [0, xJ in [0, U). (a) Show that 2& has linear discrimination. [It shatters no twopoint set.J (b) Equip MOO with its product (jfield and product measure floo. Generate observations ~ I' ~2' ... on P = Uniform(O, 1) by taking them as the coordinate projection maps on MOO. Construct empirical measures {Pn} from these observations. Show that sUPii! IPn D  PD I is identically one. (c) Repeat the construction with the same 2&, but replace (MOO, floo) by a countable product of copies of MC equipped with the product measure A00, where A equals the trace of lebesgue measure on MC. This time sUPii! IPnD  PDI is identically zero. [Funny things can happen when 2& has measurability problems. Argument adapted from Pollard (1981a) and Durst and Dudley (1981).J [l1J For independent and identically distributed random elements {O, write gn for the (jfield generated by all symmetric functions of ~I"'" ~N as N ranges over n, n + 1, .... For a fixed function f, apply the usual reversed martingale argument (Ash 1972, page 310) to show that lP(Pnflgn+l ) = Pn+d If P(sup,? If I) < 00, deduce
for every class of functions g; that makes both suprema measurable. [12J Here is one way to prove necessity in Theorem 21. Suppose liPn  PII + 0 almost surely. Construct fl: by placing mass n  I at each ~i for which the sign variable (ji equals + 1; construct fl;; similarly from the remaining ~/s. Notice that P~ = fl:  fl;;· Let N be the number of sign variables (j I, ... , (jn equal to + 1. (a) Prove that (n/N)fln+ has the same distributions as P N • [What if N = O?J (b) Deduce that both Ilfl:  tP11 + 0 and Ilfl;;  tPII + 0, in probability.
40
n. Uniform Convergence of Empirical Measures (c) Deduce that Ilfl:  fl;; II > 0 in probability. (d) Suppose PJ shatters a set F consisting of at least n1J of the points ~1"'" ~n' Without loss of generality, at least tn1J of the points in F are allocated to fl:. Choose a D to pick out just those points from F. Use independence properties of the {O';} to show that, with high probability, fl;;(D\F) and fl;;(D\F) are nearly equal. [Argue conditionally on Pn and the O'i for those ~i in F.] (e) Show that fl;;(D)  fl;;(D) ~ t1J with high conditional probability. This contradicts (c).
[13] Rederive the uniform strong law for convex sets (Example 22) by the direct approximation method of Theorem 2. [14] Let ff be a permissible class with natural envelope F = sup&>" II I. If IlPn  PII > 0 almost surely and if sup&>" IPII < 00 then PF < 00. [The condition on sup&>" IPII excludes trivial cases such as ff consisting of all constant functions. From IlPn  PII < e and IlPnl  PII < e deduce nll/(~n)  PII < 2e; almost sure convergence implies
lP{s~ 1/(~n) 
PII
~n
infinitely often} = O.
Invoke the nontrivial half of the BorelCantelli lemma, then replace each ~1 to get 00
>
lP(sulI/(~l) 
PII)
~ IPF(~l) 
~n
by
constant.
Noted by Gine and Zinn (1984).] [15] Here is an example of how Theorem 24 can go wrong if the envelope F has PF = 00. Let P be the Uniform(O, 1) distribution and let ff be the countable class consisting of the sequence {fi}, where J;(x) = xz{(i + 1)1 ::;; X < i 1 }. Show that the graphs have polynomial discrimination and that Ph = 1 for every i. But SUPi Pnh > 00 almost surely. [Find an exn with nex; > 0, such that [0, exn] contains at least one observation, for n large enough.] [16] Let ff be the class of all monotone increasing functions on IR taking values in the range [0, 1]. The class of graphs does not have polynomial discrimination, but it does satisfy the conditions of Theorem 24 for every P. [If {x;} and {tJ are strictly increasing sequences, the graphs can shatter the set of points (Xl' t 1), ... , (x N, tN)'] [17] For the ff of the previous problem, rewrite Pnl as g PnU ~ t} dt. Deduce uniform almost sure convergence from the classical result for intervals. [Suggested by Peter Gaenssler.] [18] Let ff and ':§ be classes of functions on S with envelopes F and G. Write :7 for the class of all sums I + g with I in ff and g in ':§. Prove that
N;(bQ(F
+ G), Q, ,9»
::;;
N;CbQF, Q, ff)N;CbQG, Q,
':§)
for
i = 1,2.
[19] A condition involving only covering numbers for P would not be enough to give a uniform strong law of large numbers. Let P be Uniform(O, 1). Let PJ consist of all sets that are unions of at most n intervals each with length less than n z, for n = 1,2, .... Show that sup§) IPnD  PDI = 1, even though N 1 (e, P, PJ) < 00 for each e > O.
41
Problems [20] Deduce Theorem 2 from Theorem 24.
[21] Under the conditions set down in Example 26, the function H(·, P) achieves its minimum. [If H(el> P) converges to the infimum as i + 00, use Fatou's lemma to show that the infimum is achieved at a cluster point of {eJ; condition (27) rules out cluster points at infinity.] Notice that only left continuity of cP is needed for the proof. Find and overcome the extra complications in the argument that would be caused if cp were only leftcontinuous. [22] This problem assumes familiarity with convexity methods, as described in Section 4.2 of Tong (1980). Suppose that the distribution P of Example 26 has a density p(.) whose high level setsD, = {p ;::: t} are convex and symmetric about the origin. Prove that H(e, P) has a minimum at e = O. [By Fubini's theorem, H(e, P) =
=
Iff ff
{O :::; s :::; cp(lx  el)}{O :::; t :::; p(x)} ds dt dx
volume[B(e, IX(S»" n D,] ds dt,
where B(e, r) denotes the closed ball of radius r centered at e. The volume of B(e, r) n D, is maximized at e = 0.] When is the minimum unique? Show that a multivariate normal with zero means and nonsingular variance matrix satisfies the condition for uniqueness. [23J Suppose K} are independent, but that the distribution of ~;, call it Q;, changes with i. Write pIn) for the average distribution of the first n observations, pIn) = n I(QI + ... + Qn). Show for a permissible polynomial class !!fi that sup IPnD  p(n)DI+ 0
almost surely.
!'J
What difficulties occur in the extension to more general classes of sets, or functions? [Adapt the doublesample symmetrization method of Lemma 33: sample a pair (X Zi I, X Zi) from Qi; use the selection variable Ti to choose which member of the pair is allocated to Pn , and which to P~.J [24J Show that
Nz(fib, tcQI
+ Qz), $'):::;
Nz(b, QI, $')Nz{b, Qz, $').
[Let hi be the density of Ql with respect to QI functions gl {hi> t} + gz{h 1 :::; t}.]
+ Qz. Consider the approximating
[25] Let P be the uniform distribution on [0, 1]Z. For a sample of n independent observations on P show that IP{some square of area
IXn
contains no observations} + 1
if IXn is just slightly smaller than n I log n. [Break [0, I]Z into N subsquares each with area slightly less than n I log n. Set Ai = {ith subsquare contains at least one observation}. Show that IP(Ai+ IIAI n··· n Ai) :::; IPA i +1' The probability that each of these subsquares contains at least one point is less than (IP All. BertrandRetali (1978).]
42
11. Uniform Convergence of Empirical Measures
[26J In one dimension, write the bias term for the kernel density estimate as jj(x)  p(x) =
f
K(z)[p(x
+ O"z) 
p(x)J dz.
Suppose p has a bounded derivative, and that SIzIK(z) dz < 00. Show that the bias is of order 0(0"). Generalize to higher dimensions. [If p has higherorder smooth derivatives, and K is replaced by a function orthogonal to low degree polynomials, the bias can be made to depend only on higher powers of O".J [27J The graphs of translated kernels K x ,(1 have polynomial discrimination for any K on the real line with bounded variation. [Break K into a difference of monotone functions.J [28J Let K be a density on IRd of the form h( Ix I), where h(·) is a monotone decreasing function on [0, (0). Adapt the method of Example 26 to prove that the graphs of the functions K x ,(1 have polynomial discrimination. [29J Modify the density estimate of Example 38 for distributions on the real line by choosing K as a function of bounded variation for which S K(z) dz = 0 and SzK(z) dz = 1 and S IzK(z)I dz < 00. Replace Pn by qnCx) = 0"2PnKx,(1' Show that IPqn(x) converges to the derivative of p. How fast can 0" tend to zero without destroying the almost sure uniform convergence supx Iqn(x)  IPqn(x)l> O?
CHAPTER III
Convergence in Distribution in Euclidean Spaces ... which runs through some of the standard methods for proving convergence in distribution of sequences of random vectors, and for proving weak convergence of sequences of probability measures on euclidean spaces. These include: checking convergence for expectations of smooth functions of the random vectors; checking moment conditions for sums of independent random variables (the Central Limit Theorems); checking convergence of characteristic functions (the Continuity Theorems for characteristic functions); and reduction to analogous problems of almost sure convergence via quantile transformations.
IlL!. The Definition Convergence in distribution of a sequence {X n} of real random variables is traditionally defined to mean convergence of distribution functions at each continuity point of the limit distribution function: IP{Xn ::; x}
+
IP{X ::; x}
whenever
IP{X = x} = O.
Although convenient for work with order statistics and quantiles, this definition does have some disadvantages. Distribution functions are not well suited to calculations involving sums of independent random variables. The simplest proofs of the Central Limit Theorem, for example, do not directly check pointwise convergence of distribution functions; they show that sequences of characteristic functions, or expectations of other smooth functions ofthe sums, converge. With the extensions to sequences ofrandom vectors (measurable maps into multidimensional euclidean space IR d), the difficulties multiply. And for random elements of more general spaces not equipped with a partial ordering, even the concept of distribution function disappears. With all this in mind, let us start afresh from an equivalent definition, which lends itself more readily to generalization. 1 Definition. A sequence of random vectors {Xn} is said to converge in distribution to a random vector X, written Xn ~ X, if IPf(Xn) + IPf(X) for every f belonging to the class ~(IRk) of all bounded, continuous, real functions on IRk. 0
This notion of convergence does not specify the limit random vector uniquely. If X and Y have the same distribution, that is, if IP{X E A} = IP{Y E A}
for each borel set A,
44
Ill. Convergence in Distribution in Euclidean Spaces
then Xn ~ X means the same as Xn ~ Y. This invites slight abuses of notation. For example, it is convenient to write Xn ~ N(O, 1), meaning that the sequence of real random variables {Xn} converges in distribution to any random variable having a standard normal distribution. The symbol N(O, 1) stands not only for a particular probability measure on the borel (Jfield .?8(IR) but also for any random variable having that distribution. Similarly, we can avoid much circumlocution by writing, for example, n 1 / 2 [Bin(n,
t)  tnJ
~
N(O, t),
instead of:
t
if Xn has a binomial distribution with parameters nand and X has a normal distribution with mean zero and variance t, then nl/2(Xn  !n) ~ X.
In general, for probability measures on the borel (Jfield .?8(IRk), define Pn ~ P to mean convergence in distribution for random vectors having these distributions. This definition is equivalent to the requirement: Pnf 4 Pf for every f in '?6(IRk). Most authors call this weak convergence.
111.2. The Continuous Mapping Theorem Suppose Xn ~ X, as random vectors taking values in IRk, and let H be a measurable map from IRk into IRs. Does it follow that HX n ~ HX? That is, does IPf(HX n) 4 IPf(HX) for every f in '?6(IRS)? If H were continuous, f H would belong to '?6(IRk) for every f in '?6(IRS). The result would be trivially true. The convergence HX n ~ HX also holds under a slightly weaker assumption: it suffices that H be continuous at almost all points of the range of X. This will follow as a simple corollary to the next lemma. 0
2 Convergence Lemma. Let h be a bounded, measurable, realvalued function on IRk, continuous at each point of a measurable set C.
(i) Let {Xn} be a sequence of random vectors converging in distribution to X. If IP{X E C} = 1, then IPh(Xn) 4 IPh(X). (ii) Let {P n} be a sequence of probability measures converging weakly to P. If p(e) = 1, then Pnh 4 Ph. As the two assertions are similar, we need only prove (ii). Consider any increasing sequence {I;} of bounded, continuous functions for which I; :::;; h everywhere and I; i h at each point of C. Accept for the moment that such a sequence exists. Then weak convergence of {Pn} to P implies that PROOF.
(3)
Pnl;
4
PI;
for each fixed i.
45
IlI.2. The Continuous Mapping Theorem
On the lefthand side bound I; by h. liminf Pnh ;::: PI; for each fixed i. Invoke monotone convergence as i tends to infinity on the righthand side. (4)
The companion inequality obtained by substituting h for h combines with (4) to complete the proof. Now to construct the functions {I;}. They must be chosen from the family !F = {fE~(IRk):f ~ h}. If we can find a countable subfamily of!F, say {gb g2," .}, whose pointwise supremum equals h at each point of C, then setting I; = max{gb ... , gJ
will do the trick. Without loss of generality suppose h > 0. (A constant could be added to h to achieve this.) For each subset A of IRk define the distance function d(·, A) by d(x,A) = inf{lx  YI:YEA}. It is a continuous function of x, for each fixed A. For positive integral m and positive rational r define
fm,r(x) = r /\ md(x, {h ~ r}).
I I h~/ ,1/
I
,J
/
/
.'
I
\
""..,.:~
I
I
I
I I
I
I I I I
I I
, I
~
I
  71 'r".../ I
I
I I I I
I
\
r
1\"
~ I \ fm., I
I
I
I
1
{
\
,
\
\
\
\
Each fm, r is bounded and continuous; it is at most r if h(x) > r; it takes the value zero if hex) ~ r: it belongs to !F. Given a point x in C and an s > 0, choose a positive rational number r with hex)  s < r < hex). Continuity of h at x keeps its value greater than r in some neighborhood of x. Consequently, d(x, {h ~ r}) > andfm,rCx) = r > hex)  s for all m large enough.
°
o
Weak convergence of {P n } was needed only to establish the convergence (3) for the functions {I;}. These functions were, however, not just continuous, but uniformly continuous. The functions from which they were constructed even satisfied a Lipschitz condition: Ifm,r(x)  fm,rCy) I ~ mix  yl·
46
Ill. Convergence in Distribution in Euc1idean Spaces
Thus the lemma could have been proved using only convergence of expectations of bounded, uniformly continuous functions of the random vectors. In particular, such a requirement would imply the convergence Pnh + Ph for each h in ~(IRk).
5 Corollary. If Pnf + Pf for every bounded, uniformly continuous f then P n ~ P. (And similarly for convergence in distribution of random vectors.) 0 The lemma also provides an answer to the question asked at the start of the section.
6 Continuous Mapping Theorem. Let H be a measurable mapping from IRk into IRs. Write C for the set of points in IRk at which H is continuous. If a sequence {Xn} of random vectors taking values in IRk converges in distribution to a random vector X for which IP{X E C} = 1, then HX n ~ HX. PROOF. For each fixed f in ~(IRS), the bounded function f at all points of C.
0
H is continuous D
Some authors seem to regard this result as trivial and obvious; they scarcely notice that, at least implicitly, they make use of it for many applications. It is better to recognize these covert appeals to the Continuous Mapping Theorem. That way the more general form of the result in Chapter IV will come as no surprise.
7 Example. If the real random variables {Xn} converge in distribution to X then IP{Xn ::; x} + IP{X ::; x} at each x for which IP{X = x} = o. That is, the sequence converges at each continuity point x of the distribution function of X. This holds because x is the only point of discontinuity of (the indicator function of) the set (  00, x]. Problem 1 shows you how to go the other way, from pointwise convergence of distribution functions to convergence in distribution of the random variables. The same result is true in higher dimensions if the inequalities X n ::; x and X ::; x are taken componentwise and ( 00, x] is interpreted as a multidimensional orthant with vertex at x. Continuity of the multidimensional distribution function of X at x requires that X lands on the boundary of ( 00, x] with zero probability. D 8 Example. Consider the multinomial distribution obtained by independent placement of n objects into k disjoint cells. Write Fn = (F nl , ... , F nk ) for the column vector of observed frequenciescell i receives Fni objectsand p = (Pl' ... , Pk) for the column vector of cell probabilities. Pearson's chisquare statistic is k
Zn
=
L (F i~
1
ni 
np;)2/npi·
47
III.2. The Continuous Mapping Theorem
Write Zn as a function of a standardized column vector, by setting Xn = n 1/2 (Fn  np) and Zn = X~L\2Xn'
JPi
where L\ denotes the diagonal matrix with as its ith diagonal element. By the Multivariate Central Limit Theorem, which will be proved later in this chapter (Theorem 30), the random vectors {Xn} converge to a N(O, V) distribution whose variance matrix V has (i,j)th element Pi  pf if i = j and  PiPj otherwise. Manufacture a random vector with this limit distribution by applying a linear transformation to a column vector W of independent N(O, 1) random variables: Xn ~ L\(Ik  uu')W, where u denotes the unit column vector (jP;, ... , jp,,). The mapping H from IRk into IR defined by
1L\l x I2
Hx = x'L\2 x =
is continuous. Apply the Continuous Mapping Theorem. Zn
= HX n ~
HL\(h  uu')W
=
I(h 
uu')WI 2
,...., Xf1'
because h  uu' represents the projection orthogonal to the unit vector u. (The squared length of the projection of W onto any (k  1)dimensional subspace has a chisquare distribution with k  1 degrees of freedom.) D 9 Example. Suppose el, e2, ... are independent random variables each with a Uniform(O, 1) distribution. Neyman (1937) developed a goodnessoffit test whose asymptotic properties depended on the behavior of the statistics
Gn = n
1J1 [tl
1Cj (ei)T,
where 1Co, 1C 1, ... are given polynomials defined on [0, 1J, with the orthonormality property: (1 if i = j, Jo 1C;(Y)1C/Y) dy = if i # j.
{I°
Explicitly,1Co(y) = 1,1C 1(Y) = ji2(y  t),1C 2(Y) = j5[6(y  t)2 so on. Define random column vectors Xi = (1C1(ei)' ... , 1Ck(ei))
for
The statistic Gn can then be written as n1/2.±XiI2. I
,= 1
i = 1,2, ....
U and
48
Ill. Convergence in Distribution in Euclidean Spaces
The Multivariate Central Limit Theorem (Theorem 30) and the orthonormality properties of the polynomials ensure that
nl/2(X 1
+ ... + Xn)
~
N(O, Ik).
The Continuous Mapping Theorem (applied to which map?) allows us to deduce that Gn ~ X~. Neyman used this to determine the approximate critical region for his test. D
111.3. Expectations of Smooth Functions There are two sorts of perturbation of a random vector X that don't affect the expectation IPf(X) of a smooth, bounded function of X too greatly: changes, however gross, that occur with only small probability; and changes that might occur with high probability but which alter X by only small amounts. These effects are easy to quantify when the smooth function f is uniformly continuous. Suppose If(x)  f(z) I < 8 whenever Ix  zl < b. Write 11 f 11 for the supremum of f(·). Then for any random vectors X and Y, whether dependent or not,
IIPf(X)  IPf(X +
(10)
Y) I
:::; IP{ I YI < b} If(X)  f(X + Y)I + IP{I YI ~ b}(lf(X) I + If(X + Y)I) :::; 8 + 211f11IP{ I YI ~ b}. The inequality lets us deduce convergence in distribution of a sequence of random vectors from convergence of slightly perturbed sequences. 11 Lemma. Let {X n}, X and Y be random vectors for which Xn X + u Y for each fixed positive u. Then X n > X.
+
uY ~
PROOF. Remember (Corollary 5) we have only to check that IPf(X n) ~ IPf(X) for each bounded, uniformly continuous j. Apply inequality (10) with X replaced by X nand Y replaced by u Y.
sup IIPf(Xn)  IPf(X n
+ uY)1
:::; 8 + 211f11IP{1 YI ~ bu 1 }.
Similarly
IIPf(X)  IPf(X
+ uY)I:::; 8 + 211f11IP{IYI
~ bu 1 }.
Choose u small enough to make both righthand sides less than 28, then invoke the known convergence of {IPf(X n + uY)} to IPf(X + uY) to deduce that limsup IIPf(Xn)  IPf(X) I :::; 48. D Now, instead of thinking of the u Y as a perturbation of the random vectors, treat it as a means for smoothing the functionj. This can be arranged
49
111.3. Expectations of Smooth Functions
by choosing independently of X and the {Xn} a random vector Y having a smooth density function with respect to lebesgue measurefor convenience, take Y to have a N(O,I k ) distribution. Integrate out first with respect to the distribution of Y then with respect to the distribution of X. lPf(X
+ oT) =
lPf,rCX),
where fa(x) =
=
J
(2n)k I2f(x + O'y) exp( _!IYI2) dy
J(2n0'2)k I 2f(z) exp( !Iz  x12j0'2) dz.
The function f has been smoothed by convolution. Dominated convergence justifies repeated differentiation under the last integral sign to prove that fa belongs to the class ~co(lRk) of all bounded real functions on lRk having bounded, continuous partial derivatives of all orders. 12 Theorem. If lPf(Xn) + lPf(X) for every fin ~co(lRk) then Xn ~ X. PROOF. Convergence holds for every fa produced by convolution smoothing. 0 Apply Lemma 11.
For the remainder of the section assume that k = 1. That is, consider only real random variables. As the results of Section 5 will show, no great generality will be lost therebya trick with multidimensional characteristic functions will reduce problems of convergence of random vectors to their onedimensional analogues. For expectations of smooth functions of X, the effect of small perturb ations can be expressed in terms of moments by applying Taylor's theorem. Suppose f belongs to ~CO(lRk). Then, ignoring the niceties of convergence, we can write f(x + y) = f(x) + yf'(x) + !J2j"(X) + ....
Suppose the random variable X is incremented by an independent amount Y. Then, again ignoring problems of convergence and finiteness, deduce (13)
lPf(X + Y) = lPf(X) + lP(Y)lPf'(X) + !lP(y2)lPj"(X) + ....
Try to mimic the effect of the increment Y by a different increment W, also independent of X. As long as lP(Y) = lP(W) and lP(y2) = lP(W2), the expectations lPf(X + Y) and lPf(X + W) should differ only by terms involving third or higher moments of Y and W. These higherorder terms should amount to very little provided both Y and Ware small; the effect of substituting W for Y should be small in that case. This method of substitution can be applied repeatedly for a random variable Z made up of a lot of little independent increments. We can replace
50
Ill. Convergence in Distribution in Euc1idean Spaces
the increments one after another by new independent random variables. If at each substitution we match up the first and second moments, as above, the overall effect on IPf(Z) should involve only a sum of quantities of third or higher order. In the next section this approach, with normally distributed replacement increments, will establish the Liapounoff and Lindeberg forms of the Central Limit Theorem. To make these approximation ideas more precise we need to bound the remainder terms in the informal expansion (13). Because only the first two moments of the increments are to be matched, a Taylor expansion to quadratic terms will suffice. Existence of third derivatives for fwill help to control the error terms. Assume f belongs to the class C6'3(IR) of all bounded real functions on IR having bounded continuous derivatives up to third order. Then the remainder term in the Taylor expansion (14) f(x + y) = f(x) + yf'(x) + h 21"(x) + R(x, y) can be expressed as R(x, y) = iJ31"'(X
+ 81y)
with 8 1 (dependingonxandy)betweenOand 1. Write 111"'11 forthesupremum of If"'() I· Then IR(x, y)1 :::;; iII1"'lllyI3,
(15)
Set C equal to
HI'" 11· Then from (14) and (15),
IIPf(X +
Y)  IPf(X)  IP(Y)IPf'(X)  !IP(y2)IP1"(X) I
:::;; IPIR(X, Y)I :::;; CIP(IYI 3).
Apply the same argument with Y replaced by the increment W, which is also independent of X. Because IP(Y) = IP(W) and IP(y2) = IP(W2), when the resulting expansion for IPf(X + W) is subtracted from the expansion for IPf(X + Y) most of the terms cancel out, leaving, (16)
IIPf(X
+
Y)  IPf(X
+
W)I :::;; IPIR(X, Y)I + IPIR(X, W)I :::;; CIP(I Y1 3 ) + CIP(I WI 3 ).
This inequality is sharp enough for the proof of a limit theorem for sums of independent random variables with third moments.
III.4. The Central Limit Theorem A sum of a large number of small, independent random variables is approximately normally distributedthat is roughly what the Central Limit Theorem asserts. The rigorous formulations of the theorem set forth conditions for convergence in distribution of sums of independent random
51
IlI.4. The Central Limit Theorem
variables to a standard normal distribution. We shall prove two versions of the theorem. To begin with, consider a sum Z = (1 + ... + (k of independent random variables with finite third moments. Write a} for IP(J. Standardize, if necessary, to ensure that IP(j = for each j and ai + ... + at = 1. Independently of the {~), choose independent N(O, aJ)distributed random variables {1]j}, forj = 1, ... , k. Start replacing the {(j} by the {1]j}, beginning at the righthand end. Define
°
Sj
=
(1
+ ... + (j1 + 1]j+1 + ... + 1]k'
Notice that Sk + (k = Z and that Sl + 1]1 has a N(O, 1) distribution. Choose a ~3(IR) function f, as in Section 3. Theorem 12 has shown that convergence for expectations of infinitely differentiable functions of random variables is enough to establish convergence in distribution; convergence for functions in ~3(IR) is more than enough. We need to show that IPf(Z) is close to IPf(N(O, 1)). Apply inequality (16) with X = Sj' Y = (j' and W = 1]j. Because Sj + (j = Sj+ 1 + 1]j+ 1 for j = 1, ... , k  1, k
(17)
IIPf(Z)  IPf(N(O, 1))1:s;
L
IIPf(Sj
+0 
IPf(Sj
+ 1])1
j= 1 k
:s;
L
IPIR(Sj' ()I
+ IPIR(Sj' 1])1
j= 1
:s; C
k
k
j= 1
j= 1
L IPI(jI3 + C L IPI1] I3. j
With this bound in hand, the proof of the first version of the Central Limit Theorem presents no difficulty.
18 Liapounoff Central Limit Theorem. For each n let Zn be a sum of independent random variables (nb (n2, ... , (nk(n) with zero means and variances that sum to one. If the LiapounojJ condition, k(n) (19) L IPI(nj 13 as n  00, j= 1
is satisfied, then Zn
~
°
N(O, 1).
PROOF. Choose and fix an f in ~3(IR). Check that IPf(Zn)  IPf(N(O, 1)). The replacement normal random variables are denoted by 1]n1' ... , 1]nk(n)' The sum 1]n1 + ... + 1]nk(n) has a N(O, 1) distribution. Write a;j for the variance of (nj' and An for the sum on the lefthand side of (19). With subscripting n's attached, the bound (17) becomes k(n) IIPf(Zn)  IPf(N(O, 1))1 :s; CAn + C L a~jIPIN(O, 1W. j= 1
52
Ill. Convergence in Distribution in EucIidean Spaces
By Jensen's inequality, (J~j = (IP~;)3/2 : .: ; IP I~nj 13, which shows that the sum contributed by the normal increments is less than IP IN(O, 1WAn Two calls upon (19) as n ~ 00 complete the proof. 0 The Liapounoff condition (19) imposes the unnecessary constraint of finite third moments upon the summands. Liapounoff himself was able to weaken this to a condition on the (2 + c5)th moments, for some c5 > o. The remainder term R(x, y) in the Taylor expansion (14) does not increase as fast as lyl2+o though: (20)
IR(x, y)1
= IU(x + y)  f(x)  yf'(x)J  !y 21"(x) I = Ih21"(x + 8 2 y)  h 21"(x) I : .: ; 1I1"IIIyl2
for all x and y.
The new bound improves upon (15) for large IYI, but not for smalllyl. To have it both ways, apply (15) if Iyl < c and (20) otherwise. Increase C to the maximum of illf"'ll and 111"11. Then the bound on the expected remainder is sharpened to IPIR(X, Y)I::.::;; IPCIYI 3 {IYI < c} + IPClYI 2 {IYI;::: : .: ; cCIP(y2) + CIPy2{ I YI ;::: c}.
(21)
d
22 Lindeberg Central Limit Theorem. For each n let Zn be a sum of independent random variables ~nb ~n2' ••• , ~nk(n) with zero means and variances that sum to one. If,for each fixed c > 0, the Lindeberg condition
ken)
L
(23)
IP~;j{l~njl ;::: c} ~ 0
as
n~
00
j= 1
is satisfied, then Zn . N(O, 1).
Use the same notation as in the proof of Theorem 18. Denote the lefthand side of (23) by Ln(c). Stop one line earlier than before in the application of (17): PROOF.
ken) IIPf(Zn)  IPf(N(O, 1))1 : .: ;
L
ken)
IPIR(Snj, ~n)1
j= 1
+ L
IPIR(Snj,1]n)l·
j= 1
From (21), the first sum is less than
ken) C
L [cIP~;j + IP~;j{l~njl ;::: dJ
j= 1
which equals Cc + CLnCc) because the variance sum to one. For the second sum retain the bound ken)
C
L (J~jIPIN(O, lW
j= 1
53
111.4. The Central Limit Theorem
but, in place of Jensen's inequality, use:
L~ U~jr ~ (m~x UnjrL~: U~jr = max
Z
Unj
j
~ max[e
Z
+ IP~~j{l~njl
~ e)]
j
o The strangelooking Lindeberg condition is not as artificial as one might think. For example, consider the standardized summands for a sequence {y"} of independent, identically distributed random variables with zero means and unit variances: ~nj = n  I/Zl} for j = 1, 2, ... , n. In this case
Lie) = nIPnIYi{lYII
~ nl/Ze},
which tends to zero as n tends to infinity, by dominated convergence, because Yi is integrable. It is even more comforting to know that the Lindeberg condition comes very close to being a necessary condition (Feller 1971, Section XV.6) for the Central Limit Theorem to hold.
24 Example (A Central Limit Theorem for the Sample Median). Let Mn be the median of a sample {YI , Yz , ... , y"} from a distribution function G with median M. For simplicity, suppose the sample size n is odd, n = 2N + 1, so that Mn equals the (N + 1)st order statistic of the sample. Suppose also that the underlying distribution function G has a positive derivative y at its median. To prove that
nI/Z(Mn  M)
~ N(O,
h Z),
it suffices to check pointwise convergence of the distribution functions. (25)
IP{nI/Z(M n  M)
~
x}
+ nI/Zx} = IP {at least N + 1 observations
= IP{Mn
~
M
= IP[N + 1 ~
.f
{l}
~
M
+ nI/Zx}
~ M + nI/Zx}].
J= I
Define
Pn = IP{l} bn
=
~nj = tn
=
~
M
+ n I/2 x} =
G(M
[npnC!  Pn)J I/Z ,
[{l} ~ M + nI/Zx}  PnJ/b n, (N
+
1  npn)/b n .
+ nI/Zx),
54
Ill. Convergence in Distribution in Euclidean Spaces
Check that IP~nj = 0 and IP~~l + ... + IP~~n = 1. Continuity of G at M gives Pn ...... t; differentiability of G at M gives tn ......  2xy. The last probability in (25) becomes (26)
As each I~nj I is less than b;; 1, which converges to zero, both the Liapounoff and Lindeberg conditions are easy to check. n
n
1 "~ IPI!'Sn).1 3 < "i....J IP!,2  bn ~nJ j= 1
= bn 1
j= 1
n
LIP~~j{l~njl:::::c}=O
if c>b;;l.
j= 1
By either of two routes, n
I
~nj . N(O, 1).
j= 1
Problem 13 shows that substitution of  2xy for tn in (26) leads to the correct result: IP{n l/2 (Mn  M) ::;; x} ...... IP{N(O, 1) ::::: 2xy}.
o
111.5. Characteristic Functions The smoothing argument of Section 3 showed that only expectations of ~co(IRk) functions need be considered when checking convergence in distribution of random vectors. In this section, further analysis of the method of smoothing will show that the class of functions can be narrowed a little more: it suffices to check the convergence IP exp(iy . Xn) ...... IP exp(iy· X)
for each fixed vector y inIRk. That is, pointwise convergence of characteristic functions implies convergence in distribution. Start again from the convolution expression
derived under the assumption that Y has a N(O, I k ) distribution independent of X. This holds for every bounded measurable f. Reverse the order of
55
IlI.5. Characteristic Functions
integration (Fubini) on the righthand side to show that
lPf(X
+ O"Y) =
f
f(z)J(z) dz,
where (27) The distribution of X + 0" Y has density function J(.) with respect to lebesgue measure. The dependence on 0", which will remain fixed for most of the argument, need not be made explicit. The integrand appearing on the righthand side of (27) comes from the density function of 0" Y evaluated at (z  X). It is also proportional to the characteristic function of Y evaluated at (z  X)/O":
Invoke Fubini's theorem.
(28)
J(z) =
f (2nO")klP exp(  iy· X/O") exp(iy • z/O" 
~lyI2) dy
where cPU denotes the characteristic function of X. Formula (28) shows that the characteristic function of X uniquely determines the distribution of X + 0" Y for each fixed 0". Because X + 0" Y converges almost surely to X as 0" tends to zero, this proves that the characteristic function also uniquely determines the distribution of X. A stronger result can be proved by exploiting continuity properties of the dependence of J on cP. 29 Continuity Theorem. A sequence of random vectors {Xn} converges in distribution to a random vector X if and only if the corresponding sequence of characteristic functions converges pointwise:
lP exp(iy . X n)
+
lP exp(iy • X)
for each fixed y in IRk.
PROOF. Prove necessity by splitting exp(ix. y) into its real and imaginary parts, both of which define functions in ~(IR k).
56
Ill. Convergence in Distribution in Euclidean Spaces
For the proof of sufficiency, denote the characteristic function of X n by CPn, and the density function of X n + 0" Y by 1 n' The domain of integration is IRk. For fixed f and 0", IlPf(Xn
+ O"Y)
 lPf(X
+ O"Y)
J
=
I
:s;
Ilfll
f(z)Jn(z) dz 
J
f(z)l(z) dzl
J l n(Z)  l(z) I dz l
= IlfII[J(1(Z)  In(z))+ dz + J(1(Z)l n(Z)) dZ] = 211fll J(1(Z)  In(z))+ dz, because 0=1  1
= J(1(Z)l n(z))dZ = J(1(Z)  In(z))+ dz  J(1(Z)l n(Z)) dz. Dominated convergence applied to (28) with cP replaced by CPn shows that the functions {in} converge point wise to 1. Thus {(1  In)+} converges pointwise to zero. Notice that (1  In)+ :s; 1 for each n. Invoke dominated convergence again.
211fll J(1(Z)
 lnC z ))+ dz
Complete the proof by an appeal to Lemma 11.
~ O. D
Perhaps this result should have been proved before we launched into the Central Limit Theorem proofs of Section 4. At least for identically distributed random variables, Theorems 18 and 22 could have been disposed of more rapidly; but the general cases would have required much the same level of effort, applied to exp(y' ix) in place of a general ~3(IR) function. The advantages of working with characteristic functions come from the special multiplicative properties of the complex exponential. A central limit theorem for martingales in Chapter VIII will make use of this property. Theorem 29 contains a bonus: convergence in distribution of random vectors, Xn ~ X, is equivalent to the convergence in distribution of all linear functions of the random vectors: y . Xn ~ Y • X, for every y. (For the
57
111.6. Quantile Transformations and Almost Sure Representations
nontrivial half of the assertion, notice that CPn(t) is the expectation of a bounded continuous function of t· Xn.) Convergence problems for random vectors reduce to collections of convergence problems for random variables; limit theorems for random vectors can be deduced from their univariate analogues. 30 Multivariate Central Limit Theorem. For every sequence IP~j =
dent, identically distributed random vectors with n1/2(~1
+ ... +
~n) +
{~n}
°and
of indepenV,
IP(~j~j) =
N(O, V).
PROOF. For fixed y, the random variables {y. ~n} are independent and identically distributed with zero means and variances equal to y'Vy. If y'Vy = the random variables y. ~j and y. N(O, V) degenerate to zero; the assertion then holds for trivial reasons. Otherwise standardize to unit variance. The standardized sequence satisfies the Lindeberg condition. By Theorem 22,
°
(y'Vy)1/2y·n1/2(~1
+ ... + ~n)
+N(O, 1) = (y'Vy)1/2 y • N(O, V).
0
111.6. Quantile Transformations and Almost Sure Representations The definition of convergence in distribution by means of pointwise convergence of distribution functions does have its advantages, at least for real random variables. Behind some of these advantages lies a construction that reduces problems involving arbitrary distributions on the real line to the uniform case. For each distribution function F define a quantile function Q(u) = inf{t: F(t)
~
u}
for
°
n} be the characteristic functions corresponding to a weakly convergent sequence of probability measures {P n}, and cf> be the characteristic function of the limit distribution P. Take {Xn} as the almost surely convergent representing sequence of random variables. For each fixed t, the sequence {exp(itXn)} converges almost surely to exp(itX). Moreover, the convergence is uniform on each bounded interval loftvalues:
sup lexp(itXn)  exp(itX) I :::;; sup lexp(it(Xn  X))  1)1 I
I
~
° almost surely,
by virtue of the continuity of the exponential function at zero. Because sup lIP exp(itXn)  IP exp(itX) I :::;; IP sup lexp(it(Xn  X))  1)1, I
I
the sequence {cf>n(t)} converges uniformly to cf>(t) over bounded intervals.
D
The argument for the proof of the Representation Theorem made little explicit use of the limit distribution function F. The essentials were: (i) existence of lim FnCt) for each t in a dense subset To; (ii) for each fixed u in (0,1) there exists a t in To for which liminf Fn(t) > u, and an s in To for which limsup Fn(s) < u.
60
Ill. Convergence in Distribution in Euclidean Spaces
The second property is equivalent to the existence, for each fixed constant K such that limsup P n [ K, KJ
0, of a
G.
Of course K depends on G. A sequence {P n} with this property is said to be uniformly tight. The corresponding property for a sequence of random variables {Xn}that for each G > 0 there exists a K such that limsup IP{ IXnl > K}
Z, prove that n1/2(HXn  Hxo) . LZ. [Some authors call this the delta method.]
[16] If Xn ~ Bin(n, 8) for some fixed 8 in (0, 1), find the limiting distribution of nI/2[arsin(Xn/n)1/2  arsin 81/2 ] as n  t 00. [17] Fory real, define Hn(Y) = exp(iy)  1  iy  .,.  (iy)"/n!. Prove that IHn(Y) I ::; lyln+I/(n + I)!. [Proceed inductively, using if~ Hn(s) ds = Hn+l(t) for t > O. Take complex conjugates for t < O. Borrowed from Feller (1971).]
[18] Use the inequality in the previous problem to show that the characteristic function of n 1/2 (Poisson(n)  n) converges pointwise to exp(  t 2/2). [This is one way to find the characteristic function of the N(O, 1) distribution.] [19] For each n let Zn be a sum of independent random variables zero means and variances that sum to one. If for some 0 > 0, k(n) L IPI~nY+O  t 0 as n  t 00
~nb
... ,
~nk(n)
with
j~1
then Zn . N(O, 1). [Apply the Lindeberg Central Limit Theorem. Liapounoff.] [20] Suppose a random vector X has a characteristic function
say, then checking Cjf!l>measurability of V nand f!l>measurability off. The borel rrfield will not be the best choice for f!l>. The definition of convergence in distribution for random elements of a general metric space anticipates this complication for D[O, 1]. 1 Definition. An Cjdmeasurable map X from a probability space (Q, C, IP) into a set X with rrfield .911 is called a random element of X. If X is a metric space, the set of all bounded, continuous, dj[!8(IR)measurable, realvalued functions on X is denoted by ~(X; d). A sequence {Xn} of random elements of X converges in distribution to a random element X, written Xn ~ X, if IPf(Xn) + IPf(X) for each f in ~(X;
d).
A sequence {P n } of probability measures on .911 converges weakly to P, written P n ~ P, if Pnf + Pffor every fin ~(X; d). 0 The borel rrfield [!8(X), the rrfield generated by the closed sets, will always contain d. For those spaces where we need .911 strictly smaller than the borel rrfield, we will usually have it generated by the collection of all closed balls in X. Also the trace of .911 on each separable subset of X will coincide with the trace of the borel rrfield on the same subset. Limit distributions will always be borel measures concentrating on separable, .911measurable subsets of X. We could build these properties into the definition of weak convergence, but it would neither save us any extra work, nor simplify the theory much. 2 Example. If D[O, 1J is equipped with the borel rrfield [!8 generated by the closed sets under the uniform metric, the empirical processes {Vn} will not be random elements of D[O, 1J in the sense of Definition 1. That is, V n is not Cj[!8measurable. Consider, for example, the situation for a sample of size one. (Problem 1 extends the argument to larger sample sizes.) For each subset A of [0, 1J define GA = {x E D[O, 1J: x has a jump at some point of A}. Each GA is open because Ix(t)  x(t  ) I depends continuously upon x, for fixed t. If V 1 were Cj[!8measurable, the set {V 1 EG A } = glEA}would belong to C. A probability measure J.l could be defined on the class of all subsets of [0, 1J by setting J.l(A) = IP{~ 1 E A}. This J.l would be an extension of the uniform distribution to all subsets of [0, 1]. Unfortunately, such an extension cannot coexist with the usual axioms of set theory (Oxtoby 1971, Section 5): if we wish to retain the axiom of choice, or accept the continuum hypothesis, we must give up borel measurability of V 1. The borel rrfield generated by the uniform metric on D[O, 1J contains too many sets. There is a simple alternative to the borel rrfield. For each fixed t, the map V n(·, t) from Q into IR is a random variable. That is, if n t denotes the
66
IV. Convergence in Distribution in Metric Spaces
coordinate projection map that takes a function x in D[O, 1] onto its value at t, the composition 11: t Un is 6"/&8(lR)measurab1e. Each Un is measurable with respect to the ufie1d f!jJ generated by the coordinate projection maps (Problem 2). Call [!J! the projection ufie1d. Problem 4 shows that [!J! coincides with the ufie1d generated by the closed balls. All interesting functiona1s on D[O, 1] are [!J!measurab1e. 0 0
Too large a ufie1d d makes it too difficult for a map into X to be a random element. We must also guard against too small an d. Even though the metric on X has lost the right to have d equal to the bore1 ufie1d, it can still demand some degree of compatibility before a fruitful weak convergence theory will result. If ~(X; d) contains too few functions, the approximation arguments underlying the Continuous Mapping Theorem will fail. Without that key theorem, weak convergence becomes a barren theory. An extreme example should give you some idea of the worst that might happen. 3 Example. Allow the rea11ine to retain its usual euclidean metric, but change its ufie1d to the one generated by the intervals of the form en, n + 1), with n ranging over the integers. Call this ufie1d !!l/. Functions measurable with respect to !!l/ must stay constant over each of the generating intervals. For a continuous function, this imposes a harsh restriction; continuity at each integer forces an !!l/measurab1e function to be constant over the whole real line. This completely degrades the weak convergence concept: every sequence of !!l/measurab1e random elements converges in distribution. It bodes ill for a sensible Continuous Mapping Theorem. Consider the map H from the disfigured rea11ine into the real rea11ine (equipped with its usual metric and ufie1d) defined by Hx = 1 if x O. Suppose also that h is continuous at a point x. Choose r with 0 < r < hex). Look for an f in ff withf(x) ~ r. Continuity provides a 6 > 0 such that hey) > r on the closed ball B(x, 6) centered at x. If we could find a g in t:(j(Pl; d) with 0 So g So B(x, 6) and g(x) = 1, the function rg would meet our requirements. Notice the similarity to the topological notion of complete regularity (Simmons 1963, Section 27). If .91 happened to contain all the closed balls centered at x, a property enjoyed by the projection (Jfield on D[O, 1J (Problem 4), the function
(5)
g(y)
= [1  6 1 d(x, y)J+
would do, because {g ~ 1  s} = B(x, s6). For general.91 we must postulate existence of the appropriate g. To maintain the parallel with euclidean spaces as closely as possible, strengthen the requirements on g to include uniform continuity. We lose only a scintilla of generality thereby; the special g of (5) still passes the test. 6 Definition. Call a point x in Pl completely regular (with respect to the metric d and the (Jfield d) if to each neighborhood V of x there exists a uniformly continuous, .91measurable function g with g(x) = 1 and g So V.
D You might well object to yet another mathematical notion attaining the status of regularity; the world is already overloaded with instances of
68
IV. Convergence in Distribution in Metric Spaces
"regular" as a synonym for "amenable to our current theory." At least it has the virtue of reminding us of its topological counterpart. (A more sadistic author might have called it T3J...) The terminology would not be wasted if we were to expand our weak 2convergence theory to cover borel measures on general topological spaces, for there topological complete regularity seems just the thing needed for a wellbehaved theory.
7 Convergence Lemma. Let h be a bounded, d measurable, realvalued function on!![. Ifh is continuous at each point of some separable, dmeasurable set C of completely regular points, then: (i) Xn ~ X and IP{X E C} = 1 imply IPh(Xn) + IPh(X); (ii) P n ~ P and PC = 1 imply Pnh + Ph. PROOF. As the arguments for both assertions are quite similar, let us prove (ii) only. Assume that h > 0 (add a constant to h if necessary). Define §P as in (4), but with the continuity requirement strengthened to uniform continuity. At those completely regular points of !![ where h is continuous, the supremum of §P equals h. This applies to points in C. Separability of C will enable us to extract a suitable countable subfamily from §P. Argue as for the classical Lindel6ftheorem (Simmons 1963, Section 18). Let Co be a countable, dense subset of C. Let {gt> g2, ... } be the set of all those functions of the form rB, with r rational, B a closed ball of rational radius centered at a point of Co, and r B ~ f for at least one f in fF. For each gi choose one f satisfying the inequality gi ~ f. Denote it by /;. This picks out the required countable subfamily: (8)
sup /;
=
sup §P
on C.
i
To see this, consider any point z in C and any f in §'. For each rational number r such thatf(z) > r > 0 choose a rational 8 for whichf > r at all points within a distance 28 of z. Let B be the closed ball of radius 8 centered at a point x of Co for which d(x, z) < 8. The function rB lies completely below f; it must be one ofthe {gJ. The corresponding/; takes a value greater than r at z. Assertion (8) follows.
r
~g;
x
z
f
IV.2. The Continuous Mapping Theorem
69
Complete the argument as for the Convergence Lemma of Section III.2. Assume without loss of generality that j; i h at points of C. Then liminf Pnh 2:: liminf Pnj; = Pj; + Ph Replace h by  h limsup.
for each i because Pn ~ P as i + 00, by monotone convergence.
+ (a big constant) to get the companion inequality for the 0
9 Corollary. If JPf(X n) + JPf(X) for each bounded, uniformly continuous, d measurable f, and if X concentrates on a separable set of completely regular points, then Xn ~ X. 0 The corollary flows directly from the decision to insist upon uniformly continuous separating functions in the definition of a completely regular point. As with its counterpart for euclidean spaces, it makes some weak convergence arguments just a little bit more straightforward than the corn:sponding arguments with continuous functions. 10 Example. Let f£ be a space equipped with a afield d and metric d, and ifY be a space equipped with a afield gg and metric e. Equip f£ ® ifY with its product afield and the metric a defined by
a[(x, y), (x', y')] = max[d(x, x'), e(y, y')]' Suppose Xn ~ X, as random elements of f!£. IfJP x concentrates on a separable set of completely regular points, and y" + Yo in probability for some fixed completely regular point Yo in ifY, then (X n, y,,) ~ (X, Yo), as random elements of the product space f£ ® ifY. Of course the assertion only makes sense if X nand y" are defined on the same probability space. Given that prerequisite, measurability with respect to the product afield presents no problem, because
(Xn' y")l(A ® B) = (X; lA) n (y;lB), and similarly for (X, Yo). Write C for the separable set on which JP x concentrates. Then JPx,yO concentrates on the product set C ® {yo}, which is separable. Each point of this set is completely regular: iff(c) = 1 andf = 0 outside the ball of dradius s, and g(yo) = 1 and g = 0 outside a ball of eradius s, then the productf(x)g(y) equals 1 at (c, Yo) and vanishes outside a ball of aradius s. The product is uniformly continuous if both f and g are bounded and uniformly continuous; it is d ® ggmeasurable if f is dmeasurable and g is ggmeasurable. By virtue of Corollary 9, to prove (Xn' y,,) ~ eX, Yo) we have only to check that JPh(Xn' Yn) + JPh(X, Yo) for each bounded, uniformly continuous, d ® ggmeasurable, real function h on f£ ® ifY. Given s > 0 choose
70
IV. Convergence in Distribution in Metric Spaces
°
> so that Ihex, y)  hex', y') I < s whenever O"[(x, y), (x', y')] < (j. Write kO for the bounded, uniformly continuous, d measurable function h(·, Yo). Then
(j
IIPh(Xn' y")  IPh(X, Yo)1 .::; s
+ 21IhIIIP*{e(Yn, Yo):2: (j} + IIPk(Xn)  IPk(X)I·
Convergence in probability of Yn to Yo makes the middle term converge to zero. (Notice the outer measure IP*. By definition, IP*Z equals the infimum of IPW over all @"measurable real functions with W:2: Z. For most applications e(·, Yo) will be dmeasurable, in which case IP* can be replaced by IP.) The last term converges to zero because Xn ~ X. 0 11 Example (Convergence in Distribution via Uniform Approximation). Let X, Xl> X 2 , ••• be random elements of f![ with IPx concentrated on a separable set of completely regular points. Suppose, for each s > and (j > 0, there exist approximating random elements AX, AX 1 , AX 2 , ••• such that:
°
(i) IP*{d(X, AX) :2: (j} < s; (ii) limsup IP*{d(Xn' AXn) :2: (iii) AX n ~ AX.
(j}
< s;
Then Xn ~ X. Notice again the use of outer measure to guard against nonmeasurability. We have already met a special case of this result in Lemma III.1l, where AXn = X n + 0" Y. In applications to stochastic processes, the approximations are typically constructed from the values of the processes at a fixed, finite set of index points. For such approximations, classical weak convergence methods can handle (iii). The assumptions (i) and (ii) place restrictions on the irregularity of the sample paths. Chapter V will take up this idea. The convergence X n ~ X follows from convergence of expectations for every bounded, uniformly continuous, ..GfmeasurableJ. If I f(x)  fey) 1< s whenever d(x, y) < (j then IIPf(Xn)  IPf(X) I is less than
IPlf(X n)  f(AXn) I + IIPf(AX n)  IPf(AX) I + IPlf(AX)  f(X)I· The convergence (iii) takes care of the middle term. Handle the first term by splitting it into the contributions from {d(Xn' AXn) :2: (j} and its complement; and similarly for the last term. 0 The Convergence Lemma has one other important corollary, the result that tells us how to transfer convergence in distribution of random elements of f![ to convergence in distribution of selected functionals of those random elements. For substantial applications turn to Chapter V. 12 Continuous Mapping Theorem. Let H be an dld'measurable map from
into another metric space f!['. If H is continuous at each point of some separable, ..Gfmeasurable set C of completely regular points, then Xn ~ X and IP{X E C} = 1 together imply HX n ~ HX. 0 f![
IV.3. Representation by Almost Surely Convergent Sequences
71
IV.3. Representation by Almost Surely Convergent Sequences In Section 111.6 we used the quantile transformation to construct almost surely convergent sequences of random variables representing weakly convergent sequences of probability measures. That method will not work for probabilities on more general spaces; it even breaks down for IR2. But the representation result itself still holds. 13 Representation Theorem. Let {P n} be a sequence of probability measures on a metric space. If P n + P and P concentrates on a separable set ofcompletely regular points, then there exist random elements {Xn} and X with distributions {P n} and P such that Xn > X almost surely. D
The new construction makes repeated use of a lemma that can be applied to any two probability measures P and Q that are close in a weak convergence sense. Roughly speaking, the idea is to cut up the metric space !!C into pieces Bo, Bb"" Bk for which PBi ~ QB i for each i, so that the set B o has small P measure and each of the other B;'s has small diameter. We use these sets to construct a random element Y of !!C, starting from an X with distribution P. If X lands in Bi choose Y in Bi according to the conditional distribution Q('I BJ For i ;::: 1 this forces Y to lie close to X, because Bi doesn't contain any pairs of points too far apart. The random element Y has approximately the distribution Q: k
(14)
IP{YEA} =
L
IP{YEAIXEBJIP{XEB i}
i=O k
=
L Q(A IBJP(BJ i=O k
~
L Q(A
1
Bi)Q(Bi)
;=0
= Q(A). A slight refinement of the construction will turn the approximation into an equality. When applied with Q = P n and partitions growing finer with n, it will generate the sequence {Xn} promised by the Representation Theorem. 15 Lemma. For each G > 0 and each P concentrating on a separable set of completely regular points, the space !!C can be partitioned into finitely many disjoint, dmeasurable sets Bo, Bb ... , Bk such that:
(i) the boundary of each Bi has zero P measure (a Pcontinuity set); (ii) P(B o) < G; (iii) diameter(B i) < 2G for i = 1, 2, ... , k.
72
IV. Convergence in Distribution in Metric Spaces
PROOF. Call the separable set C. To each x in C there exists a uniformly continuous, dmeasurable f with f(x) = 1 andf = 0 for points a distance greater than e from x. The open sets of the form {f > IJ(}, for 0 < IJ( < 1, are all d measurable and of diameter less than 2e. At each point on the boundary of {f > IJ(}, the continuous function f takes the value IJ(. Because P{f = IJ(} can be nonzero for at most countably many different values of IJ(, there must exist at least one IJ( for which the probability equals zero. Choose and fix such an IJ(, then write G(x) for the corresponding set {f > IJ(}. It has diameter less than 2e and is a Pcontinuity set. The union of the family of open sets {G(x): x E C} contains the separable set C. Extract a countable subfamily {G(x;): i = 1, 2, ... } containing C. (Every open cover of a separable subset of a metric space has a countable subcover: Problem 5.) Because P[Vl G(X J ]
i
P[Dl G(XJ]
~ P(C) =
1,
there exists a k such that P[Vl G(XJ] > 1  e. Define Bi = G(xJ\[G(Xl) U '" u G(Xil)] for i = 1, ... , k and B o = [G(Xl) U ... u G(xk)J, a process known to the uncouth as disjointification. The boundary of Bi is covered by the union of the boundaries of the Pcontinuity sets G(x 1 ), ... , G(Xk)' Each Bi lies completely inside the corresponding G(Xi), a set of diameter less than 2e if i ~ 1. D PROOF OF THEOREM 13. Holding e fixed for the moment, carry out the construction detailed in the proof of the lemma, generating Pcontinuity sets Bo, Bb ... , Bk as described. The indicator function of Bi is almost surely continuous [P] because it has discontinuities only at the boundary of B i . So by the Convergence Lemma PnCB i) ~ P(BJ When n is large enough, say n ~ nee), (16)
PnCB;)
~
(1  e)P(B i) for
i
= 0, 1, ... , k.
Write nm for n(2m). Without loss of generality suppose 1 = n 1 < n2 < .... For nm S n < nm + b construct X n using the {BJ partition corresponding to em = 2 m . Notice that Bi now depends on n through the value of m. Let ~ be a random variable that has a Uniform(O, 1) distribution independent of X. If ~ s 1  em and X lands in Bb choose X n according to the conditional distribution PnCIBJ So far no Bi has received more than its quota of P n measure, because of (16). The extra probability will be distributed over the space f!l to bring Xn up to its desired distribution Pn. If ~ > 1  em choose X n according to the distribution fln determined by k
PiA)
=
fliA)IP{~
> 1  em}
+
I
i=O
Pn(AIB;)(1  em)P(BJ
IV.3. Representation by Almost Surely Convergent Sequences
73
That is, k
,unCA) =
e;;; 1
I
Pn(A IBJ[Pn(BJ  (1  em)P(BJ].
;=0
By (16), the righthand side is nonnegative. And clearly ,un:?E = 1. Except on the set Qm = {X E Bo or ~ > 1  em}, which has measure at most 2em , the random elements X and X n lie within 2em of each other. On the complement of the set {Qm infinitely often}, the sequence {Xn} converges to X. By the BorelCantelli lemma IP{Qminfinitely often} = O.
o
The applications of Theorem 13 follow the same pattern as in Section 111.6. Problems of weak convergence transform into problems of almost sure convergence, to which the standard tools (monotone convergence, dominated convergence, and so on) can be applied. 17 Example. Most of the proof of the Convergence Lemma did not use the full force of almost sure continuity for the function h. To get the inequality for the liminf we only needed lowersemicontinuity of h at points of C. (Remember that semicontinuity imposes only half the constraint of continuity: only a lower bound is set on the oscillations of h in a neighborhood of a point. Problem 9 will refresh your memory on semicontinuity.) The Representation Theorem gives a quick proof of the same result. If g is bounded below, lowersemicontinuous, and dmeasurable (automatic if d equals the bore! ITfield), then liminf Png ;::: Pg whenever Pn .... P with P concentrated on a separable set of completely regular points. To prove it, switch to almost surely convergent representations. Lowersemicontinuity at X(w) plus almost sure convergence of the representing sequence imply
Iiminf g(Xn(w» ;::: g(X(w»
almost surely.
Take expectations. liminf Png = liminf IPg(Xn) ;::: IPg(X) by Fatou's lemma = Pg. A similar inequality holds for uppersemicontinuous, dmeasurable functions that are bounded above. As a special case, (18)
limsup PnF S PF
for each closed, dmeasurable set F. If inequality (18) holds for all such F then necessarily P n .... P (Problem 12). 0 19 Example. Let 0, there were functions {gn} in '!J for which IPng n  Pg nI ;;::: 8 infinitely often. Apply the dominated convergence argument to the countable family '!J 0 = {g b g 2, ... } to reach a con0 tradiction. 22 Example (The BoundedLipschitz Metric for Weak Convergence). Suppose that si contains all the closed balls, as in the case of D[O, 1] under its uniform metric. The function fO = r[1  md(·, z)] +, which serves to separate z from points outside a small neighborhood of z, has the strong uniformity property
I f(x)  fey) I ~ mr d(x, y). A function satisfying such a condition, with mr replaced possibly by a different constant, is called a Lipschitz function. For the proof of the Convergence Lemma, Pnf + Pf for each bounded, dmeasurable Lipschitz function would have sufficed; convergence for bounded Lipschitz functions implies weak convergence. From Example 19 we draw a sharper conclusion. Define Y to be the set of all d measurable Lipschitz functions for which If(x)  f(Y)1 ~ d(x, y) and supx I f(x) I ~ 1. The class Y is equicontinuous at each point of :!E. Every bounded Lipschitz function can be expressed as a multiple of a function in Y. Define the distance between two probability measures on d by }'(P, Q) = sup{ IPf  Qf I: fEY}.
You can check that}, has all the properties required of a metric. If P concentrates on a separable set and P n ... P, the distance }'(Pn, P) converges to zero, in obedience to the uniformity result of Example 19. Conversely, the
75
IV.3. Representation by Almost Surely Convergent Sequences
convergence of A(Pn' P) to zero would ensure that Pn f ~ Pf for each bounded Lipschitz function f, which, as noted above, implies weak convergence. 0
23 Example (The Prohorov Metric for Weak Convergence). Suppose !![ is a separable metric space equipped with its borel afield. For each (j > 0 and each borel subset A of!![ define
Ab = {x E!![: d(x, A)
0: PA b +
(j
2:: QA
for every A}.
This distance has great appeal for robustniks, who interpret the delta halo as a way of constraining small migrations of Q mass and the added delta as insurance against a small proportion of gross changes. To us it will be just another metric for weak convergence. It is not obvious that p is symmetric, one of the properties required of a metric. We need to show that QAii + (j 2:: PA for every A, whenever p(P, Q) < (j. Set B equal to the complement of Ab. We know that QB :s; PBo + (j. Subtract both sides from 1, after replacing Bb by the complement of A, a larger set. (No point of A can be less than (j from a point in B.) We have symmetry. If p(P, Q) = 0 then certainly PFo + (j 2:: QF for every closed F and every (j > O. Hold F fixed but let (j tend to zero through a sequence of values. The sequence {FO} shrinks to F, giving PF 2:: QF in the limit. Interchange the roles of P and Q then repeat the argument to deduce that P and Q agree on all closed sets, and hence (Problem 11) on all borel sets. For the triangle inequality, suppose that p(P, Q) < (j and p(Q, R) < YJ. Temporarily set B = A~. Then
RA :s; QA~
+ YJ =
QB
+ YJ
+ YJ + (j. p(R, P) :s; YJ + (j. :s; PBo
Check that AO+~ contains BO. Deduce that Next, show that weak convergence implies convergence in the p metric. It suffices to deduce that p(P n, P) :s; (j eventually if P n ~ P. For each borel set A define
fA(X) = [1 
(jl d(x, A)] +.
Notice that AO 2:: fA 2:: A. Also, because
the class of all such fA functions is equicontinuous. By Example 19, sup IPnfA  PfAI ~ O. A
76
IV. Convergence in Distribution in Metric Spaces
Call this supremum
~n'
Then
P Ab ;;:: PfA ;;:: P nfA  ~n ;;:: P nA  ~n
for every A. Wait until ~n ~ 15 to be able to assert that p(P, Pn) Finally, if p(P n, P) ~ 0 then, for fixed closed F,
~
15.
limsup PnF ~ PF b + 15 for every 15 > O. Let 15 decrease to zero then deduce from Problem 12 that P n ~ P. Convergence in the p metric is equivalent to weak convergence. 0
IV.4. Coupling The Representation Theorems of Sections III.6 and IY.3 both depended upon methods for coupling distributions P nand P. That is, we needed to construct random elements Xn and X, on the same probability space, such that X n had distribution P n and X had distribution P. Closeness of P nand P, in a weak convergence sense, allowed us to choose Xn and X close in a stronger, almost sure sense. This section will examine coupling in more detail. A coupling of probability measures P and Q, on a space !!C, can be realized as a measure M on the product space !!C @ !!C, with X and Y defined by the coordinate projections. The product measure P @ Q is a coupling, albeit a not very informative one. More useful are those couplings for which M concentrates near the diagonal. For example, in the Representation Theorem we put as much mass as possible on the set {(x, y): d(x, y) ~ e}. Roughly speaking, one can construct such couplings in two steps. First treat the desired propertythat as much mass as possible be allocated to a particular region D in the product spaceas a strict requirement. Imagine building up M slowly by drawing off mass from the P marginal measure and relocating it within D, subject to a matching constraint: to put an amount 15 near (x, y) one must deplete the P supply near x by 15 and the Q supply near y by 15. When as much mass as possible has been shifted into D by this method, forget about the constraint imposed by D. In the second step, complete the transfer of mass from P into the product space subject only to the matching constraint. The final M will have the correct marginals, P and Q. A precise formulation of the coupling algorithm just sketched is easiest when both P and Q concentrate on a finite set of points. The first step can be represented by a picture that looks like a crossword puzzle. Label the points on which Q concentrates as 1, ... , r; let these correspond to rows of a twoway array of cells. Similarly, let 1, ... , c label both the points on which P concentrates and the columns ofthe twoway array. The stack beside row i represents the mass Q puts on point i, and the stack under column j represents the mass P puts on j. The unshaded cells correspond to D. The
77
IV.4. Coupling
aim is to place as much mass as possible in the unshaded cells without violating the constraint that the total mass in a row or column should not exceed the amount originally in the marginal stacks.
Q
p
This formulation makes sense even if the marginal supplies don't both correspond to measures with total mass one. In general we could allow any nonnegative masses R(i) and CU) in the supply stacks for row i and column j. We would seek a nonnegative allocation M(i, j) of as much mass as possible into the un shaded cells, subject to
L M(i,j) ~ CU)
and
i
L M(i,j) ~ R(i) j
for each i and j. A continuous analogue of the classical marriage lemma (a sort of fractional polygamy) will give the necessary and sufficient conditions for existence of an M that turns the inequalities for the columns into equalities. Treat C and R as measures. Write C(J) for the sum of supply masses in a set of columns J. Denote by DJ the set of rows i for which cell (i, j) belongs to D for at least one columnj in J. It is easy to see that M can have column marginal C only if R(D J) ;::: C(J) for every J, because the rows in DJ contain all the Dcells in the columns of J. Sufficiency is a little trickier. 24 Allocation Lemma. If R(D J) ;::: C(J) for every set of columns J, then there exists an allocation M(i, j) into the cells of D such that
L M(i, j) =
CU)
and
L M(i, j) ~ R(i) j
for every i and j. PROOF. Use induction on the number of columns. The result is trivial for c = 1. Suppose it is true for every number of columns strictly less than c. Construct M by transferring mass from the column margins into D. Shift mass at a constant rate into each of the Dcells in row r. For any mass
78
IV. Convergence in Distribution in Metric Spaces
shifted from C(j) into (r, j) discard an equal amount from R(r). If R(r) becomes exhausted, move on to row r  1, and so on. Stop when either: (i) some C(j) is exhausted; or (ii) one of the constraints R(D J) ;;:: C(J) would be violated by continuation of the current method of allocation. Here Rand C are used as variable measures that decrease as mass is drawn off; the supply stacks diminish as the allocation proceeds. Notice that the mass transferred at each step can be specified as the largest solution to a system of linear inequalities. If the allocation halts because of (i), the problem is transformed into an allocation for c  1 columns. The inductive hypothesis can be invoked to complete the allocation. If allocation halts because of (ii), then there must now exist some K for which R(D K ) = C(K). Continued allocation would have caused R(D K ) < C(K). The matchingconstraint prevents K from containing every column: the total column supply always decreases at the same rate as the total row supply. Write KC for the nonempty set of columns not in K. K'
K
c If the marginal demands of the columns in K are to be met, the entire remaining supply R(D K ) must be devoted to those columns. With this requirement the problem splits into two subproblems: rows in DK may match only mass drawn off from the columns in K; from the rows D'k not in DK , match mass from the columns in KC. Both subproblems satisfy the initial assumptions of the lemma. For subsets of K this follows because allocation halted before R(D J) < C(J) for any J. For subsets of KC, it follows from R(D J n D'K)
= R(D JuK )  R(D K ) ;;:: C(J u K)  C(K) = C(J).
Invoke the inductive hypothesis for both subproblems to complete the proof of the lemma. 0
79
IV.4. Coupling
25 Corollary. If Rand C have the same total mass and R(D]) ;;::: C(J) for every J, then the allocation measure M has marginal measures Rand C. 0
The Allocation Lemma applies directly only to discrete distributions supported by finite sets. For distributions not of that type a preliminary discretization, as in the proof of the Representation Theorem, is needed. 26 Example. Let P and Q be borel probability measures on a separable metric space. The Prohorov distance pep, Q) determines how closely P and Q can be coupled, in the sense that pep, Q) equals the infimum of those values of c such that (27) IP{d(X, Y) ;;::: c} ::; c,
with X having distribution P and Y having distribution Q. We can use the Allocation Lemma to help prove this. Half of the argument is easy. From (27) deduce, for every A, QA
= IP{YE A} ::; IP{X EA'} ::; PA' + c,
+ IP{d(X,
Y) ;;::: c}
whence pep, Q) ::; c. For the other half of the argument suppose pep, Q) < c. Construct X and Y by means of a twostage coupling. Apply the method of Lemma 15 twice to partition the underlying space into sets Bo, B l , . . . , Bk with both QBo < band PB o < b, and diameter(B;) < b for i = 1, ... , k. Choose b as a quantity much smaller than c; it will eventually be forced down to zero while c stays fixed. The requirement that each B; be a Q or Pcontinuity set is irrelevant to our present purpose. Set R(i) equal to QB; and C(j) equal to PB j • Into the region D allow only those cells (i, j), for 1 ::; i ::; k and 1 ::; j ::; k, whose corresponding B; and B j contain a pair of points, one in B; and one in Bj , a distance::; c apart. Augment the double array by one more row, call it 00, whose row stack contains mass c + 2b. Include (00,0), ... , (00, k) in the region D.
80
IV. Convergence in Distribution in Metric Spaces
The hypotheses of the Allocation Lemma are satisfied. For any collection of columns J, C(J)
s
PB o +
p( UB
j)
JI{O)
Q( UBj)E + /; b + Q( UBi) + QB o + /;
< b+
JI{O)
S
by definition of D
DJI{co)
< b + R(D J \ {oo}) + b + /;
= R(D J ). Distribute all the mass from the column stacks into D, as in the Allocation Lemma. The 00 row acts as a temporary repository for the small amount of mass that cannot legally be shifted into the desired smalldiameter cells. Return the mass in this row to the column stacks, leaving at least 1  /;  2b of the original C mass in the desired cells. Strip away the 00 row. Allocate the remaining mass in the column stacks after expanding D to include all cells (i, j), for 0 s i s k and 0 s j s k. So far we have only decided the allocation of masses MO, j) between the cells. Within the cells distribute according to the product measures M(i,j)
Q(·JBJ(8)P(.JB).
The resulting M on PE (8) PE has marginal measures P and Q. For example, within Bo the column marginal is
L MO, O)Q(BiJBi)P(·JB o) =
P(Bo)P(·JB o) = P(·B o)·
i
The M measure concentrates at least 1  /;  2b of its mass within the original D, a cluster of cells each of diameter less than b in both row and column directions. For a point (x, y) lying in a cell (i,j) of this cluster, there exists points Zi and Zj with d(Zj,
y) < b,
which gives d(x, y) < /; + 2b. Put another way, if X and Y denote the coordinate projections then IP{d(X, Y) 2: /;
+ 2b} s /; + 215.
As b can be chosen arbitrarily small, and /; can be chosen as close to pep, Q) as we please, we have the desired result. Problem 17 gives a condition under which the bound pep, Q) can be achieved by a coupling of P and Q. 0
IV.S. Weakly Convergent Subsequences
81
IV.5. Weakly Convergent Subsequences A reader not interested in existence theorems could skip this section, which presents a method for constructing measures on metric spaces. The results will be used in Section V.3 to prove existence of the brownian bridge. The method will be generalized in Chapter VII. We saw in Section 111.6 how to modify the quanti letransformation construction of the onedimensional Representation Theorem to turn it into an existence theorem, a method for constructing a probability measure as the distribution of the almost sure limit of a sequence of random variables. We had to impose a uniform tightness constraint to stop the sequence from drifting off to infinity. The analogous result for probabilities on metric spaces plays a much more important role than in euclidean spaces, because existence theorems of any sort are so much harder to come by in abstract spaces. Again the key to the construction is a uniform tightness property, which ensures that sequences that ought to converge really do converge. The setting is still that of a metric space f1£ equipped with a subafield .91 of its borel afield. 28 Definitions. Call a probability measure P on .91 tight if for every e > 0 there exists a compact set K(e) of completely regular points such that PK(e) > 1  e. Call a sequence {P n } of probability measures on si uniformly tight if for everye > 0 there exists a compact set K(e) of completely regular points such that liminf PnG > 1  e for every open, simeasurable G containing K(e). D Problem 7 justifies the implicit assumption of simeasurability for the K(e) in the definition of tightness; every compact set of completely regular points can be written as a countable intersection of open, simeasurable sets. If G is replaced by K(e), the uniform tightness condition becomes a slightly tidier, but stronger, condition. It is, however, more natural to retain the open G. If P n ~ P and P is tight then, by virtue of the results proved in Example 17, the liminf condition for open G is satisfied; it might not be satisfied if G were replaced by K(e). More importantly, one does not need the stronger condition to get weakly convergent sub sequences, as will be shown in the next theorem. For the proof of the theorem we shall make use of a property of compact sets: If {xn} is a Cauchy sequence in a metric space, and if d(xn' K) + 0 for some fixed compact set K, then {xn} converges to a point of K. This follows easily from one of a set of alternative characterizations of compactness in metric spaces. As we shall be making free use of these characterizations in later chapters, a short digression on the topic will not go amiss.
82
IV. Convergence in Distribution in Metric Spaces
To prove the assertion we have only to choose, according to the definition of d(xn' K), points {Yn} in K for which d(x n, Yn) ~ O. From {Yn} we can extract a subsequence converging to a point Y in K. For if no subsequence of {Yn} converged to a point of K, then around each x in K we could put an open neighborhood Gx that excluded Yn for all large enough values of n. This would imply that {Yn} is eventually outside the union of the finite collection of G x sets covering the compact K, a contradiction. The corresponding subsequence of {xn} also converges to y. The Cauchy property forces {xn} to follow the subsequence in converging to y. A set with the property that every sequence has a convergent subsequence (with limit point in the set) is said to be sequentially compact. Every compact set is sequentially compact. This leads to another characterization of compactness: A sequentially compact set is complete (every Cauchy sequence converges to a point of the set) and totalIy bounded (for every positive s, the set can be covered by a finite union of closed balIs of radius less than s). For clearly a Cauchy sequence in a sequentially compact K must converge to the same limit as the convergent subsequence. And if K were not totalIy bounded, there would be some positive s for which no finite collection of balls of radius s could cover K. We could extract a sequence {xn} in K with Xn+l at least s away from each of Xl' ... ' xn for every n. No subsequence of {xn} could converge, in defiance of sequential compactness. For us the last link in the chain of characterizations will be the most important: A complete, totalIy bounded subset of a metric space is compact. Suppose, to the contrary, that {G;} is an open cover of a totalIy bounded set K for which no finite union of {GJ sets covers K. We can cover K by a finite union of closed balls of radius t, though. There must be at least one such ball, Bl say, for which K n Bl has no finite {G;} subcover. Cover K n Bl by finitely many closed balls of radius i. For at least one of these balls, B2 say, K n Bl n B2 has no finite {G;} subcover. Continuing in this way we discover a sequence of closed balls {Bn} of radii {rn} for which K n Bl n ... n Bn has no finite {G;} cover. Choose a point Xn from this (necessarily nonempty) intersection. The sequence {xn} is Cauchy. If K were also complete, {xn} would converge to some X in K. Certainly x would belong to some Gi , which would necessarily contain Bn for n large enough. A single G i is about as finite a subcover as one could wish for. Completeness would indeed force {G;} to have a finite subcover for K. End of digression. 29 Compactness Theorem. Every uniformly tight sequence of probability measures contains a subsequence that converges weakly to a tight borel measure.
IV.S. Weakly Convergent Subsequences
83
PROOF. Write {P n } for the uniformly tight sequence, and Kk for the compact set K(Ck), for a fixed sequence {Ck} that converges to zero. We may assume that {K k } is an increasing sequence of sets. The proof will use a coupling to represent a subsequence of {P n } by an almost surely convergent sequence of random elements. The limit of these random elements will concentrate on the union of the compact Kk sets; it will induce the tight borel measure on f1£ to which the subsequence {P n } will converge weakly. Complete regularity of each point in Kk allows us to cover Kk by a collection of open dmeasurable sets, each of diameter less than Ck' Invoke compactness to extract a finite subcover, {U ki : 1 :::; i:::; id. Define d m to be the finite subfield of d generated by the open sets Uki for 1 :::; k :::; m and 1 :::; i :::; i k • The union of the fields {dm} is a countable subfield d 00 of d. Apply Cantor's diagonalization argument to extract a subsequence of {P n } along which lim PnA exists for each A in d oo • Write AA for this limit. It is a finitely additive measure on the field d 00' Avoid the mess of doublesubscripting by assuming, with no loss of generality, that the subsequence is {P n } itself. If {P n} were weakly convergent to a measure P we would be able to deduce that P(interior of A) :::; AA :::; P( closure of A) for each A in ,xl 00' If we could assume further that P put zero mass on the boundary of each such A, we would know the P measure of enough sets to allow almost surely convergent representing sequences to be constructed as in the Representation Theorem. Unfortunately there is no reason to expect P to cooperate in this way. Instead, we must turn to A as a surrogate for the unknown, but sought after, probability measure P. Since Aneed not be countably additive, it would be wicked of us to presume the existence of a random element of f1£ having distribution A. We must take a more devious approach. We can build a passable imitation of d 00 on the unit interval. Partition (0, 1) into as many intervals as there are atoms of d 1> making the lebesgue measure of each interval A equal to the A measure of the corresponding A in ,xl l' These intervals generate a finite field .91 1 on (0, 1). Partition each atom A in .91 1 into as many subintervals as there are atoms of d 2 in A, matching up lebesgue and A measures as before. The sub intervals together generate a second field .912 on (0, 1), finer than .91 1 , Continuing in similar fashion, we set up an increasing sequence of fields {,91 k } on (0, 1) that fit together in the same way as the fields {dd on f1£. The union of the dk's is a countable subfield .91 00 of (0, 1). There is a bijection A ++ A between d 00 and d 00 that preserves inclusion, maps d k onto db and preserves measure, in the sense that the lebesgue measure of A equals AA. The construction ensures that, if YJ has a Uniform(O, 1) distribution, IP{YJ E A} = AA for every A in d 00' The random variable YJ chooses between the sets in d k in much the same way as a random element X with distribution P would choose between the sets in d k •
84
IV. Convergence in Distribution in Metric Spaces
By definition of A, there exists an n(k) such that (30)
PnA
~
(1  Gk)AA
for every A in d
k
whenever
n ~ n(k).
Lighten the notation by assuming that n(k) = k. (If you suspect these notational tricks for avoiding an orgy of subsequencing, feel free to rewrite the argument using, by now, triple subscripting.) As in the proof of the Representation Theorem, this allows us to construct a random element X n , with distribution Pn, by means of an auxiliary random variable ~ that has a Uniform(O, 1) distribution independent of 1]: For each atom A of d n' if I] falls in the corresponding A of d nand ~ S 1  Gn, distribute X n on A according to the conditional distribution Pn(·IA). If ~ > 1  Gn distribute Xn with whatever conditional distribution is necessary to bring its overall distribution up to Pn' We have coupled each Pn with lebesgue measure on the unit square. To emphasize that Xn depends on 1], ~, and the randomization necessary to generate observations on Pn(·IA), write it as Xn(w, 1], ~). Notice that the same I] and ~ figure in the construction of every X n' It will suffice for us to prove that {Xn(w, 1], ~)} converges to a point X(w, 1], ~) of Kk for every wand every pair (I], ~) lying in a region of probability at least (1  Gk)2, a result stronger than mere almost sure convergence to a point in the union of the compact sets {K k }. Problem 16 provides the extra details needed to deduce borel measurability of X. For each m greater than k, let Gmk be the smallest open, d mmeasurable set containing K k • Uniform tightness tells us that
AGmk
=
liminf PnGmk > 1  Gb
which implies IP{I] E Gmd > 1  Gk' Define Gk as the intersection of the decreasing sequence of sets {Gmd for m = k, k + 1, .... The overbar here is slightly misleading, because Gk need not belong to d 00' But it is a borel subset of (0, 1). Countable additivity oflebesgue measure allows us to deduce that IP{I] E Gk } ~ 1  Gk' Notice how we have gotten around lack of countable additivity for A, by pulling the construction back into a more familiar measure space. Whenever I] falls in Gk and ~ S 1  Gk' which occurs with probability at least (1  Gk)2, the random elements Xb Xk+l"" crowd together into a shrinking neighborhood of a point of K k • There exists a decreasing sequence {Am} with: (i) Am is an atom of d m; (ii) Am is contained in Gmk ; (iii) X m(W, 1], ~) lies in Am. Properties (i) and (iii) are consequences of the method of construction for Xm; property (ii) holds because Gk is a subset of Gmk . The set Gmb being the
85
Notes
smallest open, .s1mmeasurable set containing Kb must be contained within the union of those Umi that intersect K k • The atom Am must lie wholly within one such Umi' a set of diameter less than em' So whenever 1] falls in Gk and ~ :s; 1  eb the sequence {X m} satisfies: (i) d(Xm(w, 1], (ii) d(Xm(w, 1],
~), ~),
m :s; em
Xiw, 1], K k ) :s; em
for k :s; m :s; n; for k :s; m.
As explained at the start of the digression, this forces convergence to a point X(w, 1], ~) of K k • 0 NOTES
Any reader uncomfortable with the metric space ideas used in this chapter might consult Simmons (1963, especially Chapters 2 and 5). The advantages of equipping a metric space with a afield different from the borel afield were first exploited by Dudley (1966a, 1967a), who developed a weak convergence theory for measures living on the afield generated by the closed balls. The measurability problem for empirical processes (Example 2) was noted by Chibisov (1965); he opted for the Skorohod metric. Pyke and Shorack (1968) suggested another way out: Xn ~ X should mean IPf(Xn) 7 IPf(X) for all those bounded, continuous f that make f(Xn) and f(X) measurable. They noted the equivalence of this definition to the definition based on the Skorohod metric, for random elements of D[O, 1J converging to a process with continuous sample paths. Separability has a curious role in the theory. With it, the closed balls generate the borel afield (Problem 6); but this can also hold without separability (Talagrand 1978). Borel measures usually have separable support (Dudley 1967a, 1976, Lecture 5). Alexandroff (1940, 1941, 1943) laid the foundation for a theory of weak convergence on abstract spaces, not necessarily topological. Prohorov (1956) reset the theory in complete, separable metric space, where most probabilistic and statistical applications can flourish. He and LeCam (1957) proved different versions of the Compactness Theorem, whose form (but not the proof) I have borrowed from Dudley (1966a). Weak convergence of baire measures on general topological spaces was thoroughly investigated by Varadarajan (1965). Tops0e (1970) put together a weak convergence theory for borel measures; he used the liminfproperty for semicontinuous functions (Example 17) to define weak convergence. These two authors made clear the need for added regularity conditions on the limit measure and separation properties on the topology. One particularly nice combination a completely regular topology and a 'Ladditive limit measurecorresponds closely to my assumption that limit measures concentrate on separable sets of completely regular points. The best references to the weak convergence theory for borel measures on metric spaces remain Billingsley (1968, 1971) and Parthasarathy (1967).
86
IV. Convergence in Distribution in Metric Spaces
Dudley's (1976) lecture notes offer an excellent condensed exposition of both the mathematical theory and the statistical applications. Example 11 is usually attributed to Wichura (1971), although Hajek (1965) used a similar approximation idea to prove convergence for random elements of C[O, 1]. Skorohod (1956) hit upon the idea of representing sequences that converge in distribution by sequences that converge almost surely, for the case of random elements of complete, separable metric spaces. The proof in Section 3 is adapted from Dudley (1968). He paid more attention to some of the points glossed over in my prooffor example, he showed how to construct a probability space supporting all the {X n }. Here, and in Section 5, one needs the existence theorem for product measures on infiniteproduct spaces. Pyke (1969,1970) has been a most pers)..lasive advocate of this method for proving theorems about weak convergence. Many of the applications now belong to the folklore. The uniformity result of Example 19 comes from Ranga Rao (1962); Billingsley and Topsq'le (1967) and Topsq'le (1970) perfected the idea. Not surprisingly, the original proofs of this type of result made direct use of the dissection technique of Lemma 15. Prohorov (1956) defined the Prohorov metric; Dudley (1966b) defined the bounded Lipschitz metric. Strassen (1965) invoked convexity arguments to establish the coupling characterization of the Prohorov metric (Example 26). My proof comes essentially from Dudley (1968), via Dudley (1976, Lecture 18), who introduced the idea of building a coupling between discrete measures by application of the marriage lemma. The Allocation Lemma can also be proved by the maxfIowmincut theorem (an elementary result from graph theory; for a proof see Bollobas (1979)). The conditions of my Lemma ensure that the minimum capacity of a cut will correspond to the total column mass. Appendix B of Jacobs (1978) contains an exposition of this approach, following Hansel and Troallic (1978). Major (1978) has described more refined forms of coupling. PROBLEMS
[1] Suppose the empirical process U 2 were measurable with respect to the borel afield on D[O, 1] generated by the uniform metric. For each subset A of (1,2) define J A
as the open set of functions in D[O, 1] with jumps at some pair of distinct points tl and t2 in [0, 1] with tl + t2 in A. Define a nonatomic measure on the class of all subsets of (1,2) by setting yeA) = IP{U 2 E J A}' This contradicts the continuum hypothesis (Oxtoby 1971, Section 5). Manufacture from l' an extension of the uniform distribution to all subsets of (1, 2) if you would like to offend the axiom of choice as well. Extend the argument to larger sample sizes. [2] Write.s;{ for the afield on a set 2l' generated by a family {j;} of realvalued functions on 2l'. That is, d is the smallest afield containing fi 1 B for each i and each borel set B. Prove that a map X from (n, 0') into 2l' is0'/dmeasurableifand only if the composition j; X is 0'/~(IR)measurable for each i. 0
Problems
87
[3J Every function in D[O, IJ is bounded: [x(t n )[+ CIJ as n + CIJ would violate either the right continuity or the existence of the left limit at some cluster point of the sequence {tn}. [4J Write g> for the projection afield on D[O, IJ and PAo for the afield generated by the closed balls of the uniform metric. Write n, for the projection map that takes an x in D[O, IJ onto its value x(t). (a) Prove that each n, is PAomeasurable. [Express {x: n,x > cc} as a countable union of closed balls B(x., n), where Xn equals cc plus (n + n 1 ) times the indicator function of et, t + n1).J Deduce that PAo contains ~ (b) Prove that the afield g> contains each closed ball B(x, r). [Express the ball as an intersection of sets {z: [n,x  n,z[ ~ r}, with t rational.J Deduce that g> contains PAo. [5J Let {G;} be a family of open sets whose union covers a separable subset C of a metric space. Adapt the argument of Lemma 7 to prove that C is contained in the union of some countable subfamily of the {G;}. [This is Linde16f's theorem.J [6J Every separable, open subset of a metric space can be written as a countable union of closed balls. [Rational radii, centered at points of the countable dense set.J The closed balls generate the borel afield on a separable metric space. [7J Every closed, separable set of completely regular points belongs to d. [Cover it with open, dmeasurable sets of small diameter. Use Linde16f's theorem to extract a countable subcover. The union of these sets belongs to d. Represent the closed set as a countable intersection of such unions.J [8J Let Co be the countable subset of C[O, IJ consisting of all piecewise linear functions with corners at only a finite set of rational pairs (t i , rJ Argue from uniform continuity to prove that C[O, 1J equals the closure of Co. Deduce that C[O, IJ is a projectionmeasurable subset of D[O, 1]. [9J A function h is said to be lowersemicontinuous at a point x if, for each M < h(x), h is greater than M in some neighborhood of x. To say h is lowersemicontinuous means that it is lowersemicontinuous at every point. Show that the upper envelope of any set of continuous functions is lowersemicontinuous. Adapt the construction of Lemma 7 to prove that every lowersemicontinuous function that is bounded below can be represented on a separable set of completely regular points as the pointwise limit of an increasing sequence of continuous functions. How would one define uppersemicontinuity? Which sets should have uppersemicontinuous indicator functions? What does a combination of both semicontinuities imply? [10J If Xn ..... X as random elements of a metric space f!£ and d(X n, y") + 0 in probability, then Y" ..... X, provided that IPx concentrates on a separable set of completely regular points. [Convergence in probability means IP*{d(Xn, y") > e} + 0 for each e > O.J [11 J Let P be a borel measure on a metric space. For every borel set B there exists an open G, containing B and a closed F, contained in B with P(G,\F,) < e. [The class of all sets with this property forms a afield. Each closed set has the property because it can be written as a countable intersection of open sets.J Deduce that P is uniquely determined by the values it gives to closed sets. Extend the result to measures defined on the afield generated by the closed balls.
88
IV. Convergence in Distribution in Metric Spaces
[12] Suppose limsup PnF ::s; PF for each P n ...... P by applying the inequalities
~losed,
dmeasurable set F. Prove that
00
k 1
L U~ ilk}
00
::s;
f
::s;
k 1
+ k 1 L U~
ilk}
i=l
i=1
for each nonnegative f in ~(,q[; d). [The summands are identically zero for all i large enough. Apply the same argument to  f + (a big constant).] [13] If PnB + PB for each dmeasurable set B whose boundary has zero P measure then P n ...... P. [Replace the levels ilk of the previous problem by levels ti for which PU = t,} = 0.]
[14] The functions in ~(,q[; d) generate a subufield :!Bc of d. A map X from (Q, . Questions of measurability in this chapter will always refer to this tTfield. A stochastic process X on (Q, fff, JP) with sample paths in D[O, 1], such as an empirical process, is fff/&'measurable provided n t X is fff/gQ(IR)measurable for each fixed t (Problem IV.2). Probability measures on &' are uniquely determined by the values they give to the generating sets {ns 1 B}, with S a finite subset of [0, 1] and B a bore! subset of IRs. Equivalently, the distribution of a random element of D[O, 1] is uniquely specified by giving the distributions of all its finitedimensional projections. As every cadlag function on [0, 1] is bounded (Problem IV.3), the uniform distance 0
d(x, y) =
Ilx  yll = sup{ Ix(t)  y(t) I:
°: ;
t ::; 1}
defines a metric on D[O, 1]. No other metric will be used for D[O, 1] in this chapter. The closed balls for d generate &' but not the larger borel tTfield (Problem IVA). Every point in D[O, 1] is completely regularsee Definition IY.6 and the discussion preceding it. The difficulty with the tTfield, which can be blamed on the lack of a countable, dense subset of functions in D[O, 1], has dissuaded many authors from working with the uniform metric. Chapter IV showed that the difficulty can be surmounted, at least when limit distributions concentrate on separable subsets of D[O, 1]. As compensation for persisting with the uniform metric, we shall find its topological properties much easier to understand and manipulate than those of its main competitor, the Skorohod metric, which will be discussed in Chapter VI. That will make life more pleasant for us in Section 3 when we come to apply the Compactness Theorem. The limit processes for the applications in this chapter will always concentrate in a separable subset of D[O, 1], usually C[O, 1], the set of all continuous, real functions on [0, 1]. As a closed (uniform convergence preserves continuity), separable, &'measurable subset of D[O, 1] (Problems IY.8), C[O, 1] has several attractive properties. For example, it inherits completeness from D[O, 1], and its projection tTfield coincides with its borel tTfield for the uniform metric (Problem IV.6). How do we establish convergence in distribution of a sequence {Xn} of random elements of D[O, 1] to a limit process X? Certainly we need the finitedimensional projections {nsXn} to converge in distribution, as random vectors in IRs, to the finitedimensional projection nsX, for each finite subset S of [0, 1]. Continuity of ns and the Continuous Mapping Theorem make that a necessary condition. The methods of Chapter III usually help here. But that alone could hardly suffice. Continuous functionals such as Mx = Ilxll, a typical nontrivial example, depend on x through more than its values at a fixed, finite S. Indeed, direct examination of that very functional gives the clue to the extra condition needed.
91
V.l. Approximation of Stochastic Processes
Intuitively, it should be possible to approximate MX n by taking the maximum of IXn(s) I over a large, finely spaced, finite subset S of [0,1]. If S expands up to a countable, dense subset of [0, 1] containing 1, then
MsXn = max IXn(s) 1+ MX n
s for every sample path of X n. The cadlag property assures us of this. (Notice the special treatment demanded by 1.) Given any (j > and e > 0, a large enough S could be found to ensure that
°
(1)
At first sight this seems to have solved the problem. Because MsXn depends continuously on nsXn' it converges in distribution to MsX as n + 00. For f bounded and uniformly continuous, IPf(MsXn) + IPf(MsX). From (1) with (j chosen appropriately,
IIPf(MsXn)  IPf(MX n) I ~
e + 211flle.
A similar inequality holds for X. Taken together, these seem to add up to convergence in distribution of MX n to MX. But the argument is flawed. The problem lies with the choice of S in (1). If the sample paths of Xn jumped about more and more wildly as n increased, S would have to get bigger with n. That would undermine the convergence of {MsXn} to MsX, since finitedimensional convergence says nothing about {nsXn} for S varying with n. We need the same S to work for each n. A similar argument can be invoked for any other continuous functional H on D[O, 1]. We approximate HX n by a continuous function of nsXn for some large, finite set S not depending on n. We construct the approximation by applying H to an element AsXn in D[O, 1]. For simplicity, suppose S has been rearranged into increasing order and augmented by the points 0 and 1, if necessary, to form a grid = So < SI < ... < Sk = 1. For x in D[O, 1] define the approximating path Asx by
°
(2)
(Asx)(t)
=
x(sJ
for
Si ~
t
such that IH(Asx)  Hx 1< e whenever IIAsx  xii < (j; the random variable HX n lies within e of H(AsXn) with probability no less than IP{IIAsXn  Xnll < (j}. Again, the same grid would have to work for every Xnor at least for every Xn with n large enoughto allow finitedimensional convergence to imply convergence in distribution of {AsXn} to AsX. (The example ofthe empirical process Un, whose sample paths have jumps of size n  1/2, shows that uniform approximation of Xn by AsXn can only be required for large values of n, anyway.) If such a map As does exist, the argument sketched above for the supremum functional M will carry over to the functional H, proving that HX n r> HX. (The argument might seem familiarit is a special case of Example IV.ll).
°
92
V. The Uniform Metric on Spaces of Cadlag Functions
For the sake of brevity, from now on shorten "finitedimensional distributions" to fidis and "finitedimensional projections" to fidi projections. 3 Theorem. Let X, Xl> X z , ... be random elements of D[O, 1J (under its uniform metric and projection (Jfield). Suppose IP{X E C} = 1 for some separable subset C of D[O, 1]. The necessary and sufficient conditions for {Xn} to converge in distribution to X are:
(i) the jidis of X n converge to the jidis of X; that is, ns X n '""* ns X for each jinite subset S of [0, IJ; (ii) to each c; > and b > 0 there corresponds a grid 0 = to < t 1 < ... < tm = 1
°
such that
(4) where J i denotes the interval Et;, t i + 1), for i
= 0, 1, ... , m  1.
PROOF. Suppose that Xn '""* X. The projection map ns is both projectionmeasurable (by definition) and continuous. Condition (i) follows by the Continuous Mapping Theorem. To simplify the proof of (ii) suppose that the separable subset C equals C[O, 1]. Continuity of the sample paths makes the choice of the grid in (ii) easier. (Problem 4 outlines the extra arguments needed for the general case.) Let {so, S1> •.• } be a countable, dense subset of [0,1]. To avoid trivialities, assume that So = and SI = 1. Write Ak for the interpolation map constructed as in (2) from the grid obtained by rearranging so, ... ,Sk into increasing order. For fixed x in C[O, IJ, the distance IIAkx  xii converges to zero as k increases, by virtue of the uniform continuity of x. Applied to the sample paths of X, this shows that IIAkX  XII > 0 almost surely. Convergence in probability would be enough to assure the existence of some k for which
°
(5)
Choose and fix such a k. Because IIAkx set F =
{x E D[O, 1]:
xii varies continuously with x, the
IIAkx 
xii;::.: 6}
is closed. By Example IV.17 (inequality IV.18 to be precise), limsup IP{Xn E F} :s; IP{X E F}. The lefthand side bounds the limsup in (4). We could require ";::.: 6" rather than" > b" in (4), but that would create a minor complication in the next lemma. Now let us show that (i) and (ii) imply X n '""* X. Retain the assumption that X has continuous sample paths (see Problem 4 for the general case),
93
V.!. Approximation of Stochastic Processes
so that (5) still holds. Choose any bounded, uniformly continuous, projectionmeasurable, real function f on D[O, 1]. Given 8 > find b > such that If(x)  fey) I < 8 whenever Ilx  yll ::; b. Write AT for the approximation map constructed from the grid in (ii) corresponding to this band 8. Condition (4) becomes
°
limsup IP{IIATX n  Xnll > b}
en} ~ 0. Set ~ni = Xii/n)  Xn((i  l)/n). Define stepfunction approximations to X by
° °
n
Xn(t) = X(O)
+ I
{i/n :s; t}~ni{ I~nd :s; en}·
i= 1
Whenever H 2/n(X) :s; en' the X n process lies uniformly within en of X. Thus IIX n  XII ~ in probability. for the variance of the sum appearing in the definition of Fix t. Write XnCt), and an for its expectation. If an ~ 0, then XnCt) = X(O) + an + 0/1) from which we get X(t)  X(O) = constant, a degenerate sort of normality. If, on the other hand, {an} does not tend to zero, then, along some subsequence, (Xn(t)  X(O)  an)/an ~ N(O, 1) by the Lindeberg Central Limit Theorem. (Necessarily en/an ~ along any subsequence for which an is bounded away from zero.) Convergence of types (Brei man 1968, Section 8.8) allows nothing but normality for the distribution of X(t)  X(O). A similar argument works for any other increment of X. Even with the gloss of apparent generality rubbed off the limit distribution, Theorem 19 still has enough content to handle some nonclassical problems about sums of independent random variables.
°a;
°
20 Example. Let gni: i = 1, ... , ken); n = 1,2, ... } be a triangular array of random variables satisfying the conditions of the Lindeberg Central Limit Theorem. That is, they are independent within each row, they have zero means, they have variances {a;i} that sum to one within each row, and
(21)
IIP~;i{l~nd ~ e} ~ i
°
for each fixed e > 0. Set Sni = ~nl + ... + ~ni' What is the asymptotic behavior of a random variable such as maxi Sni? We may write it as a functional of the partialsum process Si')' the random element of D[O, 1J defined by Sn(t) = Sni
for
var Sni :s; t < var Sn.i+l·
The curious choice for the location of the jumps of Si') has the virtue that var Sn(t) ~ t as n ~ 00, for each fixed t, because maxi a;i ~ 0. The increments of SnC·) over disjoint subintervals of [0, 1J are sums over disjoint groups of the {~nJ; the process has independent increments. If we joined up the vertices, to produce an interpolated partialsum process with sample paths in C[O, 1J, we would destroy this property.
107
V.5. Infinite Time Scales
S,,(.)~
......:...._
•
•
•o
.
• I
For fixed sand t, with s less than t, the increment Sit)  SnCs) is also a sum of the elements in a triangular array of independent random variables. Because var[SnCt)  SnCs)]
+
t  s
and because (21) also holds for summation over any subset of the {~n;}' the Lindeberg Central Limit Theorem makes Sn(t) ,Sn(s) converge to a N(O, t  s) distributionone of the conditions we need to make SnO converge in distribution to a brownian motion. What about condition (iii) of Theorem 19? Tchebychev's inequality is good enough. IP{ ISit)  Sn(s) I ::;; !6} 2 1  var[Sn(t)  Sn(s)J/(!6? + 1  (t  s)/(!6)2, uniformly in t and s. You can figure out what IJ. should be from this. Only our ingenuity in thinking up functionals that are continuous at the sample paths of brownian motion can curtail our supply of limit theorems for the partial sums. One example: max Sni = sup SnCt)
~
sup B(t).
i
Calculation of the limit distribution in closed form awaits us in Section 6, where this and several other functionals of brownian motion and brownian bridge will be examined. D
v.s.
Infinite Time Scales
Both the spaces D[O,lJ and D[  00, ooJ have compact intervals of the extended real line as their index sets; the theories for convergence in distribution of random elements of these spaces differ only superficially. For spaces with noncompact index sets some extra complications arise. As a typical example, consider D[O, 00), the set of all realvalued cadlag functions on [0, 00). A function x in D[O, 00) must be rightcontinuous at each point of [0, 00), with a finite left limit existing at each point of (0, 00). Because the limit point + 00 does not belong to the index set, no constraint is placed on the behavior
v.
lO8
The Uniform Metric on Spaces of Cadlag Functions
of x(t) as t ~ 00; it could diverge or oscillate about in any fashion. Such a function need not be bounded. The uniform distance between two functions in D[O, (0) might be infinite. Even for bounded random elements of D[O, (0), convergence in distribution in the sense of the uniform metric may impose far stronger requirements on their tail behavior (that is, on what happens as t ~ (0) than we can hope to verify. Sometimes the best we can try for is control over large compact subintervals of [0, (0). That corresponds to convergence in the sense of the metric for uniform convergence on compacta. 22 Definition. A sequence of functions {xn} in D[O, (0) converges uniformly on compacta to a function x if SUPt$k Ixn(t)  X(t) I ~ as n ~ 00 for each fixed k. Equivalently, d(x n , x) ~ 0, where
°
00
d(xn' x) =
Lr
k min[l, dk(x , x)] n
k=l
o
dk(x n, x) = suplxn(t)  x(t)l· t$k
In all that follows, D[O, (0) will be equipped with the metric d and the projection afield. With that combination, each x in D[O, (0) is completely regular (Problem 12). For each k define a truncation map Lk from D[O, (0) into D[O, k]; set Lkx equal to the restriction of x to the interval [0, k]. By construction, {xn} converges to x if and only if {Lkx n} converges uniformly to Lkx for each k. Convergence in distribution has a similar characterization. 23 Theorem. Let X, X!oX 2, ... be random elements of D[O, (0), withIP{X E C} for some separable set C. Then Xn ~ X if and only if LkXn ~ LkX, as random elements of D[O, k],for eachjixed k.
PROOF. Necessity of the condition follows from the Continuous Mapping Theorem, because each Lk is both continuous and measurable. For sufficiency, define a continuous, measurable map Hk from D[O, k] into D[O, (0) by (HkZ)(t) = z(t /\ k). It comes as close as possible to defining an inverse to the map L k: the function Hk Lkx equals x on [0, k]. Deduce that d(x, Hk Lkx) :s; r k for every x in D[O, (0). Suppose LkXn ~ LkX. Because Hk is continuous, Hk LkXn ~ Hk LkX. Pick k large enough to make 2 k < c. Then both IP{d(Xn' Hk 0 LkX n) > e} and IP{d(X, Hk LkX) > e} equal zero. Example IY.ll, with Hk Lk playing the role of the approximation map, does the rest. 0 0
0
0
0
0
0
Typically C equals C[O, (0), the set of all continuous functions in D[O, (0). As in the case of compact time intervals, C[O, (0) sits inside D[O, (0) as a closed, separable, measurable subset (Problem 13). The borel afield on C[O, (0), generated by the open sets for the d metric, coincides with the
109
V.S. Infinite Time Scales
projection (Jfield. The most important of the limit processes living C[O, 00) is brownian motion on [0, 00).
In
24 Example. Let Sn denote the nth partial sum of a sequence ~ 1> ~2' ... of independent, identically distributed random variables for which IP~i = and IP~r = 1. What is the asymptotic behavior of the hitting time
°
1: n
= inf{j: n 1 / 2 S j > 1}
as n tends to infinity? Define Hx = inf{t ~ 0: x(t) > 1}, with the usual convention that the empty set has infimum + 00. Right continuity offunctions in D[O, 00) allows us to take the infimum over rational t values to verify measurability of H. Express n l 1: n as the functional HX n of the process Xn(t) = n 1 / 2 Sj,
j/n ~ t < (j
+
l)/n.
U se Theorem 23 to check that X n ~ B, a brownian motion on [0, 00). For fixed k, the truncated process LkX n has the same form as the partialsum process of Example 20, but stretched out and rescaled to fit the interval [0, k] instead of [0, 1]. The modifications have little effect on the arguments adduced there to prove convergence to brownian motion; exactly the same idea works in D[O, k]. If we are to use the Continuous Mapping Theorem to prove HX n ~ HB we will need H continuous at almost all sample paths of B. By itself, continuity of a sample path x will not suffice for continuity of Hat x. You can construct sequences of functions {x n } that converge uniformly to a bad continuous x without having HX n converging to Hx. BAD:
The functional H has better success with a good path like this: GOOD:

rb
r+b


1  c
v.
110
The Uniform Metric on Spaces of Cadlag Functions
Here Hx = L. If x(t) :s; 1  c for t :s; L  b (a continuous function achieves its maximum on a compact interval) and x( L + b) ~ 1 + c, then any y in D[O, 00) with dt+O(x, y) < c must satisfy L  b :s; Hy :s; L + b. That brownian motion has only good sample paths, almost surely, can best be shown by arguments based on a strong markov property, a topic to be taken up in the next section. The distribution of the functional HB of brownian motion will also be derived in that section. For the moment, we must content ourselves with knowing that n1Ln has a limiting distribution that we could calculate if we were better acquainted with the limit process of the sequence {X n }. D
V.6. Functionals of Brownian Motion and Brownian Bridge From the definition of brownian motion on [0, 00) it is easy to deduce (Problem 14), for each fixed L, that the shifted process Bit)
=
B(L
+ t)
 B(L)
for
o:s; t < 00
°
is a new brownian motion, independent of the afield Ct generated by the random variables {B(s): :s; s :s; L}. Equivalently, (25)
IPf(Bt)A
=
IPf(B)IP A
for every A in Ct and every bounded measurable f on C[O, 00).
B
B,
The assertion is also valid for a wide class of random L values. The precisely formulated generalization, known as the strong markov property, underlies most of the clever tricks one can perform with brownian motion. It will get us the distributions of a few interesting functionals. The random variable L will need to be a stopping time, that is, a random variable, taking values in [0, 00], for which {L :s; t} belongs to Ct for each t.
111
V.6. Functionals of Brownian Motion and Brownian Bridge
Interpret Iff 00 as the smallest afield containing every Ifftwe learn everything about the brownian motion if we watch it forever. Define B.(w, t)
=
[B(w, t + ,(w))  B(w, ,(w))] {,(w) < oo}.
Problem 17 shows that Br is projectionmeasurable, as a map from n into C[O,oo). The stopping time, determines a afield Ifft' which captures the intuitive idea of an event being observable before time ,. By definition, an event A belongs to Iff r if and only if A {, :s;; t} belongs to Iff t for every t. The strong markov property asserts that Br is a brownian motion independent of Iffr on {, < oo}. Equivalently, (25) holds for every Iffrmeasurable A contained in {, < oo} and every bounded, measurable f on C[O, 00). To prove the assertion it suffices that we check the equality for each bounded, uniformly continuous f. (Apply Problem 15 to the distributions induced on C[O, 00) by B, under IP, and Bt' under IP(·I A).) With such an f, continuity of sample paths implies 00
f(Br)A = lim
L f(Bk/n)A{(k
 l)ln :s;; , < kin}.
k= 1
Of course only one term in the sum will be nonzero for each fixed n. B
The event A {(k  1)In :s;; , < kin} belongs to Iff kin; the independence asserted in (25) can be invoked for each kin value. 00
IPf(Br)A = lim
L
IPf(Bk/n)A{(k  l)!n :s;; , < kin}
k=l 00
= lim
I
IPf(B)IPA{(k  1)ln :s;; , < kin}
k=l
= IPf(B)IP A because, < 00 on A. The strong markov property is established.
by (25)
112
V. The Uniform Metric on Spaces of Cadlag Functions
26 Example. The classical reflection principle for brownian motion involves a hidden appeal to the strong markov property. It enables us to find the distribution of the stopping time T = inf{t: B(t) = a} for fixed a> O.
The reflection principle uses the symmetry of brownian motion  the processes Br and Br have the same distributionto argue that B should be just as likely to hit a and end up with B(t) > a as it is to hit a and end up with B(t) < a. The probability that B hits a before time t should be twice the probability that B(t) > a. A formal proof works backwards. IP{B(t) > a} = IP{B(t) > a, T < t} = IP{B(T) + Bit  T) > a, T < t} = IP{Bit  T) > 0, T < t} because B(T) = IP[{T < t}IP{Bit  T) > Ol@"r}].
=
a if T
e, otherwise the cadlag property would fail at a cluster point of those jumps. Take the union over rational e. If x is a uniform limit of {xn} then it can jump only where one of the {xn} jumps.]
119
Problems
[3J Calculate the covariances of the tieddown process B(t)  tB(l) to show that its fidis agree with those of the brownian bridge. [Means and covariances determine a multivariate normal distribution.J [4J Extend the proof of Theorem 3 to limit processes with jumps. [If X has paths in a separable set C, and if the grid points for the approximation maps Ak pick up every jump point of Cas k > 00, then IIAkX  XII > 0 almost surely.J [5J Would the argument in the proof of Theorem 9 for bounding Un(t) over [0, bJ work for a different interval, say [1  b, 1J? This would be necessary if Un were not stationary; direct analysis of the nonuniform empirical process would require such an extension. [Try a different definition for 6",.J [6J Every function in D[  00, 00 J is bounded. [If Ix(tn) I ;:: n then x would not have the cadlag property at a cluster point of {tn}.J [7J Give a direct proof of the Empirical Central Limit Theorem for sampling from a general F. [Make sure the jump points of F appear in the sequence of grids from which fidi approximations are calculated.J [8J The image of a separable metric space under a continuous map is separable. The empirical process E concentrates on a separable subset of D[  00, 00 J. [The image of a dense subset is dense.J [9J Suppose the distribution function F(x, 8) has a bounded partial derivative ~(x, 8) with respect to 8. If ~(, .) is uniformly continuous (or even just ~(x, .) equicontinuous), then sup IF(x, 8)  F(x, ( 0 )

(8 
(0)~(x,
8)1 = 0(8  ( 0 ),
[lOJ Find the limiting distribution of Kolmogorov's goodnessoffit statistic for sampling from the N(8, 1) distribution, when 8 is estimated by the sample mean. Show that this limit does not depend on the true value 80 ; express it as a functional of the gaussian process with covariance function (s)[l  (t)J  4J(s)4J(t), where and 4J denote the distribution function and density function of the N(O, 1) distribution. [l1J Here is another construction for the brownian bridge. (a) Temporarily suppose there exists a brownian bridge U. Define, recursively, new processes Yo, Y1 , ••• and Zo, Zb'" by setting Yo = U, y"+1 = Y"  Zn, and 2"
Zn(t)
=
I
hnj(t)Y,,«2j  1)/2n+1),
j= 1
where hnj is the function whose graph is an isosceles triangle of height one sitting on the base W  1)/2n,j/2nJ. Show that Y,,+ 1 and Zn are independent. Deduce that the {Zn} are mutually independent. [Calculate covariances. The process Y" would be obtained from the brownian bridge by tying it down at the points {(2j  1)/2n}; it is made up of 2n independent, rescaled brownian bridges sitting side by side. The covariance calculations all reduce to the same thing.J (b) Show that the process Zo + ... + Zn interpolates linearly between the vertices U/2n+ 1, UU/2 n+ for j = 0, ... , 2n+1.
1»,
v.
120
The Uniform Metric on Spaces of Cadlag Functions
(c) Now run the argument the other way. Construct processes {Xn} with the same distributions as the {Zn}, then recover the brownian bridge as a limit of sums of the {X n}. Let {~nj: j = 1, ... , 2n; n = 0, I, ... } be independent random variables, with ~nj distributed N(O, 1/2n+2). Define 2"
Xn(t) =
I
hnit)~nj·
j=1
At the points {j/2 n+1: j = 0, ... , 2n+1} the sum Xo + ... + Xn has the right fidis for a brownian bridge. [Only the fidis of U, which we know are well defined, were needed to calculate the distributional properties of Zo + ... + Zn·] (d) Show that JP{IIXnll ~ en} ::;; 2n+1 exp( _2n+1 e;). [Apply the exponential inequality from Appendix B for normal tails: IIXnl1 is a maximum of 2n independent IN(O, 1/2n+2) I random variables.] (e) By choosing en = (2n/2 n+ 1)1/2 and then applying the BorelCantelli lemma, Xn(t) converges uniformly in t, almost surely; it defines a show that process X with continuous sample paths, almost surely. (f) At dyadic rational values for t, the series for X contains only finitely many nonzero terms. The fidi projections of X at a dense subset of [0, 1] have the distributions of a brownian bridge. The process X, with a negligible set of sample paths discarded, is a brownian bridge.
I:..o
[12] Every point of D[O, 00) is completely regular. [The distance dix, y) can be expressed as a supremum involving only rational time points. For each x and k, the function dk(x, .) is projection measurable. Use [1  mdlx, .)]+ as a separating function.] [13] C[O, 00) is a closed, separable subset of D[O, 00). [It is the closure of a countable collection of piecewise linear, continuous functions, each constant over an interval of the form 00).]
et,
[14] For each fixed r and each finite subset S of [0, 00), the random vector nsBr is independent of tff p because brownian motion has independent increments. Deduce independence of Br and tff r • [The projection maps {ns} generate the projection eTfield on C[O, 00).] [15] Let P and Q be probability measures on the eTfield :J6 0 generated by the closed balls of a metric space. If Pf = Qf for every bounded, uniformly continuous, :J6omeasurable f then P = Q. [Every closed ball is a pointwise decreasing limit of a sequence of such functions. The same is true for the intersection of any finite collection of closed balls. These sets form a generating class for :J6 0 that is closed under the formation of finite intersections.] [16] The borel eTfield on the product f£ ® qy of two separable metric spaces coincides with the product eTfield :J6(f£) ® :J6(qy). [Every open set in the product space is a countable union of sets of the form Gx ® Gy , with both Gx and Gy open. Compare with Problem IY.5.] [17] Show that the shifted process Br is projection measurable for each random time r. [Prove measurability of B(w, t + r(w» by writing it as a composition of two
121
Problems
measurable maps: w H (r(w), B(w, .)) from n into [0, 00] ® C[O, 00) and (s, x) H limn x((t + s) 1\ n)[1 1\ (n  s)+] from [0,00] ® C[O, 00) into IR. [18] In the sense of Example 24, almost all brownian motion sample paths are good. If 1" denotes the first time that B hits level 1, prove that IP{Bls)
~
0 for 0
~
s
~ (j} =
O.
Let (j tend to zero through a sequence to show that bad paths belong to a set of probability zero. [You could try letting the level ex in Example 26 sink to zero.]
CHAPTER VI
The Skorohod Metric on D[O, (0) .,. in which an alternative to the metric of uniform convergence on compacta is studied. With the new metric the limit processes need not confine their jumps to a countable set of time points. Amongst the convergence criteria developed is an elegant condition based on random increments, due to Aldous. The chapter might be regarded as an extended appendix to Chapter V.
VI.I. Properties of the Metric The uniform metric on D[O, 1J is the best choice for applications where the limit distribution concentrates on e[O, 1J, or on some other separable subset of D[O, 1]. It is well suited for convergence to brownian motion, brownian bridge, and the gaussian processes that appear as limits in the Empirical Central Limit Theorem. But it excludes, for example, poisson processes and other nongaussian processes with independent increments, whose jumps are not constrained to lie in a fixed, countable subset of [0, 1]. To analyze such processes, Skorohod (1956) introduced four new metrics, all weaker than the uniform metric. Of these, the J 1 metric has since become the most popular. (Too popular in my opiniontoo often it is dragged into problems for which the uniform metric would suffice.) But Skorohod's J 1 metric on D[O, 1J will not be the main concern of this chapter. Instead we shall investigate a sort of J 1 convergence on compacta for D[O, 00), the space where the interesting applications live. With the results from Section V.5 in mind, and even without seeing the J 1 metric defined, you might suspect that convergence Xn ~ X of random elements of D[O, 00) should reduce to convergence of their restrictions to each finite interval [0, T], in the sense of the J 1 metric on D[O, T]. This is almost true. We need to avoid those values of T at which X has positive probability of jumping. The difficulty arises because projection maps are not automatically continuous for J 1 metrics. Both the points and T require special treatment from the J 1 metric on D[O, TJ, whereas only has an a priori right to special treatment in D[O, 00). That tiny distinction makes it slightly more convenient to study D[O, 00) directly than to deduce all its properties from those of D[O, T]. As we shall not be concerned with
°
°
123
VI.1. Properties of the Metric
Skorohod's J 2, M 1, and M 2 metrics, let us drop the J 1 designation from now on. 1 Definition. For each finite T and each pair of functions x and y in D[0, 00) define the distance dT(x, y) as the infimum of all those values of b for which there exist grids 0= to < t1 < ... < tb with tk ;:0: T, and = So < SI < .. , < Sk, with Sk ;:0: T, such that Iti  Si I ~ b for i = 0, ... , k, and
°
Ix(t)y(s)l~b
if
ti~t 0(: Ix(t)  x(O() I ~ 1c}, f3 = inf{t > ,: Ix(t)  x(,) I ~ 1c}.
1C
vu.
127
Properties of the Metric
So if the assertion were false there would exist sequences of points an < On < f3n ~ T for which f3n  an ~ and
°
IX(on)  X(a n) I ~!8
and
IX(on)  X(f3n) I ~
k
We could extract sub sequences along which an ~ a, On ~ a, f3n ~ a, for some a in [0, T]. This would violate the cadlag property at a: there must exist a (j > for which
°
Ix(s)  x(a)I
0, choose N with 2 N less than 8 and N 1 less than the (j of Lemma 5. We may assume that (j < 8. Construct the approximation ANX to x using the grid points S(N). That is, set AN = hN 0 nS(N). From Lemma 4, dN(x, ANX) ~ 8. Hence d(x, ANX) < 28. Find rational numbers rj with Irj  xU/N) I < 8. Then hN(r) belongs to Do and d(x, hN(r» < 38. For the moment write f!J for the borel afield and f!} for the projection afield on D[O, 00). To prove that f!} s; f!J, observe that x(t o) = lim Hn(x) for fixed to, where Hn(x) = sup x(t)sn(t), sit) = (1  nit  to  2Inl)+. For fixed n, the functional Hn is continuous (Problem 3). As a pointwise limit of continuous functions on D[O, 00), the projection nco must be f!Jmeasurable. The projections generate f!JJ.
128
VI. The Skorohod Metric on D[O, CfJ)
To prove that ~ 5; .9, it is enough to establish that each continuous, real function f on D[O, (0) is g>/~(lR)measurable. (Every closed set in a metric space can be represented asf1{0} for some continuous f.) We know that d(x, hN nS(N)(x» + as N + 00. Thusf hN nS(Nlx)+ f(x) for each x, by continuity of f. The map f hN is continuous from lRS(N) into lR, and hence ~(lRS(N»/~(lR)measurable. The map nS(N) is, by definition, g>/~(lRS(N»measurable. Thus their composition f hN nS(N) must be g>/~(lR)measurable. As a pointwise limit of such functions, f must also 0 be g>/~(lR)measurable.
°
0
0
0
0
0
0
Needless to say, from now on we shall always equip D[O, (0) with its projection afield, alias the borel ufield for the Skorohod metric. Every point of D[O, (0) is completely regular under this ufield. 7 Example. The asymptotic theory for maxima of independent random variables bears some similarity to the theory for sums of independent random variables. The role played by the normal is taken over by the extremevalue distributions, whose distribution functions are of the form exp(  G(x» for G(x) equal to one of e x , or x"{x ~ O}, or (x)"{x ::;; O}, with IX a positive parameter. If the maximum Mn of n independent observations from a distribution function F can be standardized to converge in distribution, then the limit must be one of these: for constants an (positive) and bn ,
This convergence implies a much stronger result for the joint asymptotic behavior of the maxima at different sample sizes, a result analogous to the convergence of the partialsum process to brownian motion (Example V.20). Define the maxima process Y,,(.) as the random element of D[O, (0) with y"(t) = (M j

bn)/an for j/n::;; t < (j
+ l)/n.
The assumption (8) gives convergence for {Y,,(l)}. Using only the facts about the Skorohod metric that we have so far accumulated, we can strengthen this to convergence in distribution of the {y"} process. The method of proof depends upon a representation of Y" as a continuous transformation of a poisson process. To minimize extraneous detail, assume G(x) = x"{x ~ O}. Trivial modifications of the argument would cover the other two cases. Define a sequence of measures {Rn} on (0, (0) by means of their distribution functions: R 1(0, x] = exp( G(x»,
RnCO, x] = nexp(G(x)/n)  (n  l)exp(G(x)/(n 1).
Calculate their density functions if you doubt that these are welldefined measures. On S = [0, (0) ® (0, (0) generate independent poisson processes
vu.
129
Properties of the Metric
{nn} with intensity measures {A @ Hn}, where A denotes lebesgue measure on [0, (0). The sum (J n = n 1 + ... + nn is also a poisson process. As n tends to infinity, (In increases to a poisson process (J on S with intensity measure ro
I
n
A @ Hi = A @ lim
i=l
n
L Hi
= A @ y,
i=l
the measure Y being determined on (0, (0) by y(x, (0) = limen  n exp( G(x)/n)] = G(x).
Label the points of (In as (fJni' hnJ, where fJnl < fJn2 < .... The {fJnJ form a poisson process on [0, (0), with intensity nA, independent of the {hnJ; the gaps between adjacent fJni have independent exponential distributions with mean n 1 • The {hnJ are independent observations on the distribution function exp(  G/n). When subjected to a slight vertical perturbation they will become standardized observations on F. Let Q be the quanti le transformation corresponding to the distribution function F (Section III.6). Define T,,(y) = [Q(exp(  G(y)/n»  bn]/an·
For large n, this transformation hardly disturbs y: T,,(y) = inf{(z  bn)/an : F(z) 2 exp( G(y)/n)} = inf{x: Fn(anx + b) 2 exp(  G(y»} + inf{x: exp( G(x» 2 exp( G(y»} by (8)
= y. The transformed variables {an T,,(h ni ) + bn} form a sequence of independent observations on F, because hni has distribution function exp(  G/n): lP{a nT,,(h ni )
+ bn :::; x} =
lP{Q(exp(  G(hni)/n» :::; x} = lP{Uniform(O, 1) :::; F(x)} = F(x).
The T" has the desired effect, in the vertical direction, on the points of (In. Define a random element Zn of D[O, (0) by setting
If Zn had its jumps at n 1, 2n1, ... instead of at fJnl' fJn2' ... it would be a probabilistic copy of Y". Remedy the defect by applying to the time axis the random, piecewise linear transformation Yn that sends onto andj/n onto fJnj, for j = 1,2, .... The processes Y,,(.) and Zn YnO have the same distribution as random elements of D[O, (0). By the weak law of large numbers, Yn(t) + t in probability uniformly on compact intervals (Problem 4). Thus d(Zn' Zn Yn) + in probability. The
° °
0
0
°
130
VI. The Skorohod Metric on D[O, (0)
random elements {Zn} themselves converge almost surely to the random element
Z(t) = sup{h;: 11;
~
t},
where (111, h 1 ), (112, h2 ), ••• denote the points of the poisson process (J arranged in order of increasing time coordinate. Deduce that {Zn}, and hence {Y,,}, converges in distribution to z. 0
VI.2. Convergence in Distribution In Section V.l we found a necessary and sufficient condition for convergence in distribution of random elements {Xn} of D[O, 1], under its uniform metric, to a limit process X concentrating on a separable subset. The separability allowed X to have discontinuities only at fixed locations. For the proof we constructed an approximation AXn to each X n based on the values it took at a fixed finite grid on [0, 1]. The conditions we imposed ensured that, with high probability, the AXn process was uniformly close to X n • A similar method of proof will apply for convergence in distribution of random elements of D[O, 00) under d, its Skorohod metric. The constraint on the limit process will disappear, because D[O, 00) itself is separable under d. Each approximation AXn will, with high probability, be close to its Xn in the sense of d distance. ~ut one extra complication will arise because the fidi projections are not automatically continuous. If x belongs to D[O, 00) and XCi) i= XCi  ), the projection map n t is not continuous at x. For example, if xnCt) = x(ntj(n + 1)) then d(xn' x) * but ntx n * x( r ) i= ntx. For r a continuity point of x, however, n t is continuous at x (Problem 5). Necessarily, no is continuous at every x, because every increasing A that maps [0, 00) onto itself must set ..1.(0) equal to 0. For a random element X of D[O, 00), the projection n t will be continuous at all sample paths except those that have a jump at r.
°
9 Lemma. For each random element X of D[O, 00) there exists a subset r x of [0, 00) such that [0, oo)\rx is countable and IP{X(t) = X(t)} = 1 for tin r x. The projection n t is IPx almost surely continuous at each t in r x. PROOF. It is enough to show that if e >
°then
IP{IX(t)  X(t)I
~
e}
~
e
for at most finitely many t values in each bounded interval [0, T]. Write J(t) for {IX(t)  X(t)I ~ e}, IfIPJ(tn) ~ e for an infinite sequence {tn} of distinct points in [0, T] then IP{J(tn) infinitely often} ~ e. There would exist an w belonging to infinitely many of the J(tn) sets. At some cluster point t in
131
V1.2. Convergence in Distribution
[0, T] the inequality IX(w, t n)  X(w, tn ) I 2 8 would hold for infinitely many distinct tn values in every neighborhood of t. This would violate the 0 cadlag property of X(w, .) at t.
Because the projection 11:0 is continuous at every x in D[O, (0), let us also admit as a point of r x, even though X(O  ) is not defined.
°
10 Theorem. Let X, Xl, X 2 , ••• be random elements of D[O, (0). Necessary and sufficient conditions for Xn ro+ X, in the sense of the Skorohod metric, are: (i) the fidis of X n corresponding to finite subsets of r x converge to the fidis of X; (ii) for each 8 > 0, each > 0, and each finite Tin r x, there exists a grid 0= to < ... < tK = T of points from r x with
I]
limsup IP{
m~x Ll(X n' [t i 
1,
t;]) >
I]} < 8.
PROOF OF NECESSITY. Appeal to the Representation Theorem (IV.13) for a new sequence {Xn}, with the same distributions as {X n}, and an X with the same distribution as X, for which
(11) For t in
d(Xn(w, .), X(w,
.»
~
°
for almost all w.
r x, XnCw, t) ~ X(w, t)
at almost every w.
Fidi convergence for {X n} at points of r x follows. Find a grid = to < ... < tk = T of points from
°
(12)
IP{
m~x Ll(X, [t
i  1,
t;]) 2
I]}
t;]). Consider an w at which the convergence (11) holds. Write xi·) for Xn(w, .), and x(·) for X(w, .). If we show limsup LlT(x n) ::; 2Ll T(x), then it will follow that limsup IP{LlT(X n) 2 21]} = limsup IP{LlT(X n) 2 21]} ::; IP{limsup LlT(Xn) 2 21]} ::; IP{LlT(X) 2 1]} < 8 by (12) as required by hypothesis (ii).
132
VI. The Skorohod Metric on D[O, CfJ)
Choose 0. By definition of ~T(X), there exists points {Si} with t i 1 < Si ::; ti and
Ix(t)  X(t i  1)1 < ~T(X) + so that If(x)  f(y)1 < 8 whenever d(x, y)::;; 21]. Choose from r x a T large enough to ensure d(x, y) ::;; dT(x, y) + 1] for every pair x, y. Let ~TO have the same meaning as in the proof of necessity. According to hypothesis (ii) there exists a grid on [0, TJ for which
°
JP{~T(Xn)
> I]}
I]} < 8. Lemma 4 shows that the approximations constructed from this grid are, with high probability, close to the sample paths of the processes:
JP{dT(Xn' AXn) > I]} < IP{dT(X, AX) > I]}
0, and each c > 0, there exists ab> 0 and an no such that (14)
JP{ IXn(Pn + b')  Xn(Pn) I ~ 1]} < c for
n ~ no,
whenever Pn is a stopping time for Xn that takes values in [0, T] and b' is a real number with 0 ~ b' ~ b. The proof of Aldous's result is built up by repeated application of inequality (14). 15 Lemma. Let Z be a random element of D[O, oo)for which
JP{ IZ(p + b')  Z(p) I ~ 1]}
JP
f
{IZ(s)  Z(O'T)/
+ {/Z(s)
~ JP
f
~ 1]}{O'T ~ S ~ O'T +
 Z(LT) I ~ 1]}{LT
~ S ~ LT
[{Z(s)  Z(O'T)/
~ 1]} + {/Z(s)
x {LT ~
+ b} ds.
S ~
O'T
b}
+ b} ds
 Z(LT)/
~ 1]}]
0'
134
VI. The Skorohod Metric on D[O, 00)
On the set {T ~ T}, the sum of the two indicators is at least 1, because at least one of the inequalities
must hold if aT = a, TT = T, and /Z(a)  Z(T) / :?: 2Yf. Deduce that
2cb > IP
J~
T}{ T
:?: IP
J~
T
1\
(a
+ !b)}{a + !b ~ s ~
:?: !bIP{T ~ T
1\
(a
+ !b)}.
{T
{T
16 Theorem. Let X, X 10 X 2,
•••
~ S ~ a + b} ds a
+ b} ds
o
be random elements of D[O, oo)for which:
(i) the fidis of X n corresponding to finite subsets of r x converge to the fidis of X; (ii) Aldous's condition (13) holds.
Then Xn
~
X in the sense of the Skorohod metric.
Verify condition (ii) of Theorem 10. Set down a grid 0= to < ... = T of points from r x with the maximum grid interval shorter than
PROOF.
Tj : /
Xit)  XiT) / :?: 2Yf}
with the usual convention that the infimum of the empty set equals + 00. We should perhaps add an extra subscript n to each T j • If (t i  I , tJ contains at most one of the {T} then we must have Ll(Xn' [tiI, tJ) ~ 4Yf. For ifTj_1 ~ t i I < Tj ~ ti < Tj +1 then
/Xn(t)  Xn(TjI)/ < 21] /Xit)  XiT)/
if t i  I ~ t < Tj'
< 2Yf if Tj
~
t
~
ti
and hence
IXit)  Xn(tiI)I < 41] /Xit)  Xn(tJI
< 4Yf
if t i I
~
t < Tj,
~
t
~
if
Tj
ti.
Apply the same reasoning to each grid interval.
{m~x Ll(Xn' [tiI, tJ) > 4Yf} ~
{some (t i 
I,
tJ contains at least two of the {Tj}}
~ {some pair Tj  I, Tj has Tj ~ T
1\
(TjI
+ !a)}.
135
V1.2. Convergence in Distribution
Fix an integer m, whose value will be specified soon. Bound the last indicator function by 2m
L: {Tj:S; T
2m
/\ (rj1
+ !a)} + m 1 L: {rj:s;
j=1
T /\ (Tj1
+ T/m)}.
j=1
The reasoning here is: either r2m 2': T, in which case the pair rjl' rj would be detected by the first sum; or r 2m < T, in which case at least m terms of the second sum must equal one, for otherwise [0, T] would have to contain (m + 1) disjoint intervals (Tjl, r) oflength greater than T/m. Take expectations. (17)
IP{m~x L\(Xn' [t
i  1,
tJ) > 41J}
:s; 2m max IP{rj :s; T /\ (rjl + ta)} js;2m
+ 2 max IP{Tj:S; T /\ (rjl + T/m)}. jS;2m
Now we choose m and a. Invoke Lemma 15 for Z = Xn with by (14). For n 2': no,
(J
IP{Tj:S; T /\ (rjl
= rj_ 10 r = rj and the
+ t(j)}
0 if and only if there exist continuous, strictly increasing maps {An} from [0, 00) onto itself such that, uniformly on compact sets of t values, An(t)  t >0
and
x(An(t))  xn(t) > O.
[Construct An as a piecewise linear map that takes gridpoints for Xn onto gridpoints for x, for pairs of grids chosen according to the definition of dT(x n , x) with T depending on n. ] [3J Prove continuity of the functional Hn that appeared in the proof of Theorem 6. [If d(x, Xi) > 0, choose continuous, increasing {AJ with x(Ai(t))  Xi(t) > 0 and A/t)  t > 0 uniformly on compact t sets. For i large enough, bound IHn(x)  HnCXi) I by sup Ix(t) Isup Isn(A/ t))  sn(t) I + sup IX(A;{t))  Xi(t) I t'5:c
t$;c
for some constant c.J [4J Let {Yn} be a sequence of random, increasing maps from [0, 00) onto itself such that Yn(t) > t in probability, for each fixed t. Show that sup IYn(t)  t I > 0 in probability O::5:t:sT
for each fixed T. [If IYnCs)  si < e and IYn(t)  tl < ethen IYnCu)  ul < e + Is  t I for u between sand t.J [5J Suppose d(xn, x) + 0 and that x is continuous at T. Show that XnCT) x(AnCT))  Xn(T) > 0 and An(T) + T.J
+
X(T). [Use
[6J If d(xn' x) + 0 and x belongs to C[O, 00) then Xn converges to x uniformly on compacta. [7J If X n ,... X in the Skorohod sense, and if X has sample paths in C[O, 00), then Xn ,... X in the sense of the metric for uniform convergence on compacta. [Switch to versions that converge almost surely in the Skorohod sense.J
CHAPTER VII
Central Limit Theorems ... in which the chaining method for proving maximal inequalities for the increments of stochastic processes is established. Applications include construction of gaussian processes with continuous sample paths, central limit theorems for empirical measures, and justification of a stochastic equicontinuity assumption that is needed to prove central limit theorems for statistics defined by minimization of a stochastic process.
VII. I. Stochastic Equicontinuity Much asymptotic theory boils down to careful application of Taylor's theorem. To bound remainder terms we impose regularity conditions, which add rigor to informal approximation arguments, but usually at the cost of increased technical detail. For some asymptotics problems, especially those concerned with central limit theorems for statistics defined by maximization or minimization of a random process, many of the technicalities can be drawn off into a single stochastic equicontinuity condition. This section shows how. Empirical process methods for establishing stochastic equicontinuity will be developed later in the chapter. Maximum likelihood estimation is the prime example of a method that defines a statistic by maximization of a random criterion function. Independent observations ~1"'" ~n are drawn from a distribution P, which is assumed to be a member of a parametric family defined by density functions {pC, e)}. For simplicity take e realvalued. The true, but unknown, eo can be estimated by the value en that maximizes n i= 1
Let us recall how one proves asymptotic normality for en' assuming it is consistent for eo. Write go (', e) for log p(., e), and 9 1 (', e), g2(', e), gk, e), for the first three partial derivatives with respect to e, whose existence we impose as a regularity condition. Using Taylor's theorem, expand go(·, e) into go(', eo)
+ (e
 eO)g1(', eo)
+ tee
 eo)2g2(, eo)
+ ice
 eo?gi·, e*)
139
VII. I. Stochastic Equicontinuity
with fJ* between fJo and fJ. Integrate with respect to the empirical measure P n. GnCfJ)
= Gn(fJo) + (fJ  ()O)Pngl + !(fJ  fJ O)2Png2 + RnCfJ).
If we impose, as an extra regularity condition, the domination
forallfJ,
Igk,fJ)I::;H(·)
then the remainder term will satisfy
IRn(fJ) I ::; ilfJ
 fJ o I3P nlg 3(·, fJ*)1 ::; ilfJ  fJol3PnH.
Assume PH < 00 and PIg21 < 00. Then, by the strong law oflarge numbers, for each sequence of shrinking neighborhoods of fJo we can absorb the remainder term into the quadratic, leaving (1)
Gi()
= Gn«()o) + (fJ  fJO)Png 1 + !«()

()0)2(Pg 2 + oil)
near fJo·
The op(l) stands for a sequence of random functions of fJ that are bounded uniformly on the shrinking neighborhoods of fJo by random variables of order 0/1). Provided Pg 2 < 0, such a bound on the error of approximation will lead to the usual central limit theorem for {nl/2(fJ n  fJo)}. As a more general result will be proved soon, let us not pursue that part ofthe argument further. Instead, reconsider the regularity conditions. The third partial derivative of go(·, fJ) was needed only to bound the remainder term in the Taylor expansion. The second partial derivative enters (1) only through its integrated value Pg 2 • But the first partial derivative plays a critical role; its value at each ~i comes into the linear term. That suggests we might relax the assumptions about existence of the higher derivatives and still get (1). We can. In place of Pg 2 we shall require a second derivative for Pg o(, fJ); and for the remainder term we shall invoke stochastic equicontinuity. In its abstract form stochastic equicontinuity refers to a sequence of stochastic processes {Zn(t): t ET} whose shared index set T comes equipped with a semimetric d(·, .). (In case you have forgotten, a semimetric has all the properties of a metric except that des, t) = 0 need not imply that s equals t.) We shall later need it in that generality. 2 Definition. Call {Zn} stochastically equicontinuous at to if for each YJ > 0 and B > 0 there exists a neighborhood U of to for which limsup
IP{s~p IZit) 
Zn(t O) I > YJ}
On AC, these bounds, together with their companions for (sn, t n) and (t i+1, tJ, allow IZ(s)  Z(t) I to be at most m
Dd(sn' tn)H(b n)
+ 2 L 2Db i H(b i+1)' i=n
The distance d(sn' t n) is at most m m m des, t) + d(Si+1, Si) + d(t i+1, tJ::;; 2b n + 2 2b i ::;; lObn •
L
L
L
i=n
i=n
i=n
m
IZ(s)  Z(t) I ::;; lODbnH(b n)
+ 4D
L 4(bi+ 1 
bi+2 )H(bi+ 1)
i=n
::;; lODbnH(b n)
+
16D In f{b i+2
26DJ(£) for some s, tin T* with des, t) ::;; £} ::;; 2£. A direct derivation of the weaker inequality would be slightly simpler than the proof of the lemma. But there are applications where the stronger result is needed.
146
VII. Central Limit Theorems
11 Example. Brownian motion on [0, IJ, you will recall, is a stochastic process {B(·, t):O ~ t ~ I} with continuous sample paths, independent increments, B(·, 0) = 0, and B(t)  B(s) distributed N(O, t  s) for t ;:::: s. If we measure distances between points of [0, IJ in a strange way, the Chaining Lemma will give a socalled modulus of continuity for the sample paths of B. The normal distribution has tails that decrease exponentially fast: from Appendix B, IP{IB(t)  B(s) I > IJ} ~ 2 exp( tlJ 2 /lt  si).
Define a new metric on [0, IJ by setting des, t) = Is  t 11/2. Then B satisfies inequality (10) with D = 1. The covering number N(f>, d, [0, IJ) is smaller than 2f>  2, which gives the bound J( f»
~
s:
[2 log 4
+ 10 10g(1/u)J 1/2 du
~ (2 log 4)1/2f> + JiO[log(I/f»J 1/2 ~
4f>[log(1/f»J 1/2
s:
10g(1/u) du
for f> small enough.
From the Chaining Lemma, IP{ IB(s)  B(t) I > 26J(d(s, t)) for some pair with des, t)
~
f>}
~
2f>.
The event appearing on the lefthand side gets smaller as f> decreases. Let f> 1 0. Conclude that for almost all w, IB(w, s)  B(w, t)1 ~ 741(s  t) logls  t11 1/2 for Is  t1 1 / 2 < f>(w). Except for the unimportant factor of 74, this is the best modulus possible (McKean 1969, Section 1.6). 0
VII.3. Gaussian Processes In Section 5 we shall generalize the Empirical Central Limit Theorem of Chapter V to empirical processes indexed by classes of functions. The limit processes will be analogues of the brownian bridge, gaussian processes with sample paths continuous in an appropriate sense. Even though existence of the limits will be guaranteed by the method of proof, it is no waste of effort if we devote a few pages here to a direct construction, which makes nontrivial application of the Chaining Lemma. The direct argument tells us more about the sample path properties of the gaussian processes. We start with analogues of brownian motion. The argument will extend an idea already touched on in Example 11.
147
VII.3. Gaussian Processes
Look at brownian motion in a different way. Regard it as a stochastic process indexed by the class of indicator functions $'
= {CO, t]:
°
~ t ~
I}.
The co variance JP[B(·,f)B(·, g)J can then be written as PUg), where P = Uniform[O, 1]. The process maps the subset $' of fil2(P) into the space fil2 (JP) in such a way that inner products are preserved. From this perspective it becomes more natural to characterize the sample path property as continuity with respect to the fil2(p) seminorm pp on $'. Notice that
pp(I[O, sJ  [0, tJI) = (PI[O, sJ  [0, tW)1/2 = Is  tI1/2. It is no accident that we used the same distance function in Example 11. The new notion of sample path continuity also makes sense for stochastic processes indexed by subclasses of other fil2(p) spaces, for probability measures different from Uniform[O, 1].
12 Definition. Let $' be a class of measurable functions on a set S with a (Jfield supporting a probability measure P. Suppose $' is contained in fil2 (P). A Pmotion is a stochastic process {B p(., f): f E $'} indexed by $' for which: (i) Bp has joint normal finitedimensional distributions with zero means and covariance JP[Bp(·, f)B p(·, g)J = PUg); (ii) each sample path Bp(w, .) is bounded and uniformly continuous with respect to the fil2(P) seminorm pp() on §'. The name does not quite fit unless one reads "Uniform[O, IJ" as "brownian," but it is easy to remember. The uniform continuity and boundedness that crept into the definition come automatically for brownian motion on the compact interval [0, 1]. In general $' need not be a compact subset of fil2 (P), although it must be totally bounded if it is to index a Pmotion (Problem 3); uniformly continuous functions on a totally bounded $' must be bounded. We seek conditions on P and $' for existence of the Pmotion. The Chaining Lemma will give us much more: a bound on the increments of the process in terms of the covering integral
J(6) = J(6, Pp,
$')
=
s:
[2 10g(N(u, Pp,
$')2 jU)J1/2
duo
Finiteness of J() will guarantee existence of Bp. 13 Theorem. Let $' be a subset of fil2(P) with a finite covering integral, J(.), under the fil2(p) seminorm pp(.). There exists a Pmotion, Bp, indexed by ff,for which
IBp(w, f)
 Bp(w,
g)1
with 6(w)finitefor every w.
~
26J(pp(f  g))
if pp(f  g) < 6(w),
148
VII. Central Limit Theorems
PROOF. Construct the process first on a countable dense subset ffo = {fJ} of g: Such a subset exists because fi' has a finite bnet for each b > (otherwise J could not be finite). Apply the GramSchmidt procedure to ffo, generating an orthonormal sequence of functions {u}. Each f in ffo is a finite linear combination Lj
o.
Problems 5 and 6 provides the details behind (17). That gets rid of [15]. The reason we needed to replace [15] by Ob) becomes evident when we condition on 1;. Write PnO for the !£2(Pn) seminorm. We have no direct control over Pn(f  g) for functions in [15]; but for Ob), whose members are determined as soon as I; is specified, Pn(f  g) < 2b. Apply the Chaining Lemma. IP{IE~(f
 g)1 > 26Ji2b, Pn, 3") for some (f, g) in IJ} < 8.
Notice that Vs:::: {r1  r2:pp(r 1  r2) S b}, because r(·, to) = 0 by definition. Thus we may check for stochastic equicontinuity by showing: (i) The class f!Jl has an envelope belonging to !£2(p).
152
VII. Central Limit Theorems
(ii) f(', t) is differentiable in quadratic mean at to. From (i), this follows by dominated convergence if r(·, t) + almost surely [PJ as t + to. (iii) Condition (16) is satisfied for $' = :J1l.
°
These three conditions place constraints on the class {f(, t)}. 18 Example. The spatial median of a bivariate distribution P is the value of e that minimizes M(e) = Plx  el. Estimate it by the en that minimizes M.(e) = Pnlx  el. Example 11.26 gave conditions for consistency of such an estimator. Those conditions apply when P equals the symmetric normal N(O, 12 ), a pleasant distribution to work with because expliQit values can be calculated for all the quantities connected with the asymptotics for {en}. For this P, convexity and symmetry force M(·) to have its unique minimum at zero, so en converges almost surely to zero. Theorem 5 will produce the central limit theorem,
after we check its nonobvious conditions (ii), (iv), and (v). Change variables to reexpress M(e) in a form that makes it easier to find derivatives. M(e) = (2n)1
flx l exp( !Ix + (1 2 ) dx.
Differentiate under the integral sign. M'(O) = 0,
of course,
M"(O) = (2n)1 fix I(xx'  1 2 ) exp( !Ix 12) dx.
A random vector X with a N(O, 12 ) distribution has the factorization X = R U where R2 = IX 12 has a x~distribution independent of the random unit vector U = X/lXI, which is uniformly distributed around the rim of the unit circle.
v=
M"(O) = IP(R 3 UU'  RI 2) = IPR 3 IPUU'  (IPR)I2 = (n/8)1/2 I 2'
Condition (ii) wasn't so hard to check. To figure out the A(x) that should appear in the linear approximation
Ix  el = Ixl + e'A(x) + lelr(x, e), carry out the usual pointwise differentiation. That gives A(x) = x/I x I for x ¥= 0. Set A(O) = 0, for completeness. The components of A(·) all belong to "p2(p). Indeed, PAS = IPUU' = H 2' That's condition (iv) taken care of.
153
VH.4. Random Covering Numbers
Now comes the hard partor at least it would be hard if we hadn't already proved the Equicontinuity Lemma. Start by checking that the class f!lt of remainder functions r(·, B) has an envelope in f.e2(P). For B =f. 0, Ir(x, B)I
Ilx  BI  Ixl  B'A(x)I/IBI ::; IBI1(lx  BI2  IxI 2)/(lx  BI + Ixl) ::; (21 x I + IBI)/( Ix  BI + Ix I) + I ::; 4.
=
+
I
It follows that I·  BI is differentiable in quadratic mean at B = 0. We have only to verify condition (16) of the Equicontinuity Lemma to complete the proof of stochastic equicontinuity. Each r(·, B), for B =f. 0, can be broken into a difference of two bounded functions: r 1(,
B)
=
B'A(·)!IBI,
r2(', B)
=
(Ix  BI  Ixi)/IBI.
Write f!lt 1 and f!lt 2 for the corresponding classes offunctions. The linear space spanned by ~ 1 has finite dimension; the graphs have polynomial discrimination, by Lemma 11.28; the covering numbers Niu, P n , ~l) are bounded by a polynomial Au w in ur, with A and W not depending on P n (Lemma 11.36). The graphs of functions in f!lt2 also have polynomial discrimination, because {(x, t): Ix  BI  Ixl ;;::: IBlt} can be written as { 2B'x
+ IBI2;;::: 21Bllxlt + IB1 2 t 2 } n
{Ixl
+ IBlt;;::: o} u
{Ixl
+ IBlt
5Y}
~ IP{ s~p IZ(t) 
Z(ta) I > Y}
+ IP{su p IZ(so) 
+s
Z(to) I >
y}.
[0]
The distance between the is less than
So
and to of each pair appearing in the last term
kl
~
kl
L bi + !Y. + 15 + !Y. + L bi i=O
~
315 0
~
1215.
i=O
+ 2!Y. + 15
There are at most N(b o)2 such pairs. The exponential inequality holds for each pair, because 12b/(!Y.yl/2) ;;::: 12yl/2 ;;::: 1. IP{sup IZ(so)  Z(to) I >
y}
[0]
~
N(b o)Z2 exp( h 2/144D 2b2)
~
215
because d(so, to) ~ 1215 Z 2 ~ 215 exp[log(N(b)Z /15)  h2/144D b ] because N(b o) ~ N(b) ~ 215 exp[tb Z(H(b)Zb 2  y2/144D 2 )]
< s.
because
H(b)b
~
J(b)
~
y/12D
o
163
VII.6. Restricted Chaining
Theorem 21 states a sufficient condition for empirical processes indexed by a pointwise bounded, totally bounded, permissible subclass ff of .;tJ2(p) to converge in distribution to a Pbridge: given I] > 0 there exists a £5 > 0 for which limsup IP{sup IEnU  g) I > I]} < e. [b]
For a permissible class of bounded functions, say 0 S ! s 1, any condition implying finiteness of N 1(', P, ff) or Nz(·, P, ff) will take care of the total boundedness. Finiteness of a covering integral will allow us to apply Theorem 26, leaving only a supremum over the class Ye = {f  fa:! E ff} of little links. It will then suffice to prove SUPK IEnhl = op(l) to get the empirical central limit theorem. Notice that rt., and hence Ye, will depend on n. The next three examples sketch typical methods for handling Ye. 27 Example. Equip ff with the semi metric dU, g) = (PI!  gl)1/ z. (This is the .;tJZ(P) seminorm applied to the function I!  g 11/Z.) The square root ensures that the variance (J2U  g) is less than dU, g)z. If we take A = t, the exponential bound (25) becomes, for dU, g) S £5,
IP{IEnC!  g)1 > I]}
s
2 exp( i1J z/£5 Z)
if £5zl1J ~ 2/(n 1 / 2 B 1 (t».
That is, D = 2 and rt. = (2/B 1(t»1/2 n 1/4 for Theorem 26. The covering numbers for d(·, .) are closely related to the .;tJ1(p) covering numbers: in terms of the covering integral,
If J is finite, Theorem 26 can chain down to leave a class Ye of little links with Ihis 1 and PI his rt. 2 . If we add to this the condition (29)
log N 1(cn 1/Z, Pn, Ye) = op(n1/Z)
for each c > 0,
the empirical central limit theorem will hold. The methods of Section II.6 work for the class Ye1 / 2 = {I h 11 / Z : hE Ye}. Notice that
N 2 (£5, Pn , Ye1 / Z ) S N 1 (£5 Z , P n , Ye) because Pn(lh 111/2

Ih211/2)2 S P nlh 1  h 2 1. From Lemma II.33,
(30) IP{S,(P n l h l )1/Z > 8rt.} S 4IP[N 1(rt. z, P n , Ye) exp( nrt. 2 )
= 4IP[exp(log N 1 (rt. Z , Pn , Ye) +
0
by (29).
/\
1J
 nrt. Z )
/\
1J
164
VII. Central Limit Theorems
Symmetrize. For n large enough,
IP{s~p IEnhl > 4Y} :s; 4IP{S~P IE~hl >
y}.
Condition on 1;. Cover Ye by M = N 1 (tyn 1 / 2 , Pn , Ye) balls, for the ft'l(P n ) seminorm, with centers gl' ... , gM in .Yf. Then as in Section 11.6,
IP{s~ IE~hl > yll;} :s; M m~x IP{ IE~gjl > hll;}· On the set of I; where suP.Yt' PnIh I :s; 640: 2, Hoeffding's Inequality bounds the righthand side by 2 exp[log M  ~(!y)2/(640:2)] which is of order oil) because (29) says log M = oin1/2). The central limit theorem follows. 0 31 Example. The direct approximation method of Section 11.2 gave uniform strong laws of large numbers. With a suitable bound on the number of functions needed for the approximations, we get central limit theorems. Define a direct covering number Ll(b, P, Ye) as the smallest M for which there exist functions gl' ... , gM such that, for every h in £,
Ihl :s; gi
°
and
Pgi:S; b
We may assume :s; gi :s; 1. If (32) log Ll(cn 1 / 2 , P, Ye)
+ Plhl
= 0(n 1 / 2 )
for some i.
for each c > 0,
and if the covering integral (28) from the previous example is finite, then the empirical central limit theorem holds. Given y > 0, choose A in the exponential inequality (25) so that 2/B 1 (A) = y. The dependence of A on y does not vitiate the chaining argument in Theorem 26; it does ensure that functions in Ye satisfy Plhl :s; 0: 2 = yn 1/2 • Find gl"'" gM according to the definition of Ll(yn 1/ 2 , P, Ye). Because Pg i :s; 2yn 1/2 for each i, the contributions of the means to En are small.
IP{s~ IEnhl > 4Y} :s; IP{s~ n1/2Pnlhl > 3Y} :s;
IP{m~x n 1/ Pngi > 3Y} 2
:s; M max IP{Engi > y}
because n1/2Plhl :s; y because Ihl :s; gJor some i because n1/2Pg i :s; 2y
i
:s; M max 2 exp[ _~(y2 /Pg i)B(2y/(n 1/2Pg i))]
from (25)
i
:s; 2 exp[log M 
= 0(1) by (32).
iYn 1/ 2 B(1)]
o
165
Notes
33 Example. In the previous two examples, the method of chaining left links of small 2 1 (P) seminorm at the end of the chain; 21 approximation methods took care of Yt. If we chain instead with 2 2 (P) covering numbers, we need 22 approximation methods for Yt. Set d(·, .) equal to the 2 2(P) semimetric. Because a 2(f  g) s d(!, g)2, the chaining down to £' requires J il, P, $') finite. At the end
Ph 2 S a 2 = (2jB 1(i»n 1 / 2 •
Invoke Lemma II.33.
IP{S~(Pnh2)1/2 S
8a}
>
1 as
n
> 00
if the random covering numbers satisfy log N 2(cn 1/4, Pn, £') = op(n 1/2) for each c >
o.
This would follow from (34) because oin1/4) = (cn1/4)lJ2(cn1/4, Pn, £') ~ [210g(N 2 (cn 1 /4, P n , £,)2n 1/4jcW/ 2.
Symmetrize. For all n large enough,
IP{s~p IEnhl > 4Y} s 4IP{S~ IE~hl > Y}. Now we are back to the sort of problem we were solving in Section 4. Condition on ;. On the set of those; for which sup£(P n h 2)1/2 S 8a, chain using the Hoeffding Inequality to bound the tail probabilities. Apply the Chaining Lemma for IP( ·1;), the 2 2 (P n) seminorm, and I> = 8et.
IP{s~ IE~hl > 26J 2(8a, Pn , £')I;} s
16a
if
S~(Pnh2)1/2 S
8a.
Condition (34) and finiteness of J 2(1, P, $') are sufficient for the empirical central limit theorem to hold. 0 NOTES
Theorem 5 draws on ideas from Chernoff (1954), but substitutes stochastic equicontinuity where he placed domination conditions on thirdorder partial derivatives. The theorem also holds if to is just a local minimum for F(·), or if 'n is a minimum for F nO over a large enough neighborhood of to. Huber (1967, Lemma 3) made explicit the role of stochastic equicontinuity in a proof of the central limit theorem for an Mestimator.
166
VII. Central Limit Theorems
The chaining argument abstracts the idea behind construction of processes on a dyadic rational skeleton. It appears to have entered weak convergence theory through the work of Kolmogorov and the Soviet School; it is closely related to the arguments for construction of measures in function spaces (Gihman and Skorohod 1974, Sections 111.4, 111.5). The Chaining Lemma is based on an arrangement by Le Cam (1983) of an argument of Dudley (1967a, 1973, 1978). Le Cam's approach avoids the complications introduced into Dudley's proof by the nuisance possibility that covering numbers N(b) might not increase rapidly enough as b decreases to zero. Alexander (1984a, 1984b) has refined Dudley's form of the chaining argument to prove the most precise maximal inequalities for general empirical processes to be found in the literature. Theorem 13 is based on Theorem 2.1 of Dudley (1973), but with his modulus function increased slightly to take advantage of Le Cam's (1983) cleaner bound for the error term. The extra (b 10g(1jb»1/2 does not change the order of magnitude of the modulus for most processes. The argument in Section 4 is based on Pollard (1982c), except for the substitution of convergence in probability (condition (16» for uniform convergence. Kolchinsky (1982) developed a similar technique to prove a similar central limit theorem for bounded classes of functions. He imposed finiteness of J 2(·, P, :#') plus a growth condition on N 1( ., Pn,:#') to get results closer to those of my Example 27. Gine and Zinn (1984) have found a necessary and sufficient random entropy condition for the empirical central limit theorem. Brown (1983) sketched the largesample theory for the spatial median. He referred to Brown and Kildea (1979) and the appendix he wrote for Maritz (1981) for rigorous proofs, which depend on a form of stochastic equicontinuity. The central limit theorem for kmeans was proved by Pollard (1982b, 1982d) for a fixed number of clusters in euclidean space. The onedimensional result was proved by Hartigan (1978), using a different method. Dudley (1978, 1981a, 1981b, 1984) has developed the application of metric entropy (covering numbers) to empirical process theory. These papers extended his earlier work on entropy and sample path properties of gaussian processes (1967b, 1973), and on the multidimensional empirical distribution function (1966a). Dudley (1966a, 1978) introduced most of the ideas needed to prove central limit theorems for empirical processes indexed by sets. He extended these ideas to classes of functions in (1981a, 1981b). His lecture notes (1984) provide the best available overview of empirical process theory, as of this writing. The proof of my Theorem 21 was inspired by Chapter 4 of those lecture notes, which reworked ideas from Dudley and Philipp (1983). If P f = 0 for each f in :#', a standardization that can be imposed without affecting En or E p , the conditions of Theorem 21 are also necessary for the empirical central limit theorem.
167
Problems
The first central limit theorems for empirical processes indexed by classes of sets were proved by the direct approximation method. Bolthausen (1978) worked with the class of compact, convex subsets of the unit square in JR 2 • He applied an entropy bound due to Dudley (1974). Revesz (1976) indexed the processes by classes of sets with smooth boundaries. Earlier work of Sun was, unfortunately, not published until quite recently (Pyke and Sun 1982). Dudley's (1978) Theorem 5.1 imposed a condition on the "metric entropy with inclusion" that corresponds to finiteness of a covering integral. Strassen and Dudley (1969) proved a central limit theorem for empirical processes indexed by classes of smooth functions. They deduced the result from their central limit theorem for sums of independent random elements of spaces of continuous functions. All these theorems depend on existence of good bounds for the rate of growth of entropy functions (covering numbers). For more about this see Dudley (1984, Sections 6 and 7) and Gaenssler (1984). Theorem 26 resets an argument of Le Cam (1983). Such an approximation theorem has been implicit in the work of Dudley. Gine and Zinn (1984) have pointed out the benefits of stripping off the 2 2 (P) chaining argument, to expose more clearly the problem of how to handle the little links left at the end of the chain. They have also stressed the strong parallels between empirical processes and gaussian processes. The examples in Section 6 follow the lead of Gine and Zinn: Example 27 is based on their adaptation of Le Cam's (1983) squareroot trick; Example 31 is based on their improvement of Dudley's (1978) "metric entropy with inclusion" method; Example 33 is based on their Theorem 5.5. PROBLEMS
[IJ Prove that the stochastic equicontinuity concept of Definition 2 follows from: Zn( Tn)  Zn(t O) ..... 0 in probability for every sequence {Tn} that converges in probability to to' [Suppose the defining property fails for some IJ > 0 and 8 > O. For a sequence ofneighborhoods {Uk} that shrink to to find positive integers n(1) < n(2) < ... with lP{sup IZn(k)(t)  Zn(k,(t O) I > IJ} > u.
ts
for every k. Choose random elements {Tn} of T such that, for n(k) ::; n < n(k
+
1),
IZnCw, TnCW))  Zn(w, to) I 2': ! sup IZn(w, t)  ZnCw, to) I u.
and Tn(w) belongs to U k' Appendix C covers measurability of Tn.J [2J Let {f(., t): t ET} be a collection of IRkvalued functions indexed by a subset of IRk. Suppose Plfe, tW < Cf) for each t. Set F(t) = Pfe, t) and Fn(t) = Pnf(" t). Let {Tn} be a sequence converging in probability to a value to at which F(to) = O. If (a) Fe) has a nonsingular derivative matrix D at to; (b) FIi(Tn) = Oin1/2);
168
VII. Central Limit Theorems
(c) {Enf(" t)} is stochastically equicontinuous at to; then n 1/1(Tn  to) ...... N(O, D 1P[f(" to)f(', to)']D1). [Compare with Huber (1967).J [3J For a class ff to index a Pmotion it must be totally bounded under the 'pl(P) seminorm pp. [First show ff is bounded: otherwise IBp(f,,)I CfJ in probability for some {f,,}, violating boundedness of Pmotion sample paths. Total boundedness will then follow from: for each e > 0, every f lies within e of some linear combination of a fixed, finite subclass of:F. If for some e no such finite subclass exists, find {f,,} such that n
fn+1
=
gn+1
+
Ianjij, j= 1
where pp(gn+ 1) :2: e and gn+ 1 is orthogonal to f1' ... ,f". Fix an M. Show that there exists aD> 0, depending on M and e, for which
Deduce that JP {suPn B if,,) :2: M} = 1 for every M, which contradicts boundedness of the sample paths. Notice that continuity of the sample paths does not enter the argument. Dudley (1967b).J [4J If sup.? IPfl is finite then ff must be totally bounded under the 'pl(P) seminorm pp if it supports a Pbridge. [Choose Z with a N,(O, 1) distribution independent of Ep. The process B(f) = EP(f) + ZPf is a Pmotion with bounded sample paths. Invoke Problem 3. The condition on the means is neededconsider the ff consisting of all constant functions. The Pbridge is unaffected by addition of arbitrary constants to functions in ff; it depends only on the projection of ff onto the subspace of 'pl(P) orthogonal to the constants.J [5J Let J"t'1 be a class offunctions with an envelope H in 'pl(P). Set J"t'l Show that
=
{h l : hE J"t'd.
N 1(4e(QH1)1/1, Q, J"t'l):::; N l (2e, Q, J"t'1)'
[By the CauchySchwarz inequality, Qlhi  h~l:::; Q(2Hlh 1  hll):::; 2(QH2)1/2(Qlh l

h211)I/l
if both Ihll :::; H and Ihll :::; H.J [6J Let ff be a permissible class of functions with envelope F. Suppose Jl(D, P n , ff) = op(n 1 / 2 )
for each
D > O.
[Condition (16) of the Equicontinuity Lemma implies that Jl(D, P n , ff) = Ope!) for each D > O.J Show that J"t'2 = {(f  g)2: f  gEff} satisfies the sufficient condition (Theorem II.24) for the uniform strong law of large numbers: log NI(e, Pn, J"t'2) [Set H
=
2F and J"t'1
=
= open),
for each e > O.
{f  g:J, g E ff}. Show that, for 1 > e > 0,
N 2(2e, P n , J"t'1) :::; Nie, P n, ff)l :::; e exp(Vl(e, P n, ff)l/e l ).
169
Problems
Deduce from this inequality, Problem 5, and the strong law of large numbers for {PnHZ} that, if 1 > e > 0, IP{log N,(4e(2PH2)J/Z, Pn, Jf'z) > nl'/} :0; IP{log NJ(4e(PnHZ)J/z, P n, Jf'z) > nl'/}
+ IP{PnHZ >
2PH Z}
Z Z :0; IP{log N z(2e, P n, Jf',) > nl'/} + IP{PnH > 2PH } z 2 Z :0; IP{tJz(e, P n, ff)z/e > nl'/} + IP{PnH > 2PH } + O.
A weaker result was proved by Pollard (1982c).J [7J If ff is totally bounded under the !£,Z(P) seminorm, then the space C(§; P) of bounded, uniformly continuous, real functions on ff is separable. [Suppose IxU)  x(g) I < e whenever pP(f  g) :0; 2b. Choose {f" ... ,fm} as a maximal set with pp(J;  fj) ~ tb. Use the weighting functions L1l·) from the proof of Theorem 21 to interpolate between rational approximations to the {x(J;)}.J [8J Suppose ff is totally bounded under the !£,2(P) seminorm. If two probability measures A and J1 on the ofield fJP have the same fidis, and if both concentrate on C(§; P), then they must agree everywhere on fJP. [Show that A and J1 agree for all finite intersections offidi sets and closed balls with centers in C(ff, P). For example, consider a closed ball B(x, r) with x in C(ff, P). Let {fJJ2, ... } be a countable, dense subset of C(ff, P). Define Bn
=
{z E C(§; P): Iz(J;)  x(!;)1
:0;
r for 1 :0; i
:0;
n}.
Show that J1B(x, r) :0; J1B n = ABn + AB(x, r) as n + 00. Extend the result to finite collections of closed balls and fidi sets, then apply a generatingclass argument.J [9J The property that the graphs have only polynomial discrimination is not preserved by the operation of summing two classes of functions. That is, both ff and '§ can have the property without the class g' = {f + g: f E ff, g E '§} having it. Let !l2 = {DJ, D2, ... } be the set of indicator functions of all finite sets of rational numbers in [0, 1]. Let ff = {2n + Dn: n = 1,2, ... } and '§ = {2n: n = 1,2, ... }. The graphs from neither class can shatter twopoint sets, but g' can shatter arbitrarily large finite sets of rationals in [0, 1]. [The roundabout reasoning used to bound the covering numbers in Example 18 may not be completely unnecessary.J
CHAPTER VIII
Martingales ... in which martingale central limit theorems in discrete and continuous time are proved. An extended nontrivial application to KaplanMeier estimatorsestimation of distribution functions from censored datais sketched.
VIlLI. A Central Limit Theorem for MartingaleDifference Arrays Martingale theory must surely be the most successful of all the attempts to extend the classical theory for sums of independent random variables to cover dependent variables. Many of the classical limit theorems have martingale analogues that rival them for elegance and far exceed them in diversity of application. We shall explore two of these martingale theorems in this chapter. One main change in technique will become apparent. Where proofs for independent summands use truncation to protect against occasional abnormally large incrementsthe sort of thing implicit in something like the Lindeberg conditionmartingale proofs can resort to stopping time arguments. Optional stopping preserves the conditional expectation connection within a martingale sequence as long as the decision to stop is based only on past behavior of the sequence. The prohibition against peering into the future to anticipate abnormally large increments imposes a characteristic feature on martingale theorems. One needs conditions that protect against the behavior of the worst single increment, because the decision to stop can be taken only after that increment has had its effect. In central limit theorems, independence allows one to factorize expectations of products and separate out the contribution of a particular increment from contributions of past and future increments. The martingale property allows a weaker factorization, for expectations conditional on the past alone. Conditional variances take over the role played by variances of independent summands. But apart from that, the arguments for martingales share the
171
VilLI. A Central Limit Theorem for MartingaleDifference Arrays
same inspiration as the proof of the Lindeberg Central Limit Theorem in Section IIl.4. We shall prove asymptotic normality for row sums of martingaledifference arrays. That is, for each n we have random variables ~nl' ... , ~nn (we avoid some messy notation, and lose no generality, by assuming exactly n variables in the nth row) and (Jfields Cno S; ... S; Cnn for which (a) (b)
IP(~njICn,j_1)=Oforj=
~nj
1, ... ,n;
is Cnrmeasurable.
Define conditional variances Vnj=IP(~~jICn,j1)
for j=l, ... ,n.
Notice that vnj is an Cn,j_1measurable random variable. Convergence of sums of conditional variances will be the only connection tying together the variables in different rows of the array. 1 Theorem. Let
{~n}
be a martingaledifference array. If as n ~
00,
(i) Lj vnj ~ (J2 in probability, with (J2 a positive constant; (ii) for every I:; > 0, the sum Lj IP( ~~j{ I~nj I > I:;} ICn, j_ 1) converges in probability to zero (a Lindeberg condition); then ~n1
+ ... + ~nn + N(O, (J2).
Without loss of generality set (J2 = 1. Let us check pointwise convergence of characteristic functions:
PROOF.
We really will need some of the special multiplicative properties of the complex exponential function, and not just its smoothness by way of three bounded, continuous derivatives as in Section IlI.4. The randomness of the conditional variances fouls up the argument based on successive substitution of matching normal increments, which worked for independent summands. At the risk of notational abuse (more will follow) abbreviate the conditional expectation IP('IC n) to IPk). Write R(x) for the remainder term eix  1  ix and Snj for the partial sum ~n1 + ... + ~nj' Define rnj as the conditional expectation IPj  1R(t~n)' When the gnj} are small, in a sense to be made precise by condition (ii), we shall get rnj ;.: : ;  tVnj' The proof will work by successive conditioning. We pin down the effect of individual increments by evaluating
a layer at a time. Start with the innermost conditional expectation. Factorize out the part depending on Cn,j1 then expand the remaining exp(it~nn)'
172
VIII. Martingales
The 1 + r nn factor foils the attempt to work the same idea with IP n 2; it won't cooperate by slipping outside the conditional expectation, leaving exp(itS n,ni) to enjoy the same treatment from IPn  2 that exp(itSnn ) received from IPn  i . We could clear away the obstacle by dividing out the offending factor. IPn  2 IPn  i [(1 + rnn )i exp(itSnn)J = IP n 2 exp(itS n,ni) = exp(itSn,n2)(1 + rn,ni)' We could get rid of the (1 + rn,ni) in a similar fashion. If you pursue this idea through each layer of conditioning, you will see the sense in starting from n
TI (1 + rn)i exp(itSnn )·
j= 1 The remote possibility that one rnj might get close to 1 could cause minor difficulty when we come to bound the product term. To avoid this, replace (1 + rn)i by (1  rn). When rnj : : : ; 0 the change has little effect. Define k
T"k =
TI (1 j=i
rn)
and
Znk = T"k exp(itSnk )·
If we could show IP I7;..  exp(t t 2) I + 0 and IP Znn + 1, then it would follow that lIP exp(itSnn )  exp( tt 2)1 S exp( tt 2)[IPI exp(itSnn + tt 2 )  Znnl + IIPZ nn  IIJ + O. We would get the desired results for {T"n} and {Znn} from: (a) Lj r + tt 2 in probability; nj
(b) Lj Ir nj I S t
2
;
(c) max j Irnjl+ 0 in probability. The second of these requirements need not be satisfied but, without loss of generality, we may act as if it were. We replace enj by enjU S ITn}, where ITn = max{k: LJ=l Vnj S 2}. Interpret ITn as zero if Vni > 2. Because vnj is 6" n,j imeasurable, the event Us ITn} is 6"n,j_imeasurable; ITn is a stopping time. The new variables are martingale differences. The new row sums have the same asymptotic behavior as the original row sums: IP{Ji enj =I Ji enjU S ITn} } S IP{n > ITn} = IPLti Vnj >
2}
+
O.
The quantities wnj = IP j  i R(tenjU S ITn}) satisfy analogues of the requirements (a), (b), and (c). The argument depends on the inequalities (Problem 1) for real x, IR(x)1 S tlxl2; 2 IR(x) + tx 21 S min{lxI ,ilxn·
173
VIlLI. A Central Limit Theorem for MartingaleDifference Arrays
For the analogue of (c): if (j > 0, no Iwnjl exceeds
m:XIPj_1ttZ~;j{j::;; an}::;; ttZ[(jZ + ~IPj1~;j{l~njl > (j}]. Set (j small, then invoke the Lindeberg condition. For the analogue of (b): n
n
j= 1
j= 1
I IWnj I ::;; tt Z I IPj _ 1~;j{j ::;; an} n
= tt Z I {j ::;; an}vnj because {j ::;; an} is Cn,j_1measurable j= 1
::;; t
Z
by definition of an'
For the analogue of (a), fix (j > 0: Z
Ijt1Wnj
+ tt J1 Vnj l
::;; IIPj1IR(t~nj{j ::;; an}) + ttZ~;j{j ::;; an} + ttZ~;j{j > an} I j
::;; IIPj1[ilt~nj{j::;; an}13{I~njl::;; (j} j
+ tZ~;j{j ::;;
::;; Lj i(jltl
3
anH I~nj I > (j}
+ ttZ~;j{j > an}]
vnj + t Z I IPj1~;j{l~njl > (j} + ttZ{n > an} I Vnj· j j
The first sum can be made small with high probability, by an appropriate choice of (j; the last two terms converge in probability to zero. We could carry an along throughout the rest of the argument, but that would clutter up the notation. Instead, let us assume that (a), (b), and (c) hold. While we are at it, let us drop the n subscript for variables in the nth row of the array; all calculations take place within the nth row. Our task is to prove that and
(2)
DJ=
where 1k = 1(1  r) and Zk = 1k exp(itSk )· The path leading from (a), (b), and (c) to the first assertion in (2) is wellworn (Chung 1968, Section 7.1). For complex e, Ilog(l  e)
+ el ::;; lelz if lel::;; t.
Apply the inequality to each rj . When maxj Irjl ::;; t,
IIIOg(1  r) + ~ rjl : ; Ilrjlz J
J
J
::;; t 2 max Irjl J
by (b).
174
VIII. Martingales
It follows from (a), (c), and continuity of the exponential function that {T,,} converges in probability to exp(!t 2 ). Each IT" I is bounded, because of (b):
(3)
I T" I :s;
9
(1
+ Ir j I)
:s; exp (
t I j I) r
:s; exp(t
2
).
Boundedness plus convergence in probability imply convergence in L 1. For the proof of the second assertion in (2), bound the errors that accrue during the calculation of conditional expectations layerbyIayer.
IPj _ 1 Zj = IP j  1[Tjexp(itS j _ 1 + it()J = Tj exp(itS j  1)IP j  1 exp(it() = (1  rj)Zj_1[1 + IPj  1( it O + IP j 1R(tOJ = (1  r)Zj_1(1 + r) = Zj1  rJZj1' Thus IIPZj  IPZj _1 1 :s; IPlrJZj11 :s; exp(t 2 )IPlrJ I, because inequality (3) implies IZ j 11 = I Tj11 :s; exp(t 2). Sum over j. IIPZn

:s; exp(t 2)IP
11
n
L
Irjl2 ~
j=l
°
as
n~
00,
4 Example. A sequence of real random variables X 0, Xl' .. , is called an autoregression of order one if Xn = eO x n  1 + Un for some fixed eo. The innovations {un} are assumed independent and identically distributed with zero mean and finite variance (J2. The initial value X 0 is independent of the {un}. The leastsquares estimator en minimizes n
L (Xj  exj _ 1)2. j= 1
Solving for en and standardizing, we get (5)
12 n / (e n

eo) =
12 [n /
.±
Uj X j  1J/[n
J=l
1
.f XJ1]'
J=l
With the help of Theorem 1 we can prove n 1 / 2 (e n

eo) ~ N(O,
1
e6)
provided Ieo I < 1 and IPX6 < 00. This will follow from convergence results for the denominator and numerator in (5): n
n 1
L xJ
1
~ (J2/(1 
e6)
in probability,
j= 1
n
L n1/2UjXj_1 ~ N(O, (J4/(1 j= 1
Start with the denominator.

e6))·
175
VIII. I. A Central Limit Theorem for MartingaleDifference Arrays
Square both sides of the defining equation of the autoregression, then sum over j. n
IX; = Iu; j~1
+ 28 0
n
IU j X j 1 + 86 IX;I'
j~1
j~1
j~1
Rearrange then divide through by n. n
(1  86)n 1 IX;1
(6)
n
=
n 1 Iu;
j~1
+ 28 0 n 1
j~1
+
n
IU j X j  1
j~1
n 1(X6  X;).
On the righthand side, the first term converges almost surely to (Jz, by the strong law of large numbers. The third term converges in L 1 to zero, because repeated application of the equality
IPX; = IPu; + 28 0 IPu n IPX n  1 + 86IPX;1 = (Jz + 86IPX;I, yields IPX; = (Jz(1 + 86 + ... + 86 n  Z) + 86nIPX6 + (Jz/(1  86). The n 1 brings the limit down to zero. The middle term on the righthand side of (6) converges in L Z to zero: nZIP('± Uj X j _ 1)Z J~
1 n
= n Z "" L., IPUjZXZjl
independence kills the crossproduct terms
1
j~
n
= n Z I (JZIPX;_1 j~
= O(n
1
1
)
because {IPXJ} is convergent.
So much for the denominator. Write tffn for the (Jfield generated by {Xo, Ul>"" un}. Abbreviate IP('ltff) to IPk). The variables {ujXj_d are martingale differences for {tffj}. Apply Theorem 1 to the sum in the numerator of (5). For condition (i): (  1 UjZXZ) Vnj = IP.iIn jl =n  1 (J ZXZjl,
I
n
Vnj
j~1
= (JZn 1
I
X;1
+
(J4/(1  86) in probability.
j~1
The Lindeberg condition demands a more delicate argument if highermoment constraints on the innovations are to be avoided. n 1
IPjlU;X;I{lujXj_ll > w 1/Z }
I j~
1
::; n 1
n
I
j~1
= n 1
n
+ {X;1 > w 1/Z }]
X;I IPuHui > w 1/Z } + n 1
I
j~
IPj _ 1U;X;_I[{U; > Bnl/Z}
1
n
I
j~
(JZX;_I{X;_1 > 1
W
1/Z }.
176
VIII. Martingales
The first sum converges to zero in probability, because ui is integrable. The second sum converges in L 1 to zero because the sequence {X~} is uniformly integrable (Problem 2). 0
VIII. 2. Continuous Time Martingales A stochastic process {Z(t): o:s; t < oo} is said to be a martingale with respect to an increasing family of (Tfields {C t : :s; t < oo} if Z(t) is adapted to the (Tfields (that is, Z(t) is Ctmeasurable) and IP(Z(s) ICt) = Z(t) whenever s > t. After some fiddling around with sets of measure zero it can usually be arranged that such a process has cadlag sample paths (Dellacherie and Meyer 1982, Section VU), in which case it may be studied as a random element of D[O, 00). Call Z an L 2martingale if it has cadlag sample paths and IPZ(t)2 < 00 for each t. The behavior of an L 2martingale is largely determined by the conditional variances of its increments. The conditional expectation of [Z(t + 6)  Z(t)J2 given Ct plays a role similar to that of the conditional variance vnj in Section 1. The most economical way to explain this uses some deeper results from the Strasbourg theory of stochastic processes. We could avoid the appeal to the deeper theory by building its special consequences into the martingale calculations for each particular application. That would always work for martingales that evolve by discrete jumps; the calculations would be similar to those in Section 1. The theory would be more selfcontained, but it would disguise the unifying concept of the conditional variance process. According to the DoobMeyer decomposition (Theorem VII.l2 of Dellacherie and Meyer (1982), applied to the supermartingale _Z2), for each L 2 martingale Z, the process Z2 has a unique representation as a sum V + M of a martingale M and an increasing, predictable, conditional variance process V with V(O) = 0. Both M and V have cadlag sample paths. (Strictly speaking, for this decomposition we need the (Tfields {C t } to satisfy the "usual conditions": Co should contain all IPnegligible sets and each Ct should equal the intersection of the (Tfields Cs for s > t.) The adjective "predictable" has the technical meaning that Yew, t) is measurable with respect to the (Tfield on Q ® [0, 00) generated by the class of all adapted, leftcontinuous processes. So V behaves something like a process with leftcontinuous sample paths; its paths can be predicted a tiny instant into the future. We will need the predictability property only in Lemma 11. If the martingale Z changes only by jumps ~ 1, ~2' ... occurring at fixed times tl < t2 < "', and if Ct = Ctk for tk :s; t < t k+ 1 , then V is just a sum of conditional variances:
°
177
VIII.2. Continuous Time Martingales
whenever tk :s; t < tk+ l' You can check directly that Z2  V is a martingale and that there exists a sequence of leftcontinuous, adapted processes converging pointwise on n ® [0, 00) to V. The value of V at tk corresponds to what we would have written as V 1 + ... + V k in Section 1. You might take this as your guiding example for the rest of the section if you wish to avoid all appeals to the Strasbourg theory. The process V carries information about the conditional variances of the increments of Z. If s > t, (8)
IP([Z(s)  Z(t)J 2 1Ct) 2 = IP(Z(s) 1Ct)  Z(t)2  2Z(t)IP(Z(s)  Z(t) 1Ct)
= IP(V(s)ICt) + IP(M(s)ICt)  Vet)  M(t) = IP(V(s)  Vet) 1Ct)· For s very close to t, the predictability of V makes yes)  Vet) almost Ctmeasurable; the last conditional expectation almost equals Yes)  Vet), in a sense that will receive a more rigorous meaning later. By means of a simple Tchebychev inequality we get from V a bound on the size of the increments of Z in precisely the form required by the stopping time argument of Lemma V.7. The maximal inequality provided by that lemma lies at the heart of any proof for convergence in distribution in spaces of cadlag functions. 9 Lemma. Let {Z(t): 0 :s; t :s; b} be an L 2martingale with conditional variance process V. 1f,Jor every t, IP(V(b)  Vet) 1Ct) :s; J2/12
almost surely,
then IP{su p IZ(t)  Z(O) 1 > J} :s; 3IP{ IZ(b)  Z(O) 1 > H}· ts,b
With no loss of generality we may assume Z(O) = O. Write IP t(·) for expectation conditional on Ct • Lemma V.7 invites us to check that
PROOF.
IPt{IZ(b)  Z(t) 1 :s; !IZ(t)l} ~
t
on
{IZ(t)1 > J}.
We shall do this by bounding the conditional probability from below by (10)
which is greater than ~  4J2(J2/12) on the set {IZ(t)1 > J}. Start from the inequality 1Z(b) 1 :s; IZ(b)  Z(t) 1 + IZ(t)1 :s; ~IZ(t)1 {IZ(b)  Z(t) 1 :s; !IZ(t)1} + 3IZ(b)  Z(t) 1{IZ(b)  Z(t) 1 > !IZ(t)I}.
178
VIII. Martingales
Keeping in mind that the absolute value of a martingale is a submartingale, take conditional expectations given tit.
IZ(t)1
~
IPtl Z(b) I
~ ~IZ(t)IIPt{IZ(b)  Z(t) I ~
tIZ(t)l}
+ 3IP t IZ(b)
 Z(t) I{IZ(b)  Z(t) I > ~ ~IZ(t)IIPt{IZ(b)  Z(t) I ~ tIZ(t)[} + 6I Z (t)I I IP t IZ(b)  Z(tW· On
{I Z(t) I > 0, and define a stopping time Tn
= inf{t > 0: I v,,(t)  H(t) I > 6}.
The hypothesis implies that Tn + 00 in probability: ifO = So < SI < ... < Sk are chosen so that H(sJ  H(Sil) < for each i, then (by monotonicity of v" and H)
t6
IP{ Tn < Sk} ~ IP{ I v,,(sJ  H(sJ I >
h
for some i}
+
0.
179
VIII.2. Continuous Time Martingales
From the definition of 'n'
I v,,(t)  H(t) I ~
(12)
8
for
t < 'n'
'n
Only the time t = might cause trouble; Vn might have a jump there. That is where predictability comes to the rescue. There exists an increasing sequence of stopping times {'nj} with 'nj < and 'nj i almost surely (Dellacherie and Meyer 1978, IV.69, IY.77), because predictability allows us to peer just a little distance into the future for v". Replace by 'n,j(n) for somej(n) such that 'n,j(n) + 00 in probability. This lops off the troublesome point t = 'n in (12); the inequality holds for every t in the range [0, 'n, j(n)J. If the argument works for fixed 8 it must also work for a sequence {8 n } decreasing slowly enough. Write O'n(8) for the stopping time 'n,j(n) just identified. There exist integers n(1) < n(2) < ... such that
'n
'n
'n
IP{O'n(k 1 ) ~ k} < k 1 for
n ~ n(k).
o Stopping time tricks like the one used in this proof pop up all over the place in martingale limit theory. Sample path properties that hold with probability tending to one can often be made to hold with probability one by enforcing an appropriate stopping rule. Something must be added to the convergence of conditional variance processes. Otherwise we could specialize the result to processes with independent increments, obtaining as a byproduct the Central Limit Theorem for sums of independent random variables without having to impose anything like a Lindeberg condition. The extra something takes the form of a constraint on the maximum jump in the sample path. Define the jump functional J T on D[O, 00) by:
JT(x) = max{lx(s)  x(s)I:O
~
s
~
T}.
It is both continuous and measurable (Problem 4). If Xn ~ BH then certainly JT(X n) ~ JT(B H) = 0; convergence in probability of {JT(X n)} to zero is a necessary condition. The theorem assumes just a little bit more to get a sufficient condition.
13 Theorem. Let {Xn} be a sequence of L 2martingales with conditional variance processes {v,,}. Let H be a continuous, increasing function on [0, 00) with H(O) = 0. SuffiCient conditions for convergence in distribution of {Xn}, as random elements of D[O, 00), to the stretchedout brownian motion BH are:
°
(i) X n(O) + in probability; (ii) v,,(t) + H(t) in probability,for each fixed t; (iii) IP J iX n)2 + for each fixed k, as n + 00.
°
180
VIII. Martingales
PROOF. By virtue of Theorem V.23, we have only to prove convergence in distribution for the truncations of the processes to each compact interval [0, T]. The argument for the typical case T = 1 will suffice. According to Theorem V.3 we shall need to establish (a) Fidi Convergence: the fidis of the {Xn} converge to the fidis of EH (b) The Grid Condition: to each e > 0 and b > 0 there corresponds a grid o = to < t 1 < ... < tm = 1 such that limsup n
IP{m~x J
sup IXn(t)  XnC t ) 1 >
b} < e
[tj,tj+l)
Theorem 1 will take care of (a); Lemma 9, applied to the Xn processes stopped appropriately, will take care of (b). By Lemma 11 there exist stopping times {O"n} and constants {en}, with O"n ~ CX) in probability and en ! 0, such that 1 v,,(t /\ O"n)  H(t /\ O"n) 1 :::; en almost surely. We may assume that X n has at most one jump of size greater than en up to time O"n' Formally, we would replace {en} by a more slowly decreasing sequence such that IP {J k(n)(X n) > en} ~ 0
as
n~
CX)
for some slowly diverging sequence {ken)}. Such sequences exist because of (iii). Then we would replace O"n by
inf{t :::; O"n: IXn(t)  Xn(t)I > en}· These modifications would not disturb the other properties of {O"n} and {en}. The stopped martingale X:(t) = Xn(t /\ O"n) has conditional variance process V:O = Vn(· /\ O"n). It enjoys (i), (ii), and (iii) in strengthened form: (i)' X:(O) = Xn(O) ~ 0 in probability; (ii), 1V:(t)  H(t /\ O"n)1 :::; en for all t; (iii)' X: has at most one jump of size > en and IP J 1(X:)2 ~ O.
X:
These will make it easy to prove that ~ EH' The required convergence of the truncation of X n to [0, 1J will thep follow because
IP{Xn(t) = X:(t) for 0:::; t:::; I}
~
IP{O"n
~
I}
~
l.
Simplify the notation by dropping the star.
Fidi Convergence
Let us prove only that Xn(1) ~ N(O, H(1». Problem 6 extends the result to higherdimensional fidis. Because of (i)" we can do this by breaking Xn(1)  Xn(O) into a sum of martingale differences that satisfy the conditions of Theorem l. Focus for the moment on a fixed Xn by setting Z(t) = XnCt)  XnCO) and writing V instead of v" for its conditional variance process. Break Z(1) into
181
VIII.2. Continuous Time Martingales
a sum of increments Z(r)  Z(r j _ 1), with stopping times defined inductively by r j + 1 = inf{t > T j : IZ(t)  Z(r)
1
°= ~
ro < r 1 < ... a sequence of
cn} /\ 1 /\ (rj
+ bn ).
Choose {b n } so that 1H(t + bn )  H(t) 1 ::; Cn for every t in [0, 1]. Denote expectations given cfftj by IPk); write I1Jfor the incrementf(T)  f(rj1) of any function f between successive stopping times. Then 00
L I1jZ,
Z(l) =
j= 1
Along any particular sample path of Z all except finitely many of these increments equal zero; there is no problem with convergence of the sum. Check the conditions of Theorem 1 for the martingale differences {l1 j Z}. Strictly speaking the theorem applies only to triangular arrays with a finite number of variables in each row, but there may be infinitely many I1jZ increments. We would need to apply the theorem to a finite sum of I1jZ, for 1 ::; j ::; j(n), with j(n) chosen so that both 00
00
L I1 (Z) j(n)
I
and
j
j(n)
IPj _ 1 (l1j Z)
converge to zero in probability. By property (8) of the conditional variance process,
I
IPj _ 1 (l1j Z)2
=
I
lFj_1(l1 j V).
j
Informally speaking, predictability of V almost makes I1j V measurable with respect to cff tj _ 1 ; the last sum almost equals Ij I1jV, which we know will converge in probability to H(l) as n + 00. Formally, Lj (11 j V  IPj  111 j V) is a sum of martingale differences with zero mean and variance less than
L IP(l1
j
V)2
because the crossproduct terms vanish
j
::; L IP[(2c + I1jHC n
/\ O"n))l1j V]
by (ii)'
j
::; 3c n IP( ::; 3cn(c n
°
~ 11 V) j
by the choice of
{b
n}
+ H(l))
+ as n + 00. That takes care of condition (i) of Theorem l. For the Lindeberg condition it suffices to check the stronger L 1 convergence.
IPIIPj1(l1jZf{ll1jZI > c} = IPL(l1jZ)2{ll1jZI > c} j
j
182
VIII. Martingales
The inequality follows, by the definition of T j , from: IZ(T)  Z(T j _ 1 )1::; IZ(T)  Z(Tj)1
::; Ijump at
+
IZ(Tj)  Z(T j  1 )1
+ en
Tjl
At most one increment !1jZ can exceed 2en in absolute value, and that happens only if Z has its one jump greater than en at T j . An appeal to the second part of (iii)' completes the proof of the fidi convergence. The Grid Condition
Choose 0 = to < ... < tm = 1 so that H(t j +1)  H(t) ::; 15 2 /24 for each j. For n large enough to make 2e n ::; 15 2 /24, the strengthened condition (ii)' implies IP(VnCtj+l)  v,,(t)ICt )::; 15 2 /12
(14)
almost surely
if tj::; t < t j +l ' Invoke Lemma 9. (15)
limsup IP{ n
m~x J
sup IXn(t)  Xn(t) 1 > b} [tj,tj+t>
mI
::; L
j=O
limsup3IP{I X nC t j +l)  Xn(t) 1 ;;::: n
rnI
::; L
j=O
3IP{ 1BH(t j + 1)  BH(t) 1 ;;:::
H}
H}
by fidi convergence.
rnI
=
L
j=O
3IP{IN(O, H(t j +1 )

H(t)) 1 ;;::: tb}
rnI
::; 4815 4
L
j=O
[H(t j +1 )

H(t)J 2 IPIN(O,
lW
::; 4815 4 max[H(t j +1)  H(t)JH(l)IP IN(O,
lW
j
which is less than e if the grid points are close enough together.
0
VII!'3. Estimation from Censored Data The empirical distribution function based on an independent sample ~ l ' ... , ~n from P is a natural estimator for the distribution function of P. How should one modify it when the observations are subject to censorship? One possibility, the KaplanMeier estimator, can be analyzed by the martingale methods from the previous section. What follows is a heuristic account. The notes to the chapter will point you towards more rigorous treatments, which draw on results from the theory of stochastic integration.
183
VII!'3. Estimation from Censored Data
Consider the simplest model for censorship. Independent variables cn' drawn from a censoring distribution C, cut short the natural lifetime (i; we observe the value Yi = (i 1\ Ci . Each Yi has distribution Q, where Q(s, (0) = P(s, 00 )c(s, (0). In addition, we can tell whether the Yi represents natural death, (i < Ci' or a case of death by censorship, (i ;;::: Ci' Cl"'"
. 00 •• Yl Y2
• • • 0). Y9 co
To construct the KaplanMeier empirical measure K n , start from the usual empirical measure Qn for the observations YI' ... ,Yn' Working from the lefthand end, distribute successively all the mass from each censored point equally amongst all the {Yi} lying to its right. In the situation pictured, Y I keeps its mass ~; then Y2 surrenders mass x ~ to each of its seven successors Y3, ... , Y9; then Y3 surrenders i x (~ + x ~) to each of Y4' ... ,Y9; and so on. At the last point, Y9' there are no more {y;} to inherit its mass, so dump it all down on a fictitious supersurvivor out at + 00. If Y9 had not been censored, it would have kept all its inherited mass. In any case, Kn will distribute its total mass of one amongst the naturally deceased {Yi}, with maybe a little bit on + 00. Notice that Kn[O, tJ ~ Qn[O, tJ for each t. Make the analysis as simple as possible by assuming that both P and C are continuous distributions living on [0, (0), with neither concentrated on a finite interval. Write Ct for the O"field corresponding to everything we learn up to timet about which (i have died or been censored. Write IPl·) for IP( ·1 Ct)· Calculate (to firstorder terms) the conditional distribution of the increment I1K n = Kit, t + hJ given Ct> for tiny positive h. From Ct we learn the value Qn [0, t]. Define m to be nQn [0, t]. The remaining n  m observations in (t, (0) are generated by choosing each (i from the conditional distribution P(·I ( > t), then censoring it by a Ci chosen from c(·1 C > t), Write I1P for P(t, t + hJ and I1K n for Kn(t, t + h]. To first order, each of the n  m observations has conditional probability I1P/P(t, (0) of registering a natural death during the interval (t, t + h]. A single such observation would receive a fraction (n  m)l of the Kn measure for (t, 00]. Thus, to first order,
+ +
IPt l1K n = (n  m)IKit, ooJ(n  m)I1P/P(t, (0).
This suggests that
+ lower order terms, which would lead us to believe that log Kit, 00 J  log P(t, (0) is a continIPtI1Kn/Kit, ooJ = I1P/P(t, (0)
uoustime martingale for each n. An attempt to add rigor to the firstorder analysis would reveal a few illegal divisions by zero. There is a positive probability that either n  m or Kit, ooJ could equal zero. A suitable stopping time can save us from
184
VIII. Marti ngales
embarrassment. Let {lXn} be a sequence of positive numbers converging to zero. Make sure that nlXn is an integer. Define Pn as the first t for which QnCt, 00) equals IXn' Then certainly
Kn(t /\ Pn, ooJ 2:: Qn(t /\ Pn'
00)
2::
IXn
>
° for all t.
The firstorder analysis could be made rigorous enough to show that the process X net) = log KnCt /\ Pn,
00
J
log pet /\ Pn,
00)
is a continuoustime martingale for each n.: On the set {Pn > t}, the increment in v", the conditional variance process of X n , would be
IPtC.1Xn)Z = (n  m)Z(n  m).1P/P(t,
00)
+ smaller order terms
On {Pn :s:; t} the increment would be zero. Recover conditional variances (see Problem 5).
v"
as a limit of sums of
v,,(t) = n 1 f{o:s:; s:S:; t /\ Pn}Qn(s, oo)lp(S, oo)lP(ds). By definition of Pn,
v,,(t):s:; (nlXn)l
f~ pes, oo)lP(ds).
Thus {v,,} converges in probability to zero uniformly over compact sets. Apply Lemma 9. For b fixed and n large enough,
IP{~~~ IXn(t) I > tb} :s:; 12O
For other distributions we have to work harder. We must maneuver the moment generating function of 1'; into a tractable form that gives us some clue about which value of t to choose. 2 Hoeffding's Inequality. Let Y1 , Y2 , ... , Y" be independent random variables with zero means and bounded ranges: a i :::;; 1'; :::;; b i • For each I] > 0,
IP{Yl
+ ... +
Y,,;::::: I]}:::;; exp [ 21]2/Jl(b i

a i )2J
192
B. Exponential Inequalities
Use convexity to bound the moment generating function of Yi. Drop the subscript i temporarily.
PROOF.
etY :::;; eta(b  Y)/(b  a)
+ etb(y  a)/(b  a).
Take expectations, remembering that IPY
= O.
IPe tY :::;; eta bleb  a)  etb a/Cb  a). Set IX
= I  [3 = a/Cb  a) and u = t(b  a). Note: IX > obecause a < 0 < b. log IPe tY :::;; 10g([3e  au
+ IXe
PU
)
=  IXU + 10g([3 + IXe
U
).
Write L(u) for this function of u. Differentiate twice.
L'(u) = IX + IXe !([3 + IXe U
U )
= IX + IX/(IX + [3e
U ),
u
L"(u) = IX[3e /(IX + [3e )2 = [IX/(IX + [3e )][[3e /(IX + [3e U ) ] u
U
U
The inequality is a special case of: x(I  x) :::;; Taylor's theorem.
i
:::;;
i.
for 0 :::;; x :::;; 1. Expand by
+ uL'(O) + !J2L"(u*) :::;; 1u 2i = kt 2(b  a)2. Apply the inequality to each Yi, then use (1) L(u) = L(O)
IP{Y1 + .. , + Y,. ;:::.: I]} :::;; exp[ I]t + kt 2
n
L
(b;  aY].
;=1
Set t =
o
41]/L; (b;  a;)2 to minimize the quadratic.
3 Corollary. Apply the same argument to { Yi} then combine with the inequality for {Yi} to get a twosided bound under the same conditions:
IP{IY1 +
... +
Y,.I;:::.: I]}:::;; 2ex p [ 21]2/t/b;  aY].
In one special case the proof can be shortened slightly. If two values, ± a;, each with probability 1, then JPe tY , = t[e ta , + e ta ,] =
00
L (a;t)2k/(2k)!
Yi
0
takes only
:::;; exp(taft 2).
k=O
The rest of the proof is the same as before. We only need the Hoeffding Inequality for this special case. ••• , Y,. be independent random variables with zero means and bounded ranges: I Yil :::;; M. Write (If for the variance of Yi. Suppose V;:::.: (Ii + .,. + (I;. Thenfor each I] > 0, IP{IY1 + ... + Y,.I > I]}:::;; 2exp[11] 2V 1 B(MI]V 1 )],
4 Bennett's Inequality. Let Y1 ,
where B(A) = 2r2[(1
+ A) log(1 + A)  A] for A>
O.
193
B. Exponential Inequalities
PROOF. It suffices to establish the corresponding onesided inequality. The twosided inequality will then follow by combining it with the companion inequality for {  Y;}. Bound the moment generating function of Y;. Drop the subscript i temporarily. 00
IPe t ¥
= 1 + tIPY +
I
(t kjk!)IP(y 2 y k 2)
k=2 00
~
1+
I
(t kjk!)(J2Mk2
k=2 = 1 ~
+ (J2g(t)
where get) = (e tM

1  tM)jM2
exp[(J2g(t)].
From (1) deduce IP{S 2:: I]} ~ exp[Vg(t)  I]t]. Differentiate to find the minimizing value, t = M 1 log(1 + M'1V1), which is positive.
o
The function B(·) is wellbehaved: continuous, decreasing, and B(O + ) = 1. When Ii is large, B(Ii) ::::; 21i 1 log Ii in the sense that the ratio tends to one as Ii ~ 00; the Bennett Inequality does not give a true exponential bound for I] large compared to V j M. For smaller '1 it comes very close to the bound for normal tail probabilities. Problem 2 shows that B(Ii) 2:: (1 + tli)  1 for allli > O. If we replace B(.) by this lower bound we get IP{ ISI 2:: I]} ~ 2 exp[1'12j(V
+ tMI])],
which is known as Bernstein's inequality. NOTES Feller (1968, Chapter VII) analyzed the tail probabilities of binomial and normal distributionssharp results obtained by elementary methods. Bennett (1962) and Hoeffding (1963) derived and compared a number of inequalities on tail probabilities for sums of independent random variables. Dudley (1984) noted the simpler derivation of Hoeffding's Inequality when r; takes only values ± a i • Bernstein's inequality apparently dates from the 1920's; it appeared as Problem X.14 in Uspensky's (1937) book. PROBLEMS [lJ For independent N(O, I)distributed random variables Y1 , Yz , ... , show that (max j:5n 1';)/(2 log n)l/Z converges in probability to one. [Write M n for the maximum. Show IP{M n ~ (21'/ log n)I/2} = [1  IP{N(O, 1) > (21'/ log n)l/Z}]n, then use the exponential inequalities for normal tails.]
194
B. Exponential Inequalities
[2] For the function BO appearing in Bennett's Inequality prove that
(1
+ t,1)B(,1)
;;::: 1 for all
A > O.
[Apply I'Hopital's rule twice to reduce the lefthand side to
+ t,1*)(1 + ,1*)1 + tlog(1 + ,1*) for some ,1* less than A. Then use 10g(1 + ,1*);;::: ,1*/(1 + ,1*).] (1
[3] If a random variable Y has zero mean, finite variance u Z, and is bounded above by a constant M, then for t > 0, IPe tY ::;; (aZe tM
+ MZetu2/M)/(uZ + MZ).
[Subject to the constraints IPY = 0 and IPY z ::;; u Z, the value of IPe tY is maximized when IPy concentrates on the two values M and uz/M.] To prove this let rf>(y) be the quadratic e tu2 /MC\M _ y)[1 + (C 1 + t)(y + aZ/M)] + etMCZ(y + uZ/M)z, where C = M rf> satisfies
+ U Z/M.
Check that the coefficient of yZ is strictly positive and that
Show that rf>(y) ;;::: etY for y ::;; M, with equality at y = M and y =  a 2/M. [The function hey) = etYrf>(y) has a local minimum of 1 at uz/M. Also heM) = 1. If h(y*) were equal to 1 for some y* in the interval (  U Z/M, y*), the quadratic etYh'(y) would have three real roots: one at a 2 /M, one in the interval (a 2 /M, y*), and one in the interval (y*, M).] The distribution IPy concentrated at M and az/M achieves equality in IPe tY ::;; IPrf>(Y). Bennett (1962, page 42).] [4] For the onesided form of Bennett's Inequality one needs only zero means and 1'; ::;; M for each i. Reexpress the inequality from the previous problem as
IPe tY ::;; exp[tM wheref(y)
=
1  yl
+ 10gf(1 + uZ/Mz)],
+ yl e tMy. Prove that dZ/dyZlogf(y) equals
_2y3 e tMY[etMy  1  tMy  ¥tMy)Z]/f(y)  [f'(y)/f(y)]Z which is less than zero for y ;;::: 1. Deduce from a Taylor expansion to quadratic terms that
logf(y) ::;; 10gf(1)
+ (y
 1)f'(1)!f(1)
for
y;;::: 1,
whence IPe ty ::;; exp[aZ(etM  1  tM)/M2]. Complete the argument as before. [Hoeffding (1963, page 24).]
APPENDIX C
Measurability
The defining properties of afields ensure that the usuaLcountable operations in probability theorycountable unions and intersections, pointwise limits of sequences, and the likecause no measurability difficulties. In Chapters II and VII, however, we needed to take suprema over uncountable families of measurable functions. The possibility of a nonmeasurable supremum was brushed aside by an assurance that a regularity condition, dubbed permissibility, would take care of everything. This appendix will supply the missing details. The discussion will take as axiomatic certain properties of analytic sets. A complete treatment may be found in Sections IILl to III.20, IIL27 to III.33, and IIL44 to 45, of Dellacherie and Meyer (1978). Square brackets containing the initials DM followed by a number will point to the section ofthat book where you can find the justification for any unproved assertions. Suppose M is a set equipped with a afield A. The analytic (Aanalytic in DM terminology) subsets of M form a class slightly larger than A. Denote it by d(M). If A is complete for some probability measure 11, (that is, A contains all the sets of zero 11 measure) then d(M) = A [DM 33]. For example, the analytic subsets generated by ~[O, 1J contain that afield properly, but the afield of lebesgue measurable subsets of [0, 1J coincides with its analytic sets. You see from this example that we should be writing d(A) rather than ,S#(M). The ambiguity is not serious when M is equipped with only one afield. For metric spaces, we will always choose A to be the borel afield; for product spaces, it will always be the product afield. We considered empirical processes indexed by a class of functions. Formally, ~ 1, ~2' ... were measurable maps from a probability space (0, $, JP) into a set S equipped with a afield Y'. A class ff of 9"/@i(IR)measurable,
196
C. Measurability
realvalued functions on S was given. The empirical measure Pn attached to each f in fF the real number n
Pnf = n 1
L f(~i(W». i= 1
We were assuming measurability offunctions of W such as supS' JPnf  Pf I. Let us now consider more carefully the dependence on w, which we emphasize by writing Pn(w, .) instead of PnO.
PERMISSIBLE CLASSES
Suppose that the class fF is indexed by a parameter t that ranges over some set T. That is, fF = {f ( ., t): t E T} . We lose no generality by assuming fF so indexed; T could be fF itself. When more convenient, write fr instead of f(·, t). Assume T is a separable metric space. The metric on T will be important only insofar as it determines the bore! afield £!leT) on T. 1 Definition. Call the class fF permissible if it can be indexed by a T in such a way that
(i) the function f(·, .) is !:f' ® £!l(T)measurable as a function from S ® T into the real line; (ii) T is an analytic subset of a compact metric space T (from which it inherits its metric and bore! afield). 0 Some authors call a T satisfying (ii) a Souslin measurable space [DM l6J. The usual sorts of class parametrized by bore! subsets [DM 12J of an euclidean space are permissible. (Take T as the onepoint compactification.) So are fancier classes such as all indicator functions of compact, convex subsets of euclidean space (Problem 2). Assume from now on that fF is permissible and that (0, $, IP) is complete. Here are the properties of analytic sets that make the definition of permissibility a good one for empirical process applications. For every measurable space (M, vii), (a) d(M ® T) contains the product afield vii ® £!leT); (b) for each H in d(M ® T), and in particular for each vii ® £!l(T)measurable set, the projection nMH of H onto M is in d(M) [DM 13, DM 9: the set H is also in d(M ® T), because T is analyticJ; (c) for each A in d(M) and each $/vIimeasurable map I] from (0, $, IP) into M, the set {rt E A} is an analytic subset ofO [DM I1J; hence {I] E d} belongs to $, because (0, $, IP) is complete. From these properties we shall be able to deduce measurability for functions defined by certain uncountable operations.
197
C. Measurability
MEASURABLE SUPREMA
Suppose g(., .) is an A ® gg(T)measurable real function on M ® T. Set G(m) = SUPt gem, t). Then by (a), d(M ® T) contains the set Ha. = {(m, t): gem, t) > IX}. The projection of Ha. onto M is an analytic set, by (b). It consists of all those m for which G(m) > IX. Thus {G > IX} belongs to d(M) for each real IX. If 11 is a measurable map from a complete probability space (n, tt, JP) into M then, by (c), the set {w: G(I1(w)) > IX} is ttmeasurable. That is, SUPt g(I1(W), t) is an ttmeasurable function of w. If g; is permissible and if P Ifr I < 00 for each t, requirement (i) of Definition 1 plus Fubini's theorem make Pfr a measurable function of t. Apply the argument given above, with M = sn, A = the product O"field gm, 11 = the vector (~1"'" ~n)' and
g(s, t) = In 1
it
[f(Si' t) 
PfrJI
to deduce that sup$'" IPnCw,f)  Pf I is a measurable function of w. MEASURABLE CROSSSECTIONS
The Symmetrization Lemma 11.8 made use of another property of analytic sets. We had a stochastic process {Zt: t E T}; we assumed existence of a random r for which, almost surely, IZT I > S whenever SUPt IZt I > s. For this we need a crosssection theorem [DM4445]. Under requirement (ii) of Definition 1, and for any complete probability space (M, A, f.1.), (d) if H belongs to d(M ® T) there exists a measurable map h from M into T U {oo} (where 00 is an ideal point added to T) such that: (m, hem)) belongs to H whenever hem) i= 00; and hem) i= 00 for f.1. almost all m in the projection 1rMH. Call h a measurable crosssection for H.
nMH •••..••••••••••••
gj
~_ _ _ _11 T
CIJ
Write Z(w, t) to emphasize the role of Z as an tt ® 86'(T)measurable function on n ® T. Let {s j} be a strictly decreasing sequence of real numbers converging to s, with Sl = 00. Set
{w:
s~p IZ(w, t)1 ~ Sj};
Aj
=
Bj
= {(w, t): Sj+1 < IZ(w,t) I ~
Sj+l
generated by balls 87 sign variables 15 signed measure. See ~ Slutsky's theorem 62 Souslin 196 squareroot trick 32, 37 Stirling's approximation 21 stochastic equicontinuity 139140 order symbol 141, 189 process I stopping time 110,172 Strasbourg theory 176 strong markov property, for brownian motion III submartingale 178 reversed 22, 25 subsets hidden 19 picked out 18 substitution, of increments 49 symmetrization 32, 149 inequality 1415 SYMMETRIZATION, FIRST 14 Symmetrization Lemma 14, 198 SYMMETRIZATION, SECOND 15 symmetry, and empirical measures 21
tail probability characteristic function bound exponential bound 160, 191
60
215
Subject Index
tieddown brownian motion. See brownian bridge tight measure on metric space 81 on real line 60 total boundedness of metric space 82 and Pmotion 168 triangular array 51; 106 truncation 25 2means. See kmeans
uniform tightness for general empirical processes 156 for uniform empirical processes 102 usual conditions 176
Un. See empirical process, uniform
weak convergence in euc1idean space 44 in metric spaces 65 weight functions 158
uniform convergence of density estimator 36 uniform integrability 176 uniform strong law of large numbers 7 for classes offunctions 25, 168 for classes of sets 18, 22 for convex sets 22
VC class 37 vector space, finitedimensional 30, 38
fI. See function space
20,