3,913 132 11MB
Pages 458 Page size 262 x 375 pts Year 2011
Asymptotic Statistics
This book is an introduction to the field of asymptotic statistics. The treatment is both practical and mathematically rigorous. In addition to most of the standard topics of an asymptotics course, including like lihood inference, M-estimation, asymptotic efficiency, U-statistics, and rank procedures, the book also presents recent research topics such as sernipararnetric models, the bootstrap, and empirical processes and their applications. One of the unifying themes is the approximation by limit experi ments. This entails mainly the local approximation of the classical i.i.d. set-up with smooth parameters by location experiments involving a sin gle, normally distributed observation. Thus, even the standard subjects of asymptotic statistics are presented in a novel way. Suitable as a text for a graduate or Master's level statistics course, this book also gives researchers in statistics, probability, and their applications an overview of the latest research in asymptotic statistics. A.W. van der Vaart is Professor of Statistics in the Department of Mathematics and Computer Science at the Vrije Universiteit, Amsterdam.
C A M B RIDGE S E RIES IN S TA T I S T I C AL A N D PROB A B I L I S T I C M A T H E M A T I C S
Editorial Board: R. Gill, Department of Mathematics, Utrecht University B .D. Ripley, Department of Statistics, University of Oxford S . Ross, Department of Industrial Engineering, University of California, Berkeley M. Stein, Department of Statistics, University of Chicago D. Williams, School of Mathematical Sciences, University of Bath This series of high-quality upper-division textbooks and expository monographs covers all aspects of stochastic applicable mathematics. The topics range from pure and applied statistics to probability theory, operations research, optimization, and mathematical pro gramming. The books contain clear presentations of new developments in the field and also of the state of the art in classical methods. While emphasizing rigorous treatment of theoretical methods, the books also contain applications and discussions of new techniques made possible by advances in computational practice. Already published 1 . Bootstrap Methods and Their Application, by A.C. Davison and D.V. Hinkley 2. Markov Chains, by J. Norris
Asymptotic Statistics
A.W. VANDER VAART
CAMBRIDGE UNIVERSITY PRESS
PUBLISHED BY THE PRESS SYNDICATE OF THE UNIVERSITY OF CAMBRIDGE The Pitt Building, Trumpington Street, Cambridge, United Kingdom
CAMBRIDGE UNIVERSITY PRESS
The Edinburgh Building, Cambridge CB2 2RU, UK
http://www.cup.cam.ac.uk
40 West 20th Street, New York, NY 10011-4211, USA
http://www.cup.org
10 Stamford Road, Oakleigh, Melbourne 3166, Australia Ruiz de Alarcon 13, 28014 Madrid, Spain
© Cambridge University Press 1998 This book is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the w ritten permission of Cambridge University Press. First published 1998 First paperback edition 2000 Printed in the United States of America Typeset in Times Roman 10112.5 pt in LKfff(2 [TB]
A
catalog record for this book is available from the British Library
Library of Congress Cataloging in Publication data Asymtotic statistics I A.W. van der Vaart. Vaart, A.W. van der p.
em. - (Cambridge series in statistical and probablistic
mathematics) Includes bibliographical references. 1. Mathematical statistics - Asymptotic theory.
I. Title.
II. Series: cambridge series on statistical and probablistic mathematics. CA276.V 22
1998
519.5-dc21 ISBN 0 521 49603 9 hardback ISBN 0 521 78450 6 paperback
98-15176
To Maryse and Marianne
Contents
Preface Notation
page xiii page XV
1 . Introduction 1 . 1 . Approximate Statistical Procedures 1 .2. Asymptotic Optimality Theory 1 .3 . Limitations 1 .4. The Index n
1 1 2 3 4
2. Stochastic Convergence 2. 1 . Basic Theory 2.2. Stochastic o and 0 Symbols *2.3 . Characteristic Functions *2.4. Almost -Sure Representations *2.5. Convergence of Moments *2.6. Convergence-Determining Classes *2.7. Law of the Iterated Logarithm *2.8. Lindeberg-Feller Theorem *2.9. Convergence in Total Variation Problems
5 5 12 13 17 17 18 19 20 22 24
3. Delta Method 3. 1 . Basic Result 3.2. Variance-Stabilizing Transformations *3.3. Higher-Order Expansions *3.4. Uniform Delta Method *3.5. Moments Problems
25 25 30 31 32 33 34
4. Moment Estimators 4. 1 . Method of Moments *4.2. Exponential Families Problems
35 35 37 40
5. M- and Z-Estimators 5 . 1 . Introduction 5.2. Consistency 5.3. Asymptotic Normality
41 41 44 51 Vll
Vlll
Contents
*5.4. 5.5. *5 .6. *5 .7. *5 .8. *5 .9.
Estimated Parameters Maximum Likelihood Estimators Classical Conditions One-Step Estimators Rates of Convergence Argmax Theorem Problems
6. Contiguity 6. 1 . Likelihood Ratios 6.2. Contiguity Problems
60 61 67 71 75 79 83 85 85 87 91
7 . Local Asymptotic Normality 7. 1 . Introduction 7.2. Expanding the Likelihood 7.3. Convergence to a Normal Experiment 7.4. Maximum Likelihood *7.5. Limit Distributions under Alternatives *7.6. Local Asymptotic Normality Problems
92 92 93 97 100 103 103 106
8. Efficiency of Estimators 8. 1 . Asymptotic Concentration 8.2. Relative Efficiency 8.3. Lower Bound for Experiments 8.4. Estimating Normal Means 8.5. Convolution Theorem 8.6. Almost-Everywhere Convolution Theorem *8.7. Local Asymptotic Minimax Theorem *8.8. Shrinkage Estimators *8.9. Achieving the Bound *8. 10. Large Deviations Problems
108 108 1 10 111 1 12 1 15
9. Limits of Experiments 9. 1 . Introduction 9.2. Asymptotic Representation Theorem 9.3. Asymptotic Normality 9.4. Uniform Distribution 9.5. Pareto Distribution 9.6. Asymptotic Mixed Normality 9.7. Heuristics Problems
125 125 126 127 129 130 131 136 1 37
10. Bayes Procedures 10. 1 . Introduction 10.2. Bernstein-von Mises Theorem
1 15 1 17 1 19 120 122 123
138 138 140
Contents
10.3. Point Estimators *1 0.4. Consistency Problems
ix
146 149 152
1 1 . Projections 1 1 . 1 . Projections 1 1 .2. Conditional Expectation 1 1 .3 . Projection onto Sums * 1 1 .4. Hoeffding Decomposition Problems
153 153 155 157 157 160
12. U -Statistics 1 2. 1 . One-Sample U -Statistics 1 2.2. Two-Sample U -statistics * 1 2.3. Degenerate U -Statistics Problems
161 161 1 65 1 67 171
1 3 . Rank, Sign, and Permutation Statistics 1 3 . 1 . Rank Statistics 1 3 .2. Signed Rank Statistics 13.3. Rank Statistics for Independence * 1 3 .4. Rank Statistics under Alternatives 13.5. Permutation Tests * 1 3 .6. Rank Central Limit Theorem Problems
173 173 181 1 84 1 84 188 190 190
14. Relative Efficiency of Tests 14. 1 . Asymptotic Power Functions 14.2. Consistency 14.3. Asymptotic Relative Efficiency *14.4. Other Relative Efficiencies *14.5. Rescaling Rates Problems
192 192 199 20 1 202 21 1 213
15. Efficiency of Tests 1 5. 1 . Asymptotic Representation Theorem 1 5.2. Testing Normal Means 15.3. Local Asymptotic Normality 1 5 .4. One-Sample Location 15.5. Two-Sample Problems Problems
215 215 216 218 220 223 226
16. Likelihood Ratio Tests 1 6. 1 . Introduction *1 6.2. Taylor Expansion 16.3. Using Local Asymptotic Normality 16.4. Asymptotic Power Functions
227 227 229 23 1 236
X
Contents
16.5. Bartlett Correction *1 6.6. Bahadur Efficiency Problems
23 8 238 24 1
17. Chi-Square Tests 17. 1 . Quadratic Forms in Normal Vectors 17.2. Pearson Statistic 17.3. Estimated Parameters 17.4. Testing Independence *17.5. Goodness-of-Fit Tests *17.6. Asymptotic Efficiency Problems
242 242 242 244 247 248 25 1 253
18. Stochastic Convergence in Metric Spaces 1 8 . 1 . Metric and Normed Spaces 1 8.2. Basic Properties 1 8.3. Bounded Stochastic Processes Problems
255 255 258 260 263
19. Empirical Processes 19 . 1 . Empirical Distribution Functions 19.2. Empirical Distributions 19.3. Goodness-of-Fit Statistics 19.4. Random Functions 19.5 . Changing Classes 19.6. Maximal Inequalities Problems
265 265 269 277 279 282 284 289
20. Functional Delta Method 20. 1 . von Mises Calculus 20.2. Hadamard-Differentiable Functions 20.3 . Some Examples Problems
29 1 29 1 296 298 303
2 1 . Quantiles and Order Statistics 2 1 . 1. Weak Consistency 2 1 .2. Asymptotic Normality 2 1 .3. Median Absolute Deviation 2 1 .4. Extreme Values Problems
304 304 305 310 312 315
22. L-Statistics 22. 1 . Introduction 22.2. Hajek Projection 22.3. Delta Method 22.4. L-Estimators for Location Problems
316 316 318 320 323 324
23 . Bootstrap
326
Xl
Contents
23. 1 . Introduction 23 .2. Consistency 23.3. Higher-Order Correctness Problems
326 329 334 339
24. Nonparametric Density Estimation 24. 1 Introduction 24.2 Kernel Estimators 24.3 Rate Optimality 24.4 Estimating a Unimodal Density Problems
341 341 341 346 349 356
25. Semiparametric Models 25 . 1 Introduction 25.2 Banach and Hilbert Spaces 25.3 Tangent Spaces and Information 25.4 Efficient Score Functions 25.5 Score and Information Operators 25.6 Testing *25 .7 Efficiency and the Delta Method 25.8 Efficient Score Equations 25.9 General Estimating Equations 25 . 1 0 Maximum Likelihood Estimators 25 . 1 1 Approximately Least-Favorable Submodels 25 . 1 2 Likelihood Equations Problems
358 358 360 362 368 37 1 384 386 39 1 400 402
References
433
Index
439
408 419 43 1
Preface
This book grew out of courses that I gave at various places, including a graduate course in the Statistics Department of Texas A&M University, Master's level courses for mathematics students specializing in statistics at the Vrije Universiteit Amsterdam, a course in the DEA program (graduate level) of Universite de Paris-sud, and courses in the Dutch AIO-netwerk (graduate level). The mathematical level is mixed. Some parts I have used for second year courses for mathematics students (but they find it tough), other parts I would only recommend for a graduate program. The text is written both for students who know about the technical details of measure theory and probability, but little about statistics, and vice versa. This requires brief explanations of statistical methodology, for instance of what a rank test or the bootstrap is about, and there are similar excursions to introduce mathematical details. Familiarity with (higher-dimensional) calculus is necessary in all of the manuscript. Metric and normed spaces are briefly introduced in Chapter 1 8, when these concepts become necessary for Chapters 19, 20, 2 1 and 22, but I do not expect that this would be enough as a first introduction. For Chapter 25 basic knowledge of Hilbert spaces is extremely helpful, although the bare essentials are summarized at the beginning. Measure theory is implicitly assumed in the whole manuscript but can at most places be avoided by skipping proofs, by ignoring the word "measurable" or with a bit of handwaving. Because we deal mostly with i.i.d. observations, the simplest limit theorems from probability theory suffice. These are derived in Chapter 2, but prior exposure is helpful. Sections, results or proofs that are preceded by asterisks are either of secondary impor tance or are out of line with the natural order of the chapters. As the chart in Figure 0. 1 shows , many of the chapters are independent from one another, and the book can be used for several different courses. A unifying theme is approximation by a limit experiment. The full theory is not developed (another writing project is on its way), but the material is limited to the "weak topology" on experiments, which in 90% of the book is exemplified by the case of smooth parameters of the distribution of i.i.d. observations. For this situation the theory can be developed by relatively simple, direct arguments. Limit experiments are used to explain efficiency properties, but also why certain procedures asymptotically take a certain form. A second major theme is the application of results on abstract empirical processes. These already have benefits for deriving the usual theorems on M -estimators for Euclidean pa rameters but are indispensable if discussing more involved situations, such as M -estimators with nuisance parameters, chi-square statistics with data-dependent cells, or semiparamet ric models. The general theory is summarized in about 30 pages, and it is the applications xiii
Preface
xiv 5
23
16
@
24
Figure 0.1. Dependence chart. A solid arrow means that a chapter is a prerequisite for a next chapter. A dotted arrow means a natural continuation. Vertical or horizontal position has no independent meaning.
that we focus on. In a sense, it would have been better to place this material (Chapters 1 8 and 19) earlier in the book, but instead we start with material of more direct statistical relevance and of a less abstract character. A drawback is that a few (starred) proofs point ahead to later chapters. Almost every chapter ends with a "Notes" section. These are meant to give a rough historical sketch, and to provide entries in the literature for further reading. They certainly do not give sufficient credit to the original contributions by many authors and are not meant to serve as references in this way. Mathematical statistics obtains its relevance from applications. The subjects of this book have been chosen accordingly. On the other hand, this is a mathematician's book in that we have made some effort to present results in a nice way, without the (unnecessary) lists of "regularity conditions" that are sometimes found in statistics books. Occasionally, this means that the accompanying proof must be more involved. If this means that an idea could go lost, then an informal argument precedes the statement of a result. This does not mean that I have strived after the greatest possible generality. A simple, clean presentation was the main aim. Leiden, September 1997 A.W. van der Vaart
Notation
A* JE* Cb(T), UC(T), C(T) e oo(T)
adjoint operator dual space (bounded, uniformly) continuous functions on T bounded functions on T £r (Q), L r (Q) measurable functions whose rth powers are Q-integrable JII norm of L r (Q) Q II ,r uniform norm li z lloo, li z liT lin linear span number fields and sets (t) = Ee Y to be differentiable at zero is that E l Y I < oo . In that case the dominated convergence theorem allows differentiation
16
Stochastic Convergence
under the expectation sign, and we obtain
cp ' (t)
=
d . -Ee 1 1 y dt =
=
. y
Ei Ye1 1
•
EY
� EY1 .
In particular, the derivative at zero is ¢' (0) i and hence Yn If < oo, then the Taylor expansion can be carried a step further and we can obtain a version of the central limit theorem.
EY 2
2.17
EY;
Y1 ,
Let . . . , Yn be i. i. d. random variables with 1. Then the sequence .JflYn converges in distribution to the standard
Proposition (Central limit theorem). =
EY?
=
0 and normal distribution.
A second differentiation under the expectation sign shows that ¢ " (0) Because ¢' (0) i 0, we obtain
Proof.
=
Ee
EY
' ' � Y.
=
=
(&"
(;,)
(I - �� EY2
=
+o
G) r
_, e
=
i
2EY 2 .
+'EY'
The right side is the characteristic function of the normal distribution with mean zero and variance The proposition follows from Levy's continuity theorem. •
EY 2 .
ir r x Ee Eei u (t r X)
The characteristic function t f--+ of a vector X is determined by the set of all characteristic functions u f--+ of linear combinations t r X of the components of X. Therefore, Levy's continuity theorem implies that weak convergence of vectors is equivalent to weak convergence of linear combinations :
Xn "-""+ X if and only if t r Xn "-""+ t r X for all t
E
JRk .
This is known as the Cramer-Wold device. It allows to reduce higher-dimensional problems to the one-dimensional case. 2.18
Y1 , Y2 , E(Y1 - (Y
Let . . . be i.i.d. random vec and covariance matrix I: �J-) 1 - �J-) r .
Example (Multivariate central limit theorem).
tors in JRk with mean vector fJThen
=
1
EY
=
(The sum is taken coordinatewise.) By the Cramer-Wold device, this can be proved by finding the limit distribution of the sequences of real variables
� - ) ( -�(Y; i= l
1 � -�(t Y; t �J-) . i= .Jri l .Jri Because the random variables t r Y - t r tr Y2 - t r . . . are i.i.d. with zero mean and variance t r L:t, this sequence is asymptotically (0, t r I:t) -distributed by the univariate
tr
1
M)
1
!J- ,
T
=
N1
T
!J- ,
central limit theorem. This is exactly the distribution of t r X if X possesses an Nk (O, I:) distribution. 0
2.5
*2.4
17
Convergence of Moments
Almost-Sure Representations
Convergence in distribution certainly does not imply convergence in probability or almost surely. However, the following theorem shows that a given sequence Xn "-"'+ X can always be replaced by a sequence Xn "-"'+ X that is, marginally, equal in distribution and converges almost surely. This construction is sometimes useful and has been put to good use by some authors, but we do not use it in this book.
Suppose that the sequence ofrandom vec tors Xn converges in distribution to a random vector Xo. Then there exists a probability space ( Q , U, P) and random vectors Xn defined on it such that Xn is equal in distribution to Xn for every n � 0 and Xn -+ Xo almost surely. 2.19
Theorem (Almost-sure representations).
For random variables we can simply define Xn F;; 1 (U) for Fn the distribu tion function of Xn and U an arbitrary random variable with the uniform distribution on [0, 1]. (The "quantile transformation," see Section 2 1 . 1 .) The simplest known construc tion for higher-dimensional vectors is more complicated. See, for example, Theorem 1 . 1 0.4 in [ 146] , or [4 1]. • =
Proof.
*2.5
Convergence of Moments
By the portmanteau lemma, weak convergence Xn "-"'+ X implies that Ef (Xn) -+ Ef(X) for every continuous, bounded function f. The condition that f be bounded is not superfluous : It i s not difficult to find examples of a sequence Xn "-"'+ X and an unbounded, continuous function f for which the convergence fails. In particular, in general convergence in distri bution does not imply convergence EX!: -+ EX P of moments. However, in many situations such convergence occurs, but it requires more effort to prove it. A sequence of random variables Yn is called asymptotically uniformly integrable if lim lim sup EI Yn i 1 { 1 Yn l > M} 0. n -HXl Uniform integrability is the missing link between convergence in distribution and conver gence of moments. =
M-HXJ
Rk
Let f : � JR. be measurable and continuous at every point in a set C. Let Xn "-"'+ X where X takes its values in C. Then Ef (Xn) -+ E f (X) zf and only if the sequence of random variables f (Xn ) is asymptotically uniformly integrable.
2.20
Theorem.
We give the proof only in the most interesting direction. (See, for example, [146] (p. 69) for the other direction.) Suppose that Yn f (Xn) is asymptotically uniformly integrable. Then we show that EYn -+ EY for Y f (X). Assume without loss of generality that Yn is nonnegative; otherwise argue the positive and negative parts separately. By the continuous mapping theorem, Yn "-"'+ Y. By the triangle inequality,
Proof.
=
=
I E Yn - EY I :::: IEYn - EYn A M l + IEYn A M - EY A M l + lEY A M - EY I .
Because the function y � y !\ M is continuous and bounded on [0, oo) , it follows that the middle term on the right converges to zero as n -+ oo. The first term is bounded above by
18
Stochastic Convergence
EYn 1 { Yn > M}, and converges to zero as n � oo followed by M � oo, by the uniform integrability. By the portmanteau lemma (iv), the third term is bounded by the lim inf as n � oo of the first and hence converges to zero as M t oo . • 2.21 Example. Suppose Xn is a sequence of random variables such that Xn -v-+ X and lim sup EI Xn I P < oo for some p. Then all moments of order strictly less than p converge also: EX! � EX k for every k < p. By the preceding theorem, it suffices to prove that the sequence X! is asymptotically uniformly integrable. By Markov's inequality
The limit superior, as n �
oo
followed by M �
oo,
of the right side is zero if k < p.
D
The moment function p r--+ EX P can be considered a transform of probability distributions, just as can the characteristic function. In general, it is not a true transform in that it does determine a distribution uniquely only under additional assumptions. If a limit distribution is uniquely determined by its moments, this transform can still be used to establish weak convergence. 2.22
every
Theorem. Let Xn and X be random variables such that EX� � EXP < oo for p E N. If the distribution of X is uniquely determined by its moments, then Xn -v-+ X. =
Because EX� 0 ( 1 ) , the sequence Xn is uniformly tight, by Markov's inequality. By Prohorov's theorem, each subsequence has a further subsequence that converges weakly to a limit Y. By the preceding example the moments of Y are the limits of the moments of the subsequence. Thus the moments of Y are identical to the moments of X. Because, by assumption, there is only one distribution with this set of moments, X and Y are equal in distribution. Conclude that every subsequence of Xn has a further subsequence that converges in distribution to X . This implies that the whole sequence converges to X . •
Proof.
Example. The normal distribution is uniquely determined by its moments. (See, for example, [ 1 23] or [133, p. 293] .) Thus EX� � 0 for odd p and EX� � (p - 1) (p - 3) · . . 1 for even p implies that Xn -v-+ N (0, 1). The converse is false. D
2.23
*2.6
Convergence-Determining Classes
A class :F of functions f : Rk � R is called convergence-determining if for every sequence of random vectors Xn the convergence Xn -v-+ X is equivalent to Ef (Xn ) � Ef (X) for every f E :F. By definition the set of all bounded continuous functions is convergence determining, but so is the smaller set of all differentiable functions, and many other classes. The set of all indicator functions 1 c - oo ,tl would be convergence-determining if we would restrict the definition to limits X with continuous distribution functions. We shall have occasion to use the following results. (For proofs see Corollary 1 .4.5 and Theorem 1 . 1 2.2, for example, in [ 146] .)
2. 7
19
Law of the Iterated Logarithm
On IR_k IR.1 x IR.rn the set offunctions (x , y) f-+ f (x)g(y) with f and g ranging over all bounded, continuous functions on IR.1 and IR.rn, respectively, is convergence determining.
2.24
Lemma.
=
There exists a countable set of continuous functions f : IR.k f-+ [0, 1] that is convergence-determining and, moreover, Xn -v-+ X implies that Ef (Xn ) ---+ Ef (X) uni formly in f E :F. 2.25
Lemma.
*2.7
Law of the Iterated Logarithm
The law of the iterated logarithm is an intriguing result but appears to be of less interest to statisticians. It can be viewed as a refinement of the strong law of large numbers. If Y1 , Y2 , . . . are i.i.d. random variables with mean zero, then Y1 + · · · + Yn o (n) almost surely by the strong law. The law of the iterated logarithm improves this order to 0 ( y'n log log n), and even gives the proportionality constant. =
Proposition (Law of the iterated logarithm). Let Y1 , Y2 , . . . be i. i. d. random vari ables with mean zero and variance 1. Then Y1 + · · · + Yn . r:::. hm sup a.s. v 2, n --;.oo y'n log log n Conversely, if this statement holds for both Yi and - Yi , then the variables have mean zero and variance 1.
2.26
=
The law of the iterated logarithm gives an interesting illustration of the difference between almost sure and distributional statements. Under the conditions of the proposition, the sequence n - 1 12 (Y1 + · · · + Yn ) is asymptotically normally distributed by the central limit theorem. The limiting normal distribution is spread out over the whole real line. Apparent! y division by the factor y'log log n is exactly right to keep n - 1 12 (Y1 + · · · + Yn) within a compact interval, eventually. A simple application of Slutsky's lemma gives Y1 + · · · + Yn P ---+ O. Zn . y'n log log n Thus Zn is with high probability contained in the interval ( - E , E ) eventually, for any E > 0. This appears to contradict the law of the iterated logarithm, which asserts that Zn reaches the interval cv0. - E , yl2 + E ) infinitely often with probability one. The explanation is that the set of w such that Zn (w) is in ( - E , E ) or cv0. - E , yl2 + E ) fluctuates with n. The convergence in probability shows that at any advanced time a very large fraction of w have Zn (w) E ( - E , E ) . The law of the iterated logarithm shows that for each particular w the sequence Zn (w) drops in and out of the interval cv0. - E , yl2 + E ) infinitely often (and hence out of ( - E , E )) . The implications for statistics can be illustrated by considering confidence statements. If f.L and 1 are the true mean and variance of the sample Y1 , Y2 , . . . , then the probability that 2 2 < ,..-vI I _< Y n + Y n Jn _ Jn . _
-
-
-
Stochastic Convergence
20
converges to (2) - (-2) � 95%. Thus the given interval is an asymptotic confidence interval of level approximately 95%. (The confidence level is exactly (2) - (- 2) if the observations are normally distributed. This may be assumed in the following; the accuracy of the approximation is not an issue in this discussion.) The point fJ., = 0 is contained in the interval if and only if the variable satisfies 2 < .Jlog log n Assume that fJ., = is the true value of the mean, and consider the following argument. By the law of the iterated logarithm, we can be sure that hits the interval ( .J2 - 8, .J2 + 8) infinitely often. The expression 2/ .Jlog log n is close to zero for large n . Thus we can be sure that the true value fJ., = is outside the confidence interval infinitely often. How can we solve the paradox that the usual confidence interval is wrong infinitely often? There appears to be a conceptual problem if it is imagined that a statistician collects data in a sequential manner, computing a confidence interval for every n. However, although the frequentist interpretation of a confidence interval is open to the usual criticism, the paradox does not seem to rise within the frequentist framework. In fact, from a frequentist point of view the curious conclusion is reasonable. Imagine statisticians , all of whom set 95% confidence intervals in the usual manner. They all receive one observation per day and update their confidence intervals daily. Then every day about five of them should have a false interval. It is only fair that as the days go by all of them take turns in being unlucky, and that the same five do not have it wrong all the time. This, indeed, happens according to the law of the iterated logarithm. The paradox may be partly caused by the feeling that with a growing number of observa tions, the confidence intervals should become better. In contrast, the usual approach leads to errors with certainty. However, this is only true if the usual approach is applied naively in a sequential set-up. In practice one would do a genuine sequential analysis (including the use of a stopping rule) or change the confidence level with n . There is also another reason that the law of the iterated logarithm is of little practical consequence. The argument in the preceding paragraphs is based on the assumption that 2/ .Jlog log n is close to zero and is nonsensical if this quantity is larger than .J2. Thus the argument requires at least n :=:: 1 6 1 9, a respectable number of observations .
Z11 I Zn l
0
·
Zn
0
100
*2.8
Lindeberg-Feller Theorem
Central limit theorems are theorems concerning convergence in distribution of sums of random variables. There are versions for dependent observations and nonnormal limit distributions. The Lindeberg-Feller theorem is the simplest extension of the classical central limit theorem and is applicable to independent observations with finite variances. 2.27
Yn, kn
For each n let Yn , I , . . . be independent random vectors with finite variances such that Proposition (Lindeberg-Feller central limit theorem).
kn EI I Yn . i i 2 1{I I Yn. i l > 8 } � 0, L i =l k:Ln cov Yn, i � I; . i =l
every E >
0,
,
Then the sequence L�== 1 (Yn , i -
distribution.
21
Linde berg-Feller Theorem
2.8
EYn, i ) converges in distribution to a normal N (0, I:)
A result of this type is necessary to treat the asymptotics of, for instance, regression problems with fixed covariates . We illustrate this by the linear regression model. The application is straightforward but notationally a bit involved. Therefore, at other places in the manuscript we find it more convenient to assume that the covariates are a random sample, so that the ordinary central limit theorem applies. In the linear regression problem, we observe a vector Y = XfJ + for a known (n x p) matrix X of full rank, and an (unobserved) error vector with i.i.d. components with mean zero and variance The least squares estimator of fJ is
2.28
Example (Linear regression).
e
e
0' 2 .
e
This estimator is unbiased and has covariance matrix C5 2 (X T X) - 1 . If the error vector is normally distributed, then � is exactly normally distributed. Under reasonable conditions on the design matrix, the least squares estimator is asymptotically normally distributed for a large range of error distributions. Here we fix p and let n tend to infinity. This follows from the representation
(X T X) 1 / 2 (� -
an 1 , . . . , ann
n {J ) = cxT x ) - 1 12 xT e = I>ni ei, i=
1 are the columns of the (p x n) matrix (X T X) - 1 1 2 x r =: A. This sequence
where is asymptotically normal if the vectors satisfy the Lindeberg conditions . 1 2 T The norming matrix (X X) 1 has been chosen to ensure that the vectors in the display have covariance matrix I for every n. The remaining condition is
0' 2
an 1 e 1 , . . . , annen
n 2 :L i = 1 ) an t i Eef 1{1 1 an t l l etl } � 0. This can be simplified to other conditions in several ways. Because L I ani 1 2 = trace(AA T ) = p, it suffices that maxEef1{ l ani l l e; l } � 0, which is equivalent to si sn l an; l � 0. 1max Alternatively, the expectation Ee 2 1{a l e l } can be bounded by c k E i e l k + 2 a k and a second set of sufficient conditions is n k (k 2) . L) i = 1 ani I l � 0 ; Both sets of conditions are reasonable. Consider for instance the simple linear regression model = {30 + {J 1 x; + e;. Then )2 - 1 /2 ( 1 1 x 1 x2 x It is reasonable to assume that the sequences and x 2 are bounded. Then the first matrix > 8
> 8
>
8
>
Y;
x
x
22
Stochastic Convergence
on the right behaves like a fixed matrix, and the conditions for asymptotic normality simplify to
Every reasonable design satisfies these conditions. D
Convergence in Total Variation
*2.9
A sequence of random variables converges in total variation to a variable X if sup j P(Xn B
E
B) - P(X E B) \ -+ 0,
where the supremum is taken over all measurable sets B. In view of the portmanteau lemma, this type of convergence is stronger than convergence in distribution. Not only is it required that the sequence P(X E B) converges for every Borel set B , the convergence must also be uniform in B . Such strong convergence occurs less frequently and is often more than necessary, whence the concept is less useful. A simple sufficient condition for convergence in total variation is pointwise convergence of densities. If Xn and X have densities and p with respect to a measure f.L, then
n
Pn
sup j P(Xn B
E
B) - P(X E B) \
=
2 J I Pn - P I df.L.
�
Thus, convergence in total variation can be established by convergence theorems for inte grals from measure theory. The following proposition, which should be compared with the monotone and dominated convergence theorems, is most appropriate.
fn f
Suppose that and are arbitrary measurable functions such that df.L J df.L < n -+ J.L-almost everywhere (or in J.L-measure) and lim sup J dt-L -+ 0. oo, for some p � 1 and measure f.L. Then J 2.29
Proposition.
fn i P ::::; I J I P l I P lfn - J By the inequality ( a + b)P ::::; 2PaP + 2Pb P , valid for every a , b 0, and the assumption, 0 ::::; 2 P i fn1 P + 2P i f 1 P - l fn - J I P -+ 2P + 1 1 f i P almost everywhere. By Fatou's lemma, f 2P+ 1 1 f i P df.L lim inf f (2P i fn l p + 2P I J I P - l fn - JI P ) dJ.L 2p+ I J I J I P df.L - lim sup J l fn - JI P d f-L , f f
�
Proof.
:S
::::;
by assumption. The proposition follows.
•
Pnn
Let Xn and X be random vectors with densities and p with respect to a measure f.L. If -+ p J.L-almost everywhere, then the sequence X converges to X in total variation. 2.30
Corollary (Schefje).
Pn
23
Notes
The central limit theorem is usually formulated in terms of convergence in distribution. Often it is valid in terms of the total variation distance, in the sense that
i
l
1-Jhr e - � (x - n J.L ) 2 / n o- 2 dx � 0. sup P ( Yr + + Yn E B) - { B JB .jfie5 2n Here f-L and C5 2 are mean and variance of the Yi , and the supremum is taken over all Borel sets. An integrable characteristic function, in addition to a finite second moment, suffices. ·
·
·
Let Y1 , Y2 , . . . be i. i.d. random variables with .finite second moment and characteristic function ¢ such that J l ¢ ( t ) l v dt < oo for some v 2:: 1. Then Yr + · · · + Yn satisfies the central limit theorem in total variation.
2.31
Theorem (Central limit theorem in total variation).
=
=
It can be assumed without loss of generality that EY1 0 and var Y1 1 . By the inversion formula for characteristic functions (see [47, p. 509]), the density P n of Yr + · · · + Yn / .jfi can be written n 1 t t dt. P n (X) 2n e - z x ¢ vn By the central limit theorem and Levy's continuity theorem, the integrand converges to e - i tx exp( - � t 2 ) . It will be shown that the integral converges to
Proof.
=
l
-
f . ( ) -
f e- z.tx e_ l2 12 dt
=
- l2 x 2 e-J2n .
2n Then an application of Scheffe's theorem concludes the proof. The integral can be split into two parts. First, for every 8 > 0, { e - i tx ¢ Jltl >svfn
( �n ) n dt v
:S
n -Jii sup l c/J ( t ) l - v ltl>s
f l cfJ (t) l v dt.
Here sup ltl>s I ¢ (t) I < 1 by the Riemann-Lebesgue lemma and because ¢ is the characteristic function of a nonlattice distribution (e.g., [47, pp. 50 1 , 5 1 3]). Thus, the first part of the integral converges to zero geometrically fast. Second, a Taylor expansion yields that cp (t) l - � t 2 + o ( t 2 ) as t � 0, so that there exists 8 > 0 such that I ¢ (t) I :S l - t 2 I 4 for every I t I < 8 . It follows that =
The proof can be concluded by applying the dominated convergence theorem to the remaining part of the integral. • Notes
The results of this chapter can be found in many introductions to probability theory. A standard reference for weak convergence theory is the first chapter of [1 1 ] . Another very readable introduction is [4 1 ] . The theory of this chapter is extended to random elements with values in general metric spaces in Chapter 18.
24
Stochastic Convergence
PROBLEMS 1. If Xn possesses a t-distribution with Show this .
n
degrees of freedom, then
Xn "-"'+ N(O, 1) as n ---+ oo.
2 . Does i t follow immediately from the result of the previous exercise that EX� ---+ EN(O, I ) P for every p E N? Is this true? 3. If Xn "-"'+ N (O, 1) and Yn
�
CJ ,
then
Xn Yn "-"'+ N(O, CJ 2 ) . Show this .
4 . I n what sense i s a chi-square distribution with distribution?
n
degrees of freedom approximately a normal
5. Find an example of s equences such that Xn "-"'+ X and Yn "-"'+ Y, but the j oint sequence (Xn , Y11 ) does not converge in law. 6. If X11 and Yn are independent random vectors for every n, then X11 "-"'+ X and Yn "-"'+ Y imply that (X11 , Y11 ) "-"'+ (X, Y ) , where X and Y are independent. Show this. 7. If every Xn and X possess discrete distributions supported on the integers, then Xn "-"'+ X if and only if P (Xn = x ) ---+ P(X = x ) for every integer x . Show thi s . 8. If P(Xn = i / n ) = I / n for every i = 1 , 2, P ( Xn E B) = 1 for every n, but P(X E B)
then Xn "-"'+ 0. Show this.
. . . , n,
=
X , but there exist B orel sets with
9. If P(Xn = Xn ) = 1 for numbers Xn and Xn ---+ x , then Xn "-"'+ x. Prove this (i) by considering distributions functions (ii) by using Theorem 2.7.
10. State the rule o p (I) + 0 p ( I )
=
0 p (I) in terms of random vectors and show its validity.
11. In what sense is it true that o p ( l )
=
Op ( l ) ? Is it true that O p (l)
12. The rules given by Lemma 2.12 are not simple plug-in rules . (i) Give an example of a function R with R (h) = o ( llh II) as variables X11 such that R (X11) is not equal to op (X11 ) .
=
op (l)?
h ---+ 0 and a sequence of random
(ii) Given an example of a function R such R (h) = 0 ( I I h I I ) a s h ---+ 0 and a sequence of random variables Xn such that Xn = Op ( 1 ) but R (Xn ) is not equal to Op (Xn ) .
13. Find an example of a sequence of random variables such that Xn "-"'+ 0 , but EXn ---+ oo .
14. Find an example of a s equence of random variables such that Xn � 0 , but X11 does not converge almost surely. 15. Let X1 , . . . , Xn be i.i.d. with density !A, a (x) = A. e - !c (x - a ) l { x 2: a } . Calculate the maximum
On ) of (A. , a) and show that (An , 011 ) � (A. , a ) . 16. Let X1 , . . . , X 11 b e i.i.d. standard normal variables . Show that the vector U = (X1 , . . . , Xn ) / N , n n 1 where N 2 = L7= 1 Xf, is uniformly distributed over the unit sphere s - in JR , in the sense that U and 0 U are identically distributed for every orthogonal transformation 0 of lR11 • n 1 n 17. For each n , let U11 be uniformly distributed over the unit sphere s - in JR . Show that the vectors ..jli(Un, 1 , Un, 2 ) converge in distribution to a pair of independent standard normal variables. 18. If ..jli(Tn - e ) converges in distribution, then Tn converges in probability to e . Show this. likelihood estimator of (An ,
19. If EXn ---+
f-L
and var Xn ---+ 0, then Xn
20. I f L� 1 P ( I Xn I
>
8
)
f-L .
Show this .
0, then X11 converges almost surely t o zero. Show this.
21. Use characteristic functions to show that binomial(n , A./ n) "-"'+ Poisson(A.) . Why does the central limit theorem not hold? 22. If X 1 , . . . , Xn are i.i.d. standard Cauchy, then Xn is standard Cauchy. (i) Show this by using characteristic functions (ii) Why does the weak law not hold?
23. Let X 1 , . . . , Xn be i.i.d. with finite fourth moment. Find constants a , b, and c11 such that the s equence c11 (X11 - a , X� - b) converges in distribution, and determine the limit law. Here Xn and X� are the averages of the Xt and the Xf, respectively.
3 Delta Method
The delta method consists of using a Taylor expansion to approximate a random vector of the form ¢ (T11) by the polynomial cp (e ) + (Tn) - ¢> (e) by a higher rate and obtain a nondegenerate limit distribution. Looking at the Taylor expansion, we see that the linear term disappears if cf>' (e) = 0, and we expect that the quadratic term determines the limit behavior of cp (Tn ) .
3.7 Example. Suppose that .jn X converges weakly to a standard normal distribution. Because the derivative of x r--+ cos x is zero at x = 0, the standard delta method of the preceding section yields that Jn (cos X - cos 0) converges weakly to 0. It should be concluded that Jn is not the right norming rate for the random sequence cos - A more informative statement is that -2n (cos X - converges in distribution to a chi -square distribution with one degree of freedom. The explanation is that 0) 2 (cos x) ;�=O · · · . 0)0 cos 0 cos
X 1.
1) + X - = (X - + �(X That the remainder term is negligible after multiplication with n can be shown along the same lines as the proof of Theorem 3.1. The sequence n X2 converges in law to a x � distribution by the continuous-mapping theorem; the sequence -2n (cos X - 1) has the same limit, by Slutsky's lemma. D
A more complicated situation arises if the statistic Tn is higher-dimensional with coor dinates of different orders of magnitude. For instance, for a real-valued function ¢> ,
If the sequences Tn,i - ei are of different order, then it may happen, for instance, that the linear part involving Tn,i - ei is of the same order as the quadratic part involving (Tn,j - ej ) 2 . Thus, it is necessary to determine carefully the rate of all terms in the expansion, and to rearrange these in decreasing order of magnitude, before neglecting the "remainder."
*3.4
Uniform Delta Method
Sometimes we wish to prove the asymptotic normality of a sequence .jn(¢> (Tn ) - ¢> (en)) for centering vectors en changing with n, rather than a fixed vector. If .jn(en - e) -* h for certain vectors e and h , then this can be handled easily by decomposing
Moments
3. 5
33
Several applications of Slutsky's lemma and the delta method yield as limit in law the vector tjJ� (T + h) - tjJ� (h ) = tjJ� (T), if T is the limit in distribution of y'ri(Tn - en) . For en � e at a slower rate, this argument does not work. However, the same result is true under a slightly stronger differentiability assumption on ¢.
Let tjJ : JRk f--+ JRm be a map defined and continuously diffe rentiable in a neighborhood of e. Let Tn be random vectors taking their values in the domain of ¢. lf rn (Tn - en) T for vectors en � e and numbers rn � oo, then rn ( t/J (Tn ) tjJ (en)) -v-+ ¢� (T). Moreover, the diffe rence between rn ( ¢ (Tn) - tjJ (en)) and ¢� ( rn (Tn - en)) converges to zero in probability.
3.8
Theorem.
-v-+
It suffices to prove the last assertion. Because convergence in probability to zero of vectors is equivalent to convergence to zero of the components separately, it is no loss of generality to assume that tjJ is real-valued. For 0 :::: t :::: and fixed h, define gn (t) = tjJ (en + th). For sufficiently large n and sufficiently small h, both en and en + h are in a ball around e inside the neighborhood on which tjJ is differentiable. Then gn : [0, f--+ lR is continuously differentiable with derivative g� (t) = ¢� + t h (h). By the mean -value theorem, gn - gn (0) = g� (�) for some 0 :=:: � :=:: In other words
Proof
1
1]
"
(1)
1.
By the continuity of the map e f--+ ¢� , there exists for every s > 0 a 8 > 0 such that ll tP� (h) - ¢� (h) I < s ll h ll for every li s - e 11 < 8 and every h. For sufficiently large n and llh l l < 8 / 2, the vectors en + �h are within distance 8 of e , so that the norm I R n (h) I of the right side of the preceding display is bounded by s ll h 11 . Thus, for any 1J > 0,
The first term converges to zero as n by choosing s small. •
� oo.
*3.5
The second term can be made arbitrarily small
Moments
So far we have discussed the stability of convergence in distribution under transformations. We can pose the same problem regarding moments: Can an expansion for the moments of tjJ (Tn) - tjJ (e) be derived from a similar expansion for the moments of Tn - e? In principle the answer is affirmative, but unlike in the distributional case, in which a simple derivative of tjJ is enough, global regularity conditions on tjJ are needed to argue that the remainder terms are negligible. One possible approach is to apply the distributional delta method first, thus yielding the qualitative asymptotic behavior. Next, the convergence of the moments of tjJ (Tn) - tjJ (e) (or a remainder term) is a matter of uniform integrability, in view of Lemma 2.20. If tjJ is uniformly Lipschitz, then this uniform integrability follows from the corresponding uniform integrability of Tn - e . If tjJ has an unbounded derivative, then the connection between moments of tjJ ( Tn) - tjJ (e) and Tn - e is harder to make, in general.
Delta Method
34
Notes
The Delta method belongs to the folklore of statistics. It is not entirely trivial; proofs are sometimes based on the mean-value theorem and then require continuous differentiability in a neighborhood. A generalization to functions on infinite-dimensional spaces is discussed in Chapter 20. PROBLEMS
1.
Find the j oint limit distribution of ( fL) , if and are based on a sample of size from a distribution with finite fourth moment. Under what condition on the underlying distribution are t-L) and asymptotically independent?
,Jfi(X - ,Jfi(S2 - a2)) X S2 n ,Jfi(X ,Jfi(S2 - a2) 2. Find the asymptotic distribution of ,Jn (r - p) if r is the correlation coefficient of a sample of n
bivariate vectors with finite fourth moments. (This is quite a bit of work. It helps to assume that the mean and the variance are equal to 0 and 1, respectively.)
3. Investigate the asymptotic robustness of the level of the t-test for testing the mean that rejects Ho : fL ::::: 0 if is larger than the upper a quantile of the tn - 1 distribution.
,JfiX IS
n - 1 L7= 1 (Xi - X)4 I S4 -
4. Find the limit distribution of the sample kurtosis kn = 3, and design an asymptotic level a test for normality based on kn . (Warning : At least 500 observations are needed to make the normal approximation work in this case.) 5. Design an asymptotic level a test for normality based on the sample skewness and kurtosis jointly. 6. Let be i.i.d. with expectation fL and variance converges in distribution if fL = 0 or fL i= 0.
X 1 , . . . , Xn 1 . Find constants such that a n (X� - bn) 7. Let X 1 , . . . , X n be a random sample from the Poisson distribution with mean e. Find a variance stabilizing transformation for the sample mean, and construct a confidence interval for e based on this.
X 1 , 1. . . , Xn be i.i.d. with expectation 1 and finite variance. Find the limit distribution of ,Jfi(X� - 1) . If the random variables are sampled from f that i s bounded and strictly 1 aoodensity positive in a neighborhood of zero, show that EIX� 1 for every n. (The density of Xn is bounded away from zero in a neighborhood of zero for every n.)
8. Let
=
4 Moment Estimators
The method of moments determines estimators by comparing sample and theoretical moments. Moment estimators are useful for their simplicity, although not always optimal. Maximum likelihood estimators for full ex ponentialfamilies are moment estimators, and their asymptotic normality can be proved by treating them as such. 4.1
Method of Moments
Let 1 , be a sample from a distribution that depends on a parameter e, ranging over some set 8. The method of moments consists of estimating e by the solution of a system of equations
X . . . , Xn
Pe
1- Ln fJ(Xt) Ee fj (X), n i=l =
j
=
1, . . . , k,
for given functions f1 , . . . , fk . Thus the parameter is chosen such that the sample moments (on the left side) match the theoretical moments. If the parameter is k-dimensional one usually tries to match k moments in this manner. The choices fJ (x) = x lead to the method of moments in its simplest form. Moment estimators are not necessarily the best estimators, but under reasonable condi tions they have convergence rate Jn and are asymptotically normal. This is a consequence of the delta method. Write the given functions in the vector notation f = (!I , . . . , fk ), and let e : 8 r-:>- �k be the vector-valued expectation e( F3 ) = Then the moment estimator en solves the system of equations
j
Pe f .
r?n f
=
1 Ln f (Xt )
-
n i=l
=
e( F3)
=
Pe f.
For existence of the moment estimator, it is necessary that the vector r?n f be in the range of the function e. If e is one-to-one, then the moment estimator is uniquely determined as en = e - 1 Wn f) and A
If r?n f is asymptotically normal and e - 1 is differentiable, then the right side is asymptoti cally normal by the delta method. 35
Moment Estimators
36
The derivative of e - 1 at e ( e0) is the inverse e�� 1 of the derivative of e at 80. Because the function e - 1 is often not explicit, it is convenient to ascertain its differentiability from the differentiability of e. This is possible by the inverse function theorem. According to this theorem a map that is (continuously) differentiable throughout an open set with nonsingular derivatives is locally one-to-one, is of full rank, and has a differentiable inverse. Thus we obtain the following theorem.
Suppose that e(e) = Pe f is one-to-one on an open set 8 c JRk and con tinuously differentiable at e0 with nonsingular derivative e�a . Moreover, assume that Pea I f 11 2 < oo. Then moment estimators en exist with probability tending to one and satisfy 4.1
Theorem.
�
Continuous differentiability at e0 presumes differentiability in a neighborhood and the continuity of e r--+ e� and nonsingularity of e�a imply nonsingularity in a neighborhood. Therefore, by the inverse function theorem there exist open neighborhoods U of e0 and V of Pea f such that e : U r--+ V is a differentiable bijection with a differentiable inverse e - 1 : V r--+ U . Moment estimators en = e - 1 (Pn f) exist as soon as Pn f E V, which happens with probability tending to by the law of large numbers. The central limit theorem guarantees asymptotic normality of the sequence Jn Wn f Pea f) . Next use Theorem on the display preceding the statement of the theorem. •
Proof.
1
3.1
For completeness, the following two lemmas constitute, if combined, a proof of the inverse function theorem. If necessary the preceding theorem can be strengthened somewhat by applying the lemmas directly. Furthermore, the first lemma can be easily generalized to infinite-dimensional parameters, such as used in the semiparametric models discussed in Chapter 25.
Let 8 c JRk be arbitrary and let e : 8 r--+ JRk be one-to-one and differentiable at a point e0 with a nonsingular derivative. Then the inverse e - 1 (defined on the range of e) is diffe rentiable at e(e0) provided it is continuous at e(e0). Proof. Write ry = e(e0) and l:!.. h = e - 1 (ry + h) - e - 1 (ry) . Because e - 1 is continuous at ry, we have that l:!.. h r--+ 0 as h r--+ 0. Thus 4.2
Lemma.
as h r--+ 0, where the last step follows from differentiability of e. The displayed equation can be rewritten as e�a ( I:!.. h) = h + o( ll l:!.. h II ) . By continuity of the inverse of e�a' this implies that
l:!.. h = e�� 1 (h) + o( ll i:!.. h ll ) . In particular, ll l:!.. h II ( + o( :::: I e�� 1 (h) I = 0 ( II h II) . Insert this in the displayed equation to obtain the desired result that i:!.. h = e�� 1 (h) + o( ll h ll ) . •
1 1) )
Let e : 8 r--+ JRk be defined and differentiable in a neighborhood of a point e0 and continuously diffe rentiable at e0 with a nonsingular derivative. Then e maps every 4.3
Lemma.
Exponential Families
4. 2
37
sufficiently small open neighborhood U of80 onto an open set V and e - 1 : V f---+ U is well defined and continuous. By assumption, e� -? A- 1 := e�0 as f) f---+ 80. Thus Il l - Ae � ll ::::; � for every e in a sufficiently small neighborhood U of 80 . Fix an arbitrary point ry 1 = e(f)I) from V = e(U) (where 81 E U). Next find an c > 0 such that ball(81 , c ) c U, and fix an arbitrary point ry with II ry - rJ1 II < 8 := � II A l l - 1 c . It will be shown that ry = e (f)) for some point e E ball(8 1 , c). Hence every ry E ball(ry 1 , 8) has an original in ball(81 , c). If e is one-to-one on U, so that the original is unique, then it follows that V is open and that e - 1 is continuous at ry1 . Define a function ¢ (8) = f) + A (ry - e (8)) . Because the norm of the derivative ¢� = I - Ae� is bounded by � throughout U, the map ¢ is a contraction on U . Furthermore, if 1 1 8 - 8 1 ll :S c,
Proof
Consequently, ¢ maps ball(81 , c ) into itself. Because ¢ is a contraction, it has a fixed point f) E ball(8 1 , c ) : a point with ¢ (8) = f) . By definition of ¢ this satisfies e (f)) = ry. Any other B with e ( B ) = ry is also a fixed point of ¢. In that case the difference B - f) = ¢ ( B ) - ¢ (8) has norm bounded by li B - 8 11 - This can only happen if B = e . Hence e is one-to-one throughout U . •
�
4.4 Example. Let X 1 , . . . , Xn be a random sample from the beta-distribution: The com mon density is equal to
x f---+
rca + fJ) x a -1 (1 - x) {3 - 1 lo 0, an open neighborhood Go of &0 such that the map \ll : G0 f--:>- ball(O, o) is a homeomorphism. The diameter of G0 is bounded by a multiple of o, by the mean-value theorem and the fact that the norms of the derivatives (P� e )- 1 of the inverse w- 1 are bounded. Combining the preceding Taylor expansion with a similar expansion for the sample version Wn (&) = Yln 1/.re , we see sup ll wn (&) - \II (&) I
e EGs
:S
op ( l ) + oop ( l ) + o 2 0p ( 1 ) ,
where the op ( 1 ) terms and the Op ( l ) term result from the law of large numbers, and are uniform in small o . Because P ( op ( 1 ) + oop ( l ) > � o ) � 0 for every o > 0, there exists On .,[,. 0 such that P (op ( 1 ) + on op ( 1 ) > � on ) � 0. If Kn,o is the event where the left side of the preceding display is bounded above by o, then P(Kn,oJ � 1 as n � oo. On the event Kn,o the map e f--:>- e - Wn w- 1 (&) maps ball(O, o) into itself, by the definitions of Go and Kn,D · Because the map is also continuous, it possesses a fixed-point in ball(O, o), by Brouwer's fixed point theorem. This yields a zero of Wn in the set G0, whence the first assertion of the theorem. For the final assertion, first note that the Hessian P � eo of e f--:>- P me at 80 is negative definite, by assumption. A Taylor expansion as in the proof of Theorem 5.41 shows that . . . JPln 1fre" - Pn 1/r e0 ::+p 0 for every &n �p &o. Hence the Hessian Yln l/r e" of 8 f--:>- Pn � e at any consistent zero &n converges in probability to the negative-definite matrix P1fr e0 and is negative-definite with probability tending to 1 . • o
�
The assertion of the theorem that there exists a consistent sequence of roots of the estimating equations is easily misunderstood. It does not guarantee the existence of an asymptotically consistent sequence of estimators. The only claim is that a clairvoyant statistician (with preknowledge of &0 ) can choose a consistent sequence of roots. In reality, it may be impossible to choose the right solutions based only on the data (and knowledge of the model). In this sense the preceding theorem, a standard result in the literature, looks better than it is. The situation is not as bad as it seems. One interesting situation is if the solution of the estimating equation is unique for every n. Then our solutions must be the same as those of the clairvoyant statistician and hence the sequence of solutions is consistent.
70
M- and Z-Estimators
In general, the deficit can be repaired with the help of a preliminary sequence of estimators iJ n . If the sequence iJ n is consistent, then it works to choose the root {) n of lJ:Dn 1/re = 0 that is closest to iJn . Because ll(jn - iJn ll is smaller than the distance 118� - iJn ll between the clairvoyant sequence {)� and iJ n , both distances converge to zero in probability. Thus the sequence of closest roots is consistent. The assertion of the theorem can also be used in a negative direction. The point (}0 in the theorem is required to be a zero of (} f----+ P 1/re , but, apart from that, it may be arbitrary. Thus, the theorem implies at the same time that a malicious statistician can always choose a sequence of roots (Jn that converges to any given zero. These may include other points besides the "true" value of (} . Furthermore, inspection of the proof shows that the sequence of roots can also be chosen to jump back and forth between two (or more) zeros. If the function (} f----+ P l/re has multiple roots, we must exercise care. We can be sure that certain roots of (} f----+ lJ:Dn 1/re are bad estimators. Part of the problem here is caused by using estimating equations , rather than maximiza tion to find estimators, which blurs the distinction between points of absolute maximum, local maximum, and even minimum. In the light of the results on consistency in section 5.2, we may expect the location of the point of absolute maximum of (} f----+ Jl:Dnme to converge to a point of absolute maximum of (} f----+ Pme . As long as this is unique, the absolute maximizers of the criterion function are typically consistent.
5.43 Example (Weibull distribution). Let X 1 , tribution with density !!._ e Pe, a (x) = (j x - I e -xB / a '
•
.
X
•
, Xn be a sample from the Weibull dis >
0, (}
>
0, a
>
0.
(Then a I / e is a scale parameter.) The score function is given by the partial derivatives of the log density with respect to (} and a :
f e, a (x) =
(1 + log x - : log x , w - � + ;: ) .
The likelihood equations I; ie , a (x i ) = 0 reduce to n 1" 1 1� I:7=1 xf log x; a = � x�� ·· "D + - � log x; " n x e = 0. n n L... i =I i i=I i=l The second equation is strictly decreasing in (} , from oo at (} = 0 to log x -log X (n) at (} = oo. Hence a solution exists, and i s unique, unless all x ; are equal. Provided the higher-order derivatives of the score function exist and can be dominated, the sequence of maximum likelihood estimators (en , an) is asymptotically normal by Theorems 5.41 and 5.42. There exist four different third-order derivatives, given by 2 xe 3 = - - log x () 3 (j xe = 2 log 2 x (j a3 £e, a (X) 2x e = - - log x aeaa2 a3 a3 £e , a (X) 2 6x e =- +. aa 3 a 3 a4
-
0
-
------,----- -
5. 7
71
One-Step Estimators
For e and a ranging over sufficiently small neighborhoods of e0 and a0 , these functions are dominated by a function of the form M(x)
=
A ( l + x B ) ( l + l log xl +
·
·
·
+ l log x n ,
for sufficiently large A and B . Because the Weibull distribution has an exponentially small tail, the mixed moment Eeo,cro X P I log Xl q is finite for every p , q ::::_ Thus, all moments of le and ee exist and M is integrable. D
0.
*5.7
One-Step Estimators
The method of Z-estimation as discussed so far has two disadvantages. First, it may be hard to find the roots of the estimating equations. Second, for the roots to be consistent, the estimating equation needs to behave well throughout the parameter set. For instance, existence of a second root close to the boundary of the parameter set may cause trouble. The one-step method overcomes these problems by building on and improving a preliminary estimator iJ The idea is to solve the estimator from a linear approximation to the original estimating equation Wn (e) = Given a preliminary estimator en , the one-step estimator is the solution (in e) to n.
0.
This corresponds to replacing Wn (e) by its tangent at iJ n , and is known as the method of Newton-Rhapson in numerical analysis. The solution e = {Jn is In numerical analysis this procedure is iterated a number of times, taking {Jn as the new preliminary guess, and so on. Provided that the starting point iJ n is well chosen, the sequence of solutions converges to a root of Wn . Our interest here goes in a different direction. We suppose that the preliminary estimator {}n is already within range n - 1 12 of the true value of e. Then, as we shall see, just one iteration of the Newton-Rhapson scheme produces an estimator {J that is as good as the Z -estimator defined by \IIn . In fact, it is better in that its consistency is guaranteed, whereas the true Z-estimator may be inconsistent or not uniquely defined. In this way consistency and asymptotic normality are effectively separated, which is useful because these two aims require different properties of the estimating equations. Good initial estimators can be constructed by ad-hoc methods and take care of consistency. Next, these initial estimators can be improved by the one-step method. Thus, for instance, the good properties of maximum likelihood estimation can be retained, even in cases in which the consistency fails . In this section we impose the following condition on the random criterion functions \1111 • For every constant M and a given nonsingular matrix �0 , n
sup
.Jll i iB-Bo I I < M
1 -Jn ( Wn (e) - Wn (eo)) - �o-Jn (e - eo) I � 0 .
(5.44)
Condition (5 .44) suggests that Wn is differentiable at eo, with derivative tending to �o, but this is not an assumption. We do not require that a derivative �n exists, and introduce
72
M- and Z-Estimators
a further refinement of the Newton-Rhapson scheme by replacing �n (en) by arbitrary estimators. Given nonsingular, random matrices �n .o that converge in probability to �0 define the one-step estimator
Call an estimator sequence en Jfi-consistent if the sequence Jn(en - e0) is uniformly tight. The interpretation is that en already determines the value e0 within n -1 1 2 -range .
Let Jfi\lln (e0) Z and let (5. 44) hold. Then the one-step estimator en, for a given Jfi-consistent estimator sequence en and estimators . p . Wn , O_,. \llo , satisfies
5.45
5.46
Theorem (One-step estimation).
Addendum. Theorem 5. 21 with
Proof.
-v-7
For \lin (e) = fiDn 1/fe condition (5. 44) is satisfied under the conditions of �o = Ve0 , and under the conditions of Theorem 5. 41 with �o = P V, 80 •
The standardized estimator �n ,o Jfi(en - e0) equals
By (5 .44) the second term can be replaced by - �0Jfi(en - e0) + o p (1). Thus the expression can be rewritten as
The first term converges to zero in probability, and the theorem follows after application of Slutsky's lemma. For a proof of the addendum, see the proofs of the corresponding theorems. • If the sequence Jn(en - e0) converges in distribution, then it is certainly uniformly tight. Consequently, a sequence of one-step estimators is Jfi-consistent and can itself be used as preliminary estimator for a second iteration of the modified Newton-Rhapson algorithm. Presumably, this would give a value closer to a root of \lin . However, the limit distribution of this "two-step estimator" is the same, so that repeated iteration does not give asymptotic improvement. In practice a multistep method may nevertheless give better results. We close this section with a discussion of the discretization trick. This device is mostly of theoretical value and has been introduced to relax condition (5 .44) to the following. For every nonrandom sequence en = e0 + O (n -1 1 2 ), (5 .47) This new condition is less stringent and much easier to check. It is sufficiently strong if the preliminary estimators en are discretized on grids of mesh width n -1 1 2 . For instance, en is suitably discretized if all its realizations are points of the grid n -1 1 2 £} (consisting of the points n -1 1 2 (i 1 , . . . , i k ) for integers i 1 , . . . , i k ). This is easy to achieve, but perhaps unnatural. Any preliminary estimator sequence e n can be discretized by replacing its values
5. 7
One-Step Estimators
73
by the closest points of the grid. Because this changes each coordinate by at most n -112, Jn-consistency of en is retained by discretization. Define a one-step estimator Bn as before, but now use a discretized version of the pre liminary estimator.
5.48 Theorem (Discretized one-step estimation). Let Jfl\lln (e0) Z and let (5.47) hold. Then the one-step estimator en, for a given ,.Jii- consistent, discretized estimator sequence . . en and estimators \lln , O�p \llo , satisfies -v--7
5.49 Addendum. For \lin (e) = lP'n 1/Je and lP'n the empirical measure of a random sample from a. density pe that is differentiable in quadratic mean ( 5.38 ), condition ( 5.47), is satisfied, ·T with \llo = - Pe0 1/fel- eo ' if, as e � eo,
Proof.
The arguments of the previous proof apply, except that it must be shown that
converges to zero in probability. Fix £ > 0. By the ,.Jii - consistency, there exists M with P(Jn ll en - eo ll > M) < c . If Jll l l en - eo I ::::; M, then en equals one of the values in the set Sn = {e E n-1127J..k: 11 e - eo ll ::::; n - 1 1 2M } . For each M and n there are only finitely many elements in this set. Moreover, for fixed M the number of elements is bounded independently of n . Thus
::::; £ +
L P( II R (en) ll > c ) .
Bn E Sn
The maximum of the terms in the sum corresponds to a sequence of nonrandom vectors en with en = e0 + O (n -11 2) . It converges to zero by (5 .47). Because the number of terms in the sum is bounded independently of n, the sum converges to zero. For a proof of the addendum, see proposition A. 10 in [ 1 39] . • If the score function le of the model also satisfies the conditions of the addendum, then the estimators �n ,o = -Pen 1/Jen i�, are consistent for �o - This shows that discretized one-step estimation can be carried through under very mild regularity conditions. Note that the addendum requires only continuity of e f--+ 1/Je , whereas (5 . 47) appears to require differentiability.
5.50 Example (Cauchy distribution). Suppose X 1 , . . . , Xn are a sample from the Cauchy location family pe (x) = n-1 (1 + (x - e) 2 r 1 . Then the score function is given by _
2_ (x_ -_e_ )_
i e (x ) - 1 (x - e) 2 · + _
M- and Z -Estimators
74
0 0
0 LO
0 0
C';J
0 LO
C';J
0 0
'?
-200
- 1 00
0
1 00
Figure 5.4. Cauchy log likelihood function of a sample of 25 observations , showing three local maxima. The value of the absolute maximum is well-separated from the other maxima, and its location is close to the true value zero of the parameter.
This function behaves like 1 j x for x -+ ±oo and is bounded in between. The second moment of fe (X1 ) therefore exists , unlike the moments of the distribution itself. Because the sample mean possesses the same (Cauchy) distribution as a single observation X 1 , the sample mean is a very inefficient estimator. Instead we could use the median, or another M -estimator. However, the asymptotically best estimator should be based on maximum likelihood. We have . 4(x - 8) ((x - 8)2 - 3) fe (x) = ( 1 + (x - 8)2) 3 .:..._ .._ _=---'-
___
The tails of this function are of the order 1 j x3 , and the function is bounded in between. These bounds are uniform in 8 varying over a compact interval. Thus the conditions of Theorems 5.41 and 5 .42 are satisfied. Since the consistency follows from Example 5 . 1 6, the sequence of maximum likelihood estimators is asymptotically normal. The Cauchy likelihood estimator has gained a bad reputation, because the likelihood equation L.)e (Xd = 0 typically has several roots. The number of roots behaves asymp totically as two times a Poisson(ljn) variable plus 1 . (See [126] .) Therefore, the one-step (or possibly multi-step method) is often recommended, with, for instance, the median as the initial estimator. Perhaps a better solution is not to use the likelihood equations, but to deter mine the maximum likelihood estimator by, for instance, visual inspection of a graph of the likelihood function, as in Figure 5.4. This is particularly appropriate because the difficulty of multiple roots does not occur in the two parameter location-scale model. In the model with density P e (x fer ) fer , the maximum likelihood estimator for (8 , er) is unique. (See [25].) D
5.51 Example (Mixtures). Let f and g be given, positive probability densities on the real line. Consider estimating the parameter 8 = (M , v , er, r , p) based on a random sample from
the mixture density
X
1---+
pj
75
Rates of Convergence
5. 8
(X -- ) 1 + ( 1 - p)g (X - V) 1 · -a
f-L
-T- -;;
-;;
If f and g are sufficiently regular, then this is a smooth five-dimensional parametric model, and the standard theory should apply. Unfortunately, the supremum of the likelihood over the natural parameter space is oo , and there exists no maximum likelihood estimator. This is seen, for instance, from the fact that the likelihood is bigger than n V 1 f-L 1- n pf ( 1 - p)g -- - . r r a a i= Z If we set J-L = x 1 and next maximize over a > 0, then we obtain the value oo whenever p > 0, irrespective of the values of v and r . A one-step estimator appears reasonable in this example. In view of the smoothness of the likelihood, the general theory yields the asymptotic efficiency of a one-step estimator if started with an initial -fo-consistent estimator. Moment estimators could be appropriate initial estimators. D
(X! - )
*5.8
(Xi - )
Rates of Convergence
In this section we discuss some results that give the rate of convergence of M -estimators . These results are useful as intermediate steps in deriving a limit distribution, but also of interest on their own. Applications include both classical estimators of "regular" parameters and estimators that converge at a slower than .Jli-rate. The main result is simple enough, but its conditions include a maximal inequality, for which results such as in Chapter 1 9 are needed. Let JlDn be the empirical distribution of a random sample of size n from a distribution P , and, for every () in a metric space 8, let x �----+ me (x ) be a measurable function. Let (Jn (nearly) maximize the criterion function () �----+ Pn me . The criterion function may be viewed as the sum of the deterministic map () �----+ P me and the random fluctations 8 t---+ Pnme - Pme . The rate of convergence of en depends on the combined behavior of these maps. If the deterministic map changes rapidly as () moves away from the point of maximum and the random fluctuations are small, then (Jn has a high rate of convergence. For convenience of notation we measure the fluctuations in terms of the empirical process Gnme = -Jfl(JIDn me - Pme) .
5.52 Theorem (Rate of convergence). Assume that forfixed constants C and a every n , andfor every sufficiently small 8 > 0, sup P (me - me0 )
:S
E* sup I Gn (me - me0 ) 1
:S
d(e , eo) 0, -
If the sequence {j n is consistent for e0, then the second probability on the right converges to 0 as n --* oo, for every fixed E > 0. The third probability on the right can be made arbitrarily small by choice of K , uniformly in n . Choose E > 0 small enough to ensure that the conditions of the theorem hold for every 8 :::=: E . Then for every j involved in the sum, we have 2 (j - l )a sup P (me - mea) :::=: -C
e ESJ ,n
For � C2CM - I ) a
�
---
r�
K , the series can be bounded in terms of the empirical process Gn by
by Markov's inequality and the definition of r11 • The right side converges to zero for every
M = Mn --*
oo .
•
Consider the special case that the parameter e is a Euclidean vector. If the map e f--+ P me is twice-differentiable at the point of maximum e0, then its first derivative at e0 vanishes and a Taylor expansion of the limit criterion function takes the form
Then the first condition of the theorem holds with a = 2 provided that the second-derivative matrix V is nonsingular. The second condition of the theorem is a maximal inequality and is harder to verify. In "regular" cases it is valid with f3 = 1 and the theorem yields the "usual" rate of convergence .Jfi. The theorem also applies to nonstandard situations and yields, for instance, the rate n 1 1 3 if a = 2 and fJ = � . Lemmas 1 9.34, 19.36 and 19.38 and corollary 1 9.35 are examples of maximal inequalities that can be appropriate for the present purpose. They give bounds in terms of the entropies of the classes of functions {me - mea : d ( e, eo) < 0}. A Lipschitz condition on the maps e f--+ me is one possibility to obtain simple estimates on these entropies and is applicable in many applications. The result of the following corollary is used earlier in this chapter.
5. 8
Rates of Convergence
77
5.53 Corollary. For each e in an open subset of Euclidean space let x f-+ m e (x) be a measurable function such that, for every e 1 and e2 in a neighborhood ofeo and a measurable function m such that Pm 2 < oo,
Furthermore, suppose that the map e f-+ P me admits a second-order Taylor expansion at the point ofmaximum eo with nonsingular second derivative. Jf lfDnmen 2:: lfDnme0 - 0 p ( n - 1 ) , p then � cen - eo) = 0 p (1 ) , provided that en -+ eo. A
A
By assumption, the first condition of Theorem 5 .52 is valid with a = 2. To see that the second one is valid with f3 = 1 , we apply Corollary 1 9.35 to the class of functions :F = { m e - mea : I e - eo II < This class has envelope function F = m , whence
Proof.
8}.
8
111mi P.28 J
E* sup 1 Gn (me - me0 ) 1 � log N[] (8, :F, L 2 (P)) d8. 0 ll e -Bo l d The bracketing entropy of the class :F is estimated in Example 1 9.7. Inserting the upper bound obtained there into the integral, we obtain that the preceding display is bounded above by a multiple of
111 m 1 P.20 F:(D8) 0
log - d 8. 8
Change the variables in the integral to see that this is a multiple of
8.
•
Rates of convergence different from ,fii are quite common for M -estimators of infinite dimensional parameters and may also be obtained through the application of Theorem 5.52. See Chapters 24 and 25 for examples. Rates slower than ,fii may also arise for fairly simple parametric estimates.
5.54 Example (Modal interval). Suppose that we define an estimator en oflocation as the center of an interval of length 2 that contains the largest possible fraction of the observations. This is an M-estimator for the functions me = 1 [ e - 1 ,8+1 1 . For many underlying distributions the first condition of Theorem 5.52 holds with a = 2. It suffices that the map e f-+ P me = P [ e - 1, e + 1] is twice-differentiable and has a proper maximum at some point e0. Using the maximal inequality Corollary 19.3 5 (or Lemma 1 9.38), we can show that the second condition is valid with f3 = � · Indeed, the bracketing entropy of the intervals in the real line is of the order j 8 2 , and the envelope function of the class of functions 1 [e - 1 ,8+1] - 1 [ ()o - 1 ,eo +1] as e ranges over ceo - eo + is bounded by 1 [eo - 1 - 8 ,eo - 1 HJ + 1 [eo +1 - Mo + 1 Hl • whose squared L 2 -norm is bounded by liP ll oo 28 . Thus Theorem 5.52 applies with a = 2 and f3 = � and yields the rate of convergence 1 n 13 • The resulting location estimator is very robust against outliers. However, in view of its slow convergence rate, one should have good reasons to use it. The use of an interval of length 2 is somewhat awkward. Every other fixed length would give the same result. More interestingly, we can also replace the fixed-length interval by the smallest interval that contains a fixed fraction, for instance 1 /2, of the observations. This
8
8, 8)
78
M- and Z-Estimators
still yields a rate of convergence of n 1 13 . The intuitive reason for this is that the length of a "shorth" settles down by a ,Jn-rate and hence its randomness is asymptotically negligible relative to its center. D The preceding theorem requires the consistency of en as a condition. This consistency is implied if the other conditions are valid for every 8 > 0, not just for small values of 8 . This can be seen from the proof or the more general theorem in the next section. Because the conditions are not natural for large values of 8 , it is usually better to argue the consistency by other means.
5.8.1
Nuisance Parameters
In Chapter 25 we need an extension of Theorem 5.52 that allows for a "smoothing" or "nuisance" parameter. We also take the opportunity to insert a number of other refinements, which are sometimes useful. Let x f--+ me, 11 (x) be measurable functions indexed by parameters (8 , YJ), and consider estimators e n contained in a set en that, for a given fin contained in a set Hn , maximize the map The sets en and Hn need not be metric spaces, but instead we measure the discrepancies between e n and 80, and �n and a limiting value YJo, by nonnegative functions 8 f--+ d11 ( 8 , 80) and YJ f--+ d(YJ, YJo), which may be arbitrary.
5.55 Theorem. Assume that, for arbitraryfunctions en : Bn x Hn f--+ lR and c/Jn : (0, oo) f--+ lR such that 8 f--+ c/Jn (8) j8� is decreasing for some f3 < 2, every (8 , YJ) E en x Hn, and every 8 > 0, P (me,11 - me0 , 11) + en (8 , YJ) � -d; (8 , 8o) + d 2 (YJ, YJo) , E* sup 1 Gn (me,11 - me0, 11 ) - Jll en (8 , YJ) I � c/Jn (o ) . d� (B.IJo) 0 satisfy c/Jn (8n) � ,Jn 8� for every n. If P (e n E Bn , fin E Hn ) l!Dnmen .iln :::: l!Dnmeo . iln - Op (o�), then diln ( e n , 8o ) = O � (o n + d ( fl n , YJo ) ) .
---+
1 and
For simplicity assume that l!Dn men . iln =::: l!Dn meo . iln • without a tolerance term. For each n E N, j E Z and M > 0, let Sn , J, M be the set { (8 , YJ) E Bn x Hn : 21 - 1 8n < d17 (8 , 8o ) � 21 8n, d(YJ, YJo) � 2- M d17 (8 , 8o ) } .
Proof.
Then the intersection ofthe events ( e n , fin) E en X Hn , and diln (en , 8o) :::: 2 M ( on +d( fl n , YJo)) is contained in the union of the events { ( e n , �n ) E Sn , J, M } over j =::: M. By the definition of en, the supremum of l!Dn (me, 11 - me0,11) over the set of parameters (8 , YJ) E Sn , J, M is nonnegative on the event { (e11 , fin ) E Sn , J ,M } . Conclude that P* dij. (en , 8o) =::: 2 M (8n + d ( fl n , YJo) ) , (en , fin) E Bn x Hn
(
�
I: r *
j�M
(
(
sup
ti , I) ) ESn 1 M
)
)
l!Dn (me,11 - me0,17) :::: 0 .
Argmax Theorem
5. 9
For every j , (e, rJ )
E
79
Sn,J ,M · and every sufficiently large M,
-d; (e, eo) + d 2 ( rJ , rJo) < - (1 - 2 - 2M ) d1)2 (e ' e0 ) < -2 21 -4 8n2 . From here on the proof is the same as the proof of Theorem 5 . 52, except that we use that ¢>n (c8) ::::; cf3¢>n (8) for every c > 1 , by the assumption on tf>n · • P (m e ,l) - me0,1) ) + en (e , rJ )
::::;
_
*5.9
Argmax Theorem
The consistency of a sequence of M -estimators can be understood as the points of maximum en of the criterion functions e r--+ Mn (e) converging in probability to a point of maximum of the limit criterion function e r--+ M(e). So far we have made no attempt to understand the distributional limit properties of a sequence of M -estimators in a similar way. This is possible, but it is somewhat more complicated and is perhaps best studied after developing the theory of weak convergence of stochastic processes, as in Chapters 1 8 and 19. Because the estimators en typically converge to constants, it is necessary to rescale them before studying distributional limit properties. Thus, we start by searching for a sequence of numbers rn r--+ oo such that the sequence hn = rn (en - e) is uniformly tight. The results of the preceding section should be useful. If en maximizes the function e r--+ Mn (e), then the rescaled estimators hn are maximizers of the local criterion functions
h
r--+
( �) - Mn (eo) .
Mn e +
Suppose that these, if suitably normed, converge to a limit process h r--+ M ( h ) . Then the general principle is that the sequence hn converges in distribution to the maximizer of this limit process. For simplicity of notation we shall write the local criterion functions as h r--+ Mn ( h ) . Let { Mn (h) : h E Hn } be arbitrary stochastic processes indexed by subsets Hn of a given metric space. We wish to prove that the argmax-functional is continuous : If Mn -v-+ M and Hn ----+ H in a suitable sense, then the (near) maximizers hn of the random maps h r--+ Mn (h) converge in distribution to the maximizer h of the limit process h r--+ M (h) . It is easy to find examples in which this is not true, but given the right definitions it is, under some conditions. Given a set B , set
M (B )
=
sup M (h ) . hEB
Then convergence in distribution of the vectors ( Mn (A) , Mn ( B ) ) for given pairs of sets A and B is an appropriate form of convergence of Mn to M . The following theorem gives some flexibility in the choice of the indexing sets. We implicitly either assume that the suprema Mn (B ) are measurable or understand the weak convergence in terms of outer probabilities, as in Chapter 18. The result we are looking for is not likely to be true if the maximizer of the limit process is not well defined. Exactly as in Theorem 5 .7 , the maximum should be "well separated." Because in the present case the limit is a stochastic process, we require that every sample path h r--+ M (h ) possesses a well-separated maximum (condition (5.57)).
80
M- and Z -Estimators
5.56 Theorem (Argmax theorem). Let Mn and M be stochastic processes indexed by sub sets Hn and H of a given metric space such that, for every pair of a closed set F and a set K in a given collection K, Furthermore, suppose that every sample path of the process h � M(h) possesses a well separated point of maximum h in that, for every open set G and every K E K, M( h ) > M(G c n K n H) ,
if h E G,
a.s ..
If Mn ( h n) :=:: Mn (Hn) - o p ( 1 ) andfor every 8 > 0 there exists K E K) < 8 and P( h � K) < 8, then h n "-""+ h .
K such that sup n
(5 .57) P( h n
�
If h n E F n K , then Mn (F n K n Hn ) :::: Mn (B ) - op (l) for any set B . Hence, for every closed set F and every K E K ,
Proof.
P( h n E F n K)
:S
:S
P(Mn (F n K n Hn) :::: Mn (K n Hn) - op ( l ) ) P(M(F n K n H) :::: M(K n H) ) + o ( l ) ,
by Slutsky's lemma and the portmanteau lemma. If h E F e , then M ( F n K n H ) i s strictly smaller than M (h ) by (5.57) and hence on the intersection with the event in the far right side h cannot be contained in K n H . It follows that lim sup P( h n E F n K)
:S
P(h E F) + P( h
tf.
K n H) .
By assumption we can choose K such that the left and right sides change by less than 8 if we replace K by the whole space. Hence h n "-""+ h by the portmanteau lemma. • The theorem works most smoothly if we can take K to consist only of the whole space. However, then we are close to assuming some sort of global uniform convergence of Mn to M, and this may not hold or be hard to prove. It is usually more economical in terms of conditions to show that the maximizers h n are contained in certain sets K , with high probability. Then uniform convergence of Mn to M on K is sufficient. The choice of compact sets K corresponds to establishing the uniform tightness of the sequence h n before applying the argmax theorem. If the sample paths of the processes Mn are bounded on K and Hn = H for every n, then the weak convergence of the processes Mn viewed as elements of the space .C00 (K) implies the convergence condition of the argmax theorem. This follows by the continuous-mapping theorem, because the map
z � (z (A n K), z (B n K) ) from .C00 (K) to IR2 is continuous, for every pair of sets A and B . The weak convergence in .C00 (K) remains sufficient if the sets Hn depend on n but converge in a suitable way. Write Hn --+ H if H is the set of all limits lim hn of converging sequences hn with hn E Hn for every n and, moreover, the limit h = limi hn, of every converging sequence hn, with hn, E Hn, for every i is contained in H .
5. 9
Argmax Theorem
81
5.58 Corollary. Suppose that M11 "'-"+ M in £00 ( K) for every compact subset K of IRk , for a limit process M with continuous sample paths that have unique points of maxima h . If H11 � H, M11 (h 11) :=: M11 (H11 ) - o p ( l ) , and the sequence h 11 is unzformly tight, then h 11 "'-"+ h . The compactness of K and the continuity of the sample paths h f-+ M (h) imply that the (unique) points of maximum h are automatically well separated in the sense of (5 .57). Indeed, if this fails for a given open set G 3 h and K (and a given w in the underlying probability space), then there exists a sequence h m in G c n K n H such that M( h m ) � M( h ) . If K is compact, then this sequence can be chosen convergent. The limit ho must be in the closed set G c and hence cannot be h . By the continuity of M it also has the property that M (h0 ) = lim M( h m ) = M ( h ) . This contradicts the assumption that h is a unique point of maximum. If we can show that (M11 ( F n Hn) , M11 ( K n Hn)) converges to the corresponding limit for every compact sets F c K, then the theorem is a corollary of Theorem 5.56. If H11 = H for every n, then this convergence is immediate from the weak convergence of M11 to M in .C00(K), by the continuous-mapping theorem. For H11 changing with n this convergence may fail, and we need to refine the proof of Theorem 5.56. This goes through with minor changes if
Proof.
lim sup P(M11 (F n H11 ) - M11 ( K H11) ::: x ) ::: P(M(F n H) - M ( K n H) ::: x ) , 11 � 00 for every x , every compact set F and every large closed ball K . Define functions g11 : £00 (K) f-+ lR by �"',
gn (Z) = sup z (h) - sup z (h) , hEFnH,, hEKnH, and g similarly, but with H replacing H11 • By an argument as in the proof of Theo rem 1 8 . 1 1 , the desired result follows if lim sup g11 (Z11) ::: g(z) for every sequence Z n � z in .C00 (K) and continuous function z . (Then lim sup P(g11 (M11) ::: x) ::: P(g(M) ::: x) for every x , for any weakly converging sequence M11 "'-"+ M with a limit with continuous sample paths.) This in turn follows if for every precompact set B c K, sup z (h) ::: lim sup Z11 (h) ::: sup z (h). n � oo hEBnH, hEBnH To prove the upper inequality, select h11 E B n H11 such that sup Z11 (h) = Z11 (h11) + o(l) = z (h11) + o ( l ) . hEBnH,, Because B is compact, every subsequence of h11 has a converging subsequence. Because H11 � H, the limit h must be in B n H. Because z (h11) � z (h), the upper bound follows. To prove the lower inequality, select for given 8 > 0 an element h E B n H such that sup z (h) ::: z (h) + 8. hEBnH
Because H11 � H , there exists h11 E H11 with h11 � h. This sequence must be in B C B eventually, whence z(h) = lim z (h11) = lim Zn (h11) is bounded above by lim inf sup hEBnH, Zn (h) . •
M- and Z-Estimators
82
The argmax theorem can also be used to prove consistency, by applying it to the original criterion functions (] f--+ Mn (e). Then the limit process (] f--+ M(e) is degenerate, and has a fixed point of maximum e0. Weak convergence becomes convergence in probability, and the theorem now gives conditions for the consistency {Jn !,. e0. Condition (5.57) reduces to the well-separation of e0, and the convergence sup
e eFnK ne.
Mn (e) �p
sup
e eFnKne
Mn (e)
is, apart from allowing 8n to depend on n , weaker than the uniform convergence of Mn to
M.
Notes
In the section on consistency we have given two main results (uniform convergence and Wald's proof) that have proven their value over the years, but there is more to say on this subject. The two approaches can be unified by replacing the uniform convergence by "one sided uniform convergence," which in the case of i.i.d. observations can be established under the conditions of Wald's theorem by a bracketing approach as in Example 19.8 (but then one-sided). Furthermore, the use of special properties, such as convexity of the 1j; or m functions, is often helpful. Examples such as Lemma 5 . 1 0, or the treatment of maximum likelihood estimators in exponential families in Chapter 4, appear to indicate that no single approach can be satisfactory. The study of the asymptotic properties of maximum likelihood estimators and other M -estimators has a long history. Fisher [48], [50] was a strong advocate of the method of maximum likelihood and noted its asymptotic optimality as early as the 1920s. What we have labelled the classical conditions correspond to the rigorous treatment given by Cramer [27] in his authoritative book. Huber initiated the systematic study of M -estimators, with the purpose of developing robust statistical procedures. His paper [78] contains important ideas that are precursors for the application of techniques from the theory of empirical processes by, among others, Pollard, as in [ 1 17], [ 1 1 8], and [ 1 20] . For one-dimensional parameters these empirical process methods can be avoided by using a maximal inequality based on the L 2 -norm (see, e.g., Theorem 2.2.4 in [146]). Surprisingly, then a Lipschitz condition on the Hellinger distance (an integrated quantity) suffices; see for example, [80] or [94] . For higher-dimensional parameters the results are also not the best possible, but I do not know of any simple better ones. The books by Huber [79] and by Hampel, Ronchetti, Rousseeuw, and Stahel [73] are good sources for applications of M -estimators in robust statistics. These references also discuss the relative efficiency of the different M -estimators, which motivates, for instance, the use of Huber's 1/J -function. In this chapter we have derived Huber's estimator as the solution of the problem of minimizing the asymptotic variance under the side condition of a uniformly bounded influence function. Originally Huber derived it as the solution to the problem of minimizing the maximum asymptotic variance sup P a� for P ranging over a contamination neighborhood: P = ( 1 c:)
0},
=
{q > 0} .
See Figure 6. 1 . Because P (Q� ) = Jp =O p d p, = 0, the measure P is supported on the set Qp . Similarly, Q is supported on Q Q . The intersection Qp n Q Q receives positive measure from both P and Q provided its measure under p, is positive. The measure Q can be written as the sum Q = Qa + Ql_ of the measures Q j_ (A) = Q (A n {p = 0}) . (6. 1 ) A s proved in the next lemma, Q a set A
«
P and Q j_
l_
P . Furthermore, for every measurable
{ 9:__ d P . }A p The decomposition Q = Qa + Q l_ is called the Lebesgue decomposition of Q with respect to P . The measures Qa and Q j_ are called the absolutely continuous part and the orthogonal Q a (A)
=
85
Contiguity
86 Q
p> O q=O
p =O q>O
p> O q>O
p =q= O Figure 6.1. Supports of measures.
part (or singular part) of Q with respect to P , respectively. In view of the preceding display, the function q I p is a density of Qa with respect to P . It is denoted d Q I d P (not: d Qa I d P ), so that dQ q P - a.s. dP p As long as we are only interested in the properties of the quotient q I p under P -probability, we may leave the quotient undefined for p = 0. The density d Q I d P is only P -almost surely unique by definition. Even though we have used densities to define them, d Q I d P and the Lebesgue decomposition are actually independent of the choice of densities and dominating measure. In statistics a more common name for a Radon-Nikodym density is likelihood ratio. We shall think of it as a random variable d Q I d P : Q f-+ [0, oo) and shall study its law under P .
6.2 Lemma. Let P and Q be probability measures with densities p and q with respect to a measure f-L. Then for the measures Q a and Q j_ defined in (22. 30) (i) Q = Q a + Q j_ , Qa « P, Q j_ ..l P. (ii) Q a (A) = JA (q lp) d P for every measurable set A. (iii) Q « P if and only if Q (p = 0) = 0 if and only if j (q I p) d P = 1. The first statement of (i) is obvious from the definitions of Qa and Q j_ . For the second, we note that P (A) can be zero only if p (x ) = 0 for M -almost all x E A. In this case, M ( A n {p > 0}) = 0, whence Q a (A) = Q ( A n {p > 0}) = 0 by the absolute continuity of Q with respect to f-L. The third statement of (i) follows from P (p = 0) = 0 and Q j_ (p > 0) = Q( 0 ) = 0. Statement (ii) follows from q q q dt-L = Q a (A) = p df.-L = - d P . Ap An{ p>O} An{ p>O} p For (iii) we note first that Q « P if and only if Q j_ = 0. By (22.30) the latter happens if and only if Q (p = 0) = 0. This yields the first "if and only if." For the second, we note
Proof.
1
1
1
6. 2
87
Contiguity
that by (ii) the total mass of Qa is equal to Qa (Q) = J (q I p) d P . This is 1 if and only if Qa = Q • .
It is not true in general that J f d Q = J f (d Q I d P) d P . For this to be true for every measurable function f , the measure Q must be absolutely continuous with respect to P . On the other hand, for any P and Q and nonnegative f,
This inequality is used freely in the following. The inequality may be strict, because dividing by zero is not permitted. t 6.2
Contiguity
If a probability measure Q is absolutely continuous with respect to a probability measure P , then the Q-law of a random vector X : Q f---+ IRk can be calculated from the P -law of the pair (X, d Qid P) through the formula EQ J (X)
With p x , v equal to the law of the pair can also be expressed as Q (X
E
= E p f ( X)
(X,
V)
dQ . dP
= (X,
dQ B) = E p l s (X) - = dP
1
d Q id P) under P, this relationship
BxR
v
dP x, v (x , v ) .
The validity of these formulas depends essentially on the absolute continuity of Q with respect to P , because a part of Q that is orthogonal with respect to P cannot be recovered from any P -law. Consider an asymptotic version of the problem. Let (Qn , An) be measurable spaces, each equipped with a pair of probability measures Pn and Qn . Under what conditions can a Qn -limit law of random vectors Xn : Qn f---+ IRk be obtained from suitable Pn-lirnit laws? In view of the above it is necessary that Qn is "asymptotically absolutely continuous" with respect to Pn in a suitable sense. The right concept is contiguity. Definition. The sequence Qn is contiguous with respect to the sequence Pn if Pn (An ) -+ 0 implies Qn CAn) -+ 0 for every sequence of measurable sets An . This is denoted Qn 0. Then, for En + 0 at a sufficiently slow rate, also lim inf g11 (£11) � 0. Thus,
Proof.
P(U = 0) = lim P(U < En) ::::; lim inf Qn On the other hand,
(:�: < En ) .
If Qn is contiguous with respect to Pn , then the Qn -probabi1ity of the set on the left goes to zero also. But this is the probability on the right in the first display. Combination shows that P(U = 0) = 0. (iii) :::} (i). If Pn (An) -+ 0, then the sequence l Q" - A " converges to 1 in Pn-probability. By Prohorov's theorem, every subsequence of {n} has a further subsequence along which
6. 2
89
Contiguity
(d Qn/dPn , 1r.lcAJ """' (V, 1) under Pn , for some weak limit V . The function (v, t) f-+ vt is continuous and nonnegative on the set [0, oo) x {0, 1}. By the portmanteau lemma lim inf Qn (Qn - An) :::: lim inf
J 1r.ln -A,. �;: d Pn :::: E1 ·
V.
Under (iii) the right side equals E V = 1. Then the left side is 1 as well and the sequence Qn (An) = 1 - Qn (Qn - An ) converges to zero. (ii) ::::} (iii). The probability measures f.Ln = 1 CPn + Qn) dominate both Pn and Qn, for every n . The sum of the densities of Pn and Qn with respect to f.Ln equals 2. Hence, each of the densities takes its values in the compact interval [0, 2] . By Prohorov's theorem every subsequence possesses a further subsequence along which d Pn & d Qn !::, dPn &, W, V' n := U' W d Qn d Pn df.Ln for certain random variables U, V and W. Every Wn has expectation 1 under f.Ln . In view of the boundedness, the weak convergence of the sequence Wn implies convergence of moments, and the limit variable has mean E W = 1 as well. For a given bounded, continuous function j , define a function g : [0, 2] f-+ IR by g (w) = f( w / (2- w) ) (2- w) for 0 � w < 2 and g (2) = 0. Then g is bounded and continuous. Because d Pn /d Qn = Wn/(2 - Wn ) and d Q n I d f.Ln = 2 - Wn ' the portmanteau lemma yields Pn = Pn d Qn = W (2 - W), EJ.Ln g(Wn) ---+ E j E!Ln j dd Qn E Q ,J dd Qn df.Ln 2- W where the integrand in the right side is understood to be g (2) = 0 if W = 2. By assumption, the left side converges to Ef(U). Thus Ef(U) equals the right side of the display for every continuous and bounded function f. Take a sequence of such functions with 1 :::: fm + 1 {OJ , and conclude by the dominated-convergence theorem that
( )
(
0)
(
=
)
)
E1r OJ (U) = E1r OJ 2 -WW (2 - W) = 2P(W = 0) . By a similar argument, Ef(V) = Ef((2 - W)/ W) W for every continuous and bounded function f, where the integrand on the right is understood to be zero if W = 0. Take a sequence 0 � fm (x) t x and conclude by the monotone convergence theorem that E V = E 2 �W w = E(2 - W) 1w> o = 2P(W > 0) - 1 . P(U
=
( )
(
)
Combination of the last two displays shows that P(U
=
0) + E V
=
1.
•
6.5 Example (Asymptotic log normality). The following special case plays an important role in the asymptotic theory of smooth parametric models. Let Pn and Qn be probability measures on arbitrary measurable spaces such that d Pn £.+ e N ( J.L , a 2 ) d Qn Then Qn oo n --'>oo h Eh vn There exists a subsequence {nd of {n} such that this expression is equal to
We apply the preceding argument to this subsequence and find a further subsequence along which Tn satisfies For simplicity of notation write this as {n'} rather than with a double subscript. Because £ is nonnegative and lower semicontinuous, the portmanteau lemma gives, for every h,
(8. 2 ).
'�1'1/� Eoc, 1 -Re t
(
(
# T" - o/
(e + J.,))) J e d L o , h :>
See, for example, [146, Chapter 3. 1 1] for the general result, which can be proved along the same lines, but using a compactification device to induce tightness.
8. 8
Shrinkage Estimators
119
Every rational vector h is contained in h for every sufficiently large k. Conclude that R
2:
sup h EQk
J £ dLe,h = hsupEQk Eh£ (T - {pe h ) .
The risk function in the supremum on the right is lower semicontinuous in h, by the continuity of the Gaussian location family and the lower semicontinuity of £. Thus the expression on the right does not change if Ql is replaced by �k . This concludes the proof. •
*8.8
Shrinkage Estimators
The theorems of the preceding sections seem to prove in a variety of ways that the best possible limit distribution is the N(O, {pe ie- 1 {pe T )-distribution. At closer inspection, the situation is more complicated, and to a certain extent optimality remains a matter of taste, asymptotic optimality being no exception. The "optimal" normal limit is the distribution of the estimator {pe X in the normal limit experiment. Because this estimator has several optimality properties, many statisticians consider it best. Nevertheless, one might prefer a Bayes estimator or a shrinkage estimator. With a changed perception of what constitutes "best" in the limit experiment, the meaning of "asymptotically best" changes also. This becomes particularly clear in the example of shrinkage estimators. Example (Shrinkage estimator). Let X 1 , . . . , Xn be a sample from a multivariate normal distribution with mean fJ and covariance the identity matrix. The dimension k of the observations is assumed to be at least This is essential ! Consider the estimator
8.12
3.
Because X n converges in probability to the mean e , the second term in the definition of Tn is 0 p (n - 1 ) if fJ =j:. In that case ,Jfi(Tn - X n) converges in probability to zero, whence the estimator sequence Tn is regular at every e =j:. For e = hj ,Jn, the variable ,JriXn is distributed as a variable X with an N(h , I)-distribution, and for every n the standardized estimator ylri(Tn - hj ,Jfi) is distributed as T - h for
0.
0.
X . II X II 2 This is the Stein shrinkage estimator. Because the distribution of T - h depends on h, the sequence Tn is not regular at e The Stein estimator has the remarkable property that, for every h (see, e.g., [99, p. T(X) = X - (k - 2)
-
0. 300]), =
Eh ii T - h ll 2 < Eh ii X - h ll 2 = k. It follows that, in terms of joint quadratic loss £(x) = ll x 11 2 , the local limit distributions Lo,h of the sequence ,Jri(Tn - hj ylri) under e = hj ,Jn are all better than the N (O, I)-limit distribution of the best regular estimator sequence X n . 0 The example of shrinkage estimators shows that, depending on the optimality criterion, a normal N {pe Ie- 1 {pe T )-limit distribution need not be optimal. In this light, is it reasonable
(0,
1 20
Efficiency of Estimators
to uphold that maximum likelihood estimators are asymptotically optimal? Perhaps not. On the other hand, the possibility of improvement over the N ;pe !8- 1 ;pe T )-limit is restricted in two important ways. First, improvement can be made only on a null set of parameters by Theorem Second, improvement is possible only for special loss functions, and improvement for one loss function necessarily implies worse performance for other loss functions. This follows from the next lemma. Suppose that we require the estimator sequence to be locally asymptotically minimax for a given loss function f in the sense that
(0,
8. 9 .
This is a reasonable requirement, and few statisticians would challenge it. The following lemma shows that for one-dimensional parameters local asymptotic minimaxity for even a single loss function implies regularity. Thus, if it is required that all coordinates of a certain estimator sequence be locally asymptotically minimax for some loss function, then the best regular estimator sequence is optimal without competition.
'ljr(e)
Assume that the experiment Pe E 8) is diffe rentiable in quadratic mean at e with nonsingular Fisher information matrix !8 . Let Vr be a real-valued map k that is differentiable at Then an estimator sequence in the experiments E IR ) can be locally asymptotically minimax at for a bowl-shaped loss function f such that < f x 2 f(x) dN O , ;pe l8- 1 ;pe T ) x ) < oo only if Tn is best regular at
8.13
(7.1)
( :e
Lemma.
e.
0
(
(
(P8n : e
e
e.
We only give the proof under the further assumption that the sequence ,Jn(Tn is uniformly tight under Then by the same arguments as in the proof of Theo rem every subsequence of {n} has a further subsequence along which the sequence y'n ( Tn - Vr + hI y'ri)) converges in distribution under + hI y'n to the distribution Le,h of T - ;pe h under h, for a randomized estimator T based on an N(h , /8- 1 )-distributed observation. Because Tn is locally asymptotically minimax, it follows that
Proof.
'ljr(e) ) 8.11,
e.
(e
e
Thus T is a minimax estimator for ;pe h in the limit experiment. By Proposition 8 .6, T = ;p8 x, whence Le,h is independent of h. •
*8.9
Achieving the Bound
If the convolution theorem is taken as the basis for asymptotic optimality, then an estimator sequence is best if it is asymptotically regular with a N(O, ;p8 !8- 1 ;p8 T )-limit distribution. An estimator sequence has this property if and only if the estimator is asymptotically linear in the score function. 8.14
mean
E
Assume that the experiment (Pe : e 8) is differentiable in quadratic (7.1) at e with nonsingular Fisher information matrix !8 . Let be differentiable at Lemma.
Vr
8. 9
Achieving the Bound
e. Let Tn be an estimator sequence in the experiments (P; : ()
121 E
JRk ) such that
Then Tn is best regular estimator for 1/J (()) at (). Conversely, every best regular estimator sequence satisfies this expansion. The sequence 11n. e = n - 1 12 L Ce ( X i ) converges in distribution to a vector !':!.. e with Ie )-distribution. By Theorem the sequence log d P;+h !-fo/ d P; is asymptotically equivalent to h T 11n.e - �h T Ieh. If Tn is asymptotically linear, then ,Jn(Tn - 1/f (())) is asymptotically equivalent to the function .(p0 /0- 1 !':!.. n , e . Apply Slutsky's lemma to find that
Proof. a N (0,
7. 2
(� (Tn - 1/f(()) ) , log dP��:Jn ) !!.. ( .(pe /0- 1 !':!..e , h T !':!..e - � h T Ieh ) ""--' N
(( - 1 h0T Ieh) ( .(p01/JehJ0- 1 {p0T T 2
0
The limit distribution of the sequence ,Jn(Tn - 1/f (())) under () + h/ ,Jn follows by Le Cam's third lemma, Example 6. and is normal with mean {p0 h and covariance matrix {p0 I0- 1 .(p0 T . Combining this with the differentiability of 1/J , we obtain that Tn is regular. Next suppose that Sn and Tn are both best regular estimator sequences. By the same arguments as in the proof of Theorem it can be shown that, at least along subsequences, the joint estimators ( Sn , Tn) for ( 1/J (()), 1/J (() ) ) satisfy for every h
7,
8.11
for a randomized estimator (S, T) in the normal-limit experiment. Because Sn and Tn are best regular, the estimators S and T are best equivariant-in-law. Thus S = T = .(p0 X almost surely by Proposition 8.6, whence ,Jfi(Sn - Tn) converges in distribution to S - T = Thus every two best regular estimator sequences are asymptotically equivalent. The second assertion of the lemma follows on applying this to Tn and the estimators
0.
Sn = 1/f (()) +
1 1/Je le- 1 11n,O ·
,Jn
°
Because the parameter () is known in the local experiments (P;+h / Jn : h E lRk ), this indeed defines an estimator sequence within the present context. It is best regular by the first part of the lemma. • Under regularity conditions, for instance those of Theorem hood estimator Bn in a parametric model satisfies
5. 3 9, the maximum likeli
Then the maximum likelihood estimator is asymptotically optimal for estimating () in terms of the convolution theorem. By the delta method, the estimator 1/f (Bn) for 1/J (() ) can be seen
1 22
Efficiency of Estimators
to be asymptotically linear as in the preceding theorem, so that it is asymptotically regular and optimal as well. Actually, regular and asymptotically optimal estimators for e exist in every parametric model (Pe : e E 8) that is differentiable in quadratic mean with nonsingular Fisher infor mation throughout 8, provided the parameter e is identifiable. This can be shown using the discretized one-step method discussed in section (see
5. 7 [93]).
*8.10
Large Deviations
Consistency of an estimator sequence Tn entails that the probability of the event This is a very weak require d ( Tn , l/J (e) ) > E tends to zero under e , for every E > ment. One method to strengthen it is to make E dependent on n and to require that the probabilities Pe (d ( Tn , 1/f (e) ) > En ) converge to or are bounded away from for a given sequence E n -+ The results of the preceding sections address this question and give very precise lower bounds for these probabilities using an "optimal" rate En r;; 1 , typically
0.
0,
0.
1,
=
n- 1 /2 .
Another method of strengthening the consistency is to study the speed at which the probabilities Pe (d ( Tn , 1/f (e) ) > c ) converge to for a fixed E > This method appears to be of less importance but is of some interest. Typically, the speed of convergence is exponential, and there is a precise lower bound for the exponential rate in terms of the Kullback-Leibler information. We consider the situation that Tn is based on a random sample of size n from a distribution Pe , indexed by a parameter e ranging over an arbitrary set 8. We wish to estimate the value of a function 1/J : 8 f-+ ][J) that takes its values in a metric space.
0
8.15
Theorem.
0.
Suppose that the estimator sequence Tn is consistentfor 1/J (e) under every
0
e. Then, for every E > and every eo,
(
lim sup - ..!_ log Pe0 d ( Tn , 1/J (eo) ) > c n --+co n
)
:S
inf
e : d(l/f( e ) , l/f (eo)) >c
- Pe log
P eo _ Pe
If the right side is infinite, then there is nothing to prove. The Kullback-Leibler information - Pe log P eo I Pe can be finite only if Pe « Pea . Hence, it suffices to prove that - Pe log P eo I Pe is an upper bound for the left side for every e such that Pe « Pea and d ( 1/f (e ) , 1/f (eo) ) > E . The variable An = (n - 1 ) I: 7= 1 log( p e i Pea ) ( X i ) is well defined (possibly - oo ) For every constant M,
Proof
.
(
Pe0 d ( Tn , 1/f (eo) ) > c
)
:=::. :=::. ::::_
An < M ) ( Ee l { d ( Tn , 1/f (eo) ) > An < M } e -n An e - n M Pe (d ( Tn , 1/f (eo) ) > An < M) . Pea d ( Tn , 1/J (eo) ) >
E,
E,
E,
Take logarithms and multiply by - ( l i n ) to conclude that
-� log Pea (d (Tn , 1/J (eo)) >
E
)
:S
M
- � log Pe (d ( Tn , 1/J (eo) ) > 1
E,
)
An < M .
For M > Pe log pe i Pe0 , we have that Pe (An < M) -+ by the law of large numbers . Furthermore, by the consistency of Tn for 1/J (e ) , the probability Pe ( d ( Tn , 1/f (eo) ) > c )
Problems
123
converges to 1 for every e such that d ( 1/f (8) , 1/f (80)) > s . Conclude that the probability in the right side of the preceding display converges to 1 , whence the lim sup of the left side is bounded by M. •
Notes
Chapter 32 of the famous book by Cramer [27] gives a rigorous proof of what we now know as the Cramer-Rao inequality and next goes on to define the asymptotic efficiency of an estimator as the quotient of the inverse Fisher information and the asymptotic variance. Cramer defines an estimator as asymptotically efficient if its efficiency (the quotient men tioned previously) equals one. These definitions lead to the conclusion that the method of maximum likelihood produces asymptotically efficient estimators, as already conjectured by Fisher [48, 50] in the 1 920s. That there is a conceptual hole in the definitions was clearly realized in 195 1 when Hodges produced his example of a superefficient estimator. Not long after this, in 1 953, Le Cam proved that superefficiency can occur only on a Lebesgue null set. Our present result, almost without regularity conditions, is based on later work by Le Cam (see [95] .) The asymptotic convolution and minimax theorems were obtained in the present form by Hajek in [69] and [70] after initial work by many authors . Our present proofs follow the approach based on limit experiments, initiated by Le Cam in [95].
PROBLEMS 1. Calculate the asymptotic relative efficiency of the sample mean and the sample median for estimating based on a sample of size n from the normal N 1 ) distribution.
e,
(e,
2. As the previous problem, but now for the Laplace distribution (density p (x )
=
! e - l x l ).
3. Consider estimating the distribution function P(X ::::: x ) at a fixed point x based o n a sample X , . . . , X11 from the distribution of X . The "nonparametric" estimator is n - 1 #(Xi ::::: x ) . If it 1 is known that the true underlying distribution is normal N ( , 1), another possible estimator is (x X ) . Calculate the relative efficiency o f these estimators .
e
-
4. Calculate the relative efficiency of the empirical p-quantile and the estimator - 1 (p) +X for the estimating the p-th quantile of the distribution of a sample from the normal N (f.L , 0" 2 ) distribution. 5. Consider estimating the population variance by either the sample variance 2 (which is unbiased) or else n - 1 L7= (Xi - :X)2 = (n - 1 ) / n S 2 . Calculate the asymptotic relative efficiency. 1 6. Calculate the asymptotic relative efficiency of the sample standard deviation and the interquartile range (corrected for unbiasedness) for estimating the standard deviation based on a sample of size n from the normal N (/L , 0" 2 ) -distribution.
Sn
n
S
e],
en )
7. Given a sample of size n from the uniform distribution on [0, the maximum X of the observations is biased downwards . Because Ee (e - X Ee X ( l ) • the bias can be removed by adding the minimum of the observations. Is X ( 1 ) + X a good estimator for e from an asymptotic point of view?
en ) ) = en )
S11 based on the mean of a sample from the N (e , I )-distribution. Jn(Sn - en)� - oo, if en ---+ 0 in such a way that n 1 14en ---+ 0 and n 1 12 en ---+ oo. Show that Sn is not regular at e = 0.
8. Consider the Hodges estimator (i) Show that
(ii)
124
Efficiency of Estimators (iii) Show that SUP-8 to the experiment consisting of observing one observation from the shzfted exponential density z e-Cz - h ) /e 1 {z > h}ltl. t 9.6
Theorem.
Proof
[0, 0
r--+
If Z is distributed according to the given exponential density, then
dPhz e - ( - h )/:--e 1_{_Z_>_h_}I_B_ (Z) - _:-(-=Z---,h:-o :e- Z - ) / e 1 { Z > ho}IB dPh�
=
e (h - ho ) /e l { Z > h} '
almost surely under h0, because the indicator 1 {z > h0} in the denominator equals 1 almost surely if h0 is the true parameter. The joint density of a random sample X 1, . . . , Xn from the uniform 8] distribution can be written in the form ( 1 1 8 ) n 1 {X en ) :S 8}. The likelihood ratios take the form
[0,
Under the parameter e - hoi n, the maximum of the observations is certainly bounded above by e - hoi n and the indicator in the denominator equals 1 . Thus, with probability 1 under e - ho l n, the likelihood ratio in the preceding display can be written
(e (h - ho ) / e + o(l)) 1 { -n (X (n) - 8)
�
h}.
B y direct calculation, -n(X (n ) -8) !!:;!. Z. B y the continuous-mapping theorem and Slutsky's lemma, the sequence of likelihood processes converges under e - ho l n marginally in distribution to the likelihood process of the exponential experiment. • Along the same lines it may be proved that in the case of uniform distributions with both endpoints unknown a limit experiment based on observation of two independent ex ponential variables pertains . These types of experiments are completely determined by the discontinuities of the underlying densities at their left and right endpoints. It can be shown more generally that exponential limit experiments are obtained for any densities that have jumps at one or both of their endpoints and are smooth in between. For densities with discontinuities in the middle, or weaker singularities, other limit experiments pertain. The convergence to a limit experiment combined with the asymptotic representation theorem, Theorem 9.3, allows one to obtain asymptotic lower bounds for sequences of estimators, much as in the locally asymptotically normal case in Chapter We give only one concrete statement.
8.
t
Define Pe arbitrarily for e
and fJ., > and density
0
0
This density is smooth in a, but it resembles a uniform distribution as discussed in the preceding section in its dependence on fJ.,. The limit experiment consists of a combination of a normal experiment and an exponential experiment. The likelihood ratios for a sample of size n from the Pareto distributions with parameters (a + gjJn, fJ., + hjn) and (a + go/Jn, fJ., + h0jn), respectively, is equal to
9. 6
131
Asymptotic Mixed Normality
Here, under the parameters (a + g0j,Jii , JL + h0jn) , the sequence 1 n 1 X log -i - D.. n = M a Jfi i = l
--L(
)
converges weakly to a normal distribution with mean g0ja 2 and variance 1 ja 2 ; and the sequence Zn = n (X cl ) - JL) converges in distribution to the (shifted) exponential distri bution with mean JL/a + h0 and variance (JJ-/a) 2 . The two sequences are asymptotically independent. Thus the likelihood is a product of a locally asymptotically normal and a "locally asymptotically exponential" factor. The local limit experiment consists of observ ing a pair ( D. , Z) of independent variables D. and Z with a N (g, a 2 )-distribution and an exp(a/ JL) + h-distribution, respectively. The maximum likelihood estimators for the parameters a and JL are given by A
an =
n
L 7= I log(X d X o ) )
,
and fl n = X cl ) ·
The sequence Jfi(&n - a ) converges in distribution under the parameters (a + gj Jfi , JL + h j n) to the variable D. - g . Because the distribution of Z does not depend on g, and D. follows a normal location model, the variable D. can be considered an optimal estimator for g based on the observation (D. , Z) . This optimality is carried over into the asymptotic optimality of the maximum likelihood estimator &n . A precise formulation could be given in terms of a convolution or a minimax theorem. On the other hand, the maximum likelihood estimator for JL is asymptotically inefficient. Because the sequence n (fln - JL - hjn) converges in distribution to Z - h, the estimators fl n are asymptotically biased upwards. 9.6
Asymptotic Mixed Normality
The likelihood ratios of some models allow an approximation by a two-term Taylor ex pansion without the linear term being asymptotically normal and the quadratic term being deterministic. Then a generalization of local asymptotic normality is possible. In the most important example of this situation, the linear term is asymptotically distributed as a mixture of normal distributions. A sequence of experiments (Pn , e : e E 8) indexed by an open subset 8 of IR.d is called locally asymptotically mixed normal at e if there exist matrices Yn , e ---+ such that
0
lo g
dPn, e +Yn ,B hn 1 = h T D. n , e - -2 h T Jn,e h + 0 Pn,e ( 1) ' dPn,e
for every converging sequence hn ---+ h, and random vectors D.. n ,e and random matrices ln , e such that ( D.. n ,e , ln,e) � ( D.. e , le) for a random vector such that the conditional distribution of D.. e given that le = J is normal N J) . Locally asymptotically mixed normal is often abbreviated to LAMN. Locally asymp totically normal, or LAN, is the special case in which the matrix le is deterministic. Se quences of experiments whose likelihood ratios allow a quadratic approximation as in the preceding display (but without the specific limit distribution of ( D.. n ,e , ln,e)) and that are
(0,
Limits of Experiments
132
such that Pn, !J+ y, eh Pn,e are called locally asymptotically quadratic, or LAQ. We note that LAQ or LAMN requires much more than the mere existence of two derivatives of the likelihood: There is no reason why, in general, the remainder would be negligible. 9.8 Theorem. Assume that the sequence of experiments (Pn,e : () E 8) is locally asymptot ically mixed normal at 8. Then the sequence of experiments (Pn , !J+ y, ,eh : h E .!Rd ) converges to the experiment consisting ofobserving a pair ( !1 , J) such that J is marginally distributed as le for every h and the conditional distribution of !1 given J is normal N (J h, J).
Write Pe,h for the distribution of ( !1 , J) under h. Because the marginal distribution of J does not depend on h and the conditional distribution of !1 given J is Gaussian
Proof
By Slutsky's lemma and the assumptions, the sequence dPn,fJ + y, ,e h/ dPn,e converges under () in distribution to exp ( h T t1e - �h T leh) . Because the latter variable has mean one, it follows that the sequences of distributions Pn,!J + y, ,eh and Pn,e are mutually contiguous. In particular, the probability under () that dPn.fJ+y,,eh is zero converges to zero for every h, so that dPn, 8 + Yn . eh dPn,e + y, , eh o dPn, 8 + Yn ,eh = lo g log log + 0 P, , e dpn, 8 + y, ,eho dPn,e dPn,e
(1)
_
Conclude that it suffices to show that the sequence (t:... n ,e , ln,e) converges under () + Yn,eho to the distribution of (6. , J) under h0. Using the general form of Le Cam's third lemma we obtain that the limit distribution of the sequence (t111,e , ln,e) under () + Yn,eh takes the form
0
noting that the distribution of ( 6. , J) under h = is the same as the distribution of (t:... e , le ) , we see that this is equal to Eo s ( !1 , J) dPe,h/ dPe.o (t1 , 1) = Ph ( (6. , 1) E B) . •
On
1
It is possible to develop a theory of asymptotic "lower bounds" for LAMN models, much as is done for LAN models in Chapter Because conditionally on the ancillary statistic 1, the limit experiment is a Gaussian shift experiment, the lower bounds take the form of mixtures of the lower bounds for the LAN case. We give only one example, leaving the details to the reader.
8.
9.9
Corollary. E
Let T11 be an estimator sequence in a LAMN sequence of experiments
8) such that y,;J (Tn - 1/1 (8 + Yn,eh)) converges weakly under e ve ry () + Yn,eh to a limit distribution Le , for every h. Then there exist probability distributions M1 (or rather kernel) such that Le = EN (O, .,fr 8 18- 1 .,fr�) * Mfe · In particular, cove Le =::: . a1 Markov . E1j; e le- 1/1 eT · (Pn,e : ()
9. 6
Asymptotic Mixed Normality
133
We include two examples to give some idea of the application of local asymptotic mixed normality. In both examples the sequence of models is LAMN rather than LAN due to an explosive growth of information, occurring at certain supercritical parameter values . The second derivative of the log likelihood, the information, remains random. In both examples there is also (approximate) Gaussianity present in every single observation. This appears to be typical, unlike the situation with LAN, in which the normality results from sums over (approximately) independent observations . In explosive models of this type the likelihood is dominated by a few observations, and normality cannot be brought in through (martingale) central limit theorems. In a Galton-Watson branching process the "nth generation" is formed by replacing each element of the (n - 1 )-th generation by a random number of elements, independently from the rest of the population and from the preceding generations. This random number is distributed according to a fixed distribution, called the offspring distribution. Thus, conditionally on the size Xn _ 1 of the (n - 1)th generation the size Xn of the nth generation is distributed as the sum of Xn - 1 i.i.d. copies of an offspring variable Z. Suppose that X 0 = 1 , that we observe (X 1 , . . . , Xn), and that the offspring distribution is known to belong to an exponential family of the form
9.10
Example (Branching processes).
Pe (Z = z)
=
az 82c(8 ) ,
z
= 0,
1,
2, . . . ,
for given numbers a0, a 1 , The natural parameter space is the set of all e such that 1 c(e) = L a2 82 is finite (an interval). We shall concentrate on parameters in the interior of the naturalz parameter space such that f.L(8) : = Ee Z > 1 . Set a 2 (8) = vare Z. The sequence X 1 , X 2 , . . . is a Markov chain with transition density .
•
.
•
Pe ( Y I x) = Pe (Xn
=
y I Xn - 1
=
x
times
x) = � eYc(e) x .
To obtain a two-term Taylor expansion of the log likelihood ratios, let fe (y I x) be the log transition density, and calculate that
f e ( Y I x ) = Y - �f.L(e) ,
.. fe (y I X)
=
y - xf.L(e) x�(e) e2 e Pe (Z = z) has derivative zero yields -
(The fact that the score function of the model e f--+ the identity f.L(8) = -8 (cjc) (8), as is usual for exponential families.) Thus, the Fisher information in the observation (X 1 , . . . , Xn) equals (note that Ee (X j I X j - 1 ) = X j - 1 f.L(8))
For f.L(8) > 1, this converges to infinity at a much faster rate than "usually." Because the total information in (X 1 , . . . , X11) is of the same order as the information in the last observation X11, the model is "explosive" in terms of growth of information. The calculation suggests the rescaling rate Yn,e = f.L(e) - n l2 , which is roughly the inverse root of the information.
Limits of Experiments
1 34
A Taylor expansion of the log likelihood ratio yields the existence of a point en between e and e + Yn , e h such that
This motivates the definitions
Because Ee (Xn I Xn -1 , . . . , X 1 ) = Xn -1 /k(e), the sequence of random variables {L(e) - n Xn is a martingale under e . Some algebra shows that its second moments are bounded as n --* oo. Thus, by a martingale convergence theorem (e.g., Theorem 10.5 .4 of [42]), there exists a random variable V such that fk(e) - n Xn --* V almost surely. By the Toeplitz lemma (Problem 9.6) and again some algebra, we obtain that, almost surely under e, 1 n 1 n 1 /k(e) x --* v V. --* X " " 1 -· n n fk(e) - 1 ' /k(e) - 1 fk(e) f=; 1 {L(e) f=I J --
It follows that the point en in the expansion of the log likelihood can be replaced by e at the cost of adding a term that converges to zero in probability under e . Furthermore,
Jn , e
r--+
j.J,(e) v e (!k(e) 1 ) , _
Pe -almost surely.
It remains to derive the limit distribution of the sequence 6.n , e . If we write X j for independent copies Zj,i of the offspring variable Z, then
=
:2:: ��1 1 Zj,i
for independent copies zi of z and Vn = :2:: 7= 1 X j -1 · Even though 2 1 ' Zz , . . . and the total number Vn of variables in the sum are dependent, a central limit theorem applies to the right side: conditionally on the event { V > 0} (on which Vn --* oo), the sequence v,:;- 1 12 :2:: �: 1 ( Zi - /k(e)) converges in distribution to (]" (e) times a standard normal variable G. Furthermore, if we define G independent of V, conditionally on { V > 0}, t
(
(]" (e) G , j.1, (8 ) V (6.n ,e , ln , e) -v-t -- r-v� e y e ( !k(e) 1 )
t
See the appendix of [8 1] or, e.g., Theorem 3.5.1 and its proof in [146].
_
)
(9. 1 1 ) ·
9. 6
Asymptotic Mixed Normality
135
It is well known that the event { V = 0} coincides with the event {lim Xn = 0} of extinction of the population. (This occurs with positive probability if and only if a0 > 0.) Thus, on the set { V = 0} the series l....: � 1 X 1 converges almost surely, whence ll n,e -+ 0. Interpreting zero as the product of a standard normal variable and zero, we see that again is valid. Thus the sequence (lln,e , ln, e ) converges also unconditionally to this limit. Finally, note that a 2 (e)je = jL(e) , so that the limit distribution has the right form. The maximum likelihood estimator for J.L (e) can be shown to be asymptotically efficient, (see, e.g., or 0
(9 .11)
[29] [81]).
The canonical example of an LAMN sequence of exper iments is obtained from an explosive autoregressive process of order one with Gaussian innovations. (The Gaussianity is essential.) Let 1 e I > and c: 1 , c: 2 , be an i.i.d. sequence of standard normal variables independent of a fixed variable X0. We observe the vector (X o , X 1 , . . . ' Xn ) generated by the recursive formula x t = ex t - 1 + E t · The observations form a Markov chain with transition density p ( · I x1_ 1 ) equal to the N (ex t -l , ) -density. Therefore, the log likelihood ratio process takes the form 9.12
Example (Gaussian AR).
1
.
.
•
1
1
n n Pn, HYn,o h � � 2 2 2 log (Xo, . . . , X n ) = h Yn, e � (X t - e x t - l ) X t - 1 - l h Yn, e � X t 1 · Pn, e t =1 t =1 This has already the appropriate quadratic structure. To establish LAMN, it suffices to find the right rescaling rate and to establish the joint convergence of the linear and the quadratic term. The rescaling rate may be chosen proportional to the Fisher information and is taken Yn, e = e - n · By repeated application of the defining autoregressive relationship, we see that t r e - x t = Xo + L: e - 1 c: 1 -+ V : = X o + L: e - 1 c: 1 , ) =1 )= 1 00
almost surely a s well a s in second mean. Given the variable X0, the limit i s normally distributed with mean X0 and variance (e 2 - ) - 1 . An application of the Toeplitz lemma (Problem yields
9. 6)
1
The linear term in the quadratic representation of the log likelihood can (under e) be rewritten as e -n :z.=;=1 c:1Xr - h and satisfies, by the Cauchy-Schwarz inequality and the Toeplitz lemma,
It follows that the sequence of vectors ( ll n, e , ln, e ) has the same limit distribution as the ) For every n the vector (e- n :z.=;=1 c:1 sequence of vectors (e -n :z.=;=1 c:1e t - 1 V, V 2 I (e 2
- 1) .
136
Limits of Experiments
e t - I , V) possesses, conditionally on X0, a bivariate-normal distribution. As n --* oo these distributions converge to a bivariate-normal distribution with mean (0, X0) and covariance matrix I / (8 2 Conclude that the sequence (�n.e , ln,e) converges in distribution as required by the LAMN criterion. D
1).
9.7
Heuristics
9. 3 ,
The asymptotic representation theorem, Theorem shows that every sequence of statistics in a converging sequence of experiments is matched by a statistic in the limit experiment. It is remarkable that this is true under the present definition of convergence of experiments, which involves only marginal convergence and is very weak. Under appropriate stronger forms of convergence more can be said about the nature of the matching procedure in the limit experiment. For instance, a sequence of maximum like lihood estimators converges to the maximum likelihood estimator in the limit experiment, or a sequence of likelihood ratio statistics converges to the likelihood ratio statistic in the limit experiment. We do not introduce such stronger convergence concepts in this section but only note the potential of this argument as a heuristic principle. See section for rigorous results. For the maximum likelihood estimator the heuristic argument takes the following form. If hn maximizes the likelihood h f--+ dPn h , then it also maximizes the likelihood ratio process h f--+ dPn,h fdPn,h o · The latter sequence of processes converges (marginally) in distribution to the likelihood ratio process h f--+ dPh/ dPh o of the limit experiment. It is reasonable to expect that the maximizer hn converges in distribution to the maximizer of the process h f--+ dPh/dPho ' which is the maximum likelihood estimator for h in the limit experiment. (Assume that this exists and is unique.) If the converging experiments are the local experiments corresponding to a given sequence of experiments with a parameter () , then the argument suggests that the sequence of local maximum likelihood estimators hn = rn (en - 8) converges, under 8 , in distribution to the maximum likelihood estimator in the local limit experiment, under h = 0. Besides yielding the limit distribution of the maximum likelihood estimator, the argu ment also shows to what extent the estimator is asymptotically efficient. It is efficient, or inefficient, in the same sense as the maximum likelihood estimator is efficient or ineffi cient in the limit experiment. That maximum likelihood estimators are often asymptotically efficient is a consequence of the fact that often the limit experiment is Gaussian and the maximum likelihood estimator of a Gaussian location parameter is optimal in a certain sense. If the limit experiment is not Gaussian, there is no a priori reason to expect that the maximum likelihood estimators are asymptotically efficient. A variety of examples shows that the conclusions of the preceding heuristic arguments are often but not universally valid. The reason for failures is that the convergence of experiments is not well suited to allow claims about maximum likelihood estimators. Such claims require stronger forms of convergence than marginal convergence only. For the case of experiments consisting of a random sample from a smooth parametric model, the argument is made precise in section 7 .4. Next to the convergence of experiments, it is required only that the maximum likelihood estimator is consistent and that the log density is locally Lipschitz in the parameter. The preceding heuristic argument also extends to the other examples of convergence to limit experiments considered in this chapter. For instance, the maximum likelihood estimator based on a sample from the uniform distribution on [0, 8]
5. 9
137
Problems
is asymptotically inefficient, because it corresponds to the estimator Z for h (the maximum likelihood estimator) in the exponential limit experiment. The latter is biased upwards and inefficient for every of the usual loss functions . Notes
This chapter presents a few examples from a large body of theory. The notion of a limit experiment was introduced by Le Cam in [95] . He defined convergence of experiments through convergence of all finite subexperiments relative to his deficiency distance, rather than through convergence of the likelihood ratio processes. This deficiency distance in troduces a "strong topology" next to the "weak topology" corresponding to convergence of experiments. For experiments with a finite parameter set, the two topologies coincide. There are many general results that can help to prove the convergence of experiments and to find the limits (also in the examples discussed in this chapter). See [82] , [89] , [96] , [97], [ 1 15], [ 1 38], [142] and [144] for more information and more examples. For nonlocal ap proximations in the strong topology see, for example, [96] or [1 10] . PROBLEMS 1. Let
X 1 , . . . , Xn be an i.i.d. sample from the normal N (hiJn, 1 ) distribution, in which h
E
JR.
The corresponding sequence of experiments converges to a normal experiment by the general results . Can you see this directly? 2. If the nth experiment corresponds to the observation of a sample of size n from the uniform
[0, 1 - h I n], then the limit experiment corresponds to observation of a shifted exponential variable Z . The sequences -n (X (n) - 1) and .jii ( 2Xn - 1) both converge in distribution under every h . According to the representation theorem their sets of limit distributions are the distributions of randomized statistics based on Z. Find these randomized statistics explicitly. Any implications regarding the quality of X (n) and Xn as estimators ?
3. Let the nth experiment consist o f one observation from the binomial distribution with parameters
n and success probability h I n with 0 < h < 1 unknown. Show that this sequence of experiments converges to the experiment consisting of observing a Poisson variable with mean h .
4 . Let the nth experiment consists of observing an i.i.d. sample o f size n from the uniform
[ - 1 - h l n , 1 + h l n] distribution. Find the limit experiment.
5. Prove the asymptotic representation theorem for the case in which the nth experiment corresponds
to an i.i.d. sample from the uniform [0, e - h I n] distribution with h of this theorem for the locally asymptotically normal case.
>
0 by mimicking the proof
L an = oo and Xn x an L J = l a j X j IL} = l a j converges to --+
6. (Toeplitz lemma.) If an is a sequence of nonnegative numbers with
arbitrary converging sequence of numbers, then the sequence x as well. Show this.
7. Derive a limit experiment in the case of Galton-Watson branching with /L (e) 8. Derive a limit experiment in the case of a Gaus sian AR( l ) process with e
=
Then the Such tests certainly exist if there exist estimators Tn that are uniformly consistent, in that, for every 8 >
0.
0,
sup Pe ( II Tn - e I � 8 ) --*
e
0.
In that case, we can define ¢n = 1 { II Tn - 6lo I � 8 j2 } . Thus the condition of the Bernstein von Mises theorem that certain tests exist can be replaced by the condition that uniformly consistent estimators exist. This is often the case. For instance, the next lemma shows that this is the case for a Euclidean sample space X provided, for Fe the distribution functions corresponding to the Pe ,
e ' l >c II l e -inf
Fe - Fe , ll oo >
0.
10.2 Bernstein-von Mises Theorem
145
For compact parameter sets, this is implied by identifiability and continuity of the maps (] f-+ Fe . We generalize and formalize this in a second lemma, which shows that uniformity on compact subsets is always achievable if the model (Pe : f3 E 8) is differentiable in quadratic mean at every f3 and the parameter f3 is identifiable. A class of measurable functions :F is a uniform Glivenko-Cantelli class (in probability) if, for every c; > 0, sup Pp ( II JPln - P ll .r > 8) --* 0. p
Here the supremum is taken over all probability measures P on the sample space, and I I Q ll .r = sup / EF I Qf l . An example is the collection of indicators of all cells ( -oo, t] in a Euclidean sample space. 10.4
Lemma.
every c; > 0,
Suppose that there exists a uniform Glivenko-Cantelli class :F such that,for inf I Pe - Pe ' ll .r > 0. d (B, e ' ) > s
(10. 5 )
Then there exists a sequence of estimators that is uniformly consistent on 8 for estimat ing e. 10.6 Lemma. Suppose that 8 is CJ -compact, Pe f3 f-+ Pe are continuous for the total variation
f. Pe ' for every pair f3 f. fJ', and the maps
norm. Then there exists a sequence of estimators that is uniformly consistent on every compact subset of 8.
For the proof of the first lemma, define en to be a point of (near) minimum of the map (] f-+ II JPln - Pe ll .r· Then, by the triangle inequality and the definition of en , II Pe" - Pe ll .r :::: 2 11 JPln - Pe ll .r + ljn, if the near minimum is chosen within distance ljn of the true infimum. Fix c; > 0, and let o be the positive number given in condition ( 1 0.5). Then
Proof.
By assumption, the right side converges to zero uniformly in f3 . For the proof of the second lemma, first assume that 8 i s compact. Then there exists a uniform Glivenko-Cantelli class that satisfies the condition of the first lemma. To see this, first find a sequence A 1, A 2 , . . . of measurable sets that separates the points Pe . Thus, for every pair fJ , (]' E 8, if Pe (Ad = Pe , ( A d for every i , then f3 = fJ'. A separating collection exists by the identifiability of the parameter, and it can be taken to be countable by the continuity of the maps f3 f-+ Pe . (For a Euclidean sample space, we can use the cells ( -oo, t] for t ranging over the vectors with rational coordinates. More generally, see the lemma below.) Let :F be the collection of functions x f-+ i - 1 1 A , (x). Then the map h : 8 f-+ £00 (F) given by f3 f-+ ( Pe f) fEF is continuous and one-to-one. By the compactness of 8, the inverse h - 1 : h ( 8 ) f-+ 8 is automatically uniformly continuous. Thus, for every c; > 0 there exists o > 0 such that
ll h(fJ) - h(fJ') ll .r :::: o implies d(fJ, (]') :::: c:.
Bayes Proudurt!s
1 46 This
means that ( I 0.5) is satis fied . The class :F is
because by Chebyshev's inequality,
This co nc lud e s the
also a u n i fonn Gl ivenko-Cantclli class,
proo f of the second Jemma for compact 0.
To remove the compactness condition. wri te e as the union of an i ncreasing sequence
of compact sets K, c Kz c T,,,, that is unifonnly co n s i st e n t
· · · .
fixed m ,
a,.. m
6,.
=
O t: K.,
the p reced i ng
(d(T,,,m, 0) .!.) 2:
111
sequ e nce m,J _.. oo such that a11 .m,. Tn.m,. sat i s fi es the requirements� •
Then the re e x i st s that
: = s up Po
f-or every m there cxisrs
on Km . by
a
a
seq u en ce of e�timators .
argu ment.
-.
0,
�
0 as " __.
ll _..
Thus. for every
00.
oo.
It is not h a rd
to see
s econd lemma, if there cx isrs a sequence of tests t/Jn such that . ( J 0.2) holds fo r some t: > 0, th e n it holds for every & > 0. In t h a t ca� we can replace the g i \'e n sequence r/Jn by t h e minimum of t/J, and the tes t s I { nT,. - Oo ll 2: £/2 J for a sequ e nce of estimators T,, that is un i fonn ly consistent on a s uftici en r J y large subset of B. As
a
consequence of the
1 0.7 !..emma. Let the .fel ofprobabilit)' mt•a.wre.'i P mr a uu•a.wrablt• ,'ifWCe (,\'. A> be .'it>parable for tlu.• total \'tlriation llorm. Then the re exists ll countable .mb.H•t Ao C A .m ch tlwt P1 = P2 011 Ao implies Pa = 1'2 for f!\ el) Pa . P!. E P. '
Proof.
'
The set P can be i de n t i fi e d with a subset of L a (Jl )
for a su it able prob abil i ty m�asurc
11. . For i n s t a n ce. 11. cim be t a ke n a con\'ex li near co mb i nat i o n of a cou nt abl e dense set. Let
Po be a cou n t_abl e dense·subset, and Jet Ao be the set of all finite intersections of the sets p -• ( B) for p ran g i ng over a choice of densi ties of rhc set Po c L 1 (J.I. ) and R ranging O\'er
a
countable generator of the Borel seas in lR.
l11en every density p E Po is O"(Ao)- mcasumblc by construction. A den sity of a measure
P E P - P0 can be approximated in L 1 (Jt ) by a seq u e n ce from Pu and hence can be chosen a (A0)-measurable. without loss of ge nc m l i ty .. Because Ao is intersection-stable (a rr-systcm ) two probabi lity measures th�1t &Jgrec on Ao a uto m at i cal ly agree on the a -ficJd a (Ao ) generated by A0• Then t hey also gi\'c the same ex pec t a t io n to every a CAo)·mcasurable fu nct ion f : X .- [0, 1 1. If the measures have u (.A0)-measurabJ c d e n s i ti e s then they must agree on A, because P(t\ ) = E,, J " p = E,, E,. ( I 11 I a (..40)) p if p is a (.Ao)-measumble. • .
..
.
.
1 0.3
Point
:Estimators
The Bcmstcin-von M ises t he orem shows that the post e rio r laws con ve rge in distribution to a G a u ss i a n posterior l aw in total variation d i s ta n ce As a con seq ue n ce, any loca t i on functional .
that is sui tably cont inuous re l is t i ve to the total v�lriation norm appl ied
to the sequence of
10.3 posterior laws c onverge s
Poilll Estimators
1 47
to the same location functional applied to the limiring G"ussian X, or a N (0. Ii. ' )-distribution.
pos erior distribution. For most choices thi s means to
t
this sect ion we cons i der more general Bayes point est i mators that arc de fi ned as the e u i s relati\'e to some l os s fu nction. For a given loss function l : R4 H> [0, oo), le t Tn . for fixed X 1 X,. minimize the posrerior risk In
mini mizers of th posterior risk f nct on
•
, H>
• • • •
f e( Jti 0, It is not imme d at ely clear that
i
fun cti n of the o bse rva ti on s . This
o
.
-
sup JhU::M
t'(lz ) ::;
inf ((h ). lh11�2M
with strict i nequality for at least one AI. t TI1is is tr e, for instance, for loss f n c o ns
u
u ti
of
= lo(lllr n) for a nondecre as i n g function to : [0. 00) H- [0, 00) that is not (0, oo). Fu n herm orc, we suppose that t. grows at most poly no m ia lly: For some constant p � 0,
the form l(h)
con s ta n t on
t !: 1 +
n'z ll ''·
10. 1 hold, ami /el l saris/)· the conclition.'i as .jii( TN - Oo) (·om·ery:es tmder Oo in distribution to the minimi:.er ofl � f ((1 - h ) d N (X, 1,� 1 )(h ). for X possessing 1 the N (0, /� )-cli.ftribution, prm•iclecl rlrat cmy two minimi�ers ofrhis process c:oindde almost .wrt!ly. In particular, for C\'er_\· nun :.ero, .wbcom·t•x loss fwtc·ticm it (.'On\'('rges w X. 10.8
Theorem. Let tile conditions ofThearem
a J ne Jl'' cl n {0 )
list(•d, for a p .'illc:ll tlz l
*Proof.
We ad op the notation as
t
< 00.
Then the .f('CJIIt!IICI!
li�ted in the fi rst paragraph of the proof of Theorem I 0. 1 .
Th e last assertion of the theorem is a consequence of Anderson's l emm a, Lemma 8.5. The standardized e st i m at or
where t, is t h
.jii( T,,
-
0o) minimizes
the function
e funct ion II ._,. l(t - h ). TI1c proof consists of t h ree parts. First it is shown
that i n te grals over the �els llll II � M, can be neglected for every M, _,
that t h e
sequence J,j( T,,
proc e sses 1 � the process
Next, it is proved
- 0o) is unifonn ly t ight. Fi n a l l y, it is shown that the stochastic Z,(l ) conve rge in distribution in the space t'"' ( K), for every compact K. to
I �
t
oo.
Z(t) =
J
l (t - h ) cl
Tho! 2 i s for cunvcnicm:c. any other numhcr would do.
N(X. 1,� 1 )(!z).
Bayes Procedures
148
t,
The sample paths of this limit process are continuous in in view of the subexponential growth of and the smoothness of the normal density. Hence the theorem follows from the argmax theorem, Corollary Let Cn be the ball of radius Mn for a given, arbitrary sequence Mn --+ oo. We first show that, for every measurable function f that grows subpolynomially of order p ,
.e
5. 5 8.
(1 0 .9 ) To see this, we utilize the tests Zn has a minimizer in the set II ::: 3Mn converges to zero. Because this is true for any Mn --+ oo, it follows that the sequence .jn (Tn - (}0) is uniformly tight. Let C be the ball of fixed radius M around 0, and fix some compact set K c �k . Define stochastic processes
(t)
t
t
lit
N (li n ,e0 , l8� 1 ) (£ , 1 c ) , WM = N(X, /8� 1 ) (£ 1 1 c ) .
Wn, M
=
(t)
1 0.4
Consistency
149
The function h 1---+ C (t - h) l c (h) is bounded, uniformly if t ranges over the compact K . Hence, b y the Bernstein-von Mises theorem, Zn , M - Wn,M � 0 in £00(K) as n -+ oo, for every fixed M. Second, by the continuous-mapping theorem, Wn,M WM in £00(K), as n -+ oo, for fixed M. Next WM � Z in £00(K) as M -+ oo, or equivalently C t JR.k . Conclude that there exists a sequence Mn -+ oo such that the processes Zn,M" Z in £00(K). Because, by ( 1 0.9), Zn (t) - Zn,M" (t) 0, we finally conclude that Zn Z in £00 (K) . a '¥'7
'¥'7
�
*10.4
'¥'7
Consistency
A sequence of posterior measures Pe" 1 x1 , . . . , x" is called consistent under () if under Pe00probability it converges in distribution to the measure 8e that is degenerate at (), in proba bility; it is strongly consistent if this happens for almost every sequence X 1 , X2 , . . . . Given that, usually, ordinarily consistent point estimators of () exist, consistency of posterior measures is a modest requirement. If we could know () with almost complete accuracy as n -+ oo, then we would use a Bayes estimator only if this would also yield the true value with similar accuracy. Fortunately, posterior measures are usually consistent. The following famous theorem by Doob shows that under hardly any conditions we already have consistency under almost every parameter. Recall that 8 is assumed to be Euclidean and the maps () 1---+ Pe (A) to be measurable for every measurable set A. 10.10 Theorem (Doob 's consistency theorem). Suppose that the sample space ( X , A) is a subset of Euclidean space with its Borel CJ -field. Suppose that Pe =f. Pw whenever () =f. ()'. Then for every prior probability measure n on 8 the sequence of posterior measures is consistent for n -almost every ().
On an arbitrary probability space construct random vectors 8 and X 1 , X 2 , . . . such that 8 is marginally distributed according to n and such that given 8 = () the vectors X 1 , X2 , are i.i.d. according to Pe . Then the posterior distribution based on the first n observations is Pe l X t , . . . , Xn . Let Q be the distribution of (X 1 ' X2 , . . . ' 8) on x oo X e . The main part o f the proof consists of showing that there exists a measurable function h : X00 1---+ 8 with
Proof.
.
.
•
Q-a.s .. Suppose that this is true. Then, for any bounded, measurable function f : 8 Doo b 's martingale convergence theorem,
(10. 1 1 ) 1---+
JR., by
E ( / (8) I X 1 , . . . , Xn ) -+ E ( / (8) I X 1 , X 2 , . . . ) f (h ( X l , X 2 , . . . ) ) , Q-a.s . .
2. 25
B y Lemma there exists a countable collection :F of bounded, continuous functions f that are determining for convergence in distribution. Because the countable union of the associated null sets on which the convergence of the preceding display fails is a null set, we have that Q-a. s . .
IJuy�s Procedures
I SO
This statement refers to the marginal distribution of
(X 1 , X 2,
• • •
)
under
Q.
We wish
to
translate it intO a State ment concerning the P800-mea� urcs. Let C C X� X 0 be th� inter
section of the sets
on
which the weak convergence hoJds and on which
Fubini's theorem
( 15.9)
is valid. By
Ct, = {x : (.t, 0) E C} is the horizontal section of C at height 0. It follows PtJ00 (CH ) = J for n -al most every 0. For e\'cry 8 such that P,]"�(C0) = J . we have (x, 0) e C for P�-al most every sequence x 1 , x1 • • • • and hence where
This is lhe assertion of the theorem.
In
order 10 establ ish
( 1 5.9),
call a measurable function
exi sts a sequence of measumblc functions
hn: XH
II jh,(x) - f(8)j l\
I
�
I:e
�-+
R such t hat
lhat that
R m·cessibl� i f there
dQ(:r, 0) -+ D.
(Here we abuse notation in viewing II, also as a measurable function on x� x
there also exi sts a {sub)scquence with h H (.t ) -+
H.)
Then
/(0 ) al most surely under Q. whence every I is almost everywhere equal to an Ax. x {0. 0}-mcasurable fu nction. This is a measurable function of :c = (.t1 , x2 , ) alone. If we can show c hat rhc functions f (0 ) = 0; arc accessible, then ( 15.9) follows. \Ve s ha ll in fact show that every Borel accessible function
• • •
measurable function is accessi ble.
L�::l
By the strong law of la�Je numbers, hn (x) =
P/f'.
for every
theorem.
0 and
mcasumble set A.
II lhn(X) -
J " (x, )
-
PH (A ) al most surely under
Consequent ly, by th� domin ated convergence
Po (A )
J dQ(:c. 0)
__.,
0.
of the functions 0 � ��(A ) is accessible. (..\-, A) is Eucl ide a } E A0 • l ienee a (F ) c A1• •
Notes
The Bern stein-von Miscs t he o re m has that n a me , because. ;.l s Le Cam and Yang (97 ) write. it was fi rst discovered by Lapl ace.
The th�orem th01t is prese nt e d in t h b c hap t e r is consi daably more e lcgu n t than t h e results _by these early authors. and also m uc h better than the result in Le Ca m (91 J. who revi ved the thcorcin in o rd er to prove resu lts on superefficiency. \Vc adapted it from Le C am 1 96) and Lc Cam and Ya ng (97 1. lb rag i mov and Hasmi nski i (80] discuss the convergence of Bayes poi nt c st im < o rs i n greater gencmlity, an d also cover non-Gaussian l i m i t experi ments. b u t their discussion of the i.i.d. case as discussed i n the p re se n t ch a pt e r is li mited to bo u n de d parameter sets and re q u i re s st ro n ge r a ssum pt i on s . Our t re at m e n t uses .some c l e me n t s of their proof. but is heav ily based on Lc C6l m·s Bernstei n-von Mises theorem. I nspection of t he proof shows thnt tht.! co nd it i ons on t h e loss fu n ct i o n cun be relaxed s ig n i l i cu m l y. for i n st ance a l l ow i n g e x po n e n t i a l growth. Doob•s theorem origi n t c n ti :a l null se t s of in co n si s t en cy that it leaves o�n re a lly exist in some s i t u ut i o n s pa rt i c u l ar ly if the param e te r set is infinite di men sio n al .
152
Bayes Procedures
and have attracted much attention. See [34] , which is accompanied by evaluations of the phenomenon by many authors, including Bayesians.
PROBLEMS 1. Verify the conditions of the B ernstein-von Mises theorem for the experiment where Pe is the
Poisson measure of mean
e.
2. Let Pe be the k-dimensiona1 normal distribution with mean
e
and covariance matrix the identify. Find the a posteriori law for the prior IT = N ( r, A) and some nonsingular matrix A . Can you see directly that the B ernstein-von Mises theorem is true in this case?
3. Let Pe be the Bernoulli distribution with mean
e. Find the posterior distribution relative to the
beta-prior measure, which has density
4. Suppose that, in the case of a one-dimensional parameter, we use the loss function £ (h)
1 (- 1 ,2) (h) .
Find the limit distribution of the corresponding Bayes point estimator, assuming that the conditions of the B ernstein-von Mises theorem hold.
11 Projections
A projection of a random variable is defined as a closest element in a given set offunctions. We can use projections to derive the asymptotic distribution of a sequence of variables by comparing these to projections of a simple form. Conditional expectations are special projections. The Hajek projection is a sum of independent variables; it is the leading term in the Hoeffding decomposition.
1 1.1
Proj ections
A common method to derive the limit distribution of a sequence of statistics Tn is to show that it is asymptotically equivalent to a sequence Sn of which the limit behavior is known. The basis of this method is Slutsky's lemma, which shows that the sequence Tn = Tn - Sn + Sn converges in distribution to S if both Tn - Sn � 0 and Sn S. How do we find a suitable sequence Sn ? First, the variables Sn must be of a simple form, because the limit properties of the sequence Sn must be known. Second, Sn must be close enough. One solution is to search for the closest Sn of a certain predetermined form. In this chapter, "closest" is taken as closest in square expectation. Let T and { S : S E S} be random variables (defined on the same probability space) with finite second-moments. A random variable S is called a projection of T onto S (or L 2 -projection) if S E S and minimizes "-"7
s
E
S.
Often S is a linear space in the sense that a 1 S1 + a 2 S2 is in S for every a 1 , a 2 E R, whenever S 1 , S2 E S. In this case S is the projection of T if and only if T - S is orthogonal to S for the inner product (S 1 , S2 ) = ES 1 S2 . This is the content of the following theorem. Let S be a linear space of random variables with finite second moments. Then S is the projection of T onto S if and only if S E S and
11.1
Theorem.
E(T - S) S
=
0,
every S
E
S.
Every two projections of T onto S are almost surely equal. If the linear space S contains the constant variables, then ET = ES and cov (T - S, S) = 0 for every S E S. 153
1 54
Projtcrions
Figure J l.J. A variable T and its projcclion S on :. linear sp::�ce. For any
Proof.
S and S in S,
E(T - S)2 If
=
E(T - S)2 + 2E(T - S)(S - S) + E(S - s)2 •
S satisfies the orthogonality condition, then the middle tenn is zero, and we conclude E(T - S)2 ?: E(T - S)2• with suict i nequal ity un less E(S - S)'2 0. Thus. the
that
=
orthogonality condition impl ies that Conversely, for any number a,
If
a
S t-)lo
is
a
S is a projection. and also lh 0 } n (n This statistic is a measure of dependence between X and Y and counts the number of concordant pairs (X;, Yi ) and (X 1, Y1) in the observations. Two pairs are concordant if the indicator in the definition of r is equal to 1. Large values indicate positive dependence (or concordance), whereas small values indicate negative dependence. Under independence of X and Y and continuity of their distributions, the distribution of r is centered about zero, and in the extreme cases that all or none of the pairs are concordant r is identically 1 or - 1,Therespectively. statistic r 1 is a U -statistic of order 2 for the kernel 12.5
- l)
l F1(x, = P(X
Hence the sequence Jn(r + o)) is asymptotically normal 4 with mean zero and variance si · With the notation y) < x, < y) and p r (x , y) y y), the projection of u - e takes the form
= P(X > X, >
Y
X and Y are independent and have continuous marginal distribution functions, then Er = 0 and the asymptotic variance 4si can be calculated to be 4/ 9, independent of the If
12.2
Two-Sample U-statistics
1 65
y
()
concordant X
Figure 12.1. Concordant and discordant pairs of points .
marginal distributions. Then ylrir -v--7 N (0, 4/9) which leads to the test for "independence": Reject independence if .fti174 1 r I > Za/2 · D 12.2
Two-Sample U-statistics
Suppose the observations consist of two independent samples X 1 , . . . , Xm and Y1 , , Yn , i.i.d. within each sample, from possibly different distributions. Let h (x 1 , . . . , Xr , y 1 , . . . , Ys ) be a known function that is permutation symmetric in x 1 , . . . , Xr and Y l , . . . , Ys separately. A two-sample U -statistic with kernel h has the form .
•
.
where and f3 range over the collections of all subsets of r different elements from {1, , m } and of s different elements from { 1 , . . . , n } , respectively. Clearly, U is an unbiased estimator of the parameter a
2, . . .
2,
The sequence Um, n can be shown to be asymptotically normal by the same arguments as for one-sample U -statistics. Here we let both m ---+ oo and n ---+ oo, in such a way that the number of Xi and Y1 are of the same order. Specifically, if N = m + n is the total number of observations we assume that, as m, n ---+ oo,
m
n
0 < A < l. - ---+ A ' - ---+ 1 - A ' N N To give an exact meaning to m , n ---+ oo, we may think of m = mv and n = n v indexed by a third index v E N. Next, we let mv ---+ oo and nv ---+ oo as v ---+ oo in such a way that mv/ Nv A. The projection of U -e onto the set of all functions of the form L�=l ki (Xi ) + L: ) = 1 l1 ( Y1 ) is given by ---+
1 66
U.Staristics
where the fu nctions 11 1 •0 and h0• 1 h � ,o(.t)
=
ho. J (y)
=
are
defined
by
Elt (x, X2 . . . . , X, , Y� t . . . . Y" ) - 0, Eh (X� . , X,, y, Y2· · · · · Y, ) - 0. • • •
This follows, as before, by first ap p ly i ng the Hajek projection lem m a , and next expressing
E( U I X; ) and E( U I Yj ) in the kernel fu nction. If the kernel is square-integrable, then the sequence {J is asymp tot i caJJy nonnal by the centml Ji mit theorem. The di fference between {J a nd U - 0 is asymptoticaJJy neg l ig i b le . 1 2.6 Theorem. I/ Eh2(X1 , • • • , X, , Y1 • • • • , Y.. ) < oo. then the sequence Jil(U 8 0 ) com·erges in probability· to zero. · consequellll)� the sequence Jil(U - 0) £·om·erges in di.t:tribution to the non1wl /aw with meall zero mu/ J.•ariance r2��,0/). + s2("o. d( l - .A.). ,....here. with the X; being i. i.d. l'ariable:r illdepelldent of the i.i.d. l'llritlbles r,. -
�'".J
=
cov (hCX� t
. • •
-
, X, Y� t • • • , Y.. ) .
h ( X. , • . • • Xr ,
x;+l • . . . . x; . r. . . . . . YJ, Y�+ . . . . . . r;>).
Proof. The argu m e nt is sim i l ar to the one given previously for one-sample U - st at i st ic s. The variances of U and it s proje cti o n are given by
var U
=
(;);(:)2 t.t. (;) (�) ('; ::;) (�) W (;=�) �··.d·
It can be checked fro m this that both the sequence Nvar D and the sequence Nvar U co nverge to the number r 2 �1 •0!). + s2�o. J /( I - l). • 12.7
Example (t.famr- U1titllej' statistic). Th e kcmcJ for the
lt (x, y) = I ( X ::;
parameter 0 = P(X ::; Y) is Y }, which is of order I i n both x and )'. The correspondi ng U - s tat i M i c is
The statistic mn U is known as the },famr- \\'ltimey statistic a nd is used to test for a di fference
i n location between the two sa mp l e s . A large value indicates that the Y1 arc '"stochastically larger" than the X, . If the X1 and r1 have c u m ul ati ve distribution functions F a nd G. resp ec t i vel y, then the p rojec t i on of U - 0 ca n be written
It is easy to obrain the limir distribur ion of the projections iJ (and hence of U) from this for mula. In particular. under the null h ypothesis that the poole d sam p le X 1 • • • • • Xm , r• . . . . . Y, is i.i.d. with conti nuous dislribu tion fu nction F = G. the sequence Jl2imz/N(U - 1 /2)
12.3
Dt!gmt!mtt" U-Swrisric.f
1 67
converges to a standard nonnal distribution. (The pammetcr equals 0 = 1 /2 and {0• 1 = {J.o = 1 / 1 2.) If no observations in the pooled sample arc tied. then mnU + 4n = 1 /(�). By the defining properties of the space 1/f l.�c·J· it follows that the kemel /r� is degenerate for c � 2. In fact. it is strongly degenerate in the sense that the conditional expectation of hc(X1 X�) given any strict su bset of the variabks X 1 • • • • • X(' is zero. In other words. the integral J h(:c. X2 . , Xc> d P(:c) with respect to any single argument vanishes. By the same reasoni ng. U,,,. is u ncorrclatcd with every measurable fu nction that depends on strictly fewer than ,. elements of X I X11 • V.'e shall show that the sequence n''/2 Un.(' converges in d istribution to a l i mit with variance c! FJz;(x 1 • • • • • X(') for every c � I . Then it follows that the sequence n('12 (U,, - 0) converges i n distribution for c equal to the smallest value such that /r(' ¢ 0. For c � 2 the limit distribution is not nonnal but is known as Gaussian chaos. Because the idea is simple, bu t th� statement of the theor�m (apparently) necessarily compl icated, first consider a special case: c = 3 and a "'product kernel .. of the. form .•
•
• • • •
•
• • • •
.•
•
•
•
•
•
•
• •
•
• • • •
168
U-Statistics
A U -statistic corresponding to a product kernel can be rewritten as a polynomial in sums of the observations. For ease of notation, let Pn i = n - 1 L.7= 1 i(Xi ) (the empirical mea sure), and let Gn i = Jll Wn - ) i (the empirical process), for the distribution of the observations X 1 , . . . , X n . If the kernel h 3 is strongly degenerate, then each function ii has mean zero and hence Gn fi = JllPn ii for every i . Then, with (i 1 , i2, i 3 ) ranging over all triplets of three different integers from { 1 , . . . , n } (taking position into account),
P
P
Pi
By the law of large numbers, Pn � almost surely for every i, while, by the central limit theorem, the marginal distributions of the stochastic processes i 1---+ Gn i converge weakly to multivariate Gaussian laws. If G i : i E L 2 (X, A, } denotes a Gaussian process with mean zero and covariance function EGiGg = (a ?-Brownian bridge process), then Gn G. Consequently,
{
P) Pig - Pi Pg
-v-+
3
The limit is a polynomial of order in the Gaussian vector (GJI , Gf2, Gf3) . There is no similarly simple formula for the limit of a general sequence of degenerate U statistics. However, any kernel can be written as an infinite linear combination of product kernels. Because a U -statistic is linear in its kernel, the limit of a general sequence of degenerate U -statistics is a linear combination of limits of the previous type. To carry through this program, it is convenient to employ a decomposition of a given kernel in terms of an orthonormal basis of product kernels. This is always possible. We assume that L 2 (X, A, is separable, so that it has a countable basis.
P)
12.8
P),
Example (General kernel).
If 1
=
io , JI , f2, . . . is an orthonormal basis of L 2 (X, ike with (k 1 , . . , kc) ranging over the nonnegative c c c
x A, then the functions ik 1 x integers form an orthonormal basis of L 2 (X , A , p ). Any square-integrable kernel can be written in the form hc (X 1 , . . . , Xc) = L a (k 1 , . . . , kc ) fk 1 X x ike for a(k 1 , . . . , kc) = (h e , ik 1 x x ike ) the inner products of h e with the basis functions. D ·
·
·
.
·
·
·
·
·
·
In the case that c = 2, there is a choice that is spe cially adapted to our purposes. Because the kernel h 2 is symmetric and square-integrable by assumption, the integral operator K : L 2 (X, A, 1---+ L 2 (X, A, defined by Ki (x) = J h 2 (x , y) i (y) d ( y ) is self-adjoint and Hilbert-Schmidt. Therefore, it has at most count ably many eigenvalues A o , A 1 , , satisfying L_ A� and there exists an orthonormal basis of eigenfunctions io , i1 , . . . . (See, for instance, Theorem VI. 1 6 in [ 1 24].) The kernel h 2 can be expressed relatively to this basis as 12.9
Example (Second-order kernel).
P
P)
P)
.
•
< oo,
.
h 2 (x , y)
=
00
L A k ik (x) fk (y) . k =O
For a degenerate kernel h 2 the function 1 is an eigenfunction for the eigenvalue 0, and we can take io = 1 without loss of generality.
12.3
169
Degenerate U-Statistics
The gain over the decomposition in the general case is that only product functions of the type f x f are needed. D The (nonnormalized) Hermite polynomial H1 is a polynomial of degree j with leading coefficient xi such that J Hi (x) H1 (x) ¢ (x) dx = 0 whenever i 1- j . The Hermite polyno mials of lowest degrees are H0 = 1 , H1 (x) = x , H2 (x) = x 2 - 1 and H3 (x ) = x 3 - 3x . Let he : xe 1--+ 1Ft be a permutation-symmetric, measurable function of c arguments such that Eh� (X 1 , . . . , Xe) < oo and Ehe (X l , . . . , Xe- 1 , X c ) 0. Let 1 = fo , fi , h , . . . be an orthonormal basis of L2 (X, A., P). Then the sequence of U statistics Un,e with kernel h e based on n observations from P satisfies d (k ) X he, X ik f n e/ 2 Un , e kt . . . ( J n Hai (k) (!Gl/fi (k)) . i= l 12.10
Theorem.
=
-v'-7
Here !G is a P -Brownian bridge process, the functions 1/f 1 (k) , . . . , lfd (k ) (k) are the different elements in fk l ' . . . , ike' and ai (k) is number of times lfi (k) occurs among fk 1 , , fke · The variance of the limit variable is equal to c ! Eh� (X 1 , . . . , Xe). •
•
•
The function he can be represented in L2(Xc , A.e , P e ) as the series L k ( he , fk 1 x . . . , fkJ fh x · · · x ike . By the degeneracy of he the sum can be restricted to k = (k 1 , . . . , ke) with every k1 ::::_ 1 . If Un , c h denotes the U-statistic with kernel h (x 1 , . . . , Xe) , then, for a pair of degenerate kernels h and g,
Proof.
cov(Un ' eh, Un ' c g)
=
c! P c hg. n (n - 1) · · · (n - c + 1)
This means that the map h 1--+ n e f2 �cf Un,eh is close to being an isometry between L 2 (P c ) and L 2 (P n ) . Consequently, the series Lk ( he, fk 1 x · · · x fk) Un,efk1 x · · · x fke con verges in L 2 (P n ) and equals Un, c h c = Un,e · Furthermore, if it can be shown that the finite-dimensional distributions of the sequence of processes { Un,efk1 x · · · x fke : k E r�:n converge weakly to the corresponding finite-dimensional distributions of the process { n:�k{ Hai (k) (!Gl/fi (k)) : k E then the partial sums of the series converge, and the proof can be concluded by approximation arguments. There exists a polynomial Pn,e of degree c, with random coefficients, such that
P�F},
(See the example for c = 3 and problem 12. 1 3). The only term of degree c in this polynomial is equal to !Gn fk1 !Gn fkz · · · !Gn fke . The coefficients of the polynomials Pn,e converge in probability to constants. Conclude that the sequence n e 2 c ! Un,e fk1 x · · · x ike converges in distribution to Pc (!G fkp . . . , !G fkJ for a polynomial Pc of degree c with leading term, and only term of degree c, equal to !G fk1 !G fk2 !G ike . This convergence is simultaneous in sets of finitely many k. It suffices to establish the representation of this limit in terms of Hermite polynomials. This could be achieved directly by algebraic and combinatorial arguments, but then the occurrence of the Hermite polynomials would remain somewhat mysterious. Alternatively,
1
•
•
•
U-Sratb:rics
1 70
the representation can be derived from the defi n ition of the Hermite polyno m ial s and co variance calculations. By the degeneracy of the kerne l J,.. , x x fi:. . the U stati stic U,,(·/t, x · · · x /t, is orthogonal to all me as u rable fu nc t i o ns of c.· - I or fewer dements of X I • • X,. . and t he i r linear combinations. Th i s includes the functions n; (Gn.'.'r ) for a rbh ra ry fun ct ions g; a n d nonn egati ve i nt egers a; w it h L: a, < c. Tak ing l i m i t s. we con c l u de that PcCGJ,., , . Gft.. ) mu s t be orth ogon a l to every polyno m i al in Gjj, 1 • • • • Gfs.. of degree Jess than c - I . By th e orthonormal ity of the basis f, , the variables G f, are i ndepe nde n t s£andard n orm a l variables. Because the Hermite polynomials fonn a b�1sis for the po lyn om i al s in one variable, their ( tensor) prod ucl" fo rm a bas is fo r the polynomials o f more than one argument. The polyno m i al P,. can be wri tten as a J incar combination of c l e m e n ts from this bas is. By the orthogonality. t he cocfficienrs of base clements of d eg ree < ( vanish. From the base elcmenl" of deg ree c.· only the product as in the t heorem can occ ur, as follows from considerat ion of the lead i ng tenn of P.· . • •
·
•
..
"'
0 0 .
o • •
0
'
1 2. 1 1 Example. For c = 2 and a basis I hz, we obtain a l i mi t of the form L,,..f '' + f = 0 has solutions cos a.r und sin ax for a:! = ).- I . Rei nsert ing these in the original equ�ttion or uti lizing the relation J f(s) ds = 0, we find that the nonzero eigenvalues are equal to· j -2rr-2 for j e N, with eigenfunctions J2 cos rrj.t . Tims. 1he Cramer-Von �1ises s tat i s t i c converges in distribut ion to 1 /6 + I ). For an othe r derivation of the li mit distribution. sec Chapter I 9. D
E7=1 j-'!rr-2 0} , the signed rank statistic w + , and the positive-sign statistic S = = 1 1 {Xi > 0} are related by w + = G)U + S in the case that there are no tied observations. 10. A V -statistic of order 2 is of the form n - 2 2::: = 1 :L'f = 1 h (Xi , Xi ) where h (x , y) is symmetric in x and y . Assume that Eh 2 (X1 , X 1 ) < oo and Eh 2 (X1 , X 2 ) < oo . Obtain the asymptotic distribution of a V -statistic from the corresponding result for a U -statistic.
2:::7
7
1 1 . Define a V -statistic of general order r and give conditions for its asymptotic normality.
12. Derive the asymptotic distribution of n ( s; - t.Jd in the case that M4 = f.J.,� by using the delta method (see Example 1 2 . 1 2). Does it make a difference whether we divide by n or n - 1 ?
13. For any (n
x
c) matrix aij we have
L ai , i
,
1
·
·
·
ai , c c
=
n
L fl ( - 1) I B I l ( B I - 1 ) ! L fl aij . -
i=1 }EB
B B EB
Here the sum o n the left ranges over all ordered subsets ( i 1 , . . . , i c ) of different integers from { 1 , . . . , n } and the first sum on the right ranges over all partitions B of { 1 , . . . , c} into nonempty sets (see Example [ 1 3 1]).
14. Given a sequence of i.i.d. random variables X 1 , X 2 , . . . , let An be the a -field generated by all functions of (X 1 , X 2 , . . . ) that are symmetric in their first n arguments . Prove that a sequence Un of U -statistics with a fixed kernel h of order r is a reverse martingale (for n :=:: r) with respect to the filtration Ar ::) Ar + 1 ::) · · · .
15. (Strong law.) If E j h (X 1 , · · · , Xr) J
< oo,
then the sequence
converges almost surely to Eh (X 1 , · · · , Xr ) . (For r
>
Un
of
U -statistics with kernel h
1 the condition is not necessary, but a
simple necessary and sufficient condition appears to be unknown.) Prove this. (Use the preceding
problem, the martingale convergence theorem, and the Hewitt-Savage 0- 1 law.)
13 Rank, Sign, and Permutation Statistics
Statistics that depend on the observations only through their ranks can be used to test hypotheses on departuresfrom the null hypothesis that the ob servations are identically distributed. Such rank statistics are attractive, because they are distribution-free under the null hypothesis and need not be less efficient, asymptotically. In the case of a sample from a symmetric distribution, statistics based on the ranks of the absolute values and the signs of the observations have a similar property. Rank statistics are a special example ofpermutation statistics.
13.1
Rank Statistics
The order statistics XN(1) ::::; XNc2J ::::; · · • ::::; XN (N) of a set of real-valued observations X 1 , , X N i th order statistic are the values of the observations positioned in increasing order. The rank RNi of Xi among X1 , . . . , XN is its position number in the order statistics. More precisely, if X 1 , . . . , XN are all different, then R Ni i s defined by the equation .
•
.
If Xi is tied with some other observations, this definition is invalid. Then the rank RNi is defined as the average of all indices j such that Xi = XN(j) (sometimes called the midrank), or alternatively as L:: 7= 1 {X1 ::::; Xd (which is something like an up rank). In this section it is assumed that the random variables X 1 , . . . , XN have continuous distribution functions, so that ties in the observations occur with probability zero. We shall neglect the latter null set. The ranks and order statistics are written with double subscripts, because N varies and we shall consider order statistics of samples of different sizes. The vectors of order statistics and ranks are abbreviated to X No and RN , respectively. A rank statistic is any function of the ranks. A linear rank statistic is a rank statistic of the special form L : 1 aN (i, RNi ) for a given (N X N) matrix (aN ( i, n) . In this chapter we are be concerned with the subclass of simple linear rank statistics, which take the form N L CN i aN, RN, . i=1 Here ( cN1 , . . . , CN N ) and (aN1 , . . . , aNN) are given vectors in ffi.N and are called the coeffi cients and scores, respectively. The class of simple linear rank statistics is sufficiently large
1
173
1 74
Rank. Sign, and P�rmmmilm Srati.ui,·.fl
to contai n i nte res tin g statistics for test ing a variety of hypotheses.
part icula r,
In
\ve s hal l
sec that it contains all .. locally most powerful" rank stat istics, whic h i n a not h e r c h a pt e r are
shown to be asy mptot ically efficient wi thi n the class of all tests.
Some e le m ent ary propcnics of ranks and order statistics arc ga t here d in the following
lemma.
13.1 /,.emma. Ut X I X N be a random sample ftvm a continuous distribmionfimc rion F with density f. Then (i) the vectors XNc , and RN are independelll: < .rN : (ii) rhe vector X N o has density N ! nf:: 1 f(x; ) o11 the set x1 < •
• • • •
·
·
·
N I (iii) rite \'ariable XNCo has densir.v NC�:t') F(x)1- 1 {I - F) - f(x): for F the wu'· form diJtriburlon 011 [0. 1 ). it has mean ij ( N + I ) and \'ariance i (N - i + 1 )/ (CN + 1 )1 (N + 2 )): (i\•) the \'telor R,.. is uniformly distributed on the set ofall N! pt•rmutarion'i of I. 2 . . . . •
N:
(l') for any statiJtic T aizd pennutation r E(T(Xr
•
.
.
.
= (r1 ,
•
• • •
r"' ) of I , 2
, X.v ) l R.v = r) = ET( X.,•cra J ·
(�·i) for em}" simple linear rank .fla tistit: T
=
•
• • • •
• • .
, N.
X,\'fr"· � ) :
Ei� 1 cN;a.v.R.,
.• •
S ta te m ents (i) through { iv) are wel l-known and e l e me nta ry. For the proof of ( v). it is he lpful to write T( X1 XN) as a function of the ranks an d the order statistics. Nex t. we app ly (i), For the proof of statement (vi). we u se that the d i st ributions of the vari ab l es R,w and the \'ectors CRt�; . RN, ) for i -I= j arc u niform on th e sets I = ( I N ) and (i. j) e 1 2 : i i: j respectively. Furthermore. a doubl e sum of the form Li'FI (b, - b) (bi - h) is
Proof.
•
• • • •
• . . . •
J.
equal to -
'L, (b; - b)'·.
{
•
It follows that rank st at i st ic s arc di.�rribmion-free over the set of all models in which the
observations arc indepe ndent and identica l ly distributed . On the one hand. this makes them
�tatistically useless in si tuati o n s i n which the observations are. indee d a random s a mp le .
from some di stribution. On the othe r hand, it makes them of great int ere st to detect cena in
differences in distribution between the obse rva tion s . such as i n the two-sample pro b l em .
If
the null hypothesi s is taken to assert that the observations are i de nt ical l y distributed. then
the critical values for a rank test can be chose n in such a way that the p robabi lity of an
error of the f i rst k i nd is eq ua l to a g iven Jc\'cl a. for any probabi lity distri bution in the nul l hypothes i s. Somewhat s u rprisin g ly. this gain is not necess aril y counteracted by a lo�s i n
asymptot ic effi c ie ncy a s w e sec in Ch a pt e r .
13.2
1 4.
Example (Two·sample location problem). Suppose that the total set of observ at io n s
consists of two i ndependent random samples. inconsistently with the precedi ng notat ion
X1 X the pooletl sample X 1 written as
•
•
•
•
•
m
and Y
• • • •
1
•
•
• • •
, X,. . Y� o
Y" . Set N
. . . •
Yn .
=
m +n
and
let RN rn:
the rank vec to r of
I 3. 1
Rank Stari.titic.,·
1 75
\Ve arc interested in resting t he null hypothesis that the two samples arc identically dis tributed (according to a continuous distribut ion) ag ai n st the alternative that the distribution of the second sample is stochastically larger than the distribution of the fir�t sample. Even wit hour a more precise description of the alternat ive hypothesis. we can discuss a tion of usefu l rank stat istics. I f the Y1 arc a sample from a stochastically larger distribu tion, then the ranks of the Y1 in the pooled sample should be rclati\'ely large. Thus. any measure of rhe size of the ranks RN.m + J · • • • • RN .v can be used as a test statistic. It will be distribution-free under the null hypothesis. TI1e most popu lar choice in this prob l e m is the Wilcoxon .'ilatistic
collec
N
W =
L
i=lll +- I
RNr ·
This i s a si mple linear rank stati�tic with coe flicients c = (0, . . . , 0, I , . . . , I }, and scores a = (I, , N ) . The null hypot he s is is rej ected for large values of the \Vi lcoxon �tatiMic. (The \Vi lcoxon statistic is equi \ a l e nt to the J-fmm- \Vhit11ey stati.vtic U = Lr. J I (X1 � Y1 } in that W = U + !n (ll + 1 ). ) There arc many mher reasonable choices of rank stat istics. some o f which are o f special interest and h ave names. For instance, the ''an dt•r U�1erden stati.\'lic is defi ned as . • •
'
,\'
L
r .=m + I
- 1 ( RN, ).
Here - 1 is the standard normal quanti lc function. We s ha ll sec ahead that this statistic is part icul arl y attract ive if it is bel ieved that the underlying distribution of the obse rvat i on s is approxi mately normal. A ge n era l method to ge n cn1 te useful mnk statistics is discussed below. 0 A cri t i ca l value for a rest based on a (distribution-free) rnnk statistic can be found by simply tabulating its null distri bution. For a l arge number of observations this is a bit ted ious. In most cases it is a so unnecessary. bc.."Causc there ex ist accurate asymptotic approx imations. The remainder of this section i s concerned with proving asymptot ic normality of si mple l i near r.mk statistics under the null hypothesi�. Apart from bei ng useful for fi nd ing crit ical values. the theorem is used su bsequ e ntly to �t udy the &�symptotic efliciency of rank tests. Consider a rank statistic of the fonn T.v = L� 1 c,.,, a.v.N., ,· For a sequence of this type to be asy mptot i ca ll y normal, some restrictions on the coeflicicnts c and scores a arc necessary. I n most cases of interest. the scores arc ge nerated .. through a g i ve n fu nction t/J : [0, I J � R in one of two ways. Either
l
"
where UNC J i o UNcN• distribution on (0. I ) ; or • • • •
arc
the order stat istics of
a
sample
"N• = ¢(N� l }
( 1 3.3) of size N from the ·unifonn
( 13.4 )
For \Vei l -behaved function� c/J. these defi nit ions ure closely related and al most ident ica l. because i I ( N + 1 ) = EUN c 1 • · Scores of the first type correspond to the locally most
1 76
Rank, Sign, and Permutation Statistics
powerful rank tests that are discussed ahead; scores of the second type are attractive in view of their simplicity. 13.5 Theorem. Let R N be the rank vector of an i. i.d. sample X 1 , . . . , XN from the continuous distribution function F. Let the scores aN be generated according to ( 13.3) for 1 a measurable function ¢ that is not constant almost everywhere, and satisfies J0 ¢ 2 (u) du < oo. Define the variables N N t N TN = a (cNi - cN )¢ (F(X i ) ) . = + N CNi N RN cNZiN i =l i =l
L
,
, '
L
Then the sequences TN and TN are asymptotically equivalent in the sense that ETN = ETN and var (TN - TN) jvar TN ---+ 0. The same is true if the scores are generated according to ( 13.4) for a function ¢ that is continuous almost everywhere, is nonconstant, and satisfies N - 1 "'£ � 1 ¢ 2 (i /(N + 1)) ---+ f01 ¢ 2 (u) du < oo. Set Ui = F ( X i ) , and view the rank vector R N as the ranks of the first N elements of the infinite sequence U1 , U2 , . . . . In view of statement (v) of the Lemma 1 3 . 1 the definition (13.3) is equivalent to
Proof.
This immediately yields that the projection of TN onto the set of all square-integrable functions of R N is equal to TN = E(TN I R N ) . It is straightforward to compute that N var aN , RN I 1 / (N - 1) "'£ (cNi - eN ) 2 "£ (aN i - ZiN ) 2 N - 1 var ¢ (U1 ) L ( CN i - CN ) 2 var cp (UI ) If it can be shown that the right side converges to 1 , then the sequences TN and TN are asymptotically equivalent by the projection theorem, Theorem 1 1 .2, and the proof for the scores (13.3) is complete. Using a martingale convergence theorem, we shall show the stronger statement
( 1 3.6) Because each rank vector R j - l is a function of the next rank vector R j (for one observation more), it follows that aN, RN ! = E (¢ ( U1 ) I R 1 , . . . , RN ) almost surely. Because ¢ is square integrable, a martingale convergence theorem (e.g., Theorem 10.5.4 in [42]) yields that the sequence aN, R N 1 converges in second mean and almost surely to E (¢ (U1 ) I R 1 , R2, . . . ) . If ¢ (U1 ) is measurable with respect to the a--field generated by R 1 , R 2 , . . . , then the condi tional expectation reduces to ¢ (UI ) and ( 13.6) follows. The projection of U1 onto the set of measurable functions of R N I equals the conditional expectation E(U1 I RN 1 ) = R N i f(N + 1). By a straightforward calculation, the sequence var (R N I / (N + 1)) converges to 1 / 1 2 = var U1 . By the projection Theorem 1 1 .2 it follows that R N i f (N + 1 ) ---+ U 1 in quadratic mean. Because R N I is measurable in the a- -field generated by R 1 , R 2 , . . . , for every N, so must be its limit U1 . This concludes the proof that ¢ (U1 ) is measurable with respect to the a--field generated by R 1 , R 2 , . . . and hence the proof of the theorem for the scores 13.3.
13. 1
bNi
Rank Statistics
177
a Ni
Next, consider the case that the scores are generated by ( 13 .4). To avoid confusion, write = ¢ ( 1 / (N + 1 ) ) and let these scores as be defined by (1 3.3) as before. We shall prove that the sequences of rank statistics SN and TN defined from the scores and bN, respectively, are asymptotically equivalent. Because RNJ !(N + 1) converges in probability to U1 and ¢ is continuous almost ev erywhere, it follows that ¢ (RNJ / (N + 1) ) � ¢ (U1 ) . The assumption on ¢ is exactly that E¢ 2 (RNJ / (N + 1) converges to E¢ 2 (U 1 ) . By Proposition 2.29, we conclude that ¢ ( RNJ/(N + 1) � ¢ (U 1 ) in second mean. Combining this with ( 1 3 .6), we obtain that 2 b Nt ) 2 = E R, ¢ -+ ,
aN
)
)
� t (a m -
(a N
-
(::\ ) ) 0.
By the formula for the variance of a linear rank statistic, we obtain that - CaN - b N ) ) 2 var (S TN) 2:=� 1 (a Ni � ' var TN (a )2 because var aN, R N 1 � var ¢ ( U1 ) > This implies that var SN fvar TN � 1 . The proof is complete. •
N
- bNi N Li= l Ni - aN
-
0.
0
iaN,
Under the conditions of the preceding theorem, the sequence of rank statistics l:= cN RN , is asymptotically equivalent to a sum of independent variables. This sum is asymptotically normal under the Lindeberg-Feller condition, given in Proposition 2.27 . In the present case, because the variables ¢ ( F(X J ) are independent and identically distributed, this is implied by max l -si -sN C cN i - eN ) 2 ( 1 3 .7) N ( - ) 2 � o. c N i cN This is satisfied by the most important choices of vectors of coefficients.
Li= l
If the vector of coefficients CN satisfies ( 1 3.7), and the scores are genera ted according to ( 1 3 .3) for a measurable, nonconstant, square-integrable function ¢, then the sequence of standardized rank statistics (TN ETN ) /sd TN converges weakly to an N(O, I)-distribution. The same is true if the scores are generated by ( 1 3 .4) for a function ¢ that is continuous almost everywhere, is nonconstant, and satisfies N - 1 2:=� 1 ¢ 2 ( i / (N + 1) � ¢ 2 (u) du. 13.8
Corollary.
-
) fol
13.9 Example (Monotone score generating functions). Any nondecreasing, nonconstant function ¢ satisfies the conditions imposed on score-generating functions of the type (13.4) in the preceding theorem and corollary. The same is true for every ¢ that is of bounded variation, because any such ¢ is a difference of two monotone functions. To see this, we recall from the preceding proof that it is always true that RNJ ! (N + 1) � U1 , almost surely. Furthermore,
E¢ 2
i ) N 1 N l(i + l)j ( N + l) 2 ( NRNl 1 ) = -N1 LN ¢2 ( -i =l N 1 :::: N Li = l i /(N+ l l ¢ (u) du . --
+
+
+
--
The right side converges to J ¢ 2 (u) du. Because ¢ is continuous almost everywhere, it follows by Proposition 2.29 that ¢ (RNJ / (N + 1) � ¢ (UJ ) in quadratic mean. 0
)
Rank, Sign, and Permutation Statistics
178
In a two-sample problem, in which the first m observations constitute the first sample and the remaining observations n = N - m the second, the coefficients are usually chosen to be 13.10
Example (Two-sample problem).
i = 1, . , m CNi = 1 l = m + 1 , . , m + n .
{0
.
.
.
. .
In this case eN = nj N and L � 1 (cNi -eN ) 2 = mnj N. The Lindeberg condition i s satisfied provided both m ---+ and n ---+ D oo
oo.
The function ¢ ( u) = u generates the scores aNi = i / (N + 1). Combined with "two-sample coefficients," it yields a multiple of the Wilcoxon statistic. According to the preceding theorem, the sequence of Wilcoxon statistics WN = L �m + 1 RNi / (N + 1) is asymptotically equivalent to 13.11
Example (Wilcoxon test).
The expectations and variances of these statistics are given by EWN var WN = mnf(12(N + 1) ) , and var WN = mnj(12N). D
n/2,
13.12 Example (Median test). The median test is a two-sample rank test with scores of the form aN i = ¢ (i j(N + 1)) generated by the function ¢ (u) = 1 co, 1 ;21 (u). The corresponding test statistic is
N
i
J;_
}
N+1 1 RNi ::::; -- . 2 1
{
This counts the number of Y1 less than the median of the pooled sample. Large values of this test statistic indicate that the distribution of the second sample is stochastically smaller than the distribution of the first sample. D The examples of rank statistics discussed so far have a direct intuitive meaning as statistics measuring a difference in location. It is not always obvious to find a rank statistic appropriate for testing certain hypotheses. Which rank statistics measure a difference in scale, for instance? A general method of generating rank statistics for a specific situation is as follows. Suppose that it is required to test the null hypothesis that X 1 , . . . , X N are i.i.d. versus the alternative that X 1 , . . . , X N are independent with Xi having a distribution with density fcNJh for a given one-dimensional parametric model e �----+ fe . According to the Neyman-Pearson lemma, the most powerful rank test for testing H0 : e = against the simple alternative H1 : e = e rejects the null hypothesis for large values of the quotient Pe (R N - r) = N ! Pe (RN = r ) . Po (RN = r )
0
__ _ _
Equivalently, the null hypothesis is rejected for large values of P8 (RN = r). This test depends on the alternative e , but this dependence disappears if we restrict ourselves to
179
13. 1 Rank Statistics
alternatives e that are sufficiently close to 0. Indeed, under regularity conditions,
Pe(RN = r) - Po(RN = r) =
r . L �, (fv'"'' (x, ) - !] to(x, ) ) dx l
N 1 = 8 - L cNi Eo N .,
i= I
dxN
( _Qjfo (Xi ) I RN = r ) + o (&) .
Pe ( R r) TN 2..::{: 1 R aNi = Eo �� ( XN 0. The scores for the locally most powerful rank test are given by
For instance, for f equal to the standard normal density this leads to the sank statistic with scores
L �=m + I aN,RN,
The same test is found for f equal to a normal density with a different mean or variance. This follows by direct calculation, or alternatively from the fact that rank statistics are location and scale invariant. The latter implies that the probabilities p.,a, B = of the rank vector of a sample of independent variables N with Xi distributed according to e - B f (e- 8 (x - J-L)/CJ ) /CJ do not depend on (J-L, CJ). Thus the procedure to generate locally most powerful scores yields the same result for any (J-L, CJ). 0
RN
X1 , . . . , X
P (RN r)
Rank, Sign, and Permutation Statistics
1 80
13.14 Example (Two-sample location). In order to find locally most powerful tests for location, we choose fe (x) = f (x - 8) for a fixed density f and the coefficients c equal to the two-sample coefficients. Then the first m observations have density f (x) and the last n = N - m observations have density f (x - 8). The scores for a locally most powerful rank test are
For the standard normal density, this leads to a variation of the van der Waerden statistic. The Wilcoxon statistic corresponds to the logistic density. D 13.15 Example (Log rank test). The cumulative hazardfunction corresponding to a con tinuous distribution function F is the function A = -log(l - F). This is an important modeling tool in survival analysis. Suppose that we wish to test the null hypothesis that two samples with cumulative hazard functions Ax and Ay are identically distributed against the alternative that they are not. The hypothesis of proportional hazards postulates that Ay = 8Ax for a constant 8 , meaning that the second sample is a factor e more "at risk" at any time. If we wish to have large power against alternatives that satisfy this postulate, then it makes sense to use the locally most powerful scores corresponding to a family defined by Ae = 8A 1 . The corresponding family of cumulative distribution functions Fe satisfies 1 - Fe = (1 - F1 ) e and is known as the family of Lehmann alternatives. The locally most powerful scores for this family correspond to the generating function
¢ (u)
=
a ae
a ax
- log - ( 1 - Fe ) (x) 1e_- 1 ' x - F-I I (u )
=
1 - log( l - u).
It is fortunate that the score-generating function does not depend on the baseline hazard function A 1 . The resulting test is known as the log rank test. The test is related to the Savage test, which uses the scores
N 1 a N , i = L -: J= N -i + 1 1
R:!
( -i ) . N 1
- log 1 -
+
The log rank test is a very popular test in survival analysis. Then usually it needs to be extended to the situation that the observations are censored. D 13.16 Example (More-sample problem). Suppose the problem is to test the hypothesis that k independent random samples X1 , . . . , XN1 , X N1 + 1 , . . . , XN2 , . . . , X Nk_ 1 + 1 , . . . , XNk are identical in distribution. Let N = Nk be the total number of observations, and let R N be the rank vector of the pooled sample X 1 , . . . , X N . Given scores aN inference can be based on the rank statistics
Nk N1 N2 TN1 = L aN , RNr ' TN 2 = L aN ,RNr ' . . . ' TNk = L aN ,RN, . i=1
i= N 1 +1
i =Nk-1 +1
The testing procedure can consist of several two-sample tests, comparing pairs of (pooled) subsamples, or on an overall statistic. One possibility for an overall statistic is the chi-square
13.2
Signed Rank Statistics
181
statistic. For n j = Nj - Nj _ 1 equal to the number of observations in the jth sample, define
� (TN1 - n/(iN) 2 2 = eN L...t = n j var ¢ (U 1 ) j 1
If the scores are generated by (13.3) or (13.4) and all sample sizes n j tend to infinity, then every sequence TN1 is asymptotically normal under the null hypothesis, under the conditions of Theorem 1 3.5. In fact, because the approximations N1 are jointly asymptotically normal by the multivariate central limit theorem, the vector TN = (TN1 , . . . , TNk) is asymptotically normal as well. By elementary calculations, if n i f N ---+ Ai ,
T
TN - ETN
v'ii sd ¢ (UI )
-'A2'A 1
-'A 1 'A2 'A2 ( 1 - 'A2 )
-'A 1 'Ak -'A2'Ak
-'Ak'A 1
-'Ak'A2
'Ak ( l - 'Ak )
)q (1
� Nk 0,
- 'A I )
This limit distribution is similar to the limit distribution of a sequence of multinomial vectors. Analogously to the situation in the case of Pearson's chi-square tests for a multinomial distribution (see Chapter 17), the sequence C1 converges in distribution to a chi-square distribution with k - 1 degrees of freedom. There are many reasonable choices of scores. The most popular choice is based on ¢ (u) = u and leads to the Kruskal-Wallis test. Its test statistic is usually written in the form
(-
)
k 12 N+1 2 n j R j - -- ' . 2 N(N - 1 )
j;
This test statistic measures the distance of the average scores of the k samples to the average score (N + 1 )/2 of the pooled sample. An alternative is to use locally asymptotically powerful scores for a family of distribu tions of interest. Also, choosing the same score generating function for all subsamples is convenient, but not necessary, provided the chi-square statistic is modified. D
13.2
Signed Rank Statistics
The sign of a number x, denoted sign(x), is defined to be - 1 , 0, or 1 if x < 0, x = 0 or x > 0, respectively. The absolute rank R ti of an observation Xi in a sample X 1 , , X N is defined as the rank of I Xi I in the sample of absolute values I X 1 1 , . . . , I X N 1 . A simple linear signed rank statistic has the form .
.
•
The ordinary ranks of a sample can always be derived from the combined set of absolute ranks and signs. Thus, the vectors of absolute ranks and signs are together statistically more informative than the ordinary ranks. The difference is dramatic if testing the location of a symmetric density of a given form, in which case the class of signed rank statistics contains asymptotically efficient test statistics in great generality.
1 82
Rank, Sign, and Permutation Statistics
The main attraction of signed rank statistics is their simplicity, particularly their being distribution-free over the set of all symmetric distributions. Write l X I , Rt, and signN (X) for the vectors of absolute values, absolute ranks, and signs.
Let X 1 , . . . , XN be a random sample from a continuous distribution that is symmetric about zero. Then (i) the vectors ( l X I , Rt) and signN (X) are independent; (ii) the vector Rt is uniformly distributed over { 1 , . . . , N}; N (iii) the vector signN (X) is uniformly distributed over { - 1, 1} ; (iv) forany signed rank statistic, var 'L;: 1 aN, , sign(Xi ) = L ;: 1 a�i · 13.17
Lemma.
Rt
Consequently, for testing the null hypothesis that a sample is i.i.d. from a continuous, symmetric distribution, the critical level of a signed rank statistic can be set without further knowledge of the "shape" of the underlying distribution. The null hypothesis of symmetry arises naturally in the two-sample problem with paired observations. Suppose that, given independent observations (X 1 , Y1 ) , . . . , (XN , YN), it is desired to test the hypothesis that the distribution of Xi - Yi is "centered at zero." If the observations (Xi , Yi) are exchangeable, that is, the pairs (Xi , Yi) and (Yi , Xi ) are equal in distribution, then Xi - Yi is symmetrically distributed about zero. This is the case, for instance, if, given a third variable (usually called "factor"), the observations Xi and Yi are conditionally independent and identically distributed. For the vector of absolute ranks to be uniformly distributed on the set of all permutations it is necessary to assume in addition that the differences are identically distributed. For the signs alone to be distribution-free, it suffices, of course, that the pairs are inde pendent and that P(Xi < YJ P(Xi > Yi ) � for every Consequently, tests based on only the signs have a wider applicability than the more general signed rank tests. However, depending on the model they may be less efficient.
=
i.
=
Let X 1 , . . . , XN be a random sample from a continuous distribution that is symmetric about zero. Let the scores aN be generated according to ( for a measurable function ¢ such that f01 ¢2 ( u) d u For p+ the distribution function of I X 1 l , define N N + '""" TN = i 1 aN sign(Xi) , TN = L ¢ ( F ( I Xi l ) ) sign(Xd . i 1 1 var (TN Then the sequences TN and TN are asymptoticall y equi v alent in the sense that N - 1 12 TN is asymptotically normal with mean zero TN) 0. Consequentl y , the sequence N 1 and variance J0 ¢ 2 (u) d u. The same is true if the scores are generated -accordi ng2 to 1 for a function ¢ that is continuous almost everywhere and satisfies N L;: 1 ¢ (i j ( N + 1 2 1) ) fo ¢ (u) du Because the vectors signN (X) and ( l X I , Rt) are independent and E signN (X) = 0, the means of both TN and TN are zero. Furthermore, by the independence and the 13.18
Theorem.
< oo.
L =
'
---+
---+
13. 3)
R+
Nt
=
(13.4)
< oo.
Proof
orthogonality of the signs,
13.4
1 83
Rank Statistics for Independence
The expectation on the right side is exactly the expression in ( 1 3.6), evaluated for the special choice U1 = p + ( I X 1 1 ) . This can be shown to converge to zero as in the proof of Theorem 1 3 .5. •
Wilcoxon signed rank statlstzc u.
The Large = 2.:= � 1 R t i sign(Xi ) is obtained from the score-generating function ¢ ( u ) = values of this statistic indicate that large absolute values I Xi I tend to go together with pos itive Xi . Thus large values of the Wilcoxon statistic suggest that the location of the Xi is larger than zero. Under the null hypothesis that X 1 , . . . , X N are i.i.d. and symmetrically distributed about zero, the sequence is asymptotically normal N(O, 1 13). The variance of is equal to N(2 N + 1 ) (N + 1 ) 1 6. The signed rank statistic is asymptotically equivalent to the U -statistic with kernel h (x 1 , x 2 ) = 1 {x 1 + x2 > 0}. (See problem 12.9.) This connection yields the limit distri bution also under nonsymmetric distributions. 0 13.19
WN
Example (Wilcoxon signed rank statistic).
N -312 WN
WN
Signed rank statistics that are locally most powerful can be obtained in a similar fashion as locally most powerful rank statistics were obtained in the previous section. Let f be a symmetric density, and let X 1 , , XN be a random sample from the density f(- -e). Then, under regularity conditions, .
Pe ( signN (X)
= = =
.
•
s, R t N r ) - Po ( signN (X) s, R t =
=
j ' ( I Xi l ) { signN (x) -e Eo L sign(Xi ) f i= 1 N 1 f' -e -N-, L iEo - ( I Xi l ) I R t i f 2 N . i =1
s (
r)
=
=
s, R t r } o (e ) ) o (e ) . ri =
=
+
+
In the second equality it is used that f' If (x) is equal to sign(x) f' If ( l x I) by the skew symmetry of f' If. It follows that for testing · f against f ( -e) are obtained from the scores
locally most poweiful signed rank statistics
These scores are of the form (1 3.3) with score-generating function ¢ = - (f' 1f) (F + ) - 1 , whence locally most powerful rank statistics are asymptotically linear by Theorem 1 3 . 1 8. By the symmetry of F, we have (F + ) - 1 = p- 1 ( + 1) 12) . o
(u)
(u
Example. The Laplace density has score function f' lf (x) = sign(x) = 1 , for x :::= 0. This leads to the locally most powerful scores 1 . The corresponding test statistic is the TN = 2.:= � 1 sign(Xi ) . Is it surprising that this simple statistic possesses an optimality property? It is shown to be asymptotically optimal for testing Ho : e = 0 in Chapter 1 5 . 0 13.20
sign statistic
13.21
aNi
=
Example. The locally most powerful score for the normal distribution are N i + 1) 12) . These are appropriately known as the normal (absolute) scores. 0
E
p,(O). 14.15
Lemma.
=
Let p, and u be functions of e such that ( 14.4) holds for every sequence en h / Jn. Suppose that p, is differentiable and that u is continuous at zero, with p,' (0) > 0 and u (0) > 0. Suppose that the tests that reject the null hypothesis for large values of Tn possess nondecreasing power functions e 1--+ nn (e). Then this family of tests is consistent against every alternative e > 0. Moreover, ifnn (0) -+ then JTn (en) -+ or JTn (en) -+ 1 when .J1i en -+ 0 or .J1i en -+ 00, respectively. 14.16
Lemma.
=
a,
a
For the first lemma, suppose that the tests reject the null hypothesis if Tn exceeds the critical value en . By assumption, the probability under e 0 that Tn is outside the interval ( p, (O) - c, p,(O) + c) converges to zero as n -+ for every fixed c > 0. If the asymptotic level lim P0(Tn > en) is positive, then it follows that Cn < p,(O) + c eventually. On the other hand, under e the probability that Tn is in (p, (e) - c, p,(e) + c) converges to 1 . For sufficiently small c and p,(e) > p,(O), this interval is to the right of p,(O) + c . Thus for sufficiently large n, the power Pe ( Tn > en ) can be bounded below by Pe ( Tn E ( p,(e) - c, p,(e) + c)) -+ 1 . For the proof of the second lemma, first note that by Theorem 14.7 the sequence of local power functions JTn ( h / .Jii) converges to n (h ) 1 - ( Za - hp,' (0) / u (0) ) , for every h, if the asymptotic level is If .J1i en -+ 0, then eventually en < h / .J1i for every given h > 0. By the monotonicity of the power functions, JTn (en) ::::::; JTn (h / .Jii) for sufficiently large n. Thus lim sup nn (en ) ::::::; n (h) for every h > 0. For h ,j.. 0 the right side converges to n (O) Combination with the inequality nn (en ) :=::: nn (O) -+ gives nn (en ) -+ The case that .J1i en -+ can be handled similarly. Finally, the power nn (e) at fixed alternatives is bounded below by nn (en) eventually, for every sequence en ,j.. 0. Thus JTn (e) -+ 1 ' and the sequence of tests is consistent at e . •
Proofs.
=
oo,
=
a.
= a.
a
a.
oo
The following examples show that the t -test and Mann-Whitney test are both consistent against large sets of alternatives, albeit not exactly the same sets. They are both tests to compare the locations of two samples, but the pertaining definitions of "location" are not the same. The t-test can be considered a test to detect a difference in mean; the Mann-Whitney test is designed to find a difference of P(X ::::::; Y) from its value 1 /2 under the null hypothesis. This evaluation is justified by the following examples and is further underscored by the consideration of asymptotic efficiency in nonparametric models. It is shown in Section 25.6 that the tests are asymptotically efficient for testing the parameters EY - EX or P(X ::::::; Y) if the underlying distributions F and G are completely unknown. The two-sample t-statistic C Y - X )/ S converges in probability to E(Y - X)/ u , where u 2 lim var( Y - X ) . If the null hypothesis postulates that EY EX, then the test that rejects the null hypothesis for large values of the t-statistic is consistent against every alternative for which EY > EX . D 14.17
Example (t-test).
=
=
20 1
14. 3 Asymptotic Relative Efficiency
The Mann-Whitney statistic U converges in prob ability to P(X .::::; Y), by the two-sample U -statistic theorem. The probability P(X .::::; Y) is equal to if the two samples are equal in distribution and possess a continuous distribution function. If the null hypothesis postulates that P( X .::::; Y) then the test that rej ects for large values of U is consistent against any alternative for which P(X .::::; Y) > D 14.18
Example (Mann- Whitney test).
1/2
1/2,
=
14.3
1/2.
Asymptotic Relative Efficiency
Sequences of tests can be ranked in quality by comparing their asymptotic power functions. For the test statistics we have considered so far, this comparison only involves the "slopes" of the tests. The concept of relative efficiency yields a method to quantify the interpretation of the slopes. Consider a sequence of testing problems consisting of testing a null hypothesis H0 : e versus the alternative : e We use the parameter to describe the asymptotics; thus We require a priori that our tests attain asymptotically level and power Usually we can meet this requirement by choosing an appropriate number of yE( observations at "time" A larger number of observations allows smaller level and higher power. If is the power function of a test if n observations are available, then we define to be the minimal number of observations such that both =
H1
v -+ a,
nv
=
Bv.
0
v
oo .
a
1). 1fn
v.
If two sequences of tests are available, then we prefer the sequence for which the numbers are smallest. Suppose that and observations are needed for two given sequences of tests. Then, if it exists, the limit
nv
nv, 1 nv, 2 nv, lm v-Hx:> n v, 2I 1.
is called the (asymptotic) relative efficiency or Pitman efficiency of the first with respect to the second sequence of tests. A relative efficiency larger than indicates that fewer observations are needed with the first sequence of tests, which may then be considered the better one. In principle, the relative efficiency may depend on y and the sequence of alternatives The concept is mostly of interest if the relative efficiency is the same for all possible choices of these parameters. This is often the case. In particular, in the situations considered previously, the relative efficiency turns out to be the square of the quotient of the slopes.
1
a,
Bv.
Consider statistical models ( P e : e :::: such that II P e - P O II as 8 -+ 0, for every n. Let and be sequences of statistics that satisfy (14.4) for every sequence -1.- and functions fLi and O"; such that fLi is differentiable at zero and O"i is continuous at zero, with JL; > and O"; > Then the relative efficiency of the tests that reject the null hypothesis H0 : e for large values of i is equal to 14.19
n,
Theorem.
0)
n, n,
-+
0
Tn, l Tn, 2 (0) 0 (0) 0. 0 Tn, (/L�(0)/0"1(0)) 2 , JL � (O) I 0"2 (O) for every sequence of alternatives Bv 0, independently of > 0 and y 1). If the powerfunctions of the tests based on Tn, i are nondecreasingfor every n, then the assumption Bn 0
=
.J..
a
E (a ,
Relative Efficiency of Tests
202
ofasymptotic normality ofTn,i can be relaxed to asymptotic normality under every sequence en o C1 1 Jn) only. =
Fix a and y as in the introduction and, given alternatives ev {. 0, let nv, i observations for be used with each of the two tests. The assumption that I Pn,e" - Pn,o II 0 as each fixed n forces nv, i Indeed, the sum of the probabilities of the first and second kind of the test with critical region Kn equals Proof.
----*
v ----* oo
----* oo .
{ d Pn, O + JK�{ d Pn, Bv
]Kn
=
1+
{ (Pn, O - Pn,eJ d f-tn
JK,
·
This sum is minimized for the critical region Kn { Pn,o - Pn,Bv < 0}, and then equals uniformly in every 1 - � II Pn, Bv - Pn,o 11 . By assumption, this converges to 1 as finite set of n. Thus, for every bounded sequence n nv and any sequence of tests, the sum of the error probabilities is asymptotically bounded below by 1 and cannot be bounded above by a + 1 - y < 1 , as required. Now that we have ascertained that nv, i as we can use the asymptotic normality of the test statistics Tn, i . The convergence to a continuous distribution implies that the asymptotic level and power attained for the minimal numbers of observations (minimal for obtaining at most level a and at least power y) is exactly a and y . In order to obtain asymptotic level a the tests must reject Ho if Fv(Tnv,i - f-t i (0)) > CJi (O) za + o(l). The powers of these tests are equal to 1-t; (0) lfnv_ , (ev) 1 - supn Pn,e ( Pn . o 0). 14.22
Theorem.
=
For simplicity of notation, we drop the index i E { 1 , 2} and write nv for the minimal numbers of observations needed to obtain level av and power y with the test statistics Tn .
Proof
204
Relative Efficiency of Tests
The sample sizes nv necessarily converge to oo as ---+ oo. If not, then there would exist a fixed value n and a (sub)sequence of tests with levels tending to and powers at least y . However, for any fixed n , and any sequence of measurable sets Km with ---+ as m ---+ oo, the probabilities n Pn,o = + o(1) are eventually strictly = smaller than y , by assumption. The most powerful level av-test that rejects for large values of Tn has critical re gion {Tn ::::: en } or {Tn > en } for Cn = inf { c : Po (Tn ::::: c) :S av } , where we use ::::: if Po (Tn ::::: en) :S av and > otherwise. Equivalently, with the notation Ln = Po (Tn ::::: t ) lt=Tn ' this is the test with critical region { Ln ::=: av } . By the definition of n v we conclude that 2 2 - - log Ln ::::: - - log av -> y for n = nv , < Y for n = n v - 1 . n n By (14.20) and (14.21), the random variable inside the probability converges in probability to the number as n ---+ oo. Thus, the probability converges to or 1 if - (2/ n) log av is asymptotically strictly bigger or smaller than , respectively. Conclude that 2 lim sup - - log av :S v
Pn, o (Km) 0
0)
Pn,e (Km ) Pn, e (Km
Pn, e (
0
){
e (t-t(8) )
0 e (t-t(8) ) e (t-t(8) ) 2 log a11 ::::: e ( t-t(8) ) . lim inf v -Hxl 11 - 1 Combined, this yields the asymptotic equivalence v - 2logav f e (t-t(8) ) . Applying this nv
1! -HX)
n
n
�
for both nv, l and nv, 2 and taking the quotient, we obtain the theorem.
•
Bahadur and Pitman efficiencies do not always yield the same ordering of sequences of tests. In numerical comparisons, the Pitman efficiencies appear to be more relevant for moderate sample sizes. This is explained by their method of calculation. By the preceding theorem, Bahadur efficiencies follow from a large deviations result under the null hypothesis and a law of large numbers under the alternative. A law of large numbers is of less accuracy than a distributional limit result. Furthermore, large deviation results, while mathematically interesting, often yield poor approximations for the probabilities of interest. For instance, condition (14.20) shows that P0 (Tn ::::: t) = exp ( - � n e (t ) ) exp o (n ) . Nothing guarantees that the term exp o (n) is close to 1 . On the other hand, often the Bahadur efficiencies as a function of e are more informa tive than Pitman efficiencies. The Pitman slopes are obtained under the condition that the sequence ,Jii ( Tn is asymptotically normal with mean zero and variance Suppose, for the present argument, that Tn is normally distributed for every finite n , with the parameters and Then, because 1 -
0 and hence also for 8 = 0. Finally, we remove the restriction that K (u) is finite for every u, by a truncation argument. For a fixed, large M, let Y1M , Y2M , . . . be distributed as the variables Y1 , Y2 , given that I Y; I ::'S M for every i , that is, they are i.i.d. according to the conditional distribution of Y1 given I Y1 I ::'S M. Then, with u �---* KM (u) = log Eeu Yq { I Y1 I ::'S M}, .
�
lim inf log P ( Y :::: 0) ::::
•
.
� log (P ( Y� :::: O) P ( I Y;M I ::'S Mr)
:::: inf KM (u) , u::: O
by the preceding argument applied to the truncated variables. Let s be the limit of the right side as M ----+ oo, and let AM be the set u :::: 0 : KM (u) ::'S s } . Then the sets A M are nonempty and compact for sufficiently large M (as soon as KM (u) ----+ oo as u ----+ ±oo), with A 1 :J A 2 :J · · ·, whence nAM is nonempty as well. Because KM converges pointwise to K as M ----+ oo, any point u 1 E nA M satisfies K ( u 1 ) = lim KM (u 1 ) ::'S s . Conclude that s is bigger than the right side of the proposition (with t = 0). •
{
The cumulant generating function of a variable Y that is - 1 and 1 , each with probability � , is equal to K ( u ) = log cosh u . Its derivative is K' ( u) = tanh u and hence the infimum of K ( u) - t u over u E lR is attained for u = arctanh t . B y the Cramer-Chernoff theorem, for 0 < t < 1 , 14.24
Example (Sign statistic).
2 - - log P(Yn
::::
t) ----+ e (t) : = - 2 log cosh arctanh t + 2t arctanh t .
We can apply this result to find the Bahadur slope of the sign statistic Tn = n - 1 I:7= 1 sign(X; ) . If the null distribution of the random variables X 1 , . . . , Xn is continuous and symmetric about zero, then (1 4.20) is valid with e (t) as in the preceding display and with t-t (e) = Ee sign(X 1 ) . Figure 14.2 shows the slopes of the sign statistic and the sample mean for testing the location of the Laplace distribution. The local optimality of the sign statistic is reflected in the Bahadur slopes, but for detecting large differences of location the mean is better than the sign statistic. However, it should be noted that the power of the sign test in this range is so close to 1 that improvement may be irrelevant; for example, the power is 0.999 at level 0.007 for n = 25 at e = 2. 0
14.4
207
Other Relative Efficiencies
C\i
0
tn �
0
"! 0
ci 0
2
0
4
3
Figure 14.2. Bahadur slopes of the sign statistic (solid line) and the sample mean (dotted line) for testing that a random sample from the Laplace distribution has mean zero versus the alternative that the mean is e ' as a function of e .
2o- . X 1 , .X. . , Xn on Sn 0. + i 2 o- 2 . t > 0, 2 t2 - - log Po(Xn � t) ----+ e (t) : n oThus, the Bahadur slope of the sample mean is equal to 2 1 o- 2 , for every > 0. Under the null hypothesis, the statistic JfiX n I Sn possesses the t -distribution with (n -1) degrees of freedom. Thus, for a random sample Z0, Z 1 , . . . of standard normal variables, for every t > 0, n-1 2 ) X 2 t {!; 1 1 ( ( ) ( ) 2 2 2 � > -P ___! !:_ > -P '"' z t t t P0 n - 1 S - = 2 n - 1 - = 2 f;;{ z > 0 .
Suppose that are a random sample from a normal distribution with mean p., and variance We shall consider known and compare the slopes of the sample mean and the Student statistic I for testing H0 : p., The cumulant generating function of the normal distribution is equal to K (u) u p., u By the Cramer-Chernoff theorem, for 14.25
Example (Student statistic).
=
=
=
2·
p.,
_
_
p.,
°
n
1
-
This probability is not of the same form as in the Cramer-Chernoff theorem, but it concerns almost an average, and we can obtain the large deviation probabilities from the cumulant generating function in an analogous way. The cumulant generating function of a square of a standard normal variable is equal to u �----+ log(l 2u ) , and hence the cumulant generating function of the variable L.7::{ is equal to Kn (u )
=
-i z5 - t2 z; - i log(1 - iCn - 1) log(1 + 2u) -
2t 2 u ) .
This function is nicely differentiable and, by straightforward calculus, its minimum value can be found to be
2 + 1 ) - -1 (n - 1) log ((n - 1)(t 2 + 1) ) . 1 (t-t2n 2 n The minimum is achieved on [0, for t 2 � (n - 1) - 1 . This expression divided by n is the analogue of infu ( in the Cramer-Chernoff theorem. By an extension of this theorem, .
mf Kn (u) = - - log u 2
oo)
K u)
208
Relative Efficiency of Tests
for every t > 0, - � log Po n
(J n n- 1 XSnn t ) ---+ e(t) 2:
=
log(t 2 + 1 ) .
Thus, the Bahadur slope of the Student statistic is equal to log(l + f-L 2 I CJ 2 ) . For f-L 1 CJ close to zero, the Bahadur slopes of the sample mean and the Student statistic are close, but for large f-L I CJ the slope of the sample mean is much bigger. This suggests that the loss in efficiency incurred by unnecessarily estimating the standard deviation CJ can be substantial. This suggestion appears to be unrealistic and also contradicts the fact that the Pitman efficiencies of the two sequences of statistics are equal. 0 The sequence of Neyman-Pearson statis tics f17= 1 (pe l Pea) ( X i ) has Bahadur slope -2Pe log(pea l pe ). This is twice the Kullback Leibler divergence of the measures Pea and Pe and shows an important connection between large deviations and the Kullback-Leibler divergence. In regular cases this result is a consequence of the Cramer-Chemoff theorem. The variable Y log Pe I Pea has cumulant generating function K (u) log J p� p�a- u d f-L un der Pea . The function K (u) is finite for 0 ::S u ::S 1 , and, at least by formal calculus, K' ( l ) Pe log(pe i Pea ) J.L (8), where J.L (8) is the asymptotic mean of the sequence n- 1 I )og(pe i Pea ) (X i ) · Thus the infimum of the function u �---+ K (u) - U J.L (8) is attained at u 1 and the Bahadur slope is given by 14.26
Example (Neyman-Pearson statistic).
=
=
=
=
=
e ( J.L (8) )
=
-2(K ( l ) - J.L (e) )
=
2Pe log � Pea
In section 16.6 we obtain this result by a direct, and rigorous, argument.
0
For statistics that are not means, the Cramer-Chemoff theorem is not applicable, and we need other methods to compute the Bahadur efficiencies. An important approach applies to functions of means and is based on more general versions of Cramer's theorem. A first generalization asserts that, for certain sets B, not necessarily of the form [t , oo ) ,
1
-
sup (u y - K (u)) . u For a given statistic of the form ¢ (Y), the large deviation probabilities of interest P(¢ (Y) :::: t) can be written in the form P( Y E B1) for the inverse images B1 ¢ - 1 [t, oo) . If B1 is an eligible set in the preceding display, then the desired large deviations result follows, although we shall still have to evaluate the repeated "inf sup" on the right side. Now, according to Cramer ' s theorem, the display is valid for every set such that the right side does not change if B is replaced by its interior or its closure. In particular, if ¢ is continuous, then B1 is closed and its interior B1 contains the set ¢- 1 (t , oo) . Then we obtain a large deviations result if the difference set ¢- I { t } is "small" in that it does not play a role when evaluating the right side of the display. Transforming a univariate mean Y into a statistic ¢ ( Y ) can be of interest (for example, to study the two-sided test statistics I Y I), but the real promise of this approach is in its applications to multivariate and infinite-dimensional means. Cramer's theorem has been generalized to these situations. General large deviation theorems can best be formulated - log P(Y n
E
B) ---+ -infyEB l ( y ) ,
l (y)
=
=
14.4
209
Other Relative Efficiencies
as separate upper and lower bounds. A sequence of random maps Xn : Q � ][}) from a probability space (Q , U, P) into a topological space ][}) is said to satisfy the large deviation principle with rate function I if, for every closed set F and for every open set G, 1 lim sup - log P* CXn E F ) ::=: - infF I (y) , yE n �oo n 1 lim inf - log P* (Xn E G) :=::: - inf I (y) . n �oo yEG n The rate function I : D � [0, oo] is assumed to be lower semi continuous and is called a good rate function if the sublevel sets { y : I (y) ::=: are compact, for every E JR. The inner and outer probabilities that Xn belongs to a general set B is sandwiched between the probabilities that it belongs to the interior and the closure B . Thus, we obtain a large deviation result with equality for every set B such that inf I (y) : y E B inf I (y) : yE An implication for the slopes of test statistics of the form ¢ (Xn ) is as follows.
M}
M
B
{
B}.
}
{
=
14.27 Lemma. Suppose that ¢ : ][}) � lR is continuous at every y such that I (y) < oo and suppose that inf I (y) : ¢ ( y ) > inf { I (y) : ¢ ( y ) :=::: t If the sequence Xn satisfies the large-deviation principle with the rate function I under P0, then Tn ¢ (Xn) satisfies (14. 20) with e(t) 2 inf I (y) : ¢ ( y ) :=::: t Furthermore, if I is a good rate function, then e is continuous at t.
{
=
t}
}.
=
{
=
}.
Define sets A1 ¢ 1 ( t , oo ) and Bt ¢ 1 [ t , oo ) , and let ][})0 be the set where I is finite. By the continuity of ¢, Bt n ][})0 Bt n ][})0 and t n Do ::J At n D0 . (If y r:f_ Bt , then there is a net Yn E B� with Yn -+ y ; if also y E Do , then ¢ (y) lim ¢ ( yn) ::=: t and hence y r:f_ At .) Consequently, the infimum of I over t is at least the infimum over A1, which is the infimum over Bt by assumption, and also the infimum over Bt . Condition (14.20) follows upon applying the large deviation principle to t and B t . The function e is nondecreasing. The condition on the pair (I, ¢) is exactly that e is right-continuous, because e (t +) inf{ I (y) : ¢ (y) > t . To prove the left-continuity of e, let tm t t . Then e(tm ) t a for some a ::=: e(t) . If a oo , then e(t) oo and e is left-continuous. If a < oo, then there exists a sequence Ym with ¢ (Ym ) :=::: tm and 2/ (ym ) ::=: a + l j m . By the goodness of I, this sequence has a converging subnet Ym ' -+ y. Then 2I (y) ::=: lim inf 2I (Ym ' ) ::=: a by the lower semicontinuity of I, and ¢ (y) :=::: t by the continuity of ¢. Thus e(t) ::=: 2I (y) ::=: a . •
Proof
=
=
-
-
=
B
=
B
B
=
}
=
=
Empirical distributions can be viewed as means (of Dirac measures), and are therefore potential candidates for a large-deviation theorem. Cramer's theorem for empirical distri butions is known as Sanov 's theorem. Let IL 1 (X, A) be the set of all probability measures on the measurable space (X, A), which we assume to be a complete, separable metric space with its Borel a-field. The r -topology on IL 1 (X, A) is defined as the weak topology gen erated by the collection of all maps P � Pf for f ranging over the set of all bounded, measurable functions on f : X � JR. t Let 1P'n be the empirical measure of a random sample of size n from a fixed measure P. Then the sequence 1P'n viewed as maps into IL 1 (X, A) 14.28
Theorem (Sanov's theorem).
t For a proof of the following theorem, see [3 1 ], [32] , or [65] .
Relative Efficiency of Tests
210
satisfies the large deviation principle relative to the r -topology, with the good rate function / ( Q) = - Q log pjq.
For X equal to the real line, L 1 (X, A) can be identified with the set of cumulative distribution functions. The r -topology is stronger than the topology obtained from the uniform norm on the distribution functions. This follows from the fact that if both Fn (x) ----+ F(x) and Fn {x } ----+ F {x } for every x E JR., then II Fn - F ll oo ----+ 0. (see problem 1 9 .9). Thus any function ¢ that is continuous with respect to the uniform norm is also continuous with respect to the r -topology, and we obtain a large collection of functions to which we can apply the preceding lemma. Trimmed means are just one example. Let IFn be the empirical distribution function of a ran dom sample of size n from the distribution function F, and let JF,;- 1 be the corresponding quantile function. The function ¢ (IFn) = (1 - 2a)- 1 J: -a IF,;- 1 (s) ds yields a version of the a-trimmed mean (see Chapter 22). We assume that 0 < a < � and (partly for simplicity) that the null distribution Fa is continuous. If we show that the conditions of Lemma 14.27 are fulfilled, then we can conclude, by Sanov's theorem, 14.29
Example (Trimmed means).
2 h - - log PF0 ( ¢ CIFn) � t) ----+ e(t) : = 2 inf -G log . G : ¢(Gkt n g Because IFn !,. F uniformly by the Glivenko-Cantelli theorem, Theorem 19. 1 , and ¢z is continuous, ¢ (IFn ) !,. ¢ (F), and the Bahadur slope of the a-trimmed mean at an alternative F is equal to e ( ¢ (F)) . Finally, we show that ¢ is continuous with respect to the uniform topology and that the function t f---+ inf { -G log(fa/ g)) : ¢ (G) � t } is right-continuous at t if Fa is continuous at t. The map ¢ is even continuous with respect to the weak topology on the set of distri bution functions: If a sequence of measures Gm converges weakly to a measure G, then the corresponding quantile functions G;;/ converge weakly to the quantile function c - 1 (see Lemma 21 .2) and hence ¢ (G m ) ----+ ¢ (G) by the dominated convergence theorem. The function t f---+ inf { - G log(fa/ g) : ¢ (G) � t } is right-continuous at t if for every G with ¢ (G) = t there exists a sequence Gm with ¢ (Gm ) > t and G m log(fa/gm ) ----+ G log(fa/ g). If G log(fa/ g) = then this is easy, for we can choose any fixed Gm that is singular with respect to Fa and has a trimmed mean bigger than t . Thus, we may assume that G ! log (fa/g) I < that G « Fa and hence that G is continuous. Then there exists a point c such that a < G (c ) < 1 - a . Define - oo ,
oo,
dGm (x) = { 1 dG
l_
1+�
if if
X
X
C, > C.
::'S
Then Gm is a probability distribution for suitably chosen Em > 0, and, by the dominated convergence G m log(fafgm ) ----+ G log(fa/g) Because Gm (x) ::=:: G(x) for all x , with strict inequality (at least) for all x ::::: c such that G (x) > 0, we have that G ;; / (s) ::=:: a- 1 (s) for all s, with strict inequality for all s E (0, G(c)]. Hence the trimmed mean ¢ (G m ) is strictly bigger than the trimmed mean ¢ (G), for every 0 as m ----+
oo.
m.
14.5 * 14.5
211
Rescaling Rates Rescaling Rates
The asymptotic power functions considered earlier in this chapter are the limits of "local power functions" of the form h 1---+ lfn (h/ Jn) . The rescaling rate Jn is typical for testing smooth parameters of the model. In this section we have a closer look at the rescaling rate and discuss some nonregular situations. Suppose that in a given sequence of models (Xn ' An ' Pn ,e : e E 8) it is desired to test the null hypothesis Ho : e eo versus the alternatives H1 : e en . For probability measures P and Q define the total variation distance II P - Q II as the L 1 -distance J I P - q I df-L between two densities of P and Q . =
14.30
Lemma.
=
The power function nn of any test in (Xn , An , Pn , e : e
E
8) satisfies
For any e and eo there exists a test whose power function attains equality.
If lfn is the power function of the test ¢n , then the difference on the left side can be written as J ¢n ( Pn , e - P n , e ) df-Ln This expression is maximized for the test function ¢n l {pn , e > Pn , e }. Next, for any pair of probability densities p and q we have Jq > p (q p) d f.J- � J I P - q l df-L, since j (p - q) df-L 0. •
Proof.
0
=
0
·
=
=
This lemma implies that for any sequence of alternatives en : (i) If I Pn , en - Pn , Bo I ---;. 2, then there exists a sequence of tests with power lfn (en ) tending to 1 and size lfn (e0) tending to 0 (a perfect sequence of tests). (ii) If II Pn ,e" - Pn , eo II ___,. 0, then the power of any sequence of tests is asymptotically less than the level (every sequence of tests is worthless). (iii) If I Pn , en - Pn ,eo II is bounded away from O and 2, then there exists no perfect sequence of tests, but not every test is worthless. The rescaling rate hj Jn used earlier sections corresponds to the third possibility. These ex amples concern models with independent observations. Because the total variation distance between product measures cannot be easily expressed in the distances for the individual factors, we translate the results into the Hellinger distance and next study the implications for product experiments. The Hellinger distance H (P, Q) between two probability measures is the L 2 -distance between the square roots of the corresponding densities. Thus, its square H 2 (P, Q) is equal to J C.jp - Jfj) 2 df-L. The distance is convenient if considering product measures. First, the Hellinger distance can be expressed in the Hellinger affinity A (P , Q) J fi..j{j df-L, through the formula =
H 2 ( P , Q)
=
2 - 2 A ( P , Q).
Next, by Fubini's theorem, the affinity of two product measures is the product of the affinities. Thus we arrive at the formula
Relative Efficiency of Tests
212
Given a statistical model (Pe : 8 ::::-_ 8o ) set Pn,e = Pt. Then the possi bilities (i), (ii), and (iii) arise when nH 2 (Pe" , Pea) converges to oo, converges to 0, or is bounded away from 0 and oo, respectively. In particular, if H 2 (Pe , Pea) = 0 ( 18 - 8o I a ) as e � e0, then the possibilities (i), (ii), and (iii) are valid when n I fa IBn - 80 I converges to oo, converges to 0, or is bounded away from 0 and oo, respectively. 14.31
Lemma.
The possibilities (i), (ii), and (iii) can equivalently be described by replacing the total variation distance II Pt - Pt II by the squared Hellinger distance H 2 (Pt , Pt ) . This follows from the inequalities, for any probability measures P and Q,
Proof.
n
0
n
H 2 (P , Q) .:::; li P - Q ll .:::; (2 - A 2 (P , Q))
!\
0
2 H ( P, Q) .
The inequality on the left is immediate from the inequality I ,JP - JfiP _:::; I p - q I , valid for any nonnegative numbers p and q . For the inequality on the right, first note that pq = (p v q ) (p !\ q) _:::; (p + q ) (p !\ q ) , whence A 2 (P , Q) _:::; 2 j (p !\ q) dJL, by the Cauchy-Schwarz inequality. Now f (p !\ q) d fL is equal to 1 - i II P - Q II , as can be seen by splitting the domains of both integrals in the sets p < q and p ::::-_ q . This shows that li P - Q ll _:::; 2 - A 2 (P, Q) . That li P - Q ll _:::; 2 H ( P, Q) is a direct consequence of the Cauchy-Schwarz inequality. We now express the Hellinger distance of the product measures in the Hellinger distance of Pe" and Pea and manipulate the nth power function to conclude the proof. • If the model (X, A, Pe : 8 E 8) is differentiable in quadratic mean at 80 , then H 2 (Pe , Pea) = 0 ( 18 - 80 1 2 ) . The intermediate rate of conver gence (case (iii)) is .Jli. 0 14.32
Example (Smooth models).
If Pe is the uniform measure on [0, 8], then H 2 (Pe , Pea) = 0 ( 18 - 80 I ) . The intermediate rate of convergence is n . In this case we would study asymptotic power functions defined as the limits of the local power functions of the form h f---+ nn (eo + hI n) . For instance, the level tests that reject the null hypothesis Ho : e = eo for large values of the maximum X en) of the observations have power functions 14.33
Example (Uniform law).
a
( �) = Pea+h/n (X (n)
nn eo +
::::-_
8o ( l - ) 1 n ) a
1
�
-
1 - ( 1 - a ) e h l ea .
Relative to this rescaling rate, the level tests that reject the null hypothesis for large values of the mean X n have asymptotic power function (no power). 0 a
a
Let Pe be the probability distribution with density + x f---+ ( 1 - I x - e I) on the real line. Some clever integrations show that H 2 ( Pe , Po) = i B 2 log( l j 8 ) + 0 (8 2 ) as e � 0. (It appears easiest to compute the affinity first.) This leads to the intermediate rate of convergence .jn log n. 0 14.34
Example (Triangular law).
The preceding lemmas concern testing a given simple null hypothesis against a simple alternative hypothesis. In many cases the rate obtained from considering simple hypotheses does not depend on the hypotheses and is also globally attainable at every parameter in the parameter space. If not, then the global problems have to be taken into account from the beginning. One possibility is discussed within the context of density estimation in section 24. 3 .
Problems
213
Lemma 14.3 1 gives rescaling rates for problems with independent observations. In mod els with dependent observations quite different rates may pertain. Consider the Galton-Watson branching process, discussed in Example 9. 1 0. If the offspring distribution has mean t-t ( 8) larger than 1 , then the parameter is estimable at the exponential rate t-t (8) n . This is also the right rescaling rate for defining asymptotic power functions. 0 14.35
Example (Branching).
Notes
Apparently, E.J.G. Pitman introduced the efficiencies that are named for him in an unpub lished set of lecture notes in 1949. A published proof of a slightly more general result can be found in [109]. Cramer [26] was interested in preciser approximations to probabilities oflarge deviations than are presented in this chapter and obtained the theorem under the condition that the moment-generating function is finite on 1Ft Chernoff [20] proved the theorem as presented here, by a different argument. Chernoff used it to study the minimum weighted sums of error probabilities of tests that reject for large values of a mean and showed that, for any 0 < rr < 1 , � log inf ( nP0 ( Y > t ) + ( 1 - n)P 1 ( Y ::::: t) ) t n inf inf( K0 (u) - ut ) V inf ( K 1 (u) - ut ) .
--+
Eo Y1 < t 0 satisfy, for every h such that eo h > 0, 15.4
tfr
Theorem.
tfr80
=
:::
a
( !!__ )
lim sup nn eo + n -+oo rn 15.5
Addendum.
:'S
1
-
0, and a subsequence of {n} along which the lim sup rr (e0 + h i f rn) is taken. There exists a further subsequence along which nn (e0 + r;; 1 h) converges to a limit rr( h ) for every h E m_k (see the proof of Theorem 15. 1). The function h 1--+ n (h) is a power function in the Gaussian limit experiment. For < 0, we have 1 1 1/J (eo + r;; h) r;; ( tfr80 h+o(1) ) < 0 eventually, whence n (h) ::: lim sup lfn (eo + r;; 1 h) 0. Thus, rr is of By continuity, the inequality rr ( h ) extends to all h such that level for testing Ho : > 0. Its power function is bounded above by 0 versus H1 the power function of the uniformly most powerful test, which is given by Proposition 15.2. This concludes the proof of the theorem. The asymptotic optimality of the sequence Tn follows by contiguity arguments. We start by noting that the sequence ( !:!.. n , Bo , l::!.. n , Bo ) converges under eo in distribution to a (degenerate) normal vector (!:!.. , !:!.. ) . By Slutsky's lemma and local asymptotic normality,
8-0
Proofs.
1/!80 h 1 •
=
a.
a
tfr80 h :::
:::
a
: tfr80 h
tfr80 h tfr80 h :::
:::
By Le Cam's third lemma, the sequence l::!.. n ,Bo converges in distribution under e0 + r;; 1 h to a N (/e h , /e0)-distribution. Thus, the sequence Tn converges under eo + r;; 1 h in distribution to a normal distribution with mean and variance 1 . •
0
tfr80 h/( tfr8/8� 1 tfr� ) 1 12
220
Efficiency of Tests
The point e0 in the preceding theorem is on the boundary of the null and the alter native hypotheses. If the dimension k is larger than 1, then this boundary is typically (k - I)-dimensional, and there are many possible values for e0 • The upper bound is valid at every possible choice. If k 1, the boundary point e0 is typically unique and hence known, and we could use Tn Ie� 1 12 lln,eo to construct an optimal sequence of tests for the problem H0 : e = e0 . These are known as score tests. Another possibility is to base a test on an estimator sequence. Not surprisingly, efficient estimators yield efficient tests. =
=
Example (Wald tests). Let : e E 8) that is differentiable
15.6
X 1 , . . . , Xn be a random sample in an experiment
(Pe in quadratic mean with nonsingular Fisher informa tion. Then the sequence of local experiments (P; h f.jii : h E IRk ) is locally asymptotically + normal with rn Jn, Ie the Fisher information matrix, and =
A sequence of estimators en is asymptotically efficient for estimating e if (see Chapter 8)
Under regularity conditions, the maximum likelihood estimator qualifies. Suppose that e 1---+ Ie is continuous, and that 1/r is continuously differentiable with nonzero gradient. Then the sequence of tests that reject H0 : 1/r (e) :::=:: 0 if
is asymptotically optimal at every point e0 on the boundary of H0 . Furthermore, this seq uence of tests is consistent at every e with 1/r ( e ) > 0. These assertions follow from the preceding theorem, upon using the delta method and Slutsky's lemma. The resulting tests are called Wald tests if en is the maximum likelihood estimator. D 15.4
One-Sample Location
f
f
Let X 1 , . . . , Xn be a sample from a density (x - e), where is symmetric about zero and has finite Fisher information for location I1 . It is required to test H0 : e 0 versus H1 : e > 0. The density may be known or (partially) unknown. For instance, it may be known to belong to the normal scale family. For fixed the sequence of experiments (0�= 1 f (xi - e) : e E IR) is locally asymptot ically normal at e 0 with lln ,o -n - 1 12 2..::�= 1 norming rate Jn, and Fisher information 11 . By the results of the preceding section, the best asymptotic level ex power function (for known is
f
f,
=
=
f)
(f' If)(Xi ),
1 - (za - hjff) .
=
15.4 One-Sample Location
22 1
This function is an upper bound for lim sup 1f11 (hi.Jfi) , for every h > 0, for every sequence of level power functions. Suppose that Tn are statistics with a
(15.7) Then, according to the second assertion of Theorem 15 .4, the sequence of tests that reject the null hypothesis if Tn � Za attains the bound and hence is asymptotically optimal. We shall discuss several ways of constructing test statistics with this property. If the shape of the distribution is completely known, then the test statistics Tn can simply be taken equal to the right side of (15.7), without the remainder term, and we obtain the score test. It is more realistic to assume that the underlying distribution is only known up to scale. If the underlying density takes the form (x) (x I CJ ) I CJ for a known density that is symmetric about zero, but for an unknown scale parameter CJ , then
f
fo
=
fo
f�lfo
fo
15.8 Example (t-test). The standard normal density possesses score function (x) x and Fisher information Ifo 1 . Consequently, if the underlying distribution is normal, then the optimal test statistics should satisfy Tn .JfiXn iCJ + op0 (n - 1 12 ) . The t -statistics Xn l Sn fulfill this requirement. This is not surprising, because in the case of nor mally distributed observations the t-test is uniformly most powerful for every finite n and hence is certainly asymptotically optimal. D =
=
-
=
The t-statistic in the preceding example simply replaces the unknown standard deviation CJ by an estimate. This approach can be followed for most scale families. Under some regu larity conditions, the statistics
should yield asymptotically optimal tests, given a consistent sequence of scale esti mators &n . Rather than using score-type tests, we could use a test based on an efficient estimator for the unknown symmetry point and efficient estimators for possible nuisance parameters, such as the scale - for instance, the maximum likelihood estimators. This method is indicated in general in Example 15.8 and leads to the Wald test. Perhaps the most attractive approach is to use signed rank statistics. We summarize some definitions and conclusions from Chapter 13. Let R:1 , , R"!:n be the ranks of the absolute values I X 1 1 , . . , I Xn I in the ordered sample of absolute values. A linear signed rank statistic takes the form •
•
•
.
for given numbers a11 1 , , ann , which are called the scores of the statistic. Particular examples are the Wilcoxon signed rank statistic, which has scores an i i , and the sign statistic, which corresponds to scores an i 1 . In general, the scores can be chosen to weigh •
•
•
=
=
222
Efficiency of Tests
the influence of the different observations. A convenient method of generating scores is through a fixed function ¢ : [0, 1 ] �---+ lR, by (Here Un ( 1 ) , . . . , Un (n ) are the order statistics of a random sample of size n from the uniform distribution on [0, 1 ] .) Under the condition that J01 ¢ 2 (u) du < oo, Theorem 1 3 . 1 8 shows that, under the null hypothesis, and with p + (x) 2F(x) - 1 denoting the distribution function of I X 1 I , =
Because the score-generating function ¢ can be chosen freely, this allows the construction of an asymptotically optimal rank statistic for any given shape f. The choice 1 f' + - 1 ¢ (u) - ((F ) (u)) . ( 1 5.9) If f ...;;-y:=
-
yields the locally most powerful scores, as discussed in Chapter 13. Because f ' IJ ( l x I ) sign (x) f ' If (x) by the symmetry of f, it follows that the signed rank statistics Tn satisfy (15.7). Thus, the locally most powerful scores yield asymptotically optimal signed rank tests. This surprising result, that the class of signed rank statistics contains asymptotically efficient tests for every given (symmetric) shape of the underlying distribution, is sometimes expressed by saying that the signs and absolute ranks are "asymptotically sufficient" for testing the location of a symmetry point. =
Let Tn be the simple linear signed rank statistic with scores an i E¢ (Un (i) ) generated by the function ¢ defined in (15.9). Then Tn satisfies ( 15 .7) and hence the sequence of tests that reject Ho : 8 0 if Tn ::::_ Za is asymptotically optimal at 8 0. 15.10
Corollary.
=
=
=
Signed rank statistics were originally constructed because of their attractive property of being distribution free under the null hypothesis. Apparently, this can be achieved without losing (asymptotic) power. Thus, rank tests are strong competitors of classical parametric tests. Note also that signed rank statistics automatically adapt to the unknown scale: Even though the definition of the optimal scores appears to depend on f , they are actually identical for every member of a scale family j (x) j0 (xiCJ )ICJ (since (F + ) -1 (u) CJ (F0+ ) - 1 (u)). Thus, no auxiliary estimate for CJ is necessary for their definition. =
=
The sign statistic Tn n - 1 1 2 2.:.:7= 1 sign( X i ) satisfies (15.7) for f equal to the Laplace density. Thus the sign test is asymptotically optimal for testing location in the Laplace scale family. 0 15.11
Example (Laplace).
=
The standard normal density has score function for location f�lfo (x) x and Fisher information Ifo 1 . The optimal signed rank statistic for the normal scale family has score-generating function U i +1 � 1 + g, a
lim sup nm, n n ,m -+oo
(
fL
+
h g fhr ' fL + r;:r vN vN
)
:::=::
1 - I is the asymptotic covariance matrix of the sequence Jn en, > I ' whence we obtain an asymptotic chi-square distribution with k - l degrees of freedom, by the same argument as before. We close this section by relating the likelihood ratio statistic to two other test statistics. Under the simple null hypothesis 80 = { 80}, the likelihood ratio statistic is asymptotically equivalent to both the maximum likelihood statistic (or Wald statistic) and the score statistic. These are given by n(!J" - eol ' le, (iJ"
- 8o )
and
Ht i "" r [t i l (X, )
Ig,; l
" (X, )
The Wald statistic is a natural statistic, but it is often criticized for necessarily yielding ellipsoidal confidence sets, even if the data are not symmetric. The score statistic has the advantage that calculation of the supremum of the likelihood is unnecessary, but it appears to perform less well for smaller values of n. In the case of a composite hypothesis, a Wald statistic is given in ( 16.4) and a score statistic can be obtained by substituting the approximation n en, >i � (/i 1 ) >i, >i I: i. e, (X; ) in (1 6.4). (This approximation is obtainable from linearizing l..: Cf e, - i. e,) ·) In both cases we also replace the unknown parameter iJ by an estimator. 0 >I
16.3
Using Local Asymptotic Normality
An insightful derivation of the asymptotic distribution of the likelihood ratio statistic is based on convergence of experiments. This approach is possible for general experiments, but this section is restricted to the case of local asymptotic normality. The approach applies also in the case that the (local) parameter spaces are not linear. Introducing the local parameter spaces Hn = ,jfi(B - iJ ) and Hn,o = ,j11 ( 80 - iJ ) , we can write the likelihood ratio statistic in the form ) ) 2 if _ 2 sup 1 og n 7= 1 nPtJ+h/ Jilcx; sup 1 og n 7= 1 nP +h/Jil cx ; . n i = 1 PtJ (X ; ) n i = I P tJ (X ; ) hEH, hEH, o A l � Jl
-
_
232
Likelihood Ratio Tests
In Chapter 7 it is seen that, for large n, the rescaled likelihood ratio process in this display is similar to the likelihood ratio process of the normal experiment ( N (h , I;; 1 ) : h E Il�_k) . This suggests that, if the sets Hn and Hn, O converge in a suitable sense to sets H and H0 , the sequence An converges in distribution to the random variable A obtained by substituting the normal likelihood ratios, given by A
=
dN ( h , I;; 1 ) 2 sup log (X) hEH dN ( 0, I;; 1 )
-
dN ( h , I;; 1 ) 2 sup log (X) . h EHo dN ( 0, I;; 1 )
This is exactly the likelihood ratio statistic for testing the null hypothesis Ho : h E Ho versus the alternative H1 : h E H - Ho based on the observation X in the normal experiment. Because the latter experiment is simple, this heuristic is useful not only to derive the asymptotic distribution of the sequence An , but also to understand the asymptotic quality of the corresponding sequence of tests. The likelihood ratio statistic for the normal experiment is A
=
=
inf (X - h l I,y (X - h) - hinf (X - h l I,y (X - h)
h E Ho
EH
I I; f2 x - I; ;2 Ho ll 2 - II I; f2 x - I; ;2 H II 2 ·
( 1 6.5)
The distribution of the sequence An under iJ corresponds to the distribution of A under h 0. Under h 0 the vector I; 12 X possesses a standard normal distribution. The following lemma shows that the squared distance of a standard normal variable to a linear subspace is chi square-distributed and hence explains the chi-square limit when H0 is a linear space. =
=
16.6 Lemma. Let X be a k-dimensional random vector with a standard normal distri bution and let H0 be an ! -dimensional linear subspace of m.,_k . Then I X - H0 11 2 is chi square-distributed with k - l degrees offreedom.
Take an orthonormal base of m.,_k such that the first l elements span H0 . By Pythago ras' theorem, the squared distance of a vector z to the space H0 equals the sum of squares Li>i zf of its last k - l coordinates with respect to this basis. A change of base corresponds to an orthogonal transformation of the coordinates. Because the standard normal distribu tion is invariant under orthogonal transformations, the coordinates of X with respect to any orthonormal base are independent standard normal variables. Thus II X Ho 11 2 Li >l Xf is chi square-distributed. •
Proof.
-
=
If fJ is an inner point of 8, then the set H is the full space m.,_k and the second term on the right of ( 1 6.5) is zero. Thus, if the local null parameter spaces ..jii ( 8 0 - fJ ) converge to a linear subspace of dimension l , then the asymptotic null distribution of the likelihood ratio statistic is chi-square with k - l degrees of freedom. The following theorem makes the preceding informal derivation rigorous under the same mild conditions employed to obtain the asymptotic normality of the maximum likelihood estimator in Chapter 5. It uses the following notion of convergence of sets. Write H11 ---+ H if H is the set of all limits lim hn of converging sequences hn with h 11 E Hn for every n and, moreover, the limit h limi hn, of every converging sequence hn, with h 11, E Hn, for every i is contained in H. =
1 6. 3
233
Using Local Asymptotic Normality
Let the model (Pe : e E 8) be differentiable in quadratic mean at iJ with nonsingular Fisher information matrix, and suppose that for every 8 1 and 82 in a neighbor. 2 hood of iJ and for a measurable function f such that PiJ f· < oo, 16.7
Theorem.
If the maximum likelihood estimators Bn ,O and en are consistent under i} and the sets Hn , O and Hn converge to sets H0 and H, then the sequence of likelihood ratio statistics An converges under iJ + hj ,Jli in distribution to A given in ( 16. 5), for X normally distributed with mean h and covariance matrix Li 1 . *Proof.
Zn by
Let IGn
=
,JlioPn - PiJ ) be the empirical process, and define stochastic processes Zn (h)
=
nlfDn log
P H h/ .fii -
PiJ
h T IGn l iJ + � h T liJ h . 2
The differentiability of the model implies that Zn (h ) !,. 0 for every h. In the proof of Theorem 7 . 1 2 this is strengthened to the uniform convergence sup J zn (h) j !,. 0,
ll h i i ::OM
every M.
Furthermore, it follows from this proof that both Bn , O and en are ,Jfi-consistent under i} (These statements can also be proved by elementary arguments, but under stronger regularity conditions.) The preceding display is also valid for every sequence Mn that increases to suffi ciently slowly. Fix such a sequence. By the ,Jfi-consistency, the estimators Bn, O and en are contained in the ball of radius Mn / ,Jn around iJ with probability tending to 1 . Thus, the limit distribution of An does not change if we replace the sets Hn and Hn ,o in its definition by the sets Hn n ball(O, Mn ) and Hn ,o n ball(O, Mn ) . These "truncated" sequences of sets still converge to H and H0, respectively. Now, by the uniform convergence to zero of the processes Zn (h) on Hn and Hn , o , and simple algebra, 0
oo
by Lemma 7 . 1 3 (ii) and (iii). The theorem follows by the continuous-mapping theorem.
•
In a generalized linear model a typical ob servation (X, Y), consisting of a "covariate vector" X and a "response" Y, possesses a density of the form 16.8
Example (Generalized linear models).
(It may be more natural to model the covariates as (observed) constants, but to fit the model into our i.i.d. setup, we consider them to be a random sample from a density Px .) Thus, given
Likelihood Ratio Tests
234
X, the variable Y follows an exponential family density eY8 Q log(q I p), the proof of the first assertion is complete. To prove that the likelihood ratio statistic attains equality, it suffices to prove that its slope is bigger than the upper bound. Write An for the log likelihood ratio statistic, and write sup p and sup Q for suprema over the null and alternative hypotheses. Because ( l ln)An is bounded above by sup Q IP'11 log(q I p) , we have, by Markov's inequality,
The expectation on the right side is the n th power of the integral J ( q I p) d P 1 . Take logarithms left and right and multiply with - (21n) to find that
(
=
Q (p > 0) :::;
)
2 12 log I P1 I - - log Pp - A11 � t � 2t . n n n Because this is valid uniformly in t and P , we can take the infimum over P on the left side; next evaluate the left and right sides at t ( l ln)A11 • By the law of large numbers, IP'n log(q I p) -+ Q log(q I p) almost surely under Q, and this remains valid if we first add the infimum over the (finite) set Po on both sides. Thus, the limit inferior of the sequence ( l ln)An � infp IP'n log(q I p) is bounded below by infp Q log(q I p) almost surely under Q, where we interpret Q log(q I p) as if Q (p 0) > 0. Insert this lower bound in the preceding display to conclude that the Bahadur slope of the likelihood ratio statistics is bounded below by 2 infp Q log(q I p) . • =
oo
=
Notes
The classical references on the asymptotic null distribution of likelihood ratio statistic are papers by Chernoff [2 1] and Wilks [ 1 50] . Our main theorem appears to be better than Chernoff ' s, who uses the "classical regularity conditions" and a different notion of approx imation of sets, but is not essentially different. Wilks' treatment would not be acceptable to present-day referees but maybe is not so different either. He appears to be saying that we can replace the original likelihood by the likelihood for having observed only the maximum likelihood estimator (the error is asymptotically negligible), next refers to work by Doob to infer that this is a Gaussian likelihood, and continues to compute the likelihood ratio statistic for a Gaussian likelihood, which is easy, as we have seen. The approach using a Taylor expansion and the asymptotic distributions of both likelihood estimators is one way to make the argument rigorous, but it seems to hide the original intuition. Bahadur [3] presented the efficiency of the likelihood ratio statistic at the fifth Berkeley symposium. Kallenberg [84] shows that the likelihood ratio statistic remains asymptotically optimal in the setting in which both the desired level and the alternative tend to zero, at least in exponential families. As the proof of Theorem 16. 1 2 shows, the composite nature of the alternative hypothesis "disappears" elegantly by taking ( 1 In) log of the error probabilities too elegantly to attach much value to this type of optimality?
24 1
Problems PROBLEMS
1. Let (X I , YI ) , . . . , (Xn , Yn ) be a sample from the bivariate normal distribution with mean vec tor (f-L, v) and covariance matrix the diagonal matrix with entries cr 2 and r 2 . Calculate (or characterize) the likelihood ratio statistic for testing Ho : f-L = v versus HI : f-L i= v .
2. Let N b e a kr-dimensional multinomial variable written a s a (k x r ) matrix (Nij ) . Calculate the likelihood ratio statistic for testing the null hypothesis of independence Ho : P ii = P i · P . i for every i and j . Here the dot denotes summation over all columns and rows, respectively. What is the limit distribution under the null hypothesis?
3. Calculate the likelihood ratio statistic for testing Ho : f-L = v based on independent samples of size n from multivariate normal distributions Nr (f-L , I: ) and Nr ( v , I: ) . The matrix I: is unknown. What is the limit distribution under the null hypothesis?
4. Calculate the likelihood ratio statistic for testing Ho : f-L l = · · = I-Lk based on k independent s amples of size n from N (t-t i , cr 2 ) -distributions . What is the asymptotic distribution under the null hypothesis? ·
5. Show that (I;;} I ) > I , >I is the inverse of the matrix ltJ, >I , >I - ltJ, >I , sl l;J, 1 , s z ltJ, l · 6. Study the asymptotic distribution of the sequence An if the true parameter is contained in both the null and alternative hypotheses .
�
7. Study the asymptotic distribution of the likelihood ratio statistics for testing the hypothesis Ho : cr = -r based on a sample of size n from the uniform distribution on [cr, r ] . Does the
asymptotic distribution correspond
to
a likelihood ratio statistic in a limit experiment?
17 Chi-Square Tests
The chi-square statistic for testing hypotheses concerning multinomial distributions derives its name from the asymptotic approximation to its distribution. Two important applications are the testing of independence in a two-way classification and the testing ofgoodness-of-fit. In the second application the multinomial distribution is created artificially by group ing the data, and the asymptotic chi-square approximation may be lost if the original data are used to estimate nuisance parameters.
17.1
Quadratic Forms i n Normal Vectors
The chi-square distribution with k degrees of freedom is (by definition) the distribution of :L7=1 Zf for i.i.d. N (O, !)-distributed variables Z1 , . . . , Zk . The sum of squares is the squared norm II Z II 2 ofthe standard normal vector Z (Z1 , . . . , Zk ) . The following lemma gives a characterization of the distribution of the norm of a general zero-mean normal vector. =
If the vector X is Nk (O, "h)-distributed, then I I X II 2 is distributed as :L7=1 A. i zf for i. i. d. N(O, !)-distributed variables Z 1 , . . . , Zk and A. 1 , . . . , A. k the eigenvalues of "h.
17.1
Lemma.
There exists an orthogonal matrix 0 such that 0 'E. 0 T diag (A. i ). Then the vector 0 X is Nk ( 0, diag (A. i )) -distributed, which is the same as the distribution of the vector ( -Jii"Z 1 , . . . , ,Ji:k Zk ). Now II X II 2 II 0 X ll2 has the same distribution as :LCAZ i )2. •
Proof.
=
=
The distribution of a quadratic form of the type :L7= l A.i Zf is complicated in general. However, in the case that every A. i is either 0 or 1 , it reduces to a chi-square distribution. If this is not naturally the case in an application, then a statistic is often transformed to achieve this desirable situation. The definition of the Pearson statistic illustrates this.
Pearson Statistic
17.2
Suppose that we observe a vector Xn (Xn . l , . . . , Xn .d with the multinomial distribution corresponding to n trials and k classes having probabilities p (p1 , . . . , Pk ) . The Pearson =
=
242
1 7. 2
Pearson Statistic
statistic for testing the null hypothesis H0 p :
=
243
a is given by
We shall show that the sequence Cn (a) converges in distribution to a chi-square distribution if the null hypothesis is true. The practical relevance is that we can use the chi-square table to find critical values for the test. The proof shows why Pearson divided the squares by nai and did not propose the simpler statistic II Xn - na ll 2 . 17.2
a
=
If the vectors Xn are multinomially distributed with parameters n and (a 1 , . . . , ak ) > 0, then the sequence Cn (a) converges under a in distribution to the Theorem.
xL 1 -distribution.
The vector Xn can be thought of as the sum of n independent multinomial vectors Y1 , . . . , Yn with parameters 1 and a (a 1 , . . . , ak ) . Then
Proof.
=
By the multivariate central limit theorem, the sequence n - 1 12 ( Xn - na) converges in distribu tion to the Nk (O, Cov Y1 )-distribution. Consequently, with y'a the vector with coordinates _;a;,
(
Xn , 1 - na 1 Xn , - nak y'rUi1 , . . . , k .JiUik
)
-v--+
N (O , I - vr:::a vr:::a T ) .
Because L a i 1 , the matrix I - y'ay'aT has eigenvalue 0, of multiplicity 1 (with eigen space spanned by y'a), and eigenvalue 1 , of multiplicity (k - 1) (with eigenspace equal to the orthocomplement of y'a). An application of the continuous-mapping theorem and next Lemma 17. 1 conclude the proof. • =
The number of degrees of freedom in the chi-squared approximation for Pearson's statis tic is the number of cells of the multinomial vector that have positive probability. However, the quality of the approximation also depends on the size of the cell probabilities ai . For instance, if 100 1 cells have null probabilities w- 23 , . . . , w- 23 , 1 - w- 20 , then it is clear that for moderate values of n all cells except one are empty, and a huge value of n is necessary to make a x l000 -approximation work. As a rule of thumb, it is often advised to choose the partitioning sets such that each number na1 is at least 5. This criterion depends on the (possibly unknown) null distribution and is not the same as saying that the number of observations in each cell must satisfy an absolute lower bound, which could be very unlikely if the null hypothesis is false. The rule of thumb means to protect the level. The Pearson statistic is oddly asymmetric in the observed and the true frequencies (which is motivated by the form of the asymptotic covariance matrix). One method to symmetrize
244
Chi-Square Tests
the statistic leads to the Hellinger statistic
Up to a multiplicative constant this is the Hellinger distance between the discrete probabil ity distributions on { 1 , . . . , k} with probability vectors a and Xn f n , respectively. Because Xn / n - a � 0, the Hellinger statistic is asymptotically equivalent to the Pearson statistic.
17.3
Estimated Parameters
Chi-square tests are used quite often, but usually to test more complicated hypotheses. If the null hypothesis of interest is composite, then the parameter a is unknown and cannot be used in the definition of a test statistic. A natural extension is to replace the parameter by an estimate an and use the statistic
Cn (an)
=
k (Xn , n a n , )2 L z n z na , i i =l ·
-
A
·
The estimator an is constructed to be a good estimator if the null hypothesis is true. The asymptotic distribution of this modified Pearson statistic is not necessarily chi-square but depends on the estimators an being used. Most often the estimators are asymptotically normal, and the statistics
Xn , i - n an , i rna;:
Xn , i - nan , i
rna;:
,Jii (an , i - an , i ) �
are asymptotically normal as well. Then the modified chi-square statistic is asymptotically distributed as a quadratic form in a multivariate-normal vector. In general, the eigenvalues determining this form are not restricted to 0 or 1 , and their values may depend on the unknown parameter. Then the critical value cannot be taken from a table of the chi-square distribution. There are two popular possibilities to avoid this problem. First, the Pearson statistic is a certain quadratic form in the observations that is motivated by the asymptotic covariance matrix of a multinomial vector. If the parameter a is estimated, the asymptotic covariance matrix changes in form, and it is natural to change the quadratic form in such a way that the resulting statistic is again chi-square distributed. This idea leads to the Rao-Robson-Nikulin modification of the Pearson statistic, of which we discuss an example in section 17.5. Second, we can retain the form of the Pearson statistic but use special estimators a. In particular, the maximum likelihood estimator based on the multinomial vector Xn , or the minimum-chi square estimator tin defined by, with Po being the null hypothesis,
The right side of this display is the "minimum-chi square distance" of the observed frequen cies to the null hypothesis and is an intuitively reasonable test statistic. The null hypothesis
1 7. 3 Estimated Parameters
245
is rejected if the distance of the observed frequency vector Xn fn to the set Po is large. A disadvantage is greater computational complexity. These two modifications, using the minimum-chi square estimator or the maximum likelihood estimator based on Xn , may seem natural but are artificial in some applications. For instance, in goodness-of-fit testing, the multinomial vector is formed by grouping the "raw data," and it is more natural to base the estimators on the raw data rather than on the grouped data. On the other hand, using the maximum likelihood or minimum-chi square estimator based on Xn has the advantage of a remarkably simple limit theory: If the null hypothesis is "locally linear," then the modified Pearson statistic is again asymptotically chi-square distributed, but with the number of degrees of freedom reduced by the (local) dimension of the estimated parameter. This interesting asymptotic result is most easily explained in terms of the minimum chi square statistic, as the loss of degrees of freedom corresponds to a projection (i.e., a minimum distance) of the limiting normal vector. We shall first show that the two types of modifications are asymptotically equivalent and are asymptotically equivalent to the likelihood ratio statistic as well. The likelihood ratio statistic for testing the null hypothesis Ho : p E Po is given by (see Example 16. 1) k
X
n'i . L n ( P ) = 2 '"""' � Xn , i log npi i =l Lemma. Let Po be a closed subset of the unit simplex, and let an be the maximum likelihood estimator of a under the null hypothesis H0 : a E Po (based on Xn ). Then 17.3
k
inf L pEPo
i=l
(X , z. - np · ) 2 = C (a n n ) Op ( l ) = Ln (an ) n npi
+
1
+ O p (1).
Let On be the minimum-chi square estimator of a under the null hypothesis. Both sequences of estimators On and an are Jn-consistent. For the maximum likelihood esti mator this follows from Corollary 5.53. The minimum-chi square estimator satisfies by its definition
Proof
This implies that each term in the sum on the left is 0 p ( 1 ) , whence n l on , i - ai 1 2 = 0 p (an , i ) Op ( I Xn , i - na i 1 2 / n ) and hence the Jn-consistency. Next, the two-term Taylor expansion log(l x) x - �x 2 o (x 2 ) combined with Lemma 2. 12 yields, for any y'n-consistent estimator sequence ftn , 2 k k A i 1 k npA n i X n i pn n -1 ' = -� "'xn . z· n , · -' - 1 o p ( l ) � n , z· log 2� Xn ,l. npAn ,l. i =l z Xn ,l. i=l i=l 1 k (X . - npA · ) 2 = 0 2 L n, z X n ,z n,i i =l
+
(
"' x
+-
+
·
=
) + -"'x (
+
)+
+ O p (1).
In the last expression we can also replace Xn,i in the denominator by n Pn,i , so that we find the relation L n ( Pn ) = Cn (p11) between the likelihood ratio and the Pearson statistic, for
Chi-Square Tests
246
every Jn-consistent estimator sequence Pn · By the definitions of an and Gn , we conclude that, up to o p ( l )-terms, C (a11 ) ::::; Cn (Gn ) = Ln (Gn ) ::::; L11 (an ) = C11 (G11 ) . The lemma follows. • The asymptotic behavior of likelihood ratio statistics is discussed in general in Chap ter 16. In view of the preceding lemma, we can now refer to this chapter to obtain the asymp totic distribution of the chi-square statistics. Alternatively, a direct study of the minimum-chi square statistic gives additional insight (and a more elementary proof). As in Chapter 16, say that a sequence of sets Hn converges to a set H if H is the set of all limits lim hn of converging sequences hn with h n E Hn for every n and, moreover, the limit h = limi hn, of every converging subsequence hn, with h n E Hn , for every i is contained in H . ,
Let Po be a subset of the unit simplex such that the sequence of sets Jn (Po - a) converges to a set H (in IR.k ), and suppose that a > 0. Then, under a, 2 k (Xn , z np·) 2 1 inf L inf X - - H , h EH p EPo i=l npi Ja
17.4
Theorem.
.
_
1
I
�
1
for a vector X with the N (0, I - .J(iJaT )-distribution. Here ( 1 / Ja) H is the set of vectors (h l / fo, . . . , h k f ,J{ik) as h ranges over H. Let Po be a subset of the unit simplex such that the sequence of sets Jn(Po - a) converges to a linear subspace of dimension l (of IR.k ), and let a > 0. Then both the sequence of minimum-chi square statistics and the sequence of modified Pearson statistics Cn (f111 ) converge in distribution to the chi-square distribution with k - 1 - l degrees offreedom. 17.5
Corollary.
Because the minimum-chi square estimator an (relative to Po) is Jn-consistent, the asymptotic distribution of the minimum-chi square statistic is not changed if we replace nan , i in its denominator by the true value na i . Next, we decompose,
Proof.
X n , l· - na·
v'fUii
r
The first vector on the right converges in distribution to X . The (modified) minimum-chi square statistics are the distances of these vectors to the sets Hn = Jn(Po - a)/ Ja, which converge to the set HI Ja. The theorem now follows from Lemma 7 . 1 3 . The vector X is distributed as Z - n01z for n01 the projection onto the linear space spanned by the vector Ja and Z a k-dimensional standard normal vector. Because every element of H is the limit of a multiple of differences of probability vectors, 1 T h = 0 for every h E H . Therefore, the space ( 1 / Ja)H is orthogonal to the vector Ja, and I1 n01 0 for I1 the projection onto the space ( 1 / Ja) H. The distance of X to the space ( 1 / Ja) H is equal to the norm of X - I1 X , which is distributed as the norm of Z - n01z - I1 Z . The latter projection is multivariate normally distributed with mean zero and covariance matrix the projection matrix I - n 01 - I1 with k - l - 1 eigenvalues 1 . The corollary follows from Lemma 17. 1 or 16.6. • =
247
Testing Independence
1 7. 4
=
If the null hypothesis is a parametric family Po of ffi.1 with l :S k and the maps 8 f-+ Pe from 8 into the unit simplex are differentiable and of full rank, then ,Jri(Po - pe ) ---+ .P e (lR1 ) for every e E 8 (see Example 16. 1 1 ). Then the chi-square statistics Cn (pe ) are asymptotically xl- z- 1 -distributed. This situation is common in testing the goodness-of-fit of parametric families, as dis cussed in section 17.5 and Example 16. 1 . 0 17.6 Example (Parametric model). { pe : f) E 8} indexed by a subset 8
17.4
Testing Independence
Suppose that each element of a population can be classified according to two characteristics, having k and r levels, respectively. The full information concerning the classification can be given by a (k x r) table of the form given in Table 17 . 1 . Often the full information is not available, but we do know the classification Xn, ij for a random sample of size n from the population. The matrix Xn,ij • which can also be written in the form of a (k x r) table, is multinomially distributed with parameters n and probabil ities P ij NiJ / N. The null hypothesis of independence asserts that the two categories are independent: H0 : Pi j ai bj for (unknown) probability vectors ai and bj . The maximum likelihood estimators for the parameters a and b (under the null hypothe sis) are ai Xn i . f n and b j Xn,. j /n. With these estimators the modified Pearson statistic takes the form =
=
=
.
=
The null hypothesis is a (k + r - 2)-dimensional submanifold of the unit simplex in ffi.kr . In a shrinking neighborhood of a parameter in its interior this manifold looks like its tangent space, a linear space of dimension k + r - 2. Thus, the sequence Cn (an 0 bn) is asymptotically chi square-distributed with kr - 1 - (k + r - 2) (k - 1 ) (r - 1) degrees of freedom. =
Table 1 7 . 1 .
Classification of a population of N elements according to two categories, Nij elements having value i on the first category and value j on the second. The borders give the sums over each row and column, respectively.
Nl l N2 1
N, z Nzz
Nl r Nl r
N! . Nz.
Nk , N,
Nkz N.z
Nl r
Nk. N
N.r
248
Chi-Square Tests
If the (k x r) matrices Xn are multinomially distributed with parameters n and Pij ai b j > 0, then the sequence Cll (all ® bn) converges in distribution to the xfk -1)(r-1) -distribution. 17.7
Corollary. =
The map (a 1 , . . . , ak -1 , b1 , . . . , b r -d � (a x b) from �k+r -2 into �kr is con tinuously differentiable and of full rank. The true values (a1 , . . . , ak -l , b 1 . . . , b r _ 1 ) are interior to the domain of this map. Thus the sequence of sets ,.jii (Po - a x b) converges to a (k + r 2)-dimensional linear subspace of �kr . •
Proof
-
* 17.5
Goodness-of-Fit Tests
Chi -square tests are often applied to test goodness-of-fit. Given a random sample X 1 , . . . , Xn from a distribution P , we wish to test the null hypothesis H0 : P E Po that P belongs to a given class Po of probability measures. There are many possible test statistics for this problem, and a particular statistic might be selected to attain high power against certain alternatives. Testing goodness-of-fit typically focuses on no particular alternative. Then chi-square statistics are intuitively reasonable. The data can be reduced to a multinomial vector by "grouping." We choose a partition X = U j Xj of the sample space into finitely many sets and base the test only on the observed numbers of observations falling into each of the sets Xj . For ease of notation, we express these numbers into the empirical measure of the data. For a given set A we denote by JPlll (A) = n -1 ( 1 :::::: i :::::: n : Xi E A) the fraction of observations that fall into A . Then the vector n (JPlll (X1 ) , . . . , JPln (Xk ) ) possesses a multinomial distribution, and the corresponding modified chi-square statistic is given by
Here P (Xj ) is an estimate of P (Xj ) under the null hypothesis and can take a variety of forms. Theorem 17.4 applies but is restricted to the case that the estimates P ( Xj ) are based on the frequencies n (JPln (X1 ) , . . . , JPln (Xk ) ) only. In the present situation it is more natural to base the estimates on the original observations X 1 , . . . , Xn . Usually, this results in a non--chi square limit distribution. For instance, Table 17.2 shows the "errors" in the level of a chi-square test for testing normality, if the unknown mean and variance are estimated by the sample mean and the sample variance but the critical value is chosen from the chi-square distribution. The size of the errors depends on the numbers of cells, the errors being small if there are many cells and few estimated parameters. Consider testing the null hypothesis that the true dis tribution belongs to a regular parametric model { Pe : e E 8}. It appears natural to estimate the unknown parameter e by an estimator en that is asymptotically efficient under the null hypothesis and is based on the original sample X 1 , . . . , Xn , for instance the maximum likelihood estimator. If Gn = ,.jii(Jilln - Pe ) denotes the empirical process, then efficiency entails the approximation ,.jii(en - e) = Ie- 1 Gni e + o P ( 1 ) . Applying the delta method to 17.8
Example (Parametric model).
1 7. 5
Goodness-of-Fit Tests
249
Table 1 7 .2.
True levels of the chi-square test for normality using -quantiZes as critical values but estimating unknown mean and �ariance by sample mean and sample variance. Chi square statistic based on partitions of [ - 1 0, 1 0] into k = 5 , 1 0, or 20 equiprobable cells under the standard normal law.
xl;_ 3
a
a =
k k k
=
=
=
5 10 20
0.20
0.30 0 . 22 0.21
a =
0.10
0.15 0. 1 1 0.10
a =
0.05
0.08 0.06 0.05
a =
0.01
0.02 0.0 1 0.01
Note: Values based on 2000 simulations of standard normal samples of size 1 00.
the variables J1i ( Pe (X1) - Pe ( X1 )) and using Slutsky's lemma, we find J1i (I¥n ( Xi ) - Pe ( Xi ))
J Pe ( X1 )
(The map e 1---+ Pe (A) has derivative Pe 1 A i:: e .) The sequence of vectors (Gn lx) , Gnf e) converges in distribution to a multivariate-normal distribution. Some matrix manipulations show that the vectors in the preceding display are asymptotically distributed as a Gaussian vector X with mean zero and covariance matrix P.e l v f e (Ce) iJ =
��
·
In general, the covariance matrix of X is not a projection matrix, and the variable II X 11 2 does not possess a chi-square distribution. Because Pef e 0, we have that Ce Fe 0 and hence the covariance matrix of X can be rewritten as the product (I - ,.fiie,.fiieT )(I - CJ Ie- 1 Ce). Here the first matrix is the projection onto the orthocomplement of the vector Fe and the second matrix is a positive definite transformation that leaves Fe invariant, thus acting only on the orthocomplement Fej_ . This geometric picture shows that Cove X has the same system of eigenvectors as the matrix I -CJ Ie- 1 Ce , and also the same eigenvalues, except for the eigenvalue corresponding to the eigenvector ,.fiie, which is O for Cove X and 1 for I - CJ Ii 1 Ce . Because both matrices CJ I() 1 Ce and I - CJ Ie- 1 Ce are nonnegative-definite, the eigenvalues are contained in [0, 1 ] . One eigenvalue (corresponding to eigenvector ,.fiie ) is 0, dim N(Ce) - 1 eigenvalues (corresponding to eigenspace N(Ce) n y'o:Bj_ ) are 1 , but the other eigenvalues may be contained in (0, 1) and then typically depend on e. By Lemma 17. 1 , the variable II X II 2 is distributed as =
=
dimN(Ce) - 1 L Zf + i =1 i = dimN(Ce) This means that it is stochastically "between" the chi-square distributions with dim N( Ce ) 1 and k - 1 degrees of freedom. The inconvenience that this distribution is not standard and depends on e can be remedied by not using efficient estimators en or, alternatively, by not using the Pearson statistic.
250
Chi-Square Tests
The square root of the matrix I - CJ Ie- 1 Ce is the positive-definite matrix with the same eigenvectors, but with the square roots of the eigenvalues. Thus, it also leaves the vector Fe invariant and acts only on the orthocomplement Fej_. It follows that this square root commutes with the matrix I - FeFe r and hence
(We assume that the matrix I - CJ Ie- 1 Ce is nonsingular, which is typically the case; see problem 17 .6). By the continuous-mapping theorem, the squared norm of the left side is asymptotically chi square-distributed with k - 1 degrees of freedom. This squared norm is the Rao-Robson-Nikulin statistic. 0 It is tempting to choose the partitioning sets X1 dependent on the observed data X 1 , . . . , Xn , for instance to ensure that all cells have positive probability under the null hypothesis. This is permissible under some conditions: The choice of a "random partition" typically does not change the distributional properties of the chi-square statistic. Consider partition ing sets X1 XJ ( X 1 , . . . , Xn ) that possibly depend on the data, and a further modified Pearson statistic of the type =
If the random partitions settle down to a fixed partition eventually, then this statistic is asymptotically equivalent to the statistic for which the partition had been set equal to the limit partition in advance. We discuss this for the case that the null hypothesis is a model { Pe : e E 8} indexed by a subset 8 of a normed space. We use the language of Donsker classes as discussed in Chapter 19.
Suppose that the sets XJ belong to a Pe0 -Donsker class C of sets and p that Pe0 (Xj !::,. XJ ) -+ 0 under Pe0 , for given nonrandom sets Xj such that Pe0 ( Xj ) > 0. Furthermore, assume that Jn ll tJ - 8o ll O p ( l ), and suppose that the map e H- Pe from 8 into £00 (C) is differentiable at e0 with derivative P e0 such that P e0 ( X1) - P e0 ( X1 ) � o for every j. Then 17.9
Theorem. A
=
Let Gn Jli"(IFn - Pe0 ) be the empirical process and define IHin Jli"(Pe - Pe0 ) . Then Jn(1Fn ( X1) - Pe ( X1 ) ) (Gn - IHin ) ( X1 ) , and similarly with X1 replacing X1 . The condition that the sets X1 belong to a Donsker class combined with the continuity condition Pe0 (XJ .6. XJ ) _!,. 0, imply that Gn ( Xj ) - Gn (Xj ) � 0 (see Lemma 19.24). The differentiability of the map e �--+- Pe implies that
Proof
=
=
=
sup I Pe (C) - Pe0 (C) - P e0 (C) (fJ - 8o) l c Together with the continuity P eo ( Xj ) - P eo ( Xj )
=
op ( ll e - 8o ll) .
� 0 and the #-consistency of e, this
1 7. 6
25 1
Asymptotic Efficiency
p
p
shows that IHin (X1) - IHin (X1) � 0. In particular, because Pe0 ( X1 ) � Pe0 (XJ ) , both Pe ( XJ ) and Pe ( X1 ) converge in probability to Pe0 ( X1 ) > 0. The theorem follows. • A
A
A
The conditions on the random partitions that are imposed in the preceding theorem are mild. An interesing choice is a partition in sets X1 (B ) such that P8 (X1 (e) ) a 1 is independent of e . The corresponding modified Pearson statistic is known as the Watson-Roy statistic and takes the form =
Here the null probabilities have been reduced to fixed values again, but the cell frequencies are "doubly random." If the model is smooth and the parameter and the sets X1 (e) are not too wild, then this statistic has the same null limit distribution as the modified Pearson statistic with a fixed partition. Consider testing a null hypothesis that the true under lying measure of the observations belongs to a location-scale family Fo ( ( · f.L) I a) : f.L E lR, a > 0}, given a fixed distribution F0 on R It is reasonable to choose a partition in sets and estimators X1 {1, + a ( c1 _ 1 , c1], for a fixed partition c0 < c 1 < · · · < ck {1, and a of the location and scale parameter. The partition could, for instance, be chosen equal to c1 F0- 1 (j I k), although, in general, the partition should depend on the type of deviation from the null hypothesis that one wants to detect. If we use the same location and scale estimators to "estimate" the null probabilities ( F0 ( X1 J.L) I a) of the random cells X1 {1, + a ( c 1 _ 1 , c1], then the estimators cancel, and we find the fixed null probabilities F0 ( c1 ) F0 ( c1 _ 1 ). D 17.10
Example (Location-scale).
{
=
-
-
= oo
oo =
=
=
-
-
* 1 7.6
Asymptotic Efficiency
The asymptotic null distributions of various versions of the Pearson statistic enable us to set critical values but by themselves do not give information on the asymptotic power of the tests. Are these tests, which appear to be mostly motivated by their asymptotic null distribution, sufficiently powerful? The asymptotic power can be measured in various ways. Probably the most important method is to consider local limiting power functions, as in Chapter 14. For the likelihood ratio test these are obtained in Chapter 16. Because, in the local experiments, chi-square statistics are asymptotically equivalent to the likelihood ratio statistics (see Theorem 17.4) , the results obtained there also apply to the present problem, and we shall not repeat the discussion. A second method to evaluate the asymptotic power is by Bahadur efficiencies. For this nonlocal criterion, chi-square tests and likelihood ratio tests are not equivalent, the second being better and, in fact, optimal (see Theorem 16. 12). We shall compute the slopes of the Pearson and likelihood ratio tests for testing the simple hypothesis H0 : p a . A multinomial vector Xn with parameters n and p (p 1 , . . . , Pk ) can be thought of as n times the empirical measure JPln of a random sample of size n from the distribution P on the set { 1 , . . . , k } defined by P {i } Pi . Thus we can view both the =
=
=
252
Chi-Square Tests
Pearson and the likelihood ratio statistics as functions of an empirical measure and next can apply Sanov's theorem to compute the desired limits of large deviations probabilities. Define maps C and K by
k a Pi K(p, a) = - P log - = L Pi log - . ai p i= l Then the Pearson and likelihood ratio statistics are equivalent to C (JPln , a) and K (JPl a), respectively. Under the assumption that a > 0, both maps are continuous in p on the k-dimensional unit simplex. Furthermore, for t in the interior of the ranges of C and K, the sets B1 = p : C (p , a) 2:: t } and B1 = p : K (p , a) 2:: t } are equal to the closures of their interiors. Two applications of Sanov's theorem yield 1 - log Pa ( C (JPln , a) 2:: t ) --+ - inf K (p, a), n,
{
{
n
pe�
1 - log Pa ( K (JPln , a)
n
2::
t ) --+ - inf K (p, a) = - t. pe�
We take the function e(t) of ( 14.20) equal to minus two times the right sides. Because Pn {i } --+ Pi by the law of large numbers, whence C (JPln , a) � C(P, a) and K (JPln , a) � K ( P , a), the Bahadur slopes of the Pearson and likelihood ratio tests at the alternative H1 : p = q are given by 2
inf
p:C(p, a)�C(q ,a)
K (p , a)
and 2K(q , a) .
It is clear from these expressions that the likelihood ratio test has a bigger slope. This is in agreement with the fact that the likelihood ratio test is asymptotically Bahadur optimal in any smooth parametric model. Figure 17. 1 shows the difference of the slopes in one particular case. The difference is small in a neighborhood of the null hypothesis a, in agreement with the fact that the Pitman efficiency is equal to 1 , but can be substantial for alternatives away from a . Notes
Pearson introduced his statistic in 1900 in [ 1 12] The modification with estimated para meters, using the multinomial frequencies, was considered by Fisher [49], who corrected the mistaken belief that estimating the parameters does not change the limit distribution. Chernoff and Lehmann [22] showed that using maximum likelihood estimators based on the original data for the parameter in a goodness-of-fit statistic destroys the asymptotic chi-square distribution. They note that the errors in the level are small in the case of testing a Poisson distribution and somewhat larger when testing normality.
253
Problems
N
0
a .t.
a .'O
Figure 17. 1. The difference of the B ahadur slopes of the likelihood ratio and Pearson tests for testing Ho : p = ( 1 / 3 , 1 / 3 , 1 / 3 ) based on a multinomial vector with parameters n and p = ( P I , p 2 , p3 ) , as a function of ( PI , P2 ) .
The choice of the partition in chi-square goodness-of-fit tests is an important issue that we have not discussed. Several authors have studied the optimal number of cells in the partition. This number depends, of course, on the alternative for which one desires large power. The conclusions of these studies are not easily summarized. For alternatives p such that the likelihood ratio p j Peo with respect to the null distribution is "wild," the number of cells k should tend to infinity with n . Then the chi-square approximation of the null distribution needs to be modified. Normal approximations are used, because a chi-square distribution with a large number of degrees of freedom is approximately a normal distribution. See [40], [60], and [86] for results and further references.
PROBLEMS
1. Let N = (Nij ) be a multinomial matrix with success probabilities Pi} . Design a test statistic for the null hypothesis of symmetry Ho : Pii = P) i and derive its asymptotic null distribution.
2. Derive the limit distribution of the chi-square goodness-of-fit statistic for testing normality if using the sample mean and s ample variance as estimators for the unknown mean and variance. Use two or three cells to keep the calculations simple. Show that the limit distribution is not chi-square. 3. Suppose that X m and Yn are independent multinomial vectors with parameters (m , a I , . . . , ak ) and (n , h , . . , bk ) , respectively. Under the null hypothesis Ho : a = b, a natural estimator of the unknown probability vector a = b is c = (m + n) - I (Xm + Yn ), and a natural test statistic is given by .
Show that c is the maximum likelihood estimator and show that the sequence of test statistics is asymptotically chi square-distributed if m, n -+ oo.
254
Chi-Square Tests
4. A matrix I: - is called a generalized inverse of a matrix I: if x = I: - y solves the equation I:x = y for every y in the range of I: . Suppose that X is Nk (O, 2:: )-distributed for a matrix I: of rank r . Show that (i) 2:;- Y is the same for every generalized inverse 2:: - , with probability one; (ii) (iii)
yT yT 2:: - Y possesses a chi-square distribution with r degrees of freedom; if Y T C Y possesses a chi -square distribution with r degrees offreedom and C is a nonnegative definite symmetric matrix, then C is a generalized inverse of I: .
5. Find the limit distribution of the Dzhaparidze-Nikulin statistic
6. Show that the matrix I - CJ
!(} 1 Ce in Example 1 7 . 8 is nonsingular unless the empirical estima tor (lPn (X1 ), . . . , lPn (Xk )) is asymptotically efficient. (The estimator ( Pe (XI), . . . , Pe (Xk )) is asymptotically efficient and has asymptotic covariance matrix diag (,.jae) CJ !8- 1 Ce diag ( ,.jae ) ; the empirical estimator has asymptotic covariance matrix diag (,.jae) (I - ,.Jae
,.JaeT) diag (,.jae) . )
18 Stochastic Convergence in Metric Spaces
This chapter extends the concepts of convergence in distribution, in prob ability, and almost surely from Euclidean spaces to more abstract metric spaces. We are particularly interested in developing the theory for ran dom functions, or stochastic processes, viewed as elements of the metric space of all bounded functions. 18.1
Metric and Normed Spaces
In this section we recall some basic topological concepts and introduce a number of examples of metric spaces. A metric space is a set lDi equipped with a metric. A metric or distance function is a map d : lDi x lDi r--+ [0, oo) with the properties (i) d(x , y) d(y, x); (ii) d(x , z ) :S d(x, y) + d(y, z ) (triangle inequality); (iii) d (x , y) = O if and only if x = y. =
A semimetric satisfies (i) and (ii), but not necessarily (iii). An open ball is a set of the form { y : d (x, y) < r } . A subset of a metric space is open if and only if it is the union of open balls; it is closed if and only if its complement is open. A sequence Xn converges to x if and only if d (xn , x) ---+ 0; this is denoted by Xn ---+ x. The closure A of a set A c ]]]I consists of all points that are the limit of a sequence in A; it is the smallest closed set containing A . The interior A is the collection of all points x such that x E G c A for some open set G ; it is the largest open set contained in A . A function f : lDi r--+ lE between two metric spaces is continuous at a point x if and only if f (xn ) ---+ f (x) for every sequence Xn ---+ x ; it is continuous at every x if and only if the inverse image f - 1 (G) of every open set G c lE is open in lDi . A subset of a metric space is dense if and only if its closure is the whole space. A metric space is separable if and only if it has a countable dense subset. A subset K of a metric space is compact if and only if it is closed and every sequence in K has a converging subsequence. A subset K is totally bounded if and only if for every 8 > 0 it can be covered by finitely many balls of radius 8. A semimetric space is complete if every Cauchy sequence, a sequence such that d(xn , Xm ) ---+ 0 as n, m ---+ oo, has a limit. A subset of a complete semimetric space is compact if and only if it is totally bounded and closed. A normed space ]]]I is a vector space equipped with a norm. A norm is a map I I : lDi r--+ [0, oo) such that, for every x, y in lDi, and E R, ·
a
255
256
Stochastic Convergence in Metric Spaces
(i) ll x + Y I :::: ll x I + I y II (triangle inequality); (ii) I la x II = Ia l ll x II ; (iii) ll x II = 0 if and only if x = 0. A seminorm satisfies (i) and (ii), but not necessarily (iii). Given a norm, a metric can be defined by d (x , y) = ll x - Y ll . The Borel a -field on a metric space Jill is the smallest a -field that contains the open sets (and then also the closed sets). A function defined relative to (one or two) metric spaces is called Baret-measurable if it is measurable relative to the Borel a-field(s). A Borel-measurable map X : Q 1---7 Jill defined on a probability space (Q , U, P) is referred to as a random element with values in Jill . 18.1
Definition.
For Euclidean spaces, Borel measurability is just the usual measurability. Borel measur ability is probably the natural concept to use with metric spaces. It combines well with the topological structure, particularly if the metric space is separable. For instance, continuous maps are Borel-measurable. 18.2
Lemma.
A continuous map between metric spaces is Borel-measurable.
A map g : Jill 1---7 IE is continuous if and only if the inverse image g - 1 (G) of every open set G c IE is open in Jill . In particular, for every open G the set g - 1 (G) is a Borel set in Jill . By definition, the open sets in IE generate the Borel a -field. Thus, the inverse image of a generator of the Borel sets in IE is contained in the Borel a -field in Jill . Because the inverse image g - 1 (9) of a generator g of a a-field B generates the a -field g - 1 (B) , it follows that the inverse image of every Borel set is a Borel set. •
Proof.
The Euclidean space IRk is a normed space with re spect to the Euclidean norm (whose square is ll x 11 2 = L�= 1 x[), but also with respect to many other norms, for instance ll x II = maxi l x i I , all of which are equivalent. By the Heine Bore! theorem a subset of IRk is compact if and only if it is closed and bounded. A Euclidean space is separable, with, for instance, the vectors with rational coordinates as a countable dense subset. The Borel a -field is the usual a -field, generated by the intervals of the type ( -oo, x ]. D 18.3
Example (Euclidean spaces).
The extended real line lR = [ -oo, oo] is the set consisting of all real numbers and the additional elements -oo and oo. It is a metric space with respect to 18.4
Example (Extended rea/ line).
d(x , y) = I
0, the index set T can be partitioned into finitely many sets T1 , . . . , Tk such that (asymptotically) the variation of the sample paths t � Xn , t is less than 8 on every one of the sets � , with large probability. Then the behavior of the process can be described, within a small error margin, by the behavior of the marginal vectors (Xn ,t1 , , Xn , tk ) for arbitrary fixed points ti E Ti . If these marginals converge, then the processes converge. •
•
•
18.14 Theorem. A sequence of arbitrary maps Xn : Qn � l00(T) converges weakly to a tight random element if and only if both of the following conditions hold: (i) The sequence (Xn , t1 , , Xn , rk ) converges in distribution in JR.k for every finite set of points t 1 , . . . , tk in T; (ii) for every 8 , rJ > 0 there exists a partition of T into finitely many sets Tt , . . . , Tk such that •
•
•
(
lim sup P* sup sup I Xn ,s n -? oo
1
s,t E T,
- Xn , t 2:: 8
I
)
� rJ .
We only give the proof of the more constructive part, the sufficiency of (i) and (ii). For each natural number m, partition T into sets Tt , . . . , Tkrn , as in (ii) corresponding to 8 rJ 2- rn . Because the probabilities in (ii) decrease if the partition is refined, we can assume without loss of generality that the partitions are successive refinements as m increases. For fixed define a semimetric Prn on T by Prn (s, t) 0 if s and t belong to the same partioning set Tt , and by Prn (s , t) 1 otherwise. Every Prn -ball of radius 0 < s < 1 coincides with a partitioning set. In particular, T is totally bounded for Prn , and the Prn -diameter of a set TF is zero. By the nesting of the partitions, P t � P2 � Define p (s , t) L: ;:= 1 2 - rn prn (s, t). Then p is a sernimetric such that the p-diameter of TF is smaller than L k> rn 2- k 2- rn ' and hence T is totally bounded for p. Let To be the countable p-dense subset constructed by choosing an arbitrary point tj from every TF . By assumption (i) and Kolmogorov's consistency theorem (e.g., [1 33, p. 244] or [42, p. 34 7]), we can construct a stochastic process {X 1 : t E T0} on some probability space such that (Xn ,r1 , , Xn ,tk ) � (X11 ' . . . , Xrk ) for every finite set of points t1 , . . . , tk in To . By the
Proof
m
=
=
m
=
=
·
=
=
•
•
•
·
· .
Stochastic Convergence in Metric Spaces
262
portmanteau lemma and assumption (ii), for every finite set S c T0,
By the monotone convergence theorem this remains true if S is replaced by T0• If p (s, t) < 2-m , then P (s , t) < 1 and hence s and t belong to the same partitioning set T1m . Conse m quently, the event in the preceding display with S T0 contains the event in the following display, and =
P
(
sup
p (s,t) 2 -m
::::; 2 - m .
)
This sums to a finite number over m E N. Hence, by the Borel-Cantelli lemma, for almost all I ::::; 2-m for all p (s, t) < 2-m and all sufficiently large m. This implies that almost all sample paths of t t E T0} are contained in U C (To , p). Extend the process by continuity to a process t E T} with almost all sample paths in UC(T, p). Define nm : T 1--+ T as the map that maps every partioning set T? onto the point tj E T? . Then, by the uniform continuity of and the fact that the p-diameter of T? is smaller than 2 -m , o lfm in £00 (T) as m -+ oo (even almost surely). The processes o n (t ) t E T } are essentially km -dimensional vectors. By (i), Xn o n o n in £00 (T) as m m m n -+ oo, for every fixed m. Consequently, for every Lipschitz function f : .€00 (T) 1--+ [0, 1 ] , as n -+ oo, followed by m -+ oo . Conclude that, for every c > 0, f o nm ) -+
w, Xs(w) - Xt (w) l X
:
E* (Xn
{X : {Xt :
""'"' X
X,
{ Xn
""'"' X
Ef (X) I E* f(Xn) - Ef(X) I ::::; I E * f(Xn) - E * f(Xn + P* (I I Xn - Xn 0
::::; ll f ii iip E
For c 2-m this is bounded by partitions. The proof is complete. =
ll f l l !ip 2-m + 2-m + •
lf ) l + o(1) m o lf i i T > c) + m
o(l).
o(1), by the construction of the
In the course of the proof of the preceding theorem a semimetric p is constructed such that the weak limit has uniformly p-continuous sample paths, and such that (T, p) is totally bounded. This is surprising: even though we are discussing stochastic processes with values in the very large space £00(T), the limit is concentrated on a much smaller space of continuous functions. Actually, this is a consequence of imposing the condition (ii), which can be shown to be equivalent to asymptotic tightness. It can be shown, more generally, that every tight random element in £00 (T) necessarily concentrates on UC(T, p) for some semimetric p (depending on that makes T totally bounded. In view of this connection between the partitioning condition (ii), continuity, and tight ness, we shall sometimes refer to this condition as the condition of asymptotic tightness or
X
X X)
asymptotic equicontinuity.
We record the existence of the semimetric for later reference and note that, for a Gaussian limit process, this can always be taken equal to the "intrinsic" standard deviation semimetric. 18.15 Lemma. Under the conditions (i) and (ii) of the preceding theorem there exists a semimetric p on T for which T is totally bounded, and such that the weak limit of the
Problems
263
Xn X sd(X8 - X1).
sequence can be constructed to have almost all sample paths in U C (T, p ) . Furthermore, if the weak limit is zero-mean Gaussian, then this semimetric can be taken equal to p (s, t) =
A semimetric p is constructed explicitly in the proof of the preceding theorem. It suffices to prove the statement concerning Gaussian limits Let p be the semimetric obtained in the proof of the theorem and let p2 be the stan dard deviation semimetric. Because every uniformly p-continuous function has a unique continuous extension to the p-completion of T, which is compact, it is no loss of gener ality to assume that T is p-compact. Furthermore, assume that every sample path of is p -continuous. An arbitrary sequence tn in T has a p-converging subsequence tn' t. By the p continuity of the sample paths, almost surely. Because every is Gaussian, this implies convergence of means and variances, whence p2 (tn ' , t) 2 0 by Proposition 2.29. Thus tn' t also for p2 and hence T is p2 -compact. Suppose that a sample path t f---+ is not p2 -continuous. Then there exists an s > 0 and a t E T such that P2 Ctn , t) ---+ 0, but 2: s for every n. By the p-compactness and continuity, there exists a subsequence such that p Ctn' , s) 0 and for some s . By the argument of the preceding paragraph, P2 Ctn' , s) ---+ 0, so that P2 Cs, t) 0 and 2: s . Conclude that the path t f---+ can only fail to be P2 -continuous for for which there exist s, t E T with P2 Cs, t) 0, but i= Let N be the set of for which there do exist such s, t. Take a countable, p-dense subset A of { (s, t) E T x T : p2 (s , t) 0 } . Because t f---+ is p continuous, N is also the set of all such that there exist (s, t) E A with i= From the definition of p2 , it is clear that for every fixed (s, t), the set of such that (w) i= is a null set. Conclude that N is a null set. Hence, almost all paths of are P2 -continuous. • Proof.
X.
X
X1", ---+ X1 ---+ X1 (w)
X1", (w) ---+ Xs(w)
=
I XrJw) - X1(w) l
I Xs(w) - X1(w) w l w Xs(w) X1(w). w Xs X1 (w) =
---+X1 E(X1", - X1) 2 ---+ ---+
X1(w) =
=
X1(w) Xs (w) X1 (w). w X
Notes
The theory in this chapter was developed in increasing generality over the course of many years. Work by Donsker around 1950 on the approximation of the empirical process and the partial sum process by the Brownian bridge and Brownian motion processes was an important motivation. The first type of approximation is discussed in Chapter 19. For further details and references concerning the material in this chapter, see, for example, [76] or [ 1 46] . PROBLEMS 1. (i) Show that a compact set is totally bounded. (ii) Show that a compact set is separable.
2. Show that a function f : ]]]) r--+ lE is continuous at every x E ]]]) if and only if for every open G E JE .
j- 1 (G) is open in ]]])
3 . (Projection a-field.) Show that the a -field generated b y the coordinate projections z 1-+ z (t) on C [a , b] is equal to the Borel a-field generated by the uniform norm. (First, show that the space
Stochastic Convergence in Metric Spaces
264
C [a , b]
is separable. Next show that every open set in a separable metric space is a countable union of open balls. Next, it suffices to prove that every open ball is measurable for the proj ection a - field.)
4. Show that
D[a , b]
is not separable for the uniform norm.
D [a , b] is bounded. Let h be an arbitrary element of D[ -oo, oo] and let 8 0. Show that there exists a grid uo -oo < u 1 < · · · Urn oo such that h varies at most 8 on every interval [ui , Ui +l ) . Here "varies at most 8 " means that l h (u) - h (v) I is les s than 8 for every u , in the interval. (Make sure that all points at which h jumps more than E are grid points .) Suppose that Hn and Ho are subsets of a semimetric space H such that Hn Ho in the sense that (i) Every h E Ho is the limit of a sequence hn E Hn ; (ii) If a subsequence hn 1 converges to a limit h, then h E Ho . Suppose that An are stochastic processes indexed by H that converge in distribution in the space .f00 (H) to a stochastic process A that has uniformly continuous sample paths. Show that sup h E Hn An (h) sup h E Ho A (h).
5. Show that every function in 6.
=
7.
>
=
v
_,..
'V'-)
19 Empirical Processes
The empirical distribution of a random sample is the uniform discrete measure on the observations. In this chapter; we study the convergence of this measure and in particular the convergence of the corresponding distribution function. This leads to laws of large numbers and central limit theorems that are uniform in classes offunctions. We also discuss a number of applications of these results. 19.1
Empirical Distribution Functions
Let X 1, . . . , X be a random sample from a distribution function empirical distribution function is defined as
n
F on the real line. The
F F( )
It is the natural estimator for the underlying distribution if this is completely unknown. Because n iF (t) is binomially distributed with mean n t , this estimator is unbiased. By the law of large numbers it is also consistent,
n
as
IFn(t) --+ F(t),
every t.
By the central limit theorem it is asymptotically normal, In this chapter we improve on these results by considering t r--+ (t) as a random function, rather than as a real-valued estimator for each t separately. This is of interest on its own account but also provides a useful starting tool for the asymptotic analysis of other statistics, such as quantiles, rank statistics, or trimmed means. The Glivenko-Cantelli theorem extends the law of large numbers and gives uniform convergence. The uniform distance
IFn
I IFn - F l oo
=
sup i iFn t
(t) - F(t) I
is known as the Kolmogorov-Smimov statistic. 265
Empirical Processes
266
Theorem (Glivenko- Cantelli). If X 1 , bution function then II IF I DO � 0. 19.1
n-F
F,
X 2 , . . . are i. i.d. random variables with distri
as as By the strong law of large numbers, both t ---+ t and (t -) ---+ t for < tk every t. Given a fixed E: > 0, there exists a partition -oo to < t 1 < oo such that ) < E: for every i (Points at which jumps more than E: are points of . _1 the partition.) Now, for t _ 1 ::::: t < t ,
IFn( ) F( ) IFn F( -) F (ti -) - F Cti F i i IFn(t) - F(t) ::::: IFn Cti-) - F(ti-) + c, IFn(t) - F(t ) � IFnCti - I) - F(ti - I) - E:. The convergence of IFn (t) and IFn (t -) for every fixed t is certainly uniform for t in the finite set {t1 , . . . , tk - d · Conclude that lim sup l lFn - F l oo ::::: c, almost surely. This is true for every E: > 0 and hence the limit superior is zero.
Proof.
=
·
·
=
·
•
The extension of the central limit theorem to a "uniform" or "functional" central limit theorem is more involved. A first step is to prove the joint weak convergence of finitely many coordinates. By the multivariate central limit theorem, for every t1 , , tk . •
•
•
where the vector on the right has a multivariate-normal distribution, with mean zero and covariances (1 9.2)
,Jll(IFn - F),
This suggests that the sequence of empirical processes viewed as random functions, converges in distribution to a Gaussian process G F with zero mean and covariance functions as in the preceding display. According to an extension of Donsker's theorem, this is true in the sense of weak convergence of these processes in the Skorohod space D[-oo, oo] equipped with the uniform norm. The limit process GF is known as an Brownian bridge process, and as a standard (or uniform) Brownian bridge if is the uniform distribution A. on [0, 1 ] . From the form of the covariance function it is clear that the -Brownian bridge is obtainable as G"from a standard bridge G"- . The name "bridge" results from the fact that the sample paths of the process are zero (one says "tied down") at the endpoints -oo and oo. This is a consequence of the fact that the difference of two distribution functions is zero at these points.
F
F
F
oF
X 1 , X2 , . . . are i. i.d. random variables with distribution function then the sequence of empirical processes converges in distribution in the space D[ -oo, oo] to a tight random element GF, whose marginal distributions are zero-mean normal with covariance function ( 19. 2). 19.3
Theorem (Donsker). If
F,
,Jll(IFn - F)
The proof of this theorem is long. Because there is little to be gained by considering the special case of cells in the real line, we deduce the theorem from a more general result in the next section. •
Proof.
Figure 1 9. 1 shows some realizations of the uniform empirical process. The roughness of the sample path for n 5000 is remarkable, and typical. It is carried over onto the limit =
19. 1
267
Empirical Distribution Functions
0
..
0
"'
0
0
9
"'
9
..
0.0
0
"'
0
"'
.. 0
0
"'
0
0
"'
9
9
"'
0.0
0.2
0.4
06
0.8
1 .0
00
_0 2
04
0.6
08
1 0
Figure 19.1. Three realizations of the uniform empirical process , of 50 (top), 500 (middle), and 5000 (bottom) observations , respectively.
process, for it can be shown that, for every
t,
. . I G;. (t + h) - G;. (t) l -< hm. sup I G;. (t + h) - G;. (t) l < oo ,
0 < hm mf h 0
a.s. -J i h log log h l v'l h log log h l h-?-0 Thus, the increments of the sample paths of a standard Brownian bridge are close to being of the order .JThf. This means that the sample paths are continuous, but nowhere differentiable. -?-
268
Empirical Processes
A related process is the Brownian motion process, which can be defined by ZA(t) IGA (t) + t Z for a standard normal variable Z independent of IGA . The addition of t Z "liberates" the sample paths at t 1 but retains the "tie" at t 0. The Brownian motion process has the same modulus of continuity as the Brownian bridge and is considered an appropriate model for the physical Brownian movement of particles in a gas. The three coordinates of a particle starting at the origin at time 0 would be taken equal to three independent Brownian motions. The one-dimensional empirical process and its limits have been studied extensively. t For instance, the Glivenko-Cantelli theorem can be strengthened to a law of the iterated logarithm, =
=
n
with equality if
theorem
F
F l oo
a.s. , lim sup ---- II IFn � �, 11 ---7 00 2 log log n takes on the value � . This can be further strengthened to Strassen 's
n
F)
o F,
1i ---- (IF11 a.s. 2 log log n � Here 1{ is the class of all functions h if h : [0, 1 ] 1---+ ffi. ranges over the set of absolutely continuous functions + with h(O) h(l) 0 and J01 h'(s) 2 ds � 1 . The notation h11 =: 1{ means that the sequence h11 is relatively compact with respect to the uniform norm, with the collection of all limit points being exactly equal to H . Strassen's theorem gives a fairly precise idea of the fluctuations of the empirical process J1i (IF11 when striving in law to IGF. The preceding results show that the uniform distance of IF11 to is maximally of the order -Jlog log n / n as n -+ oo . It is also known that
oF
=
oF
�
=
F), F
a.s. IF11 rr , lim 11 ---7 inf 00 j2n log log n II 2 Thus the uniform distance is asymptotically (along the sequence) at least 1 / (n log log n) . A famous theorem, the DKW inequality after Dvoretsky, Kiefer, and Wolfowitz, gives a bound on the tail probabilities of llF For every x x2 � 2e -2 • -
F l oo
=
l n - F l oo · P( Jll i iFn F l oo > X )
The originally DKW inequality did not specify the leading constant 2, which cannot be improved. In this form the inequality was found as recently as 1990 (see [ 1 03]). The central limit theorem can be strengthened through strong approximations. These give a special construction of the empirical process and Brownian bridges, on the same probability space, that are close not only in a distributional sense but also in a pointwise sense. One such result asserts that there exists a probability space carrying i.i.d. random variables X 1 , X2 , . . . with law and a sequence of Brownian bridges IG F, n such that . sup Jn hm IGF < 00 , a.s. 11 ---? oo (log n) 2 II Jn' CIFn - - ,n
F
F)
t
+
l oo
See [ 1 34] for the following and many other results on the univariate empirical process. A function is absolutely continuous if it is the primitive function J� g (s) ds of an integrable function g . Then it is almost-everywhere differentiable with derivative g .
1 9. 2
269
Empirical Distributions
CGF,n
CGp,
.fo(IFn
Because, by construction, every is equal in law to this implies that as a process (Donsker 's theorem), but it implies a lot more. Apparently, the F) distance between the sequence and its limit is of the order 0 ( (log n ) 2 j After the method of proof and the country of origin, results of this type are also known as Hungarian embeddings. Another construction yields the estimate, for fixed constants a , b, and c and every x > 0, "Y"-7
CGF
.fo) .
Empirical Distributions
19.2
n JPln
P
Let X 1 , . . . , X be a random sample from a probability distribution on a measurable space (X, A) . The empirical distribution is the discrete uniform measure on the observations. We denote it by = n - 1 I:7= 1 8 x, , where 8 is the probability distribution that is degenerate at x . Given a measurable function : X 1--+ IR, we write for the expectation of under the empirical measure, and for the expectation under Thus x
f JPln f Pf P. 1 n (XJ , JPln f = -n L f Pf = f f dP. i=l
f
JPln
Actually, this chapter is concerned with these maps rather than with as a measure. By the law of large numbers, the sequence converges almost surely to for every such that is defined. The abstract Glivenko-Cantelli theorems make this result uniform in ranging over a class of functions. A class :F of measurable functions X1--+ IR is called -Glivenko-Cantelli if
f
f
JPln f
Pf
P
Pf, f:
as*
I JPln f - Pf i F = jEF sup I JPln f - Pfl ---+ 0. The empirical process evaluated at f is defined as CGnf = .fo Wnf - Pf). By the multivariate central limit theorem, given any finite set of measurable functions fi with Pf? < oo, where the vector on the right possesses a multivariate-normal distribution with mean zero and covariances
ECG f CG g Pfg - Pf Pg The abstract Donsker theorems make this result "uniform" in classes of functions. A class :F of measurable functions f : X1--+ IR is called P-Donsker if the sequence of processes {CGnf : f :F} converges in distribution to a tight limit process in the space C00(:F ) . Then the limit process is a Gaussian process CG with zero mean and covariance function as given in the preceding display and is known as a ?-Brownian bridge. Of course, the Donsker property includes the requirement that the sample paths f 1--+ CGn f are uniformly bounded for every n and every realization of X . . . , X n. This is the case, for instance, if the class :F P
=
P
E
p
1,
·
270
Empirical Processes
has a finite and integrable envelope function F: a function such that I J Cx) l :S F(x) < oo, for every x and f. It is not required that the function x f-+ F (x) be uniformly bounded. For convenience of terminology we define a class :F of vector-valued functions f : x f-+ �k to be Glivenko-Cantelli or Donsker if each of the classes of coordinates fi : x f-+ � with f = (fi , . . . , fk ) ranging over :F (i = 1 , 2, . . . , k) is Glivenko-Cantelli or Donsker. It can be shown that this is equivalent to the union of the k coordinate classes being Glivenko Cantelli or Donsker. Whether a class of functions is Glivenko-Cantelli or Donsker depends on the "size" of the class. A finite class of integrable functions is always Glivenko-Cantelli, and a finite class of square-integrable functions is always Donsker. On the other hand, the class of all square-integrable functions is Glivenk:o-Cantelli, or Donsker, only in trivial cases. A relatively simple way to measure the size of a class :F is in terms of entropy. We shall mainly consider the bracketing entropy relative to the L r (P)-norm Given two functions l and u, the bracket [l , u] is the set of all functions f with l :S f :S u. An c-bracket in L r (P) i s a bracket [l , u] with P ( u - lY < c r . The bracketing number N[ J ( c, :F, L r (P) ) is the minimum number of c-brackets needed to cover :F. (The bracketing functions l and u must have finite L r (P)-norms but need not belong to :F.) The entropy with bracketing is the logarithm of the bracketing number. A simple condition for a class to be P -Glivenko-Cantelli is that the bracketing numbers in L1 (P) are finite for every c > 0. The proof is a straightforward generalization of the proof of the classical Glivenko-Cantelli theorem, Theorem 19. 1 , and is omitted.
Every class :F of measurable functions such that N[ 1 ( c, :F, L1 (P) ) < oo jor every c > O is P-Glivenko-Cantelli. Theorem (Glivenko- Cantelli).
19.4
For most classes of interest, the bracketing numbers N[ 1 ( c, :F, L r (P) ) grow to infinity as c + 0. A sufficient condition for a class to be Donsker is that they do not grow too fast. The speed can be measured in terms of the bracketing integral
If this integral is finite-valued, then the class :F is P-Donsker. The integrand in the integral is a decreasing function of c. Hence, the convergence of1 the integral depends only on the size of the bracketing numbers for c + 0. Because J0 c - r converges for r < 1 and diverges for r � 1 , the integral condition roughly requires that the entropies grow of slower order than (1 I c ) 2 .
de
19.5
< OO
Theorem (Donsker).
is P-Donsker.
Every class :F of measurable functions with f[ J ( 1 , :F, L 2 (P) )
Let g be the collection of all differences f - g if f and g range over :F. With a given set of c-brackets [li , ud over :F we can construct 2c-brackets over g by tak ing differences [l i - u i , u i - lj ] of upper and lower bounds. Therefore, the bracket ing numbers N[ J (c, Q, L 2 (P) ) are bounded by the squares of the bracketing numbers
Proof
27 1
Empirical Distributions
1 9. 2
Nr 1 ( s/2, :F, L 2 (P) ) . Taking a logarithm turns the square into a multiplicative factor 2, and hence the entropy integrals of :F and g are proportional. For a given, small 8 > 0 choose a minimal number of brackets of size 8 that cover :F, and use them to form a partition of :F U i :Fi in sets of diameters smaller than 8 . The subset of g consisting of differences - g of functions and g belonging to the same partitioning set consists of functions of L 2 (P)-norm smaller than 8 . Hence, by Lemma ahead, there exists a finite number a (8) such that
f
E* supi j,supg E:Fi (f - g ) I Gn
=
f
I � Ir 1 ( 8 ,
19. 3 4
:F, L 2 (P) ) + ,Jn P F l { F > a(8) ,J11 } .
Here the envelope function F can be taken equal to the supremum of the absolute values of the upper and lower bounds of finitely many brackets that cover :F, for instance a minimal set of brackets of size This F is square-integrable. The second term on the right is bounded by a (8 ) - 1 PF 2 1 { F > a (8) Jn} and hence converges to zero as n ---+ oo for every fixed 8. The integral converges to zero as 8 ---+ 0. The theorem follows from Theorem in view of Markov's inequality. •
1.
18.14,
If :F is equal to the collection of all indicator functions of the form with t ranging over JR., then the empirical process Gn is the classical empirical process Jn(IFn (t) - F (t)) . The preceding theorems reduce to the classical theorems by Glivenko-Cantelli and Donsker. We can see this by bounding the bracketing numbers of the set of indicator functions Consider brackets of the form for a grid of points - oo to < t 1 < · · · < tk oo with the property F (ti -) - FCti - l ) < s for each i . These brackets have L 1 (F)-size s . Their total number k can be chosen smaller than 2/s. Because Fj 2 :s F for every 0 :S :S the L 2 (F)-size of the brackets is bounded by y'c. Thus Nr 1 ( y'c, :F, L 2 (F)) :s (2/s) , whence the bracketing numbers are of the polynomial order ( /s ) 2 . This means that this class of functions is very small, because a function of the type log( /s ) satisfies the entropy condition of Theorem easily. 0 19.6
Example (Distribution function).
ft 1 (- oo, t J •
ft
=
f. Dc- oo. ti - t l • 1c - oo, ti J t]
=
f 1
=
f 1,
1
19. 5
Ue :
Let :F 8 E G } be a collection of measurable d functions indexed by a bounded subset G c IR. . Suppose that there exists a measurable function m such that 19.7
Example (Parametric class).
=
If P l m l r < oo, then there exists a constant K , depending on G and d only, such that the bracketing numbers satisfy diam G d Nr J (c l l m i i P,n :F, L r (P) ) :S K , every 0 < s < diam G . 8 Thus the entropy is of smaller order than log( l js). Hence the bracketing entropy integral certainly converges, and the class of functions :F is Donsker. To establish the upper bound we use brackets of the type - sm, + sm] for 8 ranging over a suitably chosen subset of G. These brackets have L r (P)-size 2s l l m i i P . r · If 8 ranges over a grid of meshwidth s over 8, then the brackets cover :F, because by the Lipschitz condition, - sm :S :S + sm if l l 81 - 82 ll :S s. Thus, we need as many brackets as we need balls of radius s/2 to cover 8.
(
)
[fe
fe1
fe2 fe1
fe
Empirical Processes
272
The size of 8 in every fixed dimension is at most diam 8. We can cover 8 with fewer than ( diam 8 I 8 ) d cubes of size 8. The circumscribed balls have radius a multiple of 8 and also cover 8 . If we replace the centers of these balls by their projections into 8, then the balls of twice the radius still cover 8. D The parametric class in Example 19.7 is cer tainly Glivenko-Cantelli, but for this a much weaker continuity condition also suffices. Let :F Ue : e E 8 } be a collection of measurable functions with integrable envelope function F indexed by a compact metric space 8 such that the map e �----+- fe (x ) is continuous for every x. Then the L 1-bracketing numbers of :F are finite and hence :F is Glivenko-Cantelli. We can construct the brackets in the obvious way in the form [ fs , J B ] , where is an open ball and is and J B are the infimum and supremum of fe for e E respectively. Given a sequence of balls with common center a given e and radii decreasing to 0, we have f Bm - ism ,j._ fe - fe 0 by the continuity, pointwise in x and hence also in L 1 by the dominated-convergence theorem and the integrability of the envelope. Thus, given 8 > 0, for every e there exists an open ball around e such that the bracket [ fB , f B ] has size at most 8. By the compactness of 8, the collection of balls constructed in this way has a finite subcover. The corresponding brackets cover :F. This construction shows that the bracketing numbers are finite, but it gives no control on their sizes. D 19.8
Example (Pointwise Compact Class).
=
Bm
B,
B
=
B
Let IR.d Ui 11 be a partition in cubes of volume 1 and let :F be the class of all functions f : IR.d ---+ JR. whose partial derivatives up to order exist and are uniformly bounded by constants M1 on each of the cubes 11 . (The condition includes bounds on the "zero-th derivative," which is f itself.) Then the bracketing numbers of :F satisfy, for every V ::::_ d I and every probability measure P, 19.9
Example (Smooth functions).
=
a
a
The constant K depends on V , r , and d only. If the series on the right converges for r 2 and some d I :::; V < 2, then the bracketing entropy integral of the class :F converges and hence the class is P- Donsker. t This requires sufficient smoothness > d 12 and sufficiently small tail probabilities P (11 ) relative to the uniform bounds M1 . If the functions f have compact support (equivalently M1 0 for all large }), then smoothness of order > dl2 suffices. D a,
=
a
a
=
a
19.10 Example (Sobolev classes). Let :F be the set of all functions f : [0, 1] �----+- JR. such that I f I :::; 1 and the (k - 1 )-th derivative is absolutely continuous with J ( f (k) ) 2 (x) dx :::; 1 for some fixed k E N. Then there exists a constant K such that, for every 8 > o,+ 00
log N[ J (8, :F, II · ll oo) :::; Thus, the class :F is Donsker for every k t
+
::::_
(1) K -;;
ljk
1 and every P. D
The upper bound and this sufficient condition can be slightly improved. For this and a proof of the upper bound, see e.g., [ 146, Corollary 2.74] . See [ 1 6] .
19.2
Empirical Distributions
273
Let :F be the collection of all monotone functions f : lft 1---7 [ - 1 , 1 ] , or, bigger, the set of all functions that are of variation bounded by 1 . These are the differences of pairs of monotonely increasing functions that together increase at most 1 . Then there exists a constant K such that, for every r � 1 and probability measure P , t 19.11
Example (Bounded variation).
Thus, this class of functions is P-Donsker for every P . D Let w : (0, 1) I---7 Ift+ be a fixed, con tinuous function. The weighted empirical process of a sample of real-valued observations is the process 19.12
Example (Weighted distribution function).
t !---7 er: (t)
=
-Ji1(IFn - F) (t)w ( F(t) )
(defined to be zero if F(t) 0 or F(t) 1). For a bounded function w, the map z 1---7 z woF is continuous from .t:00 [ -oo, oo] into .t:00[ -oo, oo] and hence the weak convergence of the weighted empirical process follows from the convergence of the ordinary empirical process and the continuous-mapping theorem. Of more interest are weight functions that are unbounded at 0 or 1 , which can be used to rescale the empirical process at its two extremes -oo and oo. Because the difference F ) ( t ) converges to 0 as t -+ ±oo, the sample paths of the process t 1---7 er: (t) may be bounded even for unbounded w, and the rescaling increases our knowledge of the behavior at the two extremes. A simple condition for the weak convergence of the weighted empirical process in .t:00( -oo, oo) is that the weight function w is monotone around 0 and 1 and satisfies J01 w 2 (s) ds < oo. The square-integrability is almost necessary, because the convergence is known to fail for w ( t ) 1/ Jt ( 1 - t ) . The Chibisov-O 'Reilly theorem gives necessary and sufficient conditions but is more complicated. We shall give the proof for the case that w is unbounded at only one endpoint and decreases from w(O) oo to w(l) 0. Furthermore, we assume that F is the uni form measure on [0, 1 ] . (The general case can be treated in the same way, or by the quantile transformation.) Then the function v (s) w 2 (s ) with domain [0, 1] has an inverse v - 1 (t ) w- 1 ( Jt) with domain [0, oo]. A picture of the graphs shows that J000 w - 1 (Jl) dt f01 w 2 ( t ) dt , which is finite by assumption. Thus, given an c > 0, we can choose partitions 0 so < s 1 < < sk 1 and 0 to < t 1 < < t, oo such that, for every i , =
=
·
CIFn -
=
=
=
=
=
=
=
·
·
=
·
=
·
·
·
=
This corresponds to slicing the area under w 2 both horizontally and vertically in pieces of size c2 . Let the partition 0 u0 < u 1 < < Um 1 be the partition consisting of all points si and all points w- 1 (Jtj) . Then, for every i , =
t
See, e.g., [ 146, Theorem 2.75] .
·
·
·
=
274
Empirical Processes
It follows that the brackets have L 1 (A.)-size 2s 2 . Their square roots are brackets for the functions of interest x � w (t) l [o . tJ (x ) , and have L 2 (A.) -size V2s, because P I Ju - .J[f ::::; P l u - l l . Because the number m of points in the partitions can be chosen of the order ( 1 /s) 2 for small s, the bracketing integral of the class of functions x � w(t) l [o , 11 (x ) converges easily. 0 The conditions given by the preceding theorems are not necessary, but the theorems cover many examples. Simple necessary and sufficient conditions are not known and may not exist. An alternative set of relatively simple conditions is based on "uniform covering numbers." The covering number N (s, :F, L 2 ( Q) ) is the minimal number of L 2 ( Q)-balls of radius s needed to cover the set :F. The entropy is the logarithm of the covering number. The following theorems show that the bracketing numbers in the preceding Glivenko-Cantelli and Donsker theorems can be replaced by the uniform covering numbers sup N ( s ii F II Q , r , :F, L r ( Q) ) . Q Here the supremum is taken over all probability measures Q for which the class :F is not identically zero (and hence II F I Q , r Q p r > 0). The uniform covering numbers are relative to a given envelope function F . This is fortunate, because the covering numbers under different measures Q typically are more stable if standardized by the norm I F I Q , r of the envelope function. In comparison, in the case of bracketing numbers we consider a single distribution P , and standardization by an envelope does not make much of a difference. The uniform entropy integral is defined as =
log sup N ( s ii F II Q , 2 , :F, L 2 ( Q) ) ds. Q
Let :F be a suitably measurable class ofmeasurable functions with sup Q N (s ii F II Q , l , :F, L 1 ( Q) ) < oo for every s > 0. If P* F < oo, then :F is P -Glivenko-Cantelli. 19.13
Theorem (Glivenko-Cantelli).
19.14
Theorem (Donsker)
with J ( 1 , :F, L 2 )
< oo.
Let :F be a suitably measurable class ofmeasurablefunctions If P * F 2 < oo, then :F is P -Donsker. .
The condition that the class :F be "suitably measurable" is satisfied in most examples but cannot be omitted. We do not give a general definition here but note that it suffices that there exists a countable collection g of functions such that each f is the pointwise limit of a sequence gm in 9. t An important class of examples for which good estimates on the uniform covering numbers are known are the so-called Vapnik-Cervonenkis classes, or VC classes, which are defined through combinatorial properties and include many well-known examples. t
See, for example, [ 1 1 7], [ 120], or [ 146] for proofs of the preceding theorems and other unproven results in this section.
19.2
275
Empirical Distributions
Figure 19.2. The subgraph of a function.
Say that a collection C of subsets of the sample space X picks out a certain subset A of the finite set {xi , . . . , Xn } C X if it can be written as A {xi , . . . , Xn } n C for some C E C. The collection C is said to shatter {xi , . . . , Xn } if C picks out each of its 2n subsets. The VC index V (C) of C is the smallest n for which no set of size n is shattered by C. A collection C of measurable sets is called a VC class if its index V (C) is finite. More generally, we can define VC classes of functions. A collection F is a VC class of functions if the collection of all subgraphs { (x , t) : f(x) < t } , if f ranges over F, forms a VC class of sets in X x � (Figure 19 .2). It is not difficult to see that a collection of sets C is a VC class of sets if and only if the collection of corresponding indicator functions lc is a VC class of functions. Thus, it suffices to consider VC classes of functions. By definition, a VC class of sets picks out strictly less than 2n subsets from any set of n � V (C) elements. The surprising fact, known as Sauer's lemma, is that such a class can necessarily pick out only a polynomial number 0 (n v (C) - I ) of subsets, well below the 2n - 1 that the definition appears to allow. Now, the number of subsets picked out by a collection C is closely related to the covering numbers of the class of indicator functions { lc : C E C} in L1 (Q) for discrete, empirical type measures Q. By a clever argument, Sauer's lemma can be used to bound the uniform covering (or entropy) numbers for this class. =
19.15
There exists a universal constant K such that for any VC class F of 1 and 0 < 8 < 1, 1 r (V(.F) - I ) .FJ V( --; s�p N(8 II F II Q,n :F, L r (Q) ) :S K V (F ) (16e)
Lemma.
functions, any r
�
()
Consequently, VC classes are examples of polynomial classes in the sense that their covering numbers are bounded by a polynomial in 1 j 8. They are relatively small. The
Empirical Processes
276
upper bound shows that VC classes satisfy the entropy conditions for the Glivenko-Cantelli theorem and Donsker theorem discussed previously (with much to spare). Thus, they are P Glivenko-Cantelli and P-Donsker under the moment conditions P * F < oo and P * F 2 < oo on their envelope function, if they are "suitably measurable." (The VC property does not imply the measurability.) 19.16 Example (Cells). The collection of all cells ( -oo, t] in the real line is a VC class of index V(C) = 2. This follows, because every one-point set {xd is shattered, but no two-point set {xi , x2 } is shattered: If X I < x 2 , then the cells (-oo, t] cannot pick out {x2 } . D 19.17 Example (Vector spaces). Let :F be the set of all linear combinations L: A.i fi of a given, finite set of functions f1 , . . . , fk on X. Then :F is a VC class and hence has a finite uniform entropy integral. Furthermore, the same is true for the class of all sets {f > c} if f ranges over f and c over R For instance, we can construct :F to be the set of all polynomials of degree less than some number, by taking basis functions 1 , x , x 2 , on JR. and functions x�1 x�d more generally. For polynomials of degree up to 2 the collection of sets {f > 0} contains already all half-spaces and all ellipsoids. Thus, for instance, the collection of all ellipsoids is Glivenko-Cantelli and Donsker for any P . To prove that :F is a VC class, consider any collection of n = k + 2 points (x 1 , t 1 ) , . . . , (xn , tn) in X x JR.. We shall show this set is not shattered by :F, whence V (:F ) :S n. By assumption, the vectors (f(xi) - ti , . . . , f(xn) - tn) Y are contained in a (k + I) dimensional subspace of IR.n . Any vector a that is orthogonal to this subspace satisfies .
i : a, >O
.
•
·
·
·
i : a, 0} is nonempty and is not picked out by the subgraphs of :F. If it were, then it would be of the form { (xi , ti ) : ti < f (ti ) } for some f , but then the left side of the display would be strictly positive and the right side nonpositive. D A number of operations allow to build new VC classes or Donsker classes out of known VC classes or Donsker classes. 19.18 Example (Stability properties). The class of all complements c c, all intersections C n D, all unions C U D, and all Cartesian products C x D of sets C and D that range over VC classes C and D is VC. The class of all suprema f v g and infima f 1\ g of functions f and g that range over VC classes :F and g is VC. The proof that the collection of all intersections is VC is easy upon using Sauer 's lemma, according to which a VC class can pick out only a polynomial number of subsets. From n given points C can pick out at most O (n V(Cl ) subsets. From each of these subsets D can pick out at most O (n V(Dl ) further subsets. A subset picked out by C n D is equal to the subset picked out by C intersected with D. Thus we get all subsets by following the
Goodness-of-Fit Statistics
1 9. 3
277
two-step procedure and hence C n D can pick out at most 0 (n V(C J + V(Dl ) subsets. For large n this is well below whence C n D cannot pick out all subsets. That the set of all complements is VC is an immediate consequence of the definition. Next the result for the unions follows by combination, because C U D ( c c n Dey . The results for functions are consequences of the results for sets, because the subgraphs of suprema and infima are the intersections and unions of the subgraphs, respectively. D
2n ,
=
If :F and g possess a finite uniform entropy inte gral, relative to envelope functions and G, then so does the class :FQ of all functions x �---+ f (x)g(x), relative to the envelope function FG. More generally, suppose that ¢ : �2 �---+ � is a function such that, for given functions L 1 and L g and every x, 19.19
Example (Uniform entropy).
F
1 ¢ ( !1 (x ) , g 1 (x) ) - ¢ ( h (x), g2 (x)) I
-
I - hI
:S
L J (x) f1 (x) + L g (x) l g 1 - g2 l (x) . ¢ (f0 , g0) has a finite uniform entropy integral
Then the class of all functions ¢ (f, g ) relative to the envelope function L 1 + L g G, whenever :F and g have finite uniform entropy integrals relative to the envelopes and G . D
F F
For any fixed Lipschitz function ¢ : �2 �---+ � ' the class of all functions of the form ¢ (f, g) is Donsker, if f and g range over Donsker classes :F and g with integrable envelope functions. For example, the class of all sums f + g, all minima f 1\ g, and all maxima f v g are Donsker. If the classes :F and g are uniformly bounded, then also the products fg form a Donsker class, and if the functions f are uniformly bounded away from zero, then the functions 1 /f form a Donsker class. D 19.20
Example (Lipschitz transformations).
19.3
Goodness-of-Fit Statistics
An important application of the empirical distribution is the testing of goodness-of-fit. Because the empirical distribution is always a reasonable estimator for the underlying distribution P of the observations, any measure of the discrepancy between and P can be used as a test statistic for testing the hypothesis that the true underlying distribution is P. Some popular global measures of discrepancy for real-valued observations are
IFn
IFn
Jll l lFn - F l ocH n J ClFn - F ) 2 dF,
(Kolmogorov-Smirnov), (Cramer-von Mises).
These statistics, as well as many others, are continuous functions of the empirical process. The continuous-mapping theorem and Theorem 19.3 immediately imply the following result. 19.21
Corollary. If X 1 ,
X 2 , . . . are i. i.d. random variables with distribution function
F,
then the sequences of Kolmogorov-Smirnov statistics and Cramer-von Mises statistics con verge in distribution to F and J d respectively. The distributions of these limits are the same for every continuous distribution function
IG I
00
G} F,
F.
Empirical Processes
278
The maps z 1-+ ll z ll oo and z 1-+ f z 2 (t) dt from D[-oo, oo] into !R. are continuous with respect to the supremum norm. Consequently, the first assertion follows from the continuous-mapping theorem. The second assertion follows by the change of variables F(t) 1-+ u in the representation G F G). o F of the Brownian bridge. Alternatively, use the quantile transformation to see that the Kolmogorov-Smirnov and Cramer-von Mises statistics are distribution-free for every fixed n. •
Proof.
=
It is probably practically more relevant to test the goodness-of-fit of compositive null hypotheses, for instance the hypothesis that the underlying distribution P of a random sample is normal, that is, it belongs to the normal location-scale family. To test the null hypothesis that P belongs to a certain family { Pe : e E 8}, it is natural to use a measure of the discrepancy between and Pe , for a reasonable estimator 8 of 8 . For instance, a modified Kolmogorov-Smirnov statistic for testing normality is
IFn
s�p v'ri
t - X) IFn (t) - ( 5-
.
For many goodness-of-fit statistics of this type, the limit distribution follows from the limit distribution of Jl1 (IF - Pe). This is not a Brownian bridge but also contains a "drift," due to e . Informally, if e I-+ Pe has a derivative Pe in an appropriate sense, then
n Jfi Wn - Pe) = vfriWn - Pe) - y'ri (Pe - Pe)T P, ( 19.22) y'ri ( IFn - Pe) - y'ri (e - e) e . By the continuous-mapping theorem, the limit distribution of the last approximation can be derived from the limit distribution of the sequence ,Jii ( IFn - Pe , 8 - 8). The first component converges in distribution to a Brownian bridge. Its joint behavior with -fo(B - 8) can most easily be obtained if the latter sequence is asymptotically linear. Assume that �
for "influence functions" 1/Je with Pe 1/Je
= 0 and Pe 11 1/Je f < oo.
19.23 Theorem. Let 1 , . . . be a random sample from a distribution Pe indexed by k () E IR. . Let :F be a Pe-Donsker class of measurable functions and let be estimators that are asymptotically linear with influence function 1/Je. Assume that the map () 1-+ Pe from IR.k
X , Xn
Bn
to f00 (:F ) is Frichet differentiable at e. t Then the sequence - Pe) converges under P G G 1/J e in distribution in f 00 (:F ) to the process f I-+ pe f - pe l e f.
-foWn
Proof.
In view of the differentiability of the map () 1-+ Pe and Lemma 2. 12,
I Pe, - Pe - (8 - () ) T Pe I ;- = (I I Bn - e I ) . Op
This justifies the approximation (19.22). The class Q obtained by adding the k components of 1/Je to :F is Donsker. (The union of two Donsker classes is Donsker, in general. In t
This means that there exists a map Pe : :F .,._.,. IR.k such that II Pe+h - Pe Chapter 20.
-
h T Pe IIF
=
o( I h II) as h
�
0 ; see
279
19.4 Random Functions
the present case, the result also follows directly from Theorem 18. 14.) The variables are obtained from the empirical process seen as an element L oo of � (9) by a continuous map. Finally, apply Slutsky's lemma. •
( JnWn -Pe), n - 1 12 1/re (X;))
The preceding theorem implies, for instance, that the sequences of modified Kolmogorov Smimov statistic converge in distribution to the supremum of a certain Gaussian process. The distribution of the limit may depend on the model 8 Hthe estimators and even on the parameter value 8. Typically, this distribution is not known in closed form but has to be approximated numerically or by simulation. On the other hand, the limit distribution of the true Kolmogorov-Srnimov statistic under a continuous distribution can be derived from properties of the Brownian bridge, and is given byt
en,
.Jfi l lFn - Fe l oo
P(I I GJcl l oo >
Fe ,
x)
=
2
00
( - 1)1+ 1e -2/x2 • L }=1
With the Donsker theorem in hand, the route via the Brownian bridge is probably the most convenient. In the 1940s Smimov obtained the right side as the limit of an explicit expression for the distribution function of the Kolmogorov-Smimov statistic.
19.4
Random Functions
The language of Glivenko-Cantelli classes, Donsker classes, and entropy appears to be convenient to state the "regularity conditions" needed in the asymptotic analysis of many statistical procedures. For instance, in the analysis of Z- and M -estimators, the theory of empirical processes is a powerful tool to control remainder terms. In this section we consider the key element in this application: controlling random sequences of the form for functions that change with and depend on an estimated parameter. If a class :F of functions is -Glivenko-Cantelli, then the difference JPl converges to zero uniformly in varying over :F, almost surely. Then it is immediate that also ....... "' "' as JPl -+ 0 for every sequence of random functions that are contained in :F. If converges almost surely to a function and the sequence is dominated (or uniformly "" "' as as integrable), so that p -+ p then it follows that JPl -+ p Here by "random functions" we mean measurable functions x H- j (x ; w) that, for every fixed x , are real maps defined on the same probability space as the observations (w . . . (w) . In many examples the function (x ) (x ; is a func tion of the observations, for every fixed x . The notations JPl j and j are abbreviations for the expectations of the functions x H- j (x ; w) with w fixed. A similar principle applies to Donsker classes of functions. For a Donsker class :F, the empirical process converges in distribution to a Brownian bridge process p "uni formly in E :F." In view of Lemma 1 8 . 1 5 , the limiting process has uniformly continuous sample paths with respect to the variance semimetric. The uniform convergence combined with the continuity yields the weak convergence j for every sequence j of p random functions that are contained in :F and that converges in the variance semimetric to a function
2::7= 1 fn, e" (X;)
I n f - Pf I fn n n
f
fn
fo '
X 1 ), , Xn f
n
n, e P!
fo
fn
n fn
fo.
n fn fn X 1 , . . . , Xn) nn Pn =
Gn f
n
Gf
P-
Gn n G fo -v--7
fo.
t
I n f-Pf I
See, for instance, [42, Chapter 12] , or [134] .
n
280
Empirical Processes
n
Suppose that :F is a P -Donsker class of measurable functions and j is a sequence of randomfunctions that take their values in F such that J fo ( ) 2 d P ( converges in probability to 0 for some fo E L 2 (P). Then fo) 0 and hence Gp fo19.24
Gnfn
Lemma.
(fn (x)- x) GnCfn - !,.
'V'-7
x)
Assume without of loss of generality that fo is contained in :F. Define a function g : £00(:F ) X :F � lR by g(z, f) z(f) - z(fo). The set :F is a semimetric space relative to the L 2 (P)-metric. The function g is continuous with respect to the product semimetric at every point (z , f) such that f � z(f) is continuous. Indeed, if (z , f) in the z(f) if z z uniformly and hence z f + o( space i00 (:F ) x :F, then is continuous at f. By assumption, fo as maps in the metric space :F. Because :F is Donsker, in the space i00 (:F ) , and it follows that ( , ( p , fo in the space £00 (:F ) x :F. By Lemma almost all sample paths of Gp are continuous on :F. Thus the function g is continuous at almost every point (Gp , f0). By the continuous-mapping theorem, g(Gp , fo) 0. The lemma follows, because fo) convergence in distribution and convergence in probability are the same for a degenerate limit. •
Proof
=
Zn -+ fn !,. Gn 'V'-?Gp 18.15, GnCfn - g(Gn, fn ) =
Czn, fn) -+ Zn C fn) C n) 1) -+ =
Gn fn ) 'V'-? G )
'V'-7
=
The preceding lemma can also be proved by reference to an almost sure representation for the converging sequence G p . Such a representation, a generalization of Theorem exists. However, the correct handling of measurability issues makes its application involved.
Gn
2.19
19.25
sample
'V'-7
Example (Mean absolute deviation).
The mean absolute deviation of a random
X 1 , . . . , Xn is the scale estimator n 1 Mn -n 2) i= l Xi - Xn l =
The absolute value bars make the derivation of its asymptotic distribution surprisingly difficult. (Try and do it by elementary means.) Denote the distribution function of the observations by and assume for simplicity of notation that they have mean equal to zero. We shall write for the stochastic process � and use the notations and in a similar way. If < oo , then the set of functions � with ranging over a compact, such as [ is Donsker by Example Because, by the triangle inequality, - p p 0, the preceding lemma shows that -+ 0. ::S This can be rewritten as
Fx 2 - 1, 1], Xn l - l x l ) 2
F, Fx IFn l x - 81 8 n - 1 L:7= 1 1 X i - 8 1 , Gn l x - 81 F i x - 81 x x 81 8 F19. 7 . l F( lx Gnl x - Xn l - Gnl x l 1 X n l 2 -+
8 F l x - 8 I is differentiable at 0, then, with the derivative written in the form 2F(O) - 1, the first term on the right is asymptotically equivalent to (2F(O) - 1 ) Gnx, by the delta method. Thus, the mean absolute deviation is asymptotically normal with mean zero and asymptotic variance equal to the variance of ( 2F(O) - 1 ) X 1 + I X 1 1. If the mean and median of the observations are equal (i.e., F(O) � ), then the first term If the map
�
=
is 0 and hence the centering of the absolute values at the sample mean has the same effect
19.4 Random Functions
28 1
as centering at the true mean. In this case not knowing the true mean does not hurt the scale estimator. In comparison, for the sample variance this is true for any F. D Perhaps the most important application of the preceding lemma is to the theory of Z estimators. In Theorem 5.21 we imposed a pointwise Lipschitz condition on the maps 1---+ 1/Je to ensure the convergence 5.22:
e
In view of Example 19.7, this is now seen to be a consequence of the preceding lemma. The display is valid if the class of functions 1/Je : I I < 8 } is Donsker for some 8 > 0 and 1/Je 1/Je0 in quadratic mean. Imposing a Lipschitz condition is just one method to ensure these conditions, and hence Theorem 5.21 can be extended considerably. In particular, in its generalized form the theorem covers the sample median, corresponding to the choice 1/Je (x ) The sign functions can be bracketed just as the indicator functions sign(x of cells considered in Example 19.6 and thus form a Donsker class. For the treatment of semiparametric models (see Chapter 25), it is useful to extend the results on Z-estimators to the case of infinite-dimensional parameters. A differentiability or Lipschitz condition on the maps 1---+ 1/Je would preclude most applications of interest. However, if we use the language of Donsker classes, the extension is straightforward and useful. If the parameter ranges over a subset of an infinite-dimensional normed space, then we use an infinite number of estimating equations, which we label by some set H and assume to be sums. Thus the estimator en (nearly) solves an equation JPln 1/Je , h 0 for every h E H . We assume that, for every fixed x and the map h 1---+ 1/Je, h (x ), which we denote by 1/Je (x ), is uniformly bounded, and the same for the map h 1---+ P l/le , h , which we denote by P l/Je .
{ e - e0
---+
e) .
=
e
e
=
e,
19.26
Theorem.
e
For each in a subset G ofa normed space and every h in an arbitrary set
e - eo
{
H' let X 1---+ 1/le , h (x) be a measurable function such that the class 1/le , h : II II < 8 ' h E H } is P -Donsker for some 8 > 0, with finite envelope function . Assume that, as a map into £00 (H), the map 1---+ P l/Je is Frechet-diffe rentiable at a zero with a derivative V : lin G 1---+ £00(H) that has a continuous inverse on its range. Furthermore, assume that
e II P C1/!e , h - 1/Jeo h ) 2 I H ---+ O as e ---+ eo. lf ii Pn l/JeJ H V Jn (en - eo) = ,
eo,
=
O p (n - 1 12 ) and en
� eo, then
-Gn l/le0 + op ( 1 ) .
This follows the same lines as the proof of Theorem 5 .2 1 . The only novel aspect is that a uniform version of Lemma 19.24 is needed to ensure that Gn ( 1/Je - 1/le0 ) converges to zero in probability in £00(H). This is proved along the same lines. Assume without loss of generality that en takes its values in Go E 8: < 8} and define a map g : £00 (Go X H ) X G o 1---+ £00 (H) by g(z, e)h z ( , h) - z ( , h). This map is continuous at every point (z ' such that I z ( h) - z h) I H 0 as The sequence (Gn 1/Je ' en ) converges in distribution in the space £00 (Go X H) X Go to a pair (Gl/Je , As we have that suph P ( 1/le, h - 1/le0, h ) 2 0 by assumption, and thus II G 1/fe - G1/fe0 I H 0 almost surely, by the uniform continuity of the sample paths of the Brownian bridge. Thus, we can apply the continuous-mapping theorem and conclude that g (Gn l/Je , en ) � g (Gl/Je , 0, which is the desired result. •
Proof.
"
=
eo )
eo). e ---+ eo , ---+
eo) =
e'
=
---+
(eo '
{e e
1 e -e01 1 eo ---+ e ---+ eo.
282
Empirical Processes
19.5
Changing Classes
The Glivenko-Cantelli and Donsker theorems concern the empirical process for different n, but each time with the same indexing class :F. This is sufficient for a large number of applications, but in other cases it may be necessary to allow the class :F to change with n. For instance, the range of the random function fn in Lemma 1 9.24 might be different for every n . We encounter one such a situation in the treatment of M -estimators and the likelihood ratio statistic in Chapters 5 and 1 6, in which the random functions of interest Jli (m 0 - me0 ) - Jli({}n - 8o)me0 are obtained by rescaling a given class of functions. It turns out that the convergence of random variables such as Gn jn does not require the ranges :Fn of the functions jn to be constant but depends only on the sizes of the ranges to stabilize. The nature of the functions inside the classes could change completely from n to n (apart from a Lindeberg condition). Directly or indirectly, all the results in this chapter are based on the maximal inequalities obtained in section 19.6. The most general results can be obtained by applying these inequalities, which are valid for every fixed n, directly. The conditions for convergence of quantities such as Gn fn are then framed in terms of (random) entropy numbers. In this section we give an intermediate treatment, starting with an extension of the Donsker theorems, Theorems 19.5 and 19. 14, to the weak convergence of the empirical process indexed by classes that change with n. Let :Fn be a sequence of classes of measurable functions fn , t : X 1-+ JR. indexed by a parameter t, which belongs to a common index set T. Then we can consider the weak convergence of the stochastic processes t 1-+ Gn fn t as elements of f 00(T) , assuming that the sample paths are bounded. By Theorem 1 8 . 1 4 weak convergence is equivalent to marginal convergence and asymptotic tightness. The marginal convergence to a Gaussian process follows under the conditions of the Lindeberg theorem, Proposition 2.27. Sufficient conditions for tightness can be given in terms of the entropies of the classes :Fn . We shall assume that there exists a semimetric p that makes T into a totally bounded space and that relates to the L 2 -metric in that (19.27) sup P n .s - fn t ) 2 ---+ o, every On + 0. "
,
p (s , t ) < 8n
U
,
Furthermore, we suppose that the classes :Fn possess envelope functions Fn that satisfy the Lindeberg condition
P F};
=
0(1), every E: > 0.
Then the central limit theorem holds under an entropy condition. As before, we can use either bracketing or uniform entropy.
Let :Fn Un , t : t E T} be a class of measurable functions indexed by a totally bounded semimetric space (T, p) satisfying ( 19.27) and with envelope function that satisfies the Lindeberg condition. If l[ J (on , :Fn , L 2 (P) ) ---+ O for every On + 0, or alternatively, every :Fn is suitably measurable and J (on , :Fn , L 2 ) ---+ 0 for every On + 0, then the sequence {Gn fn , t : t E T} converges in distribution to a tight Gaussian process, provided the sequence of covariance functions Pfn , s fn , t - Pfn , s Pfn , t converges pointwise on T x T . 19.28
Theorem.
=
19.5 Changing Classes
283
Under bracketing the proof of the following theorem is similar to the proof of Theorem 19.5. We omit the proof under uniform entropy. For every given o > 0 we can use the semimetric p and condition to partition T into finitely many sets T1 , . . . , Tk such that, for every sufficiently large
Proof.
(19. 27) n,
sup sup P (fn,s - fn,r) 2 < o 2 .
i s, t ET,
(This is the only role for the totally bounded semimetric p ; alternatively, we could assume the existence of partitions as in this display directly.) Next we apply Lemma to obtain the bound
19. 3 4
19. 3 4
Here an (o) is the number given in Lemma evaluated for the class of functions Fn - Fn and Fn is its envelope, but the corresponding number and envelope of the class Fn differ only by constants. Because l[ On , Fn , L 2 ( P)) ----+ 0 for every On .,!.. 0, we must have that for every o > 0 and hence an (o) is bounded away from zero. l[ J (o , Fn , L 2 (P)) = Then the second term in the preceding display converges to zero for every fixed o > 0, by the Lindeberg condition. The first term can be made arbitrarily small as ----+ by choosing o small, by assumption. •
0(1)
l(
n
oo
1ca , a +t8n l (1/n)# (Xi
19.29 Example (Local empirical measure). Consider the functions fn, t = rn for t ranging over a compact in R say [0, a fixed number a, and sequences On .,!.. 0 and rn ----+ This leads to a multiple of the local empirical measure IT!n fn.r = E (a , a + ton]), which counts the fraction of observations falling into the shrinking intervals
1],
oo.
(a, a + ton].
Assume that the distribution of the observations is continuous with density p. Then
P/;, , = r; P (a, a + ton ] = r; p(a)ton + o(r; on) ·
1.
1.
Thus, we obtain an interesting limit only if r;on From now on, set r;on = Then the variance of every Gn fn,r converges to a nonzero limit. Because the envelope function is Fn = fn, l • the Lindeberg condition reduces to r; P (a , a + On l l rn >cJll ----+ 0, which is true provided non ----+ This requires that we do not localize too much. If the intervals become too small, then catching an observation becomes a rare event and the problem is not within the domain of normal convergence. The bracketing numbers of the cells with t E [0, are of the order O (on fs 2 ) . 2 Multiplication with rn changes this in 0 s ). Thus Theorem applies easily, and we conclude that the sequence of processes t �----+- Gn fn,r converges in distribution to a Gaussian process for every On .,!.. 0 such that non ----+ The limit process is not a Brownian bridge, but a Brownian motion process, as follows by computing the limit covariance of ( Gn fn, s • Gn fn,1). Asymptotically the local empirical process "does not know" that it is tied down at its extremes. In fact, it is an interesting exercise to check that two different local empirical processes (fixed at two different numbers a and b) converge jointly to two independent Brownian motions. 0 "'
oo.
1c , 8 l (1/a a+t n
1] 19. 2 8
oo .
16,
In the treatment of M -estimators and the likelihood ratio statistic in Chapters 5 and we encountered random functions resulting from rescaling a given class of functions. Given
Empirical Processes
284
functions x � m e (x ) indexed by a Euclidean parameter e, we needed conditions that ensure that, for a given sequence r -+ and any random sequence o;(l),
n
hn
oo
=
(19.30)
We shall prove this under a Lipschitz condition, but it should be clear from the following proof and the preceding theorem that there are other possibilities.
)
19.31 Lemma. For each e in an open subset of Euclidean space let x � m e (x be a measurable function such that the map e � m e (x ) is differentiable at eo for almost every x (or in probability) with derivative meo (x ) and such that,for every e l and in a neighborhood < of eo, and for a measurable function such that
Pm 2
m
e2
oo,
(19. 3 0) is validfor every random sequence hn that is bounded in probability. The random variables CGn ( rn (meo + h / rn - me0 ) - h T m e0 ) have mean zero and their variance converges to 0, by the differentiability of the maps e m e and the Lipschitz con Then
Proof
�
dition, which allows application of the dominated-convergence theorem. In other words, this sequence seen as stochastic processes indexed by h converges marginally in distribu tion to zero. Because the sequence is bounded in probability, it suffices to strengthen this to uniform convergence in h ::=:: This follows if the sequence of processes con verges weakly in the space £00 h : h ::::: because taking a supremum is a continuous operation and, by the marginal convergence, the weak limit is then necessarily zero. By Theorem we can confine ourselves to proving asymptotic tightness (i.e., condition (ii) of this theorem). Because the linear processes h � h T are trivially tight, we may concentrate on the processes h � r (meo + h / rn - m e , the empirical process indexed by the classes of functions r M 1 ; r for M o = m e - meo : 11 e - eo ll ::::: B y Example the bracketing numbers of the classes of functions M 8 satisfy
hn I I 1. ( I I 1),
18.14,
CGn ( n
n
19. 7 , N[J(88I I m i P, 2 ,
{
" '
CGnm e )0 ) 0
8}.
L 2 (P) ) c(l) 0 8 8. The constant C is independent of 8 and 8. The function M8 8m is an envelope function of M8. The left side also gives the bracketing numbers of the rescaled classes 8 I 8 relative to the envelope functions M818 m. Thus, we compute d Log (l) Log C d 8 . The right side converges to zero as 8n 0 uniformly in 8. The envelope functions M8 I 8 m also satisfy the Lindeberg condition. The lemma follows from Theorem 19. 2 8. :S
Mo ,
d'
0,
by Fubini's theorem and next developing the exponential function in its power series. The term for k 1 vanishes because 0, so that a factor 1 j n can be moved outside the sum. We apply this inequality with the choice
P (f - Pf)
=
=
k A 1 A� - 2 A I PCf - Pf) k l Pf2 (21 1 f l oo ) k 2 , n ( ) 00 1 1 1 P(Gnf > x) e -'-x 1 + -n L=l k . -A2 x kn e - 2 1 and (1 + aY e , the right side of this inequality is
Next, with A 1 and A 2 defined as in the preceding display, we insert the bound A and we obtain ::=: and use the inequality ::S
�
Because L: O ! k!) ::=: ::=: ::=: a bounded by exp(-Ax/2), which is the exponential in the lemma. 19.33
For any finite class elements,
Lemma.
::=:
•
:F of bounded, measurable, square-integrable func
I :F I EP I Gn i F ::S maxf l fv lnoo log ( 1 + I :F I ) + maxf ll f ii P.2 Jlog ( 1 + I :F I ) . Define a 24 1 f l oo /v'n and b 24Pf2 . For x bja and2 x bja the exponent in Bernstein ' s inequality is bounded above by - 3x ja and - 3x jb, respectively.
tions, with
r;:
Proof.
t
=
=
:::::
::=:
The constant 1 /4 can be replaced by 1 /2 (which is the best possible constant) by a more precise argument.
Empirical Processes
286
{ l x --) , P( I AJ I > x ) 2 exp ( -a3x
For the truncated variables A f Gn f l I Gn f > bla } and B f Bernstein ' s inequality yields the bounds, for all > 0, =
=
{ l
Gn f l I Gn f
:::S
b la } ,
:::S
Combining the first inequality with Fubini's theorem, we obtain, with 1/fP
P
(x) exp x - 1 , l A ) E { lAt l!a ex dx t"' P(I AJ I > x a ) ex dx ::::; 1 . Eo/ 1 ( -!Jo Jo 1
=
=
=
B y a similar argument w e find that Eo/2 ( I B f I I -Jb) ::::; 1 . Because the function o/ 1 is convex and nonnegative, we next obtain, by Jensen's inequality,
Because 1/f[ 1 (u) log(l + u) is increasing, we can apply it across the display, and find a bound on E max I A f I that yields the first term on the right side of the lemma. An analogous inequality is valid for max f I B f I I -Jb, but with 1/f2 instead of o/1 . An application of the triangle inequality concludes the proof. • =
For any class F of measurable functions f : X f---+ JR. such that Pf 2 < 8 2 for every f, we have, with a (8) 8 1 JLog N[ J ( 8 , F, L 2 ( P ) ) , and F an envelopefunction, 19.34
Lemma.
=
l
Because I Gn f :::S JflC JPln for F an envelope function of F,
Proof.
+
g
P ) for every pair of functions I f I
:::S
g, we obtain,
The right side is twice the second term in the bound of the lemma. It suffices to bound E* I Gn f F :::S ,Jna ( 8) } I F by a multiple of the first term. The bracketing numbers of the class of functions f F ::::; a(8)Jl1} if f ranges over F are smaller than the bracketing numbers of the class F. Thus, to simplify the notation, we can assume that every f E F is bounded by ,Jna(8). Fix an integer q0 such that 48 ::::; 2 - qo ::::; 88. There exists a nested sequence of partitions F U � 1 :Fq i of F, indexed by the integers q ::::_ q0 , into Nq disjoint subsets and measurable functions D.q i 2F such that
{
{
=
::::;
sup I f
f,g E:Fq,
- gl
:::S
D. q i ,
To see this, first cover F with minimal numbers of L 2 (P)-brackets of size 2- q and re place these by as many disjoint sets, each of them equal to a bracket minus "previous" brackets. This gives partitions that satisfy the conditions with D.q i equal to the difference
287
19. 6 Maximal Inequalities
of the upper and lower brackets. If this sequence of partitions does not yet consist of suc cessive refinements, then replace the partition at stage q by the set of all intersections of the form This gives partitions into N sets. Using the inequal 1 2 1 1 2 ity ( log 0 NP ) ::; I: Clog Np ) 1 and rearranging sums, we see that the first of the two displayed conditions is still satisfied. Choose for each q � q0 a fixed element from each partitioning set and set
n�=qo Fp , ip .
q Nq0 Nq =
•
•
•
fqi Fqi, l:!..q f l:!..q i• if f E Fq i· Then nq f and l:!.. q f run through a set of Nq functions if f runs through F. Define for each fixed n and numbers and indicator functions aq 2 -q /)Log Nq + 1 · Aq - 1 ! 1{ l:!..q0 j ,Jnaq0 , , l:!..q - 1 ! ,Jnaq_I}, Bqf 1{ 1:!..qo f ,Jnaq0 , , l:!.. q - d ,Jnaq - 1 , l:!..q f > ,Jnaq} . Then Aq f and Bq f are constant in f on each of the partitioning sets Fqi at level because the partitions are nested. Our construction of partitions and choice of also ensure that 2a(8) ::; aq0 , whence Aq0 f 1 . Now decompose, pointwise in (which is suppressed in =
q � q0
=
:S
=
:S
=
:S
•••
:S
•••
q,
q0
x
=
the notation),
00
00
f - nqo f qLo +1 C! - nqf)Bqf + qLo + 1 (nqf - nq - 1 j)Aq - 1 f. The idea here is to write the left side as the sum of f - nq1 f and the telescopic sum L��+ 1 (nqf - lfq - 1 f) for the largest 1 1 (f, such that each of the bounds l:!.. q f on the "links" nq f - lfq - 1 f in the "chain" is uniformly bounded by ,Jriaq (with q 1 possibly infinite). We note that either all Bq f are 1 or there is a unique 1 > with Bq1 f 1 . In the first case Aq f 1 for every in the second case Aq f 1 for 1 and Aq f 0 for Next we apply the empirical process Gn to both series on the right separately, take absolute values, and next take suprema over f E F. We shall bound the means of the resulting two variables. First, because the partitions are nested, l:!.. q f Bq f ::; l:!.. q - 1 f Bq f ,Jriaq_ 1 trivially P(l:!.. q f) 2 Bqf 2 -2q . Because I Gnf l Gng + 2,JriPg for every pair of functions l f l ::; g, we obtain, by the triangle inequality and next Lemma 19.33, E * qLo + 1 Gn U - nqf)Bqf qLo 1 E* I Gn l:!..q J Bqf i F + qLo + 1 2v'li i P I:!..q J Bq J I F F + =
q
=
q
x)
q
q;
=
q � q1 .
=
00
:S
00
=
=
:S
:S
:S
q0 q < q
00
In view of the definition of aq , the series on the right can be bounded by a multiple of the series I::+ 1 JLog N .
2 -q
q
Empirical Processes
288
nq -nq - 1
Second, there are at most Nq functions f f and at most Nq - 1 indicator functions A q _ 1 f. Because the partitions are nested, the function is bounded is bounded by ::::=: ,fii The L 2 (P)-norm of by Apply Lemma 19.33 to find
l:!..q - d Aq - d
aq - I ·
00
nqf - :rrq - If i A q - If 2-q + 1 l . l nqf - nq - 1 f l
00
[aq - 1 Log Nq + 2 - q JLog Nq ] . L qo + 1 Again this is bounded above by a multiple of the series 2.:= ,:+ 1 2 -q JLog Nq . To conclude the proof it suffices to consider the terms nq 0 f. Because I nqo f I F a (o) ,fii ,fiiaq0 and P(nq0 f) 2 8 2 by assumption, another application of Lemma 19 .33 yields
qLo + 1 Gn (nqf - TCq - 1 f)Aq - 1 f
E*
::::=:
F
,fiia (o) } is bounded above by a multiple of F P, 2 , by Markov's inequality. •
Proof
=
21 Fl I Fl
{
=
I I
The second term in the maximal inequality Lemma 19.34 results from a crude majoriza tion in the first step of its proof. This bound can be improved by taking special properties of the class of functions :F into account, or by using different norms to measure the brackets. The following lemmas, which are used in Chapter exemplify this. t The first uses the L 2 (P)-norm but is limited to uniformly bounded classes; the second uses a stronger norm, which we call the "Bernstein norm" as it relates to a strengthening of Bernstein's inequality. Actually, this is not a true norm, but it can be used in the same way to measure the size of brackets. It is defined by
25,
l f i �, B
19.36
and
=
2P (e lfl - 1 -
l fl) .
For any class :F of measurable functions f : X M- lR such that Pf 2 ::::=: M for every f,
Lemma.
l f l oo
t For a proof of the following lemmas and further results, see Lemmas 3.4.2 and Also see [ 1 4], [ 1 5], and [5 1 ] .
0 } as f ranges over :F; (ii) The collection of functions x
r-+
(iii) The collection of functions x
r-+
f (x) + g (x ) as f ranges over :F and g is fixed; f (x ) g (x) as f ranges over :F and g is fixed.
8. Show that a collection of sets is a VC class of sets if and only if the corresponding class of
indicator functions is a VC class of functions.
9. Let Fn and F be distribution functions on the real line. Show that: (i) If Fn (x ) __,. F (x ) for every x and F is continuous, then II Fn - F II 00 (ii) If Fn (x )
__,.
F (x) and Fn {x}
__,.
F{x } for every x, then II Fn - F l l oo
__,. __,.
0. 0.
10. Find the asymptotic distribution of the mean absolute deviation from the median.
20 Functional Delta Method
The delta method was introduced in Chapter 3 as an easy way to turn the weak convergence of a sequence of random vectors rn (Tn - 8) into the weak convergence of transformations ofthe type rn ( ¢ ( Tn) - ¢ (8)). It is useful to apply a similar technique in combination with the more powerful convergence ofstochastic processes. In this chapter we consider the delta method at two levels. The first section is of a heuristic character and limited to the case that Tn is the empirical distribution. The second section establishes the delta method rigorously and in general, completely parallel to the delta method for for Hadamard differentiable maps between normed spaces.
IRk ,
20.1
von Mises Calculus
Let JP>n be the empirical distribution of a random sample X 1 , . . . , Xn from a distribution P . Many statistics can be written in the form ¢ ( 1P'n) , where ¢ is a function that maps every distribution of interest into some space, which for simplicity is taken equal to the real line. Because the observations can be regained from lP'n completely (unless there are ties), any statistic can be expressed in the empirical distribution. The special structure assumed here is that the statistic can be written as a fixed function ¢ of lP'n , independent of n, a strong assumption. Because lP'n converges to P as n tends to infinity, we may hope to find the asymptotic behavior of ¢ (1P'n) - ¢ (P) through a differential analysis of ¢ in a neighborhood of P . A first-order analysis would have the form
�
¢ (1P'n ) - ¢ (P) = ¢ (1P'n - P) +
·
·
·
,
where ¢� is a "derivative" and the remainder is hopefully negligible. The simplest approach towards defining a derivative is to consider the function t H- ¢ ( P + t H) for a fixed perturbation H and as a function of the real-valued argument t. If ¢ takes its values in JR., then this function is just a function from the reals to the reals. Assume that the ordinary derivatives of the map t H- ¢ ( P + t H) at t = 0 exist for k = 1 , 2, . . . , m . Denoting them by ¢ � ) ( H ) , we obtain, by Taylor's theorem,
k
29 1
292
Functional Delta Method
Substituting t 1/.Jfi and H Gn , for Gn .J1i (Pn - P) the empirical process of the observations, we obtain the von Mises expansion =
=
=
Actually, because the empirical process Gn is dependent on n, it is not a legal choice for H under the assumed type of differentiability: There is no guarantee that the remainder is smalL However, we make this our working hypothesis. This is reasonable, because the remainder has one factor 1/.Jfi more, and the empirical process Gn shares at least one property with a fixed H : It is "bounded." Then the asymptotic distribution of ¢ (IFn) - ¢ (P) should be determined by the first nonzero term in the expansion, which is usually the first order term ¢� (Gn)· A method to make our wishful thinking rigorous is discussed in the next section. Even in cases in which it is hard to make the differentation operation rigorous, the von Mises expansion still has heuristic value. It may suggest the type of limiting behavior of ¢ (Pn) - ¢ (P), which can next be further investigated by ad-hoc methods. We discuss this in more detail for the case that 1 . A first derivative typically gives a linear approximation to the original function. If, indeed, the map H � ¢� (H) is linear, then, writing IFn as the linear combination IFn n - l L ox, of the Dirac measures at the observations, we obtain m =
=
(20. 1) Thus, the difference ¢ Wn) - cp (P) behaves as an average of the independent random variables ¢� (ox, - P). If these variables have zero means and finite second moments, then a normal limit distribution of .Jfi ( ¢ Wn) - cp (P) ) may be expected. Here the zero mean ought to be automatic, because we may expect that
J ¢� (ox - P) d P (x )
=
¢�
(/ (ox - P) d P (x ) )
=
¢� (0)
=
0.
The interchange of order of integration and application of ¢� is motivated by linearity (and continuity) of this derivative operator. The function x � ¢� (o x - P) is known as the influence function of the function ¢. It can be computed as the ordinary derivative ¢� (o x - P)
=
-dtd
l t=O
¢ ( (1 - t) P + fo x ) ·
The name "influence function" originated in developing robust statistics. The function measures the change in the value ¢ (P) if an infinitesimally small part of P is replaced by a pointmass at x. In robust statistics, functions and estimators with an unbounded influence function are suspect, because a small fraction of the observations would have too much influence on the estimator if their values were equal to an x where the influence function is large. In many examples the derivative takes the form of an "expectation operator" ¢� (H) J ¢ p d H, for some function ¢ p with J ¢ p d P 0, at least for a subset of H. Then the influence function is precisely the function ¢ p . =
=
20. 1
von Mises Calculus
293
The sample mean is obtained as ¢(IP'n ) from the mean function ¢(P) J s dP(s). The influence function is ¢� (8x - P) !!__dt J s d [ (1 - t)P + t8x ] (s) x - J s d P(s). In this case, the approximation (20.1) is an identity, because the function is linear already. If 20.2
Example (Mean).
=
=
=
l t=O
the sample space is a Euclidean space, then the influence function is unbounded and hence the sample mean is not robust. D Let (X1 , Y1 ) , . . . , (Xn , Yn ) be a random sample from a bivari ate distribution. Write n and n for the empirical distribution functions of the X; and Y1 , respectively, and consider the Mann-Whitney statistic 20.3
Example (Wilcoxon).
lF
G
1 8 f; 1 { ::::; YJ } . Tn f lFn dGn This statistic corresponds to the function ¢ (F, G) J F d G, which can b e viewed as a function of two distribution functions, or also as a function of a bivariate distribution function with marginals F and G. (We have assumed that the sample sizes of the two =
=
n2
n
n
X;
=
samples are equal, to fit the example into the previous discussion, which, for simplicity, is restricted to i.i.d. observations.) The influence function is
¢(F, G) (8x . y - P) !!__dt / [ (1 - t)F + t8x ] d [ (1 - t)G + t8y] F(y) + 1 - G_(x) - 2 f F dG. The last step follows on multiplying out the two terms between square brackets: The function that is to be differentiated is simply a parabola in t. For this case (20.1) reads =
l t=O
=
12. 6 ,
From the two-sample U -statistic theorem, Theorem it is known that the difference between the two sides of the approximation sign is actually o p .Jfi) . Thus, the heuristic calculus leads to the correct answer. In the next section an alternative proof of the asymptotic normality of the Mann-Whitney statistic is obtained by making this heuristic approach rigorous. D
( 11
20.4 Example (Z-functions). For every e in an open subset of JR.\ let x 1--+ 1/Je (x) be a given, measurable map into The corresponding Z-function assigns to a probability measure a zero of the map e 1--+ (Consider only for which a unique zero exists.) If applied to the empirical distribution, this yields a Z -estimator (lP'n ) . Differentiating with respect to across the identity
P
¢(P)
IR.k .
P1/Je .
P
t
e [ !!._dt ¢(? + t8x )]
and assuming that the derivatives exist and that e
0 (!__ae P1/Je ) =
e = 0 on each interval Such a grid exists for every element of -oo, oo] (problem 1 8.6). Then
J h 21 _ d¢ F1 • h 1 l oo f d i F2r 1 . u1 U rn ·
·
·
o
h 21, F21
o
=
=
o
[u i _ 1 , u i ).
I / ¢' (F1 )h 1 d(F2t - F2) 1 8 (/ d i F2rl + d i F2 I ) :S
m+l + L I C¢' F1 h 1 )(u i - l ) I I F2r [Ui - l • Ui ) - F2 [u i - 1 , Ui ) l . i= l The first term is bounded by 8 0 ( 1) , in which the 8 can be made arbitrarily small by the choice of the partition. For each fixed partition, the second term converges to zero as t 0. Hence the left side converges to zero as t 0. o
+
This proves the first assertion. The second assertion follows similarly.
+
•
IFm Gn 1 , m m Y1 , n. . . , Yn
Example (Wilcoxon). Let and be the empirical distribution functions of two independent random samples X . . . , X and from distribution functions and G, respectively. As usual, consider both and as indexed by a parameter v, let N + and assume that N � A E (0, 1) as v � oo. By Donsker's theorem and Slutsky's lemma,
20. 1 1
F
= m
n,
mj
D[
D[
in the space -oo, oo] x -oo, oo], for a pair of independent Brownian bridges and The preceding lemma together with the delta method imply that
Gc.
.JN (j iFm dGn - J FdG ) � - f � dF + j � dG.
Gp
The random variable on the right is a continuous, linear function applied to Gaussian processes. In analogy to the theorem that a linear transformation of a multivariate Gaussian vector has a Gaussian distribution, it can be shown that a continuous, linear transformation of a tight Gaussian process is normally distributed. That the present variable is normally
Functional Delta Method
300
distributed can be more easily seen by applying the delta method in its stronger form, which implies that the limit variable is the limit in distribution of the sequence
This can be rewritten as the difference of two sums of independent random variables, and next we can apply the central limit theorem for real variables. D Let IHIN be the empirical distribution func tion of a sample X 1 , , Xm , Y1 , , Yn obtained by "pooling" two independent random samples from distributions F and G, respectively. Let RN l , . . . , RN N be the ranks of the pooled sample and let Gn be the empirical distribution function of the second sample. If no observations are tied, then NIHIN (Yj ) is the rank of Yj in the pooled sample. Thus, 20. 12
Example (Two-sample rank statistics). .
.
•
•
.
.
is a two-sample rank statistic. This can be shown to be asymptotically normal by the preceding lemma. Because NIHIN m + nCGn , the asymptotic normality of the pair (IHIN , CGn ) can be obtained from the asymptotic normality of the pair CIFm , Gn ) , which is discussed in the preceding example. D =
mlF
cumulative hazard function corresponding to a cumulative distribution function F oo] is defined as { 1 -dFF_ . AF(t) j[O,t] In particular, if F has a density f, then AF has a density A.F f/(1 - F) . If F(t) gives the probability of "survival" of a person or object until time t , then dAF (t) can be interpreted as the probability of "instant death at time t given survival until t ." The hazard function is
The on [0,
=
=
an important modeling tool in survival analysis. The correspondence between distribution functions and hazard functions is one-to-one. The cumulative distribution function can be explicitly recovered from the cumulative hazard function as the of (see the proof of Lemma 25.74), T1 ( 1 { } ) (t J . (20. 13) O
=
If b;; 1 (Xn (n ) - an ) """' G for a nondegenerate cumula tive distribution function G, then G belongs to the location-scale family of a distribution of one of the following forms: (i) e - e -x with support IR.; (ii) e - C l /x " ) with support [0, oo) and 0; ) ( -x (iii) e " with support ( -oo, 0] and 0. 21.13
Theorem (Extremal types).
a >
a >
Example (Uniform). If the distribution has finite support [0, 1] with F(t) (1 t)01 , then nF( l + n - 1 101x) ---+ (-x)01 for every x ::: 0. In view of Lemma 2 1 . 1 2, the sequence n 1 101 (Xn (n ) - 1) converges in distribution to a limit of type (iii). The uniform distribution is the special case with 1 , for which the limit distribution is the negative of an exponential distribution. D
21.14
=
a =
t
For a proof of the following theorem, see [66] or Theorem 1 .4.2 in [90].
313
21.4 Extreme Values
The survival distribution of the Pareto distribution satisfies F( t ) (JJ.) tY' for t ::::: J.L. Thus n F(n 1 1cx J.LX ) � 1/xcx for every x > 0. In view of Lemma 2 1 . 12, the sequence n - 1 /cx Xn(n)/ f.L converges in distribution to a limit of type (ii). 0
21.15
Example (Pareto).
=
21.16
Example (Normal).
more delicate. We choose an
=
For the normal distribution the calculations are similar, but
1 loglog n + log 4n , J2 log n - 2 J2 1og n -
bn
=
1j j2 1og n.
Using Mill 's ratio, which asserts that
0, by choosing x close to 0. In case (ii), we can bound the jump at U n by F (xun) - F (un) for every x 1 , which is of the order F(un ) ( 1 jxcx - 1) ::::; ( l j n) ( 1 jxcx - 1). In case (iii) we argue similarly. We conclude the proof by applying Lemma 2 1 . 1 2. For instance, in case (i) we have nF ( un + g (un)x ) n F (un)e - x � e - x for every x, and the result follows. The argument under the assumptions (ii) or (iii) is similar. •
Proof.
=
n
0. We conclude that ,JfiEXn 1 ---+ 0. Because we also have that EX 1 ---+ Fg 2 and EX 1 1 1 Xn 1 1 ::=:: s ,Jfi ---+ 0 for every s > 0, the Lindeberg-Feller theorem yields that Fn (u I Vn) ---+ Cl> (u). This implies Fn (u I Vn ) """"' Cl> (u) by Theorem 18. 1 1 or a direct argument. •
�
� {
}
By taking linear combinations, we readily see from the preceding lemma that the em pirical process CGn and b;; 1 (Xn (n ) - an ) , if they converge, are asymptotically independent as well. This independence carries over onto statistics whose asymptotic distribution can
Problems
315
be derived from the empirical process by the delta method, including central order statis tics Xn (kn / n ) with kn / n p + 0 ( 1 / ..jn) , because these are asymptotically equivalent to averages. =
Notes
For more results concerning the empirical quantile function, the books [28]and [1 34] are good starting points. For results on extreme order statistics, see [66] or the book [90] .
PROBLEMS 1 1 1. Suppose that Fn ---+ F uniformly. Does this imply that Fn- ---+ F - uniformly or pointwise? Give a counterexample.
2. Show that the asymptotic lengths of the two types of asymptotic confidence intervals for a quan tile, discussed in Example 2 1 . 8 , are within op (lj.jil) . Assume that the asymptotic variance of 1 the sample quantile (involving 1 /f o F - (p)) can be estimated consistently. 3. Find the limit distribution of the median absolute deviation from the mean, med 1 �i �n I Xi - Xn 1 .
4 . Find the limit distribution of the pth quantile of the absolute deviation from the median. 5. Prove that Xn and Xn (n - 1 ) are asymptotically independent.
22 L-Statistics
In this chapter we prove the asymptotic normality of linear combinations oforder statistics, particularly those usedfor robust estimation or testing, such as trimmed means. We present two methods: The projection method presumes knowledge of Chapter 11 only; the second method is based on the functional delta method of Chapter 20.
Introduction
22. 1
Let Xn (l) , . . . , Xn (n ) be the order statistics of a sample of real-valued random variables. A linear combination of (transformed) order statistics, or L-statistic, is a statistic of the form n
L.>n ; a (Xn (i ) ) . i =l
The coefficients Cn ; are a triangular array of constants and a is some fixed function. This "score function" can without much loss of generality be taken equal to the identity function, for an L-statistic with monotone function a can be viewed as a linear combination of the order statistics of the variables a (X 1 ) , . . . , a (Xn ) , and an L-statistic with a function a of bounded variation can be dealt with similarly, by splitting the L-statistic into two parts. The simplest example of an L -statistic is the sample mean. More interesting are the a-trimmed meanst 22.1
Example (Trimmed and Winsorized means).
n - Lan j
1
L
n - 2 l anj l. = Lan J + 1
Xn ( i ) •
and the a - Winsorized means
[
n
n
- La j 1 Xn ( i ) + l an J Xn (n - La n J +l) - lan J Xn ( La n J ) + n i = La n J + l
t
L
]
·
The notation Lx J is used for the greatest integer that is less than or equal to x. Also 1x l denotes the smallest integer greater than or equal to x . For a natural number n and a real number 0 ::S x ::S n one has Ln x J n - Ix l and ln xl n LxJ . -
-
=
=
-
316
22. 1 Cauchy
JR is right-continuous and nondecreasing with a(O) b: JR 1----i> JR is right-continuous , nondecreasing and bounded, then
= 0, and
f a db = lc{o, oo) (b(oo) - b- ) da + J( oo,OJ (b( - oo) - b - ) da . -
a is J b_ da .
Prove this. If
als o bounded, then the righthand side can be written more succinctly as (Substitute a (x) = fco ,x] da for x > 0 and a(x) = - fcx , O] da for x :::: 0 into the left side of the equation, and use Fubini' s theorem separately on the integral over the positive and negative part of the real line.)
ab l�oo
-
23 Bootstrap
This chapter investigates the asymptotic properties of bootstrap estima tors for distributions and confidence intervals. The consistency of the bootstrap for the sample mean implies the consistency for many other statistics by the delta method. A similar result is valid with the empirical process.
23.1
Introduction
In most estimation problems it is important to give an indication of the precision of a given estimate. A simple method is to provide an estimate of the bias and variance of the estimator; more accurate is a confidence interval for the parameter. In this chapter we concentrate on bootstrap confidence intervals and, more generally, discuss the bootstrap as a method of estimating the distribution of a given statistic. Let 8 be an estimator of some parameter 8 attached to the distribution P of the obser vations. The distribution of the difference 8 - 8 contains all the information needed for assessing the precision of 8 . In particular, if �a is the upper a-quantile of the distribution of (8 - 8)/ff , then P(8 - �fJ ff :::; 8 :::; 8 - � 1 -aff I P) :::: 1 - fJ - a. Here ff may be arbitrary, but it is typically an estimate of the standard deviation of 8 . It follows that the interval [ 8 - �fJ fj , 8 - � 1 - a fj ] is a confidence interval of level 1 - fJ - a. Unfortunately, in most situations the quantiles and the distribution of 8 - 8 depend on the unknown distribution P of the observations and cannot be used to assess the performance of 8 . They must be replaced by estimators. If the sequence (8 - 8)/ff tends in distribution to a standard normal variable, then the normal N(O, ff 2 )-distribution can be used as an estimator of the distribution of 8 - 8 , and we can substitute the standard normal quantiles Za for the quantiles �a · The weak convergence implies that the interval [8 - Z fJ ff , 8 - Z 1 - a ff ] is a confidence interval of asymptotic level 1 - a - fJ . Bootstrap procedures yield an alternative. They are based on an estimate P of the un derlying distribution P of the observations. The distribution of (8 - 8)/ff under P can, in principle, be written as a function of P . The bootstrap estimator for this distribution is the "plug-in" estimator obtained by substituting P for P in this function. Bootstrap estimators 326
327
23. 1 Introduction
for quantiles, and next confidence intervals, are obtained from the bootstrap estimator for the distribution. The following type of notation is customary. Let e* and 8* be computed from (hypo thetic) observations obtained according to p in the same way e and a are computed from the true observations with distribution P . If fJ is related to P in the same way 8 is related to p ' then the bootstrap estimator for the distribution of ce - 8)18 under p is the distribution of (e* - e) I a* under p . The latter is evaluated given the original observations, that is, for a fixed realization of P . A bootstrap estimator for a quantile �a of ( e - 8 ) I a is a quantile of the distribution of (e* - e) Ia* under P . This is the smallest value X = �a that satisfies the inequality
(
P e* � e 8
::: x
1?
)
�
(23 . 1 )
1 - a.
The notation P(- I P ) indicates that the distribution of (e*, 8*) must be evaluated assum ing that the observations are sampled according to P given the original observations. In particular, in the preceding display e is to be considered nonrandom. The left side of the preceding display is a function of the original observations, whence the same is true for �a . If P is close to the true underlying distribution P , then the bootstrap quantiles should be close to the true quantiles, whence it should be true that
(
e - 8 ;P ---;
:S
A
�a I P
)
�
1 - CY .
In this chapter we show that this approximation is valid in an asymptotic sense: The probability on the left converges to 1 - a as the number of observations tends to infinity. Thus, the bootstrap confidence interval A
A
A
[e - �f3 a , 8
e-8 - � l -a B] = 8 : � 1 -a ::; ---;;A
{
A
:S
A
�f3
}
possesses asymptotic confidence level 1 - a - f3 . The statistic a i s typically chosen equal to an estimator of the (asymptotic) standard deviation of e . The resulting bootstrap method is known as the percentile t -method, in view of the fact that it is based on estimating quantiles of the "studentized" statistic ce - 8) 1 a . (The notion of a t -statistic is used here in an abstract manner to denote a centered statistic divided by a scale estimate; in general, there is no relationship with Student's t-distribution from normal theory.) A simpler method is to choose a independent of the data. If we choose a = a* = 1 , then the bootstrap quantiles �a are the quantiles of the centered statistic e* - fJ . This is known as the percentile method. Both methods yield asymptotically correct confidence levels, although the percentile t-method is generally more accurate. A third method, Efron 's percentile method, proposes the confidence interval [E 1 - f3 , Ea ] for Ea equal to the upper a-quantile of e*: the smallest value X = Ea such that
Pee * ::: x 1 P ) � 1 - a. Thus, Ea results from "bootstrapping" e, while �a is the product of bootstrapping ce 8) 1 a . These quantiles are related, and Efron 's percentile interval can be reexpressed in the quantiles �a of e* - e (employed by the percentile method with a = 1) as
Bootstrap
328
The logical justification for this interval is less strong than for the intervals based on boot strapping 8 8 , but it appears to work well. The two types of intervals coincide in the case that the conditional distribution of 8* - 8 is symmetric about zero. We shall see that the difference is asymptotically negligible if 8* - 8 converges to a normal distribution. Efron's percentile interval is the only one among the three intervals that is invariant under monotone transformations. For instance, if setting a confidence interval for the cor relation coefficient, the sample correlation coefficient might be transformed by Fisher's transformation before carrying out the bootstrap scheme. Next, the confidence interval for the transformed correlation can be transformed back into a confidence interval for the correlation coefficient. This operation would have no effect on Efron's percentile interval but can improve the other intervals considerably, in view of the skewness of the statistic. In this sense Efron ' s method automatically "finds" useful (stabilizing) transformations. The fact that it does not become better through transformations of course does not imply that it is good, but the invariance appears desirable. Several of the elements of the bootstrap scheme are still unspecified. The missing prob ability a + f3 can be distributed over the two tails of the confidence interval in several ways. In many situations equal-tailed confidence intervals, corresponding to the choice a = {3 , are reasonable. In general, these do not have 8 exactly as the midpoint of the interval. An alternative is the interval -
with �� equal to the upper a-quantile of 18* - 8 1 /8' * . A further possibility is to choose a and f3 under the side condition that the difference �fJ - � l - a , which is proportional to the length of the confidence interval, is minimal. More interesting is the choice of the estimator P for the underlying distribution. If the original observations are a random sample X 1 , . . . , Xn from a probability distribution P , - l L 8 x, of the observations, leading then one candidate is the empirical distribution IFn to the empirical bootstrap. Generating a random sample from the empirical distribution amounts to resampling with replacement from the set {X 1 , . . . , Xn } of original observations. The name "bootstrap" derives from this resampling procedure, which might be surprising at first, because the observations are "sampled twice." If we view the bootstrap as a nonparametric plug-in estimator, we see that there is nothing peculiar about resampling. We shall be mostly concerned with the empirical bootstrap, even though there are many other possibilities. If the observations are thought to follow a specified parametric model, then it is more reasonable to set F equal to Pe for a given estimator e. This is what one would have done in the first place, but it is called the parametric bootstrap within the present context. That the bootstrapping methodology is far from obvious is clear from the fact that the literature also considers the exchangeable, the Bayesian, the smoothed, and the wild bootstrap, as well as several schemes for bootstrap corrections. Even "resampling" can be carried out differently, for instance, by sampling fewer than variables, or without replacement. It is almost never possible to calculate the bootstrap quantiles �a numerically. In practice, these estimators are approximated by a simulation procedure. A large number of indepen dent bootstrap samples X7 , . . . , X� are generated according to the estimated distribution P . Each sample gives rise to a bootstrap value (8* - 8)j8- * of the standardized statistic. Finally, the bootstrap quantiles �a are estimated by the empirical quantiles of these bootstrap = n
n
23.2
Consistency
329
values. This simulation scheme always produces an additional (random) error in the cover age probability of the resulting confidence interval. In principle, by using a sufficiently large number of bootstrap samples, possibly combined with an efficient method of simulation, this error can be made arbitrarily small. Therefore the additional error is usually ignored in the theory of the bootstrap procedure. This chapter follows this custom and concerns the "exact" distribution and quantiles of ctr - e) 1 &* , without taking a simulation error into account.
23.2
Consistency
A confidence interval [en , 1 , en, 2 ] is (conservatively) asymptotically consistent at level 1 a - {3 if, for every possible P ,
lim P(en, 1 :::; 8 :::; en. 2 l P) :::: 1 - a - {3 . n ---'> inf 00 The consistency of a bootstrap confidence interval is closely related to the consistency of the bootstrap estimator of the distribution of (en - 8) j 8-n . The latter is best defined relative to a metric on the collection of possible laws of the estimator. Call the bootstrap estimator for the distribution consistent relative to the Kolmogorov-Smirnov distance if sup P x
( en - 8
A-
Un
:S
XIP
) - P ( e*n - en A
(Jn*
:S
A
X I Pn
)
p
--+ 0.
It is not a great loss of generality to assume that the sequence (en - 8)/8-n converges in distribution to a continuous distribution function F (in our examples 0 and a large number M. For 1} sufficiently small to ensure that 1} M
Because this function is skew-symmetric about the point e , the bias condition in (25.58) is satisfied, with a bias of zero. Because the efficient score function can be written in the form
-
f e, r] (x)
=
g' g
- - ( l x - 8 1) sign(x - 8),
g' g
the consistency condition in (25.58) reduces to consistency of kn for the function I in that
2 (s) g(s) ds � 0. ( kn � ) J
(25.62)
Estimators kn can be constructed by several methods, a simple one being the kernel method of density estimation. For a fixed twice continuously differentiable probability density with compact support, a bandwidth parameter CJ11 , and further positive tuning parameters CYn , f3n , and Yn , set 1 n - T.· --� ' gn ( ) CJn =l Cln
w
s
= -
,Lw ( s ) i
(25.63) Then (25.58) is satisfied provided an t oo , f3n + 0, Yn + O, and CJ11 + O at appropriate speeds. The proof is technical and is given in the next lemma. This particular construction shows that efficient estimators for e exist under minimal conditions. It is not necessarily recommended for use in practice. However, any good initial estimator en and any method of density or curve estimation may be substituted and will lead to a reasonable estimator for e , which is theoretically efficient under some regularity conditions.
398
Semiparametric Models
Let T1 , . . . , Tn be a random sample from a density g that is supported and absolutely continuous on [0, oo) and satisfies j (g ' I _,jg) 2 (s) ds oo. Then kn given by (25.63) for a probability density w that is twice continuously differentiable and supported on [ - 1 , 1 ] satisfies (25.62), if an t oo, Yn � 0, f3n � 0, and CJn � 0 in such a way that CJn ::S Yn • a;CJn l {3; -+ 0, nCJ: {3; -+ 00. 25.64
Lemma.
0. Combining this with the preceding display, we conclude that g n (s) � g (s). If g' is sufficiently smooth, then the analogous statement is true for g� (s ) . Under only the condition of finite Fisher information for location, this may fail, but we still have that g� (s ) - g� (s ) � 0 for every s ; furthermore, g� 1 [ a, oo) -+ g ' in L 1 , because
100 l g� - g' l (s) ds :::: JJ l g' (s - CJ Y) - g' (s ) l ds w (y) dy -+ 0,
by the L 1 -continuity theorem on the inner integral, and next the dominated-convergence theorem on the outer integral. The expectation of the integral in (25.62) restricted to the complement of the set iJ n is equal to
J( � ) \s) g (s) P( I g� l (s ) > a or gn (s )
a) is bounded above by 1 { lg� l (s ) > al2 } + o(l), and the Lebesgue measure of the set {s : lg� l (s ) > al2 } converges to zero, because g� -+ g' in L 1 . On the set iJ n the integrand in (25 .62) is the square of the function (g� I g n - g ' I g) g 1 12 . This function can be decomposed as ( g n - gn ) gn1 / 2 g n (g n - gn ) g n 1 /2 _1_ 2 1 / + � gn ) + (g L 1 1 2 / / gn gn g 1 /2 • gn g n gn
0 such
429
0,
Then B;0 , 170 Be0 ,r10 : Cf3 (Z) � Cf3 (Z) is continuously invertiblefor every f3
< a.
By its strict positive-definiteness in the Hilbert-space sense, the operator B0 B0 : £00 (Z) � £00 (Z) is certainly one-to-one in that B0 B0 h = 0 implies that h = 0 almost surely under ry0 . On reinserting this we find that -h = C0Coh = C00 = 0 everywhere. Thus B0 Bo is also one-to-one in a pointwise sense. If it can be shown that C0C0 : Cf3 (Z) � c f3 (Z) is compact, then B0 B0 is onto and continuously invertible, by Lemma 25.93. It follows from the Lipschitz condition on the partial derivatives that C0h(z) is differ entiable for every bounded function h : X � lP1. and its partial derivatives can be found by differentiating under the integral sign:
Proof.
a
C0h(z) = a zi
J h(x) -aazi po(x I z) dfl, (x).
The two conditions of the lemma imply that this function has Lipschitz norm of order a bounded by K ll h ll oo· Let h n be a uniformly bounded sequence in £00 (X). Then the partial derivatives of the sequence C0hn are uniformly bounded and have uniformly bounded Lipschitz norms of order a. Because Z is totally bounded, it follows by a strengthening of the Arzela-Ascoli theorem that the sequences of partial derivatives are precompact with respect to the Lipschitz norm of order f3 for every f3 a. Thus there exists a subsequence along which the partial derivatives converge in the Lipschitz norm of order f3. By the Arzela-Ascoli theorem there exists a further subsequence such that the functions C0h n (z) converge uniformly to a limit. If both a sequence of functions itself and their continuous partial derivatives converge uniformly to limits, then the limit of the functions must have the limits of the sequences of partial derivatives as its partial derivatives. We conclude that C0hn converges in the I · 11 1 +!3 -norm, whence C0 : £00 (X) � Cf3 (Z) is compact. Then the operator C0C0 is certainly compact as an operator from cf3 (Z) into itself. •