1,682 77 1MB
Pages 225 Page size 612 x 792 pts (letter) Year 2000
Probability Theory S.R.S.Varadhan Courant Institute of Mathematical Sciences New York University August 31, 2000
2
Contents 1 Measure Theory 1.1 Introduction. . . . . . . . . . . 1.2 Construction of Measures . . . 1.3 Integration . . . . . . . . . . . . 1.4 Transformations . . . . . . . . . 1.5 Product Spaces . . . . . . . . . 1.6 Distributions and Expectations
. . . . . .
7 7 11 14 22 24 28
2 Weak Convergence 2.1 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . 2.2 Moment Generating Functions . . . . . . . . . . . . . . . . . . 2.3 Weak Convergence . . . . . . . . . . . . . . . . . . . . . . . .
31 31 36 38
3 Independent Sums 3.1 Independence and Convolution . . . . . . 3.2 Weak Law of Large Numbers . . . . . . . 3.3 Strong Limit Theorems . . . . . . . . . . 3.4 Series of Independent Random variables 3.5 Strong Law of Large Numbers . . . . . . 3.6 Central Limit Theorem. . . . . . . . . . 3.7 Accompanying Laws. . . . . . . . . . . . 3.8 Infinitely Divisible Distributions. . . . . 3.9 Laws of the iterated logarithm. . . . . .
51 51 54 58 61 68 70 76 83 92
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . .
. . . . . . . . .
. . . . . . . . .
4 Dependent Random Variables 101 4.1 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . 108 4.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . 112 3
CONTENTS
4
4.4 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.5 Stopping Times and Renewal Times . . . . . . . . . . . . . . . 122 4.6 Countable State Space . . . . . . . . . . . . . . . . . . . . . . 123 5 Martingales. 5.1 Definitions and properties . . . . . . . 5.2 Martingale Convergence Theorems. . . 5.3 Doob Decomposition Theorem. . . . . 5.4 Stopping Times. . . . . . . . . . . . . . 5.5 Upcrossing Inequality. . . . . . . . . . 5.6 Martingale Transforms, Option Pricing. 5.7 Martingales and Markov Chains. . . .
. . . . . . .
149 . 149 . 154 . 157 . 160 . 164 . 165 . 169
. . . . . .
179 . 179 . 184 . 187 . 192 . 195 . 199
Filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
213 . 213 . 215 . 218
. . . . . . .
6 Stationary Stochastic Processes. 6.1 Ergodic Theorems. . . . . . . . . . . . . 6.2 Structure of Stationary Measures. . . . . 6.3 Stationary Markov Processes. . . . . . . 6.4 Mixing properties of Markov Processes. . 6.5 Central Limit Theorem for Martingales. 6.6 Stationary Gaussian Processes. . . . . . 7 Dynamic Programming 7.1 Optimal Control. . . 7.2 Optimal Stopping. . 7.3 Filtering. . . . . . . .
and . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
Preface These notes are based on a first year graduate course on Probability and Limit theorems given at Courant Institute of Mathematical Sciences. Originally written during 1997-98, they have been revised during academic year 199899 as well as in the Fall of 1999. I want to express my appreciation to those who pointed out to me several typos as well as suggestions for improvement. I want to mention in particular the detailed comments from Professor Charles Newman and Mr Enrique Loubet. Chuck used it while teaching the course in 98-99 and Enrique helped me as TA when I taught out of these notes again in the Fall of 99. These notes cover about three fourths of the course, essentially discrete time processes. Hopefully there will appear a companion volume some time in the near future that will cover continuos time processes. A small amount measure theory that is included. While it is not meant to be complete, it is my hope that it will be useful.
5
6
CONTENTS
Chapter 1 Measure Theory 1.1
Introduction.
The evolution of probability theory was based more on intuition rather than mathematical axioms during its early development. In 1933, A. N. Kolmogorov [4] provided an axiomatic basis for probability theory and it is now the universally accepted model. There are certain ‘non commutative’ versions that have their origins in quantum mechanics, see for instance K. R. Parthasarathy[5], that are generalizations of the Kolmogorov Model. We shall however use exclusively Kolmogorov’s framework. The basic intuition in probability theory is the notion of randomness. There are experiments whose results are not predicatable and can be determined only after performing it and then observing the outcome. The simplest familiar examples are, the tossing of a fair coin, or the throwing of a balanced die. In the first experiment the result could be either a head or a tail and the throwing of a die could result in a score of any integer from 1 through 6. These are experiments with only a finite number of alternate outcomes. It is not difficult to imagine experiments that have countably or even uncountably many alternatives as possible outcomes. Abstractly then, there is a space Ω of all possible outcomes and each individual outcome is represented as a point ω in that space Ω. Subsets of Ω are called events and each of them corresponds to a collection of outcomes. If the outcome ω is in the subset A, then the event A is said to have occurred. For example in the case of a die the set A = {1, 3, 5} ⊂ Ω corresponds to the event ‘an odd number shows up’. With this terminology it is clear that 7
CHAPTER 1. MEASURE THEORY
8
union of sets corresponds to ‘or’, intersection to ‘and’, and complementation to ‘negation’. One would expect that probabilities should be associated with each outcome and there should be a ‘Probability Function’ f (ω) which is the probabilty that ω occurs. In the case of coin tossing we may expect Ω = {H, T } and 1 f (T ) = f (H) = . 2 Or in the case of a die 1 f (1) = f (2) = · · · = f (6) = . 6 Since ‘Probability’ is normalized so that certainty corresponds to a Probability of 1, one expects X f (ω) = 1. (1.1) ω∈Ω
If Ω is uncountable this is a mess. There is no reasonable way of adding up an uncountable set of numbers each of which is 0. This suggests that it may not be possible to start with probabilities associated with individual outcomes and build a meaningful theory. The next best thing is to start with the notion that probabilities are already defined for events. In such a case, P (A) is defined for a class B of subsets A ⊂ Ω. The question that arises naturally is what should B be and what properties should P (·) defined on B have? It is natural to demand that the class B of sets for which probabilities are to be defined satisfy the following properties: The whole space Ω and the empty set Φ are in B. For any two sets A and B in B, the sets A ∪ B and A ∩ B are again in B. If A ∈ B, then the complement Ac is again in B. Any class of sets satisfying these properties is called a field. Definition 1.1. A ‘probability’ or more precisely ‘a finitely additive probability measure’ is a nonnegative set function P (·) defined for sets A ∈ B that satisfies the following properties: P (A) ≥ 0 for all
A ∈ B,
(1.2)
P (Ω) = 1 and P (Φ) = 0.
(1.3)
1.1. INTRODUCTION.
9
If A ∈ B and B ∈ B are disjoint then P (A ∪ B) = P (A) + P (B).
(1.4)
P (Ac ) = 1 − P (A)
(1.5)
In particular
for all A ∈ B. A condition which is some what more technical, but important from a mathematical viewpoint is that of countable additivity. The class B, in addition to being a field is assumed to be closed under countable union (or equivalently, countable intersection); i.e. if An ∈ B for every n, then A = ∪n An ∈ B. Such a class is called a σ-field. The ‘probability’ itself is presumed to be defined on a σ-field B. Definition 1.2. A set function P defined on a σ-field is called a ‘countably additive probability measure’ if in addition to satsfying equations (1.2), (1.3) and (1.4), it satisfies the following countable additivity property: for any sequence of pairwise disjoint sets An with A = ∪n An X P (A) = P (An ). (1.6) n
Exercise 1.1. The limit of an increasing (or decreasing) sequence An of sets is defined as its union ∪n An (or the intersection ∩n An ). A monotone class is defined as a class that is closed under monotone limits of an increasing or decreasing sequence of sets. Show that a field B is a σ-field if and only if it is a monotone class. Exercise 1.2. Show that a finitely additive probability measure P (·) defined on a σ-field B, is countably additive, i.e. satisfies equation (1.6), if and only if it satisfies any the following two equivalent conditions. If An is any nonincreasing sequence of sets in B and A = limn An = ∩n An then P (A) = lim P (An ). n
If An is any nondecreasing sequence of sets in B and A = limn An = ∪n An then P (A) = lim P (An ). n
10
CHAPTER 1. MEASURE THEORY
Exercise 1.3. If A, B ∈ B, and P is a finitely additive probability measure show that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). How does this generalize to P (∪nj=1Aj )? Exercise 1.4. If P is a finitely additive measure on a field F and A, B ∈ F , then |P (A) − P (B)| ≤ P (A∆B) where A∆B is the symmetric difference (A ∩ B c ) ∪ (Ac ∩ B). In particular if B ⊂ A, 0 ≤ P (A) − P (B) ≤ P (A ∩ B c ) ≤ P (B c ).
Exercise 1.5. If P is a countably additive P∞ probability measure, show that for any sequence An ∈ B, P (∪∞ A ) ≤ n=1 n n=1 P (An ). Although we would like our ‘probability’ to be a countably additive probability measure, on a σ- field B of subsets of a space Ω it is not clear that there are plenty of such things. As a first small step show the following. Exercise P 1.6. If {ωn : n ≥ 1} are distinct points in Ω and pn ≥ 0 are numbers with n pn = 1 then X P (A) = pn n:ωn ∈A
defines a countably additive probability measure on the σ-field of all subsets of Ω. ( This is still cheating because the measure P lives on a countable set.)
Definition 1.3. A probability measure P on a field F is said to be countably additive on F , if for any sequence An ∈ F with An ↓ Φ, we have P (An ) ↓ 0. Exercise 1.7. Given any class F of subsets of Ω there is a unique σ-field B such that it is the smallest σ-field that contains F .
Definition 1.4. The σ-field in the above exercise is called the σ-field generated by F .
1.2. CONSTRUCTION OF MEASURES
1.2
11
Construction of Measures
The following theorem is important for the construction of countably additive probability measures. A detailed proof of this theorem, as well as other results on measure and integration, can be found in [7], [3] or in any one of the many texts on real variables. In an effort to be complete we will sketch the standard proof. Theorem 1.1. (Caratheodory Extension Theorem). Any countably additive probabilty measure P on a field F extends uniquely as a countably additive probability measure to the σ-field B generated by F . Proof. The proof proceeds along the following steps: Step 1. Define an object P ∗ called the outer measure for all sets A. X P ∗ (A) = inf P (Aj ) ∪j Aj ⊃A
(1.7)
j
where the infimum is taken over all countable collections {Aj } of sets from F that cover A. Without loss of generality we can assume that {Aj } are c disjoint. (Replace Aj by (∩j−1 i=1 Ai ) ∩ Aj ). Step 2. Show that P ∗ has the following properties: 1. P ∗ is countably sub-additive, i.e. P ∗ (∪j Aj ) ≤
X
P ∗(Aj ).
j
2. For A ∈ F , P ∗ (A) ≤ P (A). (Trivial) 3. For A ∈ F , P ∗ (A) ≥ P (A). (Need to use the countable additivity of P on F ) Step 3. Define a set E to be measurable if P ∗ (A) ≥ P ∗ (A ∩ E) + P ∗(A ∩ E c ) holds for all sets A, and establish the following properties for the class M of measurable sets. The class of measurable sets M is a σ-field and P ∗ is a countably additive measure on it.
CHAPTER 1. MEASURE THEORY
12
Step 4. Finally show that M ⊃ F . This implies that M ⊃ B and P ∗ is an extension of P from F to B. Uniqueness is quite simple. Let P1 and P2 be two countably additive probability measures on a σ-field B that agree on a field F ⊂ B. Let us define A = {A : P1 (A) = P2 (A)}. Then A is a monotone class i.e., if An ∈ A is increasing (decreasing) then ∪n An (∩n An ) ∈ A. Uniqueness will follow from the following fact left as an excercise. Exercise 1.8. The smallest monotone class generated by a field is the same as the σ-field generated by the field. It now follows that A must contain the σ-field generated by F and that proves uniqueness. The extension Theorem does not quite solve the problem of constructing countably additive probability measures. It reduces it to constructing them on fields. The following theorem is important in the theory of Lebesgue integrals and is very useful for the construction of countably additive probability measures on the real line. The proof will again be only sketched. The natural σ-field on which to define a probability measure on the line is the Borel σ-field. This is defined as the smallest σ-field containing all intervals and includes in particular all open sets. Let us consider the class of subsets of the real numbers, I = {Ia,b : −∞ ≤ a < b ≤ ∞} where Ia,b = {x : a < x ≤ b} if b < ∞, and Ia,∞ = {x : a < x < ∞}. In other words I is the collection of intervals that are left-open and right-closed. The class of sets that are finite disjoint unions of members of I is a field F , if the empty set is added to the class. If we are given a function F (x) on the real line which is nondecreasing, continuous from the right and satisfies lim F (x) = 0 and lim F (x) = 1, x→−∞
x→∞
we can define a finitely additive probability measure P by first defining P (Ia,b ) = F (b) − F (a) for intervals and then extending it to F by defining it as the sum for disjoint unions from I. Let us note that the Borel σ-field B on the real line is the σ-field generated by F .
1.2. CONSTRUCTION OF MEASURES
13
Theorem 1.2. (Lebesgue). P is countably additive on F if and only if F (x) is a right continuous function of x. Therefore for each right continuous nondecreasing function F (x) with F (−∞) = 0 and F (∞) = 1 there is a unique probability measure P on the Borel subsets of the line, such that F (x) = P (I−∞,x ). Conversely every countably additive probability measure P on the Borel subsets of the line comes from some F . The correspondence between P and F is one-to-one. Proof. The only difficult part is to establish the countable additivity of P on F from the right continuity of F (·). Let Aj ∈ F and Aj ↓ Φ, the empty set. Let us assume, P (Aj ) ≥ δ > 0, for all j and then establish a contradiction. Step 1. We take a large interval [−`, `] and replace Aj by Bj = Aj ∩ [−`, `]. Since |P (Aj ) − P (Bj )| ≤ 1 − F (`) + F (−`), we can make the choice of ` large enough that P (Bj ) ≥ 2δ . In other words we can assumes without loss of generality that P (Aj ) ≥ 2δ and Aj ⊂ [−`, `] for some fixed ` < ∞. Step 2. If kj Aj = ∪i=1 Iaj,i ,bj,i use the right continuity of F to replace Aj by Bj which is again a union of left open right closed intervals with the same right end points, but with left end points moved ever so slightly to the right. Achieve this in such a way that δ P (Aj − Bj ) ≤ 10.2j for all j. Step 3. Define Cj to be the closure of Bj , obtained by adding to it the left end points of the intervals making up Bj . Let Ej = ∩ji=1 Bi and Dj = ∩ji=1 Ci . Then, (i) the sequence Dj of sets is decreasing, (ii) each Dj is a closed bounded set, (iii) since Aj ⊃PDj and Aj ↓ Φ , it follows that Dj ↓ Φ. Because 4 Dj ⊃ Ej and P (Ej ) ≥ 2δ − j P (Aj − Bj ) ≥ 10δ , each Dj is nonempty and this violates the finite intersection property that every decreasing sequence of bounded nonempty closed sets on the real line has a nonempty intersection, i.e. has at least one common point. The rest of the proof is left as an exercise. The function F is called the distribution function corresponding to the probability measure P .
CHAPTER 1. MEASURE THEORY
14
Example 1.1. Suppose x1 , x2 , · · · , xn , · · · is a sequence of points and we have probabilities pn at these points then for the discrete measure X pn P (A) = n:xn ∈A
we have the distribution function F (x) =
X
pn
n:xn ≤ x
that only increases by jumps, the jump at xn being pn . The points {xn } themselves can be discrete like integers or dense like the rationals. Example 1.2. If f (x) is a nonnegative integrable function with integral 1, i.e. R∞ Rx f (y)dy = 1 then F (x) = f (y)dy is a distribution function which −∞ −∞ is continuous. In this case f is the density of the measure P and can be calculated as f (x) = F 0 (x). There are (messy) examples of F that are continuous, but do not come from any density. More on this later. Exercise 1.9. Let us try to construct the Lebesgue measure on the rationals Q ⊂ [0, 1]. We would like to have P [Ia,b ] = b − a for all rational 0 ≤ a ≤ b ≤ 1. Show that it is impossible by showing that P [{q}] = 0 for the set {q} containing the single rational q while P [Q] = P [∪q∈Q {q}] = 1. Where does the earlier proof break down? Once we have a countably additive probability measure P on a space (Ω, Σ), we will call the triple (Ω, Σ, P ) a probabilty space.
1.3
Integration
An important notion is that of a random variable or a measurable function. Definition 1.5. A random variable or measurable function is map f : Ω → R, i.e. a real valued function f (ω) on Ω such that for every Borel set B ⊂ R, f −1 (B) = {ω : f (ω) ∈ B} is a measurable subset of Ω, i.e f −1 (B) ∈ Σ.
1.3. INTEGRATION
15
Exercise 1.10. It is enough to check the requirement for sets B ⊂ R that are intervals or even just sets of the form (−∞, x] for −∞ < x < ∞. A function that is measurable and satisfies |f (ω)| ≤ M all ω ∈ Ω for some finite M is called a bounded measurable function. The following statements are the essential steps in developing an integration theory. Details can be found in any book on real variables. 1. If A ∈ Σ , the indicator function A, defined as ( 1 if ω ∈ A 1A (ω) = 0 if ω ∈ /A is bounded and measurable. 2. Sums, products, limits, compositions and reasonable elementary operations like min and max performed on measurable functions lead to measurable functions. 3. If {Aj : 1 ≤ j ≤ n} is a finite P disjoint partition of Ω into measurable sets, the function f (ω) = j cj 1Aj (ω) is a measurable function and is referred to as a ‘simple’ function. 4. Any bounded measurable function f is a uniform limit of simple functions. To see this, if f is bounded by M, divide [−M, M] into n subintervals Ij of length 2M with midpoints cj . Let n Aj = f −1 (Ij ) = {ω : f (ω) ∈ Ij } and fn =
n X
cj 1Aj .
j=1
Clearly fn is simple, supω |fn (ω) − f (ω)| ≤ M , and we are done. n R P 5. ForP simple functions f = cj 1Aj the integral f (ω)dP is defined to be j cj P (Aj ). It enjoys the following properties: (a) If f and g are simple, so is any linear combination af + bg for real constants a and b and Z Z Z (af + bg)dP = a f dP + b gdP.
CHAPTER 1. MEASURE THEORY
16
(b) If f is simple so is |f | and |
R
f dP | ≤
R
|f |dP ≤ supω |f (ω)|.
6. If fn is R a sequence of simple functions converging to f uniformly, then an = fn dP is a Cauchy sequence R of real numbers and therefore has a limit a as n → ∞. The integral f dP of f is defined to be this limit a. One can verify that a depends only on f and not on the sequence fn chosen to approximate f . 7. Now the integral is defined for all bounded measurable functions and enjoys the following properties. (a) If f and g are bounded measurable functions and a, b are real constants then the linear combination af + bg is again a bounded measurable function, and Z Z Z (af + bg)dP = a f dP + b gdP. (b) If R f is a bounded measurable function so is |f | and | |f |dP ≤ supω |f (ω)|.
R
f dP | ≤
(c) In fact a slightly stronger inequality is true. For any bounded measurable f , Z |f |dP ≤ P ({ω : |f (ω)| > 0}) sup |f (ω)| ω
(d) If f is a bounded measurable function and A is a measurable set one defines Z Z f (ω)dP = 1A (ω)f (ω)dP A
and we can write for any measurable set A, Z Z Z f dP = f dP + f dP A
Ac
In addition to uniform convergence there are other weaker notions of convergence.
1.3. INTEGRATION
17
Definition 1.6. A sequence fn functions is said to converge to a function f everywhere or pointwise if lim fn (ω) = f (ω)
n→∞
for every ω ∈ Ω. In dealing with sequences of functions on a space that has a measure defined on it, often it does not matter if the sequence fails to converge on a set of points that is insignificant. For example if we are dealing with the Lebesgue measure on the interval [0, 1] and fn (x) = xn then fn (x) → 0 for all x except x = 1. A single point, being an interval of length 0 should be insignificant for the Lebesgue measure. Definition 1.7. A sequence fn of measurable functions is said to converge to a measurable function f almost everywhere or almost surely (usually abbreviated as a.e.) if there exists a measurable set N with P (N) = 0 such that lim fn (ω) = f (ω) n→∞
for every ω ∈ N . c
Note that almost everywhere convergence is always relative to a probability measure. Another notion of convergence is the following: Definition 1.8. A sequence fn of measurable functions is said to converge to a measurable function f ‘in measure’ or ‘in probability’ if lim P [ω : |fn (ω) − f (ω)| ≥ ] = 0
n→∞
for every > 0. Let us examine these notions in the context of indicator functions of sets fn (ω) = 1An (ω). As soon as A 6= B, supω |1A (ω) − 1B (ω)| = 1, so that uniform convergence never really takes place. On the other hand one can verify that 1An (ω) → 1A (ω) for every ω if and only if the two sets lim sup An = ∩n ∪m≥n Am n
CHAPTER 1. MEASURE THEORY
18 and
lim inf An = ∪n ∩m≥n Am n
both coincide with A. Finally 1An (ω) → 1A (ω) in measure if and only if lim P (An ∆A) = 0
n→∞
where for any two sets A and B the symmetric difference A∆B is defined as A∆B = (A ∩ B c ) ∪ (Ac ∩ B) = A ∪ B ∩ (A ∩ B)c . It is the set of points that belong to either set but not to both. For instance 1An → 0 in measure if and only if P (An ) → 0. Exercise 1.11. There is a difference between almost everywhere convergence and convergence in measure. The first is really stronger. Consider the interval [0, 1] and divide it successively into 2, 3, 4 · · · parts and enumerate the intervals in succession. That is, I1 = [0, 12 ], I2 = [ 12 , 1], I3 = [0, 13 ], I4 = [ 13 , 23 ], I5 = [ 23 , 1], and so on. If fn (x) = 1In (x) it easy to check that fn tends to 0 in measure but not almost everywhere. Exercise 1.12. But the following statement is true. If fn → f as n → ∞ in measure, then there is a subsequence fnj such that fnj → f almost everywhere as j → ∞. Exercise 1.13. If {An } is a sequene of measurable sets, then in order that lim supn→∞ An = Φ, it is necessary and sufficient that lim P [∪∞ m=n Am ] = 0
n→∞
In particular it is sufficient that
P n
P [An ] < ∞. Is it necessary?
Lemma 1.3. If fn → f almost everywhere then fn → f in measure. Proof. fn → f outside N is equivalent to ∩n ∪m≥n [ω : |fm (ω) − f (ω)| ≥ ] ⊂ N for every > 0. In particular by countable additivity P [ω : |fn (ω) − f (ω)| ≥ ] ≤ P [∪m≥n [ω : |fm (ω) − f (ω)| ≥ ] → 0 as n → ∞ and we are done.
1.3. INTEGRATION
19
Exercise 1.14. Countable additivity is important for this result. On a finitely additive probability space it could be that fn → f everywhere and still fn 6→ f in measure. In fact show that if every sequence fn → 0 that converges everywhere also converges in probabilty, then the measure is countably additive. Theorem 1.4. (Bounded Convergence Theorem). If the sequence {fn } of measurable functions bounded and if fn → f in measure as R is uniformly R n → ∞, then limn→∞ fn dP = f dP . Proof. Since Z Z Z Z | fn dP − f dP | = | (fn − f )dP | ≤ |fn − f |dP R we need only prove that if fn → 0 in measure and |fn | ≤ M then |fn |dP → 0. To see this Z Z Z |fn |dP + |fn |dP ≤ + MP [ω : |fn (ω)| > ] |fn |dP = |fn |≤
|fn |>
and taking limits
Z lim sup n→∞
|fn |dP ≤
and since > 0 is arbitrary we are done. The bounded convergence theorem is the essence of countable additivity. Let us look at the example of fn (x) = xn on 0 ≤ x ≤ 1 with Lebesgue measure. Clearly fn (x) → 0 a.e. and therefore in measure. While the convergence is not uniform, 0 ≤ xn ≤ 1 for all n and x and so the bounded convergence theorem applies. In fact Z 1 1 → 0. xn dx = n+1 0 However if we replace xn by nxn , fn (x) still goes to 0 a.e., but the sequence is no longer uniformly bounded and the integral does not go to 0. We now proceed to define integrals of nonnegative measurable functions.
CHAPTER 1. MEASURE THEORY
20
Definition 1.9. If f is a nonnegative measurable function we define Z Z f dP = {sup gdP : g bounded , 0 ≤ g ≤ f }. An important result is Theorem 1.5. (Fatou’s Lemma). If for each n ≥ 1, fn ≥ 0 is measurable and fn → f in measure as n → ∞ then Z Z f dP ≤ lim inf fn dP. n→∞
Proof. Suppose g is bounded and satisfies 0 ≤ g ≤ f . Then the sequence hn = fn ∧ g is uniformly bounded and hn → h = f ∧ g = g. Therefore, by the bounded convergence theorem, Z Z gdP = lim hn dP. n→∞
Since
R
hn dP ≤
R
fn dP for every n it follows that Z Z gdP ≤ lim inf fn dP. n→∞
As g satisfying 0 ≤ g ≤ f is arbitrary and we are done. Corollary 1.6. (Monotone Convergence Theorem). If for a sequence {fn } of nonnegative functions, we have fn ↑ f monotonically then Z Z fn dP → f dP as n → ∞.
Proof. Obviously lemma.
R
fn dP ≤
R
f dP and the other half follows from Fatou’s
1.3. INTEGRATION
21
Now we try to define integrals of arbitrary measurable functions. A nonR negative measurable function is said to be integrable if f dP < ∞. A measurable function f isRsaid to be integrable if |f | is integrable and we deR R fine f dP = f + dP − f − dP where f + = f ∨ 0 and f − = −f ∧ 0 are the positive and negative parts of f . The integral has the following properties. 1. It is linear. IfRf and g are integrable so isR af + bg for any two real R constants and (af + bg)dP = a f dP + b gdP . R R 2. | f dP | ≤ |f |dP for every integrable f . 3. If and R f = 0 except on a set N of measure 0, then f is integrable R R f dP = 0. In particular if f = g almost everywhere then f dP = gdP . Theorem 1.7. (Jensen’s Inequality.) If φ(x) is a convex function of x and f (ω) and φ(f (ω)) are integrable then Z Z φ(f (ω))dP ≥ φ f (ω)dP . (1.8) Proof. We have seen the inequlity already for φ(x) = |x|. The proof is quite simple. We note that any convex function φ can be represented as the supremum of a collection of affine linear functions. φ(x) = sup {ax + b}.
(1.9)
(a,b)∈E
It is clear that if (a, b) ∈ E, then af (ω) + b ≤ φ(f (ω)) and on integration this yields am + b ≤ E[φ(f (ω))] where m = E[f (ω)]. Since this is true for every (a, b) ∈ E, in view of the represenattion (1.9), our theorem follows. Another important theorem is Theorem 1.8. (The Dominated Convergence Theorem) If for some sequence {fn } of measurable functions we have fn → f in measure and R |f R n (ω)| ≤ g(ω) for all n and ω for some integrable function g, then fn dP → f dP as n → ∞.
CHAPTER 1. MEASURE THEORY
22
Proof. g + fn and g − fn are nonnegative and converge in measure to g + f and g − f respectively. By Fatou’s lemma Z Z lim inf (g + fn )dP ≥ (g + f )dP n→∞
Since
R
gdP is finite we can subtract it from both sides and get Z Z lim inf fn dP ≥ f dP. n→∞
Working the same way with g − fn yields Z Z lim sup fn dP ≤ f dP n→∞
and we are done. Exercise 1.15. Take the unit interval with the Lebesgue measure and define fn (x) = nα 1[0, 1 ] (x). Clearly fn (x) → 0 for x 6= 0. On the other hand n R fn (x)dx = nα−1 → 0 if and only if α < 1. What is g(x) = supn fn (x) and when is g integrable? If h(ω) = f (ω) + ig(ω) is a complex valued measurable function with real and imaginary parts f (ω) and g(ω) that are integrable we define Z Z Z h(ω)dP = f (ω)dP + i g(ω)dP
Exercise 1.16. Show that for any complex function h(ω) = f (ω) + ig(ω) with measurable f and g, |h(ω)| is integrable, if and only if |f | and |g| are integrable and we then have Z Z h(ω) dP ≤ |h(ω)| dP
1.4
Transformations
A measurable space (Ω, B) is a set Ω together with a σ-field B of subsets of Ω.
1.4. TRANSFORMATIONS
23
Definition 1.10. Given two measurable spaces (Ω1 , B1 ) and (Ω2 , B2 ), a mapping or a transformation from T : Ω1 → Ω2 , i.e. a function ω2 = T (ω1 ) that assigns for each point ω1 ∈ Ω1 a point ω2 = T (ω1 ) ∈ Ω2 , is said to be measurable if for every measurable set A ∈ B2 , the inverse image T −1 (A) = {ω1 : T (ω1 ) ∈ A} ∈ B1 . Exercise 1.17. Show that, in the above definition, it is enough to verify the property for A ∈ A where A is any class of sets that generates the σ-field B2 . If T is a measurable map from (Ω1 , B1 ) into (Ω2 , B2 ) and P is a probability measure on (Ω1 , B1 ), the induced probability measure Q on (Ω2 , B2 ) is defined by Q(A) = P (T −1(A)) for A ∈ B2 .
(1.10)
Exercise 1.18. Verify that Q indeed does define a probability measure on (Ω2 , B2 ). Q is called the induced measure and is denoted by Q = P T −1. Theorem 1.9. If f : Ω2 → R is a real valued measurable function on Ω2 , then g(ω1) = f (T (ω1 )) is a measurable real valued function on (Ω1 , B1 ). Moreover g is integrable with respect to P if and only if f is integrable with respect to Q, and Z Z f (ω2 ) dQ = g(ω1) dP (1.11) Ω2
Ω1
Proof. If f (ω2) = 1A (ω2 ) is the indicator function of a set A ∈ B2 , the claim in equation (1.11) is in fact the definition of measurability and the induced measure. We see, by linearity, that the claim extends easily from indicator functions to simple functions. By uniform limits, the claim can now be extended to bounded measurable functions. Monotone limits then extend it to nonnegative functions. By considering the positive and negative parts separately we are done.
CHAPTER 1. MEASURE THEORY
24
A measurable trnasformation is just a generalization of the concept of a random variable introduced in section 1.2. We can either think of a random variable as special case of a measurable transformation where the target space is the real line or think of a measurable transformation as a random variable with values in an arbitrary target space. The induced measure Q = P T −1 is called the distribution of the random variable F under P . In particular, if T takes real values, Q is a probability distribution on R. Exercise 1.19. When T is real valued show that Z Z T (ω)dP = x dQ. When F = (f1 , f2 , · · · fn ) takes values in Rn , the induced distribution Q on Rn is called the joint distribution of the n random variables f1 , f2 · · · , fn . Exercise 1.20. If T1 is a measurable map from (Ω1 , B1 ) into (Ω2 , B2 ) and T2 is a measurable map from (Ω2 , B2 ) into (Ω3 , B3 ), then show that T = T2 ◦ T1 is a measurable map from (Ω1 , B1 ) into (Ω3 , B3 ). If P is a probability measure on (Ω1 , B1 ), then on (Ω3 , B3 ), the two measures P T −1 and (P T1−1)T2−1 are identical.
1.5
Product Spaces
Given two sets Ω1 and Ω2 the Cartesian product Ω = Ω1 × Ω2 is the set of pairs (ω1 , ω2 ) with ω1 ∈ Ω1 and ω2 ∈ Ω2 . If Ω1 and Ω2 come with σ-fields B1 and B2 respectively, we can define a natural σ-field B on Ω as the σ-field generated by sets (measurable rectangles) of the form A1 × A2 with A1 ∈ B1 and A2 ∈ B2 . This σ-field will be called the product σ-field. Exercise 1.21. Show that sets that are finite disjoint unions of measurable rectangles constitute a field F . Definition 1.11. The product σ-field B is the σ-field generaated by the field F.
1.5. PRODUCT SPACES
25
Given two probability measures P1 and P2 on (Ω1 , B1 ) and (Ω2 , B2 ) respectively we try to define on the product space (Ω, B) a probability measure P by defining for a measurable rectangle A = A1 × A2 P (A1 × A2 ) = P1 (A1 ) × P2 (A2 ) and extending it to the field F of finite disjoint unions of measurable rectangles as the obvious sum. Exercise 1.22. If E ∈ F has two representations as disjoint finite unions of measurable rectangles E = ∪i (Ai1 × Ai2 ) = ∪j (B1j × B2j ) then
X
P1 (Ai1 ) × P2 (Ai2 ) =
i
X
P1 (B1j ) × P2 (B2j ).
j
so that P (E) is well defined. P is a finitely additive probability measure on F. Lemma 1.10. The measure P is countably additive on the field F . Proof. For any set E ∈ F let us define the section Eω2 as Eω2 = {ω1 : (ω1 , ω2 ) ∈ E}.
(1.12)
Then P1 (Eω2 ) is a measurable function of ω2 (is in fact a simple function) and Z P (E) = P1 (Eω2 ) dP2. (1.13) Ω2
Now let En ∈ F ↓ Φ, the empty set. Then it is easy to verify that En,ω2 defined by En,ω2 = {ω1 : (ω1 , ω2 ) ∈ En } satisfies En,ω2 ↓ Φ for each ω2 ∈ Ω2 . From the countable additivity of P1 we conclude that P1 (En,ω2 ) → 0 for each ω2 ∈ Ω2 and since, 0 ≤ P1 (En,ω2 ) ≤ 1 for n ≥ 1, it follows from equation (1.13) and the bounded convergence theorem that Z P (En ) = P1 (En,ω2 ) dP2 → 0 Ω2
establishing the countable additivity of P on F .
26
CHAPTER 1. MEASURE THEORY
By an application of the Caratheodory extension theorem we conclude that P extends uniquely as a countably additive measure to the σ-field B (product σ-field) generated by F . We will call this the Product Measure P . Corollary 1.11. For any A ∈ B if we denote by Aω1 and Aω2 the respective sections Aω1 = {ω2 : (ω1 , ω2 ) ∈ A} and Aω2 = {ω1 : (ω1 , ω2 ) ∈ A} then the functions P1 (Aω2 ) and P2 (Aω1 ) are measurable and Z Z P (A) = P1 (Aω2 )dP2 = P2 (Aω1 ) dP1 . In particular for a measurable set A, P (A) = 0 if and only if for almost all ω1 with respect to P1 , the sections Aω1 have measure 0 or equivalently for almost all ω2 with respect to P2 , the sections Aω2 have measure 0. Proof. The assertion is clearly valid if A is rectangle of the form A1 × A2 with A1 ∈ B1 and A2 ∈ B2 . If A ∈ F , then it is a finite disjoint union of such rectangles and the assertion is extended to such a set by simple addition. Clearly, by the monotone convergence theorem, the class of sets for which the assertion is valid is a monotone class and since it contains the field F it also contains the σ-field B generated by the field F . Warning. It is possible that a set A may not be measurable with respect to the product σ-field, but nevertheless the sections Aω1 and Aω2 are all measurable, P2 (Aω1 ) and P1 (Aω2 ) are measurable functions, but Z Z P1 (Aω2 )dP2 6= P2 (Aω1 ) dP1. In fact there is a rather nasty example where P1 (Aω2 ) is identically 1 whereas P2 (Aω1 ) is identically 0. The next result concerns the equality of the double integral, (i.e. the integral with respect to the product measure) and the repeated integrals in any order.
1.5. PRODUCT SPACES
27
Theorem 1.12. (Fubini’s Theorem). Let f (ω) = f (ω1 , ω2 ) be a measurable function of ω on (Ω, B). Then f can be considered as a function of ω2 for each fixed ω1 or the other way around. The functions gω1 (·) and hω2 (·) defined respectively on Ω2 and Ω1 by gω1 (ω2 ) = hω2 (ω1 ) = f (ω1 , ω2 ) are measurable for each ω1 and ω2 . If f is integrable then the functions gω1 (ω2 ) and hω2 (ω1 ) are integrable for almost all ω1 and ω2 respectively. Their integrals Z G(ω1 ) =
Ω2
gω1 (ω2 ) dP2
Z
and H(ω2 ) =
Ω1
hω2 (ω1 ) dP1
are measurable, finite almost everywhere and integrable with repect to P1 and P2 respectively. Finally Z Z Z f (ω1 , ω2) dP = G(ω1 )dP1 = H(ω2 )dP2 Conversely for a nonnegative measurable function f if either G are H, which are always measurable, has a finite integral so does the other and f is integrable with its integral being equal to either of the repeated integrals, namely integrals of G and H.
Proof. The proof follows the standard pattern. It is a restatement of the earlier corollary if f is the indicator function of a measurable set A. By linearity it is true for simple functions and by passing to uniform limits, it is true for bounded measurable functions f . By monotone limits it is true for nonnegative functions and finally by taking the positive and negative parts seperately it is true for any arbitrary integrable function f . Warning. The following could happen. f is a measurable function that takes both positive and negative values that is not integrable. Both the repeated integrals exist and are unequal. The example is not hard.
CHAPTER 1. MEASURE THEORY
28
Exercise 1.23. Construct a measurable function f (x, y) which is not integrable, on the product [0, 1] × [0, 1] of two copies of the unit interval with Lebesgue measure, such that the repeated integrals make sense and are unequal, i.e. Z Z Z Z 1
1
dx 0
1.6
0
1
f (x, y) dy 6=
1
dy 0
f (x, y) dx 0
Distributions and Expectations
Let us recall that a triplet (Ω, B, P ) is a Probability Space, if Ω is a set, B is a σ-field of subsets of Ω and P is a (countably additive) probability measure on B. A random variable X is a real valued measurable function on (Ω, B). Given such a function X it induces a probability distribution α on the Borel subsets of the line α = P X −1 . The distribution function F (x) corresponding to α is obviously F (x) = α((−∞, x]) = P [ ω : X(ω) ≤ x ]. The measure α is called the distibution of X and F (x) is called the distribution function of X. If g is a measurable function of the real variable x, then Y (ω) = g(X(ω)) is again a random variable and its distribution β = P Y −1 can be obtained as β = α g −1 from α. The Expectation or mean of a random variable is defined if it is integrable and Z P E[X] = E [X] = X(ω) dP. By the change of variables formula (Exercise 3.3) it can be obtained directly from α as Z E[X] = x dα. Here we are taking advantage of the fact that on the real line x is a very special real valued function. The value of the integral in this context is referred to as the expectation or mean of α. Of course it exists if and only if Z |x| dα < ∞ and
Z Z x dα ≤ |x| dα.
1.6. DISTRIBUTIONS AND EXPECTATIONS Similarly
Z E(g(X)) =
29
Z g(X(ω)) dP =
g(x) dα
and anything concerning X can be calculated from α. The statement X is a random variable with distribution α has to be interpreted in the sense that somewhere in the background there is a Probability Space and a random variable X on it, which has α for its distribution. Usually only α matters and the underlying (Ω, B, P ) never emerges from the background and in a pinch we can always say Ω is the real line, B are the Borel sets , P is nothing but α and the random variable X(x) = x. Some other related quantities are V ar(X) = σ 2 (X) = E[X 2 ] − [E[X]]2 .
(1.14)
V ar(X) is called the variance of X. Exercise 1.24. Show that if it is defined V ar(X) is always nonnegative and V ar(X) = 0 if and only if for some value a, which is necessarily equal to E[X], P [X = a] = 1. Some what more generally we can consider a measurable mapping X = (X1 , · · · , Xn ) of a probability space (Ω, B, P ) into Rn as a vector of n random variables X1 (ω), X2 (ω), · · · , Xn (ω). These are caled random vectors or vector valued random variables and the induced distribution α = P X −1 on Rn is called the distribution of X or the joint distribution of (X1 , · · · , Xn ). If we denote by πi the coordinate maps (x1 , · · · , xn ) → xi from Rn → R, then αi = απi−1 = P Xi−1 are called the marginals of α. The covariance between two random variables X and Y is defined as Cov (X, Y ) = E[(X − E(X))(Y − E(Y ))] = E[XY ] − E[X]E[Y ].
(1.15)
Exercise 1.25. If X1 , · · · , Xn are n random variables the matrix Ci,j = Cov(Xi , Xj ) is called the covariance matrix. Show that it is a symmetric positive semidefinite matrix. Is every positive semi-definite matrix the covariance matrix of some random vector?
CHAPTER 1. MEASURE THEORY
30
Exercise 1.26. TheR Riemann-Stieljes integral uses the distribution function ∞ directly to define −∞ g(x)dF (x) where g is a bounded continuous function and F is a distribution function. It is defined as limit as N → ∞ of sums N X
N g(xj )[F (aN j+1 ) − F (aj )]
j=0 N N N where −∞ < aN 0 < a1 < · · · < aN < aN +1 < ∞ is a partition of the finite N N interval [aN 0 , aN +1 ] and the limit is taken in such a way that a0 → −∞, N N N aN +1 → +∞ and the oscillation of g in any [aj , aj+1 ] goes to 0. Show that if P is the measure corresponding to F then Z ∞ Z g(x)dF (x) = g(x)dP. −∞
R
Chapter 2 Weak Convergence 2.1
Characteristic Functions
If α is a probability distribution on the line, its characteristic function is defined by Z φ(t) = exp[ i t x ] dα. (2.1) The above definition makes sense. We write the integrand eitx as cos tx + i sin tx and integrate each part to see that |φ(t)| ≤ 1 for all real t. Exercise 2.1. Calculate the characteristic functions for the following distributions: 1. α is the degenerate distribution δa with probability one at the point a. 2. α is the binomial distribution with probabilities n k pk = P rob[X = k] = p (1 − p)n−k k for 0 ≤ k ≤ n.
31
CHAPTER 2. WEAK CONVERGENCE
32
Theorem 2.1. The characteristic function φ(t) of any probability distribution is a uniformly continuous function of t that is positive definite, i.e. for any n complex numbers ξ1 , · · · , ξn and real numbers t1 , · · · , tn n X
φ(ti − tj ) ξi ξ¯j ≥ 0.
i,j=1
Proof. Let us note that n X
P φ(ti − tj ) ξi ξ¯j = ni,j=1 ξi ξ¯j
i,j=1
=
R
exp[ i(ti − tj )x] dα
R Pn
2 dα ≥ 0. ξ exp[i t x] j j j=1
To prove uniform continuity we see that Z |φ(t) − φ(s)| ≤ | exp[ i(t − s) x ] − 1| dP which tends to 0 by the bounded convergence theorem if |t − s| → 0. The characteristic function of R course carries some information about the distribution α. In particular if |x| dα < ∞, then φ(·) is continuously difR 0 ferentiable and φ (0) = i x dα. Exercise 2.2. Prove it! Warning: The R converse need not be true. φ(·) can be continuously differentiable but |x| dP could be ∞. Exercise 2.3. Construct a counterexample along the following lines. Take a discrete distribution, symmetric around 0 with α{n} = α{−n} = p(n) ' Then show that
P n
(1−cos nt) n2 log n
n2
1 . log n
is a continuously differentiable function of t.
R Exercise 2.4. The story with higher moments mr = xr dα is similar. If any of them, say mr exists, then φ(·) is r times continuously differentiable and φ(r) (0) = ir mr . The converse is false for odd r, but true for even r by an application of Fatou’s lemma.
2.1. CHARACTERISTIC FUNCTIONS
33
The next question is how to recover the distribution function F (x) from φ(t). If we go back to the Fourier inversion formula, see for instance [2], we can ‘guess’, using the fundamental theorem of calculus and Fubini’s theorem, that Z ∞ 1 0 F (x) = exp[−i t x ] φ(t) dt 2π −∞ and therefore Rb R∞ 1 F (b) − F (a) = 2π dx −∞ exp[−i t x ] φ(t) dt a R∞ Rb 1 = 2π φ(t) dt a exp[−i t x ] dx −∞ R∞ ita] 1 = 2π φ(t) exp[− i t b−]−exp[− dt −∞ it RT ita] 1 φ(t) exp[− i t b−]−exp[− dt. = limT →∞ 2π it −T We will in fact prove the final relation, which is a principal value integral, provided a and b are points of continuity of F . We compute the right hand side as Z Z T 1 exp[− i t b ] − exp[− i t a ] lim dt exp[ i t x ] dα T →∞ 2π −T −it = limT →∞
1 2π
R
dα
R
RT −T
RT
exp[i t (x−b) ]−exp[i t (x−a) ] −it
dt
t (x−b) 1 dα −T sin t (x−a)−sin dt = limT →∞ 2π t R = 12 [sign(x − a) − sign(x − b)] dα
= F (b) − F (a) provided a and b are continuity points. We have applied Fubini’s theorem and the bounded convergence theorem to take the limit as T → ∞. Note that the Dirichlet integral Z T sin tz u(t, z) = dt t 0 satisfies supT,z |u(T, z)| ≤ C and
if z > 0 1 lim u(T, z) = −1 if z < 0 T →∞ 0 if z = 0.
34
CHAPTER 2. WEAK CONVERGENCE
As a consequence we conclude that the distribution function and hence α is determined uniquely by the characteristic function. Exercise 2.5. Prove that if two distribution functions agree on the set of points at which they are both continuous, they agree everywhere. Besides those in Exercise 2.1, some additional examples of probability distributions and the corresponding characteristic functuions are given below. 1. The Poisson distribution of ‘rare events’, with rate λ, has probabilities r P [X = r] = e−λ λr! for r ≥ 0. Its characteristic function is φ(t) = exp[λ(eit − 1)].
2. The geometric distribution, the distribution of the number of unsuccessful attempts preceeding a success has P [X = r] = pq r for r ≥ 0.Its characteristic function is φ(t) = p(1 − qeit )−1 .
3. The negative binomial distribution, the probabilty distribution of the number of accumulated failures before k successes, with P [X = r] = k+r−1 k r p q has the characteristic function is r φ(t) = pk (1 − qeit )−k . We now turn to some common continuous distributions, in fact R x given by ‘densities’ f (x) i.e the distribution functions are given by F (x) = −∞ f (y) dy 1. The ‘uniform ’ distribution with density f (x) = characteristic function φ(t) =
1 , b−a
a ≤ x ≤ b has
eitb − eita . it(b − a)
In particular for the case of a symmetric interval [−a, a], φ(t) =
sin at . at
2.1. CHARACTERISTIC FUNCTIONS
35
2. The gamma distribution with density f (x) = the characteristic function φ(t) = (1 −
cp −cx p−1 e x , Γ(p)
x ≥ 0 has
it −p ) . c
where c > 0 is any constant. A special case of the gamma distribution is the exponential distribution, that corresponds to c = p = 1 with density f (x) = e−x for x ≥ 0. Its characteristic function is given by φ(t) = [1 − it]−1 .
3. The two sided exponential with density f (x) = 12 e−|x| has characteristic function φ(t) =
1 . 1 + t2
4. The Cauchy distribution with density f (x) = istic function
1 1 π 1+x2
has the character-
φ(t) = e−|t | .
5. The normal or Gaussian distribution with mean µ and variance σ 2 , which has a density of by
√ 1 e− 2πσ
(x−µ)2 2σ 2
has the characteristic function given
φ(t) = eitµ−
σ 2 t2 2
.
In general if X is a random variable which has distribution α and a characteristic function φ(t), the distribution β of aX + b, can be written as β(A) = α [x : ax + b ∈ A] and its characteristic function ψ(t) can be expressed as ψ(t) = eita φ(bt). In particular the characteristic function of −X is φ(−t) = φ(t). Therefore the distribution of X is symmetric around x = 0 if and only if φ(t) is real for all t.
36
2.2
CHAPTER 2. WEAK CONVERGENCE
Moment Generating Functions
If α is a probability distribution on R, for any integer k ≥ 1, the moment mk of α is defined as Z mk = xk dα. (2.2) Or equivalently the k-th moment of a random variable X is mk = E[X k ]
(2.3)
By convention one takes m0 = 1 even if P [X = 0] > 0. We should note k that R kif k is odd, in order for mk to be defined we must have E[|X| ] = |x| dα < ∞. Given a distribution α, either all the moments exist, or they exist only for 0 ≤ k ≤ k0 for some k0 . It could happen that k0 = 0 as is the case with the Cauchy distribution.R If we know all the moments of a distribution α, we know the expectations p(x)dα for every polynomial p(·). Since polynomials p(·) can be used to approximate (by Stone-Weierstrass theorem) any continuous function, one might hope that, from the moments, one can recover the distribution α. This is not as staright forward as one would hope. If we take a bounded continuous function, like sin x we can find a sequence of polynomials pn (x) that converges to sin x. But to conclude that Z Z sin xdα = lim pn (x)dα n→∞
we need to control the contribution to the integral from large values of x, which is the role of the dominated convergence theorem. If we define p∗ (x) = R ∗ supn |pn (x)| it would be a big help if p (x)dα were finite. But the degrees of the polynomials pn have to increase indefinitely with n because sin x is a transcendental function. RTherefore p∗ (·) must grow faster than a polynomial at ∞ and the condition p∗ (x)dα < ∞ may not hold. In general, it is not true that moments determine the distribution. If we look at it through characteristic functions, it is the problem of trying to recover the function φ(t) from a knowledge of all of its derivatives at t = 0. The Taylor series at t = 0 may not yield the function. Of course we have more information in our hands, like positive definiteness etc. But still it is likely that moments do not in general determine α. In fact here is how to construct an example.
2.2. MOMENT GENERATING FUNCTIONS
37
We need nonnegative numbers {an }, {bn } : n ≥ 0, such that X X an ek n = bn ek n = mk n
n
an bn for k ≥ 0. We can then replace them by { m }, { m } : n ≥ 0 so that 0 0 P every P k ak = k bk = 1 and the two probability distributions
P [X = en ] = an ,
P [X = en ] = bn
will have all their moments equal. Once we can find {cn } such that X cn en z = 0 for z = 0, 1, · · · n
we can take an = max(cn , 0) and bn = max(−cn , 0) and P we will have our n example. The goal then is to construct {cn } such that = 0 for n ck z 2 z = 1, e, e , · · · . Borrowing from ideas in the theory of a complex variable, (see Weierstrass factorization theorem, [1]) we define z C(z) = Π∞ n=1 (1 − n ) e P n and expand cn z . Since C(z) is an entire function, the coefficients P C(z) = kn cn satisfy n |cn |e < ∞ for every k. There R is in fact a positive result as well. If α is such that the moments mk = xk dα do not grow too fast, then α is determined by mk . P a2k Theorem 2.2. Let mk be such that k m2k (2k)! < ∞ for some a > 0. Then R k there is atmost one distribution α such that x dα = mk . Proof. We want to determine the characteristic function φ(t) of α. First we note that if α has moments mk satisfying our assumption, then Z X a2k cosh(ax)dα = m2k < ∞ (2k)! k by the monotone convergence theorem. In particular Z ψ(u + it) = e(u+it)x dα is well define as an analytic function of z = u + it in the strip |u| < a. From the theory of functions of a complex variable we know that the function ψ(·) is uniquely determined in the strip by its derivatives at 0, i.e. {mk }. In particular φ(t) = ψ(0 + it) is determined as well
CHAPTER 2. WEAK CONVERGENCE
38
2.3
Weak Convergence
One of the basic ideas in establishing Limit Theorems is the notion of weak convergence of a sequence of probability distributions on the line R. Since the role of a probability measure is to assign probabilities to sets, we should expect that if two probability measures are to be close, then they should assign for a given set, probabilities that are nearly equal. This suggets the definition d(P1, P2 ) = sup |P1 (A) − P2 (A)| A∈B
as the distance between two probability measures P1 and P2 on a measurable space (Ω, B). This is too strong. If we take P1 and P2 to be degenerate distributions with probability 1 concentrated at two points x1 and x2 on the line one can see that, as soon as x1 6= x2 , d(P1, P2 ) = 1, and the above metric is not sensitive to how close the two points x1 and x2 are. It only cares that they are unequal. The problem is not because of the supremum. We can take A to be an interval [a, b] that includes x1 but omits x2 and |P1 (A) − P2 (A)| = 1. On the other hand if the end points of the interval are kept away from x1 or x2 the situation is not that bad. This leads to the following definition. Definition 2.1. A sequence αn of probability distributions on R is said to converge weakly to a probability distribution α if, lim αn [I] = α[I]
n→∞
for any interval I = [a, b] such that the single point sets a and b have probability 0 under α. One can state this equivalently in terms of the distribution functions Fn (x) and F (x) corrresponding to the measures αn and α respectively. Definition 2.2. A sequence αn of probability measures on the real line R with distribution functions Fn (x) is said to converge weakly to a limiting probability measure α with distribution function F (x) (in symbols αn ⇒ α or Fn ⇒ F ) if lim Fn (x) = F (x) n→∞
for every x that is a continuity point of F .
2.3.
WEAK CONVERGENCE
39
Exercise 2.6. prove the equivalence of the two definitions. Remark 2.1. One says that a sequence Xn of random variables converges in law or in distribution to X if the distributions αn of Xn converges weakly to the distribution α of X. There are equivalent formulations in terms of expectations and characteristic functions. Theorem 2.3. (L´ evy-Cram´ er continuity theorem) The following are equivalent. 1. αn ⇒ α or Fn ⇒ F 2. For every bounded continuous function f (x) on R Z Z lim f (x) dαn = f (x) dα n→∞
R
R
3. If φn (t) and φ(t) are respectively the characteristic functions of αn and α, for every real t, lim φn (t) = φ(t) n→∞
Proof. We first prove (a ⇒ b). Let > 0 be arbitrary. Find continuity points a and b of F such that a < b, F (a) ≤ and 1 − F (b) ≤ . Since Fn (a) and Fn (b) converge to F (a) and F (b), for n large enough, Fn (a) ≤ 2 and 1−Fn (b) ≤ 2. Divide the interval [a, b] into a finite number N = Nδ of small subintervals Ij = (aj , aj+1 ], 1 ≤ j ≤ N with a = a1 < a2 < · · · < aN +1 = b such that all the end points {aj } are points of continuity of F and the oscillation of the continuous function f in each Ij is less than a preassigned number δ. Since any continuous function f is uniformly continuous in the closed bounded (compact) interval [a, b], this is always possible for any given P δ > 0. Let h(x) = N χ j=1 Ij f (aj ) be the simple function equal to f (aj ) on Ij and 0 outside ∪j Ij = (a, b]. We have |f (x) − h(x)| ≤ δ on (a, b]. If f (x) is bounded by M, then Z N X f (x) dαn − ≤ δ + 4M f (a )[F (a ) − F (a )] j n j+1 n j j=1
(2.4)
CHAPTER 2. WEAK CONVERGENCE
40 and
Z N X f (x) dα − ≤ δ + 2M. f (a )[F (a ) − F (a )] j j+1 j
(2.5)
j=1
Since limn→∞ Fn (aj ) = F (aj ) for every 1 ≤ j ≤ N, we conclude from equations (2.4), (2.5) and the triangle inequality that Z Z lim sup f (x) dαn − f (x) dα ≤ 2δ + 6M. n→∞
Since and δ are arbitrary small numbers we are done. Because we can make the choice of f (x) = exp[i t x ] = cos tx + i sin tx, which for every t is a bounded and continuous function (b ⇒ c) is trivial. (c ⇒ a) is the hardest. It is carried out in several steps. Actually we will prove a stronger version as a separate theorem. Theorem 2.4. For each n ≥ 1, let φn (t) be the characteristic function of a probability distribution αn . Assume that limn→∞ φn (t) = φ(t) exists for each t and φ(t) is continuous at t = 0. Then φ(t) is the characteristic function of some probability distribution α and αn ⇒ α. Proof. Step 1. Let r1 , r2 , · · · be an enumeration of the rational numbers. For each j consider the sequence {Fn (rj ) : n ≥ 1} where Fn is the distribution function corresponding to φn (·). It is a sequence bounded by 1 and we can extract a subsequence that converges. By the diagonalization process we can choose a subseqence Gk = Fnk such that lim Gk (r) = br
k→∞
exists for every rational number r. From the monotonicity of Fn in x we conclude that if r1 < r2 , then br1 ≤ br2 . Step 2. From the skeleton br we reconstruct a right continuous monotone function G(x). We define G(x) = inf br . r>x
2.3.
WEAK CONVERGENCE
41
Clearly if x1 < x2 , then G(x1 ) ≤ G(x2 ) and therefore G is nondecreasing. If xn ↓ x, any r > x satisfies r > xn for sufficiently large n. This allows us to conclude that G(x) = inf n G(xn ) for any sequence xn ↓ x, proving that G(x) is right continuous. Step 3. Next we show that at any continuity point x of G lim Gn (x) = G(x).
n→∞
Let r > x be a rational number. Then Gn (x) ≤ Gn (r) and Gn (r) → br as n → ∞. Hence lim sup Gn (x) ≤ br . n→∞
This is true for every rational r > x, and therefore taking the infimum over r>x lim sup Gn (x) ≤ G(x). n→∞
Suppose now that we have y < x. Find a rational r such that y < r < x. lim inf Gn (x) ≥ lim inf Gn (r) = br ≥ G(y). n→∞
n→∞
As this is true for every y < x, lim inf Gn (x) ≥ sup G(y) = G(x − 0) = G(x) n→∞
y `] It is a well known principle in Fourier analysis that the regularity of φ(t) at t = 0 is related to the decay rate of the tail probabilities. R Exercise 2.8. Compute |x|p dα in terms of the characteristic function φ(t) for p in the range 0 < p < 2. Hint: Look at the formula Z ∞ −∞
1 − cos tx dt = Cq |x|p p+1 |t|
and use Fubini’s theorem. We have the following result on the behavior of αn (A) for certain sets whenever αn ⇒ α. Theorem 2.6. Let αn ⇒ α on R. If C ⊂ R is closed set then lim sup αn (C) ≤ α(C) n→∞
while for open sets G ⊂ R lim inf αn (G) ≥ α(G) n→∞
If A ⊂ R is a continuity set of α i.e. α(∂A) = α(A¯ − Ao ) = 0, then lim αn (A) = α(A)
n→∞
Proof. The function d(x, C) = inf y∈C |x − y| is a continuous and equals 0 precisely on C. 1 f (x) = 1 + d(x, C)
2.3.
WEAK CONVERGENCE
45
is a continuous function bounded by 1, that is equal to 1 precisely on C and fk (x) = [f (x)]k ↓ χC (x) as k → ∞. For every k ≥ 1, we have Z Z lim fk (x) dαn = fk (x) dα n→∞
and therefore
Z lim sup αn (C) ≤ lim n→∞
n→∞
Z fk (x) dαn =
fk (x)dα.
Letting k → ∞ we get lim sup αn (C) ≤ α(C). n→∞
Taking complements we conclude that for any open set G ⊂ R lim inf αn (G) ≥ α(G). n→∞
Combining the two parts, if A ⊂ R is a continuity set of α i.e. α(∂A) = α(A¯ − Ao ) = 0, then lim αn (A) = α(A). n→∞
We are now ready to prove the converse of Theorem 2.1 which is the hard part of a theorem of Bochner that characterizes the characteristic functions of probability distributions as continuous positive definite functions on R normalized to be 1 at 0. Theorem 2.7. (Bochner’s Theorem). If φ(t) is a positive definite function which is continuous at t = 0 and is normalized so that φ(0) = 1, then φ is the characteristic function of some probability ditribution on R. Proof. The proof depends on constructing approximations φn (t) which are in fact characteristic functions and satisfy φn (t) → φ(t) as n → ∞. Then we can apply the preceeding theorem and the probability measures corresponding to φn will have a weak limit which will have φ for its characteristic function.
CHAPTER 2. WEAK CONVERGENCE
46
Step 1. Let us establish a few elementary properties of positive definite functions. 1) If φ(t) is a positive definite function so is φ(t)exp[ i t a ] for any real a. The proof is elementary and requires just direct verification. 2) IfP φj (t) are positive definite for each j then so is any linear combination φ(t) = j wj φj (t) with P nonnegative weights wj . If each φj (t) is normalized with φj (0) = 1 and j wj = 1, then of of course φ(0) = 1 as well. 3) If φ is positive definite then φ satisfies φ(0) ≥ 0, φ(−t) = φ(t) and |φ(t)| ≤ φ(0) for all t. We use the fact that the matrix {φ(ti − tj ) : 1 ≤ i, j ≤ n} is Hermitian positive definite for any n real numbers t1 , · · · , tn . The first assertion follows from the the positivity of φ(0)|z|2 , the second is a consequence of the Hermitian property and if we take n = 2 with t1 = t and t2 = 0 as a consequence of the positive definiteness of the 2 × 2 matrix we get |φ(t)|2 ≤ |φ(0)|2 4) For any s, t we have |φ(t) − φ(s)|2 ≤ 4φ(0)|φ(0) − φ(t − s)| We use the positive definiteness of the 3 × 3 matrix 1 φ(t − s) φ(t) φ(t − s) 1 φ(s) φ(t)
φ(s)
1
which is {φ(ti − tj )} with t1 = t, t2 = s and t3 = 0. In particular the determinant has to be nonnegative. 0 ≤ 1 + φ(s)φ(t − s)φ(t) + φ(s)φ(t − s)φ(t) − |φ(s)|2 −|φ(t)|2 − |φ(t − s)|2 = 1 − |φ(s) − φ(t)|2 − |φ(t − s)|2 − φ(t)φ(s)(1 − φ(t − s)) −φ(t)φ(s)(1 − φ(t − s)) ≤ 1 − |φ(s) − φ(t)|2 − |φ(t − s)|2 + 2|1 − φ(t − s)| Or |φ(s) − φ(t)|2 ≤ 1 − |φ(s − t)|2 + 2|1 − φ(t − s)| ≤ 4|1 − φ(s − t)|
2.3.
WEAK CONVERGENCE
47
5) It now follows from 4) that if a positive definite function is continuous at t = 0, it is continuous everywhere (in fact uniformly continuous). Step 2. First we show that if φ(t) is a positive definite function which is continuous on R and is absolutely integrable, then Z ∞ 1 f (x) = exp[− i t x ]φ(t) dt ≥ 0 2π −∞ is a continuous function and
Z
∞
f (x)dx = 1. −∞
Z
Moreover the function
x
F (x) =
f (y) dy −∞
defines a distribution function with characteristic function Z ∞ φ(t) = exp[i t x ]f (x) dx.
(2.9)
−∞
If φ is integrable on (−∞, ∞), then f (x) is clearly bounded and continuous. To see that it is nonnegative we write 1 f (x) = lim T →∞ 2π
Z
1 = lim T →∞ 2πT 1 = lim T →∞ 2πT
T
1−
−T
Z
T
Z
0
Z 0
T 0
T
Z
T 0
|t| − i t x e φ(t) dt T
(2.10)
e− i (t−s) x φ(t − s) dt ds
(2.11)
e− i t x ei s x φ(t − s) dt ds
(2.12)
≥ 0. We can use the dominated convergence theorem to prove equation (2.10), a change of variables to show equation (2.11) and finally a Riemann sum approximation to the integral and the positive definiteness of φ to show that the quantity in (2.12) is nonnegative. It remains to show the relation (2.9). Let us define σ 2 x2 fσ (x) = f (x) exp[ − ] 2
CHAPTER 2. WEAK CONVERGENCE
48
and calculate for t ∈ R, using Fubini’s theorem Z
∞
Z itx
e −∞
fσ (x) dx =
∞
−∞
ei t x f (x) exp[ −
σ 2 x2 ] dx 2
Z ∞Z ∞ 1 σ 2 x2 = ei t x φ(s)e−i s x exp[ − ] ds dx 2π −∞ −∞ 2 Z ∞ 1 (t − s)2 φ(s) √ ] ds. (2.13) exp[ − = 2σ 2 2πσ −∞
If we take t = 0 in equation (2.13), we get Z ∞ Z ∞ 1 s2 √ fσ (x) dx = φ(s) exp[ − 2 ] ds ≤ 1. 2σ 2πσ −∞ −∞
(2.14)
Now we let σ → 0. Since fσ ≥ 0 and tends to f as σ → 0, from Fatous’s R ∞ lemma and equation (2.14), it follows that f is integarable and initxfact f (x)dx ≤ 1. Now we let σ → 0 in equation (2.13). Since fσ (x)e is −∞ dominated by the integrable function f , there is no problem with the left hand side. On the other hand the limit as σ → 0 is easily calculated on the right hand side of equation (2.13) Z ∞ Z ∞ 1 (s − t)2 itx e f (x)dx = lim φ(s) √ ] ds exp[ − σ→0 −∞ 2σ 2 2πσ −∞ Z ∞ s2 1 = lim φ(t + σs) √ exp[ − ] ds σ→0 −∞ 2 2π = φ(t) proving equation (2.9). Step 3. If φ(t) is a positive definite function which is continuous, so is φ(t) exp[ i t y ] for every y and for σ > 0, as well as the convex combination φσ (t) =
R∞ −∞
y 1 φ(t) exp[ i t y ] √2πσ exp[ − 2σ 2 ] dy 2
2 2
= φ(t) exp[ − σ 2t ]. The previous step is applicable to φσ (t) which is clearly integrable on R and by letting σ → 0 we conclude by Theorem 2.3. that φ is a characteristic function as well.
2.3.
WEAK CONVERGENCE
49
Remark 2.2. There is a Fourier Series analog involving distributions on a finite interval, say S = [0, 2π). The right end point is omitted on purpose, because the distribution should be thought of as being on [0, 2π] with 0 and 2π identified. If α is a distribution on S the characteristic function is defined as Z φ(n) = ei n x dα for integral values n ∈ Z. There is a uniqueness theorem, and a Bochner type theorem involving an analogous definition of positive definiteness. The proof is nearly the same. R R Exercise 2.9. If αn ⇒ α it is not always true that x dαn → x dα because while x is a continuous function it is not bounded. Construct a simple counterexample. On the positive side, let f (x) be a continuous function that is not necessarily bounded. Assume that there exists a positive continuous function g(x) satisfying |f (x)| lim =0 |x|→∞ g(x) and Z sup g(x) dαn ≤ C < ∞. n
Z
Then show that lim
Z f (x) dαn =
f (x) dα R R R In particular if |x|k dαn remains bounded, then xj dαn → xj dα for 1 ≤ j ≤ k − 1. Exercise 2.10. On the other hand if αn ⇒ α and g : R → R is a continuos function then the distribution βn of g under αn defined as n→∞
βn [A] = αn [x : g(x) ∈ A] converges weakly to β the corresponding distribution of g under α. Exercise 2.11. If gn (x) is a sequence of continuous functions such that sup |gn (x)| ≤ C < ∞ and n,x
lim gn (x) = g(x)
n→∞
uniformly on every bounded interval, then whenever αn ⇒ α it follows that Z lim gn (x)dαn = g(x)dα. n→∞
50
CHAPTER 2. WEAK CONVERGENCE
Can you onstruct an example to show that even if gn , g are continuous just the pointwise convergence limn→∞ gn (x) = g(x) is not enough. Exercise 2.12. If a sequence {fn (ω)} of random variables on a measure space are such that fn → f in measure, then show that the sequence of distributions αn of fn on R converges weakly to the distribution α of f . Give an example to show that the converse is not true in general. However, if f is equal to a constant c with probability 1, or equivalently α is degenerate at some point c, then αn ⇒ α = δc implies the convrgence in probability of fn to the constant function c.
Chapter 3 Independent Sums 3.1
Independence and Convolution
One of the central ideas in probabilty is the notion of independence. In intuitive terms two events are independent if they have no influence on each other. The formal definition is Definition 3.1. Two events A and B are said to be independent if P [A ∩ B] = P [A]P [B]. Exercise 3.1. If A and B are independent prove that so are Ac and B. Definition 3.2. Two random variables X and Y are independent if the events X ∈ A and Y ∈ B are independent for any two Borel sets A and B on the line i.e. P [X ∈ A, Y ∈ B] = P [X ∈ A]P [Y ∈ B]. for all Borel sets A and B. There is a natural extension to a finite or even an infinite collection of random variables. 51
CHAPTER 3. INDEPENDENT SUMS
52
Definition 3.3. A finite collection collection {Xj : 1 ≤ j ≤ n} of random variables are said to be independent if for any n Borel sets A1 , . . . , An on the line P ∩1≤j≤n [Xj ∈ Aj ] = Π1≤j≤n P [Xj ∈ Aj ].
Definition 3.4. An infinite collection of random variables is said to be independent if every finite subcollection is independent. Lemma 3.1. Two random variables X, Y defined on (Ω, Σ, P ) are independent if and only if the measure induced on R2 by (X, Y ), is the product measure α × β where α and β are the distributions on R induced by X and Y respectively. Proof. Left as an exercise. The important thing to note is that if X and Y are independent and one knows their distributions α and β, then their joint distribution is automatically determined as the product measure. If X and Y are independent random variables having α and β for their distributions, the distribution of the sum Z = X +Y is determined as follows. First we construct the product measure α×β on R×R and then consider the induced distribution of the function f (x, y) = x + y. This distribution, called the convolution of α and β, is denoted by α ∗ β. An elementary calculation using Fubini’s theorem provides the following identities. Z (α ∗ β)(A) =
Z α(A − x) dβ =
β(A − x) dα
(3.1)
In terms of characteristic function, we can express the characteristic function of the convolution as Z
Z Z exp[ i t x ]d(α ∗ β) =
exp[ i t (x + y) ] d α d β Z Z = exp[ i t x ] d α exp[ i t x ] d β
3.1. INDEPENDENCE AND CONVOLUTION
53
or equivalently φα∗β (t) = φα (t)φβ (t)
(3.2)
which provides a direct way of calculating the distributions of sums of independent random variables by the use of characteristic functions. Exercise 3.2. If X and Y are independent show that for any two measurable functions f and g, f (X) and g(Y ) are independent. Exercise 3.3. Use Fubini’s theorem to show that if X and Y are independent and if f and g are measurable functions with both E[|f (X)|] and E[|g(Y )|] finite then E[f (X)g(Y )] = E[f (X)]E[g(Y )]. Exercise 3.4. Show that if X and Y are any two random variables then E(X + Y ) = E(X) + E(Y ). If X and Y are two independent random variables then show that Var(X + Y ) = Var(X) + Var(Y ) where Var(X) = E [X − E[X]]2 = E[X 2 ] − [E[X]]2 . If X1 , X2 , · · · , Xn are n independent random variables, then the distribution of their sum Sn = X1 + X2 + · · · + Xn can be computed in terms of the distributions of the summands. If αj is the distribution of Xj , then the distribution of µn of Sn is given by the convolution µn = α1 ∗ α2 ∗ · · · ∗ αn that can be calculated inductively by µj+1 = µj ∗ αj+1 . In terms of their characteristic functions ψn (t) = φ1 (t)φ2 (t) · · · φn (t). The first two moments of Sn are computed easily. E(Sn ) = E(X1 ) + E(X2 ) + · · · E(Xn ) and Var(Sn ) = E[Sn − E(Sn )]2 X = E[Xj − E(Xj )]2 j
+2
X 1≤i 0.
CHAPTER 3. INDEPENDENT SUMS
56
Proof. Use Chebychev’s inequality to estimate Sn Sn 1 σ2 P | − m| ≥ δ ≤ 2 Var( ) = 2 . n δ n nδ
Actually it is enough to assume that E|Xi | < ∞ and the existence of the second moment is not needed. We will provide two proofs of the statement Theorem 3.3. If X1 , X2 , · · · Xn are independent and identically distributed with a finite first moment and E(Xi ) = m, then X1 +X2n+···+Xn converges to m in probability as n → ∞. Proof. 1. Let C be a large constant and let us define XiC as the truncated random variable XiC = Xi if |Xi | ≤ C and XiC = 0 otherwise. Let YiC = Xi − XiC so that Xi = XiC + YiC . Then 1 X 1 X C 1 X C Xi = X + Y n 1≤i≤n n 1≤i≤n i n 1≤i≤n i = ξnC + ηnC . If we denote by aC = E(XiC ) and bC = E(YiC ) we always have m = aC + bC . Consider the quantity 1 X δn = E[| Xi − m|] n 1≤i≤n = E[|ξnC + ηnC − m|] ≤ E[|ξnC − aC |] + E[|ηnC − bC |] 12 C 2 ≤ E[|ξn − aC | ] + 2E[|YiC |].
(3.8)
As n → ∞, the truncated random variables XiC are bounded and independent. Theorem 3.2 is applicable and the first of the two terms in (3.8) tends to 0. Therefore taking the limsup as n → ∞, for any 0 < C < ∞, lim sup δn ≤ 2E[|YiC |]. n→∞
If Cwe now let the cutoff level C to go to infinity, by the integrability of Xi , E |Yi | → 0 as C → ∞ and we are done. The final step of establishing that
3.2. WEAK LAW OF LARGE NUMBERS
57
for any sequence Yn of random variables, E[|Yn |] → 0 implies that Yn → 0 in probability, is left as an exercise and is not very different from Chebychev’s inequality. Proof 2. We can use characteristic functions. If we denoteP the characteristic 1 function of Xi by φ(t), then the characteristic function of n 1≤i≤n Xi is given by ψn (t) = [φ( nt )]n . The existence of the first moment assures us that φ(t) is differentiable at t = 0 with a derivative equal to im where m = E(Xi ). Therefore by Taylor expansion t imt 1 φ( ) = 1 + + o( ). n n n n Whenever nan → z it follows that (1 + an ) → ez . Therefore, lim ψn (t) = exp[ i m t ]
n→∞
which is the characteristic function of the distribution degenerate at m. Hence the distribution of Snn tends to the degenerate distribution at the point m. The weak law of large numbers is thereby established. Exercise 3.5. If the underlying distribution is a Cauchy distribution with 1 −|t| density π(1+x , prove that the weak 2 ) and characteristic function φ(t) = e law does not hold. Exercise 3.6. The weak law may hold sometimes even if the mean does not exist. If we dampen the tails of the Cauchy ever so slightly with a density c f (x) = (1+x2 ) log(1+x 2 ) , show that the weak law of large numbers holds. Exercise 3.7. In the case of the Binomial distribution with p = 12 , use Stirling’s formula √ n! ' 2π e−n nn+12 to estimate the probability
X n 1 r 2n r≥nx
and show that it decays geometrically in n. Can you calculate the geometric ratio X n1 n 1 ρ(x) = lim n→∞ r 2n r≥nx explicitly as a function of x for x > 12 ?
CHAPTER 3. INDEPENDENT SUMS
58
3.3
Strong Limit Theorems
The weak law of large numbers is really a result concerning the behavior of X 1 + X 2 + · · · + Xn Sn = n n where X1 , X2 , · · · , Xn , . . . is a sequence of independent and identically distributed random variables on some probability space (Ω, B, P ). Under the assumption that Xi are integrable with an integral equal to m, the weak law asserts that as n → ∞, Snn → m in Probability. Since almost everywhere convergence is generally stronger than convergence in Probability one may ask if Sn (ω) P ω : lim =m =1 n→∞ n This is called the Strong Law of Large Numbers. Strong laws are statements that hold for almost all ω. Let us look at functions of the form fn = χAn . It is easy to verify that fn → 0 in probability if and only if P (An ) → 0. On the other hand Lemma 3.4. (Borel-Cantelli lemma). If X P (An ) < ∞ n
then
P ω : lim χAn (ω) = 0 = 1. n→∞
If the events An are mutually independent the converse is also true. Remark 3.1. Note that the complementary event ω : lim sup χAn (ω) = 1 n→∞
∞ is the same as ∩∞ n=1 ∪j=n Aj , or the event that infinitely many of the events {Aj } occcur.
The cnclusion of the next exercise will be used in the proof.
3.3. STRONG LIMIT THEOREMS
59
Exercise 3.8. Prove the following variant of the monotone convergence theorem. If fn (ω) ≥ 0 are measurble functions the set E = {ω : S(ω) = P n fn (ω) < ∞} is measurable P and S(ω) is a measurable function on E. If each fn is integrable and n E[fn ] < ∞ then P [E] = 1, S(ω) is integrable P and E[S(ω)] = n E[fn (ω)]. P P Proof. By the previous exercise if n P (An ) < ∞, then n χAn (ω) = S(ω) is finite almost everywhere and X E(S(ω)) = P (An ) < ∞. n
If an infinite series has a finite sum then the n-th term must go to 0, thereby proving the direct part. To prove the converse we need to show that if P ∞ n P (An ) = ∞, then limm→∞ P (∪n=m An ) > 0. We can use independence and the continuity of probability under monotone limits, to calculate for every m, ∞ c P (∪∞ n=m An ) = 1 − P (∩n=m An ) ∞ Y = 1− (1 − P (An )) (by independence)
P
n=m −
≥ 1−e = 1
∞ m
P (An )
and we are done. We have used the inequality 1 − x ≤ e−x familiar in the study of infinite products. Another digression that we want to make into measure theory at this point is to discuss Kolmogorov’s consistency theorem. How do we know that there are probability spaces that admit a sequence of independent identically distributed random variables with specified distributions? By the construction of product measures that we outlined earlier we can construct a measure on Rn for every n which is the joint distribution of the first n random variables. Let us denote by Pn this probability measure on Rn . They are consistent in the sense that if we project in the natural way from Rn+1 → Rn , Pn+1 projects to Pn . Such a family is called a consistent family of finite dimensional distributions. We look at the space Ω = R∞ consisting of all real sequences ω = {xn : n ≥ 1} with a natural σ-field Σ generated by the field F of finite dimensional cylinder sets of the form B = {ω : (x1 , · · · , xn ) ∈ A} where A varies over Borel sets in Rn and varies over positive integers.
60
CHAPTER 3. INDEPENDENT SUMS
Theorem 3.5. (Kolmogorov’s Consistency Theorem). Given a consistent family of finite dimensional distributions Pn , there exists a unique P on (Ω, Σ) such that for every n, under the natural projection πn (ω) = (x1 , · · · , xn ), the induced measure P πn−1 = Pn on Rn . Proof. The consistency is just what is required to be able to define P on F by P (B) = Pn (A). Once we have P defined on the field F , we have to prove the countable additivity of P on F . The rest is then routine. Let Bn ∈ F and Bn ↓ Φ, the empty set. If possible let P (Bn ) ≥ δ for all n and for some δ > 0. Then Bn = πk−1 Akn for some kn and without loss of generality we assume that n kn = n, so that Bn = πn−1 An for some Borel set An ⊂ Rn . According to Exercise 3.8 below, we can find a closed bounded subset Kn ⊂ An such that Pn (An − Kn ) ≤
δ 2n+1
and define Cn = πn−1 Kn and Dn = ∩nj=1 Cj = πn−1 Fn for some closed bounded set Fn ⊂ Kn ⊂ Rn . Then n X δ δ P (Dn ) ≥ δ − ≥ . j+1 2 2 j=1
Dn ⊂ Bn , Dn ↓ Φ and each Dn is nonempty. If we take ω (n) = {xnj : j ≥ 1} to be an arbitrary point from Dn , by our construction (xn1 , · · · xnm ) ∈ Fm for n ≥ m. We can definitely choose a subsequence (diagonlization) such that xnj k converges for each j producing a limit ω = (x1 , · · · , xm , · · · ) and, for every m, we will have (x1 , · · · , xm ) ∈ Fm . This implies that ω ∈ Dm for every m, contradicting Dn ↓ Φ. We are done. Exercise 3.9. We have used the fact that given any borel set A ⊂ Rn , and a probability measure α on Rn , for any > 0, there exists a closed bounded subset K ⊂ A such that α(A − K ) ≤ . Prove it by showing that the class of sets A with the above property is a monotone class that contains finite disjoint unions of measurable rectangles and therefore contains the Borel σfield. To prove the last fact, establish it first for n = 1. To handle n = 1, repeat the same argument starting from finite disjoint unions of right-closed left-open intevals. Use the countable additivity to verify this directly.
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES
61
Remark 3.2. Kolmogorov’s consistency theorem remains valid if we replace R by an arbitrary complete separable metric space X, with its Borel σ-field. However it is not valid in complete generality. See [8]. See Remark 4.7 in this context. The following is a strong version of the Law of Large Numbers. Theorem 3.6. If X1 , · · · , Xn · · · is a sequence of independent identically distributed random variables with E|Xi |4 = C < ∞, then Sn X 1 + · · · + Xn = lim = E(X1 ) n→∞ n n→∞ n lim
with probability 1. Proof. We can assume without loss of generality that E[Xi ] = 0 . Just take Yi = Xi − E[Xi ]. A simple calculation shows E[(Sn )4 ] = nE[(X1 )4 ] + 3n(n − 1)E[(X1 )2 ]2 ≤ nC + 3n2 σ 4 and by applying a Chebychev type inequality using fourth moments, Sn nC + 3n2 σ 4 P [| | ≥ δ ] = P [ |Sn | ≥ nδ ] ≤ . n n4 δ 4 We see that
∞ X
P[|
n=1
Sn | ≥ δ]< ∞ n
and we can now apply the Borel-Cantelli Lemma.
3.4
Series of Independent Random variables
We wish to investigate conditions under which an infinite series with independent summands ∞ X S= Xj j=1
converges with probability 1. The basic steps are the following inequalities due to Kolomogorov and L´evy that control the behaviour of sums of independent random variables. They both deal with the problem of estimating Tn (ω) = sup |Sk (ω)| = sup | 1≤k≤n
1≤k≤n
k X j=1
Xj (ω)|
CHAPTER 3. INDEPENDENT SUMS
62
where X1 , · · · , Xn are n independent random variables. Lemma 3.7. (Kolmogorov’s P Inequality). Assume that EXi = 0 and Var(Xi ) = σi2 < ∞ and let s2n = nj=1 σj2 . Then P {Tn (ω) ≥ `} ≤
s2n . `2
(3.9)
Proof. The important point here is that the estimate depends only on s2n and not on the number of summands. In fact the Chebychev bound on Sn is P {|Sn | ≥ `} ≤
s2n `2
and the supremum does not cost anything. Let us define the events Ek = {|S1 | < `, · · · , |Sk−1| < `, |Sk | ≥ `} and then {Tn ≥ `} = ∪nk=1 Ek is a disjoint union of Ek . If we use the independence of Sn − Sk and Sk χEk that only depends on X1 · · · , Xk Z 1 P {Ek } ≤ 2 Sk2 dP ` Ek Z 2 1 ≤ 2 Sk + (Sn − Sk )2 dP ` Ek Z 2 1 Sk + 2Sk (Sn − Sk ) + (Sn − Sk )2 dP = 2 ` Ek Z 1 = 2 S 2 dP. ` Ek n Summing over k from 1 to n 1 P {Tn ≥ `} ≤ 2 `
Z Tn ≥`
Sn2
s2n dP ≤ 2 . `
eatblishing (3.9) Lemma 3.8. (L´ evy’s Inequality). Assume that ` P {|Xi + · · · + Xn | ≥ } ≤ δ 2 for all 1 ≤ i ≤ n. Then P {Tn ≥ `} ≤
δ . 1−δ
(3.10)
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES
63
Proof. Let Ek be as in the previous lemma. n X ` ` P Ek ∩ |Sn | ≤ P (Tn ≥ `) ∩ |Sn | ≤ = 2 2 k=1 n X ` P Ek ∩ |Sn − Sk | ≥ ≤ 2 k=1 n X ` P |Sn − Sk | ≥ = P (Ek ) 2 k=1 n X
≤ δ
P (Ek )
k=1
= δP {Tn ≥ `}. On the other hand, ` ` P (Tn ≥ `) ∩ |Sn | > ≤ P |Sn | > ≤ δ. 2 2 Adding the two, or
P Tn ≥ ` ≤ δP Tn ≥ ` + δ P Tn ≥ ` ≤
δ 1−δ
proving (3.10) We are now ready to prove Theorem 3.9. (L´ evy’s Theorem). If X1 , X2 , . . . , Xn , . . . is a seqence of independent random variables, then the following are equivalent. (i) The distribution αn of Sn = X1 + · · · + Xn converges weakly to a probability distribution α on R. (ii) The random variable Sn = X1 + · · · + Xn converges in probability to a limit S(ω). (iii) The random variable Sn = X1 + · · · + Xn converges with probability 1 to a limit S(ω).
CHAPTER 3. INDEPENDENT SUMS
64
Proof. Clearly (iii) ⇒ (ii) ⇒ (i) are trivial. We will establish (i) ⇒ (ii) ⇒ (iii). (i) ⇒ (ii). The characteristic functions φj (t) of Xj are such that φ(t) =
∞ Y
φj (t)
i=1
is a convergent infinite product. Since the limit φ(t) is continuous at t = 0 and φ(0) = 1 it is nonzero in some interval |t| ≤ T around 0. Therefore for |t| ≤ T , n Y lim φj (t) = 1. n→∞ m→∞
m+1
By Exercise 3.10 below, this implies that for all t, lim
n→∞ m→∞
n Y
φj (t) = 1
m+1
and consequently, the distribution of Sn − Sm converges to the distribution degenerate at 0. This implies the convergence in probability to 0 of Sn − Sm as m, n → ∞. Therefore for each δ > 0, lim P {|Sn − Sm | ≥ δ} = 0
n→∞ m→∞
establishing (ii). (ii) ⇒ (iii). To establish (iii), because of Exercise 3.11 below, we need only show that for every δ > 0 lim P sup |S − S | ≥ δ =0 k m n→∞ m→∞
m 0, lim P {|Sn − Sm | ≥ δ} = 0
n→∞ m→∞
then there is a random variable S such that Sn → S in probability, i.e for each δ > 0, lim P {|Sn − S| ≥ δ} = 0. n→∞
Exercise 3.12. Prove that if a sequence Sn of random variables satisfies lim P sup |Sk − Sm | ≥ δ = 0 n→∞ m 0 then there is a limiting random variable S(ω) such that P lim Sn (ω) = S(ω) = 1. n→∞
Exercise 3.13. Prove that whenever Xn → X in probability the distribution αn of Xn converges weakly to the distribution α of X. Now it is straightforward to find sufficient conditions for the convergence of an infinite series of independent random variables. Theorem 3.10. (Kolmogorov’s one series Theorem). Let a sequence {Xi } of independent random variables, each of which has finite mean and P variance, satisfy E(Xi ) = 0 and ∞ Var(X i ) < ∞, then i=1 S(ω) =
∞ X
Xi (ω)
i=1
converges with probability 1. Proof. By a direct application of Kolmogorov’s inequality lim P n→∞ m→∞
sup |Sk − Sm | ≥ δ
≤
m 0,
P i
P {|Xi| > C} converges.
(ii) If Yi is defined to equal Xi if |Xi | ≤ C, and 0 otherwise, converges.
P i
E(Yi )
3.4. SERIES OF INDEPENDENT RANDOM VARIABLES (iii) With Yi as in (ii),
67
P
Var(Yi ) converges. P Proof. Let us now prove the converse. If i Xi converges for a sequence of independent random variables, we must necessarily have |Xn | ≤ C eventually with probability 1. By Borel-Cantelli Lemma the first series must converge. This means that in order to prove the necessity we can assume without loss of generality that |Xi | are all bounded say by 1. we may also assume that E(Xi ) = 0 for each i. Otherwise let us take independent random Xi0 P P variables 0 that have the same distribution as Xi . Then i Xi as well as i Xi converge P with probability 1 and therefore so does i (Xi − Xi0 ). The random variables 0 Zi = Xi − XiP are independent and bounded by 2. They have mean 0. If we can show Var(Zi ) is convergent, since Var(Zi ) = 2Var(Xi ) we would have proved the convergencePof the the thirdPseries. Now it is elementary to conclude that since both i Xi as well as i (Xi − E(Xi )) converge, the P series i E(Xi ) must be convergent as well. So all we need is the following lemma to complete the proof of necessity. P Lemma 3.13. If i Xi is convergent for a series of independent P random variables with mean 0 that are individually bounded by C, then i Var(Xi ) is convergent. i
Proof. Let Fn = {ω : |S1 | ≤ `, |S2 | ≤ `, · · · , |Sn | ≤ `} where Sk = X1 + · · · + Xk . If the series converges with probablity 1, we must have, for some ` and δ > 0, P (Fn ) ≥ δ for all n. We have Z Z 2 Sn dP = [Sn−1 + Xn ]2 dP Fn−1 ZFn−1 2 = [Sn−1 + 2Sn−1 Xn + Xn2 ] dP Fn−1 Z 2 = Sn−1 dP + σn2 P (Fn−1 ) ZFn−1 2 ≥ Sn−1 dP + δσn2 Fn−1
and on the other hand, Z Z Z 2 2 Sn dP = Sn dP + Sn2 dP c Fn−1 Fn−1 ∩Fn ZFn ≤ Sn2 dP + P (Fn−1 ∩ Fnc ) (` + C)2 Fn
CHAPTER 3. INDEPENDENT SUMS
68 providing us with the estimate Z Z 2 2 δσn ≤ Sn dP − Fn
2 Sn−1 dP + P (Fn−1 ∩ Fnc ) (` + C)2 .
Fn−1
Since Fn−1 ∩ Fnc are disjoint and |Sn | ≤ ` on Fn , ∞ X
σj2 ≤
j=1
1 2 ` + (` + C)2 ]. δ
This concludes the proof.
3.5
Strong Law of Large Numbers
We saw earlier that in Theorem 3.6 that if {Xi} is sequence of i.i.d. (independent identically distributed) random variables with zero mean and a n finite fourth moment then X1 +···+X → 0 with probability 1. We will now n prove the same result assuming only that E|Xi | < ∞ and E(Xi ) = 0. Theorem 3.14. If {Xi } is a sequence of i.i.d random variables with mean 0, X 1 + · · · + Xn lim =0 n→∞ n with probability 1. Proof. We define ( Xn Yn = 0
if |Xn | ≤ n if |Xn | > n
an = P [Xn 6= Yn ], bn = E[Yn ] and cn = Var(Yn ). First we note that (see exercise 3.14 below) X X an = P [|X1 | > n ] ≤ E|X1 | < ∞ n
n
lim bn = 0
n→∞
3.5. STRONG LAW OF LARGE NUMBERS and
69
X cn X E[Y 2 ] XZ x2 n ≤ = dα 2 2 2 n n n |x|≤n n n n X Z Z 1 2 = x dα ≤ C |x| dα < ∞ 2 n n≥x
where α is the common distribution of Xi . From three series theorem P the P Xnand Yn −bn the Borel-Cantelli Lemma, we conclude that n n as well as nP n−bn converge almost surely. It is elementary to verify that for any series n xnn n that converges, x1 +···+x → 0 as n → ∞. We therefore conclude that n X1 + · · · + Xn b1 + · · · + bn =0 =1 P lim − n→∞ n n Since bn → 0 as n → ∞, the theorem is proved. Exercise 3.14. Let X be a nonnegative random variable. Then E[X] − 1 ≤
∞ X
P [Xn ≥ n] ≤ E[X]
n=1
In particular E[X] < ∞ if and only if
P n
P [X ≥ n] < ∞.
Exercise 3.15. If for a sequence of i.i.d. random variables X1 , · · · , Xn , · · · , the strong law of large numbers holds with some limit, i.e. Sn = ξ]= 1 n→∞ n
P [ lim
for some random variable ξ, which may or may not be a constant with probability 1, then show that necessarily E|Xi| < ∞. Consequently ξ = E(Xi ) with probabilty 1. One may ask why the limit cannot be a proper random variable. There is a general theorem that forbids it called Kolmogorov’s Zero-One law. Let us look at the space Ω of real sequences {xn : n ≥ 1}. We have the σ-field B, the product σ-field on Ω. In addition we have the sub σ-fields Bn generated by {xj : j ≥ n}. Bn are ↓ with n and B∞ = ∩n Bn which is also a σ-field is called the tail σ-field. The typical set in B∞ is a set depending only on the tail behavior of the sequence. For example the sets {ω : xn is bounded }, {ω : lim supn xn = 1} are in B∞ whereas {ω : supn |xn | = 1} is not.
CHAPTER 3. INDEPENDENT SUMS
70
Theorem 3.15. (Kolmogorov’s Zero-One Law). If A ∈ B∞ and P is any product measure (not necessarily with identical components) P (A) = 0 or 1. Proof. The proof depends on showing that A is independent of itself so that P (A) = P (A ∩ A) = P (A)P (A) = [P (A)]2 and therefore equals 0 or 1. The proof is elementary. Since A ∈ B∞ ⊂ Bn+1 and P is a product measure, A is independent of Bn = σ-field generated by {xj : 1 ≤ j ≤ n}. It is therefore independent of sets in the field F = ∪n Bn . The class of sets A that are independent of A is a monotone class. Since it contains the field F it contains the σ-field B generated by F . In particular since A ∈ B, A is independent of itself. Corollary 3.16. Any random variable measurable with respect to the tail σfield B∞ is equal with probaility 1 to a constant relative to any given product measure. Proof. Left as an exercise. Warning. For different product measures the constants can be different. Exercise 3.16. How can that happen?
3.6
Central Limit Theorem.
We saw before that for any sequence of independent identically distributed random variables X1 , · · · , Xn , · · · the sum Sn = X1 + · · · + Xn has the property that Sn lim =0 n→∞ n in probability provided the expectation exists and equals 0. If we assume that the Variance of the random variables is finite and equals σ 2 > 0, then we have Theorem 3.17. The distribution of distribution with density p(x) = √
Sn √ n
converges as n → ∞ to the normal
1 x2 exp[− 2 ]. σ 2πσ
(3.11)
3.6. CENTRAL LIMIT THEOREM.
71
Proof. If we denote by φ(t) the characteristic function of any Xi then the Sn characteristic function of √ is given by n t ψn (t) = [φ( √ )]n n We can use the expansion φ(t) = 1 − to conclude that
σ 2 t2 + o (t2 ) 2
1 t σ 2 t2 +o( ) φ( √ ) = 1 − 2n n n
and it then follows that lim ψn (t) = ψ(t) = exp[−
n→∞
σ 2 t2 ]. 2
Since ψ(t) is the characteristic function of the normal distribution with density p(x) given by equation (3.11), we are done. Exercise 3.17. A more direct proof is possible in some special cases. For instance if each Xi = ±1 with probability 12 , Sn can take the values n − 2k with 0 ≤ k ≤ n, 1 n P [Sn = 2k − n] = n 2 k and Sn 1 P [a ≤ √ ≤ b] = n n 2
X √ √ k:a n≤2k−n≤b n
n . k
Use Stirling’s formula to prove directly that Sn lim P [a ≤ √ ≤ b] = n→∞ n
Z
b a
x2 1 √ exp[− ] dx. 2 2π
Actually for the proof of the central limit theorem we do not need the random variables {Xj } to have identical distributions. Let us suppose that they all have zero means and that the variance of Xj is σj2 . Define s2n =
CHAPTER 3. INDEPENDENT SUMS
72
σ12 + · · · + σn2 . Assume s2n → ∞ as n → ∞. Then Yn = Ssnn has zero mean and unit variance. It is not unreasonable to expect that Z a 1 x2 √ exp[− ] dx lim P [Yn ≤ a] = n→∞ 2 2π −∞ under certain mild conditions. Theorem 3.18. (Lindeberg’s theorem). If we denote by αi the distribution of Xi , the condition (known as Lindeberg’s condition) n Z 1 X lim x2 dαi = 0 n→∞ s2 n i=1 |x|≥sn for each > 0 is sufficient for the central limit theorem to hold. Proof. The first step in proving this limit theorem as well as other limit theorems that we will prove is to rewrite Yn = Xn,1 + Xn,2 + · · · + Xn,kn + An where Xn,j are kn mutually independent random variables and An is a conX stant. In our case kn = n, An = 0, and Xn,j = snj for 1 ≤ j ≤ n. We denote by Z Z x t i t Xn,j itx φn,j (t) = E[e ] = e dαn,j = ei t sn dαj = φj ( ) sn where αn,j is the distribution of Xn,j . The functions φj and φn,j are the characteristic functions of αj and αn,j respectively. If we denote by µn the distribution of Yn , its characteristic function µˆn (t) is given by µ ˆn (t) =
n Y
φn,j (t)
j=1
and our goal is to show that t2 lim µ ˆ n (t) = exp[− ]. n→∞ 2 This will be carried out in several steps. First, we define ψn,j (t) = exp[φn,j (t) − 1]
3.6. CENTRAL LIMIT THEOREM.
73
and ψn (t) =
n Y
ψn,j (t).
j=1
We show that for each finite T , lim sup sup |φn,j (t) − 1| = 0
n→∞ |t|≤T 1≤j≤n
and sup sup n
n X
|t|≤T j=1
|φn,j (t) − 1| < ∞.
This would imply that lim sup log µ ˆn (t) − log ψn (t)
n→∞ |t|≤T
n X log φn,j (t) − [φn,j (t) − 1] ≤ lim sup n→∞ |t|≤T
j=1
≤ lim sup C n→∞ |t|≤T
≤ C lim
n→∞
n X
|φn,j (t) − 1|2
j=1
sup sup |φn,j (t) − 1|
|t|≤T 1≤j≤n
sup
n X
|t|≤T j=1
|φn,j (t) − 1|
=0 by the expansion log r = log(1 + (r − 1)) = r − 1 + O(r − 1)2 . The proof can then be completed by showing X 2 2 n t t lim sup log ψn (t) + = lim sup (φn,j (t) − 1) + = 0. n→∞ |t|≤T n→∞ |t|≤T 2 2 j=1
CHAPTER 3. INDEPENDENT SUMS
74 We see that
Z sup φn,j (t) − 1 = sup exp[i t x ] − 1 dαn,j |t|≤T |t|≤T Z x exp[i t ] − 1 dαj = sup sn |t|≤T Z x x = sup exp[i t ] − 1 − i t dαj sn sn |t|≤T Z 2 x ≤ CT dαj s2n Z Z x2 x2 = CT dα + C dαj j T 2 2 |x| 0 is arbitrary, we have lim sup sup φn,j (t) − 1 = 0.
n→∞ 1≤j≤kn |t|≤T
Next we observe that there is a bound, n n Z n X X 1 X 2 x2 φn,j (t) − 1 ≤ CT sup dαj ≤ CT 2 σ = CT s2n sn j=1 j |t|≤T j=1 j=1
uniformly in n. Finally for each > 0,
3.6. CENTRAL LIMIT THEOREM.
75
X n t2 lim sup (φn,j (t) − 1) + n→∞ |t|≤T 2 j=1 n X σ 2 t2 φn,j (t) − 1 + j n→∞ |t|≤T 2s2n j=1 n Z 2 2 X x x t x = lim sup + 2 dαj exp[i t ] − 1 − i t n→∞ |t|≤T sn sn 2sn j=1 n Z 2 2 X x x x t ≤ lim sup exp[i t ] − 1 − i t + 2 dαj n→∞ |t|≤T sn sn 2sn |x| 0 lim
n→∞
n Z 1 X
s2+δ n
|x|2+δ dαj = 0.
j=1
Prove that Lyapunov’s condition implies Lindeberg’s condition. Exercise 3.19. Consider the case of mutually independent random variables {Xj }, where Xj = ±aj with probability 12 . What do Lyapunov’s and Lindeberg’s conditions demand of {aj }? Can you find a sequence {aj } that does not satisfy Lyapunov’s condition for any δ > 0 but satisfies Lindeberg’s condition? Try to find a sequence {aj } such that the central limit theorem is not valid.
3.7
Accompanying Laws.
As we stated in the previous section, we want to study the behavior of the sum of a large number of independent random variables. We have kn independent random variables {Xn,j : 1 ≤ j ≤ kn } with respective distributions {αn,j }. Pn We are interested in the distribution µn of Zn = kj=1 Xn,j . One important assumption that we will make on the random variables {Xn,j } is that no single one is significant. More precisely for every δ > 0, lim sup P [ |Xn,j | ≥ δ ] = lim sup αn,j [ |x| ≥ δ ] = 0.
n→∞ 1≤j≤kn
n→∞ 1≤j≤kn
(3.15)
The condition is referred to as uniform infinitesimality. The following construction will play a major role. If α is a probability distribution on the line and φ(t) is its characteristic function, for any nonnegative real number a > 0, ψa (t) = exp[a(φ(t) − 1)] is again a characteristic distribution. In fact,
3.7. ACCOMPANYING LAWS.
77
if we denote by αk the k-fold convolution of α with itself, ψa is seen to be the characteristic function of the probability distribution −a
e
∞ X aj j=0
j!
αj
which is a convex combination αj with weights e−a aj! . We use the construction mostly with a = 1. If we denote the probability distribution with characteristic function ψa (t) by ea (α) one checks easily that ea+b (α) = ea (α) ∗ eb(α). In particular ea (α) = e na (α)n . Probability distributions β that can be written for each n ≥ 1 as the n-fold convolution βnn of some probability distribution βn are called infinitely divisible. In particular for every a ≥ 0 and α, ea (α) is an infinitely divisible probability distribution. These are called compound Poisson distributions. A special case when α = δ1 the degenerate distribution at 1, we get for ea (δ1 ) the usual Poisson distribution with parameter a. We can interpret ea (α) as the distribution of the sum of a random number of independent random variables with common distribution α. The random n has a distribution which is Poisson with parameter a and is independent of the random variables involved in the sum. In order to study the distribution µn of Zn it will be more convenient to replace αn,j by an infinitely divisible distribution βn,j . This is done as follows. We define Z an,j = x dαn,j , j
|x|≤1
0 αn,j
as the translate of αn,j by −an,j , i.e. 0 αn,j = αn,j ∗ δ−an,j , 0 = e1 (αn,j ), βn,j 0 βn,j = βn,j ∗ an,j
and finally λn =
kn Y
βn,j
j=1
A main tool in this subject is the following theorem. We assume always that the uniform infinitesimality condition (3.15) holds. In terms of notation, we will find it more convenient to denote by µ ˆ the characteristic function of the probability distribution µ.
CHAPTER 3. INDEPENDENT SUMS
78
Theorem 3.19. (Accompanying Laws.) In order that, for some constants An , the distribution µn ∗ δAn of Zn + An may converge to the limit µ it is necessary and sufficient that, for the same constants An , the distribution λn ∗ δAn converges to the same limit µ. Proof. First we note that, for any δ > 0, Z lim sup sup |an,j | = lim sup sup x dαn,j n→∞ 1≤j≤kn n→∞ 1≤j≤kn Z|x|≤1 ≤ lim sup sup x dαn,j n→∞ 1≤j≤kn |x|≤δ Z x dαn,j + lim sup sup n→∞
1≤j≤kn
δ 1 ]
and estimate |a0n,j | by |a0n,j | ≤ Cαn,j [ |x| ≥
3 1 0 ] ≤ Cαn,j [ |x| ≥ ]. 4 2
3.7. ACCOMPANYING LAWS.
79
In other words we may assume without loss of generality that αn,j satisfy the bound |an,j | ≤ Cαn,j [ |x| ≥
1 ] 2
(3.16)
0 and forget all about the change from αn,j to αn,j . We will drop the primes and stay with just αn,j . Then, just as in the proof of the Lindeberg theorem, we proceed to estimate ˆ n (t) − log µ lim sup log λ ˆn (t) n→∞ |t|≤T
kn X ≤ lim sup log α ˆ n,j (t) − (α ˆ n,j (t) − 1)] n→∞ |t|≤T
j=1
kn X log α ˆ n,j (t) − (α ≤ lim sup ˆ n,j (t) − 1) n→∞ |t|≤T
j=1
≤ lim sup C n→∞ |t|≤T
kn X
|α ˆ n,j (t) − 1|2
j=1
= 0. provided we prove that if either λn or µn has a limit after translation by some constants An , then sup sup n
kn X α ˆ n,j (t) − 1 ≤ C < ∞.
(3.17)
|t|≤T j=1
Let us first suppose that λn has a weak limit as n → ∞ after translation by An . The characteristic functions exp
kn X
(α ˆ n,j (t) − 1)) + itAn = exp[fn (t)]
j=1
have a limit, which is again a characteristic function. Since the limiting characteristic function is continuous and equals 1 at t = 0, and the convergence is uniform near 0, on some small interval |t| ≤ T0 we have the bound sup sup 1 − Re fn (t) ≤ C n
|t|≤T0
CHAPTER 3. INDEPENDENT SUMS
80 or equivalently
sup sup
kn Z X
|t|≤T0 j=1
n
(1 − cos t x ) dαn,j ≤ C
and from the subadditivity property (1−cos 2 t x ) ≤ 4(1−cos t x) this bound extends to arbitrary interval |t| ≤ T ,
sup sup n
kn Z X
|t|≤T j=1
(1 − cos t x ) dαn,j ≤ CT .
If we integrate the inequality with respect to t over the interval [−T, T ] and divide by 2T , we get
sup n
kn Z X
(1 −
j=1
sin T x ) dαn,j ≤ CT Tx
from which we can conclude that
sup n
kn X
αn,j [ |x| ≥ δ ] ≤ Cδ < ∞
j=1
for every δ > 0 by choosing T = 2δ . Moreover using the inequality (1−cos x) ≥ c x2 valid near 0 for a suitable choice of c we get the estimate
sup n
kn Z X j=1
|x|≤1
x2 dαn,j ≤ C < ∞.
3.7. ACCOMPANYING LAWS.
81
Now it is straight forward to estimate, for t ∈ [−T, T ], Z |α ˆ n,j (t) − 1| = [exp(i t x ) − 1] dαn,j Z = [exp(i t x ) − 1] dαn,j |x|≤1 Z + [exp(i t x ) − 1] dαn,j |x|>1 Z ≤ [exp(i t x ) − 1 − i t x] dαn,j |x|≤1 Z + [exp(i t x ) − 1] dαn,j + T |an,j | |x|>1 Z 1 ≤ C1 x2 dαn,j + C2 αn,j [x : |x| ≥ ] 2 |x|≤1 which proves the bound of equation (3.17). Now we need to establish the same bound under the assumption that µn has a limit after suitable translations. For any probability measure α we define α ¯ by α ¯ (A) = α(−A) for all Borel sets. The distribution α ∗ α ¯ is 2 2 denoted by |α| . The characteristic functions of α ¯ and |α| are respectively ¯ˆ and |α(t)| α(t) ˆ 2 where α ˆ (t) is the characteristic function of α. An elementary but important fact is |α ∗ A|2 = |α|2 for any translate A. If µn has a limit so does |µn |2 . We conclude that the limit 2
lim |ˆ µn (t)| = lim
n→∞
n→∞
kn Y
|α ˆ n,j (t)|2
j=1
exists and defines a characteristic function which is continuous at 0 with a value of 1. Moreover because of uniform infinitesimality, lim inf |α ˆ n,j (t)| = 1.
n→∞ |t|≤T
It is easy to conclude that there is a T0 > 0 such that, for |t| ≤ T0 , sup sup n
kn X
|t|≤T0 j=1
[1 − |α ˆ n,j (t)|2 ] ≤ C0 < ∞
CHAPTER 3. INDEPENDENT SUMS
82
and by subadditivity for any finite T , sup sup
kn X
|t|≤T j=1
n
[1 − |α ˆ n,j (t)|2 ] ≤ CT < ∞
providing us with the estimates sup n
kn X
|αn,j |2 [ |x| ≥ δ ] ≤ Cδ < ∞
(3.18)
j=1
for any δ > 0, and sup n
kn Z Z X j=1
|x−y|≤2
(x − y)2dαn,j (x) dαn,j (y) ≤ C < ∞.
(3.19)
We now show that estimates (3.18) and (3.19) imply (3.17) δ |αn,j | [ x : |x| ≥ ] ≥ 2 2
Z |y|≤ 2δ
αn,j [x : |x − y| ≥
δ ] dαn,j (y) 2
≥ αn,j [ x : |x| ≥ δ ] αn,j [ x : |x| ≤ ≥
δ ] 2
1 αn,j [ x : |x| ≥ δ ] 2
by uniform infinitesimality. Therfore 3.18 implies that for every δ > 0, sup n
kn X
αn,j [ x : |x| ≥ δ ] ≤ Cδ < ∞.
j=1
We now turn to exploiting (3.19). We start with the inequality Z Z
(x − y)2dαn,j (x) dαn,j (y) |x−y|≤2 Z 2 ≥ αn,j [y : |y| ≤ 1 ] inf (x − y) dαn,j (x) . |y|≤1
|x|≤1
(3.20)
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS.
83
The first term on the right can be assumed to be at least 12 by uniform infinitesimality. The second term Z Z Z 2 2 (x − y) dαn,j (x) ≥ x dαn,j (x) − 2y x dαn,j (x) |x|≤1 |x|≤1 |x|≤1 Z Z 2 ≥ x dαn,j (x) − 2 x dαn,j (x) |x|≤1 |x|≤1 Z 1 ≥ x2 dαn,j (x) − Cαn,j [x : |x| ≥ ]. 2 |x|≤1 The last step is a consequence of estimate (3.16) that we showed we could always assume. Z 1 x dαn,j (x) ≤ Cαn,j [x : |x| ≥ ] 2 |x|≤1 Because of estimate (3.20) we can now assert kn Z X sup x2 dαn,j ≤ C < ∞. n
j=1
(3.21)
|x|≤1
One can now derive (3.17) from (3.20) and (3.21) as in the earlier part. Exercise 3.20. Let kn = n2 and αn,j = δ 1 for 1 ≤ j ≤ n2 . µn = δn and show n that without centering λn ∗ δ−n converges to a different limit.
3.8
Infinitely Divisible Distributions.
In the study of limit theorems for sums of independent random variables infinitely divisible distributions play a very important role. Definition 3.5. A distribution µ is said to be infinitely divisible if for every positive integer n, µ can be written as the n-fold convolution (λn ∗)n of some other probability distribution λn . Exercise 3.21. Show that the normal distribution with density x2 1 p(x) = √ exp[− ] 2 2π is infinitely divisible.
CHAPTER 3. INDEPENDENT SUMS
84
Exercise 3.22. Show that for any λ ≥ 0, the Poisson distribution with parameter λ e−n λn pλ (n) = for n ≥ 0 n! is infinitely divisible. Exercise 3.23. Show that any probabilty distribution supported on a finite set {x1 , . . . , xk } with µ[{xj }] = pj Pk and pj ≥ 0, j=1 pj = 1 is infinitely divisible if and only if it is degenrate, i.e. µ[{xj }] = 1 for some j. Exercise 3.24. Show that for any nonnegative finite measure α with total mass a, the distribution ∞ X (α∗)j e(F ) = e−a j! j=0 with characteristic function
Z [ e(F )(t) = exp[ (eitx − 1)dα]
is an infinitely divisible distribution. Exercise 3.25. Show that the convolution of any two infinitely divisible distributions is again infinitely divisible. In particular if µ is infinitely divisible so is any translate µ ∗ δa for any real a. We saw in the last section that the asymptotic behavior of µn ∗ δAn can be investigated by means of the asymptotic behavior of λn ∗ δAn and the ˆ n of λn has a very special form characteristic function λ ˆn = λ
kn Y
exp[ βˆn,j (t) − 1 + i t an,j ]
j=1
= exp
kn X
Z [ ei t x − 1 ] dβn,j + i t
j=1
= exp = exp = exp
[ ei t x − 1 ] dMn + i t an itx
[e Z
an,j
j=1
Z
Z
kn X
− 1 − i t θ(x) ] dMn + i t [
Z
θ(x) dMn + an ]
[ ei t x − 1 − i t θ(x) ] dMn + i t bn .
(3.22)
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS.
85
We can make any reasonable choice for θ(·) and we will need it to be a bounded continuous function with |θ(x) − x| ≤ C|x|3 x near 0. Possible choices are θ(x) = 1+x 2 , or θ(x) = x for |x| ≤ 1 and sign (x) for |x| ≥ 1. We now investigate when such things will have a weak limit. Convoluting with δAn only changes bn to bn + An .
First we note that µ ˆ(t) = exp
Z
[ ei t x − 1 − i t θ(x) ] dM + i t a
is a characteristic function for any measure M with finite total mass. In fact it is the characteristic function of an infinitely divisible probability distribution. It is not necessary that M be a finite measure for µ to make sense. M could be infinite, but in such a way that it is finite on {x : |x| ≥ δ} for every δ > 0, and near 0 it integrates x2 i.e., M[x : |x| ≥ δ] < ∞ Z x2 dM < ∞.
for all δ > 0,
(3.23) (3.24)
|x|≤1
To see this we remark that Z µ ˆδ (t) = exp
|x|≥δ
[ ei t x − 1 − i t θ(x) ] dM + i t a
is a characteristic function for each δ > 0 and because |ei t x − 1 − i t x | ≤ CT x2 for |t| ≤ T , µ ˆδ (t) → µ ˆ(t) uniformly on bounded intervals where µ ˆ(t) is given by the integral Z µ ˆ(t) = exp [ ei t x − 1 − i t θ(x) ] dM + i t a which converges absolutely and defines a characteristic function. Let us call measures that satisfy (3.23) and (3.24), that can be expressed in the form
CHAPTER 3. INDEPENDENT SUMS
86
Z
x2 dM < ∞ 1 + x2
(3.25)
admissible L´evy measures. Since the same argument applies to M and na n instead of M and a, for any admissible L´evy measure M and real number a, µ ˆ(t) is in fact an infinitely divisible characteristic function. As the normal distribution is also an infinitely divisible probability distribution, we arrive at the following Theorem 3.20. For every admissible L´evy measure M, σ 2 > 0 and real a Z σ 2 t2 µ ˆ(t) = exp [ ei t x − 1 − i t θ(x) ] dM + i t a − 2 is the characteristic function of an infinitely divisible distribution µ. We will denote this distribution µ by µ = e (M, σ 2 , a). The main theorem of this section is Theorem 3.21. In order that µn = e (Mn , σn2 , an ) may converge to a limit µ it is necessary and sufficient that µ = e (M, σ 2 , a) and the following three conditions (3.26) (3.27) and (3.28) are satisfied. For every bounded continuous function f that vanishes in some neighborhood of 0, Z Z lim (3.26) f (x)dMn = f (x)dM. n→∞
For some ( and therefore for every) ` > 0 such that ± ` are continuity points for M, i.e., M{± `} = 0 Z 2 lim σn +
n→∞
`
−`
2
2
x dMn = σ +
an → a as n → ∞.
Z
`
2
x dM .
(3.27)
−`
(3.28)
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS.
87
Proof. Let us prove the sufficiency first. Condition (3.26) implies that for every ` such that ± ` are continuity points of M Z Z itx lim [e − 1 − i t θ(x) ] dMn = [ ei t x − 1 − i t θ(x) ] dM n→∞
|x|≥`
|x|≥`
and because of condition (3.27), it is enough to show that Z ` t2 x2 lim lim sup [ ei t x − 1 − i t θ(x) + ] dMn `→0 n→∞ 2 −` Z ` t2 x2 itx ] dM − [e − 1 − i t θ(x) + 2 ` = 0 in order to conclude that lim
n→∞
Z σn2 t2 itx + [e − − 1 − i t θ(x)] dMn 2 Z σ 2 t2 itx + [e = − − 1 − i t θ(x)] dM . 2
This follows from the estimates itx t2 x2 e − 1 − i t θ(x) + 2 and
Z
` −`
3
|x| dMn ≤ `
Z
`
−`
≤ CT |x|3
|x|2 dMn .
Condition (3.28) takes care of the terms involving an . We now turn to proving the necessity. If µn has a weak limit µ then the absolute values of the characteristic functions |ˆ µn (t)| are all uniformly close to 1 near 0. Since Z σn2 t2 |ˆ µn (t)| = exp − (1 − cos t x) dMn − 2
CHAPTER 3. INDEPENDENT SUMS
88
taking logarithms we conclude that 2 Z σn t lim sup + (1 − cos t x) dMn = 0. t→0 n 2 This implies (3.29), (3.30) and (3.31 )below. For each ` > 0, sup Mn {x : |x| ≥ `} < ∞
(3.29)
lim sup Mn {x : |x| ≥ A} = 0.
(3.30)
Z 2 sup σn +
(3.31)
n
A→∞
n
For every 0 ≤ ` < ∞,
n
`
−`
2
|x| dMn < ∞.
We can choose a subsequence of Mn (which we will denote by Mn as well) that ‘converges’ in the sense that it satisfies conditions (3.26) and (3.27) of the Theorem. Then e (Mn , σn2 , 0) converges weakly to e (M, σ 2 , 0). It is not hard to see that for any sequence of probability distributions αn if both αn and αn ∗δan converge to limits α and β respectively, then necessarily β = α∗δa for some a and an → a as n → ∞. In order complete the proof of necessity we need only establish the uniqueness of the representation, which is done in the next lemma. Lemma 3.22. (Uniqueness). Suppose µ = e (M1 , σ12 , a1 ) = e (M2 , σ22 , a2 ), then M1 = M2 , σ12 = σ22 and a1 = a2 . Proof. Since µ ˆ(t) never vanishes by taking logarithms we have Z σ12 t2 itx ψ(t) = − + [e − 1 − i t θ(x) ] dM1 + i t a1 2 Z σ22 t2 itx + [e = − − 1 − i t θ(x) ] dM2 + i t a2 . 2
(3.32)
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS.
89
We can verify that for any admissible L´evy measure M Z 1 lim [ ei t x − 1 − i t θ(x) ] dM = 0. t→∞ t2 Consequently ψ(t) = σ12 = σ22 t→∞ t2 lim
leaving us with Z
itx
[e
ψ(t) =
− 1 − i t θ(x) ] dM1 + i t a1
Z =
itx
[e
− 1 − i t θ(x) ] dM2 + i t a2
for a different ψ. If we calculate H(s, t) = we get
ψ(t + s) + ψ(t − s) − ψ(t) 2 Z
Z itx
e
(1 − cos s x)dM1 =
ei t x (1 − cos s x)dM2
for all t and s. Since we can and do assume that M{0} = 0 for any admissible Levy measure M we have M1 = M2 . If we know that σ12 = σ22 and M1 = M2 it is easy to see that a1 must equal a2 . Finally Corollary 3.23. (L´ evy-Khintchine representation ) Any infinitely divisible distribution admits a representation µ = e (M, σ 2 , a) for some admissible L´evy measure M, σ 2 > 0 and real number a. Proof. We can write µ = µn ∗n = µn ∗ µn ∗ · · · ∗ µn with n terms. If we show that µn ⇒ δ0 then the sequence is uniformly infinitesimal and by the earlier theorem on accompanying laws µ will be the limit of some λn = e (Mn , 0, an ) and therefore has to be of the form e (M, σ 2 , a) for some choice of admissible
CHAPTER 3. INDEPENDENT SUMS
90
Levy measure M, σ 2 > 0 and real a. In a neighborhood around 0, µ ˆ(t) is close to 1 and it is easy to check that 1
µ ˆ n (t) = [ˆ µ(t)] n → 1 as n → ∞ in that neighborhood. As we saw before this implies that µn ⇒ δ0 .
Applications. 1. Convergence to the Poisson Distribution. Let {Xn,j : 1 ≤ j ≤ kn } be kn independent random variables taking the values 0 or 1 with probabilities 1 − pn,j and pn,j respectively. We assume that lim sup pn,j = 0
n→∞ 1≤j≤kn
which is the uniform infinitesimality P n condition. We are interested in the limiting distribution of Sn = kj=1 Xn,j as n → ∞. Since we have to center by the mean we can pick any level say 12 for truncation. Then the truncated means are all P 0. The accompanying P laws are given by e (Mn , 0, an ) with Mn = ( pn,j )δ1 andPan = ( pn,j ) θ(1). It is clear that a limit exists if and only if λn = pn,j has a limit λ as n → ∞ and the limit in such a case is the Poisson distribution with parameter λ. Pn 2. Convergence to the normal distribution. If the limit of Sn = kj=1 Xn,j of kn uniformly infinitesimal mutually independent random variables exists, then the limit is Normal if and only if M ≡ 0. If an,j is the centering needed, this is equivalent to X lim P [|Xn,j − an,j | ≥ ] = 0 n→∞
j
for all > 0. Since limn→∞ supj |an,j | = 0, this is equivalent to X lim P [|Xn,j | ≥ ] = 0 n→∞
for each > 0.
j
3.8. INFINITELY DIVISIBLE DISTRIBUTIONS.
91
3. The limiting variance and the mean are given by X 2 2 σ = lim E [Xn,j − an,j ] : |Xn,j − an,j | ≤ 1 n→∞
j
and a = lim
n→∞
X
an,j
j
Z
where an,j =
|x|≤1
x dαn,j
Suppose P that E[X2 n,j ] = 02 for all 1 ≤ 2j ≤ kn and n. Assume that 2 σn = j E{[Xn,j ] } and σ = limn→∞ σn exists. What do we need in order to make sure that the limiting distribution is normal with mean 0 and variance σ 2 ? Let αn,j be the distribution of Xn,j . Z |an,j | = 2
|x|≤1
2 Z x dαn,j =
|x|>1
2 Z x dαn,j ≤ αn,j [ |x| > 1 ] |x|2 dαn,j
and kn X
2
|an,j | ≤
X Z
2
|x| dαn,j
1≤j≤kn
j=1
≤ σn2
sup αn,j [ |x| > 1 ]
1≤j≤kn
sup αn,j [ |x| > 1 ]
1≤j≤kn
→ 0. Because
Pkn j=1
|an,j |2 → 0 as n → ∞ we must have XZ 2 σ = lim |x|2 dαn,j n→∞
|x|≤`
for every ` > 0 or equivalently XZ lim n→∞
|x|>`
|x|2 dαn,j = 0
CHAPTER 3. INDEPENDENT SUMS
92
for every ` establishing the necessity as well as sufficiency in Lindeberg’s Theorem. A simple calculation shows that X XZ XZ |an,j | ≤ |x| dαn,j ≤ |x|2 dαn,j = 0 j
j
|x|>1
j
|x|>1
establishing that the limiting Normal distribution has mean 0. Exercise if P 3.26. What happens in the Poisson limit theorem (applicationSn1) −λn √ λn = j pn,j → ∞ as n → ∞? Can you show that the distribution of λ n converges to the standard Normal distribution?
3.9
Laws of the iterated logarithm.
When we are dealing with a sequence of independent identically distributed random variables X1 , · · · , Xn , · · · with mean 0 and variance 1, we have a strong law of large numbers asserting that X 1 + · · · + Xn P lim =0 =1 n→∞ n and a central limit theorem asserting that Z a X1 + · · · + Xn 1 x2 √ √ exp[− ] dx P ≤a → n 2 2π −∞ n √ It is a reasonable question to ask if the random variables X1 +···+X themselves n converge to some limiting random variable Y that is distributed according to the the standard normal distribution. The answer is no and is not hard to show.
Lemma 3.24. For any sequence nj of numbers → ∞, √ P lim sup X1 + · · · + Xnj nj = +∞ = 1 j→∞
Proof. Let us define √ Z = lim sup X1 + · · · + Xnj nj j→∞
3.9. LAWS OF THE ITERATED LOGARITHM.
93
which can be +∞. Because the normal distribution has an infinitely long tail, i.e the probability of exceeding any given value is positive, we must have P Z≥a >0 for any a. But Z is a random variable that does not depend on the particular values of X1 , · · · , Xn and is therefore a set in the tail σ-field. By Kolmogorov’s zero-one law P Z ≥ a must be either 0 or 1. Since it cannot be 0 it must be 1. n Since we know that X1 +···+X → 0 with probability 1 as n → ∞, the n question arises as to the rate at which this happens. The law of the iterated logarithm provides an answer.
Theorem 3.25. For any sequence X1 , · · · , Xn , · · · of independent identically distributed random variables with mean 0 and Variance 1, X1 + · · · + Xn √ P lim sup √ = 2 = 1. n log log n n→∞ We will not prove this theorem in the most general case which assumes only the existence of two moments. We will assume instead that E[|X|2+α ] < ∞ for some α > 0. We shall first reduce the proof to an estimate on the Sn tail behavior of the distributions of √ by a careful application of the Boreln Cantelli Lemma. This estimate is obvious if X1 , · · · , Xn , · · · are themselves normally distributed and we will show how to extend it to a large class of distributions that satisfy the additional moment condition. It is clear that √ we are interested in showing that for λ > 2, p P Sn ≥ λ n log log n infinitely often = 0. It would √ be sufficient because of Borel-Cantelli lemma to show that for any λ > 2, X p P Sn ≥ λ n log log n < ∞. n
This however is too strong. The condition of the Borel-Cantelli lemma is not necessary in this context because of √ the strong dependence between the partial sums Sn . The function φ(n) = n log log n is clearly well defined and
CHAPTER 3. INDEPENDENT SUMS
94
non-decreasing √ for n ≥ 3 and it is sufficient for our purposes to show that for any λ > 2 we can find some sequence kn ↑ ∞ of integers such that X P sup Sj ≥ λ φ(kn−1 ) < ∞. (3.33) kn−1 ≤j≤kn
n
This will establish that with probability 1, lim sup n→∞
supkn−1 ≤j≤kn Sj ≤λ φ(kn−1)
or by the monotonicity of φ, lim sup n→∞
Sn ≤λ φ(n)
√ with probability 1. Since λ > 2 is arbitrary the upper bound in the law of the iterated logarithm will follow. Each term in the sum of 3.33 can be estimated as in Levy’s inequality, P sup Sj ≥ λ φ(kn−1) ≤ 2 P Skn ≥ (λ − σ) φ(kn−1) kn−1 ≤j≤kn
with 0 < σ < λ, provided
1 sup P |Sj | ≥ σφ(kn−1) ≤ . 2 1≤j≤kn −kn−1
Our choice of kn will be kn = [ρn ] for some ρ > 1 and therefore φ(kn−1) √ =∞ n→∞ kn lim
and by Chebychev’s inequality, for any fixed σ > 0, sup P |Sj | ≥ σφ(kn−1) ≤
1≤j≤kn
E[Sn2 ] [σφ(kn−1)]2 kn = [σφ(kn−1)]2 kn = 2 σ kn−1 log log kn−1 = o(1) as n → ∞.
(3.34)
3.9. LAWS OF THE ITERATED LOGARITHM.
95
√ By choosing σ√small enough so that λ − σ > 2 it is sufficient to show that for any λ0 > 2, X 0 P Skn ≥ λ φ(kn−1 ) < ∞. n
√ √ By picking ρ sufficiently close to 1, ( so that λ0 ρ > 2), because √1 we can reduce this to the convergence of ρ X P Skn ≥ λ φ(kn ) < ∞
φ(kn−1 ) φ(kn )
=
(3.35)
n
√
for all λ > 2. 2 If we use the estimate P [X ≥ a] ≤ exp[− a2 ] that is valid for the standard normal distribution, we can verify 3.35. X n
λ2 (φ(kn ))2 2. To prove the lower bound we select again a subsequece, kn = [ρn ] with some ρ > 1, and look at Yn = Skn+1 − Skn , which are now independent random variables. The tail probability of the Normal distribution has the lower bound Z ∞ 1 x2 P [X ≥ a] = √ exp[− ]dx 2 2π a Z ∞ 1 x2 ≥√ exp[− − x](x + 1)dx 2 2π a 1 (a + 1)2 ≥ √ exp[− ]. 2 2π If we assume Normal like tail probabilities we can conclude that X X 1 λφ(kn+1) 2 ] = +∞ P Yn ≥ λφ(kn+1) ≥ exp − [1 + p 2 (ρn+1 − ρn ) n n 2
λ ρ provided 2(ρ−1) < 1 and conclude by the Borel-Cantelli lemma, that Yn = Skn+1 − Skn exceeds λφ(kn+1) infinitely often for such λ. On the other hand
CHAPTER 3. INDEPENDENT SUMS
96
from the upper bound we already have (replacing Xi by −Xi ) √ −Skn 2 P lim sup ≤√ = 1. φ(kn+1) ρ n Consequently P
s Skn+1 lim sup ≥ φ(kn+1) n
√ 2(ρ − 1) 2 −√ =1 ρ ρ
and therefore, P
s Sn lim sup ≥ φ(n) n
√ 2(ρ − 1) 2 −√ = 1. ρ ρ
We now take ρ arbitrarily large and we are done. We saw that the law of the iterated logarithm depends on two things. 2 (i). For any a > 0 and p < a2 an upper bound for the probability p P [Sn ≥ a n log log n] ≤ Cp [log n]−p with some constant Cp (ii). For any a > 0 and p >
a2 2
a lower bound for the probability p P [Sn ≥ a n log log n] ≥ Cp [log n]−p
with some, possibly different, constant Cp . Both inequalities can be obtained from a uniform rate of convergence in the central limit theorem. Z ∞ 2 1 S x n √ exp[− ] dx ≤ Cn−δ sup P { √ ≥ a} − n 2 a 2π a
(3.36)
for some δ > 0 in the central limit theorem. Such an error estimate is provided in the following theorem Theorem 3.26. (Berry-Esseen theorem). Assume that the i.i.d. sequence {Xj } with mean zero and variance one satisfies an additional moment condition E|X|2+α < ∞ for some α > 0. Then for some δ > 0 the estimate (3.36) holds.
3.9. LAWS OF THE ITERATED LOGARITHM.
97
Proof. The proof will be carried out after two lemmas. Lemma 3.27. Let −∞ < a < b < ∞ be given and 0 < h < positive number. Consider the function fa,b,h (x) defined as 0 for −∞ < x ≤ a − h x−a+h for a − h ≤ x ≤ a + h 2h fa,b,h (x) = 1 for a + h ≤ x ≤ b − h 1 − x−b+h for b − h ≤ x ≤ b + h 2h 0 for b + h ≤ x < ∞.
b−a 2
be a small
For any probability distribution µ with characteristic function µ ˆ(t) Z ∞ Z ∞ 1 e−i a y − e−i b y sin h y dy. fa,b,h (x) dµ(x) = µ ˆ(y) 2π −∞ iy hy −∞ Proof. This is essentially the Fourier inversion formula. Note that Z ∞ e−i a y − e−i b y sin h y 1 fa,b,h (x) = ei x y dy. 2π −∞ iy hy We can start with the double integral Z ∞Z ∞ e−i a y − e−i b y sin h y 1 dy dµ(x) ei x y 2π −∞ −∞ iy hy and apply Fubini’s theorem to obtain the lemma. Lemma 3.28. If λ, µ are two probability measures with zero mean having ˆ λ(·), µ ˆ(·) for respective characteristic functions. Then Z ∞ Z ∞ 1 e−i a y sin h y ˆ dy fa,h (x) d(λ − µ)(x) = [λ(y) −µ ˆ(y)] 2π −∞ iy hy −∞ where fa,h (x) = fa,∞,h (x), is given by 0 fa,h (x) =
x−a+h 2h
1
for −∞ < x ≤ a − h for a − h ≤ x ≤ a + h for a + h ≤ x < ∞.
CHAPTER 3. INDEPENDENT SUMS
98
ˆ Proof. We just let b → ∞ in the previous lemma. Since |λ(y)− µ ˆ(y)| = o(|y|), there is no problem in applying the Riemann-Lebesgue Lemma. We now proceed with the proof of the theorem. Z λ[[a, ∞)] ≤ and
fa−h,h (x) dλ(x) ≤ λ[[a − 2h, ∞)] Z
µ[[a, ∞)] ≤
fa−h,h (x) dµ(x) ≤ µ[[a − 2h, ∞)].
Therefore if we assume that µ has a density bounded by C, Z λ[[a, ∞)] − µ[[a, ∞)] ≤ 2hC + fa−h,h (x) d(λ − µ)(x). Since we get a similar bound in the other direction as well, sup |λ[[a, ∞)] − µ[[a, ∞)]| ≤ sup a
a
Z
fa−h,h (x) d(λ − µ)(x)
+ 2hC Z ∞ 1 | sin h y | ˆ ≤ |λ(y) −µ ˆ(y)| dy 2π −∞ h y2 + 2hC. (3.37)
Now we return to the proof of the theorem. We take λ to be the distribuSn ˆ n (y) = [φ( √y )]n where φ(y) tion of √ having as its characteristic function λ n n is the characteristic function of the common distribution of the {Xi} and has the expansion y2 φ(y) = 1 − + O(|y|2+α) 2 for some α > 0. We therefore get, for some choice of α > 0, ˆ n (y) − exp[− |λ
α |y|2+α y2 ]| ≤ C α if |y| ≤ n 2+α . 2 n
3.9. LAWS OF THE ITERATED LOGARITHM. Therefore for θ = Z
99
α 2+α
y 2 | sin h y | ˆ dy λn (y) − exp[− ] 2 h y2 −∞ Z y 2 | sin h y | ˆ = dy λn (y) − exp[− ] h y2 |y|≤nθ Z y 2 | sin h y | ˆ + dy λn (y) − exp[− ] h y2 |y|≥nθ Z Z |y|α dy C ≤ dy + α 2 h |y|≤nθ n |y|≥nθ |y| ∞
n(α+1)θ−α + n−θ h C = α hn α+2 ≤ C
Substituting this bound in 3.37 we get sup |λn [[a, ∞)] − µ[[a, ∞)]| ≤ C1 h + a
C . α h n 2+α
By picking h = n− 2(2+α) we get α
sup |λn [[a, ∞)] − µ[[a, ∞)]| ≤ C n− 2(2+α) α
a
and we are done.
100
CHAPTER 3. INDEPENDENT SUMS
Chapter 4 Dependent Random Variables 4.1
Conditioning
One of the key concepts in probability theory is the notion of conditional probability and conditional expectation. Suppose that we have a probability space (Ω, F , P ) consisting of a space Ω, a σ-field F of subsets of Ω and a probability measure on the σ-field F . If we have a set A ∈ F of positive measure then conditioning with respect to A means we restrict ourselves to the set A. Ω gets replaced by A. The σ-field F by the σ-field FA of subsets of A that are in F . For B ⊂ A we define P (B) PA (B) = P (A) We could achieve the same thing by defining for arbitrary B ∈ F PA (B) =
P (A ∩ B) P (A)
in which case PA (·) is a measure defined on F as well but one that is concentrated on A and assigning 0 probability to Ac . The definition of conditional probability is P (A ∩ B) P (B|A) = . P (A) Similarly the definition of conditional expectation of an integrable function f (ω) given a set A ∈ F of positive probability is defined to be R f (ω)dP E{f |A} = A . P (A) 101
CHAPTER 4. DEPENDENT RANDOM VARIABLES
102
In particular if we take f = χB for some B ∈ F we recover the definition of conditional probability. In general if we know P (B|A) and P (A) we can recover P (A ∩ B) = P (A)P (B|A) but we cannot recover P (B). But if we know P (B|A) as well as P (B|Ac ) along with P (A) and P (Ac ) = 1 − P (A) then P (B) = P (A ∩ B) + P (Ac ∩ B) = P (A)P (B|A) + P (Ac )P (B|Ac ). More generally if P is a partition of Ω into a finite or even a countable number of disjoint measurable sets A1 , · · · , Aj , · · · X P (B) = P (Aj )P (B|Aj ). j
If ξ is a random variable taking distinct values {aj } on {Aj } then P (B|ξ = aj ) = P (B|Aj ) or more generally P (B|ξ = a) =
P (B ∩ ξ = a) P (ξ = a)
provided P (ξ = a) > 0. One of our goals is to seek a definition that makes sense when P (ξ = a) = 0. This involves dividing 0 by 0 and should involve differentiation of some kind. In the countable case we may think of P (B|ξ = aj ) as a function fB (ξ) which is equal to P (B|Aj ) on ξ = aj . We can rewrite our definition of fB (aj ) = P (B|ξ = aj ) as
Z fB (ξ)dP = P (B ∩ ξ = aj )
for each j
ξ=aj
or summing over any arbitrary collection of j’s Z fB (ξ)dP = P (B ∩ {ξ ∈ E}). ξ∈E
Sets of the form ξ ∈ E form a sub σ-field Σ ⊂ F and we can rewrite the definition as Z fB (ξ)dP = P (B ∩ A) A
4.1. CONDITIONING
103
for all A ∈ Σ. Of course in this case A ∈ Σ if and only if A is a union of the atoms ξ = a of the partition over a finite or countable subcollection of the possible values of a. Similar considerations apply to the conditional expectation of a random variable G given ξ. The equation becomes Z Z g(ξ)dP = G(ω)dP A
or we can rewrite this as
A
Z
Z g(ω)dP =
A
G(ω)dP A
for all A ∈ Σ and instead of demanding that g be a function of ξ we demand that g be Σ measurable which is the same thing. Now the random variable ξ is out of the picture and rightly so. What is important is the information we have if we know ξ and that is the same if we replace ξ by a one-to-one function of itself. The σ-field Σ abstracts that information nicely. So it turns out that the proper notion of conditioning involves a sub σ-field Σ ⊂ F . If G is an integrable function and Σ ⊂ F is given we will seek another integrable function g that is Σ measurable and satisfies Z Z g(ω)dP = G(ω)dP A
A
for all A ∈ Σ. We will prove existence and uniqueness of such a g and call it the conditional expectation of G given Σ and denote it by g = E[G|Σ]. The way to prove the above result will take us on a detour. A signed measure on a measurable space (Ω, F ) is a set function λ(.) defined for A ∈ F which is countably additive but not necessarily nonnegative. Countable addivity is again in any of the following two equivalent senses. X λ(∪An ) = λ(An ) for any countable collection of disjoint sets in F , or lim λ(An ) = λ(A)
n→∞
whenver An ↓ A or An ↑ A. Examples of such λ can be constructed by taking the difference µ1 − µ2 of two nonnegative measures µ1 and µ2 .
104
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Definition 4.1. A set A ∈ F is totally positive (totally negative) for λ if for every subset B ∈ F with B ⊂ A λ(B) ≥ 0. (≤ 0) Remark 4.1. A measurable subset of a totally positive set is totally positive. Any countable union of totally positive subsets is again totally positive. Lemma 4.1. If λ is a countably additive signed measure on (Ω, F ), sup |λ(A)| < ∞ A∈F
Proof. The key idea in the proof is that, since λ(Ω) is a finite number, if λ(A) is large so is λ(Ac ) with an opposite sign. In fact, it is not hard to see that ||λ(A)| − |λ(Ac )|| ≤ |λ(Ω)| for all A ∈ F . Another fact is that if supB⊂A |λ(B)| and supB⊂Ac |λ(B)| are finite, so is supB |λ(B)|. Now let us complete the proof. Given a subset A ∈ F with supB⊂A |λ(B)| = ∞, and any positive number N, there is a subset A1 ∈ F with A1 ⊂ A such that |λ(A1 )| ≥ N and supB⊂A1 |λ(B)| = ∞. This is obvious because if we pick a set E ⊂ A with |λ(E)| very large so will λ(E c ) be. At least one of the two sets E, E c will have the second property and we can call it A1 . If we proceed by induction we have a sequence An that is ↓ and |λ(An )| → ∞ that contradicts countable additivity. Lemma 4.2. Given a subset A ∈ F with λ(A) = ` > 0 there is a subset ¯ ≥ `. A¯ ⊂ A that is totally positive with λ(A) Proof. Let us define m = inf B⊂A λ(B). Since the empty set is included, m ≤ 0. If m = 0 then A is totally positive and we are done. So let us assume that m < 0. By the previous lemma m > −∞. Let us find B1 ⊂ A such that λ(B1 ) ≤ m2 . Then for A1 = A − B1 we have A1 ⊂ A, λ(A1 ) ≥ ` and inf B⊂A1 λ(B) ≥ m2 . By induction we can find Ak with A ⊃ A1 ⊃ · · · ⊃ Ak · · · , λ(Ak ) ≥ ` for every k and inf B⊂Ak λ(Ak ) ≥ 2mk . Clearly if we define A¯ = ∩Ak which is the decreasing limit, A¯ works. Theorem 4.3. (Hahn-Jordan Decomposition). Given a countably additive signed measure λ on (Ω, F ) it can be written always as λ = µ+ − µ− the difference of two nonnegative measures. Moreover µ+ and µ− may be chosen to be orthogonal i.e, there are disjoint sets Ω+ , Ω− ∈ F such that µ+ (Ω− ) = µ− (Ω+ ) = 0. In fact Ω+ and Ω− can be taken to be subsets of Ω that are respectively totally positive and totally negative for λ. µ± then become just the restrictions of λ to Ω± .
4.1. CONDITIONING
105
Proof. Totally positive sets are closed under countable unions, disjoint or not. Let us define m+ = supA λ(A). If m+ = 0 then λ(A) ≤ 0 for all A and we can take Ω+ = Φ and Ω− = Ω which works. Assume that m+ > 0. There exist sets An with λ(A) ≥ m+ − n1 and therefore totally positive subsets A¯n of An with λ(A¯n ) ≥ m+ − n1 . Clearly Ω+ = ∪n A¯n is totally positive and λ(Ω+ ) = m+ . It is easy to see that Ω− = Ω − Ω+ is totally negative. µ± can be taken to be the restriction of λ to Ω± . Remark 4.2. If λ = µ+ − µ− with µ+ and µ− orthogonal to each other, then they have to be the restrictions of λ to the totally positive and totally negative sets for λ and such a representation for λ is unique. It is clear that in general the representation is not unique because we can add a common µ to both µ+ and µ− and the µ will cancel when we compute λ = µ+ − µ− . Remark 4.3. If µ is a nonnegative measure and we define λ by Z Z λ(A) = f (ω) dµ = χA (ω)f (ω) dµ A
where f is an integrable function, then λ is a countably additive signed measure and Ω+ = {ω : f (ω) > 0} and Ω− = {ω : f (ω) < 0}. If we define f ± (ω) as the positive and negative parts of f , then Z ± µ (A) = f ± (ω) dµ. A
The signed measure λ that was constructed in the preceeding remark enjoys a very special relationship to µ. For any set A with µ(A) = 0, λ(A) = 0 because the integrand χA (ω)f (ω) is 0 for µ-almost all ω and for all practical purposes is a function that vanishes identically. Definition 4.2. A signed measure λ is said to be absolutely continuous with respect to a nonnegative measure µ, λ a and we will try to construct f from the sets Ω(a) by the definition f (ω) = [sup a ∈ Q : ω ∈ Ω(a)]. The plan is to check that the function f (ω) defined above works. Since λa is getting more negative as a increases, Ω(a) is ↓ as a ↑. There is trouble with sets of measure 0 for every comparison between two rationals a1 and a2 . Collect all such troublesome sets (only a countable number and throw them away). In other words we may assume without loss of generality that Ω(a1 ) ⊂ Ω(a2 ) whenever a1 > a2 . Clearly {ω : f (ω) > x} = {ω : ω ∈ Ω(y) for some rational y > x} = ∪ y>x Ω(y) y∈Q
and this makes f measurable. If A ⊂ ∩a Ω(a) then λ(A) − aµ(A) ≥ 0 for all A. If µ(A) > 0, λ(A) has to be infinite which is not possible. Therefore µ(A) has to be zero and by absolute continuity λ(A) = 0 as well. On the other hand if A ∩ Ω(a) = Φ for all a, then λ(A) − aµ(A) ≤ 0 for all a and again if µ(A) > 0, λ(A) = −∞ which is not possible either. Therefore µ(A), and by absolute continuity, λ(A) are zero. This proves that f (ω) is finite almost everywhere with respect to both λ and µ. Let us take two real numbers a < b and consider Ea,b = {ω : a ≤ f (ω) ≤ b}. It is clear that the set Ea,b is in Ω(a0 ) and Ωc (b0 ) for any a0 < a and b0 > b. Therefore for any set A ⊂ Ea,b by letting a0 and b0 tend to a and b a µ(A) ≤ λ(A) ≤ b µ(A).
4.1. CONDITIONING
107
Now we are essentially done. Let us take a grid {n h} and consider En = {ω : nh ≤ f (ω) < (n + 1)h} for −∞ < n < ∞. Then for any A ∈ F and each n, Z λ(A ∩ En ) − hµ(A ∩ En ) ≤ n h µ(A ∩ En ) ≤ f (ω) dµ A∩En
≤ (n + 1) hµ(A ∩ En ) ≤ λ(A ∩ En ) + h µ(A ∩ En ). Summing over n we have Z λ(A) − h µ(A) ≤
f (ω) dµ ≤ λ(A) + h µ(A) A
proving the integrability of f and if we let h → 0 establishing Z λ(A) = f (ω) dµ A
for all A ∈ F . Remark 4.4. (Uniqueness). If we have two choices of f say f1 and f2 their difference g = f1 − f2 satisfies Z g(ω) dµ = 0 A
for all A ∈ F . If we take A = {ω : g(ω) ≥ }, then 0 ≥ µ(A ) and this implies µ(A ) = 0 for all > 0 or g(ω) ≤ 0, almost everywhere with respect to µ. A similar argument establishes g(ω) ≥ 0 almost everywhere with respect to µ. Therefore g = 0 a.e. µ proving uniqueness. Exercise 4.1. If f andR g are two integrable functions, maesurable with respect R to a σ-filed B and if A f (ω)dP = A g(ω)dP for all sets A ∈ B0 , a field that generates the σ-field B, then f = g a.e. P . Exercise 4.2. If λ(A) ≥ 0 for all A ∈ F , prove that f (ω) ≥ 0 almost everywhere. Exercise 4.3. If Ω is a countable set and µ({ω}) > 0 for each single point set prove that any measure λ is absolutely continuous with respect to λ and calculate the Radon-Nikodym derivative.
108
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Exercise 4.4. Let F (x) be a distribution function on the line with F (0) = 0 and F (1) = 1 so that the probability measure α corresponding to it lives on the interval [0, 1]. If F (x) satisfies a Lipschitz condition |F (x) − F (y)| ≤ A|x − y| then prove that α 1, ouside which F (ω, x) = 1. If we collect all these null sets, of which there are only countably many, and take their union, we get a null set N ∈ Σ such that for ω ∈ / N, we have have a family F (ω, x) defined for rational x that satisfies F (ω, x) ≤ F (ω, y) if x < y are rational F (ω, x) = 0 for rational x < 0 F (ω, x) = 1 for rational x > 1 Z P (A ∩ [0, x]) = F (ω, x) dP for all A ∈ Σ. A
For ω ∈ / N and real y we can define G(ω, y) =
lim F (ω, x).
x↓y x rational
For ω ∈ / N, G is a right continuous nondecreasing function (distribution function) with G(ω, y) = 0 for y < 0 and G(ω, y) = 1 for y ≥ 1. There is ˆ then a probability measure Q(ω, B) on the Borel subsets of [0, 1] such that ˆ ˆ is our candidate for regular conditional Q(ω, [0, y]) = G(ω, y) for all y. Q ˆ probability. Clearly Q(ω, I) is Σ measurable for all intervals I and by standard arguments will continue to be Σ measurable for all Borel sets B ∈ F . If we check that Z P (A ∩ [0, x]) = G(ω, x) dP for all A ∈ Σ A
4.3. CONDITIONAL PROBABILITY for all 0 ≤ x ≤ 1 then
115
Z
P (A ∩ I) =
ˆ Q(ω, I) dP
for all A ∈ Σ
A
for all intervals I and by standard arguments this will extend to finite disjoint unions of half open intervals that constitute a field and finally to the σ-field F generated by that field. To verify that for all real y, Z P (A ∩ [0, y]) = G(ω, y) dP for all A ∈ Σ A
we start from
Z P (A ∩ [0, x]) =
F (ω, x) dP
for all A ∈ Σ
A
valid for rational x and let x ↓ y through rationals. From the countable additivity of P the left hand side converges to P (A ∩ [0, y]) Rand by the bounded convergence theorem, the right hand side converges to A G(ω, y) dP and we are done. Finally from the uniqueness of the conditional expectation if A ∈ Σ ˆ Q(ω, A) = χA (ω) provided ω ∈ / NA , which is a null set that depends on A. We can take a countable set Σ0 of generators A that forms a field and get a single null set N such that if ω ∈ /N ˆ Q(ω, A) = χA (ω) for all A ∈ Σ0 . Since both side are countably additive measures in A and as they agree on Σ0 they have to agree on Σ as well. Exercise 4.10. (Disintegration Theorem.) Let µ be a probability measure on the plane R2 with a marginal distribution α for the first coordinate. In other words if we denote α is such that, for any f that is a bounded measurable function of x, Z Z f (x) dµ = f (x) dα R2
R
Show that there exist measures βx depending measurably on x such that βx [{x}×R] R = 1, i.e. βx is supported on the vertical line through (x, y) : y ∈ R and µ = R βx dα. The converse is of course easier. Given α and βx we can construct a unique µ such that µ disintegrates as expected.
116
4.4
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Markov Chains
One of the ways of generating a sequence of dependent random variables is to think of a system evolving in time. We have time points that are discrete say T = 0, 1, · · · , N, · · · . The state of the system is described by a point x in the state space X of the system. The state space X comes with a natural σ-field of subsets F . At time 0 the system is in a random state and its distribution is specified by a probability distribution µ0 on (X , F ). At successive times T = 1, 2, · · · , the system changes its state and given the past history (x0 , · · · , xk ) of the states of the system at times T = 0, · · · , k − 1 the probability that system finds itself at time k in a subset A ∈ F is given by πk (x0 , · · · , xk−1 ; A ). For each (x0 , · · · , xk−1 ), πk defines a probability measure on (X , F ) and for each A ∈ F , πk (x0 , · · · , xk−1 ; A ) is assumed to be a measurable function of (x0 , · · · , xk−1 ), on the space (X k , F k ) which is the product of k copies of the space (X , F ) with itself. We can inductively define measures µk on (X k+1, F k+1) that describe the probability distribution of the entire history (x0 , · · · , xk ) of the system through time k. To go from µk−1 to µk we think of (X k+1, F k+1) as the product of (X k , F k ) with (X , F ) and construct on (X k+1, F k+1 ) a probability measure with marginal µk−1 on (X k , F k ) and conditionals πk (x0 , · · · , xk−1 ; ·) on the fibers (x1 , · · · , xk−1 )×X . This will define µk and the induction can proceed. We may stop at some finite terminal time N or go on indefinitely. If we do go on indefinitely, we will have a consitent family of finite dimensional distributions {µk } on (X k+1 , F k+1) and we may try to use Kolmogorov’s theorem to construct a probability measure P on the space (X ∞ , F ∞ ) of sequences {xj : j ≥ 0} representing the total evolution of the system for all times. Remark 4.7. However Kolmogorov’s theorem requires some assumptions on (X , F ) that are satisfied if X is a complete separable metric space and F are the Borel sets. However, in the present context, there is a result known as Tulcea’s theorem (see [8]) that proves the existence of a P on (X ∞ , F ∞ ) for any choice of (X , F ), exploiting the fact that the consistent family of finite dimensional distributions µk arise from well defined successive regular conditional probability distributions. An important subclass is generated when the transition probability depends on the past history only through the current state. In other words πk (x0 , · · · , xk−1 ; ·) = πk−1,k (xk−1 ; ·).
4.4. MARKOV CHAINS
117
In such a case the process is called a Markov Process with transition probabilities πk−1,k (·, ·). An even smaller subclass arises when we demand that πk−1,k (·, ·) be the same for different values of k. A single transition probability π(x, A) and the initial distribution µ0 determine the entire process i.e. the measure P on (X ∞ , F ∞ ). Such processes are called time-homogeneous Markov Proceses or Markov Processes with stationary transition probabilities. Chapman-Kolmogorov Equations:. If we have the transition probabilities πk,k+1 of transition from time k to k + 1 of a Markov Chain it is possible to obtain directly the transition probabilities from time k to k + ` for any ` ≥ 2. We do it by induction on `. Define Z πk,k+`+1(x, A) = πk,k+` (x, dy) πk+`,k+`+1(y, A) (4.5) X
or equivalently, in a more direct fashion Z Z πk,k+`+1(x, A) = ··· πk,k+1(x, dyk+1 ) · · · πk+`,k+`+1(yk+` , A) X
X
Theorem 4.8. The transition probabilities πk,m (·, ·) satisfy the relations Z πk,n (x, A) = πk,m (x, dy) πm,n (y, A) (4.6) X
for any k < m < n and for the Markov Process defined by the one step transition probabilities πk,k+1 (·, ·), for any n > m P [xn ∈ A|Σm ] = πm,n (xm , A) a.e. where Σm is the σ-field of past history upto time m generated by the coordinates x0 , x1 , · · · , xm . Proof. The identity is basically algebra. The multiple integral can be carried out by iteration in any order and after enough variables are integrated we get our identity. To prove that the conditional probabilities are given by the right formula we need to establish Z P [{xn ∈ A} ∩ B] = πm,n (xm , A) dP B
CHAPTER 4. DEPENDENT RANDOM VARIABLES
118
for all B ∈ Σm and A ∈ F . We write Z dP P [{xn ∈ A} ∩ B] = {xn ∈A}∩B Z Z dµ(x0 ) π0,1 (x0 , dx1 ) · · · πm−1,m (xm−1 , dxm ) = ··· {xn ∈A}∩B
Z =
π (x , dxm−1 ) · · · πn−1,n (xn−1 , dxn ) Z m,m+1 m · · · dµ(x0 ) π0,1 (x0 , dx1 ) · · · πm−1,m (xm−1 , dxm ) B
Z = Z =
π (x , dxm−1 ) · · · πn−1,n (xn−1 , A) Z m,m+1 m · · · dµ(x0 ) π0,1 (x0 , dx1 ) · · · πm−1,m (xm−1 , dxm ) πm,n (xm , A) B
πm,n (xm , A) dP B
and we are done. Remark 4.8. If the chain has stationary transition probabilities then the transition probabilities πm,n (x, dy) from time m to time n depend only on the difference k = n − m and are given by what are usually called the k step transition probabilities. They are defined inductively by Z (k+1) π (x, A) = π (k) (x, dy) π(y, A) X
and satisfy the Chapman-Kolmogorov equations Z Z (k+`) (k) (`) (x, A) = π (x, dy)π (x, A) = π (`) (x, dy)π (k)(y, A) π X
§
Suppose we have a probability measure P on the product space X ×Y ×Z with the product σ-field. The Markov property in this context refers to equality E P [g(z)|Σx,y ] = E P [g(z)|Σy ]
a.e. P
(4.7)
for bounded measurable functions g on Z, where we have used Σx,y to denote the σ-field generated by projection on to X × Y and Σy the corresponding
4.4. MARKOV CHAINS
119
σ-field generated by projection on to Y . The Markov property in the reverse direction is the similar condition for bounded measurable functions f on X. E P [f (x)|Σy,z ] = E P [f (x)|Σy ]
a.e. P
(4.8)
They look different. But they are both equivalent to the symmetric condition E P [f (x)g(z)|Σy ] = E P [f (x)|Σy ]E P [g(z)|Σy ]
a.e. P
(4.9)
which says that given the present, the past and future are conditionally independent. In view of the symmetry it sufficient to prove the following: Theorem 4.9. For any P on (X × Y × Z) the relations (4.7) and (4.9) are equivalent. Proof. Let us fix f and g. Let us denote the common value in (4.7) by gˆ(y) Then E P [f (x)g(z)|Σy ] = E P [E P [f (x)g(z)|Σx,y ]|Σy ] P
P
a.e. P
= E [f (x)E [g(z)|Σx,y ]|Σy ]
a.e. P
g (y)|Σy ] = E P [f (x)ˆ
a.e. P (by (4.5))
P
= E [f (x)|Σy ]ˆ g (y) P
P
= E [f (x)|Σy ]E [g(z)|Σy ]
a.e. P a.e. P
which is (4.9). Conversely, we assume (4.9) and denote by g¯(x, y) and gˆ(y) the expressions on the left and right side of (4.7). Let b(y) be a bounded measurable function on Y . E P [f (x)b(y)¯ g (x, y)] = E P [f (x)b(y)g(z)] = E P [b(y)E P [f (x)g(z)|Σy ]] = E P [b(y) E P [f (x)|Σy ] E P [g(z)|Σy ] ] = E P [b(y) E P [f (x)|Σy ] gˆ(y)] = E P [f (x)b(y)ˆ g (y)].
Since f and b are arbitrary this implies that g¯(x, y) = gˆ(y) a.e. P . Let us look at some examples.
120
CHAPTER 4. DEPENDENT RANDOM VARIABLES
1. Suppose we have an urn containg a certain number of balls (nonzero) some red and others green. A ball is drawn at random and its color is noted. Then it is returned to the urn along with an extra ball of the same color. Then a new ball is drawn at random and the process continues ad infinitum. The current state of the system can be characterized by two integers r, g such that r + g ≥ 1. The initial state if the system is some r0 , g0 with r0 + g0 ≥ 1. The system can go from (r, g) r to either (r + 1, g) with probability r+g or to (r, g + 1) with probability g . This is clearly an example of a Markov Chain with stationary r+g transition probabilities. 2. Consider a queue for service in a store. Suppose at each of the times 1, 2, · · · , a random number of new customers arrive and and join the queue. If the queue is non empty at some time, then exactly one customer will be served and will leave the queue at the next time point. The distribution of the number of new arrivals is specified by {pj : j ≥ 0} where pj is the probability that exactly j new customers arrive at a given time. The number of new arrivals at distinct times are assumed to be independent. The queue length is a Markov Chain on the state space X = {0, 1, · · · , } of nonegative integers. The transition probabilities π(i, j) are given by π(0, j) = pj because there is no service and nobody in the queue to begin with and all the new arrivals join the queue. On the other hand π(i, j) = pj−i+1 if j + 1 ≥ i ≥ 1 because one person leaves the queue after being served. 3. Consider a reservoir into which water flows. The amount of additional water flowing into the reservoir on any given day is random, and has a distribution α on [0, ∞). The demand is also random for any given day, with a probability distribution β on [0, ∞). We may also assume that the inflows and demands on successive days are random variables ξn and ηn , that have α and β for their common distributions and are all mutually independent. We may wish to assume a percentage loss due to evaporation. In any case the storage level at successive days have a recurrence relation Sn+1 = [(1 − p)Sn + ξn − ηn ]+ p is the loss and we have put the condition that the outflow is the demand unless the stored amount is less than the demand in which case
4.4. MARKOV CHAINS
121
the outflow is the available quantity. The current amount in storage is a Markov Process with Stationary transition probabilities. 4. Let X1 , · · · , Xn , · · · be a sequence of independent random variables with a common distribution α. Let Sn = Y + X1 + · · · + Xn for n ≥ 1 with S0 = Y where Y is a random variable independent of X1 , . . . , Xn , . . . with distribution µ. Then Sn is a Markov chain on R with one step transition probaility π(x, A) = α(A − x) and initial distribution µ. The n step transition probability is αn (A − x) where αn is the n-fold convolution of α. This is often referred to as a random walk. The last two examples can be described by models of the type xn = f (xn−1 , ξn ) where xn is the current state and ξn is some random external disturbance. ξn are assumed to be independent and identically distributed. They could have two components like inflow and demand. The new state is a deterministic function of the old state and the noise. Exercise 4.11. Verify that the first two examples can be cast in the above form. In fact there is no loss of generality in assuming that ξj are mutually independent random variables having as common distribution the uniform distribution on the interval [0, 1]. Given a Markov Chain with stationary transition probabilities π(x, dy) on a state space (X , F ), the behavior of π (n) (x, dy) for large n is an important and natural question. In the best situation of independent random variables π (n) (x, A) = µ(A) are independent of x as well as n. Hopefully after a long time the Chain will ‘forget’ its origins and π (n) (x, ·) → µ(·), in some suitable sense, for some µ that does not depend on x . If that happens, then from the relation Z (n+1) π (x, A) = π (n) (x, dy)π(y, A), Z
we conclude µ(A) =
π(y, A) dµ(y) for all A ∈ F
122
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Measures that satisfy the above property, abbreviated as µπ = µ, are called invariant measures for the Markov Chain. If we start with the initial distribution µ which is inavariant then the probability measure P has µ as marginal at every time. In fact P is stationary i.e., invariant with respect to time translation, and can be extended to a stationary process where time runs from −∞ to +∞.
4.5
Stopping Times and Renewal Times
One of the important notions in the analysis of Markov Chains is the idea of stopping times and renewal times. A function τ (ω) : Ω → {n : n ≥ 0} is a random variable defined on the set Ω = X ∞ such that for every n ≥ 0 the set {ω : τ (ω) = n} (or equivalently for each n ≥ 0 the set {ω : τ (ω) ≤ n}) is measurable with respect to the σ-field Fn generated by Xj : 0 ≤ j ≤ n. It is not necessary that τ (ω) < ∞ for every ω. Such random variable τ are called stopping times. Examples of stopping times are, constant times n ≥ 0, the first visit to a state x, or the second visit to a state x. The important thing is that in order to decide if τ ≤ n i.e. to know if what ever is supposed to happen did happen before time n the chain need be observed only up to time n. Examples of τ that are not stopping times are easy to find. The last time a site is visited is not a stopping time nor is is the first time such that at the next time one is in a state x. An important fact is that the Markov property extends to stopping times. Just as we have σ-fields Fn associated with constant times, we do have a σ field Fτ associated to any stopping time. This is the information we have when we observe the chain upto time τ . Formally ∞ Fτ = A : A ∈ F and A ∩ {τ ≤ n} ∈ Fn for each n One can check from the definition that τ is Fτ measurable and so is Xτ on the set τ < ∞. If τ is the time of first visit to y then τ is a stopping time and the event that the chain visits a state z before visiting y is Fτ measurable. Lemma 4.10. ( Strong Markov Property.) At any stopping time τ the Markov property holds in the sense that the conditional distribution of
4.6. COUNTABLE STATE SPACE
123
Xτ +1 , · · · , Xτ +n , · · · conditioned on Fτ is the same as the original chain starting from the state x = Xτ on the set τ < ∞. In other words Px {Xτ +1 ∈ A1 , · · · , Xτ +n ∈ An |Fτ } Z Z = ··· π(Xτ , dx1 ) · · · π(xn−1 , dxn ) A1
An
a.e. on {τ < ∞}. Proof. Let A ∈ Fτ be given with A ⊂ {τ < ∞}. Then Px {A ∩ {Xτ +1 ∈ A1 , · · · , Xτ +n ∈ An }} X = Px {A ∩ {τ = k} ∩ {Xk+1 ∈ A1 , · · · , Xk+n ∈ An }} XZ
Z
k
=
Zk Z
A∩{τ =k}
Z
···
= A
A1
Z ···
A1
π(Xk , dxk+1 ) · · · π(xk+n−1 , dxk+n ) dPx An
π(Xτ , dx1 ) · · · π(xn−1 , dxn ) dPx An
We have used the fact that if A ∈ Fτ then A ∩ {τ = k} ∈ Fk for every k ≥ 0. Remark 4.9. If Xτ = y a.e. with respect to Px on the set τ < ∞, then at time τ , when it is finite, the process starts afresh with no memory of the past and will have conditionally the same probabilities in the future as Py . At such times the process renews itself and these times are called renewal times.
4.6
Countable State Space
From the point of view of analysis a particularly simple situation is when the state space X is a countable set. It can be taken as the integers {x : x ≥ 1}. Many applications fall in this category and an understanding of what happens in this situation will tell us what to expect in general. The one step transition probability is a matrix π(x, y) with nonnegative enP tries such that y π(x, y) = 1 for each x. Such matrices are called stochastic
124
CHAPTER 4. DEPENDENT RANDOM VARIABLES
matrices. The n step transition matrix is just the n-th power of the matrix defined inductively by X π (n+1) (x, y) = π (n) (x, z)π(z, y) z
To be consistent one defines π (0) (x, y) = δx,y which is 1 if x = y and 0 otherwise. The problem is to analyse the behaviour for large n of π (n) (x, y). A state x is said to communicate with a state y if π (n) (x, y) > 0 for some n ≥ 1. We will assume for simplicity that every state communicates with every other state. Such Markov Chains are called irreducible. Let us first limit ourselves to the study of irreducible chains. Given an irreducible Markov chain with transition probabilities π(x, y) we define fn (x) as the probability of returning to x for the first time at the n-th step assuming that the chain starts from the state x.. Using the convention that Px refers to the measure on sequences for the chain starting from x and {Xj } are the successive positions of the chain fn (x) = Px Xj 6= x for 1 ≤ j ≤ n − 1 and Xn = x X = π(x, y1 ) π(y1, y2 ) · · · π(yn−1, x) y1 6=x ··· yn−1 6=x
P Since fn (x) are probailities of disjoint events ≤ 1. The state x n fn (x)P P is called transient if n fn (x) < 1 and recurrent if n fn (x) = 1. The recurrent case is divided into two situations. If we denote by τx = inf{n ≥ 1 : Xn = x}, the time of first visit to x, then recurrence is Px {τx < ∞} = 1. A recurrent state x is called positive recurrent if X E Px {τx } = n fn (x) < ∞ n≥1
and null recurrent if E Px {τx } =
X
n fn (x) = ∞
n≥1
Lemma 4.11. If for a (not necessarily irreducible) chain starting from x, the probability of ever visiting y is positive then so is the probability of visiting y before returning to x.
4.6. COUNTABLE STATE SPACE
125
Proof. Assume that for the chain starting from x the probability of visiting y before returning to x is zero. But when it returns to x it starts afresh and so will not visit y until it returns again. This reasoning can be repeated and so the chain will have to visit x infinitely often before visiting y. But this will use up all the time and so it cannot visit y at all. Lemma 4.12. For an irreducible chain all states x are of the same type. Proof. Let x be recurrent and y be given. Since the chain is irreducible, for some k, π (k) (x, y) > 0. By the previous lemma, for the chain starting from x, there is a positive probability of visiting y before returning to x. After each successive return to x, the chain starts afresh and there is a fixed positive probability of visiting y before the next return to x. Since there are infinitely many returns to x, y will be visited infinitely many times as well. Or y is also a recurrent state. We now prove that if x is positive recurrent then so is y. We saw already that the probability p = Px {τy < τx } of visiting y before returning to x is positive. Clearly E Px {τx } ≥ Px {τy < τx } E Py {τx } and therefore
1 E Py {τx } ≤ E Px {τx } < ∞. p On the other hand we can write Z Z Px E {τy } ≤ τx dPx + τy dPx τy 0} then q−k = a for k ∈ S. We can then deduce from equation 4.11 that q−k = a for k = k1 + k2 with k1 , k2 ∈ S. By repeating the same reasoning q−k = a for k = k1 + k2 + · · · + k` . By lemma 3.6 because the greatest common factor of the integers in S is 1, there is a k0 such that for k ≥ k0 ,we have q−k = a. We now apply the relation 4.11 again to conclude that qj = a for all positve as well as negative j. Step 3: If we add up equation 4.10 for n = 1, · · · , N we get p1 + p2 + · · · + pN = (f1 + f2 + · · · + fN ) + (f1 + f2 + · · · + fN −1 )p1 + · · · + (f1 + f2 + · · · + fN −k )pk + · · · + f1 pN −1 P∞ P If we denote by Tj = i=j fi , we have T1 = 1 and ∞ j=0 Tj = m. We can now rewrite N N X X Tj pN −j+1 = fj j=1
j=1
Step 4: Because pN −j → a for every j along the subsequence N = nk , if P j Tj = m < ∞, we can deduce from the dominated convergence theorem that m a = 1 and we conclude that lim sup pn = v1m n→∞
130
CHAPTER 4. DEPENDENT RANDOM VARIABLES
P If j Tj = ∞, by Fatou’s Lemma a = 0. Exactly the same argument applies to liminf and we conclude that 1 lim inf pn = n→∞ m This concludes the proof of the renewal theorem. We now turn to Proof. (of Theorem 4.15). If we take a fixed x ∈ X and consider fn = Px {τx = n}, then fn and pn = π (n) (x, x) are related by (1) and m = E Px {τx }. In order to apply the renewal theorem we need to establish that the greatest common divisor of S = {n : fn > 0} is 1. In general if fn > 0 so is pn . So the greatest common divisor of S is always larger than that of {n : pn > 0}. That does not help us because the greatest common divisor of {n : pn > 0} is 1. On the other hand if fn = 0 unless n = k d for some k, the relation 4.10 can be used inductively to conclude that the same is true of pn . Hence both sets have the same greatest common divisor. We can now conclude that lim π (n) (x, x) = q(x) =
n→∞
1 m(x)
On the other hand if fn (x, y) = Px {τy = n}, then π
(n)
and recurrence implies
(x, y) =
P∞ k+1
n X
fk (x, y) π (n−k)(y, y)
k=1
fk (x, y) = 1 for all x and y. Therefore
lim π (n) (x, y) = q(y) =
n→∞
1 m(y)
and is independent of x, the starting point. In order to complete the proof we have to establish that X Q= q(y) = 1 y
It is clear by Fatou’s lemma that X y
q(y) = Q ≤ 1
4.6. COUNTABLE STATE SPACE
131
By letting n → ∞ in the Chapman-Kolmogorov equation X π (n+1) (x, y) = π n (x, z)π(z, y) z
and using Fatou’s lemma we get q(y) ≥
X
π(z, y)q(z)
z
Summing with repect to y we obtain X Q≥ π(z, y)q(z) = Q z,y
and equality holds in this relation. Therefore X q(y) = π(z, y)q(z) z
for every y or q(·) is an invariant measure. By iteration X q(y) = π n (z, y)q(z) z
and if we let n → ∞ again an application of the bounded convergence theorem yields q(y) = Q q(y) implying Q = 1 and we are done. Let us now consider an irreducible Markov Chain with one step transition probability π(x, y) that is periodic with period d > 1. Let us choose and fix a reference point x0 ∈ X . For each x ∈ X let Dx0 ,x = {n : π (n) (x0 , x) > 0}. Lemma 4.17. If n1 , n2 ∈ Dx0 ,x then d divides n1 − n2 . Proof. Since the chain is irreducible there is an m such tha π (m) (x, x0 ) > 0. By the Chapman-Kolmogorov equations π (m+ni ) (x0 , x0 ) > 0 for i = 1, 2. Therefore m + ni ∈ Dx0 = Dx0 ,x0 for i = 1, 2. This implies that d divides both m + n1 as well as m + n2 . Thus d divides n1 − n2 .
CHAPTER 4. DEPENDENT RANDOM VARIABLES
132
The residue modulo d of all the integers in Dx0 ,x are the same and equal some number r(x), satisfying 0 ≤ r(x) ≤ d − 1. By definition r(x0 ) = 0. Let us define Xj = {x : r(x) = j}. Then {Xj : 0 ≤ j ≤ d − 1} is a partition of X into disjoint sets with x0 ∈ X0 . Lemma 4.18. If x ∈ X , then π (n) (x, y) = 0 unless r(x) + n = r(y) mod d. Proof. Suppose that x ∈ X and π(x, y) > 0. Then if m ∈ Dx0 ,x then (m + 1) ∈ Dx0 ,y . Therefore r(x) + 1 = r(y) modulo d. The proof can be completed by induction. The chain marches through {Xj } in a cyclical way from a state in Xj to one in Xj+1 Theorem 4.19. Let X be irreducible and positive recurrent with period d. Then d lim π (n) (x, y) = n→∞ m(y) n+r(x)=r(y) modulo d Of course
π (n) (x, y) = 0
unless n + r(x) = r(y) modulo d. Proof. If we replace π by π ˜ where π ˜ (x, y) = π (d) (x, y), then π ˜ (x, y) = 0 unless both x and y are in the same Xj . The restriction of π ˜ to each Xj defines an irreducible aperiodic Markov chain. Since each time step under π˜ is actually d units of time we can apply the earlier results and we will get for x, y ∈ Xj for some j, d lim π (k d) (x, y) = k→∞ m(y) We note that π (n) (x, y) =
X
fm (x, y) π (n−m) (y, y)
1≤m≤n
fm (x, y) = Px {τy = m} = 0 unless r(x) + m = r(y) modulo d π (n−m) (y, y) = 0 unless n − m = 0 modulo d X fm (x, y) = 1. m
The theorem now follows.
4.6. COUNTABLE STATE SPACE
133
Suppose now we have a chain that is not irreducible. Let us collect all the transient states and call the set Xtr . The complement consists of all the recurrent states and will be denoted by Xre . Lemma 4.20. If x ∈ Xre and y ∈ Xtr , then π(x, y) = 0. Proof. If x is a recuurrent state, and π(x, y) > 0, the chain will return to x infinitely often and each time there is a positive probability of visiting y. By the renewal property these are independent events and so y will be recurrent too. The set of recurrent states Xre can be divided into one or more equivalence classes accoeding to the following procedure. Two recurrent states x and y are in the same equivalence class if f (x, y) = Px {τy < ∞}, the probability of ever visiting y starting from x is positive. Because of recurrence if f (x, y) > 0 then f (x, y) = f (y, x) = 1. The restriction of the chain to a single equivalence class is irreducible and possibly periodic. Different equivalence classes could have different periods, some could be positive recurrent and others null recurrent. We can combine all our observations into the following theorem. P (n) Theorem 4.21. If y is transient then (x, y) < ∞ for all x. If y nπ is null recurrent (belongs toP an equivalence class that is null recurrent) then π (n) (x, y) → 0 for all x, but n π (n) (x, y) = ∞ if x is in the same equivalence class or x ∈ Xtr with f (x, y) > 0. In all other cases π (n) (x, y) = 0 for all n ≥ 1. If y is positive recurrent and belongs to an equivalence class with period d with m(y) = E Py {τy }, then for a nontransient x, π (n) (x, y) = 0 unless x is in the same equivalence class and r(x) + n = r(y) modulo d. In such a case, d lim . π (n) (x, y) = n→∞ m(y) r(x)+n=r(y) modulo d If x is transient then lim n→∞
π (n) (x, y) = f (r, x, y)
n=r modulo d
d m(y)
where f (r, x, y) = Px {Xkd+r = y
for some k ≥ 0}.
CHAPTER 4. DEPENDENT RANDOM VARIABLES
134
Proof. The only statement that needs an explanation is the last one. The chain starting from a transient state x may at some time get into a positive recurrent equivalence class Xj with period d. If it does, it never leaves that class and so gets absorbed in that class. The probability of this is f (x, y) where y can be any state in Xj . However if the period d is greater than 1, there will be cyclical subclasses C1 , · · · , Cd of Xj . Depending on which subclass the chain enters and when, the phase of its future is determined. There are d such possible phases. For instance, if the subclasses are ordered in the correct way, getting into C1 at time n is the same as getting into C2 at time n + 1 and so on. f (r, x, y) is the probability of getting into the equivalence class in a phase that visits the cyclical subclass containing y at times n that are equal to r modulo d. Example 4.1. (Simple Random Walk). If X = Z d , the integral lattice in Rd , a random walk is a Markov chain with transition probability π(x, y) = p(y − x) where {p(z)} specifies the probability distribution of a single step. We will assume for simplicity that p(z) = 0 except when z ∈ F where F consists of the 2 d neighbors of 0 and 1 p(z) = 2d for each z ∈ F . For ξ ∈ Rd the characteristic function of pˆ(ξ) of p(·) is given by d1 (cos ξ1 + cos ξ2 + · · · + cos ξd ). The chain is easily seen to irreducible, but periodic of period 2. Return to the starting point is possible only after an even number of steps. Z 1 d (2n) π (0, 0) = ( ) [ˆ p(ξ)]2n dξ 2π d T C ' d . n2 To see this asymptotic behavior let us first note that the integration can be restricted to the set where |ˆ p(ξ)| ≥ 1 − δ or near the 2 points (0, 0, · · · , 0) and (π, π, · · · , π) where |ˆ p(ξ)| = 1. Since the behaviour is similar at both points let us concentrate near the origin. X X 1X cos ξj ≤ 1 − c ξj2 ≤ exp[−c ξj2 ] d j=1 j j d
for some c > 0 and d X 1 X 2n cos ξj ≤ exp[−2 n c ξj2 ] d j=1 j
4.6. COUNTABLE STATE SPACE
135
and with a change of variables the upper bound is clear. We have a similar lower bound as well. The random walk is recurrent if d = 1 or 2 but transient if d ≥ 3. Exercise 4.12. If the distribution p(·) is arbitrary, determine when the chain is irreducible and when it is irreducible and aperiodic. P Exercise 4.13. If z zp(z) = m 6= 0 conclude that the chain is transient by an application of the strong law of large numbers. P Exercise 4.14. If z zp(z) = m = 0, and if the covariance matrix given by , P z z p(z) = σ , is nondegenerate show that the transience or recurrence i,j z i j is determined by the dimension as in the case of the nearest neighbor random walk. Exercise 4.15. Can you make sense of the formal calculation X X 1 Z (n) π (0, 0) = ( )d [ˆ p(ξ)]n dξ 2π d T n n Z 1 d 1 = ( ) dξ 2π ˆ(ξ)) Td (1 − p Z 1 1 d Real Part = ( ) dξ 2π 1 − pˆ(ξ) Td to conclude that a necessary and sufficient condition for transience or recurrece is the convergence or divergence of the integral Z 1 Real Part dξ 1 − pˆ(ξ) Td with an integrand
Real Part
1 1 − pˆ(ξ)
that is seen to be nonnegative ? Hint: Consider instead the sum ∞ X n=0
n (n)
ρ π
X 1 Z (0, 0) = ( )d ρn [ˆ p(ξ)]n dξ 2π d T n Z 1 1 dξ = ( )d 2π p(ξ)) Td (1 − ρˆ
136
CHAPTER 4. DEPENDENT RANDOM VARIABLES 1 = ( )d 2π
Z Td
Real Part
1 dξ 1 − ρˆ p(ξ)
for 0 < ρ < 1 and let ρ → 1. Example 4.2. (The Queue Problem). In the example of customers arriving, except in the trivial cases of p0 = 0 or p0 + p1 = 1 the chain is irreducible and P aperiodoc. Since the service rate is at most 1 if the arrival rate m = j j pj > 1, then the queue will get longer and by an application of law of large numbers it is seen that the queue length will become infinite as time progresses. This is the transient behavior of the queue. If m < 1, one can expect the situation to be stable and there should be an asymptotic distribution for the queue length. If m = 1, it is the borderline case and one should probably expect this to be the null recurent case. The actual proofs are not hard. In time n the actual number of customers served is at most n because the queue may sometomes be empty. If {ξi : i ≥ 1} are the number of new customers arriving at time i and X0 is the initial number inP the queue, then the number Xn in the queue at time n satisfies Xn ≥ X0 +( ni=1 ξi )−n and if m > 1, it follows from the law of large numbers that limn→∞ Xn = +∞, thereby establishing transience. To prove positive recurrence when m < 1 it is sufficient to prove that the equations X q(x)π(x, y) = q(y) x
P has a nontrivial nonnegative solution such that x q(x) < ∞. We shall proceed to show that this is indeed the case.PSince the equation is linear we can alaways normalize tha solution so that x q(x) = 1. By iteration X q(x)π (n) (x, y) = q(y) x
P for every n. If limn→∞ π (x, y) = 0 for every x and y, because x q(x) = 1 < ∞, by the bounded convergence theorem the right hand side tends to 0 as n → ∞. therefore q ≡ 0 and is trivial. This rules out the transient and the null recurrent case. In our case π(0, y) = py and π(x, y) = py−x+1 if y ≥ x − 1 and x ≥ 1. In all other cases π(x, y) = 0. The equations for {qx = q(x)} are then (n)
q0 py +
y+1 X x=1
qx py−x+1 = qy
for y ≥ 0.
(4.13)
4.6. COUNTABLE STATE SPACE
137
Multiplying equation 4.13 by z n and summimg from 1 to ∞, we get q0 P (z) +
1 P (z) [Q(z) − q0 ] = Q(z) z
where P (z) and Q(z) are the generating functions P (z) =
∞ X
px z x
x=0
Q(z) =
∞ X
qx z x .
x=0
We can solve for Q to get −1 Q(z) P (z) − 1 = P (z) 1 − q0 z−1 ∞ X P (z) − 1 k = P (z) z−1 k=0 k ∞ ∞ X X j−1 = P (z) pj (1 + z + · · · + z ) k=0
j=1
is a power series in z with nonnegative coefficients. If m < 1, we can let z → 1 to get k X ∞ ∞ ∞ Q(1) X X 1 0, U(λ, y) = Ey exp[−λτx0 ] . Clearly U(x) = U(λ, x) satisfies X π(x, y)U(y) for x 6= x0 U(x) = e−λ
(4.14)
y
and U(x0 ) = 1. One would hope that if we solve for these equations then we have our U. This requires uniqueness. Since our U is bounded in fact by 1, it is sufficient to prove uniqueness within the class of bounded solutions
4.6. COUNTABLE STATE SPACE
139
of equation 4.14. We will now establish that any bounded solution U of equation 4.14 with U(x0 ) = 1, is given by U(y) = U(λ, y) = Ey exp[−λτx0 ] . Let us define En = {X1 6= x0 , X2 6= x0 , · · · , Xn−1 6= x0 , Xn = x0 }. Then we will prove, by induction, that for any solution U of equation (3.7), with U(λ, x0 ) = 1, Z n X −λ j −λ n U(y) = e Py {Ej } + e U(Xn ) dPy . (4.15) τx0 >n
j=1
By letting n → ∞ we would obtain U(y) =
∞ X
e−λ j Py {Ej } = E Py {e−λτx0 }
j=1
because U is bounded and λ > 0. Z U(Xn ) dPy τx0 >n Z X −λ =e [ π(Xn , y) U(y)] dPy τx0 >n −λ
=e
y −λ
Z
Py {En+1 } + e
[
X
π(Xn , y) U(y)] dPy
τx0 >n y6=x 0
−λ
=e
−λ
Z
Py {En+1 } + e
U(Xn+1 ) dPy τx0 >n+1
completing the induction argument. In our case, if we take x0 = 0 and try Uσ (x) = e−σ x with σ > 0, for x ≥ 1 X X π(x, y)Uσ (y) = e−σ y py−x+1 y
y≥x−1
=
X
e−σ (x+y−1) py
y≥0
= e−σ x eσ
X y≥0
e−σ y py = ψ(σ)Uσ (x)
CHAPTER 4. DEPENDENT RANDOM VARIABLES
140 where
ψ(σ) = eσ
X
e−σ y py .
y≥0
Let us solve eλ = ψ(σ) for σ which is the same as solving log ψ(σ) = λ for λ > 0 to get a solution σ = σ(λ) > 0. Then U(λ, x) = e−σ(λ) x = E Px {e−λτ0 }. We see now that recurrence is equivalent to σ(λ) → 0 as λ ↓ 0 and positive recurrence to σ(λ) being differentiable at λ = 0. The function log ψ(σ) is convex and its slope at the origin is 1 − m. If m > 1 it dips below 0 initially for σ > 0 and then comes back up to 0 for some positive σ0 before turning positive for good. In that situation limλ↓0 σ(λ) = σ0 > 0 and that is transience. If m < 1 then log ψ(σ) has a positive slope at the origin and 1 σ 0 (0) = ψ01(0) = 1−m < ∞. If m = 1, then log ψ has zero slope at the origin 0 and σ (0) = ∞. This concludes the discussion of this problem. Example 4.3. ( The Urn Problem.) We now turn to a discussion of the urn problem. π(p, q ; p + 1, q) =
p p+q
and π(p, q ; p, q + 1) =
q p+q
and π is zero otherwise. In this case the equation F (p, q) =
p q F (p + 1, q) + F (p, q + 1) for all p, q p+q p+q
which will play a role later, has lots of solutions. In particular, F (p, q) = is one and for any 0 < x < 1 Fx (p, q) =
p p+q
1 xp−1 (1 − x)q−1 β(p, q)
where β(p, q) =
Γ(p)Γ(q) Γ(p + q)
is a solution as well. The former is defined on p + q > 0 where as the latter is defined only on p > 0, q > 0. Actually if p or q is initially 0 it remains so for
4.6. COUNTABLE STATE SPACE
141
ever and there is nothing to study in that case. If f is a continuous function on [0, 1] then Z 1 Ff (p, q) = Fx (p, q)f (x) dx 0
is a solution and if we want we can extend Ff by making it f (1) on q = 0 and f (0) on p = 0. It is a simple exercise to verify lim Ff (p, q) = f (x)
p,q→∞ p q →x
n for any continuous f on [0, 1]. We will show that the ratio ξn = pnp+q which n is random, stabilizes asymptotically (i.e. has a limit) to a random variable ξ and if we start from p, q the distribution of ξ is the Beta distribution on [0, 1] with density
Fx (p, q) =
1 xp−1 (1 − x)q−1 β(p, q)
Suppose we have a Markov Chain on some state space X with transition probability π(x, y) and U(x) is a bounded function on X that solves X U(x) = π(x, y) U(y). y
Such functions are called (bounded) Harmonic functions for the Chain. Consider the random variables ξn = U(Xn ) for such an harmonic function. ξn are uniformly bounded by the bound for U. If we denote by ηn = ξn − ξn−1 an elementary calculation reveals E Px {ηn+1 } = E Px {U(Xn+1 ) − U(Xn )} = E Px {E Px {U(Xn+1 ) − U(Xn )}|Fn }} where Fn is the σ-field generated by X0 , · · · , Xn . But X E Px {U(Xn+1 ) − U(Xn )}|Fn } = π(Xn , y)[U(y) − U(Xn )] = 0. y
A similar calculation shows that E Px {ηn ηm } = 0
CHAPTER 4. DEPENDENT RANDOM VARIABLES
142
for m 6= n. If we write U(Xn ) = U(X0 ) + η1 + η2 + · · · + ηn this is an orthogonal sum in L2 [Px ] and because U is bounded 2
2
E {|U(Xn )| } = |U(x)| + Px
n X
E Px {|ηi |2 } ≤ C
i=1
is bounded in n. Therefore limn→∞ U(Xn ) = ξ exists in L2 [Px ] and E Px {ξ} = U(x). Actually the limit exists almost surely and we will show it when p we discuss martingales later. In our example if we take U(p, q) = p+q , as remarked earlier, this is a harmonic function bounded by 1 and therefore pn =ξ lim n→∞ pn + qn exists in L2 [Px ]. Moreover if we take U(p, q) = Ff (p, q) for some continuous f on [0, 1], because Ff (p, q) → f (x) as p, q → ∞ and pq → x, U(pn , qn ) has a limit as n → ∞ and this limit has to be f (ξ). On the other hand E Pp,q {U(pn , qn )} = U(p0 , q0 ) = Ff (p0 , q0 ) Z 1 1 = f (x)xp0 −1 (1 − x)q0 −1 dx β(p0, q0 ) 0 giving us
Z 1 1 f (x) xp−1 (1 − x)q−1 dx E {f (ξ)} = β(p, q) 0 thereby identifying the distribution of ξ under Pp,q as the Beta distribution with the right parameters. Pp,q
Example 4.4. (Branching Process). Consider a population, in which each individual member replaces itself at the beginning of each day by a random number of offsprings. Every individual has the same probability distribution, but the number of offsprings for different individuals are distibuted independently of each other. The distribution of the number N of offsprings is given by P [N = i] = pi for i ≥ 0. If there are Xn individuals in the population on a given day, then the number of individuals Xn+1 present on the next day, has the represenation Xn+1 = N1 + N2 + · · · + NXn
4.6. COUNTABLE STATE SPACE
143
as the sum of Xn independent random variables each having the offspring distribution {pi : i ≥ 0}. Xn is seen to be a Markov chain on the set of nonnegative integers. Note that if Xn ever becomes zero, i.e. if every member on a given day produces no offsprings, then the population remains extinct. If one uses generating functions, then the transition probability πi,j of the chain are X
πi,j z j =
X
j
i pj z j .
j
What is the long time behavior of the chain? Let us denote by m the expected number of offsprings of any individual, i.e. X m= ipi . i≥0
Then E[Xn+1 |Fn ] = mXn . 1. If m < 1, then the population becomes extinct sooner or later. This is easy to see. Consider X X E[ Xn |F0 ] = mn X0 = n≥0
n≥0
1 X0 < ∞. 1−m
By an application of Fubini’s theorem, if S = E[S|X0 = i] =
P n≥0
Xn , then
i 0, and there is a positive probabiity q(i) = q i that the poulation becomes extinct, when it starts with i individuals. Here q is the probabilty of the population becoming extinct when we start with X0 = 1. If we have initially i individulas each of the i family lines have to become extinct for the entire population to become extinct. The number q must therefore be a solution of the equation q = P (q) where P (z) is the generating function X P (z) = pi z i . i≥0
If we show that the equation P (z) = z has only the solution z = 1 in 0 ≤ z ≤ 1, then the population becomes extinct with probability 1 although E[S] = ∞ in this case. If P (1) = 1 and P (a) = a for some 0 ≤ a < 1 then by the mean value theorem applied to P (z) − z we must have P 0 (z) = 1 for some 0 < z < 1. But if 0 < z < 1 X X P 0(z) = iz i−1 pi < ipi = 1 i≥1
i≥1
a contradiction. 4. If m > 1 but p0 = 0 the problem is trivial. There is no chance of the population becoming extinct. Let us assume that p0 > 0. The equation P (z) = z has another solution z = q besides z = 1, in the range 0 < z < 1. This is seen by considering the function g(z) = P (z) − z. We have g(0) > 0, g(1) = 0, g 0(1) > 0 which implies another root. But g(z) is convex and therefore ther can be atmost one more root. If we can rule out the possibility of extinction probability being equal to 1, then this root q must be the extinction probability when we start with a single individual at time 0. Let us denote by qn the probability of extinction with in n days. Then X qn+1 = pi qni = P (qn ) i
and q1 < 1. A simple consequence of the monotonicity of P (z) and the inequalities P (z) > z for z < q and P (z) < z for z > q is that if
4.6. COUNTABLE STATE SPACE
145
we start with any a < 1 and iterate qn+1 = P (qn ) with q1 = a, then qn → q. If the population does not become extinct, one can show that it has to grow indefinitely. This is best done using martingales and we will revisit this example later as Example 5.6. Example 4.5. Let X be the set of integers. Assume that transitions from x are possible only to x − 1, x, and x + 1. The transition matrix π(x, y) appears as a tridiagonal matrix with π(x, y) = 0 unless |x − y| ≤ 1. For simplicity let us assume that π(x, x), π(x, x − 1) and π(x, x + 1) are positive for all x. The chain is then irreducible and aperiodic. Let us try to solve for U(x) = Px {τ0 = ∞} that satisfies the equation U(x) = π(x, x − 1) U(x − 1) + π(x, x) U(x) + π(x, x + 1) U(x + 1) for x 6= 0 with U(0) = 0. The equations decouple into a set for x > 0 and a set for x < 0. If we denote by V (x) = U(x + 1) − U(x) for x ≥ 0, then we always have U(x) = π(x, x − 1) U(x) + π(x, x) U(x) + π(x, x + 1) U(x) so that π(x, x − 1) V (x − 1) − π(x, x + 1) V (x) = 0 or
and therefore
and
V (x) π(x, x − 1) = V (x − 1) π(x, x + 1) x Y π(i, i − 1) V (x) = V (0) π(i, i + 1) i=1
y x−1 Y X π(i, i − 1) U(x) = V (0) 1 + . π(i, i + 1) y=1 i=1
146
CHAPTER 4. DEPENDENT RANDOM VARIABLES
If the chain is to be transient we must have for some choice of V (0), 0 ≤ U(x) ≤ 1 for all x > 0 and this will be possible only if y ∞ Y X π(i, i − 1) 0 for x > 0. There is a similar condition on the negative side y ∞ Y X π(−i, −i + 1) < ∞. π(−i, −i − 1) y=1 i=1
Transience needs at least one of the two series to converge. Actually the converse is also true. If, for instance the series on the positive side converges then we get a function U(x) with 0 ≤ U(x) ≤ 1 and U(0) = 0 that satisfies U(x) = π(x, x − 1) U(x − 1) + π(x, x) U(x) + π(x, x + 1) U(x + 1) and by iteration one can prove that for each n, Z U(x) = U(Xn ) dPx ≤ P {τ0 > n} τ0 >n
so the existence of a nontrivial U implies transience. Exercise 4.16. Determine the conditions for positive recurrence in the previous example. Exercise 4.17. We replace the set of integers by the set of nonnegative integers and assume that π(0, y) = 0 for y ≥ 2. Such processes are called birth and death processes. Work out the conditions in that case. Exercise 4.18. In the special case of a birth and death process with π(0, 1) = π(0, 0) = 12 , and for x ≥ 1, π(x, x) = 13 , π(x, x − 1) = 13 + ax , π(x, x + 1) = 1 − ax with ax = xλα for large x, find conditions on positive α and real λ for 3 the chain to be transient, null recurrent and positive recurrent.
4.6. COUNTABLE STATE SPACE
147
Exercise 4.19. The notion of a Markov Chain makes sense for a finite chain X0 , · · · , Xn . Formulate it precisely. Show that if the chain {Xj : 0 ≤ j ≤ n} is Markov so is the reversed chain {Yj : 0 ≤ j ≤ n} where Yj = Xn−j for 0 ≤ j ≤ n. Can the transition probabilities of the reversed chain be determined by the transition probabilities of the forward chain? If the forward chain has stationary transition proabilities does the same hold true for the reversed chain? What if we assume that the chain has a finte invariant probability distribution and we initialize the chain to start with an initial distribution which is the invariant distribution? Exercise 4.20. Consider the simple chain on nonnegative integers P∞ with the following transition probailities. π(0, x) = px for x ≥ 0 with x=0 px = 1. For x > 0, π(x, x − 1) = 1 and π(x, y) = 0 for all other y. Determine conditions on {px } in order that the chain may be transient, null recurrent or positive recurrent. Determine the invariant probability measure in the positive recurrent case. Exercise 4.21. Show that any null recurrent equivalence class must necessarily contain an infinite number of states. In patricular any Markov Chain with a finite state space has only transient and positive recurrent states and moreover the set of positive recurrent states must be non empty.
148
CHAPTER 4. DEPENDENT RANDOM VARIABLES
Chapter 5 Martingales. 5.1
Definitions and properties
The theory of martingales plays a very important ans ueful role in the study of stochastic processes. A formal definition is given below. Definition 5.1. Let (Ω, F , P ) be a probability space. A martingale sequence of length n is a chain X1 , X2 , · · · , Xn of random variables and corresponding sub σ-fields F1 , F2 , · · · , Fn that satisfy the following relations 1. Each Xi is an integrable random variable which is measurable with respect to the corresponding σ-field Fi . 2. The σ-field Fi are increasing i.e. Fi ⊂ Fi+1 for every i. 3. For every i ∈ [1, 2, · · · , n − 1], we have the relation Xi = E{Xi+1 |Fi} a.e. P.
Remark 5.1. We can have an infinite martingale sequence {(Xi , Fi) : i ≥ 1} which requires only that for every n, {(Xi , Fi) : 1 ≤ i ≤ n} be a martingale sequence of length n. This is the same as conditions (i), (ii) and (iii) above except that they have to be true for every i ≥ 1. 149
150
CHAPTER 5. MARTINGALES.
Remark 5.2. From the properties of conditional expectations we see that E{Xi } = E{Xi+1 } for every i, and therfore E{Xi } = c for some c. We can define F0 to be the trivial σ-field consisting of {Φ, Ω} and X0 = c. Then {(Xi , Fi) : i ≥ 0} is a martingale sequence as well. P Remark 5.3. We can define Yi+1 = Xi+1 − Xi so that Xj = c + 1≤i≤j Yi and property (iii) reduces to E{Yi+1 |Fi} = 0 a.e. P. Such sequences are called martingale differences. If Yi is a sequence of independent random variables with mean 0, for each i, we can take Fi to be the σ-field P generated by the random variables {Yj : 1 ≤ j ≤ i} and Xj = c + 1≤i≤j Yi , will be a martingale relative to the σ-fields Fi . Remark 5.4. We can generate martingale sequences by the following procedure. Given any increasing family of σ-fields {Fj }, and any integrable random variable X on (Ω, F , P ), we take Xi = E{X|Fi} and it is easy to check that {(Xi, Fi )} is a martingale sequence. Of course every finite martingale sequence is generated this way for we can always take X to be Xn , the last one. For infinite sequences this raises an important question that we will answer later. If one participates in a ‘fair ’ gambling game, the asset Xn of the player at time n is supposed to be a martingale. One can take for Fn the σ-field of all the results of the game through time n. The condition E[Xn+1 − Xn |Fn ] = 0 is the assertion that the game is neutral irrespective of past history. A related notion is that of a super or sub-martingale. If, in the definition of a martingale, we replace the equality in (iii) by an inequality we get super or sub-martingales. For a sub-martingale we demand the relation (iiia) for every i, Xi ≤ E{Xi+1 |Fi } a.e. P. while for a super-martingale the relation is (iiib) for every i, Xi ≥ E{Xi+1 |Fi } a.e. P.
5.1. DEFINITIONS AND PROPERTIES
151
Lemma 5.1. If {(Xi , Fi)} is a martingale and ϕ is a convex (or concave) function of one variable such that ϕ(Xn ) is integrable for every n, then {(ϕ(Xn ), Fn )} is a sub (or super)-martingale. Proof. An easy consequence of Jensen’s inequality (4.2) for conditional expectations. Remark 5.5. A particular case is φ(x) = |x|p with 1 ≤ p < ∞. For any martingale (Xn , Fn ) and 1 ≤ p < ∞, (|Xn |p , Fn ) is a sub-martingale provided E[|Xn |p ] < ∞ Theorem 5.2. (Doob’s inequality.) Suppose {Xj } is a martingale sequence of length n. Then Z Z 1 1 P ω : sup |Xj | ≥ ` ≤ |Xn | dP ≤ |Xn | dP (5.1) ` {sup1≤j≤n |Xj |≥`} ` 1≤j≤n Proof. Let us define S(ω) = sup1≤j≤n |Xj (ω)|. Then {ω : S(ω) ≥ `} = E = ∪j Ej is written as a disjoint union, where Ej = {ω : |X1 (ω)| < `, · · · , |Xj−1| < `, |Xj | ≥ `}. We have 1 P (Ej ) ≤ `
Z
1 |Xj | dP ≤ ` Ej
Z |Xn | dP.
(5.2)
Ej
The second inequality in (5.2) follows from the fact that |x| is a convex function of x, and therefore |Xj | is a sub-martingale. In particular E{|Xn ||Fj } ≥ |Xj | a.e. P and Ej ∈ Fj . Summing up (5.2) over j = 1, · · · , n we obtain the theorem. Remark 5.6. We could have started with Z 1 P (Ej ) ≤ p |Xj |p dP ` Ej and obtained for p ≥ 1 1 P (Ej ) ≤ p ` Compare it with (3.9) for p = 2.
Z |Xn |p dP. Ej
(5.3)
CHAPTER 5. MARTINGALES.
152
This simple inequality has various implications. For example Corollary 5.3. (Doob’s Inequality.) Let {Xj : 1 ≤ j ≤ n} be a martingale. Then if, as before, S(ω) = sup1≤j≤n |Xj (ω)| we have E[S ] ≤ p
p p−1
p E [ |Xn |p ].
The proof is a consequence of the following fairly general lemma. Lemma 5.4. Suppose X and Y are two nonnegative random variables on a probability space such that for every ` ≥ 0, Z 1 P {Y ≥ `} ≤ X dP. ` Y ≥` Then for every p > 1,
Z Y dP ≤ p
p p−1
p Z X p dP.
Proof. Let us denote the tail probability by T (`) = P {Y ≥ `}. Then with 1 + 1q = 1, i.e. (p − 1)q = p p Z
Z
∞
Z
∞
Y dP = − y dT (y) = p y p−1T (y)dy (integrating by parts) 0 Z Z 0∞ dy ≤p y p−1 X dP (by assumption) y Y ≥y 0 Z Y Z p−2 =p X y dy dP (by Fubini’s Theorem) 0 Z p = X Y p−1 dP p−1 p1 Z 1q Z p p q(p−1) ≤ dP (by H¨older’s inequality) Y X dP p−1 p1 Z 1− p1 Z p p p ≤ Y dP X dP p−1 p
p
5.1. DEFINITIONS AND PROPERTIES This simplifies to
Z Y dP ≤ p
p p−1
153
p Z X p dP
R provided Y p dP is finite. In general given Y , we can truncate it at level N to get YN = min(Y, N) and for 0 < ` ≤ N , 1 P {YN ≥ `} = P {Y ≥ `} ≤ `
Z
1 X dP = ` Y ≥`
Z X dP YN ≥`
R with P {YN ≥ `} = 0 for ` > N. This gives us uniform bounds on YNp dP and we can Rpass to the limit. So we have the implication that the R strong p p finiteness of X dP implies the finiteness of Y dP .
Exercise 5.1. The result is false for p = 1. Construct a nonnegative martingale Xn with E[Xn ] ≡ 1 such that ξ = supn Xn is not integrable. Consider Ω = [0, 1], F is the Borel σ-field and P the Lebesgue measure. Suppose we take Fn to be the σ-field generated by intervals with end points of the form j for some integer j. It corresponds to a partition with 2n sets. Consider 2n the random variables ( 2n for 0 ≤ x ≤ 2−n Xn (x) = 0 for 2−n < x ≤ 1. R Check that it is a martingale and calculate ξ(x) dx. This is the ‘winning ’ strategy of doubling one’s bets until the losses are recouped. Exercise 5.2. If Xn is a martingale such that the differences Yn = Xn − Xn−1 are all square integrable, show that for n 6= m, E [Yn Ym ] = 0. Therefore E[Xn2 ]
2
= E[X0 ] +
n X
E [Yj2 ].
j=1
If in addition, supn E[Xn2 ] < ∞, then show that there is a random variable X such that lim E [ |Xn − X|2 ] = 0. n→∞
CHAPTER 5. MARTINGALES.
154
5.2
Martingale Convergence Theorems.
If Fn is an increasing family of σ-fields and Xn is a martingale sequence with respect to Fn , one can always assume without loss of generality that the full σ-field F is the smallest σ-field generated by ∪n Fn . If for some p ≥ 1, X ∈ Lp , and we define Xn = E [ X|Fn ] then Xn is a martingale and by Jensen’s inequality, supn E [ |Xn |p ] ≤ E [|X|p ]. We would like to prove Theorem 5.5. For p ≥ 1, if X ∈ Lp , then limn→∞ kXn − Xkp = 0. Proof. Assume that X is a bounded function. Then by the properties of conditional expectation supn supω |Xn | < ∞. In particular E [ Xn2 ] is uniformly bounded. By Exercise 5.2, at the end of last section, limn→∞ Xn = Y exists in L2 . By the properties of conditional expectations for A ∈ Fm , Z Z Z Y dP = lim Xn dP = X dP. A
n→∞
A
A
This is true for all A ∈ Fm for any m. Since F is generated by ∪m Fm the above relation is true for A ∈ F . As X and Y are F measurable we conclude that X = Y a.e. P . See Exercise 4.1. For a sequence of functions that are bounded uniformnly in n and ω convergence in Lp are all equivalent and therefore convergence in L2 implies the convergence in Lp for any 1 ≤ p < ∞. If now X ∈ Lp for some 1 ≤ p < ∞, we can approximate it by X 0 ∈ L∞ so that kX 0 − Xkp < . Let us denote by Xn0 the conditional expectations E [ X 0 |Fn ]. By the properties of conditional expectations kXn0 − Xn kp ≤ for all n, and as we saw earlier kXn0 − X 0 kp → 0 as n → ∞. It now follows that lim sup kXn − Xm kp ≤ 2 n→∞ m→∞
and as > 0 is arbitrary we are done. In general, if we have a martingale {Xn }, we wish to know when it comes from a random variable X ∈ Lp in the sense that Xn = E [ X |Fn ]. Theorem 5.6. If for some p > 1, a martingale {Xn } is bounded in Lp , in the sense that supn kXn kp < ∞, then there is a random variable X ∈ Lp such that Xn = E [ X |Fn ] for n ≥ 1. In particular kXn − Xkp → 0 as n → ∞.
5.2. MARTINGALE CONVERGENCE THEOREMS.
155
Proof. Suppose kXn kp is uniformly bounded. For p > 1, since Lp is the dual of Lq with 1p + 1q = 1, bounded sets are weakly compact. See [7] or [3]. We can therefore choose a subsequence Xnj that converges weakly in Lp to a limit in the weak topology. We call this limit X. Then consider A ∈ Fn for some fixed n. The function 1A (·) ∈ Lq . Z Z Z X dP =< 1A , X >= lim < 1A , Xnj >= lim Xnj dP = Xn dP. A
j→∞
j→∞
A
A
The last equality follows from the fact that {Xn } is a martingale, A ∈ Fn and nj > n eventually. It now follows that Xn = E [ X |Fn ]. We can now apply the preceeding theorem. Exercise 5.3. For p = 1 the result is false. Example 5.1 gives us at the same time a counterexample of an L1 bounded martingale that does not converge in L1 and so cannot be represented as Xn = E [ X |Fn ]. We can show that the convergence in the preceeding theorems is also valid almost everywhere. Theorem 5.7. Let X ∈ Lp for some p ≥ 1. Then the martingale Xn = E [X |Fn ] converges to X for almost all ω with respect to P . Proof. From H¨older’s inequality kXk1 ≤ kXkp . Clearly it is sufficient to prove the theorem for p = 1. Let us denote by M ⊂ L1 the set of functions X ∈ L1 for which the theorem is true. Clearly M is a linear subset of L1 . We will prove that it is closed in L1 and that it is dense in L1 . If we denote by Mn the space of Fn measurable functions in L1 , then Mn is a closed subspace of L1 . By standard approximation theorems ∪n Mn is dense in L1 . Since it is obvious that M ⊃ Mn for every n, it follows that M is dense in L1 . Let Yj ∈ M ⊂ L1 and Yj → X in L1 . Let us define Yn,j = E [Yj |Fn ]. With Xn = E [X |Fn ], by Doob’s inequality (5.1) and jensen’s inequlaity (4.2), Z 1 P sup |Xn | ≥ ` ≤ |XN | dP ` {ω:sup1≤n≤N |Xn |≥`} 1≤n≤N 1 ≤ E [ |XN | ] ` 1 ≤ E [ |X| ] `
CHAPTER 5. MARTINGALES.
156
and therefore Xn is almost surely a bounded sequence. Since we know that Xn → X in L1 , it suffices to show that lim sup Xn − lim inf Xn = 0 a.e. P. n
n
If we write X = Yj + (X − Yj ), then Xn = Yn,j + (Xn − Yn,j ), and lim sup Xn − lim inf Xn ≤ [lim sup Yn,j − lim inf Yn,j ] n
n
n
n
+ [lim sup (Xn − Yn,j ) − lim inf (Xn − Yn,j )] n
n
= lim sup (Xn − Yn,j ) − lim inf (Xn − Yn,j ) n
n
≤ 2 sup |Xn − Yn,j |. n
Here we have used the fact that Yj ∈ M for every j and hence lim sup Yn,j − lim inf Yn,j = 0 a.e. P. n
n
Finally P
lim sup Xn − lim inf Xn ≥ ≤ P sup |Xn − Yn,j | ≥ n 2 n n 2 ≤ E [ |X − Yj | ] =0
since the left hand side is independent of j and the term on the right on the second line tends to 0 as j → ∞. The only case where the situation is unclear is when p = 1. If Xn is an L1 bounded martingale, it is not clear that it comes from an X. If it did arise from an X, then Xn would converge to it in L1 and in particular would have to be uniformly integrable. The converse is also true. Theorem 5.8. If Xn is a uniformly integrable martingale then there is random variable X such that Xn = E [ X |Fn ] , and then of course Xn → X in L1 .
5.3. DOOB DECOMPOSITION THEOREM.
157
Proof. The uniform integrability of Xn implies the weak compactness in L1 and if X is any weak limit of Xn [see [7]], it is not difficult to show as in Theorem 5.5, that Xn = E [ X |Fn ] . Remark 5.7. Note that for p > 1, a martingale Xn that is bounded in Lp is uniformly integrable in Lp , i.e |Xn |p is uniformly integrable. This is false for p = 1. The L1 bounded martingale that we constructed earlier in Exercise 5.1 as a counterexample, is not convergent ln L1 and therefore can not be uniformly integrable. We will defer the analysis of L1 bounded martingales to the next section.
5.3
Doob Decomposition Theorem.
The simplest example of a submartingale is a sequence of functions that is non decreasing in n for every (almost all) ω. In some sense the simplest example is also the most general. More precisely the decomposition theorem of Doob asserts the following. Theorem 5.9. (Doob decomposition theorem.) If {Xn : n ≥ 1} is a sub-martingale on (Ω , Fn , P ), then Xn can be written as Xn = Yn + An , with the following properties: 1. (Yn , Fn ) is a martingale. 2. An+1 ≥ An for almost all ω and for every n ≥ 1. 3. A1 ≡ 0. 4. For every n ≥ 2, An is Fn−1 measurable. Xn determines Yn and An uniquely . Proof. Let Xn be any sequence of integrable functions such that Xn is Fn measurable, and is represented as Xn = Yn + An , with Yn and An satisfying (1), (3) and (4). Then An − An−1 = E [Xn − Xn−1 |Fn−1 ]
(5.4)
CHAPTER 5. MARTINGALES.
158
are uniquely determined. Since A1 = 0, all the An are uniquely determined as well. Property (2) is then plainly equivalent to the submartingale property. To establish the representation, we define An inductively by (5.4). It is routine to verify that Yn = Xn − An is a martingale and the monotonicity of An is a consequence of the submartingale property. Remark 5.8. Actually, given any sequence of integrable functions {Xn : n ≥ 1} such that Xn is Fn measurable, equation (5.4) along with A1 = 0 defines Fn−1 measurable functions that are integrable, such that Xn = Yn + An and (Yn , Fn ) is a martingale. The decomposition is always unique. It is easy to verify from (5.4) that {An } is increasing (or decreasing) if and only if {Xn } is a super- (or sub-) maringale. Such a decomposition is called the semi-martingale decomposition. Remark 5.9. It is the demand that An be Fn−1 measurable that leads to uniqueness. If we have to deal with continuous time this will become a thorny issue. We now return to the study of L1 bounded martingales. A nonnegative martingale is clearly L1 bounded because E [ |Xn | ] = E [ Xn ] = E [ X1 ]. One easy way to generate L1 bounded martingales is to take the difference of two nonneagtive martingales. We have the converse as well. Theorem 5.10. Let Xn be an L1 bounded martingale. There are two nonnegative martingales Yn and Zn relative to the same σ-fields Fn , such that Xn = Yn − Zn . Proof. For each j and n ≥ j, we define Yj,n = E [ | Xn | |Fj ]. By the submartingale property of | Xn | Yj,n+1 − Yj,n = E[(|Xn+1| − |Xn |) |Fj ] = E[E[(|Xn+1 | − |Xn |) |Fn ]|Fj ] ≥ 0 almost surely. Yj,n is nonnegative and E[ Yj,n ] = E[ |Xn | ] is bounded in n. By the monotone convergence theorem, for each j, there exists some Yj in L1 such that Yj,n → Yj in L1 as n → ∞. Since limits of martingales are again martingales, and Yn,j is a martingale for n ≥ j, it follows that Yj is a martingale. Moreover Yj + Xj = lim E [ | Xn | + Xn |Fj ] ≥ 0 n→∞
5.3. DOOB DECOMPOSITION THEOREM.
159
and Xj = (Yj + Xj ) − Yj does it! We can always assume that our nonnegative martingale has its expectation equal to 1 because we can always multiply by a suitable constant. Here is a way in which such martingales arise. Suppose we have a probability space (Ω , F , P ) and and an increasing family of sub σ-fields Fn of F that generate F . Suppose Q is another probability measure on (Ω , F ) which may or may not be absolutely continuous with respect to P on F . Let us suppose however that Q 0 and cos λx ≥ cos λR > 0 for x ∈ [−R, R], if R is an integer, we can claim that cos λx E P0 eσ [τR ∧N ] ≤ . cos λR Since the estimate is uniform we can let N → ∞ to get the estimate cos λx E P0 eσ τR ≤ . cos λR Exercise 5.19. Can you equality above? What is range of validity of prove the equality? Is E Px eστR < ∞ for all σ > 0? Example 5.2. Let us make life slightly more complicated by taking a Markov chain in Z d with transition probabilities ( 1 + δ(x, y) if |x − y| = 1 π(x, y) = 2d 0 if |x − y| 6= 0 so that we have slightly perturbed the random walk with perhaps even a possible bias. Exact calculations like in Eaxmple 5.1 are of course no longer possible. Let us try to estimate again the exit time from a ball of radius R. For σ > 0 consider the function d X F (x) = exp[σ |xi |] i=1 d
defined on Z . We can get an estimate of the form (ΠF )(x1 , · · · , xd ) ≥ θF (x1 , · · · , xd ) for some choices of σ > 0 and θ > 1 that may depend on R. Now proceed as in Example 5.1.
CHAPTER 5. MARTINGALES.
174
Example 5.3. We can use these methods to show that the random walk is transient in dimension d ≥ 3. For 0 < α < d − 2 consider the function V (x) = |x|1α for x 6= 0 with V (0) = 1. An approximate calculation of (ΠV )(x) yields, for sufficiently large |x| (i.e |x| ≥ L for some L), the estimate (ΠV )(x) − V (x) ≤ 0 If we start initially from an x with |x| > L and take τL to be the first entrance time into the ball of radius L, one gets by the stopping theorem, the inequality E Px V (xτL ∧N ) ≤ V (x). If τL ≤ N, then |xτL | ≤ L. In any case V (xτL ∧N ) ≥ 0. Therefore, Px τL ≤ N ≤
V (x) inf |y|≤L V (y)
valid uniformly in N. Letting N → ∞ Px τL < ∞ ≤
V (x) . inf |y|≤L V (y)
If we let |x| → ∞, keeping L fixed, we see the transience. Note that recur rence implies that Px τL < ∞ = 1 for all x. The proof of transience really only required a function V defined for large |x|, that was strictly positive for each x, went to 0 as |x| → ∞ and had the property (ΠV )(x) ≤ V (x) for large values of |x|. Example 5.4. We will now show that the random walk is recurrent in d = 2. This is harder because the recurrence of random walk in d = 2 is right on the border. We want to construct a function V (x) → ∞ as |x| → ∞ that satisfies (ΠV )(x) ≤ V (x) for large |x|. If we succeed, then we can estimate by a stopping argument the probability that the chain starting from a point x in the annulus ` < |x| < L exits at the outer circle before getting inside the inner circle. V (x) Px τL < τ` ≤ . inf |y|≥L V (y) We also have for every L, Px τL < ∞ = 1.
5.7. MARTINGALES AND MARKOV CHAINS.
175
This proves that Px τ` < ∞ = 1 thereby proving recurrence. The natural candidate is F (x) = log |x| for x 6= 0. A computation yields (ΠF )(x) − F (x) ≤
C |x|4
which does not quite make it. On the other hand if U(x) = |x|−1 , for large values of |x|, c (ΠU)(x) − U(x) ≥ 3 |x| C for some c > 0. The choice of V (x) = F (x) − CU(x) = log x − |x| works with any C > 0.
Example 5.5. We can use these methods for proving positive recurrence as well. Suppose X is a countable set and we can find V ≥ 0, a finite set F and a constant C ≥ 0 such that ( −1 for x ∈ /F (ΠV )(x) − V (x) ≤ C for x ∈ F Let us let U = ΠV − V , and we have −V (x) ≤ E Px V (xn ) − V (x) n X Px =E U(xj−1 ) ≤E
Px
= −E Px
j=1 n X
C 1F (xj−1 ) −
j=1 n X
n X
1F c (xj−1 )
j=1
[1 − (1 + C)1F (xj−1 )]
j=1
= −n + (1 + C)
n X X
π n (x, y)
j=1 y∈F
= −n + o(n) as n → ∞.
CHAPTER 5. MARTINGALES.
176
if the process is not positive recurrent. This is a contradiction. For instance if X = Z, the integers, and we have a little bit of bias towards the origin in the random walk a π(x, x + 1) − π(x, x − 1) ≥ if x ≤ −` |x| a π(x, x − 1) − π(x, x + 1) ≥ if x ≥ ` |x| with V (x) = x2 , for x ≥ ` 1 1 a a (ΠV )(x) ≤ (x + 1)2 (1 − ) + (x − 1)2 (1 + ) 2 |x| 2 |x| = x2 + 1 − 2a If a > 12 , we can multiply V by a constant and it works. Exercise 5.20. What happens when π(x, x + 1) − π(x, x − 1) = −
1 2x
for |x| ≥ 10? (See Exercise 4.16) Example 5.6. Let us return to our example of a branching process Example 4.4. We see from the relation E[Xn+1 |Fn ] = mXn Xn that m If m < 1 we saw before quite easily that the n is a martingale. population becomes extinct. If m = 1, Xn is a martingale. Since it is nonnegative it is L1 bounded and must have an almost sure limit as n → ∞. Since the population is an integer, this means that the size eventually stabilizes. The limit can only be 0 because the population cannot stabilize at any other size. If m > 1 there is a probability 0 < q < 1 such that P [Xn → 0|X0 = 1] = q, We can show that with probability 1 − q, Xn → ∞. To see this consider the function u(x) = q x . In the notation of Example 4.4
X E[q Xn+1 |Fn ] = [ q j pj ]Xn = [P (q)]Xn = q Xn
5.7. MARTINGALES AND MARKOV CHAINS.
177
so that q Xn is a non negative martingale. It then has an almost sure limit, which can only be 0 or 1. If q is the probabbility that it is 1 i.e that Xn → 0, then 1 − q is the probability that it is 0, i.e. that Xn → ∞.
178
CHAPTER 5. MARTINGALES.
Chapter 6 Stationary Stochastic Processes. 6.1
Ergodic Theorems.
A stationary stochastic process is a collection {ξn : n ∈ Z} of random variables with values in some space (X, B) such that the joint distribution of (ξn1 , · · · , ξnk ) is the same as that of (ξn1 +n , · · · , ξnk +n ) for every choice of k ≥ 1, and n, n1 , · · · , nk ∈ Z. Assuming that the space (X, B) is reasonable and Kolmogorov’s consistency theorem applies, we can build a measure P on the countable product space Ω of sequences {xn : n ∈ Z} with values in X, defined for sets in the product σ-field F . On the space Ω there is the natural shift defined by (T ω)(n) = xn+1 for ω with ω(n) = xn . The random variables xn (ω) = ω(n) are essentially equivalent to {ξn }. The stationarity of the process is reflected in the invariance of P with respect to T i.e. P T −1 = P . We can without being specific consider a space Ω a σ-field F , a one to one invertible measurable map from Ω → Ω with a measurable inverse T −1 and finally a probability measure P on (Ω, F ) that is T -invariant i.e P (T −1 A) = P (A) for every A ∈ F . One says that P is an invariant measure for T or T is a measure preserving transformation for P . If we have a measurable map from ξ : (Ω, F ) → (X, B), then it is easily seen that ξn (ω) = ξ(T n ω) defines a stationary stochastic process. The study of stationary stochastic process is then more or less the same as the study of measure preserving (i.e. probability preserving) transformations. The basic transformation T : Ω → Ω induces a linear transformation U 179
CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
180
on the space of functions defined on Ω by the rule (Uf )(ω) = f (T ω). Because T is measure preserving it is easy to see that Z Z Z f (ω) dP = f (T ω) dP = (Uf )(ω) dP Ω
as well as
Ω
Z
Z |f (ω)| dP = p
Ω
Ω
Z |f (T ω)| dP = p
Ω
Ω
|(Uf )(ω)|p dP.
In other words U acts as an isometry (i.e. norm presrving linear transformation) on the various Lp spaces for 1 ≤ p < ∞ and in fact it is an isometry on L∞ as well. Moreover the transformation induced by T −1 is the inverse of U so that U is also invertible. In particular U is unitary ( or orthogoanl)on L2 . This means it presrves the inner product < ·, · >. Z Z < f, g >= f (ω)g(ω)dP = f (T ω)g(T ω)dP =< Uf, Ug > . Of course our linear transformation U is very special and satisfies U1 = 1 and U(f g) = (Uf )(Ug). A basic theorem known as the Ergodic theorem asserts that Theorem 6.1. For any f ∈ L1 (P ) the limit f (ω) + f (T ω) + · · · + f (T n−1 ω) = g(ω) n→∞ n lim
exists for almost all ω with respect to P as well as in L1 (P ). Moreover if f ∈ Lp for some p satisfying 1 < p < ∞ then the function g ∈ Lp and the convergence takes place in that Lp . Moreover the limit g(ω) is given by the conditional expectation g(ω) = E P [f |I] where the σ-field I, called the invariant σ-field, is defined as I = {A : T A = A}. Proof. Fisrst we prove the convergence in the various Lp spaces. These are called mean ergodic theorems. The easiest situation to prove is when p = 2. Let us define H0 = {f : f ∈ H, Uf = f } = {f : f ∈ H, f (T ω) = f (ω)}.
6.1. ERGODIC THEOREMS.
181
Since H0 contains constants, it is a closed nontrivial subspace of H = L2 (P ), of dimension at least one. Since U is unitary Uf = f if and only if U −1 f = U ∗ f = f where U ∗ is the adjoint of U. The orthogonal complement H0⊥ can be defined as H0⊥ = {g :< g, f >= 0 ∀f : U ∗ f = f } = Range(I − U)H . Clearly if we let
f + Uf + · · · + U n−1 f n then kAn f k2 ≤ kf k2 for every f ∈ H and An f = f for every n and f ∈ H0 . Therefore for f ∈ H0 , An f → f as n → ∞. On the other hand if f = (I−U)g, n 2 → 0 as n → ∞. Since kAn k ≤ 1, it follows An f = g−Un g and kAn f k2 ≤ 2kgk n that An f → 0 as n → ∞ for every f ∈ H0⊥ = Range(I − U)H. (See exercise 6.1). If we denote by π the orthogonal projection from H → H0 , we see that An f → πf as n → ∞ for every f ∈ H establishing the L2 ergodic theorem. There is an alternate characterization of H0 . Functions f in H0 are invariant under T , i.e. have the property that f (T ω) = f (ω). For any invariant function f the level sets {ω : a < f (ω) < b} are invariant under T . We can therefore talk about invariant sets {A : A ∈ F , T −1A = A}. Technically we should allow ourselves to differ by sets of measure zero and one defines I = {A : P (A ∆T −1 A) = 0} as the σ-field of almost invariant sets. . NothAn f =
ing is therefore lost by taking I to be the σ-field of invariant sets. We can identify the orthogonal projection π as (see Exercise 4.8) πf = E P f |I} and as the conditional expectation operator, π is well defined on Lp as an operator of norm 1, for all p in the range 1 ≤ p ≤ ∞. If f ∈ L∞ , then kAn f k∞ ≤ kf k∞ and by the bounded convergence theorem, for any p satisfying 1 ≤ p < ∞, we have kAn f − πf kp → 0 as n → ∞. Since L∞ is dense in Lp and kAn k ≤ 1 in all the Lp spaces it is easily seen, by a simple approximation argument, that for each p in 1 ≤ p < ∞ and f ∈ Lp , lim kAn f − f kp = 0
n→∞
proving the mean ergodic theorem in all the Lp spaces. We now concentrate on proving almost sure convergence of An f to πf for f ∈ L1 (P ). This part is often called the ‘individual ergodic theorem’ or
CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
182
‘Birkhoff’s theorem’ . This will be based on an analog of Doob’s inequality for martingales. First we will establish an inequality called the maximal ergodic theorem. Theorem 6.2. (Maximal Ergodic Theorem.) Let f ∈ L1 (P ) and for n ≥ 1, let En0 = {ω : sup [f (ω) + f (T ω) + · · · + f (T j−1ω)] ≥ 0]}. 1≤j≤n
Z
Then
f (ω) dP ≥ 0 0 En
Proof. Let hn (ω) = sup [f (ω) + f (T ω) + · · · + f (T j−1ω)] 1≤j≤n
= f (ω) + max(0 , hn−1 (T ω)) = f (ω) + h+ n−1 (T ω) where h+ n (ω) = max(0 , hn (ω)). On En0 , hn (ω) = h+ n (ω) and therefore + + f (ω) = hn (ω) − h+ n−1 (T ω) = hn (ω) − hn−1 (T ω).
Consequently, Z Z + f (ω) dP = [h+ n (ω) − hn−1 (T ω)] dP 0 En E0 Z n + + ≥ [h+ (because h+ n (ω) − hn (T ω)] dP n−1 (ω) ≤ hn (ω)) 0 Z En Z + = hn (ω) dP − h+ n (ω) dP (because of inavraince of T ) 0 En
0 T En
≥ 0. The last step follows from the fact that for any integrable function h(ω), R h(ω) dP is the largest when we take for E the set E = {ω : h(ω) ≥ 0}. E
6.1. ERGODIC THEOREMS.
183
Now we establish the analog of Doob’s inequality or maximal inequality, or sometimes referred to as the weaktype 1 − 1 inequality. Lemma 6.3. For any f ∈ L1 (P ), and ` > 0, denoting by En the set En = {ω : sup |(Aj f )(ω)| ≥ `} 1≤j≤n
we have
1 P En ≤ `
Z |f (ω)| dP. En
In particular 1 P ω : sup |(Aj f )(ω)| ≥ ` ≤ ` j≥1
Z |f (ω)| dP.
Proof. We can assume without loss of generality that f ∈ L1 (P ) is nonnegative. Apply the lemma to f − `. If [f (ω) + f (T ω) + · · · + f (T j−1ω)] En = {ω : sup > `}, j 1≤j≤n then
Z [f (ω) − `] dP ≥ 0 En
or
1 P [En ] ≤ `
Z f (ω) dP. En
We are done. Given the lemma the proof of the almost sure ergodic theorem follows along the same lines as the proof of the almost sure convergence in the martingale context. If f ∈ H0 it is trivial. For f = (I − U)g with g ∈ L∞ it ∞ is equally trivial because kAn f k∞ ≤ 2kgk . So the almost sure convergence n is valid for f = f1 + f2 with f1 ∈ H0 and f2 = (I − U)g with g ∈ L∞ . But such functions are dense in L1 (P ). Once we have almost sure convergence for a dense set in L1 (P ), the almost sure convergence for every f ∈ L1 (P ) follows by routine approximation using Lemma 6.3. See the proof of Theorem 5.7.
184
CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Exercise 6.1. For any bounded linear transformation A on a Hilbert Space H, show that the closure of the range of A, i.e Range A is the orthogonal complement of the null space {f : A∗ f = 0} where A∗ is the adjoint of A. Exercise 6.2. Show that any almost invariant set differs by a set of measure 0 from an invariant set i.e. if P (A ∆T −1 A) = 0 then there is a B ∈ F with P (A∆B) = 0 and T −1 B = B. Although the ergodic theorem implies a strong law of large numbers for any stationary sequence of random variables, in particular a sequence of independent identically distributed random variables, it is not quite the end of the story. For the law of large numbers, R we need to know that the limit πf is a constant, which will then equal f (ω) dP . To claim this, we need to know that the invariant σ-field is trivial or essentially consists of the whole space Ω and the empty set Φ. An invariant measure P is said to be ergodic for the transformation T , if every A ∈ I i.e every invariant set has measure 0 or 1. Then R every invariant function is almost surely a constant and πf = E f |I = f (ω) dP . Theorem 6.4. Any product measure is ergodic for the shift. Proof. Let A be an invariant set. Then A can be approximated by sets An in the σ-field corresonding to the coordinates from [−n, n]. Since A is invariant T ±2n An will approximate A just as well. This proves that A actually belongs to the tail σ-field, the remote past as well as the remote future. Now we can use Kolmogorov’s 0 − 1 law (Theorem 3.15), to assert that P (A) = 0 or 1.
6.2
Structure of Stationary Measures.
Given a space (Ω, F ) and a measurable transformation T with a measurable inverse T −1 , we can consider the space M of all T -invariant probability measures on (Ω, F ). The set M, which may be empty, is easily seen to be a convex set. Exercise 6.3. Let Ω = Z, the integers, and for n ∈ Z, let T n = n + 1. Show that M is empty. Theorem 6.5. A probability measure P ∈ M is ergodic if and only if it is an extreme point of M.
6.2. STRUCTURE OF STATIONARY MEASURES.
185
Proof. A point of a convex set is extreme if it cannot be written as a nontrivial convex combination of two other points from that set. Suppose P ∈ M is not extremal. Then P can be written as nontrivial convex combination of P1 , P2 ∈ M, i.e. for some 0 < a < 1 and P1 6= P2 , P = aP1 + (1 − a)P2 . We claim that such a P cannot be ergodic. If it were, by definition, P (A) = 0 or 1 for every A ∈ I. Since P (A) can be 0 or 1 only when P1 (A) = P2 (A) = 0 or P1 (A) = P2 (A) = 1, it follows that for every invariant set A ∈ I, P1 (A) = P2 (A). We now show that if two invariant measures P1 and P2 agree on I, they agree on F . Let f (ω) be any bounded F -measurable function. Consider the function 1 [f (ω) + f (T ω) + · · · + f (T n−1 ω)] n→∞ n
h(ω) = lim
defined on the set E where the limit exists. By the ergodic theorem P1 (E) = P2 (E) = 1 and h is I measurable. Moreover, by the stationarity of P1 , P2 and the bounded convergence theorem, Z Pi h(ω)dPi for i = 1, 2 E [f (ω)] = E
Since P1 = P2 on I and h is I measurable and Pi (E) = 1 for i = 1, 2 we see that E P1 [f (ω)] = E P2 [f (ω)] Since f is arbitrary this implies that P1 = P2 on F . Conversely if P is not ergodic, then there is an A ∈ I with 0 < P (A) < 1 and we define P1 (E) =
P (A ∩ E) P (Ac ∩ E) ; P2 (E) = . P (A) P (Ac )
Since A ∈ I it follows that Pi are stationary. Moreover P = P (A)P1 + P (Ac )P2 and hence P is not extremal. One of the questions in the theory of convex sets is the existence of sufficiently many extremal points, enough to recover the convex set by taking convex combinations. In particular one can ask if any point in the convex set can be obtained by taking a weighted average of the extremals. The next theorem answers the question in our context. We will assume that our space (Ω, F ) is nice, i.e. is a complete separable metric space with its Borel sets.
186
CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Theorem 6.6. For any invariant measure P , there is a probability measure µP on the set Me of ergodic measures such that Z Q µP (dQ) P = Me
Proof. If we denote by Pω the regular conditional probability distribution of P given I, which exists (see Theorem 4.4) because (Ω, F ) is nice, then Z P = Pω P (dω) Ω
We will complete the proof by showing that Pω is an ergodic stationary probability measure for almost all ω with respect to P . We can then view Pω as a map Ω → Me and µP will be the image of P under the map. Our integral representation in terms of ergodic measures will just be an immediate consequence of the change of variables formula. Lemma 6.7. For any stationary probability measure P , for almost all ω with respect to P , the regular conditional probability distribution Pω , of P given I, is stationary and ergodic. Proof. Let us first prove stationarity. We need to prove that Pω (A) = Pω (T A) a.e. We have to negotiate carefully through null sets. Since a measure on the Borel σ-field F of a complete separable metric space is determined by its values on a countable generating field F0 ⊂ F , it is sufficient to prove that for each fixed A ∈ F0 , Pω (A) = Pω (T A) a.e. P . Since Pω is I measurable all we need to show is that for any E ∈ I, Z Z Pω (A) P (dω) = Pω (T A) P (dω) E
E
or equivalently P (E ∩ A) = P (E ∩ T A) This is obvious because P is stationary and E is invariant. We now turn to ergodicity. Again there is a minefield of null sets to negotiate. It is a simple exercise to check that if, for some stationary measure Q, the ergodic theorem is valid with an almost surely constant limit for the indicator functions 1A with A ∈ F0 , then Q is ergodic. This needs to be checked only for a countable collection of sets {A}. We need therfore only to
6.3. STATIONARY MARKOV PROCESSES.
187
check that any invariant function is constant almost surely with respect to almost all Pω . Equivalently for any invariant set E, Pω (E) must be shown almost surely to be equal to 0 or 1. But Pω (E) = χE (ω) and is always 0 or 1. This completes the proof. Exercise 6.4. Show that any two distinct ergodic invariant measures P1 and P2 are orthogonal on I, i.e. there is an invariant set E such that P1 (E) = 1 and P2 (E) = 0. Exercise 6.5. Let (Ω, F ) = ([0, 1), B) and T x = x + a (mod) 1. If a is irrational there is just one invariant measure P , namely the uniform distribution on [0, 1). This is seen by Fourier Analysis. See Remark 2.2. Z Z Z Z i2nπx i 2 n π (T x) i 2 n π (x+a) i2nπa e ei 2 n π x dP dP = e dP = e dP = e If a is irrational ei 2 n π a = 1 if and only if n = 0. Therefore Z ei 2 n π x dP = 0 for n 6= 0 which makes P uniform. Now let a = pq be rational with (p, q) = 1, i.e. p and q are relatively prime. Then, for any x, the discrete distribution with probabilities 1q at the points {x, x+a, x+2a, . . . , x+(q−1)a} is invariant and ergodic. We can denote this distribution by Px . If we limit x to the interval 0 ≤ x < 1q then x is uniquely determined by Px . Complete the example by determining all T invariant probability distributions on [0, 1) and find the integral representation in terms of the ergodic ones.
6.3
Stationary Markov Processes.
Let π(x, dy) be a transition probability function on (X, B), where X is a state space and B is a σ-field of measurable subsets of X. A stochastic process with values in X is a probability measure on the space (Ω, F ), where Ω is the space of sequences {xn : −∞ < n < ∞} with values in X, and F is the product σ-field. The space (Ω, F ) has some natural sub σ-fields. For any two integers m ≤ n, we have the sub σ-fields, Fnm = σ{xj : m ≤ j ≤ n} corresponding to information about the process during the time interval [m, n]. In addition m we have Fn = Fn−∞ = σ{xj : j ≤ n} and F m = F∞ = σ{xj : j ≥ m} that
188
CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
correspond to the past and future. P is a Markov Process on (Ω, F ) with transition probability π(·, ·), if for every n, A ∈ B and P-almost all ω, P xn+1 ∈ A|Fn } = π(xn , A)
Remark 6.1. Given a π, it is not always true that P exists. A simple but illuminating example is to take X = {0, 1, · · · , n, · · · } to be the nonnegative integers and define π(x, x + 1) = 1 and all the process does is move one step to the right every time. Such a process if it had started long time back will be found nowhere today! So it does not exist. On the other hand if we take X to be the set of all integers then P is seen to exist. In fact there are lots of them. What is true however is that given any initial distribution µ and initial time m, there exist a unique process P on (Ω, F m ), i.e. defined on the future σ-field from time m on, that is Markov with transition probability π and satisfies P {xm ∈ A} = µ(A) for all A ∈ B. The shift T acts naturally as a measurable invertible map on the product space Ω into itself and the notion of a stationary process makes sense. The following theorem connects stationarity and the Markov property. Theorem 6.8. Let the transition probability π be given. Let P be a stationary Markov process with transition probability π. Then the one dimensional marginal distribution µ, which is independent of time because of stationarity and given by µ(A) = P xn ∈ A is π invariant in the sense that µ(A) =
Z π(x, A)µ(dx)
for every set A ∈ B. Conversely given such a µ, there is a unique stationary Markov process P with marginals µ and transition probability π. Exercise 6.6. Prove the above Theorem. Use Remark 4.7. Exercise 6.7. If P is a stationary Markov process on a countable state space with transition probaility π and invariant marginal distribution µ, show that the time reversal map that maps {xn } to {x−n } takes P to another stationary Markov process Q, and express the transition probability π ˆ of Q, as explicitly as you can in terms of π and µ .
6.3. STATIONARY MARKOV PROCESSES.
189
Exercise 6.8. If µ is an invariant measure for π, show that the conditional R expectation map Π : f (·) → f (y) π ( · , dy) induces a contraction in Lp (µ) for any p ∈ [1, ∞]. We say that a Markov process is reversible if the time reversed process Q of the previous example coincides with P . Show that P corresponding to π and µ is reversible if and only if the corresponding Π in L2 (µ) is self-adjoint or symmetric. Since a given transition probability π may in general have several invariant measures µ, there will be several stationary Markov processes with transition probability π. Let M be the set of invariant probability measures for the transition probabilities π(x, dy) i.e. Z f M = µ : µ(A) = π(x , A) dµ(x) for all A ∈ B X
f is a convex set of probability mesures and we denote by M fe its (possiM f we have the corresponding bly empty) set of extremals. For each µ ∈ M, stationary Markov process Pµ and the map µ → Pµ is clearly linear. If we want Pµ to be an ergodic stationary process, then it must be an extremal in f is therfore a the space of all stationary processes. The extremality of µ ∈ M necessary condition for Pµ to be ergodic. That it is also sufficient is a little bit of a surprise. The following theorem is the key step in the proof. The remaining part is routine. Theorem 6.9. Let µ be an invariant measure for π and P = Pµ the corresponding stationary Markov process. Let I be the σ-field of shift invariant subsets on Ω. To within sets of P measure 0, I ⊂ F00 . Proof. This theorem describes completely the structure of nontrivial sets in the σ-field I of invariant sets for a stationary Markov process with transition probability π and marginal distribution µ. Suppose that the state space can be partitioned nontrivially i.e. with 0 < µ(A) < 1 into two sets A and Ac that satisfy π(x, A) = 1 a.e µ on A and π(x, Ac ) = 1 a.e µ on Ac . Then the event E = {ω : xn ∈ A for all n ∈ Z} provides a non trivial set in I. The theorem asserts the converse. The proof depends on the fact that an invariant set E is in the remote past −∞ ∞ m F−∞ = ∩n Fn−∞ as well as in the remote future F∞ = ∩m F∞ . See the proof of Theorem 6.4. For a Markov process the past and the future are
190
CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
conditionally independent given the present. See Theorem 4.9. This implies that P E|F00 = P E ∩ E|F00 = P E|F00 P E|F00 and must therfore equal either 0 or 1. This in turn means that corresponding to any invariant set E ∈ I, there exists A ⊂ X that belongs to B, such that E = {ω : xn ∈ A for all n ∈ Z} up to a set of P measure 0. If the Markov process starts from A or Ac , it does not ever leave it. That means 0 < µ(A) < 1 and π(x, Ac ) = 0 for µ a.e x ∈ A and π(x, A) = 0 for µ a.e x ∈ Ac
Remark 6.2. One way to generate markov processes with multiple invariant measures is to start with two markov processes with transition probabilities πi (xi , dyi) on Xi and invariant measures µi , and consider X = X1 ∪ X2 . Define ( π1 (x, A ∩ X1 ) if x ∈ X1 π(x, A) = π2 (x, A ∩ X2 ) if x ∈ X2 Then any one of the two processes can be going on depending on which world we are in. Both µ1 and µ2 are invariant measures. We have combined two distinct possibilities into one. What we have shown is that when we have multiple invariant measures they essentially arise in this manner. Remark 6.3. We can therefore look at the convex set of measures µ that are π invariant, i.e. µΠ = µ. The extremals of this convex set are precisely the ones that correspond to ergodic stationary processes and they are called ergodic or extremal invariant measures. If the set of invariant probability measures is nonempty for some π, then there are enough extremals to recover arbitrary invariant measure as an integral or weighted average of extremal ones. Exercise 6.9. Show that any two distinct extremal invariant measures µ1 and µ2 for the same π are orthogonal on B. Exercise 6.10. Consider the operator Π on the Lp (µ) spaces corresponding to a given invariant measure. The dimension of the eigenspace f : Πf = f that corresponds to the eigenvalue 1, determines the extremality of µ. Clarify this statement.
6.3. STATIONARY MARKOV PROCESSES.
191
Exercise 6.11. Let Px be the Markov process with stationary transition probability π(x, dy) starting at time 0 from x ∈ X. Let f be a bounded measurable function on X. Then for almost all x with respect to any extemal invariant measure ν, Z 1 lim [f (x1 ) + · · · + f (xn )] = f (y)ν(dy) n→∞ n for almost all ω with respect to Px . Exercise 6.12. We saw in the earlier section that any stationary process is an integral over stationary ergodic processes. If we represent a stationary Markov Process Pµ as the integral Z Pµ = R Q(dR) over stationary ergodic processes, show that the integral really involves only stationary Markov processes with transition probability π, so that the integral is really of the form Z Pµ = Pν Q(dν)
g
Me
or equivalently
Z µ=
g ν Q(dν).
Me
Exercise 6.13. If there is a reference measure α such that π (x , dy) has a density p(x, y) with respect to α for every α, then show that any invariant measure µ is absolutely continuous with respect to α. In this case the eigenspace f : Πf = f in L2 (µ) gives a complete picture of all the invariant measures. The question of when there is at most one invariant measure for the Markov process with transition probability π is a difficult one. If we have a density p(x, y) with respect to a reference measure α and if for each x, p(x, y) > 0 for almost all y with respect to α, then there can be atmost one inavriant measure. We saw already that any invariant measure has a density with respect to α. If there are at least two invariant mesaures, then there are at least two ergodic ones which are orthogonal. If we denote by f1 and
CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
192
f2 their densities with respect to α, by orthogonality we know that they are supported on disjoint ivariant sets, A1 and A2 . In particular p(x, y) = 0 for almost all x on A1 in the support of f1 and almost all y in A2 with respect to α. By our positivity assumption we must have α(A2 ) = 0 , which is a contradiction.
6.4
Mixing properties of Markov Processes.
One of the questions that is important in the theory of Markov Processes is the rapidity with which the memory of the initial state is lost. There is no unique way of assessing it and depending on the circumstances this could happen in many differerent ways at many different rates. Let π (n) (x , dy) be the n step transition probability. The issue is how the measures π (n) (x , dy) depend less and less on x as n → ∞. Suppose we measure this dependence by ρn = sup sup |π (n) (x , A) − π (n) (y , A)| x,y∈X A∈B
then the following is true. Theorem 6.10. Either ρn ≡ 1 for all n ≥ 1, or ρn ≤ Cθn for some 0≤θ 0, then it is elementary to show that ρ1 ≤ (1 − δ). Remark 6.5. If ρn → 0, we can estimate Z (n) (n+m) (x , A)| = | [π (n) (x , A) − π (n) (y, A)]π (m) (x , dy)| ≤ ρn |π (x , A) − π and conclude from the estimate that lim π (n) (x , A) = µ(A)
n→∞
exists. µ is seen to be an invariant probability measure. Remark 6.6. In this context the invariant measure is unique. If β is another invariant measure because Z β(A) = π (n) (x , A)β(dy) for every n ≥ 1
Z β(A) = lim
n→∞
π (n) (x , A)β(dy) = µ(A).
−∞ Remark 6.7. The stationary process Pµ has the property that if E ∈ Fm n and F ∈ F∞ with a gap of k = n − m > 0 then Z Z Pµ [E ∩ F ] = π (k) (xm (ω), dx)Px (T −n F )Pµ (dω) ZE ZX Pµ [E]Pµ [F ] = µ(dx)Px (T −n F )Pµ (dω) ZE ZX Pµ [E ∩ F ] − Pµ [E]Pµ [F ] = Px (T −n F )[π (k) (xm (ω), dx) − µ(dx)]Pµ (dω) E
X
from which it follows that |Pµ [E ∩ F ] − Pµ [E]Pµ [F ]| ≤ ρk Pµ (E) proving an asymptotic independence property for Pµ .
194
CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
There are situations in which we know that an invariant probability measure µ exists for π and we wish to establish that π (n) (x , A) converges to µ(A) uniformly in A for each x ∈ X but not necessarily uniformly over the starting points x. Uniformity in the starting point is very special. We will illustrate this by an example. Example 6.1. The Ornstein-Uhlenbeck process is Markov Chain on the state space X = R, the real line with transition probability π(x, dy) given by a Gaussian distribution with mean ρx and variance σ 2 . It has R a density p(x, y) with respect to the Lebesgue measure so that π(x, A) = A p(x, y)dy. 1 (y − ρx)2 p(x, y) = √ exp[− ] 2σ 2 2πσ It arises from the ‘auto-regressive’ representation xn+1 = ρxn + σξn+1 where ξ1 , · · · , ξn · · · are independent standard Gaussians. The characteristic function of any invariant mesure φ(t) satisfies, for every n ≥ 1, Pn−1 2j 2 2 ( σ 2 t2 j=0 ρ )σ t φ(t) = φ(ρt) exp[− ] = φ(ρn t) exp[− ] 2 2 by induction on n. Therefore |φ(t)| ≤ exp[−
(
Pn−1 j=0
ρ2j )σ 2 t2 ] 2
and this cannot be a characteristic function unless |ρ| < 1 (otherwise by letting n → ∞ we see that φ(t) = 0 for t 6= 0 and therefore discontinuous at t = 0). If |ρ| < 1, by letting n → ∞ and observing that φ(ρn t) → φ(0) = 1 φ(t) = exp[−
σ 2 t2 ] 2(1 − ρ2 )
The only possible invariant measure is the Gaussian with mean 0 and variane σ2 . One can verify that this Gaussian is infact an invariant measure. If (1−ρ2 ) |ρ| < 1 a direct computation shows that π (n) (x, dy) is a Gaussian with mean P n−1 ρn x and variance σn2 = j=0 ρ2j σ 2 → (1 − ρ2 )σ 2 as n → ∞. Clearly there is uniform convergence only over bounded sets of starting points x. This is typical.
6.5. CENTRAL LIMIT THEOREM FOR MARTINGALES.
6.5
195
Central Limit Theorem for Martingales.
If {ξn } is an ergodic stationary sequence of random variables with mean zero n then we know from the ergodic theorem that the mean ξ1 ···+ξ converges to n zero almost surely. by the law of large numbers. We want to develop some methods for proving the central limit theorem, i.e. the covergence of the n √ distribution of ξ1 +···+ξ to some Gaussian distribution with mean 0 variance n 2 σ . Under the best of situations, since the Pcovariance ρk = E[Xn Xn+k ] may not be 0 for all k 6= 0, if we assume that −∞ −∞ S
Remark 6.8. Notice that the condition basically prevents f from vanishing on a set of positive measure or having very flat zeros. The proof will use methods from the theory of functions of a complex variable. Define Proof. g(θ) =
X
cn exp[i n θ]
n≥0
as the Fourier series of some g ∈ L2 (S). Assume cn 6= 0 for some n > 0. In fact we can assume without loss of generality that c0 6= 0 by removing a suitable factor of ei k θ which will not affect |g(θ)|. Then we will show that Z 1 log |g(θ)|dθ ≥ log |c0 |. 2π S
204
CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Consider the function G(z) =
X
cn z n
n≥0
as an analytic function in the disc |z| < 1. It has boundary values lim G(reiθ ) = g(θ)
r→1
in L2 (S). Since G is an analytic function we know, from the theory of functions of a complex variable, thatlog |G(reiθ )| is subharmonic and has the mean value property Z log |G(reiθ )|dθ ≥ log |G(0)| = log |c0 | S
Since G(reiθ ) has a limit in L2 (S), the positive part of log |G| which is dominated by |G| is uniformly integrable. For the negative part we apply Fatou’s lemma and derive our estimate. R Now for the converse. Let f ∈ L1 (S). Assume S log f (θ)dθ > −∞ or equivalently log f ∈ L1 (S). Define the Fourier coefficients Z 1 an = log f (θ) exp[i n θ] dθ. 4π S Because log f is integrable {an } are uniformly bounded and the power series A(z) =
X
an z n
which is well defined for |z| < 1. We define G(z) = exp[A(z)]. We will show that lim G(reiθ ) = g(θ)
r→1
exists in L2 (S) and f = |g|2, g being the boundary value of an analytic function in the disc. The integral condition on log f is then the necessary and sufficient condition for writing f = |g|2 with g involving only nonnegative frequencies.
6.6. STATIONARY GAUSSIAN PROCESSES.
205
|G(reiθ )|2 = exp 2 Real Part A(reiθ ) ∞ X aj r j cos jθ = exp 2 j=0 ∞ X
Z 1 r cos jθ[ log f (ϕ) cos jϕdϕ] = exp 2 4π S j=0 Z ∞ X 1 log f (ϕ)[ r j cos jθ cos jϕdϕ] = exp 2π S j=0 Z = exp log f (ϕ)K(r, θ, ϕ)dϕ S Z ≤ f (ϕ)K(r, θ, ϕ)dϕ
j
S
Here K is the Poisson Kernel for the disc 1 X j K(r, θ, ϕ) = r cos θ cos ϕ 2π j=0 ∞
R is nonnegative and S K(r, θ, ϕ)dϕ = 1. The last step is a consequence of Jensen’s inequality. The function Z fr (θ) = f (ϕ)K(r, θ, ϕ)dϕ S
converges to f as r → 1 in L1 (S) by the properties of the Poisson Kernel. It is therefore uniformly integrable. Since |G(reiθ )|2 is dominated by fr we get uniform integrability for |G|2 as r → 1. It is seen now that G has a limit g in L2 (S) as r → 1 and f = |g|2. One of the issues in the theory of time series is that of prediction. We have a stochastic process {Xn } that we have observed for times n ≤ −1 and we want to predict X0 . The best predictor is E P [X0 |F−1] or in the Gaussian linear context it is the compuation of the projection of X0 into H−1 . If we have a moving average representation, even a causal one, while it is true that
206
CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
Xj is spanned by {ξk : k ≤ j} the converse may not be true. If the two spans were the same, then the best predictor for X0 is just X ˆ0 = X aj ξ−j j≥1
obtained by dropping one term in the original representation. In fact in answering Q2 the construction yielded a representation with this property. The quantity |a0 |2 is then the prediction error. In any case it is a lower bound. Q4. What is the value of prediction error and how do we actually find the predictor ? The situation is some what muddled. Let us assume that we have a purely nondeterministic process i.e. a process with a spectral density satisfying R log f (θ)dθ > −∞. Then f can be represented as S f = |g|2 with g ∈ H2 , where by H2 we denote the subspace of L2 (S) that are boundary values of analytic functions in the disc |z| < 1, or equivalently functions g ∈ L2 (S) with only nonnegative frequencies. For any such g, we have an analytic function X G(z) = G(rei θ ) = an r n ei n θ . n≥0
For any choice of g ∈ H2 with f = |g|2, we have 1 |G(0)| = |a0 | ≤ exp 2π 2
2
Z
log f (θ)dθ .
(6.1)
S
There is a choice of g contstructed in the proof of the thorem for which 1 |G(0)| = exp 2π 2
Z log f (θ)dθ
(6.2)
S
The prediction error σ 2 (f ), that depends only on f and not on the choice of g, also satisfies
6.6. STATIONARY GAUSSIAN PROCESSES.
207
σ 2 (f ) ≥ |G(0)|2
(6.3)
for every choice of g ∈ H2 with f = |g|2. There is a choice of g such that σ 2 (f ) = |G(0)|2 Therefore from (6.1) and (6.4) 1 σ (f ) ≤ exp 2π 2
On the other hand from (6.2) and (6.3) 1 σ (f ) ≥ exp 2π 2
We do now have an exact formula 1 σ (f ) = exp 2π 2
Z log f (θ)dθ
(6.4)
(6.5)
S
Z log f (θ)dθ
(6.6)
S
Z log f (θ)dθ
(6.7)
S
for the prediction error. As for the predictor, it is not quite that simple. In principle it is a limit of linear combinations of {Xj : j ≤ 0} and may not always have a simple concrete representation. But we can understand it a little better. Let us consider the spaces H and L2 (S; µ) of square integrable functions on S with respect to the spectral measure µ. There is a natural isomorphism between the two Hilbert spaces, if we map X X aj Xj ←→ aj ei j θ The problem then is the question of approximating ei θ in L2 (S; µ) by linear combinations of {ei j θ : j ≤ 0}. We have already established that the error, 1 which is nonzero in the purely nondeterministic case, i.e when dµ = 2π f (θ)dθ for some f ∈ L1 (S) satisfying Z log f (θ)dθ > −∞, S
is given by
1 σ (f ) = exp 2π 2
Z log f (θ)dθ S
208
CHAPTER 6. STATIONARY STOCHASTIC PROCESSES.
We now want to find the best approximation. In order to get at the predictor we have to make a very special choice of the representation f = |g|2. Simply demanding g ∈ L2 (S) will not even give causal representations. Demanding g ∈ H2 will always give us causal representation, but there are too many of these. If we mutiply G(z) by an analytic function V (z) that has boundary values v(θ) satisfying |v(θ)| = |V (ei θ )| ≡ 1 on S, then gv is another choice. If we demand that Z 1 2 |G(0)| = exp log f (θ)dθ (6.8) 2π S there is atleast one choice that will satisfy it. There is still ambiguity, albeit a trivial one among these, for we can always multiply g by a complex number of modulus 1 and that will not change anything of consequence. We have the following theorem. Theorem 6.13. The representation f = |g|2 with g ∈ H2 , and satisfying (6.8), is unique to within a multiplicative constant of modulus 1. In other words if f = |g1 |2 = |g2 |2 with both g1 and g2 satisfying (8), then g1 = αg2 on S, where α is a complex number of modulus 1. Proof. Let F (rei θ ) = log |G(rei θ )|. It is a subharmonic function and lim F (rei θ ) =
r→1
1 log f (θ) 2
Because lim G(rei θ ) = g(θ)
r→1
in L2 (S), the functions are uniformly integrable in r. The positive part of the logarithm F is well controlled and therefore uniformly uniformly integrable. Fatou’s lemma is applicable and we should always have Z Z 1 1 iθ lim sup F (re )dθ ≤ log f (θ)dθ 2π S 4π S r→1 But because F is subharmonic its average value on a circle of radius r around 0 is nondecreasing in r, and the lim sup is the same as the sup. Therefore Z Z Z 1 1 1 iθ iθ F (0) ≤ sup F (re )dθ = lim sup F (re )dθ ≤ log f (θ)dθ 2π S 4π S r→1 0≤r f (k , x), we have Z V (k , x) = V (k + 1 , y)π(x , dy) and this means V (¯ τ ∧ k , xτ¯∧k ) is a martingale and this establishes the second claim.
216
CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.
Example 7.1. (The Secretary Problem.) An interesting example is the following game. We have a lottery with N tickets. Each ticket has a number on it. The numbers a1 · · · , aN are distinct, but the player has no idea of what they are. The player draws a ticket at random and looks at the number. He can either keep the ticket or reject it. If he rejects it, he can draw another ticket from the remaining ones and again decides if he wants to keep it. The information available to him is the numbers on the tickets he has so far drawn and discarded as well as the number on the last ticket that he has drawn and is holding. If he decides to keep the ticket at any stage, then the game ends and that is his ticket. Of course if he continues on till the end, rejecting all of them, he is forced to keep the last one. The player wins only if the ticket he keeps is the one that has the largest number written on it. He can not go back and claim a ticket that he has already rejected and he can not pick a new one unless he rejects the one he is holding. Assuming that the draws are random at each stage, how can the player maximize the probability of winning? How small is this probability? It is clear that the strategy to pick the first or the last or any fixed draw has the probability of N1 to win. It is not apriori clear that the probability pN of winning under the optimal strategy remains bounded away from 0 for large N. It seems unlikely that any strategy can pick the winner with significant probability far large values of N. Nevertheless the following simple strategy shows that 1 lim inf pN ≥ . N →∞ 4 Let half the draws go by, no matter what, and then pick the first one which is the highest among the tickets drawn up to the time of the draw. If the second best has already been drawn and the best is still to come, this strategy will succeed. This has probability nearly 14 . In fact the strategy works if the k best tickets have not been seen during the first half, (k + 1)-th has been and among the k best the highest shows up first in the second half. The 1 probability for this is about k2k+1 , and as these are disjoint events lim inf pN ≥ n→∞
X k≥1
1 1 = log 2 k+1 k2 2
If we decide to look at the first Nx tickets rather than N2 , the lower bound becomes x log x1 and an optimization over x leads to x = 1e and the resulting
7.2. OPTIMAL STOPPING.
217
lower bound
1 lim inf pN ≥ . n→∞ e We will now use the method optimal stopping to decide on the best strategy for every N and show that the procedure we described is about the best. Since the only thing that matters is the ordering of the numbers, the numbers themselves have no meaning. Consider a Markov chain with two states 0 and 1. The player is in state 1 if he is holding the largest ticket so far. Otherwise he is in state 0. If he is in state 1 and stops at stage k, i.e. when k tickets have been drawn, the probability of his winning is easily calculated to be Nk . If he is in state 0, he has to go on and the probability of landing on 1 1 at the next step is calculated to be k+1 . If he is at 1 and decides to play on 1 the probability is still k+1 for landing on 1 at the next stage. The problem reduces to optimal stopping for a sequence X1 , X2 , · · · , XN of independent i random variables with P {Xi = 1} = 1i + 1, P {Xi = 0} = i+1 and a reward i function of f (i , 1) = N ; f (i , 0) = 0. Let us define recursively the optimal probabilities V (i , 0) =
1 i V (i + 1, 1) + V (i + 1 , 0) I +1 i+1
and V (i , 1) = max[
i 1 i i , V (i + 1, 1) + V (i + 1 , 0)] = max[ , V (i , 0)] N I +1 i+1 N
It is clear what the optimal strategy is. We should draw always if we are in state 0, i.e. we are sure to lose if we stop. If we are holding a ticket that is the largest so far, we should stop provided i > V (i , 0) N and go on if i < V (i , 0). N Either startegy is acceptable in case of equality. Since V (i+1 , 1) ≥ V (i+1 , 0) for all i, it follows that V (i , 0) ≥ V (i + 1, 0). There is therefore a critical k(= kN ) such that Ni ≥ V (i , 0) if i ≥ k and Ni ≤ V (i , 0) if i ≤ k. The best strategy is to wait till k tickets have been drawn, discarding every ticket,
CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.
218
and then pick the first one that is the best so far. The last question is the determination of k = kN . For i ≥ k, V (i , 0) =
1 i+1 i 1 i + V (i + 1 , 0) = + V (i + 1 , 0) i+1 N i+1 N i+1
or V (i , 0) V (i + 1 , 0) 1 1 − = · i i+1 N i telling us N −1 i X1 V (i , 0) = N j=i j
so that
N −1 1 1 X1 kN = inf i : < N j=i j N
Approximately log N − log kN = 1 or kN =
7.3
N . e
Filtering.
The problem in filtering is that there is an underlying stochastic process that we cannot observe. There is a related stochastic process ‘driven’ by the first one that we can observe and we want to use our information to make conclusions about the state of the unobserved process. A simple but extreme example is when the unobserved process does not move and remains at the same value. Then it becomes a parameter. The driven process may be a sequence of i.i.d random variables with densities f (θ, x) where θ is the unobserved, unchanging underlying parameter. We have a sample of n independent observations X1 , · · · , Xn from the common distribution f (θ, x) and our goal is then nothing other than parameter estimation. We shall take a Bayesian approach. We have a prior distribution µ(dθ) on the space of parameters Θ and this can be modified to an ‘aposteriori’ distribution after the sample is observed. We have the joint distribution n Y i=1
f (θ, xi ) dxi µ(dθ)
7.3. FILTERING.
219
and we calculate the conditional distribution of µn (dθ|x1 · · · xn ) given x1 , · · · , xn . This is our best informed guess about the nature of the unknown parameter. We can use this information as we see fit. If we have an additional observation xn+1 we need not recalculate everything, but we can simply update by viewing µn as the new prior and calculating the posterior after a single observation xn+1 . We will just work out a single illustration of this known as the KallmanBucy filter. Suppose {xn } the unobserved process is a Gaussian Markov chain xn+1 = ρxn + σξn+1 with 0 < ρ < 1 and the noise term ξn are i.i.d normally distributed random variables with mean 0 and variance 1. The observed process yn is given by yn = xn + ηn where the {ηj } are again independent standard Gaussians that are independent of the {ξj } as well. If we start with an initial distribution for x0 say one that is Gaussian with mean m0 and variance σ02 , we can compute the joint distribution of x0 , x1 and y1 and then the conditional of x1 given y1 . This becomes the new distribution of the state x1 based on the observation y1 . This allows us te calculate recursively at every stage. Let us do this explicitly now. The distribution of x1 , y1 is jointly normal with mean (ρm0 , ρm0 ) variances (ρ2 σ02 + σ 2 , ρ2 σ02 + σ 2 + 1) and covariance (ρ2 σ02 + σ 2 ). The posterior distribution of x1 is again Normal with mean (ρ2 σ02 + σ 2 ) (y1 − ρm0 ) (ρ2 σ02 + σ 2 + 1) 1 (ρ2 σ02 + σ 2 ) = 2 2 m )y1 + 0 (ρ σ0 + σ 2 + 1) (ρ2 σ02 + σ 2 + 1
m1 = ρm0 +
and variance σ12 = (ρ2 σ02 + σ 2 )(1 − =
(ρ2 σ02 + σ 2 ) (ρ2 σ02 + σ 2 + 1)
(ρ2 σ02 + σ 2 ) (ρ2 σ02 + σ 2 + 1)
220
CHAPTER 7. DYNAMIC PROGRAMMING AND FILTERING.
After a long time while the recursion for mn remains the same mn =
1 (ρ2 σ02 + σ 2 ) + m )yn n−1 (ρ2 σ02 + σ 2 + 1) (ρ2 σ02 + σ 2 + 1
2 the variance σn2 has an asymptotic value σ∞ given by the solution of 2 σ∞ =
2 + σ2 ) (ρ2 σ∞ . 2 + σ 2 + 1) (ρ2 σ∞
Bibliography [1] Ahlfors, Lars V. Complex analysis. An introduction to the theory of analytic functions of one complex variable. Third edition. International Series in Pure and Applied Mathematics. McGraw-Hill Book Co., New York, 1978. xi+331 pp. [2] Dym, H.; McKean, H. P. Fourier series and integrals. Probability and Mathematical Statistics, No. 14. Academic Press, New York-London, 1972. x+295 pp. [3] Halmos, Paul R. Measure Theory. D. Van Nostrand Company, Inc., New York, N. Y., 1950. xi+304 pp. [4] Kolmogorov, A. N. Foundations of the theory of probability. Translation edited by Nathan Morrison, with an added bibliography by A. T. Bharuch-Reid. Chelsea Publishing Co., New York, 1956. viii+84 pp. [5] Parthasarathy, K. R. An introduction to quantum stochastic calculus. Monographs in Mathematics, 85. Birkhuser Verlag, Basel, 1992. xii+290 pp. [6] Parthasarathy, K. R. Probability measures on metric spaces. Probability and Mathematical Statistics, No. 3 Academic Press, Inc., New YorkLondon 1967 xi+276 pp. [7] Royden, H. L. Real analysis. Third edition. Macmillan Publishing Company, New York, 1988. xx+444 pp. [8] Stroock, Daniel W.; Varadhan, S. R. Srinivasa Multidimensional diffusion processes. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], 233. Springer-Verlag, Berlin-New York, 1979. xii+338 pp. 221
Index σ-field, 9 accompanying laws, 78 Bellman, 213 Berry, 97 Berry-Essen theorem, 97 binomial distribution, 31 Birkhoff, 179, 182 ergodic theorem of, 179, 182 Bochner, 32, 45, 49, 200 theorem of, 32, 45 for the circle, 49 Borel, 58 Borel-Cantelli lemma, 58 bounded convergence theorem, 19 branching process, 142 Bucy, 219 Cantelli, 58 Caratheodory extension theorem, 11 Cauchy, 35 Cauchy distribution, 35 central limit theorem, 71 central limit theorem under mixing , 198 change of variables, 23 Chapman, 117 Chapman-Kolmogorov equations, 117 characteristic function, 31 222
uniqueness theorm , 34 Chebychev, 55 Chebychev’s inequality, 55 compound Poisson distribution, 77 conditional expectation, 101, 109 Jensen’s inequality, 110 conditional probability, 101, 112 regular version, 113 conditioning, 101 continuity theorem, 39 control, 213 convergence almost everywhere, 17 in distribution, 38 in law, 38 in probability, 17 convolution, 53 countable additivity, 9 covariance, 29 covariance matrix, 29 Cram´er, 39 degenerate distribution, 31 Dirichlet, 33 Dirichlet integral, 33 disintegration theorem, 115 distribution joint, 24 of a random variable, 24 distribution function, 13 dominated convergence theorem, 21
INDEX Doob, 151, 152, 157, 161, 164 decomposition theorem of, 157 inequality of, 152 inequality of , 151 stopping theorem of , 161 upcrossing inequality of, 164 double integral, 27 dynamic programming, 213 ergodic invariant measure, 184 ergodic process, 184 extremality of, 185 ergodic theorem, 179 almost sure, 182 almost sure , 179 maximal, 182 mean, 179 ergodicity, 184 Esseen, 97 exit probability, 170 expectation, 28 exponential distribution, 35 two sided, 35 extension theorem, 11 Fatou, 20 Fatou’s lemma, 20 field, 8 σ-field generated by, 10 filter, 219 finite additivity, 9 Fubini, 27 Fubini’s theorem, 27 gamma distribution, 35 Gaussian distribution, 35 Gaussian process, 200 stationary, 200 autoregressive schemes, 211
223 causal representation of, 200 moving average representation of , 200 prediction of , 205 prediction error of, 205 predictor of, 205 rational spectral density, 210 spectral density of , 200 spectral measure of , 200 generating function, 36 geometric distribution, 34 Hahn, 104 Hahn-Jordan decomposition, 104 independent events, 51 independent random variables, 51 indicator function, 15 induced probability measure, 23 infinitely divisible distributions, 83 integrable functions, 21 integral, 14, 15 invariant measures, 179 inversion theorem, 34 irrational rotations, 187 Jensen, 110 Jordan, 104 Kallman, 219 Kallman-Bucy filter, 219 Khintchine, 89 Kolmogorov, 7, 59, 62, 66, 67, 70, 117 consistency theorem of, 59, 61 inequality of, 62 one series theorem of, 66 three srries theorem of, 67 two series theorem of, 66
INDEX
224 zero-one law of, 70 L´evy, 39, 63, 86, 89 inequality of, 63 theorem of, 63 L´evy measures, 86 L´evy-Khintchine representation , 89 law of large numbers strong, 61 weak, 55 law of the iterated logarithm, 93 Lebesgue, 13 extension theorem, 13 Lindeberg, 72, 76 condition of, 72 theorem of, 72 Lipschitz, 108 Lipschitz condition, 108 Lyapunov, 76 condition of, 76 mapping, 22 Markov, 117 chain, 117 process, 117 homogeneous , 117 Markov chain aperiodic, 133 invariant distribution for, 122 irreducible , 124 periodic behavior, 133 stationary distribution for, 122 Markov process invariant measures ergodicity, 189 invariant measures for, 188 mixing, 192 reversible, 189
stationary, 188 Markov property, 119 strong, 123 martingale difference, 150 martingale transform, 165 martingales, 149 almost sure convergence of, 155, 158 central limit theorem for, 196 convergence theorem, 154 sub-, 151 super-, 151 maximal ergodic inequality, 183 mean, 28 measurable function, 15 measurable space, 22 moments, 33, 36 generating function, 36 uniqueness from, 36 monotone class, 9, 12 monotone converegence theorem, 20 negative binomial distribution, 34 Nikodym, 105 normal distribution, 35 optimal control, 213 optimal stopping, 215 option pricing, 167 optional stopping theorem, 161 Ornstein, 194 Ornstein-Uhlenbeck process, 194 outer measure, 11 Poisson, 34, 77 Poisson distribution, 34 positive definite function, 32, 45 probability space, 14 product σ-field, 26
INDEX product measure, 25 product space, 24, 25 queues, 136 Radon, 105 Radon-Nikodym derivative, 105 theorem, 105 random variable, 15 random walk, 121 recurrence, 174 simple, 134 transience, 174 recurrence, 124 null, 124 positive, 124 recurrent states, 133 renewal theorem, 128 repeated integral, 27 Riemann-Stieljes integral, 30 secretary problem, 216 signed measure, 104 simple function, 15 Stirling, 57 Stirling’s formula, 57, 71 stochastic matrix, 124 stopped σ-field, 161 stopping time, 122, 160 transformations, 22, 23 measurable, 23 measure preserving, 179 isometries from, 179 transience, 124 transient states, 133 transition operator, 169 transition probability, 117
225 stationary, 117 Tulcea, 116 theorem of, 116 Uhlenbeck, 194 uniform distribution, 34 uniform infinitesimality, 76 uniform tightness, 43 upcrossing inequality, 164 urn problem, 140 variance, 29 weak convergence, 38 Weierstrass, 37 factorization, 37