Probability with Martingales (Cambridge Mathematical Textbooks)

Probability with Martingales David Williams Statistical Laboratory, DPMMS Cambridge University Th~ right of th~ Uniwrsi

3,120 494 6MB

Pages 265 Page size 346.56 x 528.6 pts Year 2011

Report DMCA / Copyright

DOWNLOAD FILE

Recommend Papers

Medieval Scotland (Cambridge Medieval Textbooks)

Medieval Scotland This is a one-volume history of medieval Scotland, concentrating on the period between the middle of

920 585 2MB Read more

Medieval Scotland (Cambridge Medieval Textbooks)

MEDIEVAL SCOTLAND A. D. M. BARRELL CAMBRIDGE UNIVERSITY PRESS Medieval Scotland This is a one-volume history of med

759 58 1MB Read more

Probability theory and mathematical statistics

2,868 1,882 5MB Read more

Medieval Scotland (Cambridge Medieval Textbooks)

MEDIEVAL SCOTLAND A. D. M. BARRELL CAMBRIDGE UNIVERSITY PRESS Medieval Scotland This is a one-volume history of med

545 226 1MB Read more

Probability on Graphs: Random Processes on Graphs and Lattices (Institute of Mathematical Statistics Textbooks)

This page intentionally left blank Probability on Graphs This introduction to some of the principal models in the theo

502 119 2MB Read more

Linguistic Anthropology (Cambridge Textbooks in Linguistics)

This page intentionally left blank In this innovative textbook Alessandro Duranti introduces linguistic anthropology a

1,355 147 2MB Read more

Medieval Russia, 980-1584 (Cambridge Medieval Textbooks)

1,132 109 5MB Read more

Language and Gender (Cambridge Textbooks in Linguistics)

This page intentionally left blank Language and Gender Language and Gender is a new introduction to the study of the r

1,611 533 2MB Read more

Understanding Minimalism (Cambridge Textbooks in Linguistics)

1,386 91 2MB Read more

Taking Chances: Winning with Probability

222 5 591KB Read more

File loading please wait...

Citation preview

Probability with Martingales David Williams Statistical Laboratory, DPMMS Cambridge University

Th~ right of th~ Uniwrsity 0/ Cambridg~ to print and s~1I all manner of books was granted by Henry VIII in 153~.

The Uniwrsity has printed and published continuously since /58

[0,1],

2

Chapter 0: A Branching-Process Example

(0.1) ..

Of course, f'(I) is here interpreted as lim f(B) - f(l) Bj! () - 1

= lim 1 - f(B) Btl

1 - () ,

since f(l) = 1. We assume that Jl

< 00.

Notes. The first application of branching-process theory was to the question of survival of family names; and in that context, animal = man, and child = son. In another context, 'animal' can be 'neutron', and 'child' of that neutron will signify a neutron released if and when the parent neutron crashes into a nucleus. Whether or not the associated branching process is supercritical can be a matter of real importance. \Ve can often find branching processes embedded in richer structures and can then use the results of this chapter to start the study of more interesting things. For superb accounts of branching processes, see Athreyaand Ney (1972), Harris (1963), Kendall (1966, 1975). 0.2. Size of nth generation, Zn To be a bit formal: suppose that we are given a doubly infinite sequence

(a) of independent identically distributed random variables (lID RVs) , each with the same distribution as X: p(x~m)

= k) = P(X =

k).

The idea is that for n E Z+ and r E N, the variable X~n+1) represents the number of children (who will be in the (n+ l)th generation) ofthe rth animal (if there is one) in the nth generation. The fundamental rule therefore is that if Zm signifies the size of the nth generation, then

(b)

Z n+l --

X(n+l) 1

+ ••• + X(n+l) Zn·

We assume that Zo 1, so that (b) gives a full recursive definition of the sequence (Zm : m E Z+) from the sequence (a). Our first task is

Chapter 0: A Branching-Process Example

.. (O.S)

to calculate the distribution function of Zn, or equivalently to find the generating function

(c)

0.3. Use of conditional expectations

The first main result is that for n E Z+ (and 0 E [0,1])

(a) so that for each n E Z+,

in

is the n-fold composition

in = i 0 f

(b)

0 ... 0

f·

Note that the O-fold composition is by convention the identity map fo(O) = indeed, forced by - the fact that Zo = 1.

e, in agreement with -

To prove (a), we use - at the moment in intuitive fashion - the following very special case of the very useful Tower Property of Conditional Expectation:

(c)

E(U)

= EE(UIV);

to find the expectation of a random variable U, first find the conditional expectation E(UIV) of U given V, and then find the expectation of that. We prove the ultimate form of (c) at a later stage. We apply (c) with U =

Now, for k satisfies

E

eZn +

1

and V = Zn:

Z+, the conditional expectation of

OZn+l

given that Zn = k

(d) But Zn is constructed from variables X~r) with r S n, and so Zn is inden +1 ), ..• n+1 ). The conditional expectation given Zn = k pendent of in the right-hand term in (d) must therefore agree with the absolute expectation

xi

(e)

,xi

Chapter 0: A Branching-Process Example

-4

(o.S) ..

But the expression at (e) is a expectation of the product of independent random variables and as part of the family of 'Independence means multiply' results, we know that this expectation of a product may be rewritten as the product of expectations. Since (for every nand r) E(8x~n+l»)

= f(8),

we have proved that

and this is what it means to say that

= k, the conditional expectation is equal to the conditional expectation E(UIV = k) of U given that V = k. (Sounds reasonable!)] Property (c) now yields [If V takes only integer values, then when V E(UIV) of U given

~7

E8 Zn + 1

= Ef( 8)Zn ,

and, since

o

result (a) is proved. Independence and conditional expectations are two of the main topics in this course.

0.4. Extinction probability, 7r Let 7r n := P(Zn = 0). Then 7r n = fn(O), so that, by (0.3,b),

(a)

7r n +l

= f(7r n ).

Measure theory confirms our intuition about the extinction probability:

(b)

7r:= P(Zm = 0 for some m)

Because

=i lim7rn •

f is continuous, it follows from (a) that

(c)

7r=f(7r).

The function f is analytic on (0,1), and is non-decreasing and convex (of non-decreasing slope). Also, f(1) = 1 and f(O) = P(X = 0) > O. The slope 1'(1) of f at 1 is f-l = E(X). The celebrated pictures opposite now make the following Theorem obvious.

THEOREM If E(X) > 1, then the extinction probability 7r is the unique root of the equation 7r = f( 7r) which lies strictly between 0 and 1. If E(X) :s; 1, then 7r = 1.

.. (0·4)

Chapter 0: A Branching-Process Example

y = f(x)

~2r---------------~~-----r

~l~~------------~

y=x

o~----------~----~--------~

o

~l

=

Case 1: subcritical, J-l f' (1) < 1. Clearly, ~ The critical case J-l = 1 has a similar picture.

Case 2: supercritical, J-l = f'(I)

> 1.

Now,

11"

= 1.

< 1.

~2

6

Chapter 0: A Branching-Process Example

(0.5) ..

0.5. Pause for thought: measure

Now that we have finished revising what introductory courses on probability theory say about branching-process theory, let us think about why we must find a more precise language. To be sure, the claim at (OA,b) that

(a)

1['

=i lim 7rn

is intuitively plausible, but how could one prove it? We certainly cannot prove it at present because we have no means of stating with puremathematical precision what it is supposed to mean. Let us discuss this further. Back in Section 0.2, we said 'Suppose that we are given a doubly infinite sequence {X~m) : m, r E N} of independent identically distributed random variables each with the same distribution as X'. What does this mean? A random variable is a (certain kind of) function on a sample space n. We could follow elementary theory in taking n to be the set of all outcomes, in other words, taking to be the Cartesian product

n

the typical element w of

n being w

= (w~r)

: r EN,s EN),

and then setting X!r)(w) = w~r). Now n is an uncountable set, so that we are outside the 'combinatorial' context which makes sense of 7r n in the elementary theory. Moreover, if one assumes the Axiom of Choice, one can prove that it is impossible to assign to all subsets of n a probability satisfying the 'intuitively obvious' axioms and making the X's IID RVs with the correct common distribution. So, we have to know that the set of w corresponding to the event 'extinction occurs' is one to which one can uniquely assign a probability (which will then provide a definition of 7r). Even then, we have to prove (a). Example. Consider for a moment what is in some ways a bad attempt to construct a 'probability theory'. Let C be the class of subsets C of N for which the 'density'

p(C) := litm Uk : I S k S n; k E C} n

00

exists. Let C n := {I, 2, ... , n}. Then C n E C and C n i N in the sense that Cn ~ C n + 1 , Vn and also U C n = N. However, p(Cn) = 0, Vn, but peN) = 1.

.. (0.6)

Chapter 0: A Branching-Process Example

7

Hence the logic which will allow us correctly to deduce (a) from the fact that {Zn = O} i {extinction occurs} fails for the (N,C,p) set-up:

(~~,C,p)

is not 'a probability triple'.

0

There are problems. Measure theory resolves them, but provides a huge bonus in the form of much deeper results such as the Martingale Convergence Theorem which we now take a first look at - at an intuitive level, I hasten to add. 0.6. Our first martingale Recall from (0.2,b) that

Z n+1 -- X(n+l) 1

+ ... + X(n+l) Zn'

where the x.(n+l) variables are independent of the values ZI, Z2, . . , , Zn. It is clear from this that

a result which you will probably recognize as stating that the process Z = (Zn : n ~ 0) is a Markov chain. We therefore have

j

or, in a condensed and better notation,

(a) Of course, it is intuitively obvious that

(b) because each of the Zn animals in the nth generation has on average 11 children. We can confirm result (b) by differentiating the result

with respect to

e and setting e = 1.

Chapter 0: A Branching-Process Example

8

(0.6) ..

Now define

n 2:

(c)

o.

Then

E(Mn+lIZo, ZI,"" Zn) = M n, which exactly says that

(d) M is a martingale relative to the Z process. Given the history of Z up to stage n, the next value Mn+l of M is on average what it is now: AI is 'constant on average' in this very sophisticated sense of conditional expectation given 'past' and 'present'. The true statement

(e)

E(Mn)

= 1,

Vn

is of course infinitely cruder. A statement S is said to be true almost surely (a.s.) or with probability 1 if (surprise, surprise!)

peS is true) =1. Because our martingale J..,[ is non-negative (Aln 2: 0, Vn), the Martingale Convergence Theorem implies that it is almost surely true that

(f)

]..[00 :=

lim Mn

exists.

Note that if Moo > 0 for some outcome (which can happen with positive probability only when p. > 1), then the statement (a.s.) is a precise formulation of 'exponential growth'. A particularly fascinating question is: suppose that p. > 1; what is the behaviour of Z conditional on the value of Moo?

0.7. Convergence (or not) of expectations We know that Moo := lim Mn exists with probability 1, and that E(Mn) = 1, \In. We might be tempted to believe that E(Moo) = 1. However, we already know that if p. s: 1, then, almost surely, the process dies out and Mn IS eventually O. Hence

(a)

if p.

s: 1,

then Moo = 0 (a.s.) and 0= E(Moo) =l-lim E(Mn)

= 1.

.. (0.8)

Chapter 0: A Branching-Process Example

9

This is an excellent exan1ple to keep in mind when we come to study Fatou's Lemma, valid for any sequence (Yn ) of non-negative random variables: E(liminfYn) ~ liminfE(Yn ). What is 'going wrong' at (a) is that (when J.l ~ 1) for large n, the chances are that lvln will be large if Mn is not and, very roughly speaking, this large value times its small probability will keep E(Mn) at 1. See the concrete examples in Section 0.9.

°

Of course, it is very important to know when

(b)

limE(·)

= E(lim·),

and we do spend quite a considerable time studying this. The best general theorems are rarely good enough to get the best results for concrete problems, as is evidenced by the fact that E(Moo) = 1 if and only if both Jl.

(c)

> 1 and E(XlogX) < 00,

where X is the typical number of children. Of course 0 log 0 = O. If J.l > 1 and E(XlogX) = 00, then, even though the process may not die out, Moo = 0, a.s.

o.s.

Finding the distribution of Moo

Since Mn

-+

Moo (a.s.), it is obvious that for A > 0, exp( -AMn) -+ exp( ->.Moo)

(a.s.)

Now since each Mn ~ 0, the whole sequence (exp(->.Mn» is bounded in absolute value by the constant 1, independently of the outcome of our experiment. The Bounded Convergence Theorem says that we can now assert what we would wish:

(a)

Eexp(-AMoo)

= limEexp(->.Mn ).

Since Mn = Zn/ Jl.n and E(8 Zn ) = fn(8), we have

(b) so that, in principle (if very rarely in practice), we can calculate the left-hand side of (a). However, for a non-negative random variable Y, the distribution function y 1-+ P(Y ::; y) is completely determined by the map A 1-+ E exp( ->'Y)

on

(0,00).

Chapter 0: A Branching-Process Example

10

(0.8) ..

Hence, in principle, we can find the distribution of Moo. We have seen that the real problem is to calculate the function

L(>.)

:= E exp( ->.Moo).

Using (b), the fact that fn+l = ! 0 fn, and the continuity of L (another consequence of the Bounded Convergence Theorem), you can immediately establish the functional equation:

(c)

L(>'Il)

= !(L(>.)).

0.9. Concretfil example

This concrete example is just about the only one in which one can calculate everything explicitly, but, in the way of mathematics, it is useful in many contexts. We take the 'typical number of children' X to have a geometric distribution:

(a) where

o < p < 1,

q:= 1 - p.

Then, as you can easily check,

(b)

I!q8'

f(8) =

and 7r

To calculate f 0 f 0 ••• 0 of the upper half-plane. If

=

{P1/ q

P

if q if q

!, we use a G=

Il =~,

> p, :5 p.

device familiar from the geometry

(911 912) g21

g22

is a non-singular 2 x 2 matrix, define the fractional linear transformation:

(c)

.. (0.9)

11

Chapter 0: A Branching-Process Example

Then you can check that if H is another such matrix, then G(H(B»

= (GH)(B),

so that composition of fractional linear transformations corresponds to matrix multiplication. Suppose that p

we find that the

nth

f:.

q. Then, by the S-I AS = A method, for example, power of the matrix corresponding to I is

( 0 p)n = -q

1

(q - p)

(1 p) (pn 0) (q -p)

-I

1

q

0

-1

qn

1

'

so that

(d) If p, = q/p ~ 1, then limn In(B) process dies out. Suppose now that p,

=

1, corresponding to the fact that the

> 1. Then you can easily check that, for A > 0,

L(A): = Eexp(-AMoo) = limln(exp(-Ajp,n)) PA qA

+q +q -

P p

= 7re- A.O + lOO(l_7r?e-AXe-(I-1f)Xdx,

from which we deduce that and

P(Moo = 0) = P(x

7r,

< Moo < x + dx)

or, better,

P(Moo > x)

= (1 - 7r?e-(I-1f)xdx

= (l-7r)e-(I-1f)x

(x> 0),

(x > 0) ..

Suppose that p, < 1. In this case, it is interesting to ask: what is the distribution of Zn conditioned by Zn =1= O? We find that

where

Chapter 0: A Branching-Process Example

12 so 0

< an < 1 and an + /3..

= 1.

an

-+

As n

-+ 00,

1 - fJ,

/3n

(0.9)

we see that

-+

fJ,

so (this is justified) lim P(Zn = klZn =I- 0) = (1 - fJ)fJ k- 1

(e)

(k EN).

n-OO

Suppose that fJ = 1. You can show by induction that

f n (0)

= n - (n - 1)0 , (n + 1) - nO

and that corresponding to

P(Zn/n > xlZn =I- 0)

(f)

-+

e- x ,

x> O.

'The FatoD factor' We know that when fJ ::; 1, we have E(Mn) = 1, \In, but E(Moo) = we get some insight into this? First consider the case when fJ for large n,

o.

Can

< 1. Result (e) makes it plausible that

E(ZnIZn =I- 0) is roughly (1- fJ) E kfJk-l = 1/(1- fJ)· We know that

P(Zn =I- 0) = 1 - fn(O) is roughly (1 - fJ)fJ n , so we should have (roughly)

E(Mn) = E

(!= IZn

= (1 _

=I- 0) P(Zn =f 0)

~)fJn (1 -

#)fJ n = 1,

which might help explain how the 'balance' E(Mn) values times small probabilities.

= 1 is achieved by big

.. (0.9)

Chapter 0: A Branching-Process Example

Now consider the case when

P(Zn

~

= 1.

19

Then

=I 0) = l/(n + 1),

and, from (f), Zn/n conditioned by Zn =I 0 is roughly exponential with mean 1, so that Mn = Zn conditioned by Zn =I 0 is on average of size about n, the correct order of magnitude for balance. Warning. We have just been using for 'correct intuitive explanations' exactly the type of argument which might have misled us into thinking that E(.1\100) = 1 in the first place. But, of course, the result

is a matter of obvious fact.

PART A: FOUNDATIONS Chapter 1

Measure Spaces

1.0. Introductory remarks Topology is about open sets. The characterizing property of a continuous function I is that the inverse image 1-1 (G) of an open set G is open. Measure theory is about measurable sets. The characterizing property of a measurable function I is that the inverse image 1-1 (A) of any measurable set is measurable. In topology, one axiomatizes the notion of 'open set', insisting in particular that the union of any collection of open sets is open, and that the intersection of a finite collection of open sets is open. In measure theory, one axiomatizes the notion of 'measurable set', insisting that the union of a countable collection of measurable sets is measurable, and that the intersection of a countable collection of measurable sets is also measurable. Also, the complement of a measurable set must be measurable, and the whole space must be measurable. Thus the measurable sets form a u-algebra, a structure stable (or 'closed') under count ably many set operations. Without the insistence that 'only count ably many operations are allowed', measure theory would be self-contradictory - a point lost on certain philosophers of probability.

The probability that a point chosen at random on the surface of the unit sphere 52 in R3 falls into the subset F of 52 is just the area of F divided by the total area 411". What could be easier? However, Banach and Tarski showed (see Wagon (1985)) that if the Axiom of Choice is assumed, as it is throughout conventional mathematics, then there exists a subset F of the unit sphere 52 in R3 such that for

Chapter 1: Measure Spaces

.. (1.1) 3 :5 k < of F:

00

(and even for k

= (0), 52

15

is the disjoint union of k exact copies

;=1 where each T;(k) is a rotation. If F has an 'area', then that area must simultaneously be 411"/3,411"/4, ... ,0. The only conclusion is that the set F is non-measurable (not Lebesgue measurable): it is so complicated that one cannot assign an area to it. Banach and Tarski have not broken the Law of Conservation of Area: they have simply operated outside its jurisdiction. Remarks. (i) Because every rotation T has a fixed point x on S2 such that = x, it is not possible to find a subset A of 52 and a rotation T such that AU T(A) = 52 and An T(A) = 0. So, we could not have taken k = 2. (ii) Banach and Tarski even proved that given any two bounded subsets A and B of R3 each with non-empty interior, it is possible to decompose A into a certain finite number n of disjoint pieces A = U~1 A; and B into the same number n of disjoint pieces B = U7=1 B;, in such a way that, for each i, A; is Euclid-congruent to B;!!! So, we can disassemble A and rebuild it as B. (iii) Section A1.1 (optional!) in the appendix to this chapter gives an Axiom-of-Choice construction of a non-measurable subset of 51.

T( x)

This chapter introduces u-algebras, 1I"-systems, and measures and emphasizes monotone-convergence properties of measures. We shall see in later chapters that, although not all sets are measurable, it IS always the case for probability theory that enough sets are measurable.

1.1. Definitions of algebra, u-algebra Let 5 be a set. Algebra on 5 A collection ~o of subsets of 5 is called an algebra on 5 (or algebra of subsets of 5) if (i) 5 E ~o, (ii) F E ~o ::::} FC:= 5\F E ~o, (iii) F,G E ~o ::::} FuG E ~o. [Note that 0 = 5 c E ~o and

16

Chapter 1: Measure Spaces

(1.1) ..

Thus, an algebra on S is a family of subsets of S stable under finitely many set operations. Exercise (optional). Let C be the class of subsets C of N for which the 'density' lim m-1Hk: 15 k 5 m;k E C} mToo

exists. We might like to think of this density (if it exists) as 'the probability that a number chosen at random belongs to C'. But there are many reasons why this does not conform to a proper probability theory. (We saw one in Section 0.5.) For example, you should find elements F and Gin C for which FnG ~ C. Note on terminology ('algebra versus field'). An algebra in our sense is a true algebra in the algebraists' sense with n as product, and symmetric difference A6.B := (A U B)\(A n B) as 'sum', the underlying field of the algebra being the field with 2 elements. (This is why we prefer 'algebra of subsets' to 'field of subsets': there is no way that an algebra of subsets is a field in the algebraists' sense - unless 1:0 is trivial, that is, 1:0 = {S, 0}.) u-algebra on S A collection 1: of subsets of S is called a u-algebra on S (or u-algebra of subsets of S) if 1: is an algebra on S such that whenever Fn E 1: (n EN), then n

[Note that if 1: is a u-algebra on Sand Fn E 1: for n E N, then

n

n

Thus, a u-algebra on S is a family of subsets of S 'stable under any countable collection of set operations'. Note. Whereas it is usually possible to write in 'closed form' the typical element of many of the algebras of sets which we shall meet (see Section 1.8 below for a first example), it is usually impossible to write down the typical element of a u-algebra. This is the reason for our concentrating where possible on the much simpler '1r-systems'. Measurable space A pair (S,1:), where S is a set and 1: is a u-algebra on S, is called a measurable space. An element of 1: is called a 1:-measurable subset of S.

.. (1.2)

Chapter 1 : Measure Spaces

17

a(C), a-algebra generated by a class C of subsets Let C be a class of subsets of S. Then a(C), the a-algebra generated by C, is the smallest a-algebra I: on S such that C ~ I: . It is the intersection of all a-algebras on S which have C as a subclass. (Obviously, the class of all subsets of S is a a-algebra which extends C.)

1.2. Examples. Borel a-algebras, B(S), B = 8(R) Let S be a topological space.

8(S) B(S), the Borel a-algebra on S, is the a-algebra generated by the family of open subsets of S. With slight abuse of notation, B(S):= a(open sets). B:= B(R) It is standard shorthand that B := B(R). The a-algebra B is the most important of alIa-algebras. Every subset of R which you meet in everyday use is an element of B; and indeed it is difficult (but possible!) to find a subset of R constructed explicitly (without the Axiom of Choice) which is not in B. Elements of 8 can be quite complicated. However, the collection 7r(R) := {( -00, xl

:x

E R}

(not a standard notation) is very easy to understand, and it is often the case that all we need to know about B is that

(a)

B = a(7r(R)). Proof of (a). For each x in R, (-00, xl = nnEN( -00, x + n -1), so that as a countable intersection of open sets, the set (-00, xl is in B.

All that remains to be proved is that every open subset G of R is in a(7r(R)). But every such G is a countable union of open intervals, so we need only show that, for a, b E R with a < b, (a,b) E a(7r(R)).

But, for any u with u > a,

(a, uJ = (-00, uJ n (-00, ale E a(7r(R)), and since, for e

= Hb -

a),

(a, b) = U(a, b - en-I], n

we see that ( a, b) E a( 7r(R)), and the proof is complete.

o

(1.3) ..

Chapter 1: Measure Spaces

18

1.3. Definitions concerning set functions

Let S be a set, let Eo be an algebra on S, and let J.to be a non-negative set function

J.to : Eo

-+

[O,ooJ.

Additive Then J.to is called additive if J.to(0)

F nG = 0

= 0 and, for F, G E Eo,

J.to(F U G) = J.to(F)

=}

+ J.to(G).

Countably additive The map J.to is called countably additive (or a-additive) if J.t(0) = 0 and whenever (Fn : n E N) is a sequence of disjoint sets in Eo with union F = UFn in Eo (note that this is an assumption since Eo need not be a a-algebra), then n

Of course (why?), a count ably additive set function is additive. 1.4. Definition of measure space

Let (S, E) be a measurable space, so that E is a a-algebra on S. A map

J.t: E -+ [O,ooJ. is called a measure on (S, E) if J.t is countably additive. The triple (S, E, J.t) is then called a measure space. 1.5. Definitions concerning measures

Let (S, E, J.t) be a measure space. Then J.t (or indeed the measure space (S, E, J.t» is called finite if J.t(S) a-finite

< 00,

if there is a sequence (Sn : n E N) of elements of E such that

J.t(Sn)
m Xn is monotone non-increasing in m, so that the limit of the sequence Y';; exists in [-00, ooJ. The use of jlim or llim to signify monotone limits will be handy, as will Yn 1 Yoo to signify Yeo =llim Yn' (b) Analogously, liminfxn :=suP { inf xn} =ilim{ inf xn} E [-00,00]. m

n2:m

m

n~m

(c) We have Xn

converges in [-00,00]

{=:::}

limsupxn = liminfx n ,

and then limx n = limsupxn = liminfxn. ~(d)

Note that (i) if z > limsupxn, then x n < z eventually (that is, for all sufficiently large n) (ii) if z

< lim sup X n , then Xn > z infinitely often

(that is, for infinitely many n).

2.6. Definitions. lim sup En, (En, i.o.) The event (in the rigorous formulation: the truth set of the statement) 'number of heads/ number of tosses -> !' is built out of simple events such as 'the nth toss results in heads' in a rather complicated way. We need a systematic method of being able to handle complicated combinations of events. The idea of taking lim infs and lim sups of sets provides what is required.

It might be helpful to note the tautology that, if E is an event, then

E = {w: wEE}. Suppose now that (En: n E N) is a sequence of events. ~(a)

We define

(En' i.o.): = (En infinitely often) : = lim sup En :=

nU m

= {w : for every m,

En

n2:m

3n(w)

~

m such that w E En(w)}

= {w: wEEn for infinitely many n}. I

.. (2.8) ~(b)

Chapter 2: Events

27

(Reverse Fatou Lemma - needs FINITENESS of P) P(limsupEn) ~ limsupP(En).

Proof. Let G m := Un>m En. Then (look at the definition in (a)) Gm 1 G, where G:= lim sup En-:- By result (l.lO,b), P(G m ) 1 peG). But, clearly,

P(G m) ~ sup PeEn). n2: m

Hence, P( G)

~llim { sup P( En)} =: lim sup P( En). m

n~m

o

2.7. First Borel-Cantelli Lemma (BCl) Let (En : n E N) be a sequence of events such that Ln PeEn) < 00. Then P(limsupEn) = peEn, i.o.) = 0. Proof. With the notation of (2.6,b), we have, for each m,

peG) ~ P(Gm) ~ using (1.9,b) and (l.lO,a). Now let m

i

L

peEn),

00.

o

Notes. (i) An instructive proof by integration will be given later.

(ii) Many applications of the First Borel-Cantelli Lemma will be given within this course. Interesting applications require concepts of independence, random variables, etc .. 2.8. Definitions. liminfEn,

(En, ev)

Again suppose that (En : n E N) is a sequence of events. ~(a) We define

(En, ev) : = (En eventually) : = lim inf En :=

U n En

= {w: for some mew), wE En,Vn ~ mew)} = {w: wEEn for all large n}. (b) Note that (En, evy = (E~, i.o.). ~~(c)

(Fatou's Lemma for sets - true for ALL measure spaces) P(liminf En)

~

liminfP(E n ).

Exercise. Prove this in analogy with the proof of result (2.6,b), using (l.lO,a) rather than (1.10,b).

28

(2.9) ..

Chapter 2: Events

2.9. Exercise

For an event E, define the indicator function IE on

n via

I ( ) ._ {I, if wEE, E W.0, if w ~ E. Let (En: n E N) be a sequence of events. Prove that, for each w,

and establish the corresponding result for lim infs.

Chapter 3

Random Variables

Let

(S,~)

be a measurable space, so that

~

is a a-algebra on S.

3.1. Definitions. ~-measurable function, m~, (m~)+, b~ Suppose that h: S

-+

R. For A

~

R, define

h-1(A) := {s E S: h(s) E A}.

Then h is called ~-measurable if h- 1 : B So, here is a picture of a

~-measurable

-+ ~,

that is, h-1(A) E ~, VA E B.

function h:

We write m~ for the class of ~-measurable functions on S, and (m~)+ for the class of non-negative elements in mL We denote by b~ the class of bounded ~-measurable functions on S.

Note. Because lim sups of sequences even of finite-valued functions may be infinite, and for other reasons, it is convenient to extend these definitions to functions h taking values in [-00,00] in the obvious way: h is called ~-measurable if h- 1 : B[-oo,oo]-+ ~. Which of the various results stated for real-valued functions extend to functions with values in [-00, ooj, and what these extensions are, should be obvious. Borel function A function h from a topological space S to R is called Borel if h is B(S)measurable. The most important case is when S itself is R.

29

(9.2)..

Chapter 9: Random Variables

90

3.2. Elementary Propositions on measurability The map h- l preseMJes all set operations: h-I(Ua Aa) = Ua h-I(Aa ), h-I(AC) = (h-I(AW, Proof This is just definition chasing.

(a)

etc. 0

IfC ~ 8 and u(C) = 8, then h- l : C -+ E => hEmE. Proof. Let £ be the class of elements B in 8 such that h-I(B) E E. By result (a), £ is a u-algebra, and, by hypothesis, £ ;;2 C. 0 (c) If S is topological and h : S -+ R is continuous, then h is Borel. Proof. Take C to be the class of open subsets of R, and apply result (b). 0 ~(d) For any measurable space (S, E), a function h: S -+ R is E-measurable if {h S c} := {s E S: h(s) S c} E E (Vc E R).

~(b)

Proof Take C to be the class 7I'(R) of intervals of the form (-00, cl, cE R, and apply result (b). o Note. Obviously, similar results apply in which {h S c} is replaced by {h> c}, {h ~ c}, etc.

3.3. LEMMA. Sums and products of measurable functions are m.easurable ~

mE is an algebra over R, that is, if A E Rand h, hI, hz E mE, then

hI

+ hz E mE,

hlh2 E mE,

Ah E mE.

Example of proof Let c E R. Then for s E S, it is clear that hl(S)+h2(S) if and only if for some rational q, we have

>c

In other words,

{hl+hz>c}= U({h l >q}n{h2 >c-q}), qEQ

a countable union of elements of E.

o

.. (S.6)

Chapter S: Random Variables

31

3.4. Composition Lemma. If h E

m~

and f E mB, then

f

0

hE

m~.

Proof. Draw the picture:

S~R-LR

~C8C8 Note. There are obvious generalizations based on the definition (important in more advanced theory): if (St, ~d and (S2, ~2) are measurable spaces and h : SI -+ S2, then h is called ~d~2-measurable if h- l : ~2 -+ ~l' From this point of view, what we have called ~-measurable should read ~/B-measurable (or perhaps ~/8[-00, ooJ-measurable).

3.5. LEMMA on measurability of infs, lim infs of functions ~

Let (hn : n E N) be a sequence of elements of m~. Then

(i) inf hn, (ii) liminf hn' (iii) limsuph n are ~-measurable (into ([-00,00], 8[-00,00]), but we shall still write inf h n E m~ (for example)). Further, (iv) {s : limhn(s) exists in

R} E ~.

nn

Proof. (i) {inf h n ~ c} = {hn ~ c}. (ii) Let Ln(s) := inf{hr(s) : r ~ n}. Then Ln E m~, by part (i). But L(s):= liminf hn(s) =i limLn(s) = sup Ln(s), and {L:::; c} = nn{Ln:::; c} E~. (iii) This part is now obvious. (iv) This is also clear because the set on which limh n exists in R is

{lim sup h n < oo}

n {liminf h n > -oo} n g-l( {O}),

where g := lim sup h n -liminf hn'

o

3.6. Definition. Random variable ~Let

(n,F) be our (sample space, family of events). A random variable is an element of mF. Thus, X :

n -+ R,

X-I: B -+ F.

Chapter 9: Random Variables

92

(9.7) ..

3.7. Example. Coin tossing Let

n = {H, T}N, W = (Wt,W2, •• . ), Wn :F = a( {w : Wn

= W}

Let

E {H, T}. As in (2.3,b), we define

: n E N, W E {H, T}). ifw n = H, if Wn = T.

The definition of :F guarantees that each Xn is a random variable. By Lemma 3.3,

Sn := Xl

+ X 2 + ... + Xn =

number of heads in n tosses

is a random variable. Next, for p E [0,1], we have number of heads } A:= { w: -+p ={w:L+(w)=p}n{w:L-(w)=p}, number of tosses

where L+ := limsupn-ISn and L- is the corresponding lim info By Lemma

3.5, A E:F. ~~

Thus, we have taken an important step towards the Strong Law: the result is meaningful! It only remains to prove that it is true! 3.8. Definition. a-algebra generated by a collection of functions

on

n

This is an important idea, discussed further in Section 3.14. (Compare the weakest topology which makes every function in a given family continuous, etc.) In Example 3.7, we have a given set

n,

a family (Xn : n E N) of maps Xn :

n -+ R.

The best way to think of the a-algebra :F in that example is as

:F = a(Xn : n E N) in the sense now to be described. ~~Generally, if we have a collection

Y

(YI' : I E C) of maps YI' : n -+ R, then

:= a(YI' : I E C)

·.(3.10)

99

Chapter 3: Random Variables

is defined to be the smallest a-algebra Y on Q such that each map Y-y (, E C) is Y-measurable. Clearly, a(Y-y : 1 E C) = a( {w E Q: Y-y(w) E B} : 1 E C, BE 8). If X is a random variable for some (Q, F), then, of course, a(X)

~

F.

Remarks. (i) The idea introduced in this section is something which you will pick up gradually as you work through the course. Don't worry about it now; think about it, yes! (ii) Normally, 7r-systems come to our aid. For example, if (Xn : n E N) is a collection of functions on Q, and Xn denotes a(Xk : k ~ n), then the union U Xn is a 7r-system (indeed, an algebra) which generates a(Xn : n EN). 3.9. Definitions. Law, distribution function Suppose that X is a random variable carried by some probability triple (Q, F, P). We have

[0,1]

Define the law

ex

or indeed [0,1] of X by

ex:= ex

P +--

P +--

x- 1

F +--8, x- 1 a(X) +--8.

ex :

P oX-l, 8 -+ [0,1] . . Then (Exercise!) is a probability measure on (R,8). Since 7r(R) = {( -00, c] : c E R} is a 7r-system which generates 8, Uniqueness Lemma 1.6 shows that ex is determined by the function Fx : R -+ [0,1] defined as follows:

Fx(c) := ex( -00, c] = P(X ~ c) = P{w : X(w) ~ c}. The function Fx is called the distribution function of X. 3.10. Properties of distribution functions Suppose that F is the distribution function F = Fx of some random variable X. Then (a) F: R -+ [0,1], F j(that is, x ~ y => F(x) ~ F(y)), (b) limx_ooF(x) = 1, limx __ ooF(x) = 0, (c) F is right-continuous. Proof of (c). By using Lemma (1.10,b), we see that

P(X

~ x

+ n- l ) !

P(X

~ x),

and this fact together with the monotonicity of Fx shows that Fx is rightcontinuous. Exercise! Clear up any loose ends.

(S.ll) ..

Chapter S: Random Variables

3.11. Existence of random variable with given distribution function ~If

F has the properties (a,b,c) in Section 3.10, then, by analogy with Section 1.8 on the existence of Lebesgue measure, we can construct a unique probability measure C on (R, B) such that

Take (f!,F,P)

=

C(-oo,xJ = F(x), Vx. (R,B,C), X(w) = w. Then it is tautological that Fx(x) = F(x), Vx.

Note. The measure C just described is called the Lebesgue-Stieltjes measure associated with F. Its existence is proved in the next section.

3.12. Skorokhod representation of a random variable with prescribed distribution function Again let F : R --+ [0, 1J have properties (3.10,a,b,c). We can construct a random variable with distribution function F carried by

(f!,F, P) = ([0, IJ, B[O, 1], Leb) as follows. Define (the right-hand equalities, which you can prove, are there for clarification only) (al)

X+(w) := inf{z : F(z) > w} = sup{y: F(y) ::; w},

(a1)

X-(w):= inf{z: F(z)

~

w} = sup{y: F(y) < w}.

The following picture shows cases to watch out for.

1

1

W

~ I

--,

I

OL-~------L---------~-------

X-(F(x))

x

By definition of X-,

(w ::; F(c))

=}

(X-(w) ::; c).

X+(F(x))

Chapter 9: Random Variables

.. (9.12)

95

Now,

(z > X-(w))

=>

(F(z)

~

w),

so, by the right-continuity of F, F(X-(w» ~ w, and

Thus, (w :=:; F(c»

(X+(w) :=:; c),

so that F(c):=:; P(X+ :=:; c). Since X- :=:; X+, it is clear that

{X- =J X+} =

U{X- :=:; c < X+}. cEQ

But, for every c E R,

Since Q is countable, the result follows.

o

Remark. It is in fact true that every experiment you will meet in this (or any other) course can be modelled via the triple ([0,1),8[0,1]' Leb). (You will start to be convinced of this by the end of the next ch~pter.) However, this observation normally has only curiosity value. '

{3.13}..

Chapter 3: Random Variables

36

3.13. Generated u-algebras - a discussion Suppose that (n, F, P) is a model for some experiment, and that the experiment has been performed, so that (see Section 2.2) Tyche has made her choice of w. Let (Y.." : I E C) be a collection of random variables associated with our experiment, and suppose that someone reports to you the following information about the chosen point w: (*) the values Y..,,(w), that is, the observed values of the random variables Y"),(,EC). Then the intuitive significance of the u-algebra Y := u(Y")' : I E C) is that it consists precisely of those events F for which, for each and every w, you can decide whether or not F has occurred (that is, whether or not w E F) on the basis of the information (*); the information (*) is precisely equivalent to the following information: (**) the values IF(w) (F E Y). (a) Exercise. Prove that the u-algebra u(Y) generated by a single random variable Y is given by

u(Y)

= y-I(B):= ({w: Yew) E B}:

BE B),

and that u(Y) is generated by the 7r-system

7r(Y):= ({w: Y(w)::; x}: x E R) = y-l(rr(R».

o

The following results might help clarify things. Good advice: stop reading this section after (c)! Results (b) and ( c) are proved in the appendix to this chapter. (b) If Y : n -> R, then Z : n -> R is an u(Y)-measurable function if and only if there exists a Borel function I : R -> R such that Z = I(Y). (c) If YI, Y2 , ••• , Yn are functions from n to R, then a function Z : n -> R is u(Yi., Y2 , ••• , Yn)-measurable if and only if there exists a Borel function f on Rn such that Z = f(Yi.,}2, ... , Yn). We shall see in the appendix that the more correct measurability condition on I is that I be 'Bn-measurable'. (d) If (Y")' : I E C) is a collection (parametrized by the infinite set C) of functions from n to R, then Z : n -> R is u(Y")' : I E C)-measurable if and only if there exists a countable sequence (,i: i E N) of elements of C and a Borel function I on RN such that Z = f(Y")'" Y")'2' .. .). Warning - for the over-enthusiastic only. For uncountable C, B(RG) is much larger than the C-fold product measure space IT")'EC B(R). It is the latter rather than the former which gives the appropriate type of f in (d).

.. (9.14)

Chapter 9: Random Variables

97

3.14. The Monotone-Class Theorem In the same way that Uniqueness Lemma 1.6 allows us to deduce results about a-algebras from results about 7I"-systems, the following 'elementary' version of the Monotone-Class Theorem allows us to deduce results about general measurable functions from results about indicators of elements of 71"systems. Generally, we shall not use the theorem in the main text, preferring 'just to use bare hands'. However, for product measure in Chapter 8, it becomes indispensable.

THEOREM. ~

Let 1-1. be a claS8 of bounded functions from a set S into R satisfying the following conditions: (i) 1i is a vector space over R; (ii) the constant function 1 is an element of 1i; (iii) if (In) is a sequence of non-negative functions in 1-1. such that In i I where f is a bounded function on S, then f E 1i. Then il 1i contains the indicator function of every set in some 71"system I, then 1i contains every bounded a(I)-measurable function on S.

For proof, see the appendix to this chapter.

Chapter 4-

Independence

Let (n,.r, P) be a probability triple. 4.1. Definitions of independence

Note. We focus attention on the q-algebra formulation (and describe the more familiar forms of independence in terms of it) to acclimatize ourselves to thinking of IT-algebras as the natural means of summarizing information. Section 4.2 shows that the fancy q-algebra definitions agree with the ones from elementary courses. Independent q-algebras

gl, ~f2, ... of.r are called independent if, whenever Gi E (i E N) and iI, ... ,in are distinct, then

~Sub-q-algebras

(ii

n

P(Gi 1 n ... n Gin)

= II P(Gik)· k=l

Independent random variables ~Random

variables Xl,X 2 , ••• are called independent if the IT-algebras

are independent. Independent events ~ Events

E 1 , E 2 , • •• are called independent if the q-algebras E1 , E2 , • •• are independent, where En

is the q-algebra {0,E n ,n\En ,n}.

Since En = q(IEn ), it follows that events E 1 , E 2 , .. . are independent if and only if the random variables IE" IE., ... are independent.

98

Chapter

··(4·2)

4:

Independence

39

4.2. The 7r-system Lemma; and the more familiar definitions We know from elementary theory that events E 1 , E 2 , ••• are independent if and only if whenever n E N and i 1 , •• • , in are distinct, then n

P(Ei 1 n ... n E;n) =

II P(E;.), k=1

corresponding results involving complements of the Ej, etc., being consequences of this. We now use the Uniqueness Lemma 1.6 to obtain a significant generalization of this idea, allowing us to st udy independence via (manageable) 7r-systems rather than (awkward) a-algebras. Let us concentrate on the case of two a-algebras. ~~(a)

LEMMA. Suppose that 9 and 1i are sub-a-algebras of F, and that I and .:1 are 7r-systems with a(I)

= 9,

a(.:1)

= 1i.

Then 9 and 1i are independent if and only if I and.:1 are independent in that P(I n J) = P(I)P( J), I E I, J E.:1.

Proof Suppose that I and.:1 are independent. For fixed I in I, the measures (check that they are measures!)

H

I-t

P(I n H) and H

I-t

P(I)P(H)

on (n,1i) have the same total mass P(I), and agree on.:1. By Lemma 1.6, they therefore agree on a(.:1) = 1i. Hence,

P(I n H) = P(I)P(H),

I E I, HE 1i.

Thus, for fixed H in 1i, the measures

G I-t peG n H) and G

I-t

P(G)P(H)

on (n,9) have the same total mass P(H), and agree on I. They therefore . agree on a(I) = 9; and this is what we set out to prove. 0

Chapter 4: Independence

40

Suppose now that X and Y are two random variables on (n,:F, P) such that, whenever x, y E R, P(X ~ XjY ~ y) = P(X ~ x)P(Y ~ y).

(b)

Now, (b) says that the 7r-systems 7r(X) and 7r(Y) (see Section 3.13) are independent. Hence u(X) and u(Y) are independent: that is, X and Y are independent in the sense of Definition 4.1. In the same way, we can prove that random variables Xl, X 2, •.• ,Xn are independent if and only if n

P(Xk ~ Xk : 1 ~ k ~ n) =

II P(Xk ~ Xk), k=l

and all the familiar things from elementary theory. Command: Do Exercise E4.1 now. 4.3. Second Borel-Cantelli Lemma (BC2) ~~

If (En: n E N) is a sequence of independent events, then

L

peEn)

= 00 =>

peEn, Lo.)

= P(limsup En) =

Proof. First, we have

(limsupEn)C

= liminf E! = U

n

1.

E!.

m n~m

With Pn denoting peEn), we have

this equation being true if the condition {n ;::: m} is replaced by condition {r ;::: n ;::: m}, because of independence, and the limit as r T 00 being justified by the monotonicity of the two sides. For x ;::: 0,

1 - x ~ exp( -x), so that, since

E Pn =

00,

II (1 - Pn) ~ exp (- LPn) = o.

n~m

So, P [(lim sup EnYJ =

n~m

o.

Exercise. Prove that if 0 ~ Pn < 1 and S := EPn < 00, then IT(l- Pn) O. Hint. First show that if S < 1, then IT(1 - Pn) ;::: 1 - S.

0

>

41

Chapter 4: Independence

··(4·4) 4.4. Example

Let (Xn : n E N) be a sequence of independent random variables, each exponentially distributed with rate 1:

Then, for a > 0, P(Xn > alogn) = n-"', so that, using (BC1) and (BC2), (aO)

P(Xn > alogn for infinitely many n)

if a> 1, if a ::s 1.

= { 01

Now let L:= lim sup(Xn/log n). Then

P(L

~ 1) ~ P(Xn > logn, i.o.) = 1,

and, for kEN,

P(L > 1 + 2k- 1 ) ::s P (Xn > (1 + k- 1 ) log n, i.o.) = Uk {L > 1 + 2k- 1 } is P-null, and hence

= O.

Thus, {L > I}

L

= 1 almost surely.

Something to think about In the same way, we can prove the finer result (al)

P(Xn > logn + a log log n, i.o. )

=

{ 01

if a> 1, if a ::s 1,

or, even finer,

P(X 2) n (a > 1ogn+ 1og 1ogn+a 1og 1og 1ogn,

1.0.

) -_ {O1 if if aa

::s> 1;1,

or etc. By combining in an appropriate way (think about this!) the sequence of statements (aO),(al),(a2), ... with the statement that the union of a countable number of null sets is null while the intersection of a sequence of probability-l sets has probability 1, we can obviously make remarkably precise statements about the size of the big elements in the sequence (X n ). I have included in the appendix to this chapter the statement of a truly fantastic theorem about precise description of long-term behaviour: Sirassen's Law.

42

Chapter

4:

(4·4) ..

Independence

A number of exercises in Chapter E are now accessible to you.

4.5. A fundamental question for modelling Can we con~truct a ~equence (Xn : n E N) of independent random variables, Xn having prescribed distribution function Fn P We have to be able to answer Yes to this question - for example, to be able to construct a rigorous model for the branching-process model of Chapter 0, or indeed for Example 4.4 to make sense. Equation (0.2,b) makes it clear that a Yes answer to our question is all that is needed for a rigorous branching-process model.

The trick answer based on the existence of Lebesgue measure given in the next section does settle the question. A more satisfying answer is provided by the theory of product measure, a topic deferred to Chapter 8. 4.6. A coin-tossing model with applications Let (n, F, P) be ([0,1], B[O, 1], Leb). For wEn, expand w in binary:

w = 0.WIW2 ... (The existence of two different expansions of a dyadic rational is not going to cause any problems because the set 0 (say) of dyadic rationals in [0,1] has Lebesgue measure 0 - it is a countable set!) An an Exercise, you can prove that the sequence (~n : n EN), where ~n(w):= wn, is a sequence of independent variables each taking the values 0 or 1 with probability t for either possibility. Clearly, (~n : n E N) provides a model for coin tossing.

Now define

Yi(w):= 0.WIW3W6.·· , Y2(w):= 0.W2wSw9 ... ,

}'jew)

:=

0.W4 W SW 13· •• ,

and so on. We now need a bit of common sense. Since the sequence

has the same 'coin-tossing' properties as the full sequence (w n is clear that Y1 has the uniform distribution on [0,1]; and similarly for the other Y's.

:

n EN), it

··(4·8)

Chapter

4:

Independence

Since the sequences (1,3,6, ... ), (2,5,9, ... ), ... which give rise to Yb Y2 , .•• are disjoint, and therefore correspond to different sets of tosses of our 'coin', it is intuitively obvious that ~

Y I , Y 2 , • •• are independent random variables, each uniformly distributed on [0,1]. Now suppose that a sequence (Fn : n E N) of distribution functions is given. By the Skorokhod representation of Section 3.12, we can find functions gn on [0,1] such that

Xn := gn(Yn) has distribution function Fn. But because the V-variables are independent, the same is obviously true of the X -variables. ~

We have therefore succeeded in constructing a family (Xn : n E N) of independent random variables with prescribed distribution functions. Exercise. Satisfy yourself that you could if forced carry through these intuitive arguments rigorously. Obviously, this is again largely a case of utilizing the Uniqueness Lemma 1.6 in much the same way as we did in Section 4.2. 4.7. Notation: lID RVs Many of the most important problems in probability concern sequences of random variables (RVs) which are independent and identically distributed (lID). Thus, if (Xn) is a sequence of lID variables, then the Xn are independent and all have the same distribution function F (say):

P(Xn

~

x) = F(x),

Vn,Vx.

Of course, we now know that for any given distribution function F, we can construct a triple (n,F,p) carrying a sequence of lID RVs with common distribution function F. In particular, we can construct a rigorous model for our branching process. 4.8. Stochastic processes; Markov chains ~A

stochastic process Y parametrized by a set C is a collection

Y

= (Y,

: '"Y E C)

of random variables on some triple (n, F, P). The fundamental question about existence of a stochastic process with prescribed joint distributions is (to all intents and purposes) settled by the famous Daniell-Kolmogorov theorem, which is just beyond the scope of this course.

Chapter

44

4:

Independence

Our concern will be mainly with processes X = (Xn : n E Z+) indexed (or parametrized) by Z+. We think of Xn as the value of the process X at time n. For wEn, the map 11 1-+ Xn(W) is called the sample path of X corresponding to the sample point w. A very important example of a stochastic process is provided by a Markov chain. ~~Let

E be a finite or countable set. Let P = (Pij : i,j E E) be a 3tocha3tic Ex E matrix, so that for i,j E E, we have Pij ~ 0,

Let J-l be a probability measure on E, so that J-l is specified by the values = (Zn: n E Z+) on E with initial di3tribution J-l and I-step transition matrix P is meant a stochastic process Z such that, whenever n E Z+ and io,i 1 , .•. ,in E E,

J-li:= J-l({i}), (i E E). By a time-homogeneou3 Markov chain Z

Exercise. Give a construction of such a chain Z expressing Zn( w) explicitly in terms of the values at w of a suitable family of independent random variables. See the appendix to this chapter. 4.9. Monkey typing Shakespeare Many interesting events must have probability 0 or 1, and we often show that an event F has probability 0 or 1 by using some argument based on independence to show that P(F)2 = P(F). Here is a silly example, to which we apply a silly method, but one which both illustrates very clearly the use of the monotonicity properties of measures in Lemma 1.10 and has a lot of the flavour of the Kolmogorov 0-1 law. See the 'Easy exercise' towards the end of this section for an instantaneous solution to the problem. Let us agree that correctly typing WS, the Collected Works of Shakespeare, amounts to typing a particular sequence of N symbols on a typewriter. A monkey types symbols at random, one per unit time, producing an infinite sequence (Xn) of lID RVs with values in the set of all possible symbols. We agree that €

:= inf{P(Xl = x): x is a symbol}

> O.

Let H be the event that the monkey produces infinitely many copies of WS. Let H k be the event that the monkey will produce at least k copies of WS in

Chapter

··(4·9)

4:

45

Independence

e\J ("A+-

all, and let H m,k be the pr~bability. that it will produce at least k copies by time m. Finally, let H(m) be the event that the monkey produces infinitely many copies of WS over the time period [m + 1,00). Because the monkey's behaviour over [1, m] is independent of its behaviour over [m + 1,00), we have

But logic tells us that, for every m, H(m)

P(Hm,k

r

n H) =

= H!

Hence,

P(Hm,k)P(H).

r

r

But, as m 00, Hm,k Hk, and (Hm,k n H) (Hk obvious that Hk ;2 H. Hence, by Lemma 1.10(a),

n H) = H, it being

P(H) = P(Hk)P(H). However, as k

r 00, Hk ! H, and so, by Lemma 1.10(b), P(H)

= P(H)P(H),

whence P(H) = 0 or 1. The Kolmogorov 0-1 law produces a huge class of important events E for which we must have P(E) = 0 or P(E) = 1. Fortunately, it does not tell us which - and it therefore generates a lot of interesting problems! Easy exercise. Use the Second Borel-Cantelli Lemma to prove that P(H) = 1. Hint. Let El be the event that the monkey produces WS right away, that is, during time period [1, N]. Then P(El) 2': eN. Tricky exercise ( to which we shall return). If the monkey types only capital letters, and is on every occasion equally likely to type any of the 26, how long on average will it take him to produce the sequence 'ABRACADABRA' ?

The next three sections involve quite subtle topics which take time to assimilate. They are not strictly necessary for subsequent chapters. The Kolmogorov 0-1 law is used in one of our two proofs of the Strong Law for lID RVs, but by that stage a quick martingale proof (of the 0-1 law) will have been provided.

Note. Perhaps the otherwise-wonderful 'lEX makes its T too like I. Below, I use K instead of I to avoid the confusion. Script X, X, is too like Greek chi, X, too; but we have to live with that.

46

Chapter

4:

(4. 10) ..

Independence

4.10. Definition. Tail u-algebras ~~Let

X I ,X2 , ••• be random variables. Define Tn := u(Xn +I ,Xn + 2 , •••

(a)

T:=

),

n

Tn.

n

The u-algebra T is called the tail u-algebra of the sequence (X n

:

n EN).

Now, T contains many important events: for example,

(bl)

FI

(b2)

F2 :=

(b3)

) . Xl + X 2 + ... + Xk Fa:= ( hm -----=-k---- exists .

:=

(limXk exists) := {w : limXk(w) exists}, k

(2: Xk converges),

Also, there are many important variables which are in mT: for example,

(c) which may be

±oo,

of course.'

Exercise. Prove that FI ,F2 and F3 are are T-measurable, that the event H in the monkey problem is a tail event, and that the various events of probability 0 and 1 in Section 4.4 are tail events.

Hint - to be read only after you have already tried hard. Look at F3 for example. For each n, logic tells us that Fa is equal to the set (n) { l' Xn+l(w)+",+Xn+k(w) 't} F3 := W: If! k eXls s .

Now, X n + b X n + 2 , ... are all random variables on the triple (n, Tn, P). That Fin) E Tn now follows from Lemmas 3.3 and 3.5. 4.11. THEOREM. Kolmogorov's 0-1 Law ~~

Let (Xn : n E N) be a sequence of independent random variables, and let T be the tail u-algebra of (Xn : n EN). Then T is P-trivial: that is, (i) FE T =? P(F) = 0 or P(F) = 1, (ii) if is aT-measurable random variable, then, is almost deterministic in that for some constant c in [-00,00],

e

e

pee = c) = 1.

.. (-/.11) We allow

Chapter ~

=

4:

47

Independence

±oo at (ii) for obvious reasons.

Proof of (i). Let

Xn := u(X1 , ••• , Xn),

Tn := u(Xn +l,Xn+2 , •• • ).

Step 1: We claim that Xn and Tn are independent. Proof of claim. The class K of events of the form

{w: X;(w)

~

xi: 1 ~ k ~ n},

is a 7r-system which generates X n . The class

{W : Xj(w)

~

Xj : n + 1 ~ j

Xi

:r of sets of the form

n + r},

~

E R U {oo}

rEN,

xjERU{oo}

is a 7r-system which generates Tn. But the assumption that the sequence (Xk) is independent implies that K and:r are independent. Lemma 4.2(a) now clinches our claim. Step 2: Xn and T are independent. This is obvious because T

~

Tn.

Step !i: We claim that Xoo := u(Xn : n E N) and T are independent. Proof of claim. Because Xn ~ X n +1 , "In, the class Koo := U Xn is a 7rsystem (it is generally NOT a u-algebra!) which generates X(X). Moreover, Koo and T are independent, by Step 2. Lemma 4.2(a) again clinches things. Step

4.

Since T ~ X(X)' T is independent of T! Thus,

FE T and P(F)

= 0 or

=>

P(F)

= P(F n F) = P(F)P(F),

1.

o

PrClof of (ii). By part (i), for every x in R, P(~ ~ x) = 0 or 1. Let c := sup{x : P(( ~ x) = OJ. Then, if c = -00, it is clear that P(( = -00) = 1; and if c = 00, it is clear that P(~ = 00) = 1.

So, suppose that c is finite. Then P(( ~ c -lin) = 0, "In, so that

while, since P(( ~ c + lin) = 1, "In, we have

48

Chapter

Hence,

4:

(4. 11) ..

Independence

o

PC, = c) = 1.

Remarks. The examples in Section 4.10 show how striking this result is. For example, if XI, X 2 , ••• is a sequence of independent random variables, then either Xn converges) = 0

P(L:

or

peE Xn converges) = 1.

The Three Series Theorem (Theorem 12.5) completely settles the question of which possibility occurs. So, you can see that the 0-1 law poses numerous interesting questions. Example. In the branching-process example of Chapter 0, the variable

is measurable on the tail u-algebra of the sequence (Zn : n E N) but need not be almost deterministic. But then the variables (Zn : n E N) are not independent. 4.12. Exercise/Warning Let

Yo, VI, 12, ...

be independent random variables with P(Yn

= +1) = P(Yn = -1) = t,

"In.

For n E N, define Xn := YoYi ... Y n • Prove that the variables Xl, X 2 , ••• are independent. Define

y:= u(YI, Y2, .. .),

Tn := u(Xr

: r

> n).

Prove that

Hint. Prove that Yo E mC and that Yo is independent of R. Notes. The phenomenon illustrated by this example tripped up even Kolmogorov and Wiener. The very simple illustration given here was shown to me by Martin Barlow and Ed Perkins. Deciding when we can assert that (for Y a u-algebra and (Tn) a decreasing sequence of u-algebras )

is a tantalizing problem in many probabilistic contexts.

Chapter 5

Integration

5.0. Notation, etc. /-1(/) :=: I

I

dl-', /-1(/; A)

Let (S, E, p.) be a measure space. We are interested in defining for suitable elements I of mE the (Lebesgue) integral of I with respect to /-I, for which we shall use the alternative notations:

/-1(1) :=: IsI(s)/-I(ds) :=: Isld/-l. It is worth mentioning now that we shall also use the equivalent notations for A E E:

(with a true definition on the extreme right!) It should be clear that, for example,

1-'(/;1

~

x):= /-1(/; A), where A = {s E S: I(s)

~

x}.

Something else worth emphasizing now is that, of course, summation is a special type 01 integration. If (an: n E N) is a sequence of real numbers, then with S = N, E = peN), and /-I the measure on (S, E) with 1-'( {k}) = 1 for every k in N, then s 1-+ a. is /-I-integrable if and only if:E lanl < 00, and then

We begin by considering the integral of a function I in (mE)+, allowing such an I to take values in the extended hall-line [0,00].

50

Chapter 5: Integration

{5.1}..

5.1. Integrals of non-negative simple functions, SF+ If A is an element of ~, we define

The use of po rather than p signifies that we currently have only a naive integral defined for simple functions. An element / of (m~)+ is called simple, and we shall then write / E SF+, if / may be written as a finite sum

(a) where ak E [0,00] and Ak E

~.

We then define

(b)

(with 0.00:= 0 =: 00.0).

The first point to be checked is that /10(/) is well-defined; for / will have many different representations of the form (a), and we must ensure that they yield the same value of /10(/) in (b). Various desirable properties also need to be checked, namely (c), (d) and (e) now to be stated: (c)

if /,g E SF+ and /1(/

i- g) = 0 then /10(/) = /1o(g); SF+ and e ;::: 0 then / + 9 and e/ are in

(d) ('Linearity') if /,g E and /10(/ + g) = 1'0(/) + /1o(g),

SF+,

po(e!) = el'o(/);

(e)

(Monotonicity) if /,g E SF+ and / 5 g, then /10(/) 5/10(g);

(f)

if /,g E SF+ then /1\9 and / V 9 are in SF+.

Checking all the properties just mentioned is a little messy, but it involves no point of substance, and in particular no analysis. We skip this, and turn our attention to what matters: the Monotone-Convergence Theorem.

5.2. Definition of /1(/), / E (m~)+ ~For

(a)

/ E (mE)+ we define

p(/) := sup{/1o(h) : h E SF+, h 5 f} 5 00.

Clearly, for / E SF+, we have 1'(/) = /10(/)' The following result is important.

51

Chapter 5: Integration

.. (5.3) LEMMA ~(b)

II I E (m~)+ and JL(f)

= 0,

then

JL(U

> OJ)

= O.

Proof Obviously, U> O} =T limU > n- 1 }. Hence, using (1.lO,a), we see that if JL( U > O}) > 0, then, for some n, JL( U > n- 1 }) > 0, and then

o 5.3. Monotone-Convergence Theorem (MON) ~~~(a)

If (fn) is a sequence of elements of (m~)+ such that In then

T I,

or, in other notation,

This theorem is really all there is to integration theory. We shall see that other key results such as the Fatou Lemma and the Dominated-Convergence Theorem follow trivially from it. The (MON) theorem is proved in the Appendix. Obviously, the theorem relates very closely to Lemma 1.1O(a), the monotonicity result for measures. The proof of (MON) is not at all difficult, and may be read once you have looked at the following definition of o: r.

I satisfies I(r) E SF+, and I(r)

We have made a(r) left-continuous so that if In

TI

TI

(i EN),

so that, by (MON),

then a(r)(fn)

Ta(r) (f).

(5.9) ..

Chapter 5: Integration

52

Often, we need to apply convergence theorems such as (MON) where the hypothesis (fn i I in the case of (MON» holds almost everywhere rather than everywhere. Let us see how such adjustments may be made. (c)

= 9 (a.e.),

If I,g E (m~)+ and I

then J-l(f)

= J-l(g).

Proof. Let I(r) = a(r) 0 I, g(r) = a(r) 0 g. Then I(r) = g(r) (a.e.) and so, by (5.1,c), J-l(f(r») = J-l(g(r»). Now let r i 00, and use (MON). 0 ~(d)

If I E (m~)+ and (fn) i8 a 8equence in a J-l-null set N, In i I. Then

(m~)+

such that, except on

Proof. We have J-l(f) = J-l(fIs\N) and p(fn) = J-l(fnIS\N). But fnIS\N l IIs\N everywhere. The result now follows from (MON). 0 From now on, (MON) is understood to include this extension. We do not bother to spell out such extensions for the other convergence theorems, often stating results with 'almost everywhere' but proving them under the assumption that the exceptional null set is empty.

Note on the Riemann integral If, for example,

I

is a non-negative Riemann integrable function on ([0,1]'

B[O,I], Leb) with Riemann integral I, then there exists an increasing sequence (Ln) of elements of SF+ and a decreasing sequence (Un) of elements of SF+ such that

Ln and J-l(Ln)

TI,

J-lCUn)

1L

TL

~ I,

Un

1U ~ I

If we define

- {L

I =

0

if L = U, otherwise,

then it_is clear that j is Borel measurable, while (since J-l(L) = J-l(U) = 1) {f =J J} is a subset of the Borel set {L =J U} which Lemma 5.2(b) shows to be of measure O. So f is Lebesgue measurable (see Section Al.l1) and the Riemann integral of I equals the integral of I associated with ([0,1], Leb[O, 1], Leb), Leb[O,I] denoting the a-algebra of Lebesgue measurable subsets of [0,1].

5.4. The Fatou Lemmas for functions ~~(a)

(FATOU) For a sequence (fn) in (m~)+, p(liminf In) ~ liminf J-l(fn).

.. (5.6)

59

Chapter 5: Integration

Proof. We have

liminf fn n

For n

~ k,

we have fn

~

=i limgk,

where gk

:= infn~k In.

gk, so that p.(fn) ~ P.(gk), whence

and on combining this with an application of (MON) to (*), we obtain p.(liminf In) n

=i limp.(gk) $i lim inf p.(fn) k k n~k

o

=: liminf p.(fn). n

Reverse Fatou Lemma ~(b)

If(fn) is a sequence in (mE)+ such that for some gin (mE)+, we have In $ g, "In, and p.(g) < 00, then

p.(limsup/n) ~ limsupp.(fn).

o

Proof· Apply (FATOU) to the sequence (g - In).

5.5. 'Linearity' For a, (3 E R+ and I,g E (mE)+, p.(af

+ (3g) = ap.(f) + (3p.(g)

($ 00).

Proof· Approximate I and 9 from below by simple functions, apply (5.1,d) to the simple functions, and then use (MON). 0

5.6. Positive and negative parts of I For I E mE, we write

I = 1+ - 1-, where

I+(s):= max(f(s),O),

Then 1+,1- E (mE)+, and

r(s):= max(-f(s),O).

III = j+ + 1-.

54

(5.7) ..

Chapter 5: ,Integration

5.7. Integrable function, {,l(S,E;/1) ~For

fEmE, we say that f is /1-integrable, and write

if and then we define

Note that, for

f

E {,l (S, E, /1), I/1U)1 ~ /1(lfl),

the fanliliar rule that the modulus of the integral is less than or equal to the integral of the modulus.

We write {,l (S, E, /1)+ for the class of non-negative elements in {,l(S, E, /1). 5.8. Linearity Fora,f3 E Rand f,g E {,l(S,E,/1), af + f3g E {,l(S, E' /1) and fl(af

+ f3g) = a/1U) + f3/1(g).

Proof. This is a totally routine consequence of the result in Section 5.5. 0

5.9. Dominated-Convergence Theorem (DOM) ~

Suppose that fn'! E mE, that fn(s) --+ f( s) for every s in S and that the sequence Un) is dominated by an element 9 of {,l(S, E,/1)+: Ifn(s)1 ~ g(s), where /1(g) < fn

00.

--+

Vs E S, Vn E N,

Then

fin {,l(S, E,/1): that is, /1(lfn - fl)

whence

Command: Do Exercise E5.1 now.

--+

0,

55

Chapter 5: Integration

.. (5.11)

Proof We have Ifn Lemma 5.4(b),

fl ::; 2g, where

fl(2g)
. = fJl for some f E (m1:)+.

Chapter 6

Expectation

6.0. Introductory remarks

We work with a probability triple (n, F, P), and write £r for £r(n, F, P). Recall that a random variable (RV) is an element of mF, that is an Fmeasurable function from n to R. Expectation is just the integral relative to P. Jensen's inequality, which makes critical use of the fact that pen) = 1, is very useful and powerful: it implies the Schwarz, Holder, ... inequalities for general (S,E,p). (See Section 6.13.) We study the geometry of the space £2(n, F, P) in some detail, with a view to several later applications. 6.1. Definition of expectation For a random variable X E £1 E(X) of X by

E(X):=

In

=

£1(n, F, P), we define the expectation

XdP =

In

X(w)P(dw).

We also define E(X) (::; 00) for X E (mF)+. In short, E(X)

= P(X).

That our present definitions agree with those in terms of probability density function (if it exists) etc. will be confirmed in Section 6.12. 6.2. Convergence theorenls

Suppose that (Xn) is a sequence of RVs, that X is a RV, and that Xn almost surely: P(Xn -+ X) = 1.

-+

X

We rephrase the convergence theorems of Chapter 5 in our new notation:

58

Chapter 6: Expectation

.. (6··0 ~~(MON)

if 05 Xn

~~(FATOU) if Xn ;::

i

X, then E(Xn)

59

i E(X) 5 ooj

0, then E(X) 5liminfE(Xn)j

~(DOM) if IXn(w)1 5 Y(w) V(n,w), where E(Y)

E(IXn - XI)

-+

< 00, then

0,

so that

E(Xn)

-+

E(X)j

~(SCHEFFE) if E(lXnl) -+ E(IXI), then

E(IXn - XI) ~~(BDD)

-+

OJ

if for some finite constant /(, IXn(w)1 5 I o. With the notation of Section 5.14, define

so that P is a probability measure on (S, E). Define

U(8):= {Oh(8)/f(S)P-I

if f(s) > 0, iff(s) =0.

The fact that P(u)9 ::; P(u9) now yields

o Proof of (b). Using Holder's inequality, we have

p(lf + glP)

where

= p(lfllf + gIP-I) + p(lgllf + gIP-I) ::; IIfllpA + IIgllpA,

A = Illf + gIP-III, = p(lf + gIP)I!9,

and (b) follows on rearranging. (The result is non-trivial only if f, 9 E £P, and in that case, the finiteness of A follows from the vector-space property of cP.) 0

Chapter 7

An Easy Strong Law

7.1. 'Independence means multiply' - again!

THEOREM ...

Suppose that X and Yare independent RVs, and that X and Yare both in {}. Then XY E £1 and E(XY) = E(X)E(Y). In particular, if X and Yare independent elements of £2, then Cov(X, Y) = 0 and Var(X

+ Y) =

Var(X) + Var(Y).

Proof. Writing X = X+ - X-, etc., allows us to reduce the problem to the case when X ~ 0 and Y ~ O. This we do.

But then, if a(r) is our familiar staircase function, then

where the sums are over finite parameter sets, and where for each i and j, Ai (in o-(X» is independent of Bj (in o-(Y». Hence E[a(r)(X)a(r)(y)]

=L = L

Now let r

i

00

LaibjP(Ai n Bj) L aibjP(Ai)P(Bj)

= E[a(r)(X)]E[a(r)(y)].

o

and use (MON).

Remark. Note especially that if X and Y are independent then X E £1 and Y E £1 imply that XY E Cl. This is not necessarily true when X and

71

(7.0) ..

Chapter 7: An Easy Strong Law

72

Yare not independent, and we need the inequalities of Schwarz, Holder, etc. It is important that independence obviates the need for such inequalities.

1.2. Strong Law - first version The following result covers many cases of importance. You should note that though it imposes a 'finite 4th moment' condition, it makes no assumption about identical distributions for the (Xn) sequence. It is remarkable that so fine a result has so simple a proof.

THEOREM ~

Suppose that Xl, X 2, ... are independent random variables, and that for some constant K in [0,(0),

E(XA:) Let Sn

= Xl

+X2

or again, Sn/n

-4

= 0,

+ ... +Xn .

E(xt) ~ K,

Vk.

Then

0 (a.s.).

Proof. vVe have E(S~)

= E[(XI + X 2 + ... + X n )4) = E(Lxt +6LL.

0,

o

a.s.

°

Corollary. If the condition E(Xk) = in the theorem is replaced by E(Xk) = p for some constant p, then the theorem holds with n- 1 Sn -> p (a.s.) as its conclusion. Proof. It is obviously a case of applying the theorem to the sequence (Yk), where Yk := Xk - p. But we need to know that

(a) This is obvious from Minkowski's inequality

(the constant function pIon n having £4 norm Ipl). But we can also prove (a) immediately by the elementary inequality (6.7,b). 0

The next topics indicate a different use of variance.

7.3. Chebyshev's inequality As you know this says that for c 2:: 0, and X E £2, p:= E(X);

and it is obvious.

Example. Consider a sequence (Xn) of lID RVs with values in {O, I} with p

= P(Xn = 1) = 1 -

P(Xn

= 0).

(Vi) ..

Chapter 7: An Easy Strong Law

74 Then E(Xn)

= P and Var(Xn ) = p(l- p)::; 1. Thus (using Theorem 7.1) Sn := Xl + X 2 + ... + Xn

has expectation np and variance np(l - p) ::; n/4, and we have E(n-ISn) = p, Var(n-ISn) = n- 2 Var(Sn)::; 1/(4n). Chebyshev's inequality yields P(ln-ISn - pi> b)::; 1/(4n82 ).

1.4. Weierstrass approximation theorem If f is a continuous function on [0,1] and c polynomial B such that

> 0,

then there exists a

sup IB(x) - f(x)1 ::; c. xE(O,I]

Proof. Let (Xk), Sn etc. be as in the Example in Section 7.3. You are well aware that

Hence

Bn(P):= Ef(n-ISn)

= ~f(n-Ik)(~)pk(l- pt-k,

the 'B' being in deference to Bernstein. Now f is bounded on [0,1], If(y)1 ::; K, Vy E [0,1]. Also, f is uniformly continuous on [0,1]: for our given t: > 0, there exists b > 0 such that (a) Ix - yl ::; 8 implies that If(x) - f(y)1 Now, for p E [0,1],

< !t:.

IBn(P) - f(p)1 = IE{f(n-ISn) - f(p)} I· Let us write Y n := If(n-ISn) - f(p)1 and Zn := In-ISn - pl. Then Zn ::; 8 implies that Yn < !c, and we have IBn(P) - f(p)1 ::; E(Yn) = E(Yn; Zn ::; b) + E(Yn; Zn > 8) ::; !cP(Zn ::; 8) + 2KP(Zn > b) ::; !e + 2K/( 4nb 2 ). Earlier, we chose a fixed 8 at (a). We now choose n so that

2f{/(4nb 2 ) < !c. Then IBn(P) - f(p)1 < c, for all p in [0,1]. Now do Exercise E7.1 on inverting Laplace transforms.

o

Chapter 8

Product Measure

8.0. Introduction and advice One of this chapter's main lessons of practical importance is that an 'interchange of order of integration' result

is always valid (both sides possibly being infinite) if f ~ OJ and is valid for 'signed' f (with both repeated integrals finite) provided that one (then the other) of the integrals of absolute values:

is finite. It j., a good idea to read through the chapter to get the ideas, but you are strongly recommended to postpone serious study of the contents until a later stage. Except for the matter of infinite products, it is all a case of relentless use of either the standard machine or the Monotone-Class Theorem to prove intuitively obvious things made to look complicated by the notation. When you do begin a serious study, it is important to appreciate when the more subtle Monotone-Class Theorem has to be used instead of the standard machine.

8.1. Product measurable structure, I:1 x I:2 Let (5], I:x) and (52, I: 2) be measurable spaces. Let 5 denote the Cartesian product 5 := S1 X S2. For i = 1,2, let Pi denote the ith coordinate map, so that

75

(8.1) ..

Chapter 8: Product Measure

76

The fundamental definition of E

= El

X

E2 is as the :F, so that X is an (S, !:)-valued random variable, and if P is a probability measure on n, we can talk about the law p. of X (equals the joint law of Xl and X 2 ) on (S,!:) : p. = Po X-Ion !:. Suppose now that SI and S2 are metrizable spaces and that !:i = 8(Si) Then S is a metrizable space under the product topology. If SI and S2 are separable, then!: = 8(S), and there is no 'conflict'. However, if SI and S2 are not separable, then 8( S) may be strictly larger than !:, X need not be an (S,8(S»-valued random variable, and the joint law of Xl and X 2 need not exist on (S, 8 (S».

(i

= 1,2).

It is perhaps as well to be warned of such things. Note that the separability of R was used in proving that 8(Rn) ~ 8 n in Section 8.5.

PART B: MARTINGALE THEORY ChapteT' 9

Conditional Expectation

9.1. A motivating example Suppose that (O,:F, P) is a probability triple and that X and Z are random variables, X taking the distinct values Xl, XZ, ••• , X m, Z taking the distinct values Zl, Z2, .. . , Zn.

Elementary conditional probability:

P(X

= xilZ = Zj):= P(X = Xi; Z = Zj)jP(Z = Zj)

and elementary conditional expectation:

E(XIZ

= Zj) = 2:XiP(X = xdZ = Zj)

are familiar to you. The random variable Y = E(XIZ), the conditional expectation of X given Z, is defined as follows: (a)

if Z(w) = Zj, then Y(w) := E(XIZ = Zj) =: Yj (say).

It proves to be very advantageous to look at this idea in a new way. 'Reporting to us the value of Z(w)' amounts to partitioning 0 into' Z-atoms' on which Z is constant:

o

I

Z =

Zl

I

Z = Zz

I .. . I

Z = Zn

I

The u-algebra 9 = u(Z) generated by Z consists of sets {Z E B}, B E B, and therefore consists precisely of the 2 n possible unions of the n Z-atoms. It is clear from (a) that Y is constant on Z-atoms, or, to put it better,

(b)

Y

is

g-measurable.

89

(9.1) ..

Chapter 9: Conditional Expectation

84

Next, since Y takes the constant value Yj on the Z-atom {Z

r

J{z=~}

YdP = YjP(Z = Zj) = LXiP(X

= Zj}, we have:

= xdZ = Zj)P(Z = Zj)

j

= LXiP(X = Xii Z = Zj) = i

r

XdP.

J{z=Zj}

If we write Gj = {Z = Zj}, this says E(YIGj) = E(XIGj)' Since for every Gin 9, IG is a sum of IGj 's, we have E(YIG) = E(XIG), or

(c)

!aYdP= LXdP,

VGE9.

Results (b) and (c) suggest the central definition of modern probability. 9.2. Fundamental Theorem and Definition (Kolmogorov, 1933) Let (n, F, P) be a triple, and X a random variable with E(lX!) < 00. Let 9 be a sub-u-algebra of F. Then there exists a random variable Y such that

~~~

(a)

Y is 9 measurable,

(b)

E(IY!)