11,013 286 11MB
Pages 374 Page size 333 x 500 pts Year 2010
INTRODUCTION TO
BIOS TATIS TIC S SECOND EDITION
Robert R. Sokal and F. James Rohlf State University of New York at Stony Brook
D O V E R P U B L I C A T I O N S , INC. Mineola, New York
Copyright C o p y r i g h t © 1969, 1973. 1981. 1987 b y R o b e r t R . S o k a l and F. J a m e s R o h l f All rights reserved.
Bibliographical
Note
T h i s D o v e r e d i t i o n , first p u b l i s h e d in 2009, is a n u n a b r i d g e d r e p u b l i c a t i o n of t h e w o r k originally p u b l i s h e d in 1969 by W . H . F r e e m a n a n d C o m p a n y , N e w Y o r k . T h e a u t h o r s h a v e p r e p a r e d a new P r e f a c e f o r this e d i t i o n .
Library
of Congress
Cataloging-in-Publication
Data
S o k a l , R o b e r t R. I n t r o d u c t i o n t o Biostatistics / R o b e r t R. S o k a l a n d F. J a m e s R o h l f . D o v e r ed. p. c m . O r i g i n a l l y p u b l i s h e d : 2 n d ed. N e w Y o r k : W . H . F r e e m a n , 1969. I n c l u d e s b i b l i o g r a p h i c a l r e f e r e n c e s a n d index. I S B N - 1 3 : 978-0-486-46961-4 I S B N - 1 0 : 0-486-46961-1 I. B i o m e t r y . I. R o h l f , F. J a m e s , 1936- II. Title. Q H 3 2 3 . 5 . S 6 3 3 2009 570.1 '5195 dc22 2008048052
M a n u f a c t u r e d in the U n i t e d S t a l e s of A m e r i c a D o v e r P u b l i c a t i o n s , Inc., 31 Fast 2nd Street, M i n e o l a , N . Y . 1 1501
to Julie and Janice
Contents
PREFACE TO THE DOVER EDITION PREFACE 1.
2.
3.
xiii
INTRODUCTION
1
1.1
Some
1.2
The development
1.3
The statistical
definitions
1 of bioslatistics frame
2.1
Samples
2.2
Variables
2
of mind
D A T A IN B i O S T A T l S T I C S
4
6
and populations
7
in biostatisties
8
and precision
of data
2.3
Accuracy
2.4
Derived
2.5
Frequency
2.6
The handling
variables
distributions of data
3.1
The arithmetic
mean
3.2
Other
31
3.3
The median
3.4
The mode
33
3.5
The range
34
3.6
The standard
3.7
Sample
3.Ν
Practical
methods
deviation
39
means
10
13
D E S C R I P T I V E STATISTICS
3.9
xi
14 24
27 28
32
deviation
statistics
The coefficient
36
and parameters
37
for computing
mean
of variation
43
and
standard
V1U
4.
5.
6.
7.
8.
CONTENTS
I N T R O D U C T I O N TO PROBABILITY DISTRIBUTIONS: T H E B I N O M I A L A N D P O I S S O N D I S T R I B U T I O N S 46 4.1
Probability,
4.2
The
binomial
random
4.3
The
Poisson
sampling,
distribution
and hypothesis
testing
distribution
63
THE N O R M A L PROBABILITY DISTRIBUTION 5.1
Frequency
distributions
5.2
Derivation
of the normal
distribution
76
5.3
Properties
of the normal
distribution
78
5.4
Applications
of continuous
5.5
Departures
of the normal from
methods
ESTIMATION A N D HYPOTHESIS TESTING 6.1
Distribution
and
variance
of means
94
6.2
Distribution
and
variance
of other
statistics
6.3
Introduction
6.4
Student's
6.5
Confidence
limits
t distribution
6.6
The
6.7
Confidence
chi-square
6.8
Introduction
6.9
Tests
6.10
Testing
93
101
103
on sample
distribution
statistics
114
testing
hypotheses
115
employing
the t distribution
!!,,: σ2 = al
the hypothesis
109
112
for variances
lo hypothesis
of simple
85
106
based
limits
75
82
Graphic
to confidence
74
variables
distribution
normality:
limits
The
variances
7.2
The
F distribution
of samples
hypothesis
7.3
The
7.4
Heterogeneity
7.5
Partitioning
7.6
Model
I anova
7.7
Model
II anora
and their
126
129
INTRODUCTION TO ANALYSIS O F VARIANCE 7.1
48
54
means
133
134
138 / / „ : σ] =
among
143
sample
the total
sum
means
of squares
143 and degrees
of freedom
154 157
SINGLE-CLASSIFICATION ANALYSIS O F VARIANCE H.l
Computational
H.2
Equal
Λ.J
Unequal
X.4
Two groups
168
1
1a
I l-> α
l;
n,
3.85 5.21 4.70
12 25 Η
their weighted average will be -
=
(12)(3.85) + (25)(5.2I| + (8X4.70) 12T25 1 S
=
214.05 45
N o t e that in this example, c o m p u t a t i o n of Ihc weighted mean is exactly equivalent to a d d i n g up all the original m e a s u r e m e n t s a n d dividing the sum by the total n u m b e r of the m e a s u r e m e n t s . Thus, the s a m p l e with 25 observations, having the highest m e a n , will influence the weighted average in p r o p o r t i o n to ils size.
31
3.2 / OTHER MEANS
3.2 Other means W e shall see in C h a p t e r s 10 a n d 11 t h a t variables are s o m e t i m e s t r a n s f o r m e d into their l o g a r i t h m s or reciprocals. If we calculate the m e a n s of such transformed variables a n d then c h a n g e the m e a n s back into the original scale, these m e a n s will not be the s a m e as if we h a d c o m p u t e d the arithmetic m e a n s of t h e original variables. T h e resulting m e a n s have received special n a m e s in statistics. T h e b a c k - t r a n s f o r m e d m e a n of the logarithmically t r a n s f o r m e d variables is called the geometric mean. It is c o m p u t e d as follows: GMv = antilog - Υ log Y
(3.3)
η
which indicates that the geometric m e a n GMr is the a n t i l o g a r i t h m of the m e a n of the l o g a r i t h m s of variable Y. Since a d d i t i o n of logarithms is equivalent t o multiplication of their antilogarithms, there is a n o t h e r way of representing this quantity; it is GMY = ^Y^YiT77Yn
(3.4)
T h e geometric m e a n p e r m i t s us to b e c o m e familiar with a n o t h e r o p e r a t o r symbol: capital pi, Π , which m a y be read as " p r o d u c t . " Just as Σ symbolizes s u m m a t i o n of the items that follow it, so Π symbolizes the multiplication of the items that follow it. T h e subscripts a n d superscripts have exactly the same m e a n i n g as in the s u m m a t i o n case. T h u s , Expression (3.4) for the geometric m e a n can be rewritten m o r e c o m p a c t l y as follows: GMr=nY\Yi
(3.4a) I
T h e c o m p u t a t i o n of the geometric m e a n by Expression (3.4a) is quite In practice, the geometric m e a n has to be c o m p u t e d by t r a n s f o r m i n g the into logarithms. The reciprocal of the arithmetic m e a n of reciprocals is called the mean. If we symbolize it by HY, the f o r m u l a for the h a r m o n i c m e a n written in concise form (without subscripts a n d superscripts) as 1
1 „
tedious. variates harmonic can be
1
You may wish to convince yourself that the geometric mean a n d the h a r m o n i c m e a n of the four oxygen percentages are 14.65% a n d 14.09%, respectively. U n less the individual items d o not vary, the geometric m e a n is always less than the arithmetic m e a n , and the h a r m o n i c m e a n is always less t h a n the geometric mean. S o m e beginners in statistics have difficulty in accepting the fact that measures of location or central tendency o t h e r t h a n the arithmetic m e a n are permissible or even desirable. T h e y feel that the arithmetic m e a n is the "logical"
32
CHAPTER 3 /' DESCRIPTIVE STATISTICS
average, a n d that any o t h e r m e a n would be a distortion. This whole p r o b l e m relates t o the p r o p e r scale of m e a s u r e m e n t for representing d a t a ; this scale is not always the linear scale familiar to everyone, but is sometimes by preference a logarithmic or reciprocal scale. If you have d o u b t s a b o u t this question, we shall try to allay t h e m in C h a p t e r 10, where we discuss the reasons for t r a n s f o r m i n g variables.
3.3 The median T h e median Μ is a statistic of location occasionally useful in biological research. It is defined as that value of the variable (in an o r d e r e d array) that has an equal number of items on either side of it. Thus, the m e d i a n divides a frequency distribution into two halves. In the following sample of five m e a s u r e m e n t s , 14, 15, 16, 19, 23 Μ ~ 16, since the third o b s e r v a t i o n has an equal n u m b e r of o b s e r v a t i o n s on b o t h sides of it. We can visualize the m e d i a n easily if we think of an a r r a y f r o m largest t o s m a l l e s t — f o r example, a row of m e n lined u p by their heights. T h e m e d i a n individual will then be that m a n having an equal n u m b e r of m e n on his right a n d left sides. His height will be the median height of the s a m ple considered. This quantity is easily evaluated f r o m a sample a r r a y with an o d d n u m b e r of individuals. W h e n the n u m b e r in the s a m p l e is even, the m e d i a n is conventionally calculated as the m i d p o i n t between the (n/2)th a n d the [(«/2) + 1 j t h variate. T h u s , for the s a m p l e of four m e a s u r e m e n t s 14, 15, 16, 19 the median would be the m i d p o i n t between the second and third items, or 15.5. Whenever any o n e value of a variatc occurs m o r e than once, p r o b l e m s may develop in locating the m e d i a n . C o m p u t a t i o n of the median item b e c o m e s m o r e involved because all the m e m b e r s of a given class in which the m e d i a n item is located will have the s a m e class m a r k . T h e median then is the {n/2)lh variate in the frequency distribution. It is usually c o m p u t e d as that point between the class limits of the m e d i a n class where the median individual would be located (assuming the individuals in the class were evenly distributed). T h e median is just o n e of a family of statistics dividing a frequency distribution into equal areas. It divides the distribution into two halves. T h e three quartiles cut the d i s t r i b u t i o n at the 25, 50, and 75% p o i n t s — t h a t is, at points dividing the distribution into first, second, third, and f o u r t h q u a r t e r s by area (and frequencies). T h e second quarlile is, of course, the median. (There are also quintiles, deciles, a n d percentiles, dividing the distribution into 5. 10, a n d 100 equal portions, respectively.) M e d i a n s arc most often used for d i s t r i b u t i o n s that d o not c o n f o r m to the s t a n d a r d probability models, so that n o n p a r a m e t r i c m e t h o d s (sec C h a p t e r 10) must be used. Sometimes (he median is a m o r e representative m e a s u r e of location than the a r i t h m e t i c m e a n . Such instances almost always involve a s y m m e t r i c
3.4 / THE MODE
33
distributions. An often q u o t e d example f r o m economics w o u l d be a suitable m e a s u r e of location for the "typical" salary of a n employee of a c o r p o r a t i o n . T h e very high salaries of the few senior executives would shift the arithmetic m e a n , the center of gravity, t o w a r d a completely unrepresentative value. T h e m e d i a n , on the o t h e r h a n d , would be little affected by a few high salaries; it w o u l d give the p a r t i c u l a r point o n the salary scale a b o v e which lie 50% of the salaries in the c o r p o r a t i o n , the o t h e r half being lower t h a n this figure. In biology an example of the preferred application of a m e d i a n over the arithmetic m e a n m a y be in p o p u l a t i o n s showing skewed distribution, such as weights. T h u s a m e d i a n weight of American males 50 years old m a y be a more meaningful statistic than the average weight. T h e m e d i a n is also of i m p o r t a n c e in cases where it m a y be difficult or impossible to o b t a i n a n d m e a s u r e all the items of a sample. F o r example, s u p p o s e an animal behaviorist is studying the time it takes for a s a m p l e of a n i m a l s to perform a certain behavioral step. T h e variable he is m e a s u r i n g is the time from the beginning of the experiment until each individual has performed. W h a t he w a n t s to o b t a i n is an average time of p e r f o r m a n c e . Such an average time, however, can be calculated only after records have been o b t a i n e d on all the individuals. It m a y t a k e a long lime for the slowest a n i m a l s to complete their p e r f o r m a n c e , longer t h a n the observer wishes to spend. (Some of them may never respond a p p r o p r i a t e l y , m a k i n g the c o m p u t a t i o n of a m e a n impossible.) Therefore, a convenient statistic of location to describe these a n i m a l s may be the median time of p e r f o r m a n c e . Thus, so long as the observer k n o w s what the total sample size is, he need not have m e a s u r e m e n t s for the right-hand tail of his distribution. Similar e x a m p l e s would be the responses to a d r u g or poison in a g r o u p of individuals (the median lethal or effective dose. LD 5 ( I or F.D S 0 ) or the median time for a m u t a t i o n to a p p e a r in a n u m b e r of lines of a species.
3.4 The mode T h e mode
r e f e r s t o the
value
represented
by
the
greatest
number
of
individuals.
When seen on a frequency distribution, the m o d e is the value of the variable at which the curve peaks. In grouped frequency distributions the m o d e as a point has little meaning. It usually sulliccs It) identify the m o d a l class. In biology, the m o d e does not have m a n y applications. Distributions having two peaks (equal or unequal in height) are called bimodal; those with m o r e than two peaks are multimodal. In those rare distributions that are U-shaped, we refer to the low point at the middle of the distribution as an
antimode.
In evaluating the relative merits of the arithmetic mean, the median, a n d the mode, a n u m b e r of c o n s i d e r a t i o n s have to be kept in mind. T h e m e a n is generally preferred in statistics, since it has a smaller s t a n d a r d e r r o r than o t h e r statistics of location (see Section 6.2), it is easier to work with mathematically, and it has an a d d i t i o n a l desirablc p r o p e r t y (explained in Section 6.1): it will tend to be normally distributed even if the original data are not. T h e mean is
CHAPTER 3 /' DESCRIPTIVE STATISTICS
34
20
η = 120
18
uh
Hi
14
12 c" ct-
10
U.
8
:i. ι
:i.(i
;is
i.o
lVl"!'!']
1.2
1,1
!.(>
4.8
5,0
bul I r r f a t
HGURi·: 3.1 An a s y m m e t r i c a l f r e q u e n c y d i s t r i b u t i o n ( s k e w e d t o the right) s h o w i n g l o c a t i o n of t h e m e a n , m e d i a n , a n d m o d e . P e r c e n t b u t t e r f a t in 120 s a m p l e s of milk ( f r o m a C a n a d i a n c a t t l e b r e e d e r s ' r e c o r d b o o k ) .
m a r k e d l y affected by outlying observations; the m e d i a n and m o d e are not. T h e mean is generally m o r e sensitive to c h a n g e s in the s h a p e of a frequency distribution, a n d if it is desired to have a statistic reflecting such changes, the m e a n may be preferred. In symmetrical, u n i m o d a l d i s t r i b u t i o n s the mean, the median, a n d the m o d e are all identical. A prime example of this is the well-known n o r m a l distribution of C h a p t e r 5. In a typical asymmetrical d i s t r i b u t i o n , such as the o n e s h o w n in Figure 3.1, the relative positions of the mode, median, and mean are generally these: the mean is closest to the d r a w n - o u t tail of the distribution, the m o d e is farthest, and the m e d i a n is between these. An easy way to r e m e m b e r this seq u e n c e is to recall that they occur in alphabetical o r d e r from the longer tail of t h e distribution.
3.5 The ran}>e We now turn to measures of dispersion, f igure 3.2 d e m o n s t r a t e s that radically different-looking distributions may possess the identical arithmetic mean. It is
35
3.5 / THE RANGE
10 8
6 4 2
0 10 8
£α; 6 αϊ 4
Uh
2
0 10
8 1
(i
I
0 FIGURE 3 . 2
T h r e e frequency d i s t r i b u t i o n s h a v i n g identical m e a n s a n d s a m p l e si/.es but differing in dispersion pattern.
O n e simple m e a s u r e of dispersion is the range, which is defined as the difference
between
the
largest
and
the
smallest
items
in a sample.
Thus, the
range
of the four oxygen percentages listed earlier (Section 3.1) is R a n g e = 23.3 - 10.8 = 12.5";, a n d the range of the a p h i d femur lengths (Box 2.1) is Range = 4.7 - 3.3 = 1.4 units of 0.1 m m Since the range is a m e a s u r e of the s p a n of the variates a l o n g the scale of the variable, it is in the same units as the original m e a s u r e m e n t s . T h e range is clearly affected by even a single outlying value a n d for this reason is only a rnuoh estimate of the dtsriersion of all the items in the samtnle.
36
CHAPTER 3 /' DESCRIPTIVE STATISTICS
3.6 The standard deviation W e desire t h a t a m e a s u r e of dispersion t a k e all items of a d i s t r i b u t i o n i n t o c o n s i d e r a t i o n , weighting e a c h item by its distance f r o m the center of the distrib u t i o n . W e shall n o w try t o c o n s t r u c t such a statistic. In T a b l e 3.1 we s h o w a s a m p l e of 15 b l o o d n e u t r o p h i l c o u n t s f r o m p a t i e n t s with t u m o r s . C o l u m n (1) s h o w s the variates in t h e o r d e r in which they were reported. T h e c o m p u t a t i o n of t h e m e a n is s h o w n below the table. T h e m e a n n e u t r o p h i l c o u n t t u r n s o u t to be 7.713. T h e distance of e a c h variate f r o m t h e m e a n is c o m p u t e d as t h e following deviation: y = Y - Y E a c h individual deviation, or deviate, is by c o n v e n t i o n c o m p u t e d as the individual o b s e r v a t i o n m i n u s t h e m e a n , Υ — Ϋ, r a t h e r t h a n the reverse, Ϋ — Y. D e v i a t e s are symbolized by lowercase letters c o r r e s p o n d i n g to the capital letters of t h e variables. C o l u m n (2) in T a b l e 3.1 gives the deviates c o m p u t e d in this manner. W e n o w wish to calculate a n average d e v i a t i o n t h a t will s u m all t h e deviates and divide t h e m by the n u m b e r of deviates in the sample. But n o t e that when TABLE
3.1
The standard deviation. L o n g m e t h o d , not r e c o m m e n d e d for h a n d or c a l c u l a t o r c o m p u t a t i o n s but s h o w n here to illust r a t e t h e m e a n i n g of t h e s t a n d a r d deviation. T h e d a t a a r e b l o o d n e u t r o p h i l c o u n t s (divided by 1000) per microliter, in 15 p a t i e n t s with n o n h e m a t o l o g i c a l t u m o r s . (/) Y
y
(2) Υ - Y
(i)
y2
4.9
-2.81
7.9148
4.6 5.5
-3.11 -2.21
9.6928 4.8988
9.1 16.3 12.7 6.4
1.39 8.59 4.99
1.9228 73.7308
-1.31
24.8668 1.7248
-0.61 -5.41
0.3762 29.3042
-4.11 10.29
105.8155
-4.01 -0.41
16.1068 0.1708
-3.31 2.09
10.9782
9.8 15.7
0.05
308.7770
7.1 2.3 3.6 18.0 3.7 7.3 4.4
Total
=
Mean
Y
ΣΥ
16.9195
4.3542
I Is.7 -
7.713
37
3.7 / SAMPLE STATISTICS AND PARAMETERS
we s u m o u r deviates, negative a n d positive deviates cancel out, as is s h o w n by the s u m at the b o t t o m of c o l u m n (2); this sum a p p e a r s to be u n e q u a l to zero only because of a r o u n d i n g error. D e v i a t i o n s f r o m the a r i t h m e t i c m e a n always s u m to zero because the m e a n is the center of gravity. C o n s e q u e n t l y , an average based o n the s u m of deviations w o u l d also always e q u a l zero. Y o u are urged to study A p p e n d i x A l . l , which d e m o n s t r a t e s that the s u m of deviations a r o u n d the m e a n of a s a m p l e is equal t o zero. S q u a r i n g t h e deviates gives us c o l u m n (3) of Table 3.1 a n d e n a b l e s us to reach a result o t h e r t h a n zero. (Squaring the deviates also h o l d s o t h e r m a t h e matical a d v a n t a g e s , which we shall t a k e u p in Sections 7.5 a n d 11.3.) T h e sum of the s q u a r e d deviates (in this case, 308.7770) is a very i m p o r t a n t q u a n t i t y in statistics. It is called t h e sum of squares a n d is identified symbolically as Σγ2. A n o t h e r c o m m o n symbol for the s u m of s q u a r e s is SS. T h e next step is t o o b t a i n the average of the η s q u a r e d deviations. T h e resulting q u a n t i t y is k n o w n as the variance, or the mean square'. Variance =
X>· 2 __ 308.7770 15
= 20.5851
T h e variance is a m e a s u r e of f u n d a m e n t a l i m p o r t a n c e in statistics, a n d we shall employ it t h r o u g h o u t this b o o k . At the m o m e n t , we need only r e m e m b e r that because of the s q u a r i n g of the deviations, the variance is expressed in squared units. T o u n d o the effect of the squaring, we now take the positive s q u a r e r o o t of the variance a n d o b t a i n the standard deviation:
Thus, s t a n d a r d deviation is again expressed in the original units of measurement, since it is a s q u a r e r o o t of the squared units of the variance. An important note: T h e technique just learned a n d illustrated in T a b l e 3.1 is not the simplest for direct c o m p u t a t i o n of a variance a n d s t a n d a r d deviation. However, it is often used in c o m p u t e r p r o g r a m s , where accuracy of c o m p u t a tions is an i m p o r t a n t consideration. Alternative a n d simpler c o m p u t a t i o n a l m e t h o d s are given in Section 3.8. T h e o b s e r v a n t reader m a y have noticed that we have avoided assigning any symbol to either the variance o r the s t a n d a r d deviation. We shall explain why in the next section. 3.7 Sample statistics and parameters U p to now we have calculated statistics f r o m samples without giving t o o m u c h t h o u g h t to what these statistics represent. W h e n correctly calculated, a m e a n and s t a n d a r d deviation will always be absolutely true measures of location a n d dispersion for the samples on which they are based. T h u s , the true m e a n of the four oxygen percentage readings in Section 3.1 is 15.325".",. T h e s t a n d a r d deviation of the 15 n e u t r o p h i l c o u n t s is 4.537. However, only rarely in biology (or f ι,Λ,-,ιί^,Λ .,,-,,ι ,ι;..
38
CHAPTER 3 /' DESCRIPTIVE STATISTICS
only as descriptive s u m m a r i e s of the samples we have studied. Almost always we are interested in the populations f r o m which t h e samples h a v e been t a k e n . W h a t we w a n t to k n o w is not the m e a n of the particular four oxygen precentages, but r a t h e r the t r u e oxgyen percentage of the universe of readings f r o m which the f o u r readings have been sampled. Similarly, we would like t o k n o w the true m e a n neutrophil c o u n t of the p o p u l a t i o n of patients with n o n h e m a t o l o g i c a l t u m o r s , n o t merely the m e a n of the 15 individuals m e a s u r e d . W h e n s t u d y i n g dispersion we generally wish to learn the true s t a n d a r d deviations of t h e p o p u lations a n d not those of t h e samples. These p o p u l a t i o n statistics, however, are u n k n o w n a n d (generally speaking) are u n k n o w a b l e . W h o would be able t o collect all the patients with this p a r t i c u l a r disease a n d m e a s u r e their n e u t r o p h i l c o u n t s ? T h u s we need to use sample statistics as e s t i m a t o r s of population statistics or parameters. It is c o n v e n t i o n a l in statistics to use G r e e k letters for p o p u l a t i o n p a r a m e t e r s a n d R o m a n letters for s a m p l e statistics. T h u s , the sample m e a n Ϋ estimates μ, the p a r a m e t r i c m e a n of the p o p u l a t i o n . Similarly, a sample variance, symbolized by s 2 , estimates a p a r a m e t r i c variance, symbolized by a 2 . Such e s t i m a t o r s should be unbiased. By this we m e a n that samples (regardless of the sample size) t a k e n f r o m a p o p u l a t i o n with a k n o w n p a r a m e t e r should give sample statistics that, when averaged, will give the p a r a m e t r i c value. An estimator that d o e s not d o so is called biased. T h e s a m p l e m e a n Ϋ is an unbiased e s t i m a t o r of the p a r a m e t r i c m e a n μ. H o w e v e r , the s a m p l e variance as c o m p u t e d in Section 3.6 is not unbiased. O n the average, it will u n d e r e s t i m a t e the m a g n i t u d e of the p o p u l a t i o n variance a 1 . T o o v e r c o m e this bias, m a t h e m a t i c a l statisticians have shoWn t h a t w h e n s u m s of squares are divided by π — 1 rather than by η the resulting s a m p l e variances will be unbiased estimators of the p o p u l a t i o n variance. F o r this reason, it is c u s t o m a r y to c o m p u t e variances by dividing the sum of squares by η — 1. T h e f o r m u l a for the s t a n d a r d deviation is therefore customarily given as follows: (3.6) In the n e u t r o p h i l - c o u n t d a t a the s t a n d a r d deviation would thus be c o m p u t e d as
We note that this value is slightly larger than o u r previous estimate of 4.537. Of course, the greater the s a m p l e size, the less difference there will be between division by η a n d by n I. However, regardless of sample size, it is good practice to divide a sum of s q u a r e s by η — 1 when c o m p u t i n g a variance or s t a n d a r d deviation. It m a y be assumed that when the symbol s2 is e n c o u n t e r e d , it refers to a variance o b t a i n e d by division of the sum of squares by the degrees of freedom, as the q u a n t i t y η — 1 is generally referred to. Division of the s u m of s q u a r e s by η is a p p r o p r i a t e only when the interest of the investigator is limited to the s a m p l e at h a n d a n d to its variance a n d
3.8 / PRACTICAL METHODS FOR COMPUTING MEAN AND STANDARD DEVIATION
39
s t a n d a r d deviation as descriptive statistics of the sample. This w o u l d be in c o n t r a s t t o using these as estimates of the p o p u l a t i o n p a r a m e t e r s . T h e r e are also the rare cases in which the investigator possesses d a t a on the entire p o p u lation; in such cases division by η is perfectly justified, because then the investigator is not e s t i m a t i n g a p a r a m e t e r but is in fact e v a l u a t i n g it. T h u s the variance of the wing lengths of all a d u l t w h o o p i n g cranes w o u l d b e a p a r a m e t r i c value; similarly, if the heights of all winners of the N o b e l Prize in physics h a d been m e a s u r e d , their variance w o u l d be a p a r a m e t e r since it w o u l d be based on the entire p o p u l a t i o n . 3.8 Practical methods for computing mean and standard deviation T h r e e steps are necessary for c o m p u t i n g the s t a n d a r d deviation: (1) find Σ>>2, the s u m of squares; (2) divide by η — 1 to give the variance; a n d (3) take the s q u a r e r o o t of the variance to o b t a i n the s t a n d a r d deviation. T h e p r o c e d u r e used t o c o m p u t e the s u m of squares in Section 3.6 can be expressed by the following f o r m u l a : £y2 = X"
(3.8) 11
Let us see exactly w h a t this f o r m u l a represents. T h e first term o n the right side of the e q u a t i o n , Σ Υ 2 , is the sum of all individual Y's, each s q u a r e d , as follows: £y
2
-
Y 2 + >1 + >1 + · • • + Y2„
W h e n referred to by name, Σ Υ 2 should be called the "sum of Y s q u a r e d " and should be carefully distinguished f r o m Σ>>2, "the sum of squares of Y." These names are u n f o r t u n a t e , but they are t o o well established to think of a m e n d i n g them. T h e o t h e r q u a n t i t y in Expression (3.8) is (ΣΥ) 2 />ι. It is often called the correction term (CT). T h e n u m e r a t o r of this term is the s q u a r e of the sum of the Y's; t h a t is, all t h e Y's are first s u m m e d , and this s u m is then s q u a r e d . In general, this q u a n t i t y is different f r o m Σ Υ 2 , which first squares the y ' s a n d then sums them. These two terms a r c identical only if all the Y's arc equal. If you arc not certain a b o u t this, you can convince yourself of this fact by calculating these two quantities for a few n u m b e r s . T h e d i s a d v a n t a g e of Expression (3.8) is that the quantities Σ Y2 a n d (Σ Y)2hi may b o t h be quite large, so that accuracy may be lost in c o m p u t i n g their difference unless one takes the precaution of c a r r y i n g sufficient significant figures. W h y is Expression (3.8) identical with Expression (3.7)? T h e proof of this identity is very simple a n d is given in Appendix A 1.2. You are urged to work
40
CHAPTER 3 /' DESCRIPTIVE STATISTICS
t h r o u g h it t o build u p y o u r confidence in h a n d l i n g statistical s y m b o l s a n d formulas. It is s o m e t i m e s possible t o simplify c o m p u t a t i o n s by recoding variates into simpler f o r m . W e shall use the term additive coding for the a d d i t i o n or s u b t r a c t i o n of a c o n s t a n t (since s u b t r a c t i o n is only a d d i t i o n of a negative n u m b e r ) . W e shall similarly use multiplicative coding to refer to the multiplication or division by a c o n s t a n t (since division is multiplication by the reciprocal of the divisor). W e shall use the t e r m combination coding to m e a n the a p p l i c a t i o n of b o t h additive a n d multiplicative c o d i n g t o the same set of d a t a . In A p p e n d i x A 1.3 we e x a m i n e the c o n s e q u e n c e s of the three types of coding in the c o m p u t a t i o n of means, variances, a n d s t a n d a r d deviations. F o r the case of means, the f o r m u l a for c o m b i n a t i o n coding a n d d e c o d i n g is the most generally applicable one. If the c o d e d variable is Yc = D(Y + C), then
where C is an additive c o d e a n d D is a multiplicative code. O n considering the effects of c o d i n g variates on the values of variances and standard deviations, we find that additive codes have no effect o n the s u m s of squares, variances, or s t a n d a r d deviations. T h e m a t h e m a t i c a l proof is given in A p p e n d i x A 1.3, but we can see this intuitively, because an additive code has n o effect on the distance of an item f r o m its m e a n . T h e distance f r o m an item of 15 to its m e a n of 10 would be 5. If we were to code the variates by subtracting a c o n s t a n t of 10, the item would now be 5 a n d the m e a n zero. T h e difference between t h e m would still be 5. T h u s , if only additive c o d i n g is employed, the only statistic in need of d e c o d i n g is the mean. But multiplicative coding does have an effect on s u m s of squares, variances, a n d s t a n d a r d deviations. T h e s t a n d a r d deviations have to be divided by the multiplicative code, just as had to be d o n e for the m e a n . However, the s u m s of squares or variances have to be divided by the multiplicative codes s q u a r e d , because they are s q u a r e d terms, and the multiplicative factor becomcs s q u a r e d d u r i n g the o p e r a t i o n s . In c o m b i n a t i o n coding the additive code can be ignored. W h e n the d a t a are u n o r d e r e d , the c o m p u t a t i o n of the m e a n and s t a n d a r d deviation proceeds as in Box 3.1, which is based on the u n o r d e r e d n e u t r o p h i l c o u n t d a t a s h o w n in T a b l e 3.1. W e chose not to apply coding to these d a t a , since it would not have simplified the c o m p u t a t i o n s appreciably. W h e n the d a t a are a r r a y e d in a frequency distribution, the c o m p u t a t i o n s can be m a d e m u c h simpler. W h e n c o m p u t i n g the statistics, you can often avoid the need for m a n u a l entry of large n u m b e r s of individual variatcs if you first set u p a frequency distribution. Sometimes the d a t a will c o m e to you already in the form of a frequency distribution, having been g r o u p e d by the researcher. T h e c o m p u t a t i o n of Ϋ a n d s f r o m a frequency distribution is illustrated in Box 3.2. T h e d a t a are the birth weights of male Chinese children, first e n c o u n t e r e d in Figure 2.3. T h e calculation is simplified by coding to remove the a w k w a r d class m a r k s . This is d i m e bv s u b t r a c t i n g 59.5. the lowest class m a r k of the arrav.
3.8 / PRACTICAL METHODS FOR COMPUTING MEAN AND STANDARD DEVIATION
41
BOX 3.1 Calculation of Ϋ and s from unordered data. Neutrophil counts, unordered as shown in Table 3.1. Computation
n = 15 £ 7 = 115.7 y = - T y = 7.713 η
ΣΥ
2
= 1201.21
= 308.7773
S
Σ / , 308.7773 ~ η- 1~ 14 = 22.056 s = V22.056 = 4.696
T h e resulting class m a r k s are values such as 0, 8, 16, 24, 32, a n d so on. T h e y are then divided by 8, which c h a n g e s them to 0, 1, 2, 3, 4, and so on, which is the desired f o r m a t . T h e details of the c o m p u t a t i o n can be learned f r o m the box. W h e n checking the results of calculations, it is frequently useful to have an a p p r o x i m a t e m e t h o d for e s t i m a t i n g statistics so that gross e r r o r s in c o m p u tation can be detected. A simple m e t h o d for e s t i m a t i n g the m e a n is to average the largest a n d smallest o b s e r v a t i o n to obtain the so-called miJrunye. F o r the neutrophil c o u n t s of T a b l e 3.1, this value is (2.3 + 18.0J/2 = 10.15 (not a very good estimate). S t a n d a r d deviations can be estimated f r o m ranges by a p p r o priate division of the range, as follows:
/•'or samples of 10 30 100 500 1000
Divide the range by 3 4 5
6 6
42
CHAPTER 3 /' DESCRIPTIVE STATISTICS
BOX 3.2 Calculation of F, s, and Κ from a frequency distribution. Birth weights of male Chinese in ounces.
(/) Class
y
(2)
mark
Coifei
c/iJM mark
/
59.5 67.5 75.5 83.5 91.5 99.5 107.5 115.5 123.5 131.5 139.5 147.5 155.5 163.5 171.5
2 6 39 385 888 1729 2240 2007 1233 641 201 74 14 5 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
9465 = η Source: Millis and Seng (1954).
Computation
Coding
v
Σ JX = 59,629
Code: Yc =
Ϋ, = 6.300
To decode
2
^ /Y f = 402,987 CT =
η
and
decoding
Y — 59 5 y = 8fr+59.5 = 50-4 + 59.5 = 109.9 oz
= 375,659.550
Z / > ' c = Z / n 2 - C T = 27,327.450 s? =
η— 1
sc = 1.6991
= 2.888 To decode sf: s = 8sc = 13.593 oz V = i X 100 - ^ ^ X 100 = 12.369% Y 109.9
3.9 / THE COEFFICIENT OF VARIATION
43
T h e range of the neutrophil c o u n t s is 15.7. W h e n this value is divided by 4, we get a n estimate for the s t a n d a r d deviation of 3.925, which c o m p a r e s with the calculated value of 4.696 in Box 3.1. H o w e v e r , w h e n we estimate m e a n a n d s t a n d a r d deviation of the a p h i d f e m u r lengths of Box 2.1 in this m a n n e r , we o b t a i n 4.0 a n d 0.35, respectively. These are g o o d estimates of the a c t u a l values of 4.004 a n d 0.3656, the s a m p l e m e a n a n d s t a n d a r d deviation. 3.9 The coefficient of variation H a v i n g o b t a i n e d the s t a n d a r d deviation as a m e a s u r e of t h e a m o u n t of v a r i a t i o n in the d a t a , y o u m a y be led to ask, " N o w w h a t ? " At this stage in o u r c o m prehension of statistical theory, n o t h i n g really useful comes of the c o m p u t a t i o n s we have carried out. H o w e v e r , the skills j u s t learned are basic to all later statistical w o r k . So far, the only use t h a t we might have for the s t a n d a r d deviation is as an estimate of the a m o u n t of variation in a p o p u l a t i o n . T h u s , we may wish to c o m p a r e the m a g n i t u d e s of the s t a n d a r d deviations of similar p o p u l a tions a n d see w h e t h e r p o p u l a t i o n A is m o r e or less variable than p o p u l a t i o n B. W h e n p o p u l a t i o n s differ appreciably in their means, the direct c o m p a r i s o n of their variances o r s t a n d a r d deviations is less useful, since larger o r g a n i s m s usually vary m o r e t h a n smaller one. F o r instance, the s t a n d a r d deviation of the tail lengths of e l e p h a n t s is obviously m u c h greater than the entire tail length of a mouse. T o c o m p a r e the relative a m o u n t s of variation in p o p u l a t i o n s having different means, the coefficient of variation, symbolized by V (or occasionally CV), has been developed. This is simply the s t a n d a r d deviation expressed as a percentage of the m e a n . Its f o r m u l a is
F o r example, the coefficient of variation of the birth weights in Box 3.2 is 12.37%, as s h o w n at the b o t t o m of that box. T h e coefficient of variation is independent of the unit of m e a s u r e m e n t a n d is expressed as a percentage. Coefficients of variation are used when one wishes t o c o m p a r e the variation of two p o p u l a t i o n s without considering the m a g n i t u d e of their means. (It is p r o b a b l y of little interest to discover whether the birth weights of the Chinese children are m o r e or less variable t h a n the femur lengths of the aphid stem mothers. However, we can calculate V for the latter as (0.3656 χ Ι00)/4.004 = 9.13%, which would suggest that the birth weights arc m o r e variable.) O f t e n , we shall wish to test whether a given biological sample is m o r e variable for o n e character than for a n o t h e r . T h u s , for a s a m p l e of rats, is b o d y weight m o r e variable than b l o o d sugar content? A second, frequent type of c o m p a r i s o n , especially in systcmatics, is a m o n g different p o p u l a t i o n s for the same c h a r a c t e r . Thus, we m a y have m e a s u r e d wing length in samples of birds f r o m several localities. We wish t o k n o w w h e t h e r any o n e of these p o p u l a t i o n s is m o r e variable than the others. An a n s w e r to this question can be o b t a i n e d by examining the coefficients of variation of wing length in these samples.
44
CHAPTER 3 /' DESCRIPTIVE STATISTICS
Exercises 3.1
F i n d f , s, V, a n d t h e m e d i a n f o r t h e f o l l o w i n g d a t a ( m g o f g l y c i n e p e r m g o f c r e a t i n i n e in t h e u r i n e o f 3 7 c h i m p a n z e e s ; f r o m G a r t l e r , F i r s c h e i n , a n d D o b z h a n s k y , 1956). A N S . Y = 0 . 1 1 5 , s = 0 . 1 0 4 0 4 . .008
.018
.056
.055
.135
.052
.077
.026
.025
.043
.080
.110 .110
.100
.070
.100 .050
.120
.011
.036 .060
.110
.350 .120
.100
.155
.370
.019
.100
.100
.116
.300
.440 .100
.300 .100
.133
3.2
F i n d t h e m e a n , s t a n d a r d d e v i a t i o n , a n d c o e f f i c i e n t of v a r i a t i o n f o r t h e p i g e o n d a t a g i v e n i n E x e r c i s e 2.4. G r o u p t h e d a t a i n t o t e n c l a s s e s , r e c o m p u t e Ϋ a n d s, a n d c o m p a r e t h e m with the results o b t a i n e d from u n g r o u p e d data. C o m p u t e the m e d i a n for the g r o u p e d data.
3.3
T h e f o l l o w i n g a r e p e r c e n t a g e s of b u t t e r f a t f r o m 120 r e g i s t e r e d t h r e e - y e a r - o l d A y r s h i r e c o w s selected at r a n d o m f r o m a C a n a d i a n stock r e c o r d b o o k . (a) C a l c u l a t e Y, s, a n d V d i r e c t l y f r o m t h e d a t a . (b) G r o u p t h e d a t a i n a f r e q u e n c y d i s t r i b u t i o n a n d a g a i n c a l c u l a t e Y, s, a n d V. C o m p a r e t h e r e s u l t s w i t h t h o s e o f (a). H o w m u c h p r e c i s i o n h a s b e e n l o s t b y grouping? Also calculate the median. 4.32
4.24
4.29
4.00
3.96 3.74
4.48
3.89 4.20
4.02
4.10
4.00
4.33 4.23
4.16 4.67
4.33 3.88 3.74
3.81 4.81 4.25
4.28 4.15 4.49 4.67 4.60 4.00 4.71
4.03 4.29 4.05 4.11 4.38 4.46 3.96
4.42 4.27 3.97 4.24 3.72 4.82
4.09 4.38 4.32
4.38 4.06 3.97 4.31
4.16 4.08 3.97 3.70
4.30
4.17
3.97
4.20
4.51 4.24
3.86
4.36
4.18
4.05
4.05
3.56
3.94
3.89
4.17 4.06 3.93
3.82 3.89 4.20
4.58 3.70 4.07
3.99 4.33 3.58
3.89
4.38
4.14 3.47
4.66 3.92
4.60 3.97 4.91
4.38 3.91
4.12
4.52
4.10
4.09
4.34 3.98
4.09 3.86
4.88 4.58
4.22 3.95 4.35 4.09 4.28 3.4
4.42
3.66 3.77 3.66 4.20 3.83
3.87
5.00 3.99 3.91 4.10 4.40 4.70 4.41 4.24
W h a t clfect w o u l d a d d i n g a c o n s t a n t
5.2 t o all o b s e r v a t i o n s
have upon
n u m e r i c a l v a l u e s o f t h e f o l l o w i n g s t a t i s t i c s : Υ, .s, V, a v e r a g e d e v i a t i o n ,
the
median.
EXERCISES
3.5
3.6
45
mode, range? What would be the effect of adding 5.2 and then multiplying the sums by 8.0? Would it make any difference in the above statistics if we multiplied by 8.0 first and then added 5.2? Estimate μ and σ using the midrange and the range (see Section 3.8) for the data in Exercises 3.1, _3.2, and 3.3. How well do these estimates agree with the estimates given by Y and s? ANS. Estimates of μ and σ for Exercise 3.2 are 0.224 and 0.1014. Show that the equation for the variance can also be written as ,
s2 = ^
ΤΥ2-ηΫ2 η — 1
3.7
3.8
Using the striped _bass age distribution given in Exercise 2.9, compute the following statistics: Y, s2, s, V, median, and mode. ANS. 7 = 3.043, s2 = 1.2661, s = 1.125, V = 36.98%, median = 2.948, mode = 3. Use a calculator and compare the results of using Equations 3.7 and 3.8 to compute s 2 for the following artificial data sets: (a) 1 , 2 , 3 , 4 , 5 (b) 9001, 9002, 9003, 9004, 9005 (c) 90001, 90002, 90003, 90004, 90005 (d) 900001, 900002, 900003, 900004, 900005 Compare your results with those of one or more computer programs. What is the correct answer? Explain your results.
CHAPTER
Introduction Distributions: Poisson
to
Probability
The Binomial and
Distributions
In Section 2.5 we first e n c o u n t e r e d frequency distributions. F o r example, T a b l e 2.2 s h o w s a distribution for a meristic, or discrete (discontinuous), variable, the n u m b e r of sedge p l a n t s per q u a d r a t . Examples of distributions for c o n t i n u o u s variables are the f e m u r lengths of a p h i d s in Box 2.1 and the h u m a n birth weights in Box 3.2. Each of these d i s t r i b u t i o n s i n f o r m s us a b o u t the a b s o l u t e f r e q u e n c y of a n y given class a n d permits us to c o m p u t a t e the relative frequencies of a n y class of variable. T h u s , m o s t of the q u a d r a t s c o n t a i n e d either n o sedges or o n e or t w o plants. In the 139.5-oz class of birth weights, we find only 201 out of the total of 9465 babies recorded; that is, a p p r o x i m a t e l y only 2.1% of the infants are in t h a t birth weight class. W e realize, of course, that these frequency d i s t r i b u t i o n s are only samples f r o m given p o p u l a t i o n s . T h e birth weights, for example, represent a p o p u l a t i o n of male Chinese infants f r o m a given geographical area. But if we k n e w o u r s a m p l e to be representative of that p o p u l a t i o n , we could m a k e all sorts of predictions based u p o n the s a m p l e frequency distribution. F o r instance, we could say t h a t a p p r o x i m a t e l y 2.1% of male Chinese babies b o r n in this p o p u l a t i o n should weigh between 135.5 a n d 143.5 oz at birth. Similarly, we might say that
47CHAPTER4/INTRODUCTION TO PROBABILITY DISTRIBUTIONS
the p r o b a b i l i t y t h a t the weight at birth of any o n e b a b y in this p o p u l a t i o n will be in t h e 139.5-oz b i r t h class is quite low. If all of the 9465 weights were mixed up in a h a t a n d a single o n e pulled out, t h e probability t h a t we w o u l d pull out one of the 201 in the 139.5-oz class w o u l d be very low i n d e e d — o n l y 2.1%. It would be m u c h m o r e p r o b a b l e t h a t we w o u l d sample a n infant of 107.5 or 115.5 oz, since the infants in these classes are represented by frequencies 2240 a n d 2007, respectively. Finally, if we were t o s a m p l e f r o m a n u n k n o w n p o p u l a tion of babies a n d find t h a t the very first individual sampled h a d a b i r t h weight of 170 oz, we w o u l d p r o b a b l y reject a n y hypothesis t h a t the u n k n o w n p o p u l a t i o n was the same as t h a t sampled in Box 3.2. W e w o u l d arrive at this conclusion because in the distribution in Box 3.2 only o n e out of a l m o s t 10,000 infants h a d a birth weight t h a t high. T h o u g h it is possible t h a t we could have sampled f r o m the p o p u l a t i o n of male Chinese babies a n d o b t a i n e d a birth weight of 170 oz, the probability t h a t t h e first individual s a m p l e d would have such a value is very low indeed. It seems m u c h m o r e r e a s o n a b l e t o s u p p o s e t h a t the u n k n o w n p o p u l a t i o n f r o m which we are s a m p l i n g has a larger m e a n t h a t the o n e sampled in Box 3.2. W e have used this empirical frequency distribution to m a k e certain predictions (with w h a t frequency a given event will occur) or to m a k e j u d g m e n t s a n d decisions (is it likely t h a t an infant of a given birth weight belongs to this population?). In m a n y cases in biology, however, we shall m a k e such predictions not f r o m empirical distributions, b u t on the basis of theoretical c o n s i d e r a t i o n s that in o u r j u d g m e n t are pertinent. W e m a y feel t h a t the d a t a should be distributed in a certain way because of basic a s s u m p t i o n s a b o u t the n a t u r e of the forces acting o n the e x a m p l e at h a n d . If o u r actually observed d a t a d o not c o n f o r m sufficiently to the values expected on the basis of these a s s u m p t i o n s , we shall have serious d o u b t s a b o u t o u r a s s u m p t i o n s . This is a c o m m o n use of frequency distributions in biology. T h e a s s u m p t i o n s being tested generally lead to a theoretical frequency distribution k n o w n also as a probability distribution. This m a y be a simple two-valued distribution, such as the 3:1 ratio in a Mendelian cross; or it m a y be a m o r e complicated function, as it would be if we were trying to predict the n u m b e r of plants in a q u a d r a t . If we find that the observed d a t a d o not fit the expectations on the basis of theory, we are often led to the discovery of s o m e biological m e c h a n i s m causing this deviation f r o m expectation. T h e p h e n o m e n a of linkage in genetics, of preferential m a t i n g between different p h e n o t y p e s in animal behavior, of c o n g r e g a t i o n of a n i m a l s at certain favored places or, conversely, their territorial dispersion are cases in point. We shall thus m a k e use of probability theory to test o u r a s s u m p t i o n s a b o u t the laws of occurrence of certain biological p h e n o m e n a . Wc should point out to the reader, however, t h a t probability theory underlies the entire s t r u c t u r e of statistics, since, owing to the n o n m a t h e m a t i c a l o r i e n t a t i o n of this b o o k , this m a y not be entirely obvious. In this c h a p t e r we shall first discuss probability, in Section 4.1, but only to the extent necessary for c o m p r e h e n s i o n of the sections that follow at the intended level of m a t h e m a t i c a l sophistication. Next, in Section 4.2, we shall take up the
48
CHAPTER 4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS
b i n o m i a l frequency distribution, which is not only i m p o r t a n t in certain types of studies, such as genetics, but also f u n d a m e n t a l to an u n d e r s t a n d i n g of t h e various k i n d s of p r o b a b i l i t y d i s t r i b u t i o n s t o be discussed in this b o o k . T h e Poisson d i s t r i b u t i o n , which follows in Section 4.3, is of wide applicability in biology, especially for tests of r a n d o m n e s s of occurrence of certain events. B o t h the b i n o m i a l a n d P o i s s o n d i s t r i b u t i o n s are discrete p r o b a b i l i t y distributions. T h e m o s t c o m m o n c o n t i n u o u s p r o b a b i l i t y distribution is the n o r m a l frequency d i s t r i b u t i o n , discussed in C h a p t e r 5.
4.1 Probability, random sampling, and hypothesis testing W e shall start this discussion with an e x a m p l e t h a t is n o t biometrical o r biological in the strict sense. W e have often f o u n d it pedagogically effective t o i n t r o d u c e new c o n c e p t s t h r o u g h situations t h o r o u g h l y familiar to the s t u d e n t , even if the e x a m p l e is n o t relevant to the general subject m a t t e r of biostatistics. Let us b e t a k e ourselves to M a t c h l e s s University, a state institution s o m e w h e r e between the A p p a l a c h i a n s a n d the Rockies. L o o k i n g at its e n r o l l m e n t figures, we notice the following b r e a k d o w n of the student body: 70% of the s t u d e n t s a r e American u n d e r g r a d u a t e s (AU) a n d 26% are American g r a d u a t e s t u d e n t s (AG); the r e m a i n i n g 4% are f r o m a b r o a d . Of these, 1% are foreign u n d e r g r a d u a t e s ( F U ) a n d 3% are foreign g r a d u a t e s t u d e n t s (FG). In m u c h of o u r w o r k we shall use p r o p o r t i o n s r a t h e r t h a n percentages as a useful c o n v e n t i o n . T h u s the enrollment consists of 0.70 AU's, 0.26 AG's, 0.01 F U ' s , a n d 0.03 F G ' s . T h e total student b o d y , c o r r e s p o n d i n g to 100%, is therefore represented by the figure 1.0. If we were to assemble all the s t u d e n t s a n d s a m p l e 100 of t h e m at r a n d o m , we would intuitively expect that, on the average, 3 would be foreign g r a d u a t e students. T h e actual o u t c o m e might vary. T h e r e might not be a single F G s t u d e n t a m o n g the 100 sampled, or there might be quite a few m o r e t h a n 3. T h e ratio of the n u m b e r of foreign g r a d u a t e s t u d e n t s sampled divided by the total n u m b e r of s t u d e n t s sampled might therefore vary f r o m zero to c o n s i d e r a b l y greater than 0.03. If we increased o u r s a m p l e size to 500 or 1000, it is less likely t h a t t h e ratio would fluctuate widely a r o u n d 0.03. T h e greater the s a m p l e taken, the closer the r a t i o of F G s t u d e n t s sampled t o the total s t u d e n t s s a m p l e d will a p p r o a c h 0.03. In fact, the probability of s a m p l i n g a foreign s t u d e n t can be defined as the limit as s a m p l e size keeps increasing of the ratio of foreign s t u d e n t s to the total n u m b e r of s t u d e n t s sampled. T h u s , we may formally s u m m a r i z e the situation by stating that the probability that a student at Matchless University will be a foreign g r a d u a t e student is P [ F G ] = 0.03. Similarly, the probability of s a m p l i n g a foreign u n d e r g r a d u a t e is P [ F U ] = 0 . 0 1 , that of s a m p l i n g an American u n d e r g r a d u a t e is /-"[AUJ = 0.70, and that for American g r a d u a t e students, P [ A G ] = 0.26. N o w let us imagine the following experiment: We try to sample a student at r a n d o m f r o m a m o n g the student body at Matchless University. This is not as easy a task as might be imagined. If we w a n t e d to d o this o p e r a t i o n physically,
4 . 1 / PROBABILITY, RANDOM SAMPLING, AND HYPOTHESIS TESTING
49
we w o u l d h a v e t o set u p a collection o r t r a p p i n g s t a t i o n s o m e w h e r e o n c a m p u s . A n d t o m a k e certain t h a t the s a m p l e was truly r a n d o m with respect t o t h e entire s t u d e n t p o p u l a t i o n , we w o u l d have t o k n o w t h e ecology of s t u d e n t s o n c a m p u s very t h o r o u g h l y . W e should try to locate o u r t r a p a t s o m e s t a t i o n where e a c h s t u d e n t h a d a n e q u a l probability of passing. F e w , if a n y , such places can be f o u n d in a university. T h e s t u d e n t u n i o n facilities a r e likely t o be frequented m o r e by i n d e p e n d e n t a n d foreign students, less by t h o s e living in organized houses a n d d o r m i t o r i e s . F e w e r foreign a n d g r a d u a t e s t u d e n t s m i g h t be f o u n d a l o n g fraternity row. Clearly, we w o u l d n o t wish t o place o u r t r a p near the I n t e r n a t i o n a l C l u b o r H o u s e , because o u r p r o b a b i l i t y of s a m p l i n g a foreign s t u d e n t w o u l d be greatly e n h a n c e d . In f r o n t of the b u r s a r ' s w i n d o w we might s a m p l e s t u d e n t s p a y i n g tuition. But those o n scholarships m i g h t n o t be found there. W e d o n o t k n o w w h e t h e r the p r o p o r t i o n of scholarships a m o n g foreign o r g r a d u a t e s t u d e n t s is t h e s a m e as o r different f r o m t h a t a m o n g t h e American or u n d e r g r a d u a t e students. Athletic events, political rallies, dances, and the like w o u l d all d r a w a differential s p e c t r u m of the s t u d e n t body; indeed, n o easy solution seems in sight. T h e time of s a m p l i n g is equally i m p o r t a n t , in the seasonal as well as the d i u r n a l cycle. T h o s e a m o n g t h e r e a d e r s w h o are interested in s a m p l i n g o r g a n i s m s f r o m n a t u r e will already h a v e perceived parallel p r o b l e m s in their w o r k . If we were to s a m p l e only s t u d e n t s wearing t u r b a n s or saris, their p r o b a b i l i t y of being foreign s t u d e n t s w o u l d b e a l m o s t 1. W e could n o longer speak of a r a n d o m sample. In the familiar ecosystem of t h e university these violations of p r o p e r sampling p r o c e d u r e a r e o b v i o u s t o all of us, b u t they are not nearly so o b v i o u s in real biological instances where we a r e unfamiliar with the true n a t u r e of the environment. H o w s h o u l d we proceed t o o b t a i n a r a n d o m s a m p l e of leaves f r o m a tree, of insects f r o m a field, o r of m u t a t i o n s in a culture? In s a m p l i n g at r a n d o m , we are a t t e m p t i n g t o permit the frequencies of v a r i o u s events occurring in n a t u r e t o be r e p r o d u c e d unalteredly in o u r records; t h a t is, we h o p e t h a t o n the average the frequencies of these events in o u r s a m p l e will be the same as they a r e in the n a t u r a l situation. A n o t h e r way of saying this is that in a r a n d o m s a m p l e every individual in the p o p u l a t i o n being s a m p l e d has a n equal probability of being included in the sample. We might go a b o u t o b t a i n i n g a r a n d o m s a m p l e by using records representing the student b o d y , such as the student directory, selecting a page f r o m it at r a n d o m a n d a n a m e at r a n d o m f r o m the page. O r we could assign an an a r b i t r a r y n u m b e r t o each s t u d e n t , write each o n a chip or disk, put these in a large c o n t a i n e r , stir well, a n d then pull out a n u m b e r . I m a g i n e n o w t h a t we s a m p l e a single s t u d e n t physically by the t r a p p i n g m e t h o d , after carefully p l a n n i n g t h e placement of the t r a p in such a way as to m a k e s a m p l i n g r a n d o m . W h a t a r e the possible o u t c o m e s ? Clearly, the student could be either a n A U , A G , F U or F G . T h e set of these four possible o u t c o m e s exhausts the possibilities of this experiment. This set, which we c a n represent as {AU, A G , F U , F G } is called the sample space. Any single trial of the experiment described a b o v e w o u l d result in only o n e of the f o u r possible o u t c o m e s (elements)
CHAPTER 4 / INTRODUCTION TO PROBABILITY DISTRIBUTIONS
50
in t h e set. A single element in a s a m p l e space is called a simple event. It is distinguished f r o m an event, which is a n y subset of the sample-space. T h u s , in the s a m p l e space defined a b o v e {AU}, {AG}, {FU}, a n d { F G } a r e e a c h simple events. T h e following s a m p l i n g results a r e some of the possible events: {AU, A G , F U } , {AU, A G , F G } , {AG, F G } , {AU, F G } , . . . By t h e definition of "event," simple events as well as t h e entire s a m p l e space a r e also events. T h e m e a n i n g of these events s h o u l d be clarified. T h u s {AU, A G , F U } implies being either a n A m e r i c a n o r a n u n d e r g r a d u a t e , or b o t h . Given the s a m p l i n g space described above, the event A = {AU, A G } enc o m p a s s e s all possible o u t c o m e s in the space yielding a n A m e r i c a n student. Similarly, the event Β = {AG, F G } s u m m a r i z e s the possibilities for o b t a i n i n g a g r a d u a t e student. T h e intersection of events A a n d B, written Α η Β, describes only those events t h a t a r e shared by A a n d B. Clearly only A G qualifies, as can be seen below: A = {AU, A G } Β =
{AG, F G }
T h u s , Α η Β is that event in the s a m p l e space giving rise to the s a m p l i n g of a n A m e r i c a n g r a d u a t e s t u d e n t . W h e n the intersection of t w o events is e m p t y , as in Β η C, where C = {AU, F U } , events Β a n d C are m u t u a l l y exclusive. T h u s there is n o c o m m o n element in these t w o events in the s a m p l i n g space. W e m a y also define events t h a t are unions of t w o o t h e r events in the s i m p l e space. T h u s Α υ Β indicates t h a t A or Β or b o t h A a n d Β occur. As defined above, A u Β would describe all s t u d e n t s w h o are either American students, g r a d u a t e students, o r A m e r i c a n g r a d u a t e students. W h y a r e we c o n c e r n e d with defining s a m p l e spaces a n d events? Because these concepts lead us to useful definitions a n d o p e r a t i o n s r e g a r d i n g the p r o b a b i l i t y of various o u t c o m e s . If we can assign a n u m b e r p, where 0 < ρ < 1, t o each simple event in a s a m p l e space such t h a t the sum of these p's over all simple events in the space e q u a l s unity, then the space b e c o m e s a (finite) probability space. In o u r e x a m p l e above, the following n u m b e r s were associated with the a p p r o p r i a t e simple events in the s a m p l e space: {AU, AG, F U ,
FG}
{0.70,0.26, 0.01,0.03} G i v e n this p r o b a b i l i t y space, we a r e n o w able to m a k e s t a t e m e n t s r e g a r d i n g the probability of given events. F o r example, w h a t is the p r o b a b i l i t y that a s t u d e n t sampled at r a n d o m will be an A m e r i c a n g r a d u a t e s t u d e n t ? Clearly, it is P [ { A G } ] = 0.26. W h a t is the p r o b a b i l i t y that a student is either American o r a g r a d u a t e s t u d e n t ? In terms of the events defined earlier, this is PLAuBj
= P[{AU,AG}]
+ P[{AG,
= 0.96 + 0.29
0.26
= 0.99
FG]]
-
P[{AG]]
4.1 /
PROBABILITY, R A N D O M SAMPLING, A N D HYPOTHESIS TESTING
51
W e s u b t r a c t P [ { A G } ] f r o m the s u m on the right-hand side of t h e e q u a t i o n because if we did n o t d o so it w o u l d be included twice, once in P [ A ] a n d once in P [ B ] , a n d w o u l d lead to the a b s u r d result of a p r o b a b i l i t y greater t h a n 1. N o w let us a s s u m e t h a t we have sampled o u r single s t u d e n t f r o m the s t u d e n t b o d y of Matchless University a n d t h a t s t u d e n t t u r n s o u t to be a foreign g r a d u a t e student. W h a t c a n we c o n c l u d e f r o m this? By c h a n c e alone, this result w o u l d h a p p e n 0.03, or 3%, of the t i m e — n o t very frequently. T h e a s s u m p t i o n t h a t we have sampled at r a n d o m should p r o b a b l y be rejected, since if we accept the hypothesis of r a n d o m sampling, the o u t c o m e of the experiment is i m p r o b a b l e . Please n o t e that we said improbable, n o t impossible. It is o b v i o u s t h a t we could have chanced u p o n a n F G as the very first one t o be s a m p l e d . H o w e v e r , it is not very likely. T h e p r o b a b i l i t y is 0.97 t h a t a single s t u d e n t s a m p l e d will be a n o n - F G . If we could be certain t h a t o u r s a m p l i n g m e t h o d was r a n d o m (as when d r a w i n g s t u d e n t n u m b e r s o u t of a container), we w o u l d have t o decide that an i m p r o b a b l e event h a s occurred. T h e decisions of this p a r a g r a p h are all based on o u r definite k n o w l e d g e t h a t the p r o p o r t i o n of s t u d e n t s at Matchless University is indeed as specified by t h e p r o b a b i l i t y space. If we were uncertain a b o u t this, we w o u l d be led to a s s u m e a higher p r o p o r t i o n of foreign g r a d u a t e students as a c o n s e q u e n c e of the o u t c o m e of o u r sampling experiment. W e shall n o w extend o u r experiment a n d s a m p l e two s t u d e n t s r a t h e r t h a n just one. W h a t a r e the possible o u t c o m e s of this s a m p l i n g e x p e r i m e n t ? T h e new sampling space can best be depicted by a d i a g r a m (Figure 4.1) t h a t shows the set of t h e 16 possible simple events as p o i n t s in a lattice. T h e simple events are the following possible c o m b i n a t i o n s . I g n o r i n g which student was sampled first, they are (AU, AU), (AU, AG), (AU, FU), (AU, FG), (AG, AG), (AG, FU), (AG, FG), ( F U , FU), ( F U , FG), a n d ( F G , FG).
ο.»:! Η;
ε •a
0.0210
0.0078
o.ooo:!
0.000!)
(1.0070
0.0020
0.0001
o.ooo:!
0.1820
0.0()7(i
0.002(1
0.0078
0.1900
0.1820
0.0070
0.0210
AC
AC
I'll
i''"'
2!ί>"'
3!μ'
(4.2)
4!?"
w h e r e the terms are the relative expected frequencies c o r r e s p o n d i n g t o the following c o u n t s of the rare event Y: 0,
1,
2,
3,
4,
T h u s , the first of these t e r m s represents the relative expected f r e q u e n c y of samples c o n t a i n i n g n o rare event; the second term, o n e rare event; t h e third term, t w o rare events; a n d so on. T h e d e n o m i n a t o r of each term c o n t a i n s e where e is t h e base of the n a t u r a l , or N a p i e r i a n , logarithms, a c o n s t a n t w h o s e value, a c c u r a t e to 5 decimal places, is 2.718,28. W e recognize μ as the p a r a m e t r i c m e a n of the distribution; it is a c o n s t a n t for a n y given problem. T h e e x c l a m a t i o n m a r k after the coefficient in the d e n o m i n a t o r m e a n s "factorial," as explained in the previous section. O n e way to learn m o r e a b o u t the Poisson distribution is to apply it to a n actual case. At the t o p of Box 4.1 is a well-known result f r o m the early statistical literature based on the d i s t r i b u t i o n of yeast cells in 400 squares of a h e m a c y t o meter, a c o u n t i n g c h a m b e r such as is used in m a k i n g c o u n t s of b l o o d cells a n d o t h e r microscopic objects suspended in liquid. C o l u m n (1) lists the n u m b e r of yeast cells observed in each h e m a c y t o m e t e r square, and c o l u m n (2) gives the observed f r e q u e n c y — t h e n u m b e r of squares c o n t a i n i n g a given n u m b e r of yeast cells. We n o t e that 75 s q u a r e s c o n t a i n e d n o yeast cells, but that m o s t s q u a r e s held either 1 or 2 cells. O n l y 17 s q u a r e s c o n t a i n e d 5 or m o r e yeast cells. W h y would we expect this frequency distribution to be d i s t r i b u t e d in Poisson fashion? W e have here a relatively rare event, the frequency of yeast cells per h e m a c y t o m e t e r s q u a r e , the m e a n of which has been calculated a n d f o u n d to be 1.8. T h a t is, on the average there are 1.8 cells per square. Relative to the a m o u n t of space provided in each s q u a r e and the n u m b e r of cells t h a t could have c o m e to rest in a n y o n e square, the actual n u m b e r f o u n d is low indeed. We might also expect that the occurrence of individual yeast cells in a s q u a r e is independent of the occurrence of o t h e r yeast cells. This is a c o m m o n l y e n c o u n t e r e d class of application of the Poisson distribution. T h e m e a n of the rare event is the only q u a n t i t y that we need to k n o w to calculate the relative expected frequencies of a Poisson distribution. Since we d o
4.2 / t h e BINoMiAL d i s t r i b u t i o n
67
BOX 4 1
Calculation of expected Poisson frequencies. Yeast cells in 400 squares of a hemacytometer: f = 1.8 cells per square; η - 400 squares sampled.
(/)
(3)
Μ
Absolute expected frequencies
Deviation from expectation /--/
(2)
Observed frequencies
Number of cells per square Y
/
/
66.1 119.0 107.1 64.3 28.9 10.41 3.1 0.8 •14.5 0.2
75 103 121 54 30 13Ί 2 1 •17 0 lj 40)
0 1 2 3 4 5 6 7 8 9
0.0.
+ —
+ —
+
4—
+
•
+
—
+
399.9
Source: "Student" (1907).
Computational
steps
Flow of computation based on Expression (4.3) multiplied by n, since we wish to obtain absolute expected frequencies,/. 1. Find e f in a table of exponentials or compute it using an exponential key: J _ e* „1.8 6.0496 η 400 66.12 2. f 0 6.0496 66.12(1.8)
3 . / W o ? 4. Λ
/ 2J t ^
5./3=/
2
Y
3 =
=119.02
= 107.11 '(f)- 64.27 107.1
119.02
t )
6 . / W 3
64.27
28.92
7.
28.92
10.41
Λ - A y
; Y 8·/6= Λ
"(τ) -
3.12
68
CHAPTER 4 / i n t r o d u c t i o n t o p r o b a b i l i t y
distributions
BOX 4.1 Continued 3 . 1 2 ^
-
0.80
10. Λ - Λ I = 0.8θ(~^ = 0.18 Total
39935
A
f
9
and beyond
0.05
At step 3 enter Ϋ as a constant multiplier. Then multiply it by n/er (quantity 2). At each subsequent step multiply the result of the previous step by ? and then divide by the appropriate integer.
•
n o t k n o w the p a r a m e t r i c m e a n of t h e yeast cells in this p r o b l e m , we e m p l o y a n e s t i m a t e (the sample m e a n ) a n d calculate expected frequencies of a Poisson d i s t r i b u t i o n with μ equal to the m e a n of the observed frequency d i s t r i b u t i o n of Box 4.1. It is c o n v e n i e n t for t h e p u r p o s e of c o m p u t a t i o n t o rewrite Expression (4.2) as a recursion f o r m u l a as follows: h = L
,(y)
for i = 1, 2, . . . ,
where / 0 =
(4.3)
N o t e first of all that the p a r a m e t r i c m e a n μ h a s been replaced by the s a m p l e m e a n Ϋ. Each term developed by this recursion f o r m u l a is m a t h e m a t i c a l l y exactly the same as its c o r r e s p o n d i n g term in Expression (4.2). It is i m p o r t a n t to m a k e n o c o m p u t a t i o n a l error, since in such a chain multiplication the correctness of each term d e p e n d s o n the accuracy of the term before it. Expression (4.3) yields relative expected frequencies. If, as is m o r e usual, a b s o l u t e expected frequencies are desired, simply set the first term / 0 to n/ey, where η is the n u m b e r of samples, a n d then proceed with the c o m p u t a t i o n a l steps as before. T h e a c t u a l c o m p u t a t i o n is illustrated in Box 4.1, a n d the expected frequencies so o b t a i n e d a r e listed in c o l u m n (3) of t h e frequency d i s t r i b u t i o n . W h a t have we learned f r o m this c o m p u t a t i o n ? W h e n we c o m p a r e the observed with the expected frequencies, we notice quite a good fit of o u r o b served frequencies t o a Poisson d i s t r i b u t i o n of m e a n 1.8, a l t h o u g h we have not as yet learned a statistical test for g o o d n e s s of fit (this will be covered in C h a p ter 13). N o clear p a t t e r n of deviations f r o m expectation is s h o w n . We c a n n o t test a hypothesis a b o u t the m e a n , because the m e a n of the expected distribution was t a k e n f r o m the s a m p l e m e a n of the observed variates. As in the binomial distribution, c l u m p i n g o r a g g r e g a t i o n w o u l d indicate that the probability that a second yeast cell will be f o u n d in a s q u a r e is not i n d e p e n d e n t of the prcs-
4.2
/ t h e BINoMiAL d i s t r i b u t i o n
69
ence of t h e first one, b u t is higher t h a n the p r o b a b i l i t y for the first cell. This would result in a c l u m p i n g of the items in the classes at the tails of the distrib u t i o n so t h a t there w o u l d be s o m e s q u a r e s with larger n u m b e r s of cells t h a n expected, o t h e r s with fewer n u m b e r s . T h e biological i n t e r p r e t a t i o n of the dispersion p a t t e r n varies with the p r o b l e m . T h e yeast cells seem to be r a n d o m l y distributed in t h e c o u n t i n g c h a m b e r , indicating t h o r o u g h mixing of the suspension. Red b l o o d cells, o n the o t h e r h a n d , will often stick t o g e t h e r because of a n electrical c h a r g e unless the p r o p e r suspension fluid is used. T h i s so-called r o u l e a u x effect w o u l d be indicated by c l u m p i n g of t h e observed frequencies. N o t e t h a t in Box 4.1, as in the s u b s e q u e n t tables giving examples of the application of the P o i s s o n distribution, we g r o u p the low frequencies at o n e tail of the curve, uniting t h e m by m e a n s of a bracket. This t e n d s t o simplify the p a t t e r n s of d i s t r i b u t i o n s o m e w h a t . However, the m a i n r e a s o n for this g r o u p ing is related t o the G test for g o o d n e s s of fit (of observed t o expected f r e q u e n cies), which is discussed in Section 13.2. F o r p u r p o s e s of this test, n o expected frequency / should be less t h a n 5. Before we t u r n t o o t h e r examples, we need to learn a few m o r e facts a b o u t the P o i s s o n distribution. Y o u p r o b a b l y noticed t h a t in c o m p u t i n g expected frequencies, we needed t o k n o w only o n e p a r a m e t e r — t h e m e a n of the distribution. By c o m p a r i s o n , in the b i n o m i a l distribution we needed t w o parameters, ρ and k. T h u s , the m e a n completely defines the s h a p e of a given Poisson distribution. F r o m this it follows that the variance is some f u n c t i o n of the m e a n . In a P o i s s o n distribution, we have a very simple relationship between the two: μ = σ 2 , t h e variance being equal to the m e a n . T h e variance of the n u m b e r of yeast cells per s q u a r e based o n the observed frequencies in Box 4.1 e q u a l s 1.965, not m u c h larger t h a n t h e m e a n of 1.8, indicating again that the yeast cells are distributed in Poisson fashion, hence r a n d o m l y . This r e l a t i o n s h i p between variance a n d m e a n suggests a rapid test of w h e t h e r an observed frequency distribution is distributed in Poisson fashion even w i t h o u t fitting expected frequencies to the d a t a . We simply c o m p u t e a coefficient of dispersion
This value will be near 1 in distributions that are essentially Poisson distributions, will be > 1 in c l u m p e d samples, a n d will be < 1 in cases of repulsion. In the yeast cell example, CD = 1.092. T h e shapes of five Poisson d i s t r i b u t i o n s of different m e a n s are s h o w n in Figure 4.3 as frequency polygons (a frequency polygon is formed by the line connecting successive m i d p o i n t s in a bar diagram). We notice that for the low value of μ = 0.1 the frequency polygon is extremely L-shapcd, but with an increase in the value of μ the d i s t r i b u t i o n s b e c o m e h u m p e d a n d eventually nearly symmetrical. We c o n c l u d e o u r study of the Poisson distribution with a c o n s i d e r a t i o n of two examples. T h e first e x a m p l e (Table 4.5) s h o w s the d i s t r i b u t i o n of a n u m b e r
chapter 4 / introduction t o probability
70
distributions
1.0
L^i 2
0
I^^t-^gc^—. ι 4 6 8
I
I I 10
ι ι ι ι — I - ' 12 14 16
' 18
N u m b e r of r a r e e v e n t s p e r s a m p l e figure
4.3
F r e q u e n c y p o l y g o n s of t h e P o i s s o n d i s t r i b u t i o n for v a r i o u s values of t h e m e a n .
of accidents per w o m a n f r o m an accident record of 647 w o m e n w o r k i n g in a m u n i t i o n s factory d u r i n g a five-week period. T h e s a m p l i n g unit is o n e w o m a n d u r i n g this period. T h e rare event is the n u m b e r of accidents t h a t h a p p e n e d t o a w o m a n in this period. T h e coefficient of dispersion is 1.488, a n d this is clearly reflected in the observed frequencies, which are greater t h a n expected in the tails a n d less t h a n expected in the center. T h i s relationship is easily seen in the deviations in the last c o l u m n (observed m i n u s expected frequencies) a n d shows a characteristic c l u m p e d p a t t e r n . T h e m o d e l assumes, of course, t h a t the accidents a r e n o t fata! o r very serious a n d thus d o not remove the individual f r o m f u r t h e r exposure. T h e noticeable c l u m p i n g in these d a t a p r o b a b l y arises
t a b l e
4.5
Accidents in 5 weeks to 647 women working on high-explosive shells.
Poisson expected frequencies f
Observed frequencies f
0 2 3 4 5+ Total
447 132 42
406.3 189.0 44.0
647
647.0
7 = 0.4652
= 0.692
Source: Greenwood and Yule (1920).
+
CD = 1.488
71
exercises
t a b l e
4.6
Azuki bean weevils (Callosobruchus chinensis) 112 Azuki beans (Phaseolus radiatus). U) Number of weevils emerging per bean Y
0 1 2 3 4
(2) Observed frequencies
f
f
61 50 η oil
Total
(3) Poisson expected frequencies
112
? = 0.4643
emerging from
(4) Deviation from expectation f - f
70.4 32.7 7.6] 1.2}· 8.9 0.1 J
•
-1 -J
112.0 x2 = 0.269
CD = 0.579
Source: Utida (1943).
either because some w o m e n are accident-prone or because some w o m e n have m o r e d a n g e r o u s j o b s t h a n others. Using only information on the distributions of accidents, one c a n n o t distinguish between the two alternatives, which suggest very different changes that should be m a d e to reduce the n u m b e r s of accidents. The second example (Table 4.6) is extracted f r o m an experimental study of the effects of different densities of the Azuki bean weevil. Larvae of these weevils enter the beans, feed and p u p a t e inside them, and then emerge through an emergence hole. T h u s the n u m b e r of holes per bean is a good measure of the n u m b e r of adults that have emerged. T h e rare event in this case is the presence of the weevil in the bean. W e note that the distribution is strongly repulsed. There are m a n y m o r e beans containing one weevil than the Poisson distribution would predict. A statistical finding of this sort leads us to investigate the biology of the p h e n o m e n o n . In this case it was found that the adult female weevils tended to deposit their eggs evenly rather than r a n d o m l y over the available beans. This prevented the placing of too m a n y eggs on any one bean and precluded heavy competition a m o n g the developing larvae on any one bean. A contributing factor was competition a m o n g remaining larvae feeding on the same bean, in which generally all but one were killed or driven out. Thus, it is easily understood h o w the above biological p h e n o m e n a would give rise to a repulsed distribution. Exercises 4.1
The two columns below give fertility of eggs of the CP strain of Drosophila melanogaster raised in 100 vials of 10 eggs each (data from R. R. Sokal). Find the expected frequencies on the assumption of independence of mortality for
72
chapter 4 / introduction t o probability
distributions
each egg in a vial. Use the observed mean. Calculate the expected variance and compare it with the observed variance. Interpret results, knowing that the eggs of each vial are siblings and that the different vials contain descendants from different parent pairs. ANS. σ 2 = 2.417, s 2 = 6.636. There is evidence that mortality rates are different for different vials.
Number of eggs hatched Y
0 1 2 3 4 5 6 7 8 9 10
4.2
43
4.4 4.5
4.6
Number
of vials f
1 3 8 10 6 15 14 12 13 9 9
In human beings the sex ratio of newborn infants is about 100?V': 105 J J . Were we to take 10,000 random samples of 6 newborn infants from the total population of such infants for one year, what would be the expected frequency of groups of 6 males, 5 males, 4 males, and so on? The Army Medical Corps is concerned over the intestinal disease X. From previous experience it knows that soldiers suffering from the disease invariably harbor the pathogenic organism in their feces and that to all practical purposes every stool specimen from a diseased person contains the organism. However, the organisms are never abundant, and thus only 20% of all slides prepared by the standard procedure will contain some. (We assume that if an organism is present on a slide it will be seen.) How many slides should laboratory technicians be directed to prepare and examine per stool specimen, so that in case a specimen is positive, it will be erroneously diagnosed negative in fewer than 1 % of the cases (on the average)? On the basis of your answer, would you recommend that the Corps attempt to improve its diagnostic methods? ANS. 21 slides. Calculate Poisson expected frequencies for the frequency distribution given in Table 2.2 (number of plants of the sedge Carex flacca found in 500 quadrats). A cross is made in a genetic experiment in Drosophila in which it is expected that { of the progeny will have white eyes and 5 will have the trait called "singed bristles." Assume that the two gene loci segregate independently, (a) What proportion of the progeny should exhibit both traits simultaneously? (b) If four flies are sampled at random, what is the probability that they will all be white-eyed? (c) What is the probability that none of the four flies will have either white eyes or "singed bristles?" (d) If two flies are sampled, what is the probability that at least one of the flies will have either white eyes or "singed bristles" or both traits? ANS. (a) (b) (i) 4 ; (c) [(1 - i)(l - i)] 4 ; (d) 1 - [(1 - i)(l Those readers who have had a semester or two of calculus may wish to try to prove that Expression (4.1) tends to Expression (4.2) as k becomes indefinitely
exercises
73
large (and ρ becomes infinitesimal, so that μ = kp remains constant). HINT: /
4.7
*Y 1 -» e x as η oo V "/ If the frequency of the gene A is ρ and the frequency of the gene a is q, what are the expected frequencies of the zygotes A A, Aa, and aa (assuming a diploid zygote represents a random sample of size 2)? What would the expected frequency be for an autotetraploid (for a locus close to the centromere a zygote can be thought of as a random sample of size 4)? ANS. P{AA} = p2, P{Aa} = 2pq, P{aa} = q2, f o r a d i p l o i d ; a n d P{AAAA} = p4, P{AAAa} 6 p 2 q 2 , P{Aaaa} = 4 p q 3 , P{aaaa} = q4, f o r a t e t r a p l o i d .
4.8 4.9
=
Summarize and compare the assumptions and parameters on which the binomial and Poisson distributions are based. A population consists of three types of individuals, A„ A2, and A3, with relative frequencies of 0.5,0.2, and 0.3, respectively, (a) What is the probability of obtaining only individuals of type Αλ in samples of size 1, 2, 3 , . . . , n? (b) What would be the probabilities of obtaining only individuals that were not of type Α γ or A 2 in a sample of size n? (c) What is the probability of obtaining a sample containing at least one representation of each type in samples of size 1, 2, 3, 4, 5 , . . . , n? ANS. (a) I i , I , . . . , 1/2". (b) (0.3)". (c) 0, 0, 0.18, 0.36, 0.507, for n:
4.10
= 4 p 3 q , P{AAaa}
"f
"£ '
"'.
.·=ι
i= ι
Aj\(n-i-j)\
|0.5|'(0.2Κ(0.3)" • '
If the average number of weed seeds found in a j o u n c e sample of grass seed is 1.1429, what would you expect the frequency distribution of weed seeds lo be in ninety-eight 4-ounce samples? (Assume there is random distribution of the weed seeds.)
CHAPTER
The
Normal
Probability
Distribution
T h e theoretical frequency d i s t r i b u t i o n s in C h a p t e r 4 were discrete. T h e i r variables a s s u m e d values that c h a n g e d in integral steps (that is, they were meristic variables). T h u s , the n u m b e r of infected insects per sample could be 0 or 1 o r 2 but never an i n t e r m e d i a t e value between these. Similarly, the n u m b e r of yeast cells per h e m a c y t o m e t e r s q u a r e is a meristic variable a n d requires a discrete probability f u n c t i o n to describe it. However, most variables e n c o u n t e r e d in biology either are c o n t i n u o u s (such as the aphid femur lengths or the infant birth weights used as e x a m p l e s in C h a p t e r s 2 a n d 3) or can be treated as cont i n u o u s variables for m o s t practical purposes, even t h o u g h they a r e inherently meristic (such as the n e u t r o p h i l c o u n t s e n c o u n t e r e d in the same chapters). C h a p t e r 5 will deal m o r e extensively with the distributions of c o n t i n u o u s variables. Section 5.1 introduces frequency d i s t r i b u t i o n s of c o n t i n u o u s variables. In Section 5.2 we show o n e way of deriving the m o s t c o m m o n such distribution, the n o r m a l probability distribution. T h e n we e x a m i n e its properties in Section 5.3. A few a p p l i c a t i o n s of the n o r m a l d i s t r i b u t i o n are illustrated in Section 5.4. A g r a p h i c technique for pointing out d e p a r t u r e s f r o m normality and for cstimat-
5.1 / f r e q u e n c y " d i s t r i b u t i o n s o f c o n t i n u o u s
variables
75
ing m e a n a n d s t a n d a r d deviation in a p p r o x i m a t e l y n o r m a l d i s t r i b u t i o n s is given in Section 5.5, as are s o m e of the reasons for d e p a r t u r e f r o m n o r m a l i t y in observed frequency distributions.
5.1 Frequency distributions of continuous variables F o r c o n t i n u o u s variables, t h e theoretical p r o b a b i l i t y distribution, or probability density function, can be represented by a c o n t i n u o u s curve, as s h o w n in F i g u r e 5.1. T h e o r d i n a t e of the curve gives the density for a given value of the variable s h o w n a l o n g the abscissa. By density we m e a n the relative c o n c e n t r a t i o n of variates a l o n g t h e Y axis (as indicated in F i g u r e 2.1). In o r d e r to c o m p a r e the theoretical with the observed frequency distribution, it is necessary to divide the t w o into c o r r e s p o n d i n g classes, as s h o w n by the vertical lines in F i g u r e 5.1. Probability density f u n c t i o n s are defined so t h a t the expected frequency of observations between two class limits (vertical lines) is given by the area between these limits u n d e r t h e curve. T h e total area u n d e r the curve is therefore equal t o the s u m of the expected frequencies (1.0 or n, d e p e n d i n g on w h e t h e r relative or absolute expected frequencies have been calculated). W h e n you f o r m a frequency distribution of o b s e r v a t i o n s of a c o n t i n u o u s variable, y o u r choice of class limits is arbitrary, because all values of a variable are theoretically possible. In a c o n t i n u o u s distribution, one c a n n o t evaluate the probability t h a t the variable will be exactly equal to a given value such as 3 or 3.5. O n e can only estimate the frequency of o b s e r v a t i o n s falling between two limits. This is so because the area of the curve c o r r e s p o n d i n g to a n y point a l o n g the curve is an infinitesimal. T h u s , to calculate expected frequencies for a c o n t i n u o u s distribution, we have t o calculate the area u n d e r the curve between the class limits. In Sections 5.3 and 5.4, we shall see how this is d o n e for the n o r m a l frequency distribution. C o n t i n u o u s frequency distributions m a y start and t e r m i n a t e at finite points a l o n g the Y axis, as s h o w n in F i g u r e 5.1, or o n e o r both e n d s of the curve may extend indefinitely, as will be seen later in Figures 5.3 and 6.11. T h e idea of an area u n d e r a curve w h e n o n e or b o t h e n d s go to infinity m a y t r o u b l e those of you not a c q u a i n t e d with calculus. F o r t u n a t e l y , however, this is not a great c o n ceptual stumbling block, since in all the cases that we shall e n c o u n t e r , the tail
I KillKL 5.1 A p r o b a b i l i t y d i s t r i b u t i o n of ;i c o n t i n u o u s variable.
76
chapter 4 /
introduction
to
probability distributions
of t h e curve will a p p r o a c h the Y axis rapidly e n o u g h t h a t the p o r t i o n of t h e a r e a b e y o n d a certain p o i n t will for all practical p u r p o s e s be zero a n d t h e frequencies it represents will be infinitesimal. W e m a y fit c o n t i n u o u s frequency d i s t r i b u t i o n s t o s o m e sets of meristic d a t a (for example, t h e n u m b e r of teeth in an organism). In such cases, we h a v e r e a s o n t o believe t h a t u n d e r l y i n g biological variables t h a t cause differences in n u m b e r s of the s t r u c t u r e a r e really c o n t i n u o u s , even t h o u g h expressed as a discrete variable. W e shall n o w proceed t o discuss the m o s t i m p o r t a n t p r o b a b i l i t y density f u n c t i o n in statistics, the n o r m a l frequency distribution.
5.2 Derivation of the normal distribution T h e r e a r e several ways of deriving the n o r m a l frequency d i s t r i b u t i o n f r o m elem e n t a r y a s s u m p t i o n s . M o s t of these require m o r e m a t h e m a t i c s t h a n we expect of o u r readers. W e shall therefore use a largely intuitive a p p r o a c h , which we have f o u n d of heuristic value. S o m e inherently meristic variables, such as c o u n t s of blood cells, range into the t h o u s a n d s . Such variables can, for practical p u r poses, be treated as t h o u g h they were c o n t i n u o u s . Let us consider a b i n o m i a l distribution of t h e familiar f o r m ( ρ + qf in which k becomes indefinitely large. W h a t type of biological situation could give rise to such a b i n o m i a l distribution? An e x a m p l e might be one in which m a n y f a c t o r s c o o p e r a t e additively in p r o d u c i n g a biological result. T h e following h y p o t h e t i c a l case is possibly not t o o far r e m o v e d f r o m reality. T h e intensity of skin p i g m e n t a t i o n in an a n i m a l will be d u e to the s u m m a t i o n of m a n y factors, s o m e genetic, o t h e r s e n v i r o n m e n t a l . As a simplifying a s s u m p t i o n , let us state t h a t every factor can occur in t w o states only: present or absent. W h e n the factor is present, it c o n t r i b u t e s o n e unit of p i g m e n t a t i o n to skin color, but it c o n t r i b u t e s n o t h i n g to p i g m e n t a t i o n w h e n it is absent. Each factor, regardless of its n a t u r e or origin, has the identical effect, a n d the effects are additive: if three o u t of five possible factors are present in an individual, the p i g m e n t a t i o n intensity will be three units, or the s u m of three c o n t r i b u t i o n s of o n e unit each. O n e final a s s u m p tion: E a c h f a c t o r has an equal probability of being present or a b s e n t in a given individual. T h u s , ρ = = 0.5, the probability t h a t the factor is present; while q = P [ / ] = 0.5, the probability that the factor is absent. With only o n e factor (k = 1), e x p a n s i o n of the binomial (p + 3, the n o r m a l distribution will be closely a p p r o x i m a t e d . Second, in a m o r e realistic situation, factors w o u l d be permitted to o c c u r in m o r e t h a n two s t a t e s — o n e state m a k i n g a large c o n t r i b u t i o n , a second state a smaller c o n t r i b u t i o n , a n d so forth. However, it can also be s h o w n that the m u l t i n o m i a l (p + q + r + · · · + z)k a p p r o a c h e s the n o r m a l frequency distribution as k a p p r o a c h e s infinity. T h i r d , different factors m a y be present in different frequencies a n d m a y have different q u a n t i t a t i v e effects. As long as these a r e additive a n d independent, n o r m a l i t y is still a p p r o a c h e d as k a p p r o a c h e s infinity. Lifting these restrictions m a k e s the a s s u m p t i o n s leading to a n o r m a l distribution c o m p a t i b l e with i n n u m e r a b l e biological situations. It is therefore not surprising that so m a n y biological variables are a p p r o x i m a t e l y normally distributed.
78
chapter 5 / the normal probability
distribution
Let us s u m m a r i z e the c o n d i t i o n s t h a t tend to p r o d u c e n o r m a l f r e q u e n c y distributions: (1) t h a t there be m a n y factors; (2) t h a t these factors be i n d e p e n d e n t in occurrence; (3) t h a t t h e factors be i n d e p e n d e n t in e f f e c t — t h a t is, t h a t their effects be additive; a n d (4) t h a t they m a k e e q u a l c o n t r i b u t i o n s t o t h e variance. T h e f o u r t h c o n d i t i o n we a r e n o t yet in a position t o discuss; we m e n t i o n it here only for completeness. It will be discussed in C h a p t e r 7.
5.3 Properties of the normal distribution F o r m a l l y , the normal expression
probability
density
1 Z = — = e
function
2
"
can be represented by the
(5.1)
H e r e Ζ indicates the height of the o r d i n a t e of the curve, which represents the density of the items. It is the d e p e n d e n t variable in the expression, being a function of the variable Y. T h e r e are t w o c o n s t a n t s in the e q u a t i o n : π, well k n o w n to be a p p r o x i m a t e l y 3.141,59, m a k i n g \/yj2n a p p r o x i m a t e l y 0.398,94, a n d e, the base of the n a t u r a l logarithms, whose value a p p r o x i m a t e s 2.718,28. T h e r e are t w o p a r a m e t e r s in a n o r m a l probability density function. These are the p a r a m e t r i c m e a n μ a n d the p a r a m e t r i c s t a n d a r d deviation σ, which d e t e r m i n e the location a n d s h a p e of the distribution. T h u s , there is not j u s t one n o r m a l distribution, as might a p p e a r to the uninitiated w h o keep e n c o u n t e r i n g the same bell-shaped image in t e x t b o o k s . R a t h e r , there are an infinity of such curves, since these p a r a m e t e r s can a s s u m e a n infinity of values. This is illustrated by the three n o r m a l curves in Figure 5.3, representing the same total frequencies.
FKil IKh 5.3 I l l u s t r a t i o n of h o w c h a n g e s in t h e Iwo p a r a m e t e r s of t h e n o r m a l d i s t r i b u t i o n alTecl t h e s h a p e a n d l o c a t i o n of Ihe n o r m a l p r o b a b i l i t y d e n s i t y f u n c t i o n . (Α) μ = 4, < ·2 w £ uχ Η 3 'δ Ε 13 (Λ C Λ ϊΐ "S 'Q J= = •£ -ο ·° ε s « 52 » 2 ο · α -S θ § £ Μ Ο κ -J fi υ g ? -ο Ο ·— ° ·Κ £ ο •S ο ν; U Ο u. -S = β ξ «ο 2 J3 . 5 so u Η S3 ^ 5 < Έ Ι 8. 4> 00 S . s s Ξ * 2 U ο (Λ Ο ;
c
c
Ο
Ο
3
3
οVI
ο
ο
ο —
ο
1) βο =5 .S
ΙΛ COα. g C υχ ω -ϊ ·° Ο rt ο -s& ε ·£ •£
C υ
w
ϊ ε J2 Ρ -S S ο
— σ- Μ .5 tc "3> ω ι' τ? Ά _c *> _
•Ο
•Γ- (Λ
^
i rt c c
/
C ο V η
Jo
3
£ Ξ
Ο
C 3
S Ο -ο .Β js ν£ < >υ W u Λ ς
ο
'S·, Ο
-—
5 ωc οι> υ
5.5 / d e p a r t u r e s f r o m n o r m a l i t y : g r a p h i c
methods
91
since t h a t c o r r e s p o n d s to an infinite distance f r o m t h e m e a n . If y o u a r e interested in plotting all observations, y o u c a n plot, instead of cumulative frequencies F, the q u a n t i t y F — j expressed as a p e r c e n t a g e of n. O f t e n it is desirable to c o m p a r e observed frequency d i s t r i b u t i o n s with their expectations w i t h o u t resorting to c u m u l a t i v e frequency distributions. O n e m e t h o d of d o i n g so w o u l d be t o s u p e r i m p o s e a n o r m a l curve o n the h i s t o g r a m of a n observed frequency distribution. Fitting a n o r m a l d i s t r i b u t i o n as a curve s u p e r i m p o s e d u p o n a n observed frequency distribution in t h e f o r m of a histog r a m is usually d o n e only when g r a p h i c facilities (plotters) a r e available. O r d i nates are c o m p u t e d by m o d i f y i n g E x p r e s s i o n (5.1) to c o n f o r m t o a frequency distribution: l /Υ-μ\ ζ Z =
(5.2) Syfln
In this expression η is the s a m p l e size a n d i is the class interval of the frequency distribution. If this needs t o be d o n e w i t h o u t a c o m p u t e r p r o g r a m , a table of ordinates of the n o r m a l curve is useful. In F i g u r e 5.8A we s h o w t h e frequency distribution of b i r t h weights of m a l e Chinese f r o m Box 5.1 with t h e o r d i n a t e s of the n o r m a l curve s u p e r i m p o s e d . T h e r e is an excess of observed frequencies at the right tail d u e t o the skewness of the distribution. You will p r o b a b l y find it difficult t o c o m p a r e the heights of bars against the arch of a curve. F o r this reason, J o h n T u k e y h a s suggested t h a t the bars of the h i s t o g r a m s be s u s p e n d e d f r o m t h e curve. T h e i r d e p a r t u r e s f r o m expectation c a n then be easily observed against the straight-line abscissa of the g r a p h . Such a h a n g i n g h i s t o g r a m is s h o w n in Figure 5.8B for the birth weight d a t a . T h e d e p a r t u r e f r o m n o r m a l i t y is n o w m u c h clearer. Becausc i m p o r t a n t d e p a r t u r e s are frequently noted in the tails of a curve, it has been suggested that s q u a r e r o o t s of expectcd frequencies should be c o m pared with the s q u a r e roots of observed frequencies. Such a " h a n g i n g r o o t o g r a m " is s h o w n in Figure 5.8C for the Chinese birth weight d a t a . N o t e the a c c e n t u a t i o n of the d e p a r t u r e f r o m normality. Finally, o n e can also use an a n a l o g o u s technique for c o m p a r i n g expected with observed histograms. Figure 5.8D shows the s a m e d a t a plotted in this m a n n e r . S q u a r e r o o t s of frequencies are again s h o w n . T h e excess of observed over expected frequencies in the right tail of the distribution is quite evident.
Exercises 5.1
U s i n g t h e i n f o r m a t i o n g i v e n in B o x 3.2, w h a t is t h e p r o b a b i l i t y o f o b t a i n i n g a n i n d i v i d u a l w i t h a n e g a t i v e b i r t h w e i g h t ? W h a t is t h i s p r o b a b i l i t y if w e a s s u m e t h a t b i r t h w e i g h t s a r e n o r m a l l y d i s t r i b u t e d ? A N S . T h e e m p i r i c a l e s t i m a t e is z e r o . If a n o r m a l d i s t r i b u t i o n c a n b e a s s u m e d , it is t h e p r o b a b i l i t y t h a t a s t a n d a r d n o r m a l d e v i a t e is less t h a n (0 - 1 0 9 . 9 ) / 1 3 . 5 9 3 = - 8 . 0 8 5 . T h i s v a l u e is b e y o n d t h e r a n g e of m o s t tables, a n d t h e p r o b a b i l i t y can be c o n s i d e r e d z e r o for practical purposes.
92
5.2 5.3
chapter 5 / the normal probability
distribution
C a r r y o u t t h e o p e r a t i o n s l i s t e d in E x e r c i s e 5.1 o n t h e t r a n s f o r m e d d a t a g e n e r a t e d i n E x e r c i s e 2.6. A s s u m e y o u k n o w t h a t t h e p e t a l l e n g t h of a p o p u l a t i o n of p l a n t s of s p e c i e s X is n o r m a l l y d i s t r i b u t e d w i t h a m e a n o f μ = 3.2 c m a n d a s t a n d a r d d e v i a t i o n o f σ = 1.8. W h a t p r o p o r t i o n o f t h e p o p u l a t i o n w o u l d b e e x p e c t e d t o h a v e a p e t a l l e n g t h ( a ) g r e a t e r t h a n 4 . 5 c m ? ( b ) G r e a t e r t h a n 1.78 c m ? (c) B e t w e e n 2 . 9 a n d 3.6 c m ? A N S . (a) = 0 . 2 3 5 3 , ( b ) = 0 . 7 8 4 5 , a n d (c) = 0 . 1 5 4 .
5.4
P e r f o r m a g r a p h i c a n a l y s i s o f t h e b u t t e r f a t d a t a g i v e n i n E x e r c i s e 3.3, u s i n g p r o b ability paper. In addition, plot the d a t a on probability p a p e r with the abscissa in l o g a r i t h m i c units. C o m p a r e t h e r e s u l t s of t h e t w o a n a l y s e s .
5.5
A s s u m e that traits A a n d Β are independent a n d normally distributed with p a r a m e t e r s μΛ = 2 8 . 6 , σΑ = 4 . 8 , μΒ = 16.2, a n d σΒ = 4.1. Y o u s a m p l e t w o i n d i v i d u a l s a t r a n d o m (a) W h a t is t h e p r o b a b i l i t y o f o b t a i n i n g s a m p l e s i n w h i c h b o t h i n d i v i d u a l s m e a s u r e l e s s t h a n 2 0 f o r t h e t w o t r a i t s ? (b) W h a t is t h e p r o b a b i l i t y t h a t a t l e a s t o n e o f t h e i n d i v i d u a l s is g r e a t e r t h a n 3 0 f o r t r a i t B ? A N S . (a) P{A < 20}P{B < 2 0 } = ( 0 . 3 6 5 4 ) ( 0 . 0 8 2 , 3 8 ) = 0 . 0 3 0 ; (b) 1 - (P{A < 3 0 } ) χ ( Ρ { Β < 30}) = 1 - (0.6147)(0.9960) = 0.3856.
5.6
P e r f o r m t h e f o l l o w i n g o p e r a t i o n s o n t h e d a t a o f E x e r c i s e 2.4. (a) If y o u h a v e not already d o n e so, m a k e a frequency distribution f r o m the d a t a a n d g r a p h the r e s u l t s i n t h e f o r m of a h i s t o g r a m , ( b ) C o m p u t e t h e e x p e c t e d f r e q u e n c i e s f o r e a c h o f t h e c l a s s e s b a s e d o n a n o r m a l d i s t r i b u t i o n w i t h μ = Ϋ a n d σ = s. (c) G r a p h t h e e x p e c t e d f r e q u e n c i e s in t h e f o r m o f a h i s t o g r a m a n d c o m p a r e t h e m w i t h t h e o b s e r v e d f r e q u e n c i e s , (d) C o m m e n t o n t h e d e g r e e of a g r e e m e n t b e t w e e n o b s e r v e d a n d expected frequencies.
5.7
L e t u s a p p r o x i m a t e t h e o b s e r v e d f r e q u e n c i e s in E x e r c i s e 2.9 w i t h a n o r m a l f r e q u e n c y distribution. C o m p a r e the observed frequencies with those expected w h e n a n o r m a l d i s t r i b u t i o n is a s s u m e d . C o m p a r e t h e t w o d i s t r i b u t i o n s b y f o r m i n g a n d superimposing the observed a n d the expected histograms a n d by using a h a n g i n g h i s t o g r a m . A N S . T h e e x p e c t e d f r e q u e n c i e s f o r t h e a g e c l a s s e s a r e : 17.9, 4 8 . 2 , 7 2 . 0 , 5 1 . 4 , 17.5, 3.0. T h i s is c l e a r e v i d e n c e f o r s k e w n e s s in t h e o b s e r v e d distribution. Perform a graphic analysis on the following measurements. Are they consistent w i t h w h a t o n e w o u l d e x p e c t in s a m p l i n g f r o m a n o r m a l d i s t r i b u t i o n ?
5.8
12.88 9.46
11.06 21.27
7.02 9.72
10.25
15.81 5.60
14.20
6.60
10.42
8.18
11.44
6.37
6.26 5.40 11.09
7.92
12.53
3.21 8.74
6.50
6.74 3.40
T h e f o l l o w i n g d a t a a r e t o t a l l e n g t h s (in c m ) o f b a s s f r o m a s o u t h e r n laki 29.9
40.2
37.8
19.7
19.1 41.4
34.7 13.6
33.5 32.2
18.3 24.3
17.2
13.3
37.7
12.6
29.7
19.4
39.2
24.7
20.4
16.2 33.3
39.6
24.6
38.2 23.8 18.6
36.8
19.1
27.3 37.4
33.1 20.1 38.2
30.0 19.4
18.0
31.6 33.7
C o m p u t e t h e m e a n , t h e s t a n d a r d d e v i a t i o n , a n d t h e coefficient of v a r i a t i o n . M a k e a h i s t o g r a m of t h e d a t a . D o t h e d a t a s e e m c o n s i s t e n t w i t h a n o r m a l d i s t r i b u t i o n o n t h e b a s i s o f a g r a p h i c a n a l y s i s ? If n o t , w h a t t y p e o f d e p a r t u r e is s u g g e s t e d ? A N S . F = 2 7 . 4 4 7 5 , s = 8 . 9 0 3 5 , V = 3 2 . 4 3 8 . T h e r e is a s u g g e s t i o n o f b i m o d a l i t y .
CHAPTER
Estimation Hypothesis
and Testing
In this c h a p t e r we provide m e t h o d s to a n s w e r t w o f u n d a m e n t a l statistical questions that every biologist must ask repeatedly in the c o u r s e of his or her work: (1) how reliable are the results I h a v e o b t a i n e d ? a n d (2) h o w p r o b a b l e is it that the differences between observed results a n d those expected on the basis of a hypothesis have been p r o d u c e d by c h a n c e alone? T h e first question, a b o u t reliability, is answered t h r o u g h the setting of confidencc limits to s a m p l e statistics. T h e second question leads into hypothesis testing. Both subjects belong to the field of statistical inference. T h e subject m a t t e r in this c h a p t e r is f u n d a mental to an u n d e r s t a n d i n g of a n y of the s u b s e q u e n t chapters. In Section 6.1 we consider the f o r m of the distribution of m e a n s a n d their variance. In Section 6.2 we examine the d i s t r i b u t i o n s a n d variances of statistics o t h e r t h a n the mean. This brings us to the general subject of s t a n d a r d errors, which a r e statistics m e a s u r i n g the reliability of an estimate. C o n f i d e n c e limits provide b o u n d s to o u r estimates of p o p u l a t i o n parameters. W e d e v e l o p the idea of a confidence limit in Section 6.3 a n d s h o w its application to samples where the true s t a n d a r d d e v i a t i o n is k n o w n . However, o n e usually deals with small, m o r e o r less normally distributed s a m p l e s with u n k n o w n s t a n d a r d deviations,
94
chapter 6 / estimation and hypothesis
testing
in w h i c h case t h e t d i s t r i b u t i o n m u s t be used. W e shall i n t r o d u c e the t dist r i b u t i o n in Section 6.4. T h e a p p l i c a t i o n of t t o t h e c o m p u t a t i o n of c o n f i d e n c e limits f o r statistics of s m a l l s a m p l e s w i t h u n k n o w n p o p u l a t i o n s t a n d a r d d e v i a t i o n s is s h o w n in S e c t i o n 6.5. A n o t h e r i m p o r t a n t d i s t r i b u t i o n , t h e c h i - s q u a r e d i s t r i b u t i o n , is e x p l a i n e d in S e c t i o n 6.6. T h e n it is a p p l i e d to s e t t i n g c o n f i d e n c e limits for t h e v a r i a n c e in S e c t i o n 6.7. T h e t h e o r y of h y p o t h e s i s t e s t i n g is i n t r o d u c e d in Section 6.8 a n d is a p p l i e d in S e c t i o n 6.9 to a variety of cases e x h i b i t i n g the n o r m a l o r t d i s t r i b u t i o n s . Finally, S e c t i o n 6.10 illustrates h y p o t h e s i s t e s t i n g for v a r i a n c e s by m e a n s of t h e c h i - s q u a r e d i s t r i b u t i o n .
6.1 Distribution and variance of means W e c o m m e n c e o u r s t u d y of t h e d i s t r i b u t i o n a n d v a r i a n c e of m e a n s with a s a m pling experiment. Experiment 6.1 You were asked to retain from Experiment 5.1 the means of the seven samples of 5 housefly wing lengths and the seven similar means of milk yields. We can collect these means from every student in a class, possibly adding them to the sampling results of previous classes, and construct a frequency distribution of these means. For each variable we can also obtain the mean of the seven means, which is a mean of a sample 35 items. Here again we shall make a frequency distribution of these means, although it takes a considerable number of samplers to accumulate a sufficient number of samples of 35 items for a meaningful frequency distribution. In T a b l e 6.1 we s h o w a f r e q u e n c y d i s t r i b u t i o n of 1400 m e a n s of s a m p l e s of 5 h o u s e f l y w i n g lengths. C o n s i d e r c o l u m n s (1) a n d (3) for the t i m e being. A c t u a l l y , t h e s e s a m p l e s w e r e o b t a i n e d not by b i o s t a t i s t i c s classes but by a digital c o m p u t e r , e n a b l i n g us t o collect t h e s e values with little elTort. T h e i r m e a n a n d s t a n d a r d d e v i a t i o n a r c given at the f o o t of the table. T h e s e v a l u e s are p l o t ted o n p r o b a b i l i t y p a p e r in F i g u r e 6.1. N o t e t h a t t h e d i s t r i b u t i o n a p p e a r s q u i t e n o r m a l , as d o c s t h a t of the m e a n s b a s e d o n 200 s a m p l e s of 35 w i n g l e n g t h s s h o w n in t h e s a m e figure. T h i s i l l u s t r a t e s a n i m p o r t a n t t h e o r e m : The means of samples from a normally distributed population are themselves normally distributed regardless of sample size n. T h u s , we n o t e t h a t t h e m e a n s of s a m p l e s f r o m the n o r m a l l y d i s t r i b u t e d housefly w i n g l e n g t h s a r e n o r m a l l y d i s t r i b u t e d w h e t h e r t h e y a r e b a s e d o n 5 or 35 i n d i v i d u a l r e a d i n g s . Similarly o b t a i n e d d i s t r i b u t i o n s of m e a n s of t h e heavily s k e w e d milk yields, as s h o w n in F i g u r e 6.2, a p p e a r t o be close t o n o r m a l d i s t r i b u t i o n s . H o w e v e r , t h e m e a n s based o n five milk yields d o n o t a g r e e with the n o r m a l nearly as well as d o the m e a n s of 35 items. T h i s illustrates a n o t h e r t h e o r e m of f u n d a m e n t a l i m p o r t a n c e in statistics: As sample size increases, the means of samples drawn from a population of any distribution will approach the normal distribution. This theorem, when rigorously stated (about sampling from populations with finite variances), is k n o w n as t h e central limit theorem. T h e i m p o r t a n c e of this t h e o r e m is that if η is l a r g e e n o u g h , it p e r m i t s us t o use the n o r m a l distri-
6.1 / d i s t r i b u t i o n a n d v a r i a n c e o f
95
means
TABLE 6 . 1
Frequency distribution of means of 1400 random samples of 5 housefly wing lengths. ( D a t a f r o m T a b l e 5.1.) C l a s s m a r k s chosen t o give intervals of t o each side of the p a r a m e t r i c m e a n μ.
Class
mark Y (in mm χ 10~ ')
W Class mark (in ffy units)
39.832
_ il
40.704
z
(S) f 1
->4 4
41.576
11 19
4
42.448
- U
43.320
- U
64 128
3 4 41
44.192 , 45.064 μ = 45.5 -» 45.936
247 226
1
259
4 3 4
46.808
231
47.680
u
121
48.552
|3
*A
61
49.424
21
23
50.296
z
51.168
->4
6
4
3 1400
F=
45.480
s =
1.778
ffy =
1.744
b u t i o n to m a k e statistical inferences a b o u t m e a n s of p o p u l a t i o n s in which the items are not at all n o r m a l l y distributed. T h e necessary size of η d e p e n d s u p o n the distribution. (Skewed p o p u l a t i o n s require larger s a m p l e sizes.) T h e next fact of i m p o r t a n c e that we n o t e is that the r a n g e of the m e a n s is considerably less t h a n that of t h e original items. T h u s , the wing-length m e a n s range f r o m 39.4 to 51.6 in samples of 5 a n d f r o m 43.9 t o 47.4 in s a m p l e s of 35, but the individual wing lengths r a n g e f r o m 36 to 55. T h e milk-yield m e a n s range f r o m 54.2 to 89.0 in samples of 5 a n d f r o m 61.9 to 71.3 in samples of 35, but the individual milk yields range f r o m 51 t o 98. N o t only d o m e a n s s h o w less scatter than the items u p o n which they are based (an easily u n d e r s t o o d p h e n o m e n o n if you give s o m e t h o u g h t to it), but the range of t h e distribution of the m e a n s diminishes as the sample size u p o n which the m e a n s a r e based increases. T h e differences in ranges a r e reflected in differences in the s t a n d a r d deviations of these distributions. If we calculate t h e s t a n d a r d deviations of the m e a n s
Samples of 5
_l
1
i
i
i
.
- 3 - 2 - 1 0
i
i
i
i
1
i
2
i
1
3
i
l
4
H o u s e f l y w i n g l e n g t h s in σ γ units Samples of 35
0.1 . 1 ι 1 I I I I I I—I I 1 I I I - 3 - 2 - 1 0
I
2
3
4
H o u s e f l y w i n g l e n g t h s in (i v units figure
6.1
G r a p h i c analysis of m e a n s of 14(X) r a n d o m s a m p l e s of 5 housefly wing lengths (from T a b l e 6.1) a n d of m e a n s of 200 r a n d o m s a m p l e s of 35 housefly wing lengths.
Samples of 5
0.1 - 3 - 2 - 1 0
1
2
3
M i l k y i e l d s in ιτ,7 units S a m p l e s of 3 5
99.9
•S a
99
£-95
|
90
1 8 0 1.70
r 5 ft) s 0 8
40
X 30 ω 20 1
"0
3 ε3
5
υ
ι οι - 2 - 1 0
1
2
3
M i l k y i e l d s in i i
Σ Η,' for the weighted m e a n . W c shall state w i t h o u t proof t h a t the variance of the weighted s u m of independent items Σ" is V a r ( £ w , Y ^ = £vvraf
(6.1)
where nf is the variance of V^. It follows that η
Since the weights u·, in this case equal 1. Σ" »ν( = η, a n d we can rewrite the a b o v e expression as η V σΐ
6.1 / d i s t r i b u t i o n a n d v a r i a n c e o f
99
means
If we a s s u m e that the variances of a r e all e q u a l t o σ 2 , the expected variance of the m e a n is
a n d consequently, t h e expected s t a n d a r d deviation of m e a n s is σ
(6.2a) F r o m this f o r m u l a it is clear t h a t the s t a n d a r d deviation of m e a n s is a f u n c t i o n of the s t a n d a r d deviation of items as well as of s a m p l e size of means. T h e greater the sample size, the smaller will be the s t a n d a r d deviation of means. In fact, as s a m p l e size increases to a very large n u m b e r , the s t a n d a r d deviation of m e a n s becomes vanishingly small. This m a k e s g o o d sense. Very large s a m p l e sizes, averaging m a n y observations, should yield estimates of m e a n s closer to the p o p u l a t i o n m e a n a n d less variable t h a n those based on a few items. W h e n w o r k i n g with samples f r o m a p o p u l a t i o n , we d o not, of course, k n o w its p a r a m e t r i c s t a n d a r d deviation σ, a n d we can o b t a i n only a s a m p l e estimate s of the latter. Also, we w o u l d be unlikely to have n u m e r o u s samples of size η f r o m which to c o m p u t e the s t a n d a r d deviation of m e a n s directly. C u s t o m a r i l y , we therefore have to estimate the s t a n d a r d deviation of m e a n s f r o m a single sample by using Expression (6.2a), substituting s for a: (6.3) Thus, f r o m the s t a n d a r d deviation of a single sample, we o b t a i n , an estimate of the s t a n d a r d deviation of m e a n s we would expect were we t o o b t a i n a collection of m e a n s based on equal-sized samples of η items f r o m the same p o p u l a t i o n . As we shall see, this estimate of the s t a n d a r d deviation of a m e a n is a very i m p o r t a n t a n d frequently used statistic. T a b l e 6.2 illustrates some estimates of the s t a n d a r d deviations of means that might be o b t a i n e d f r o m r a n d o m samples of the t w o p o p u l a t i o n s that we have been discussing. T h e m e a n s of 5 samples of wing lengths based on 5 individuals ranged f r o m 43.6 to 46.8, their s t a n d a r d deviations f r o m 1.095 to 4.827, a n d the estimate of s t a n d a r d deviation of 1 he means f r o m 0.490 to 2.159. Ranges for the o t h e r categories of samples in T a b l e 6.2 similarly include the p a r a m e t r i c values of these statistics. T h e estimates of the s t a n d a r d deviations of the m e a n s of the milk yields cluster a r o u n d the expected value, sincc they are not d e p e n d e n t on n o r m a l i t y of the variates. However, in a particular sample in which by c h a n c c the s a m p l e s t a n d a r d deviation is a p o o r estimate of the p o p u l a t i o n s t a n d a r d deviation (as in the second sample of 5 milk yields), the estimate of the s t a n d a r d deviation of m e a n s is equally wide of the m a r k . W e should e m p h a s i z e o n e point of difference between the s t a n d a r d deviation of items a n d the s t a n d a r d deviation of s a m p l e means. If we estimate a p o p u l a t i o n s t a n d a r d deviation t h r o u g h the s t a n d a r d deviation of a sample, the m a g n i t u d e of the e s t i m a t e will not c h a n g e as we increase o u r s a m p l e size. We m m η ν »/·) f l i i t i h i · p c i l r r i ' i i p u/itt i m n r n v p anrl will a n n r o a e h the true s t a n d a r d
100
c h a p t e r 6 / estimation a n d hypothesis
t a b l e
testing
6.2
Means, standard deviations, and standard deviations of means (standard errors) of five random samples of 5 and 35 housefly wing lengths and Jersey cow milk yields, respectively. ( D a t a f r o m T a b l e 5.1.) P a r a m e t r i c values for t h e statistics are given in the sixth line of each c a t e g o r y . U) Υ Wing
η
=
=
35
Sf
45.8
1.095
0.490
45.6
3.209 4.827 4.764
1.435 2.159 2.131 0.490
1.095 σ = 3.90
μ = 45.5
η
(3)
s lengths
43.6 44.8 46.8
5
(2)
σρ =
1.744
45.37
3.812
0.644
45.00 45.74
3.850
0.651 0.604
3.576 4.198 3.958 σ = 3.90
45.29 45.91 μ = 45.5
0.710 0.669 Of = 0 . 6 5 9
Milk yields
η
=
66.0 61.6 67.6 65.0
5
14.195
62.2 = 66.61
η = 35
2.775
6.205 4.278 16.072
100. In such cases we use the s a m p l e s t a n d a r d d e v i a t i o n for c o m p u t i n g the s t a n d a r d e r r o r of the m e a n . However, when the samples are small (n < 100) a n d we lack k n o w l e d g e of the p a r a m e t r i c s t a n d a r d deviation, we m u s t take into c o n s i d e r a t i o n the reliability of o u r sample s t a n d a r d deviation. T o d o so, we m u s t m a k e use of the so-callcd t or S t u d e n t ' s distribution. We shall learn how to set confidence limits e m p l o y i n g the t distribution in Section 6.5. Before that, however, we shall have t o b e c o m e familiar with this distribution in the next section. 6.4 Student's t distribution T h e deviations Υ — μ of s a m p l e m e a n s f r o m the p a r a m e t r i c m e a n of a n o r m a l distribution are themselves normally distributed. If these deviations are divided by the p a r a m e t r i c s t a n d a r d deviation, the resulting ratios, (Ϋ — μ)/σγ, are still normally distributed, with μ — 0 a n d σ = 1. S u b t r a c t i n g the c o n s t a n t μ f r o m every Ϋ, is simply an additive code (Section 3.8) and will not c h a n g e the f o r m of the distribution of s a m p l e means, which is n o r m a l (Section 6.1). Dividing each deviation by the c o n s t a n t o Y reduces the variance to unity, but p r o p o r t i o n a t e l y so for the entire distribution, so that its s h a p e is not altered a n d a previously normal distribution r e m a i n s so. If, on the o t h e r h a n d , we calculate the variance sf of each of the samples a n d calculate the deviation for each m e a n \\ as ( V· — /()/%,, where ,sy .stands for the estimate of the s t a n d a r d error of the m e a n of the f'th sample, we will find the distribution of the deviations wider and m o r e peaked than the n o r m a l distribution. This is illustrated in f i g u r e 6.4, which shows the ratio (Vi - μ)/*Υι for the 1400 samples of live housefly wing lengths o f T a b l e 6.1. T h e new distribution ranges wider than the c o r r e s p o n d i n g n o r m a l distribution, because the d e n o m i n a t o r is the sample s t a n d a r d e r r o r r a t h e r than the p a r a m e t r i c s t a n d a r d e r r o r a n d will s o m e t i m e s be smaller a n d sometimes greater than expected. This increased variation will he reflected in the greater variance of the ratio (Υ μ) 'sY. T h e
107
6.4 / s t u d e n t ' s i d i s t r i b u t i o n
f. figure
6.4
D i s t r i b u t i o n of q u a n t i t y f s = (Ϋ — μ)/Χγ a l o n g abscissa c o m p u t e d for 1400 s a m p l e s of 5 housefly wing lengths presented as a h i s t o g r a m a n d as a c u m u l a t i v e frequency d i s t r i b u t i o n . R i g h t - h a n d o r d i n a t e represents frequencies for the h i s t o g r a m ; l e f t - h a n d o r d i n a t e is c u m u l a t i v e frequency in probability scale.
expected distribution of this ratio is called the f distribution, also k n o w n as "Student's' distribution, n a m e d after W. S. G o s s c t t , w h o first described it, p u b lishing u n d e r the p s e u d o n y m " S t u d e n t . " T h e t distribution is a function with a complicated m a t h e m a t i c a l f o r m u l a that need not be presented here. T h e t distribution shares with the n o r m a l the properties of being symmetric a n d of extending f r o m negative to positive infinity. However, it differs f r o m the n o r m a l in that it a s s u m e s different shapes d e p e n d i n g on the n u m b e r of degrees of freedom. By "degrees of f r e e d o m " we m e a n the q u a n t i t y n I, where η is the sample size u p o n which a variance has been based. It will be r e m e m b e r e d that η — 1 is the divisor in o b t a i n i n g an unbiased estimate of the variance f r o m a sum of squares. T h e n u m b e r of degrees of freedom pertinent to a given Student's distribution is the s a m e as the n u m b e r of degrees of f r e e d o m of the s t a n d a r d deviation in the ratio (Ϋ — μ)/χγ. Degrees of freedom (abbreviated dj o r sometimes v) can range f r o m I to infinity. A t distribution for dj = 1 deviates most m a r k e d l y f r o m the n o r m a l . As the n u m b e r of degrees of freedom increases. Student's distribution a p p r o a c h e s the s h a p e of the s t a n d a r d n o r m a l distribution (μ = 0, σ = 1) ever m o r e closcly, and in a g r a p h the size of this page a t distribution of df — 30 is essentially indistinguishable f r o m a n o r m a l distribution. At
chapter 6 / estimation and hypothesis
108
testing
df — co, t h e f d i s t r i b u t i o n is the n o r m a l distribution. Thus, we can think of t h e t d i s t r i b u t i o n as the general case, considering the n o r m a l to be a special case of S t u d e n t ' s distribution with df = σο. F i g u r e 6.5 s h o w s t distributions for 1 a n d 2 degrees of f r e e d o m c o m p a r e d with a n o r m a l frequency distribution. W e were able t o e m p l o y a single table for t h e areas of the n o r m a l curve by c o d i n g the a r g u m e n t in s t a n d a r d deviation units. However, since t h e t distrib u t i o n s differ in s h a p e for differing degrees of freedom, it will be necessary to have a s e p a r a t e t table, c o r r e s p o n d i n g in s t r u c t u r e to the table of the areas of the n o r m a l curve, for e a c h value of d f . T h i s w o u l d m a k e for very c u m b e r s o m e and e l a b o r a t e sets of tables. C o n v e n t i o n a l t tables are therefore differently a r r a n g e d . T a b l e III s h o w s degrees of f r e e d o m a n d probability as a r g u m e n t s a n d the c o r r e s p o n d i n g values of t as functions. T h e probabilities indicate t h e percent of the area in b o t h tails of the curve (to the right a n d left of the m e a n ) b e y o n d the indicated value of t. T h u s , looking up the critical value of t at p r o b a b i l i t y Ρ = 0.05 a n d df = 5, we find t = 2.571 in T a b l e III. Since this is a two-tailed table, t h e probability of 0.05 m e a n s that 0.025 of the area will fall to the left of a t value of - 2 . 5 7 1 a n d 0.025 will fall to the right o f f = + 2 . 5 7 1 . Y o u will recall that the c o r r e s p o n d i n g value for infinite degrees of freedom (for the n o r m a l curve) is 1.960. O n l y those probabilities generally used are s h o w n in T a b l e III. You should b e c o m e very familiar with l o o k i n g u p t values in this table. This is o n e of the most i m p o r t a n t tables to be consulted. A fairly c o n v e n t i o n a l symbolism is ί ϊ [ ν ] , m e a n i n g the tabled t value for ν degrees of f r e e d o m a n d p r o p o r t i o n α in both tails (a/2 in each tail), which is equivalent to the t value for the c u m u l a t i v e p r o b a b i l i t y of 1 — (a/2). T r y looking u p s o m e of these values to b e c o m e familiar with the table. F o r example, convince yourself t h a t fo.osnj' 'ο ο 113]' 'o.oziioi' a n d Vo5[ Ρ > 0.05." This means that the probability of such a deviation is between 0.05 and 0.10. Another way of saying this is that the value of is is not significant (frequently abbreviated as ns). A convention often encountered is the use of asterisks after the computed value of the significance test, as in ts = 2.86**. The symbols generally represent the following probability ranges: * = 0.05 > Ρ > 0.01
** = 0.01 > Ρ > 0.001
*** = ρ < 0.001
However, since some a u t h o r s occasionally imply other ranges by these asterisks, the meaning of the symbols has to be specified in each scientific report.
128
c h a p t e r 6 /' e s t i m a t i o n a n d h y p o t h e s i s
testing
It might be argued that in a biological preparation the concern of the tester should not be whether the sample differs significantly from a standard, but whether it is significantly below the standard. This may be one of those biological preparations in which an excess of the active c o m p o n e n t is of no h a r m but a shortage would make the preparation ineffective at the conventional dosage. Then the test becomes one-tailed, performed in exactly the same m a n n e r except that the critical values of t for a one-tailed test are at half the probabilities of the two-tailed test. T h u s 2.26, the former 0.05 value, becomes i o.o25[9]> a r "d 1-83, the former 0.10 value, becomes ?0.05[]< making o u r observed ts value of 2.12 "significant at the 5T> level" or. more precisely stated, significant at 0.05 > Ρ > 0.025. If we are prepared to accept a 5% significance level, we would consider the p r e p a r a t i o n significantly below the standard. You may be surprised that the same example, employing the same d a t a and significance tests, should lead to two different conclusions, and you may begin to wonder whether some of the things you hear about statistics and statisticians are not, after all, correct. The explanation lies in the fact that the two results are answers to different questions. If we test whether our sample is significantly different from the standard in either direction, we must conclude that it is not different enough for us to reject the null hypothesis. If, on the other hand, we exclude from consideration the fact that the true sample mean μ could be greater than the established standard μ0, the difference as found by us is clearly significant. It is obvious from this example that in any statistical test one must clearly state whether a one-tailed or a two-tailed test has been performed if the nature of the example is such that there could be any d o u b t a b o u t the matter. W e should also point out that such a difference in the outcome of the results is not necessarily typical. It is only because the o u t c o m e in this case is in a borderline area between clear significance and nonsignilicance. H a d the difference between sample and s t a n d a r d been 10.5 activity units, the sample would have been unquestionably significantly different from the standard by the one-tailed or the two-tailed test. The promulgation of a s t a n d a r d mean is generally insufficient for the establishment of a rigid standard for a product. If the variance a m o n g the samples is sufficiently large, it will never be possible to establish a significant difference between the standard and the sample mean. This is an important point that should be quite clear to you. Remember that the standard error can be increased in two ways by lowering sample size or by increasing the s t a n d a r d deviation of the replicates. Both of these are undesirable aspects of any experimental setup. The test described above for the biological preparation leads us to a general test for the significance of any statistic that is. for the significance of a deviation of any statistic from a parametric value, which is outlined in Box 6.4. Such a test applies whenever the statistics arc expected to be normally distributed. When the standard error is estimated from the sample, the t distribution is used. However, since the normal distribution is just a special case /,,„, of the I distribution, most statisticians uniformly apply the I distribution with the a p p r o -
6.10 / t e s t i n g t h e h y p o t h e s i s Η
0
σι
~
129
al
BOX 6.4 Testing the significance of a statistic—that is, the significance of a deviation of a sample statistic from a parametric value. For normally distributed statistics. Computational steps I. Compute t„ as the following ratio: t β
St — S i . r. ss
,.
>.,
>;„
•••
• iy,
•··
V,
x, >;.
i n V,
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
138
variance
table, and the second subscript changes with each row representing an individual item. Using this notation, we can c o m p u t e the variance of sample 1 as 1
i="
y — η - r 1 i Σ= ι ( u -
y
i)2
The variance within groups, which is the average variance of the samples, is c o m p u t e d as 1
i=a j —η
Γ> ,Σ= ι Σ 1) j=ι
α ( η -
( Y i j -
N o t e the double s u m m a t i o n . It means that we start with the first group, setting i = 1 (i being the index of the outer Σ). W e sum the squared deviations of all items from the mean of the first group, changing index j of the inner Σ f r o m 1 to η in the process. W e then return to the outer summation, set i = 2, a n d sum the squared deviations for g r o u p 2 from j = 1 toj = n. This process is continued until i, the index of the outer Σ, is set to a. In other words, we sum all the squared deviations within one g r o u p first and add this sum to similar sums f r o m all the other groups. The variance a m o n g groups is c o m p u t e d as n
i=a
-^-rliY.-Y) a - 1 Μ
2
N o w that we have two independent estimates of the population variance, what shall we do with them? We might wish to find out whether they d o in fact estimate the same parameter. T o test this hypothesis, we need a statistical test that will evaluate the probability that the two sample variances are from the same population. Such a test employs the F distribution, which is taken u p next. 7.2 The F distribution Let us devise yet a n o t h e r sampling experiment. This is quite a tedious one without the use of computers, so we will not ask you to carry it out. Assume that you are sampling at r a n d o m from a normally distributed population, such as the housefly wing lengths with mean μ and variance σ2. T h e sampling procedure consists of first sampling n l items and calculating their variance .vf, followed by sampling n 2 items and calculating their variance .s2. Sample sizes n, and n 2 may or may not be equal to each other, but are fixed for any one sampling experiment. Thus, for example, wc might always sample 8 wing lengths for the first sample (n,) and 6 wing lengths for the second sample (n 2 ). After each pair of values (sf and has been obtained, wc calculate
This will be a ratio near 1, because these variances arc estimates of the same quantity. Its actual value will depend on the relative magnitudes of variances ..-
ι „>
ir
ι..
1
r
.,
,..,i...,i., we use a one-tailed F test, as illustrated by F i g u r e 7.2. W e can n o w test the t w o variances o b t a i n e d in the s a m p l i n g e x p e r i m e n t of Section 7.1 a n d T a b l e 7.1. T h e variance a m o n g g r o u p s based on 7 m e a n s w a s 21.180, a n d the variance within 7 g r o u p s of 5 individuals was 16.029. O u r null hypothesis is that the t w o variances estimate the same p a r a m e t r i c variance; the alternative hypothesis in an a n o v a is always that the p a r a m e t r i c variance estim a t e d by the variance a m o n g g r o u p s is greater t h a n that estimated by the variance within g r o u p s . T h e reason for this restrictive alternative hypothesis, which leads to a one-tailed test, will be explained in Section 7.4. W e calculate the variance ratio F s = s\js\ = 21.181/16.029 = 1.32. Before we c a n inspect the
FKHJRE 7 . 2
F r e q u e n c y curve of the /· d i s t r i b u t i o n for (> and 24 degrees of f r e e d o m , respectively. A one-tailed
141
7.1 / t h e F d i s t r i b u t i o n
F table, we have to k n o w the a p p r o p r i a t e degrees of freedom for this variance ratio. We shall learn simple formulas for degrees of freedom in an a n o v a later, but at the m o m e n t let us reason it out for ourselves. T h e u p p e r variance (among groups) was based on the variance of 7 means; hence it should have α — 1 = 6 degrees of freedom. T h e lower variance was based on an average of 7 variances, each of t h e m based on 5 individuals yielding 4 degrees of freedom per variance: a(n — 1) = 7 χ 4 = 28 degrees of freedom. Thus, the upper variance has 6, the lower variance 28 degrees of freedom. If we check Table V for ν 1 = 6 , v 2 = 24, the closest a r g u m e n t s in the table, we find that F0 0 5 [ 6 24] = 2.51. F o r F = 1.32, corresponding to the Fs value actually obtained, α is clearly >0.05. Thus, we may expect m o r e t h a n 5% of all variance ratios of samples based on 6 and 28 degrees of freedom, respectively, to have Fs values greater t h a n 1.32. We have no evidence to reject the null hypothesis and conclude that the two sample variances estimate the same parametric variance. This corresponds, of course, to what we knew anyway f r o m o u r sampling experiment. Since the seven samples were taken from the same population, the estimate using the variance of their means is expected to yield another estimate of the parametric variance of housefly wing length. Whenever the alternative hypothesis is that the two parametric variances are unequal (rather than the restrictive hypothesis Η { . σ \ > σ 2 ), the sample variance s j can be smaller as well as greater than s2. This leads to a two-tailed test, and in such cases a 5% type I error means that rejection regions of 2 j % will occur at each tail of the curve. In such a case it is necessary to obtain F values for ot > 0.5 (that is, in the left half of the F distribution). Since these values arc rarely tabulated, they can be obtained by using the simple relationship
' I I K)[V2. Vl] For example, F(1 „ 5 ( 5 2 4 , = 2.62. If we wish to obtain F 0 4 5 [ 5 2 4 1 (the F value to the right of which lies 95% of the area of the F distribution with 5 and 24 degrees of freedom, respectively), we first have to find F(1 0 5 1 2 4 = 4.53. Then F0 4515 241 is the reciprocal of 4.53, which equals 0.221. T h u s 95% of an F distribution with 5 and 24 degrees of freedom lies to the right of 0.221. There is an i m p o r t a n t relationship between the F distribution and the χ2 distribution. You may remember that the ratio X2 = Σ\>2/σ2 was distributed as a χ2 with η — I degrees of freedom. If you divide the n u m e r a t o r of this expression by n — 1, you obtain the ratio F, = ,ν 2 /σ 2 , which is a variance ratio with an expected distribution of F,,,- , , The upper degrees of freedom arc η — I (the degrees of freedom of the sum of squares or sample variance). T h e lower degrees of freedom are infinite, because only on the basis of an infinite n u m b e r of items can we obtain the true, parametric variance of a population. Therefore, by dividing a value of X 2 by η — 1 degrees of freedom, we obtain an Fs value with η - 1 and co d f , respectively. In general, χ2^\!ν ~ *]· Wc can convince ourselves of this by inspecting the F and χ2 tables. F r o m the χ2 tabic (Table IV) we find that χ 2,. 5[ΐοι ^ 18.307. Dividing this value by 10 dj\ we obtain 1.8307.
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
142
variance
Thus, the two statistics of significance are closely related and, lacking a χ 2 table, we could m a k e d o with an F table alone, using the values of vF [v ^ in place °f* 2 v,· Before we return to analysis of variance, we shall first apply our newly won knowledge of the F distribution to testing a hypothesis a b o u t two sample variances.
BOX 7.1 Testing the significance of differences between two variances. Survival in days of the cockroach Blattella vaga when kept without food or water. Females Males
n, = 10 n2 = 1 0 H0: where α is the type I error accepted and v, = ri1 — 1 and v2 = n, — 1 are the degrees of freedom for the upper and lower variance, respectively. Whether we look up ^, · ·Λ)|
2
Then we o p e n the parentheses inside t h e s q u a r e brackets, so that the second a, changes sign a n d the α,-'s cancel out, leaving the expression exactly as before.
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
146
variance
TABLE 7 . 4
D a t a of Table 7.3 arranged in the manner of Table 7.2. a I
ΙΛ1 t 2
2 ll
J η
+ a,
Σ
+ HOC,
ΫΙ+
a,
Y
2
a
i
+ «3 ' • Yn + a, • Yi 2 + a.
+ «3 + «3
· •
^3 + «,
Y.I +
··
•
·•
•
Y.2 +
««
+Yal «„
••
· Yij+ "A •· • Y.J+ Yin + «3 •• ' Yin + Oti•• Y+ m»„ η π Σ y 3 + »a3 •• • tYi + *i • Σκ + ny >:«; + «3
+ *2
η
η Means
y 33
y^ +
Yxj
Sums
Yil
«2
y 22 + * 2 Yli + «2
+
3
+
Y
r , , + «1
Groups 3
n
+ "a2
F, +;• + « , • ) =
a i^i
ι ι = ' + I ,· ,
a
' . ' Σ
-
η
π ς
ς
2
= Σ
t
u / π
-
2
„
Σ
(
Σ
^
T h e total sum of squares represents ssuniύ
= Σ Σ ( u
γ
-
η
= ΣΣ
η
2
1 / a
γ 2
η
- an- [\ Σ Σ
γ
We now copy the formulas for these sums of squares, slightly rearranged as follows: SS.
Σ
Σ
^ Σ ( Σ
ss,.
1 /" " -an \ Σ Σ y
Y
y
) + Σ Σ
y 2
1
a n
ΣΣ
η
1
an
( a n
ΣΣγ
7.5 / p a r t i t i o n i n g t h e t o t a l s u m o f s q u a r e s a n d d e g r e e s o f f r e e d o m
153
Adding the expression for SSgroaps to that for SS w i t h i n , we o b t a i n a q u a n t i t y that is identical to the one we have j u s t developed as SStotal. This d e m o n s t r a t i o n explains why the sums of squares are additive. We shall not go t h r o u g h any derivation, but simply state that the degrees of freedom pertaining to the sums of squares are also additive. The total degrees of freedom are split u p into the degrees of freedom corresponding to variation a m o n g groups a n d those of variation of items within groups. Before we continue, let us review the m e a n i n g of the three m e a n squares in the anova. T h e total MS is a statistic of dispersion of the 35 (an) items a r o u n d their mean, the g r a n d m e a n 45.34. It describes the variance in the entire sample due to all the sundry causes and estimates σ2 when there are n o a d d e d treatment effects or variance c o m p o n e n t s a m o n g groups. T h e within-group MS, also k n o w n as the individual or intragroup or error mean square, gives the average dispersion of the 5 (η) items in each g r o u p a r o u n d the g r o u p means. If the a groups are r a n d o m samples f r o m a c o m m o n h o m o g e n e o u s p o p u l a t i o n , the within-group MS should estimate a1. The MS a m o n g groups is based on the variance of g r o u p means, which describes the dispersion of the 7 (a) g r o u p means a r o u n d the g r a n d mean. If the groups are r a n d o m samples from a h o m o geneous population, the expected variance of their m e a n will be σ2/η. Therefore, in order to have all three variances of the same order of magnitude, we multiply the variance of means by η to obtain the variance a m o n g groups. If there are n o added treatment effects o r variance c o m p o n e n t s , the MS a m o n g groups is an estimate of σ 2 . Otherwise, it is an estimate of σ
1
η
-1
a \—' >
^
or
or
σ
Ί
J
+ ησΑ
a — ι
depending on whether the a n o v a at hand is Model I or II. T h e additivity relations we have just learned are independent of the presence of added treatment or r a n d o m effects. We could show this algebraically, but it is simpler to inspect Table 7.6, which summarizes the a n o v a of Table 7.3 in which a, or /t, is a d d e d to each sample. The additivity relation still holds, although the values for g r o u p SS and the total SS are different from those of Table 7.5.
TABLE 7.6 Anova table for data in Table 7.3.
y y y - y
Y Y
-
-
(4)
df
Μ can square MS
6 28 34
503.086 448.800 951.886
83.848 16.029 27.997
C)
U) Source of
W Sum af squares SS
variation
Among groups Within groups Total
154
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
A n o t h e r way of looking at the partitioning of the variation is to study the deviation f r o m m e a n s in a particular case. Referring to Table 7.1, we can look at the wing length of the first individual in the seventh group, which h a p p e n s to be 41. Its deviation from its g r o u p mean is y 7 1 _ y 7 = 41 - 45.4 = - 4 . 4 The deviation of the g r o u p m e a n from the grand m e a n is F7 - F = 45.4 - 45.34 = 0.06 and the deviation of the individual wing length from the grand m e a n is γΊι
- y = 4 i — 45.34 = - 4 . 3 4
N o t e that these deviations are additive. The deviation of the item from the g r o u p m e a n and that of the g r o u p mean from the grand m e a n add to the total deviation of the item from the g r a n d j n e a n . These deviations are stated algebraically as ( 7 — F) + ( F - F) = (Y - F). Squaring and s u m m i n g these deviations for an items will result in a n
_
a
_
_
an
Before squaring, the deviations were in the relationship a + b = c. After squaring, we would expect them to take the form a2 4- b2 + lab = c2. W h a t h a p p e n e d to the cross-product term corresponding to 2ab'l This is απ
_ _
^
a
—
2Σ(y - F h y - f) = 2 Ϊ [ ( ? -
=
"
_
Ϋ ) Σ ι υ - ?>]
a covariance-type term that is always zero, sincc ( Y — F) = 0 for each of the a groups (proof in Appendix A 1.1). We identify the deviations represented by each level of variation at the left margins of the tables giving the analysis of variance results (Tables 7.5 a n d 7.6). N o t e that the deviations add u p correctly: the deviation a m o n g groups plus the deviation within groups equals the total deviation of items in the analysis of variance, ( F - F) + ( Y - F) = ( Y - F).
7.6 Model I anova An i m p o r t a n t point to remember is that the basic setup of data, as well as the actual c o m p u t a t i o n and significance test, in most cases is the same for both models. The purposes of analysis of variance differ for the two models. So do some of the supplementary tests and c o m p u t a t i o n s following the initial significance test. Let us now fry to resolve the variation found in an analysis of variance case. This will not only lead us to a more formal interpretation of a n o v a but will also give us a deeper u n d e r s t a n d i n g of the nature of variation itself. For
7.7
155
/ m o d e l ii a n o v a
p u r p o s e s of discussion, we r e t u r n t o the housefly wing lengths of T a b l e 7.3. W e ask the question, W h a t m a k e s any given housefly wing length a s s u m e the value it does? T h e third wing length of the first sample of flies is recorded as 43 units. H o w c a n we explain such a reading? If we knew n o t h i n g else a b o u t this individual housefly, o u r best guess of its wing length w o u l d be the g r a n d m e a n of the p o p u l a t i o n , which we k n o w to be μ = 45.5. However, we have a d d i t i o n a l i n f o r m a t i o n a b o u t this fly. It is a m e m b e r of g r o u p 1, which has u n d e r g o n e a t r e a t m e n t shifting the m e a n of the g r o u p d o w n w a r d by 5 units. Therefore, a . 1 = —5, a n d we w o u l d expect o u r individual V13 (the third individual of g r o u p 1) t o m e a s u r e 45.5 - 5 = 40.5 units. In fact, however, it is 43 units, which is 2.5 units a b o v e this latest expectation. T o what can we ascribe this deviation? It is individual variation of the flies within a g r o u p because of the variance of individuals in the p o p u l a t i o n (σ 2 = 15.21). All the genetic a n d e n v i r o n m e n t a l effects that m a k e one housefly different f r o m a n o t h e r housefly c o m e into play t o p r o d u c e this variance. By m e a n s of carefully designed experiments, we might learn s o m e t h i n g a b o u t the causation of this variance a n d a t t r i b u t e it to certain specific genetic or environmental factors. W e might also be able to eliminate some of the variance. F o r instance, by using only full sibs (brothers and sisters) in any one culture jar, we would decrease the genetic variation in individuals, a n d undoubtedly the variance within g r o u p s would be smaller. However, it is hopeless to try to eliminate all variance completely. Even if we could remove all genetic variance, there would still be environmental variance. And even in the most i m p r o b a b l e case in which we could remove both types of variance, m e a s u r e m e n t error would remain, so that we would never obtain exactly the same reading even on the same individual fly. T h e within-groups MS always remains as a residual, greater or smaller f r o m experiment to e x p e r i m e n t — p a r t of the n a t u r e of things. This is why the within-groups variance is also called the e r r o r variance or error mean square. It is not an error in the sense of o u r m a k i n g a mistake, but in the sense of a measure of the variation you have to c o n t e n d with when trying to estimate significant differences a m o n g the groups. T h e e r r o r variance is composed of individual deviations for each individual, symbolized by the r a n d o m c o m p o n e n t of the j t h individual variatc in the /th group. In o u r case, e 1 3 = 2.5, since the actual observed value is 2.5 units a b o v e its expectation of 40.5. We shall now state this relationship m o r e formally. In a Model I analysis of variance we assume that the differences a m o n g g r o u p means, if any, are due to the fixed treatment effects determined by the experimenter. T h e p u r p o s e of the analysis of variance is t o estimate the true differences a m o n g the g r o u p means. Any single variate can be d e c o m p o s e d as follows: Yij
=
μ
+ α,· + €y
(7.2)
where i — 1 , . . . , a, j = 1 , . . . , « ; a n d e (J represents an independent, normally distributed variable with m e a n €,j = 0 a n d variance σ2 = a1. Therefore, a given reading is composed of the grand m e a n μ of the population, a fixed deviation
156
c h a p t e r 7 /' i n t r o d u c t i o n t o a n a l y s i s o f
variance
of the mean of g r o u p i from the grand mean μ, and a r a n d o m deviation eis of the /th individual of g r o u p i from its expectation, which is (μ + α,). R e m e m b e r that b o t h a,· and can be positive as well as negative. The expected value (mean) of the e^-'s is zero, a n d their variance is the parametric variance of the population, σ 2 . F o r all the assumptions of the analysis of variance to hold, the distribution of £ u must be normal. In a Model I a n o v a we test for differences of the type 30): df ~ tts -+ rt2 — 2 For the present data, since sample sizes are equal, we choose Expression (8.3): t
__ ( ή - VVl - (μ. - μι)
We are testing the null hypothesis that μι — μ2 = 0. Therefore we replace this quantity by zero in this example. Then t% =
7.5143 - 7.5571
-0.0428
-0.0428
V(a5047 + 0.4095)/7
^09142/7
0-3614
Λ11ή, = -0.1184
The degrees of freedom for this example are 2(n — 1) = 2 χ 6 = 12. The critical value of f0.oMi2j = 2-179. Since the absolute value of our observed f, is less than the critical t value, the means are found to be not significantly different, which is the same result as was obtained by the anova. Confidence limits of the difference between two means =
(^l
—
^2) ~~ '«[vjSFi-Fz
L 2 = (Yi — Y2) + ta[V]Sp, -γ. In this case F, - f 2 = --0.0428, t„.05„2, = 2.179, and s ? , = 0.3614, as computed earlier for the denominator of the t test. Therefore L , = —0.0428 - (2.179)(0.3614) = - 0 . 8 3 0 3 L 2 = - 0 . 0 4 2 8 + (2.179X0.3614) = 0.7447 The 95% confidence limits contain the zero point (no difference), as was to be expected, since the difference V, - Y2 was found to be not significant.
•
is very similar for the two series. It would surprise us, therefore, to find that tlicy arc significantly different. However, we shall carry out a test anyway. As you realize by now, one cannot tell from the m a g n i t u d e of a difference whether i( is significant. This depends on the m a g n i t u d e of (he error mean square, representing the variance within scries. The c o m p u t a t i o n s for the analysis of variance are not shown. They would be the same as in Box 8.1. With equal sample sizes and only two groups, there
8.4 / t w o
171
groups
is one further c o m p u t a t i o n a l shortcut. Q u a n t i t y 6, SSgroups, puted by the following simple formula: ( Σ ^ - Σ ^ ) =
^
(526 -
2 n
=
-
can be directly com-
529) 2 =
1 4
0
0
0
6
4
3
There is only 1 degree of freedom between the two groups. The critical value of F 0 ,05[i,i2] >s given u n d e r n e a t h the a n o v a table, but it is really not necessary to consult it. Inspection of the m e a n squares in the a n o v a shows that MS g r o u p s is m u c h smaller t h a n MS„ U h i n ; therefore the value of F s is far below unity, and there c a n n o t possibly be an added c o m p o n e n t due to treatment effects between the series. In cases where A/S g r o u p s < MS w i t h i n , we d o not usually b o t h e r to calculate Fs, because the analysis of variance could not possibly be significant. There is a n o t h e r m e t h o d of solving a Model I two-sample analysis of variance. This is a t test of the differences between two means. This t test is the traditional m e t h o d of solving such a problem; it may already be familiar to you from previous acquaintance with statistical work. It has no real advantage in either ease of c o m p u t a t i o n or understanding, and as you will see, it is mathematically equivalent to the a n o v a in Box 8.2. It is presented here mainly for the sake of completeness. It would seem too much of a break with tradition not to have the t test in a biostatistics text. In Section 6.4 we learned a b o u t the t distribution and saw that a t distribution of η — 1 degree of freedom could be obtained from a distribution of the term (F( — μ)/χ ? ι , where sy_ has η — 1 degrees of freedom and Ϋ is normally distributed. The n u m e r a t o r of this term represents a deviation of a sample mean from a parametric mean, and the d e n o m i n a t o r represents a standard error for such a deviation. We now learn that the expression (% - Y2) - (μ, -
i, = "(η. ;
μ2)
(8.2)
1 Mf i (>i2 - 1 >sl "ι η.
+ η2
-
2
n,n7
is also distributed as t. Expression (8.2) looks complicated, but it really has the same structure as the simpler term for t. T h e n u m e r a t o r is a deviation, this time, not between a single sample mean and the parametric mean, but between a single difference between two sample means, F, and Ϋ2, and the true difference between the m e a n s of the populations represented by these means. In a test of this sort our null hypothesis is that the two samples come from the same population; that is, they must have the same parametric mean. Thus, the difference μ, — μ2 is assumed to be zero. We therefore test the deviation of the difference V, — F2 from zero. The d e n o m i n a t o r of Expression (8.2) is a s t a n d a r d error, the s t a n d a r d error of the difference between two means •«F,-Fi· Tfie left portion of the expression, which is in square brackets, is a weighted average of the variances of the two samples, .v2 and .v2. computed
172
chapter
8 / single-classification analysis of
variance
in the m a n n e r of Section 7.1. T h e right term of the s t a n d a r d e r r o r is the c o m p u t a t i o n a l l y easier f o r m of ( l / n j ) + ( l / n 2 ) , which is the factor by which t h e average variance within g r o u p s m u s t be multiplied in o r d e r to convert it i n t o a variance of the difference of m e a n s . T h e a n a l o g y with the m u l t i p l i c a t i o n of a s a m p l e variance s 2 by 1 jn to t r a n s f o r m it into a variance of a m e a n sy s h o u l d be obvious. T h e test as outlined here assumes e q u a l variances in the t w o p o p u l a t i o n s sampled. This is also a n a s s u m p t i o n of the analyses of variance carried out so far, a l t h o u g h we have not stressed this. W i t h only two variances, equality m a y be tested by the p r o c e d u r e in Box 7.1. W h e n sample sizes are e q u a l in a t w o - s a m p l e test, Expression (8.2) simplifies to the expression (Υ, - Υ,) - (μι - μ , )
(8.3)
which is w h a t is applied in t h e present e x a m p l e in Box 8.2. W h e n the s a m p l e sizes are u n e q u a l but r a t h e r large, so t h a t the differences between and —1 are relatively trivial, Expression (8.2) reduces to the simpler form (V, -
Υ2)-(μ, - μ 2 )
(8.4)
T h e simplification of Expression (8.2) to Expressions (8.3) a n d (8.4) is s h o w n in A p p e n d i x A 1.3. T h e pertinent degrees of f r e e d o m for Expressions (8.2) a n d (8.4) are nl + n2 2, a n d for Expression (8.3) ilf is 2(η — I). T h e test of significance for differences between m e a n s using the f test is s h o w n in Box 8.2. This is a two-tailed test because o u r alternative hypothesis is / / , : μ, Φ μ2. T h e results of this test are identical t o those of the a n o v a in the s a m e box: the two m e a n s are not significantly different. W e can d e m o n s t r a t e this m a t h e m a t i c a l equivalence by s q u a r i n g the value for ts. T h e result should be identical to the Fs value of the c o r r e s p o n d i n g analysis of variance. Since ts = - 0 . 1 1 8 4 in Box 8.2, t2 = 0.0140. W i t h i n r o u n d i n g error, this is e q u a l to the Fs o b t a i n e d in the a n o v a (Fx = 0.0141). W h y is this so? We learned that f |v i = (Ϋ — μ )/*>·, where ν is the degrees of freedom of the variance of the m e a n stherefore = (Υ — μ) 2 Is], However, this expression can be regarded as a variance ratio. T h e d e n o m i n a t o r is clearly a variance with ν degrees of f r e e d o m . T h e n u m e r a t o r is also a variance. It is a single deviation s q u a r e d , which represents a sum of squares possessing 1 r a t h e r than zero degrees of f r e e d o m (since it is a deviation f r o m the true m e a n μ r a t h e r t h a n a s a m p l e mean). Λ s u m of s q u a r e s based on I degree of f r e e d o m is at the same time a variance. T h u s , t 2 is a variance ratio, since i[2v, = ,_vj, as we have seen. In A p p e n d i x A 1.4 wc d e m o n s t r a t e algebraically that the t 2 a n d the /·'„ value o b t a i n e d in Box 8.2 are identical quantities. Since ι a p p r o a c h e s the n o r m a l distribution as
8.5 / c o m p a r i s o n s a m o n g m e a n s ' p l a n n e d
comparisons
173
the s q u a r e of t h e n o r m a l deviate as ν -» oo. W e also k n o w (from Section 7.2) that rfv.j/Vi = Flvuao]. Therefore, when νί = 1 a n d v 2 = oo, x f u = F [ l ao] = f j ^ , (this c a n be d e m o n s t r a t e d f r o m Tables IV, V, a n d III, respectively): 2
Z0.0511 ]
= 3.841
^0.05[1 ,x] = 3.84 = 1.960
fo.os[*i = 3-8416
T h e t test for differences between t w o m e a n s is useful w h e n we wish t o set confidence limits to such a difference. Box 8.2 shows h o w to calculate 95% confidence limits to the difference between the series m e a n s in the Daphnia example. T h e a p p r o p r i a t e s t a n d a r d e r r o r a n d degrees of f r e e d o m d e p e n d on whether Expression (8.2), (8.3), or (8.4) is chosen for ts. It d o e s not surprise us to find that the confidence limits of the difference in this case enclose the value of zero, r a n g i n g f r o m ^ 0 . 8 3 0 3 t o + 0 . 7 4 4 7 . T h i s must be so w h e n a difference is found to be not significantly different from zero. We can i n t e r p r e t this by saying that we c a n n o t exclude zero as the true value of the difference between the m e a n s of the t w o series. A n o t h e r instance when you might prefer to c o m p u t e the t test for differences between two m e a n s rather t h a n use analysis of variance is w h e n you are lacking the original variates a n d have only published m e a n s a n d s t a n d a r d e r r o r s available for the statistical test. Such an example is furnished in Exercise 8.4.
8.5 Comparisons among means: Planned comparisons We have seen that after the initial significance test, a M o d e l II analysis of variance is c o m p l e t e d by estimation of the a d d e d variance c o m p o n e n t s . We usually c o m p l e t e a Model 1 a n o v a of m o r e t h a n t w o g r o u p s by e x a m i n i n g the d a t a in greater detail, testing which m e a n s are different f r o m which o t h e r ones or which g r o u p s of m e a n s arc different from o t h e r such g r o u p s or from single means. Let us look again at the M o d e l I a n o v a s treated so far in this chapter. We can dispose right away of the t w o - s a m p l e ease in Box 8.2, the average age of water fleas at beginning of r e p r o d u c t i o n . As you will recall, there was no significant difference in age between the two genetic scries. Bui even if there had been such a difference, no further tests arc possible. However, the d a t a on lenglh of pea sections given in Box 8.1 show a significant difference a m o n g (he five treatments (based on 4 degrees of freedom). Although we k n o w that the means are not all equal, we d o nol k n o w which ones differ from which o t h e r ones. This leads us to the subject of tests a m o n g pairs a n d g r o u p s of means. T h u s , for example, we might test the control against the 4 experimental treatments representing a d d e d sugars. T h e question to be lested would be, D o e s the addition of sugars have an effect on length of pea sections? We might also test for differences a m o n g the sugar treatments. A reasonable test might be pure sugars (glucose, fructose, and sucrose) versus the mixed sugar treatment (1%
174
c h a p t e r 8 / single-classification analysis of
variance
An i m p o r t a n t point a b o u t such tests is t h a t they are designed a n d c h o s e n i n d e p e n d e n t l y of the results of the experiment. T h e y should be p l a n n e d before the experiment h a s been carried out a n d the results o b t a i n e d . Such c o m p a r i s o n s are called planned or a priori comparisons. Such tests are applied regardless of the results of the preliminary overall a n o v a . By c o n t r a s t , after t h e e x p e r i m e n t has been carried out, we might wish to c o m p a r e certain m e a n s t h a t we notice to be m a r k e d l y different. F o r instance, sucrose, with a m e a n of 64.1, a p p e a r s to have had less of a g r o w t h - i n h i b i t i n g effect t h a n fructose, with a m e a n of 58.2. We might therefore wish to test w h e t h e r there is in fact a significant difference between the effects of fructose a n d sucrose. Such c o m p a r i s o n s , which suggest themselves as a result of the c o m p l e t e d experiment, are called unplanned o r a posteriori comparisons. T h e s e tests are p e r f o r m e d only if the preliminary overall a n o v a is significant. T h e y include tests of the c o m p a r i s o n s between all possible pairs of means. W h e n there are a means, there can, of course, be a(a — l)/2 possible c o m p a r i s o n s between pairs of means. T h e reason we m a k e this distinction between a priori a n d a posteriori c o m p a r i s o n s is that the tests of significance a p p r o p r i a t e for the t w o c o m p a r i s o n s a r e different. A simple e x a m p l e will s h o w why this is so. Let us a s s u m e we have sampled f r o m an a p p r o x i m a t e l y n o r m a l p o p u l a t i o n of heights on men. W e have c o m p u t e d their m e a n and s t a n d a r d deviation. If we s a m p l e t w o m e n at a time f r o m this p o p u l a t i o n , we can predict the difference between them o n the basis of o r d i n a r y statistical theory. S o m e m e n will be very similar, o t h e r s relatively very different. Their differences will be distributed normally with a m e a n of 0 and an expected variance of 2 a 2 , for reasons t h a t will be learned in Section 12.2. T h u s , if we o b t a i n a large difference between t w o r a n d o m l y sampled men, it will have to be a sufficient n u m b e r of s t a n d a r d deviations greater t h a n zero for us to reject o u r null hypothesis that the t w o men c o m c from the specified p o p u l a t i o n . If, on the o t h e r h a n d , we were to look at the heights of the men before s a m p l i n g t h e m and then take pairs of m e n w h o seemed to be very different from each o t h e r , it is o b v i o u s that we would repeatedly o b t a i n differences within pairs of men that were several s t a n d a r d deviations a p a r t . Such differences would be outliers in the expected frequency d i s t r i b u t o n of differences, a n d time a n d again wc would reject o u r null hypothesis when in fact it was true. T h e men would be sampled f r o m the s a m e p o p u l a t i o n , but because they were not being sampled at r a n d o m but being inspected before being sampled, the probability distribution on which o u r hypothesis testing rested would n o longer be valid. It is o b v i o u s that the tails in a large s a m p l e f r o m a n o r m a l distribution will be a n y w h e r e f r o m 5 to 7 s t a n d a r d deviations a p a r t . If we deliberately take individuals f r o m e a c h tail a n d c o m p a r e them, they will a p p e a r to be highly significantly different f r o m each other, a c c o r d i n g to the m e t h o d s described in the present section, even t h o u g h they belong to the s a m e p o p u l a t i o n . W h e n we c o m p a r e m e a n s differing greatly f r o m each o t h e r as the result of some treatment in the analysis of variance, we are d o i n g exactly the s a m e thing as t a k i n g the tallest and the shortest men f r o m the frequency distribution of
175
8.6 / c o m p a r i s o n s a m o n g m e a n s : u n p l a n n e d c o m p a r i s o n s
heights. If w e wish t o k n o w w h e t h e r these a r e significantly different f r o m e a c h o t h e r , we c a n n o t use the o r d i n a r y p r o b a b i l i t y d i s t r i b u t i o n o n w h i c h t h e analysis of v a r i a n c e rests, b u t we h a v e t o use special tests of significance. T h e s e u n p l a n n e d tests will be discussed in t h e next section. T h e p r e s e n t section c o n c e r n s itself with t h e c a r r y i n g o u t of t h o s e c o m p a r i s i o n s p l a n n e d b e f o r e t h e e x e c u t i o n of t h e e x p e r i m e n t . T h e general rule f o r m a k i n g a p l a n n e d c o m p a r i s o n is e x t r e m e l y simple; it is related t o t h e r u l e f o r o b t a i n i n g t h e s u m of s q u a r e s for a n y set of g r o u p s (discussed at the e n d of Section 8.1). T o c o m p a r e k g r o u p s of a n y size nh t a k e the s u m of e a c h g r o u p , s q u a r e it, divide the result by the s a m p l e size nh a n d s u m the k q u o t i e n t s so o b t a i n e d . F r o m t h e s u m of these q u o t i e n t s , s u b t r a c t a c o r r e c t i o n t e r m , w h i c h y o u d e t e r m i n e by t a k i n g t h e g r a n d s u m of all t h e g r o u p s in this c o m p a r i s o n , s q u a r i n g it, a n d d i v i d i n g t h e result by the n u m b e r of items in the g r a n d s u m . If t h e c o m p a r i s o n i n c l u d e s all t h e g r o u p s in t h e a n o v a , the c o r r e c t i o n t e r m will be the m a i n CT of the s t u d y . If, h o w e v e r , t h e c o m p a r i s o n includes only s o m e of t h e g r o u p s of the a n o v a , t h e CT will be different, b e i n g restricted only to these g r o u p s . T h e s e rules c a n best be l e a r n e d by m e a n s of a n e x a m p l e . T a b l e 8.2 lists the m e a n s , g r o u p s u m s , a n d s a m p l e sizes of the e x p e r i m e n t with t h e p e a sections f r o m Box 8.1. Y o u will recall t h a t t h e r e were highly significant differences a m o n g t h e g r o u p s . W e n o w wish t o test w h e t h e r the m e a n of the c o n t r o l differs f r o m t h a t of the f o u r t r e a t m e n t s r e p r e s e n t i n g a d d i t i o n of s u g a r . T h e r e will t h u s be t w o g r o u p s , o n e t h e c o n t r o l g r o u p a n d t h e o t h e r the " s u g a r s " g r o u p s , the latter with a sum of 2396 a n d a s a m p l e size of 40. W e t h e r e f o r e c o m p u t e SS (control v e r s u s sugars) _ (701 ) 2 4
10 (701) = — 10
(593 + 582 + 580 + 641) 2
2
+
40 (2396)
2
40
-
(701 + 593 + 582 + 580 + 641) 2 ~
(3097)50
50
= 8^2.12
In this case the c o r r e c t i o n term is the s a m e as for the a n o v a , b e c a u s e it involves all the g r o u p s of t h e s t u d y . T h e result is a s u m of s q u a r e s for the c o m p a r i s o n
TABLE 8.2 Means, group sums, and sample sizes from the data in Box 8.1. l ength of pea sections g r o w n in tissue culture (in o c u l a r units).
1" ('onirol
70.1
Y
I η
y
yhtcost' 593
Jructosc
58.2
/ ".i illliCOSi' + Γ'~„ fructose
58.0
siurosc 64.1
Σ (61.94 -
701
593
582
580
641
3097
10
10
10
10
10
50
F)
chapter
176
8 / single-classification analysis of
variance
b e t w e e n t h e s e t w o g r o u p s . Since a c o m p a r i s o n b e t w e e n t w o g r o u p s h a s o n l y 1 d e g r e e of f r e e d o m , t h e s u m of s q u a r e s is at t h e s a m e t i m e a m e a n s q u a r e . T h i s m e a n s q u a r e is tested o v e r t h e e r r o r m e a n s q u a r e of t h e a n o v a t o give t h e following comparison: MS ( c o n t r o l v e r s u s sugars) Fs =
M5^th,„ ^0.05[1,45]
=
832.32 =
~5A6~
15944 =
F 0.0 1 [ 1 .4 5] = ^.23
4.05,
T h i s c o m p a r i s o n is h i g h l y significant, s h o w i n g t h a t the a d d i t i o n s of s u g a r s h a v e significantly r e t a r d e d t h e g r o w t h of the p e a sections. N e x t we test w h e t h e r t h e m i x t u r e of s u g a r s is significantly d i f f e r e n t f r o m t h e p u r e sugars. U s i n g the s a m e t e c h n i q u e , we c a l c u l a t e SS (mixed s u g a r s v e r s u s p u r e s u g a r s ) - 4 1 ) 2 _ (593 + 582_+ 580 + 641) 2 _
~
(580) 2 K)
(1816) 2 30
40 (2396) 2 40
=
48.13
H e r e the CT is different, since it is b a s e d o n t h e s u m of the s u g a r s only. T h e a p p r o p r i a t e test statistic is MS (mixed s u g a r s versus p u r e sugars) 48.13 /, = — — ~ 8.8^ MSwilhin 5.46 T h i s is significant in view of the critical v a l u e s of 4 5 | given in t h e p r e c e d i n g paragraph. A final test is a m o n g t h e t h r e e sugars. T h i s m e a n s q u a r e h a s 2 d e g r e e s of f r e e d o m , since it is based o n t h r e e m e a n s . T h u s we c o m p u t e , Fjh, ^^wilhin Since M S g r o u p J M S „ i t h i n = S S g r o u p s / [ ( « (8.5) as
| „(„•!)]
(8.5)
1) M S w i l h i n J , we can r e w r i t e E x p r e s s i o n
g r o u p s ^ (" " Π M S w i l h i „ /·'„!„
!.„,„
1,|
F o r e x a m p l e , in Box 8.1, w h e r e the a n o v a is significant, SS Br „ s t i t u t i n g into E x p r e s s i o n (8.6), we o b t a i n 1077.32 > (5 -
1)(5.46)(2.58) - 56.35
for
(8.6) — 1077.32. S u b -
a = 0.05
It is t h e r e f o r e possible t o c o m p u t e a critical λ\ν value for a test of significance of a n a n o v a . Thus, a n o t h e r way of c a l c u l a t i n g overall significance w o u l d be t o sec w h e t h e r the S.VKI„ups is g r e a t e r t h a n this critical SS. It is of interest t o investigate w h y the SS vt>Ui , s is as large as it is a n d to test for t h e significance of the v a r i o u s c o n t r i b u t i o n s m a d e to this SS by dilfercnccs a m o n g the s a m p l e m e a n s . T h i s was discussed in the p r e v i o u s scction, w h e r e s e p a r a t e s u m s of s q u a r e s were c o m p u t e d based o n c o m p a r i s o n s a m o n g m e a n s p l a n n e d b e f o r e the d a t a were e x a m i n e d . A c o m p a r i s o n w a s called significant if its /·', r a t i o w a s > I''iik !.«(»• πι· w h e r e k is the n u m b e r of m e a n s being c o m p a r e d . W e c a n n o w also s t a t e this in t e r m s of s u m s of s q u a r e s : An SS is significant if it is g r e a t e r t h a n {k I) M S w i l h i n Fxlk ,.„,„ n]. T h e a b o v e tests w e r e a priori c o m p a r i s o n s . O n e p r o c e d u r e for testing a posteriori c o m p a r i s o n s w o u l d be to set k — a in this last f o r m u l a , n o m a t t e r
180
c h a p t e r 8 / single-classification analysis of
variance
how m a n y m e a n s we c o m p a r e ; thus the critical value of the SS will be larger t h a n in the previous m e t h o d , m a k i n g it m o r e difficult to d e m o n s t r a t e the significance of a s a m p l e SS. Setting k = a allows for the fact t h a t we c h o o s e for testing those differences between g r o u p m e a n s t h a t a p p e a r to be c o n t r i b u t i n g substantially to the significance of the overall a n o v a . F o r an example, let us r e t u r n to the effects of sugars on g r o w t h in pea sections (Box 8.1). We write d o w n the m e a n s in ascending o r d e r of m a g n i t u d e : 58.0 (glucose + fructose), 58.2 (fructose), 59.3 (glucose), 64.1 (sucrose), 70.1 (control). W e notice t h a t the first three t r e a t m e n t s have quite similar m e a n s a n d suspect t h a t they d o n o t differ significantly a m o n g themselves a n d hence d o n o t c o n t r i b u t e substantially to the significance of the SSgroups. T o test this, wc c o m p u t e the SS a m o n g these three m e a n s by the usual formula: 2 2 2 2 _ (593) + (582) __ + (580) _ (593 + 582 _ _+ 580)
-
102,677.3 - 102,667.5 = 9.8
T h e dilfcrcnccs a m o n g these m e a n s are not significant, because this SS is less than the critical SS (56.35) calculated above. T h e sucrose m e a n looks suspiciously different from the m e a n s of the o t h e r sugars. T o test this wc c o m p u t e (641) 2 k
~ 10
+
(593 + 582 + 580) 2
(641 + 593 + 582 + 580) 2
30
κΓ+30
= 41,088.1 + 102,667.5 -
143,520.4 = 235.2
which is greater than the critical SS. Wc conclude, therefore, that sucrosc retards g r o w t h significantly less than the o t h e r sugars tested. We may c o n t i n u e in this fashion, testing all the differences that look suspicious o r even testing all possible sets of means, considering them 2, 3, 4, a n d 5 at a time. This latter a p p r o a c h may require a c o m p u t e r if there are m o r e than 5 m e a n s to be c o m pared, since there arc very m a n y possible tests that could be m a d e . This p r o c e d u r e was p r o p o s e d by Gabriel (1964), w h o called it a sum of squares simultaneous
test
procedure
(SS-S'l'P).
In the SS-S I I' and in the original a n o v a , the chancc of m a k i n g a n y type I e r r o r at all is a, the probability selected for the critical I· value f r o m T a b l e V. By " m a k i n g any type I e r r o r at all" we m e a n m a k i n g such an e r r o r in the overall test of significance of the a n o v a a n d in any of the subsidiary c o m p a r i s o n s a m o n g m e a n s or sets of means needed to complete the analysis of the experiment. Phis probability a therefore is an experimentwise e r r o r rate. N o t e that t h o u g h the probability of any e r r o r at all is a, the probability of e r r o r for any p a r t i c u l a r test of s o m e subset, such as a test of the difference a m o n g three o r between t w o means, will always be less than χ Thus, for the test of each subset o n e is really using a significance level a \ which may be m u c h less than the cxperimcntwisc
e x e r c i s e s 195
α, a n d if t h e r e a r e m a n y m e a n s in t h e a n o v a , this a c t u a l e r r o r r a t e a ' m a y be o n e - t e n t h , o n e o n e - h u n d r e d t h , o r even o n e o n e - t h o u s a n d t h of t h e e x p e r i m e n t wise α ( G a b r i e l , 1964). F o r this r e a s o n , t h e u n p l a n n e d tests d i s c u s s e d a b o v e a n d the overall a n o v a a r e n o t very sensitive t o differences b e t w e e n i n d i v i d u a l m e a n s o r differences w i t h i n small subsets. O b v i o u s l y , n o t m a n y differences a r e g o i n g t o be c o n s i d e r e d significant if a' is m i n u t e . T h i s is t h e price w e p a y for n o t p l a n n i n g o u r c o m p a r i s o n s b e f o r e we e x a m i n e t h e d a t a : if w e w e r e t o m a k e p l a n n e d tests, the e r r o r r a t e of e a c h w o u l d be greater, h e n c e less c o n s e r v a t i v e . T h e SS-STP p r o c e d u r e is only o n e of n u m e r o u s t e c h n i q u e s f o r m u l t i p l e u n p l a n n e d c o m p a r i s o n s . It is t h e m o s t c o n s e r v a t i v e , since it a l l o w s a large n u m b e r of possible c o m p a r i s o n s . D i f f e r e n c e s s h o w n t o be significant by this m e t h o d c a n be reliably r e p o r t e d as significant differences. H o w e v e r , m o r e sensitive a n d p o w e r f u l c o m p a r i s o n s exist w h e n t h e n u m b e r of possible c o m p a r i s o n s is c i r c u m s c r i b e d b y t h e user. T h i s is a c o m p l e x s u b j e c t , t o w h i c h a m o r e c o m p l e t e i n t r o d u c t i o n is given in S o k a l a n d Rohlf (1981), Section 9.7. Exercises 8.1
The following is an example with easy numbers to help you become familiar with the analysis of variance. A plant ecologist wishes to test the hypothesis that the height of plant species X depends on the type of soil it grows in. He has measured the height of three plants in each of four plots representing different soil types, all four plots being contained in an area of two miles square. His results are tabulated below. (Height is given in centimeters.) Does your analysis support this hypothesis? ANS. Yes, since F, = 6.951 is larger than 'θ ^ Μ "3 ω u> ίβ Ό ed Λ S S5 Ο c .2 _ " ro c> "Λ *T ΣΝ >Η —; νο Ο Ο νο (Ν 55 —ι νο ο 00 νΟ Ον Tt 3 νΟ 00 Q m II —< Γ-; Φ Ον 10 rn οό
ο as g 00 Λ oc d Η ~ ~ Ο Ο oo w-ί «λ Κ νο
8 S νο —< σν νο -Η Tfr ο οΓΟ οί VO τ»· οο' rον vd 00 Ο II m νο ro Ον νο Ί—4 W 8 "Τ 0
Ο Ο Ο «ο S 00 ^Η ^J Ο rII tΓ—Ο 00 rr r— σ< 00 o< W
Λ
Ο ν->
ο. Β D I ί
BOX 9.1 Continued Preliminary
computations a b π
Υ
1. Grand total = Σ Σ Σ
=
461.74
2. Sum of the squared observations = Σ Σ Σ
γ 2
=
+ ••• + (12.30)2 = 5065.1530
3. Sum of the squared subgroup (cell) totals, divided by the sample size of the subgroups 2
" b / η Σ Σ \ Σ
γ
ν
/
(84.49)2 + •·• + (98.61)2
« « fb
η
\2
t Ϋ f y/ ι 4. Sum of the squared column totals divided by the sample size of a column = - A
=
bn b/a
η
Υ 5. Sum of the squared row totals divided by the sample size of a row = Σ^ Ϊ Σ Σ....
= 4663.6317
8
(2«.00) 2 + (216.74)2 _ ~~ (3 χ 8) ~ 4438.3S44
\2 1
an (143.92)η22 + (121.82)2 + (196.00)2 (2^8)
=
46230674
6. Grand total squared and divided by the total sample size = correction term CT / a b it \2 \
ΣΣΣΣΣγ Π ) /
abn
7- SS,„,ai = Σ Σ Σ a
γ1
b / η
~
C T
,
„. (qua (quantity, l), 2 abn
„(461.74), 2 (2x3x8)"4441'7464
= quantity 2 - quantity 6 = 5065.1530 - 4441.7464 = 623.4066
\2
ΣΣΙΣ 8. SSsubgr =
^
- C T = quantity 3 - quantity 6 = 4663.6317 - 4441.7464 = 221.8853 it
a ( b
V
ς(ςς^) 9. SSA (SS of columns) = — η
b fa
Σ ( Σ Σ
10. SSB (SS of rows) = — ^
C T = quantity 4 - quantity 6 = 4458.3844 - 4441.7464 = 16.6380
bn
\2 γ
Ι
'— - CT = quantity 5 - quantity 6 = 4623.0674 - 4441.7464 = 181.3210
an
11. SSA „ B (interaction SS) = SS subgr - SSA - SS„ = quantity 8 - quantity 9 - quantity 10 = 221.8853 - 16.6380 - 181.3210 = 23.9263 12. SSwUhin (within subgroups; error SS) = SSloltll — SSsllbgr = quantity 7 - quantity 8 = 623.4066 - 221.8853 = 401.5213 As a check on your computations, ascertain that the following relations hold for some of the above quantities: 2 S 3 S 4 i 6; 3 > 5 > 6. Explicit formulas for these sums of squares suitable for computer programs are as follows: 9 a . SSA = n b t ( Y
A
-
10a. SSB = n a £ ( f B 11a. SSAB
Y)2 Y
= n £ i ( Y - ?
12a. SS within =
n
? A
- ?
t i ^ - ? )
2
B
+ f )
2
BOX 9.1 Continued
Now fill in the anova table.
Source of variation
jf "J
A (columns)
Ϋ Α - ?
a -
>«
1
MS
Expected
9
9
2 , nb« the
ex
?ected
^ o v e are eorreet Below are the corresponding
Mixed model (.4 fixed, β random)
+ nbai
σ2 + π 00 w-1 00 vO ci •Ί-
00 rJi •Ίο·. oo < N4 T—
Ο οOS CJ\ οΟ
Vl
SO \D Η Os
OO OO Os v-i 00 r^i π
e CO o. 1)
β
ε
CTJ 3 υ Ό ' > •ο c •3 ' 3 C Β υ
J2 ο υ
OQ
ω
ο
Η
+ Γ-1 "S χ i 3 χ "S Ο § βο υ
c .s § Ϊ3 ι>?
I
ι ί».
203
9 . 3 / TWO-WAY ANOVA WITHOUT REPLICATION
R o w SS = 1006.9909
T o t a l SS = 1228.5988
»=ri2-r(I (difference)
7.33 7.49 7.27 7.93 7.56 7.81 7.46 6.94 7.49 7.44 7.95 7.47 7.04 7.10 7.64 111.92 836.3300
7.53 7.70 7.46 8.21 7.81 8.01 7.72 7.13 7.68 7.66 8.11 7.66 7.20 7.25 7.79 114.92 881.8304
14.86 15.19 14.73 16.14 15.37 15.82 15.18 14.07 15.17 15.10 16.06 15.13 14.24 14.35 15.43 226.84 3435.6992
0.20 .21 .19 .28 .25 .20 .26 .19 .19 .22 .16 .19 .16 .15 .15 3.00 0.6216
Source: From a larger study by Newman and Meredith (1956).
Two-way anova without replication Anova table Source of variation
df
SS
Ages (columns; factor A)
1 0.3000
Individuals (rows; factor Β) Remainder Total
14 2.6367 14 0.0108 29 2.9475
^o.oi|i.i4] = 8.86
MS 0.3000
388.89**
0.188,34 0.000,771,43
(244.14)**
^0.01(12.12] =
Expected MS
F.
axAs'min = Ύ.'ι~ = 9.0 a n d c o m p a r e it with f ' m . u l l J „|, critical values of w h i c h a r e f o u n d in T a b l e VI. F o r a = 6 a n d ν = η - 1 = 9, /·'„„„ is 7.80 a n d 12.1 at the 5% a n d Γ'ό levels, respectively. W e c o n c l u d e t h a t the v a r i a n c e s of the six s a m ples a r c significantly h e t e r o g e n e o u s . W h a t m a y c a u s e such h e t e r o g e n e i t y ? In this case, we s u s p e c t that s o m e of the p o p u l a t i o n s are i n h e r e n t l y m o r e v a r i a b l e t h a n o t h e r s . S o m e races or species
214
CHAPTER 10 , ASSUMPTIONS OF ANALYSIS OF VARIANC 1
are relatively u n i f o r m for o n e character, while others are quite variable for t h e s a m e c h a r a c t e r . In a n a n o v a representing the results of an experiment, it m a y well be that o n e s a m p l e h a s been o b t a i n e d u n d e r less s t a n d a r d i z e d c o n d i t i o n s t h a n the others a n d hence h a s a greater variance. T h e r e are also m a n y cases in which the heterogeneity of variances is a f u n c t i o n of an i m p r o p e r choice of m e a s u r e m e n t scale. W i t h s o m e m e a s u r e m e n t scales, variances vary as f u n c t i o n s of means. T h u s , differences a m o n g m e a n s b r i n g a b o u t h e t e r o g e n e o u s variances. F o r example, in variables following the Poisson distribution t h e variance is in fact e q u a l t o the m e a n , a n d p o p u l a t i o n s with greater m e a n s will therefore have greater variances. Such d e p a r t u r e s f r o m the a s s u m p t i o n of homoscedasticity can often be easily corrected by a suitable t r a n s f o r m a t i o n , as discussed later in this chapter. A rapid first inspection for hetcroscedasticity is to check for c o r r e l a t i o n between the m e a n s a n d variances or between the m e a n s a n d the ranges of the samples. If the variances increase with the m e a n s (as in a Poisson distribution), the ratios s2/Y or s/Ϋ = V will be a p p r o x i m a t e l y c o n s t a n t for the samples. If m e a n s a n d variances are i n d e p e n d e n t , these ratios will vary widely. T h e consequences of m o d e r a t e heterogeneity of variances a r e not t o o serio u s for the overall test of significance, but single degree of f r e e d o m c o m p a r i sons m a y be far f r o m accurate. If t r a n s f o r m a t i o n c a n n o t cope with heteroscedasticity, n o n p a r a m e t r i c m e t h o d s (Section 10.3) m a y have to be resorted to. Normality. We have a s s u m e d t h a t the e r r o r terms e ; j of the variates in each s a m p l e will be i n d e p e n d e n t , that the variances of the e r r o r terms of t h e several samples will be equal, a n d , finally, t h a t the error terms will be n o r m a l l y distributed. If there is serious question a b o u t the normality of the d a t a , a g r a p h i c test, as illustrated in Section 5.5, might be applied to each sample separately. T h e consequences of n o n n o r m a l i t y of e r r o r are not too serious. O n l y very skewed distribution w o u l d have a m a r k e d effect on the significance level of the F test or on the efficiency of the design. T h e best way to correct for lack of n o r m a l i t y is to carry out a t r a n s f o r m a t i o n that will m a k e the d a t a n o r m a l l y distributed, as explained in the next section. If n o simple t r a n s f o r m a t i o n is satisfactory, a n o n p a r a m e t r i c test, as carried out in Section 10.3, should be substituted for the analysis of variance. Additivitv· In two-way a n o v a without replication it is necessary to a s s u m e that interaction is not present if o n e is to m a k e tests of the m a i n effects in a M o d e l I a n o v a . This a s s u m p t i o n of no interaction in a two-way a n o v a is sometimes also referred t o as the a s s u m p t i o n of additivity of the main effects. By this we m e a n that any single observed variate can be d e c o m p o s e d into additive c o m p o n e n t s representing the t r e a t m e n t effects of a particular row a n d c o l u m n as well as a r a n d o m term special to it. If interaction is actually present, then the F test will be very inefficient, a n d possibly misleading if the effect of the interaction is very large. A check of this a s s u m p t i o n requires either m o r e t h a n a single observation per cell (so that an e r r o r m e a n square can be c o m p u t e d )
10.1 / THE ASSUMPTIONS OF ANOVA
215
o r a n i n d e p e n d e n t e s t i m a t e of the e r r o r m e a n s q u a r e f r o m p r e v i o u s comparable experiments. I n t e r a c t i o n c a n be d u e t o a variety of causes. M o s t f r e q u e n t l y it m e a n s t h a t a given t r e a t m e n t c o m b i n a t i o n , such as level 2 of f a c t o r A w h e n c o m bined with level 3 of f a c t o r B, m a k e s a v a r i a t e d e v i a t e f r o m t h e e x p e c t e d value. S u c h a d e v i a t i o n is r e g a r d e d as a n i n h e r e n t p r o p e r t y of t h e n a t u r a l system u n d e r s t u d y , as in e x a m p l e s of synergism o r interference. S i m i l a r effects o c c u r w h e n a given replicate is q u i t e a b e r r a n t , as m a y h a p p e n if a n e x c e p t i o n a l p l o t is included in a n a g r i c u l t u r a l e x p e r i m e n t , if a diseased i n d i v i d u a l is i n c l u d e d in a physiological e x p e r i m e n t , o r if by m i s t a k e a n i n d i v i d u a l f r o m a different species is i n c l u d e d in a b i o m e t r i c study. Finally, a n i n t e r a c t i o n t e r m will result if t h e effects of t h e t w o f a c t o r s A a n d Β o n t h e r e s p o n s e v a r i a b l e Y a r e m u l t i p l i c a t i v e r a t h e r t h a n additive. An e x a m p l e will m a k e this clear. In T a b l e 10.1 we s h o w t h e a d d i t i v e a n d m u l t i p l i c a t i v e t r e a t m e n t effects in a h y p o t h e t i c a l t w o - w a y a n o v a . Let us a s s u m e t h a t the expected p o p u l a t i o n m e a n μ is zero. T h e n the m e a n of the s a m p l e s u b j e c t e d to t r e a t m e n t I of fact o r A a n d t r e a t m e n t 1 of f a c t o r Β s h o u l d be 2, by the c o n v e n t i o n a l a d d i t i v e m o d e l . T h i s is so b e c a u s e each f a c t o r at level 1 c o n t r i b u t e s u n i t y t o t h e m e a n . Similarly, the expected s u b g r o u p m e a n s u b j e c t e d t o level 3 for f a c t o r A a n d level 2 for f a c t o r Β is 8, t h e respective c o n t r i b u t i o n s to the m e a n b e i n g 3 a n d 5. H o w e v e r , if the p r o c e s s is multiplicative r a t h e r t h a n additive, as o c c u r s in a variety of p h y s i c o c h e m i c a l a n d biological p h e n o m e n a , the e x p e c t e d v a l u e s will be q u i t e different. F o r t r e a t m e n t AlBt< the e x p e c t e d value e q u a l s 1, which is the p r o d u c t of 1 a n d 1. F o r t r e a t m e n t A 3 B 2 , the e x p e c t e d value is 15, the p r o d uct of 3 a n d 5. If we w e r e t o a n a l y z e m u l t i p l i c a t i v e d a t a of this sort by a c o n v e n t i o n a l a n o v a , we w o u l d find that the i n t e r a c t i o n s u m of s q u a r e s w o u l d be greatly a u g m e n t e d b e c a u s e of the n o n a d d i t i v i t y of the t r e a t m e n t effects. In this case, there is a s i m p l e r e m e d y . By t r a n s f o r m i n g the v a r i a b l e i n t o l o g a r i t h m s ( T a b l e 10.1), we a r c a b l e t o r e s t o r e the additivity of the d a t a . T h e third item in each cell gives the l o g a r i t h m of (he expected value, a s s u m i n g m u l t i p l i c a t i v e
ί'λιιι κ
κι.ι
Illustration of additive and multiplicative elfects. h'tu tor A h acKir Η
a, - 1 ">
/'. - ι
1 0 ()
II2
- 5
s 0.70
os = 2
a, - 3
3 2 0.30
4 3 0.48
Additive effects Multiplicative effects Log of multiplicative effect:
7 10 1.00
8 15 1.18
Additive effects Multiplicative effects Log of multiplicative effect:
216
CHAPTER 10 , ASSUMPTIONS OF ANALYSIS OF VARIANC 1
relations. N o t i c e that the i n c r e m e n t s are strictly additive again (SS^ x B — 0). As a m a t t e r of fact, on a l o g a r i t h m i c scale we could simply write a t = 0, a 2 = 0.30, a 3 = 0.48, = 0 , β 2 = 0.70. H e r e is a g o o d illustration of h o w t r a n s f o r m a t i o n of scale, discussed in detail in Section 10.2, helps us m e e t t h e a s s u m p tions of analysis of variance.
10.2 T r a n s f o r m a t i o n s If t h e evidence indicates t h a t the a s s u m p t i o n s for an analysis of v a r i a n c e o r for a t test c a n n o t be m a i n t a i n e d , t w o courses of action are o p e n t o us. W e m a y carry out a different test n o t requiring t h e rejected a s s u m p t i o n s , such as o n e of the distribution-free tests in lieu of a n o v a , discussed in the next section. A second a p p r o a c h w o u l d be to t r a n s f o r m t h e variable to be a n a l y z e d in such a m a n n e r t h a t the resulting t r a n s f o r m e d variates meet the a s s u m p t i o n s of the analysis. Let us look at a simple e x a m p l e of w h a t t r a n s f o r m a t i o n will do. A single variate of the simplest kind of a n o v a (completely r a n d o m i z e d , single-classification, M o d e l I) d e c o m p o s e s as follows: Y{j = μ + a{ + In this m o d e l the c o m p o n e n t s are additive, with the e r r o r term normally distributed. H o w e v e r , we m i g h t e n c o u n t e r a situation in which the c o m p o n e n t s were multiplicative in effect, so that Y^· = which is the p r o d u c t of the three terms. In such a case t h e a s s u m p t i o n s of n o r m a l i t y a n d of homoscedasticity w o u l d b r e a k d o w n . In any o n e a n o v a , t h e p a r a m c t r i c m e a n μ is c o n s t a n t but t h e t r e a t m e n t elfcct a; differs f r o m g r o u p to g r o u p . Clearly, the scatter a m o n g t h e variates Ytj would d o u b l e in a g r o u p in which a, is twicc as great as in a n o t h e r . Assume that μ = I, the smallest = 1, a n d the greatest, 3; then if a, = 1, the range of the Y's will be 3 — 1 = 2. However, w h e n a, = 4, the c o r r e s p o n d i n g range will be four times as wide, f r o m 4 χ 1 = 4 to 4 χ 3 = 12, a range of 8. Such d a t a will be heterosccdastic. W e can correct this situation simply by t r a n s f o r m ing o u r model into logarithms. W c would therefore o b t a i n log Y-j = log μ + log a, + log e, y , which is additive a n d homoscedastic. T h e entire analysis of variance would then be carried out on the t r a n s f o r m e d variates. At this point m a n y of you will feel m o r e or less u n c o m f o r t a b l e a b o u t what wc have done. T r a n s f o r m a t i o n seems t o o m u c h like " d a t a grinding." W h e n you learn t h a t often a statistical test may be m a d e significant after t r a n s f o r m a t i o n of a set of d a t a , t h o u g h it would not be so w i t h o u t such a t r a n s f o r m a t i o n , you m a y feel even m o r e suspicious. W h a t is the justification for t r a n s f o r m i n g the d a t a ? It takes s o m e getting used to the idea, but there is really n o scientific necessity to e m p l o y the c o m m o n linear or arithmetic scale to which wc arc a c c u s t o m e d . Y o u a r c p r o b a b l y a w a r e t h a t teaching of the "new m a t h " in e l e m e n t a r y schools h a s d o n e m u c h to dispel the naive notion that the decimal system of n u m b e r s is the only " n a t u r a l " one. In a similar way, with s o m e experience in science a n d in the h a n d l i n g of statistical d a t a , you will a p p r e c i a t e the fact that the linear scale, so familiar to all of us f r o m o u r earliest expe-
1 0 . 2 / TRANSFORMATIONS
i
rience, occupies a similar position with relation t o other scales of m e a s i m nu ni as does the decimal system of n u m b e r s with respect to the b i n a r y and o c t a l n u m b e r i n g systems a n d others. If a system is multiplicative o n a linear scale, it m a y be m u c h m o r e convenient to think of it as an additive system on a logarithmic scale. A n o t h e r f r e q u e n t t r a n s f o r m a t i o n is the s q u a r e r o o t of a variable. T h e s q u a r e r o o t of the surface area of an o r g a n i s m is often a m o r e a p p r o p r i a t e m e a s u r e of the f u n d a m e n t a l biological variable subjected to physiological a n d e v o l u t i o n a r y forces t h a n is t h e area. This is reflected in the n o r m a l distribution of the s q u a r e r o o t of the variable as c o m p a r e d to the skewed distribution of areas. In m a n y cases experience has t a u g h t us to express experimental variables not in linear scale b u t as l o g a r i t h m s , s q u a r e roots, reciprocals, or angles. Thus, pH values are l o g a r i t h m s a n d dilution series in microbiological titrations are expressed as reciprocals. As s o o n as you are ready t o accept the idea t h a t the scale of m e a s u r e m e n t is a r b i t r a r y , you simply have to look at the distributions of t r a n s f o r m e d variates to decide which t r a n s f o r m a t i o n most closely satisfies the a s s u m p t i o n s of the analysis of variance before c a r r y i n g out an a n o v a . A f o r t u n a t e fact a b o u t t r a n s f o r m a t i o n s is t h a t very often several d e p a r t u r e s f r o m the a s s u m p t i o n s of a n o v a are simultaneously cured by the s a m e transf o r m a t i o n to a new scale. T h u s , simply by m a k i n g the d a t a homoscedastic, we also m a k e them a p p r o a c h n o r m a l i t y a n d e n s u r e additivity of the t r e a t m e n t effects. W h e n a t r a n s f o r m a t i o n is applied, tests of significance arc p e r f o r m e d on the t r a n s f o r m e d d a t a , but estimates of m e a n s are usually given in the familiar u n t r a n s f o r m e d scale. Since the t r a n s f o r m a t i o n s discussed in this c h a p t e r are nonlinear, confidence limits c o m p u t e d in the t r a n s f o r m e d scale a n d c h a n g e d back t o the original scale would be asymmetrical. Stating the s t a n d a r d e r r o r in the original scale w o u l d therefore be misleading. In reporting results of research with variables that require t r a n s f o r m a t i o n , furnish m e a n s in the backt r a n s f o r m e d scale followed by their (asymmetrical) confidence limits rather than by their s t a n d a r d errors. An easy way to find out w h e t h e r a given t r a n s f o r m a t i o n will yield a distribution satisfying the a s s u m p t i o n s of a n o v a is to plot the c u m u l a t i v e distributions of the several samples on probability paper. By c h a n g i n g the scale of the sccond c o o r d i n a t e axis f r o m linear to logarithmic, s q u a r e root, or any o t h e r one, we can see w h e t h e r a previously curved line, indicating skewness, straightens out to indicate n o r m a l i t y (you m a y wish to refresh your m e m o r y on these graphic techniques studied in Section 5.5). W e can look u p u p p e r class limits on t r a n s f o r m e d scales or e m p l o y a variety of available probability g r a p h p a p e r s whose second axis is in logarithmic, a n g u l a r , or o t h e r scale. T h u s , we not only test whether the d a t a b e c o m e m o r e n o r m a l t h r o u g h t r a n s f o r m a t i o n , but wc can also get an estimate of the s t a n d a r d deviation u n d e r t r a n s f o r m a t i o n as measured by the slope of the lilted line. T h e a s s u m p t i o n of homosccdasticity implies that the slopes for the several samples should be the same. If the slopes are very heterogeneous, homoscedasticity has not b e e n achieved. Alternatively, wc can
218
CHAPTER 10 , ASSUMPTIONS OF ANALYSIS OF VARIANC 1
e x a m i n e g o o d n e s s of fit tests for n o r m a l i t y (see C h a p t e r 13) for the samples u n d e r v a r i o u s t r a n s f o r m a t i o n s . T h a t t r a n s f o r m a t i o n yielding the best fit over all samples will be chosen for the a n o v a . It is i m p o r t a n t that the t r a n s f o r m a t i o n not be selected on the basis of giving the best a n o v a results, since such a proced u r e w o u l d distort t h e significance level. The logarithmic transformation. T h e most c o m m o n t r a n s f o r m a t i o n applied is conversion of all variates into logarithms, usually c o m m o n logarithms. W h e n ever the m e a n is positively correlated with the variance (greater m e a n s are acc o m p a n i e d by greater variances), the logarithmic t r a n s f o r m a t i o n is quite likely to remedy the situation a n d m a k e the variance i n d e p e n d e n t of the m e a n . Freq u e n c y d i s t r i b u t i o n s skewed to the right are often m a d e m o r e symmetrical by t r a n s f o r m a t i o n to a l o g a r i t h m i c scale. W e saw in the previous section a n d in T a b l e 10.1 t h a t logarithmic t r a n s f o r m a t i o n is also called for w h e n effects are multiplicative. The square root transformation. W e shall use a s q u a r e root t r a n s f o r m a t i o n as a detailed illustration of t r a n s f o r m a t i o n of scale. W h e n the d a t a are counts, as of insects on a leaf or blood cells in a h e m a c y t o m e t e r , we frequently find the s q u a r e r o o t t r a n s f o r m a t i o n of value. You will r e m e m b e r that such distrib u t i o n s are likely to be Poisson-distributed rather than normally d i s t r i b u t e d a n d that in a Poisson distribution the variance is the same as the m e a n . Therefore, the m e a n a n d variance c a n n o t be independent but will vary identically. T r a n s f o r m i n g the variates to s q u a r e roots will generally m a k e the variances i n d e p e n d e n t of the means. W h e n the c o u n t s include zero values, it has been f o u n d desirable to code all variates by a d d i n g 0.5. T h e t r a n s f o r m a t i o n then is v'v + i T a b l e 10.2 shows an application of the s q u a r e root t r a n s f o r m a t i o n . T h e s a m p l e with the greater m e a n has a significantly greater variance prior to transf o r m a t i o n . After t r a n s f o r m a t i o n the variances arc not significantly different. F o r r e p o r t i n g m e a n s the t r a n s f o r m e d m e a n s arc squared again and confidence limits arc r e p o r t e d in lieu of s t a n d a r d errors. The arcsine transformation This t r a n s f o r m a t i o n (also k n o w n as the angular transformation) is especially a p p r o p r i a t e to percentages and p r o p o r t i o n s . You may r e m e m b e r from Section 4.2 that the s t a n d a r d deviation of a binomial distribution is σ = \Jpq/k. Sincc μ = />, //>, where ρ is a p r o p o r t i o n . T h e term "arcsin" is s y n o n y m o u s with inverse sine or sin which stands for "Ihe angle whose sine is" the given quantity. Thus, if we c o m p u t e or look up arcsin v '0.431 — 0.6565, we find 41.03", the angle whose sine is 0.6565. T h e arcsine transf o r m a t i o n stretches out both tails of a distribution of percentages or p r o p o r tions and compresses the middle. W h e n the percentages in the original d a t a fall between 30",', a n d 70",'., it is generally not neccssary to apply the arcsinc transformation.
t a b l e
10.2
An application of the square root transformation. T h e d a t a r e p r e s e n t t h e n u m b e r of a d u l t Drosophila e m e r g i n g f r o m single-pair c u l t u r e s for t w o different m e d i u m f o r m u l a t i o n s ( m e d i u m A c o n t a i n e d DDT). (1)
(2)
Number of flies emerging
Square root of number of flies
y
J y
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.00 1.00 1.41 1.73 2.00 2.24 2.45 2.65 2.83 3.00 3.16 3.32 3.46 3.61 3.74 3.87 4.00
(3) Medium
A
(4) Medium
/
1 5 6
—
3
—
— —
—
—
—
—
—
—
2 1 2 3 1 1 1 1 1 2
— — — — —
— —
75
15 Untransformed
variable
1.933 1.495
Ϋ s2 Square root
11.133 9.410
transformation
1.299 0.2634
'Jy
Tests of equality V
Β
f
of
3.307 0.2099
variances
ntransformed s2,
f\ =
transformed
9.410
6.294**
1.495
Back-transformed
(squared)
F0.()2S[l 4, 1 4| —— ~>— 9>i
r
A
1.687
(7
— sjt
L2 = JY Back-transformed
=
•Wl
0.2634 _ 1.255 0.2099 ~
ns
means Medium
95% confidence
/'
Medium
li
10.937
limits —
1.297 - 2.145 V 0 ' 2 " 4
'o.os-Vy
f i 0 . 0S .Vr (squared)
confidence
3.307 - 2.145
= 1.015
^ 3.053
'-583
3.561
limits
I.]
1.030
9.324
l.\
2.507
12.681
N
" iT"
220
CHAPTER 10 , ASSUMPTIONS Of ANALYSIS Of VARIANC 1
10.3 Nonparametric methods in lieu of anova If n o n e of the a b o v e t r a n s f o r m a t i o n s m a n a g e t o m a k e our d a t a meet t h e ass u m p t i o n s of analysis of variance, we m a y resort t o an a n a l o g o u s n o n p a r a metric m e t h o d . These techniques are also called distribution-free methods, since they are not d e p e n d e n t o n a given d i s t r i b u t i o n (such as the n o r m a l in anova), but usually will w o r k for a wide r a n g e of different distributions. T h e y are called n o n p a r a m e t r i c m e t h o d s because their null hypothesis is not c o n c e r n e d with specific p a r a m e t e r s (such as the m e a n in analysis of variance) but only with the distribution of the variates. In recent years, n o n p a r a m e t r i c analysis of variance has become quite p o p u l a r because it is simple to c o m p u t e a n d permits f r e e d o m f r o m worry a b o u t the distributional a s s u m p t i o n s of a n a n o v a . Yet we should point out that in cases where those a s s u m p t i o n s hold entirely or even a p p r o x i m a t e l y , the analysis of variance is generally the m o r e efficient statistical test for detecting d e p a r t u r e s f r o m the null hypothesis. W e shall discuss only n o n p a r a m c t r i c tests for two samples in this section. F o r a design that would give rise to a t test or a n o v a with t w o classes, we e m p l o y the n o n p a r a m e t r i c Mann-Whitney U test (Box 10.1). T h e null hypothesis is that the two samples c o m e f r o m p o p u l a t i o n s having the same distribution. T h e d a t a in Box 10.1 are m e a s u r e m e n t s of heart (ventricular) f u n c t i o n in two g r o u p s of patients that have been allocated to their respective g r o u p s on the basis of o t h e r criteria of ventricular d y s f u n c t i o n . T h e M a n n - W h i t n e y U test as illustrated in Box 10.1 is a semigraphical test a n d is quite simple to apply. It will be cspccially c o n v e n i e n t when the data arc already g r a p h e d a n d there are not t o o m a n y items in each sample. N o t e that this m e t h o d docs not really require that each individual observation represent a precise m e a s u r e m e n t . So long as you can o r d e r the observations. you are able to p e r f o r m these tests. T h u s , for example, s u p p o s e you placcd some meat out in the open and studied t h e arrival limes of individuals of Iwo species of blowflies. You could record exactly the lime of arrival of each individual fly, starting f r o m a point zero in time when the meat was scl out. O n the o t h e r h a n d , you might simply rank arrival times of the t w o species, n o t i n g that individual 1 of species Β c a m e first, 2 individuals f r o m species .4 next, then 3 individuals of B, followed by the s i m u l t a n e o u s arrival of o n e of each of the Iwo species (a lie), a n d so forth. While such r a n k e d or ordered data could not be analyzed by the p a r a m e t r i c m e t h o d s studied earlier, the techniques of Box 10.1 are entirely applicable. T h e m e t h o d of calculating the sample statistic U s for the M a n n - W h i t n e y test is s t r a i g h t f o r w a r d , as shown in Box 10.1. It is desirable to o b t a i n an intuitive u n d e r s t a n d i n g of the rationale behind this test. In the M a n n - W h i l n e y test we can conceive of two extreme situations: in o n e case the two samples o v e r l a p and coincide entirely: in the o t h e r they are quite separate. In the lallcr ease, if we take the s a m p l e with the lower-valued variates. there will be n o points of the c o n t r a s t i n g s a m p l e below it; that is, we can go t h r o u g h every observation in the lower-valued sample without having any items of the higher-valued o n e below
10.3 / n o n p a r a m e t r 1 c m e t h o d s in
i
ΙΓΙΙ
o f a n o v a 232
BOX 10.1 Mann-Whitney V test for two samples, ranked observations, not paired. A measure of heart function (left ventricle ejection fraction) measured in two samples of patients admitted to the hospital under suspicion of heart attack. The patients were classified on the basis of physical examinations during admission into different so-called Killip classes of ventricular dysfunction. We compare the left ventricle ejection fraction for patients classified as Killip classes I and III. The higher Killip class signifies patients with more severe symptons. Thefindingswere already graphed in the source publication, and step 1 illustrates that only a graph of the data is required for the Mann-Whitney U test. Designate the sample size of the larger sample as nl and that of the smaller sample as n2. In this case, n, = 29, n2 = 8. When the two samples are of equal size it does not matter which is designated as n,. 1. Graph the two samples as shown below. Indicate the ties by placing dots at the same level.
0.8 0.7 r-
* •
0.6
ι »
0.5
;
bu ω 0.4
-
0.3
-
0.2
-
0.1
-
ft— • • •
* • % • •
«
•
0.49 + 0.13 η = 29
0.28 + 0.08 n = 8 1 m
ι
1 Killip class
2. For each observation in one sample (it is convenient to use the smaller sample), count the number of observations in the other sample which are lower in value (below it in this graph). Count \ for each tied observation. For example, there are lj observations in class I below the first observation in class III. The half is introduced because of the variate in class I tied with the lowest variate in class III. There are 2f observations below the tied second and third observations in class III. There are 3 observations below the fourth and fifth variates in class III, 4 observations below the sixth variate, and 6 and 7 observations, respectively, below the seventh and eight variates in class III. The sum of these counts C = 29{. The Mann-Whitney statistic Vs is the greater of the two quantities C and (n,n2 - C), in this case 29| and [(29 χ 8) - 29|] = 202^.
chapter
222
10 , a s s u m p t i o n s o f a n a l y s i s o f v a r i a n c 1
Box 10.1 Continued Testing the significance of V, No tied variates in samples (or variates tied within samples only). When n, £ 20,
compare U, with critical value for ί/φ,,„2] in Table XI. The null hypothesis is rejected if the observed value is too large. In cases where n t > 20, calculate the following quantity U s ~ n'n^2 /"ι"ζ("ι + n t + 1)
t
V
12
which is approximately normally distributed. The denominator 12 is a constant. Look up the significance of ts in Table III against critical values of for a onetailed or two-tailed test as required by the hypothesis. In our case this would yield t
202.5 ~(29)(8)/2 ^ 86.5 /(29)(8)(29 + 8TT) V734.667 V
=
^
12
A further complication arises from observations tied between the two groups. Our example is a case in point. There is no exact test. For sample sizes n, < 20, use Table XI, which will then be conservative. Larger sample sizes require a more elaborate formula. But it takes a substantial number of ties to affect the outcome of the test appreciably. Corrections for ties increase the t„ value slightly; hence the uncorrected formula is more conservative. We may conclude that the two samples with a t, value of 3.191 by the uncorrected formula are significantly different at Ρ < 0.01.
it. Conversely, all the p o i n t s of the lower-valued s a m p l e would be below every point of the higher-valued o n e if we started out with the latter. O u r total c o u n t w o u l d therefore be the total c o u n t of o n e s a m p l e multiplied by every o b s e r v a t i o n in the second sample, which yields n x ti 2 . T h u s , since we are told to take the greater of the t w o values, the sum of the c o u n t s C or n,n2 — C, o u r result in this ease would be n x n 2 . O n the o t h e r h a n d , if the t w o samples coincided c o m pletely, then for each point in o n e s a m p l e we would have those p o i n t s below it plus a half point for t h e tied value representing t h a t observation in the second s a m p l e which is at exactly the same level as the observation u n d e r c o n s i d e r a t i o n . A little e x p e r i m e n t a t i o n will show this value to be [n(n — l)/2] + (n/2) = n2/l. Clearly, the range of possible U values must be between this a n d n { rt 2 , a n d the critical value must be s o m e w h e r e within this range. O u r conclusion as a result of the tests in Box 10.1 is that (he two admission classes characterized by physical e x a m i n a t i o n differ in their ventricular dysfunction as m e a s u r e d by left ventricular ejection fraction. T h e sample characterized as m o r e severely ill has a lower ejection fraction t h a n the sample characterized .. . ι :Μ
10.3 / n o n p a r a m e t r 1 c m e t h o d s i n i
ΙΓΙΙ
of
anova
223
T h e M a n n - W h i t n e y V test is based on ranks, a n d it measures differences in location. A n o n p a r a m e t r i c test t h a t tests differences between t w o distributions is the Kolmogorov-Smirnov two-sample test. Its null hypothesis is identity in distribution for the two samples, a n d thus the test is sensitive to differences in location, dispersion, skewness, a n d so forth. This test is quite simple to carry out. It is based on the unsigned differences between the relative c u m u l a t i v e frequency distributions of the t w o samples. Expected critical values can be l o o k e d u p in a table or evaluated a p p r o x i m a t e l y . C o m p a r i s o n between observed a n d expected values leads to decisions w h e t h e r the m a x i m u m difference between the t w o cumulative frequency distributions is significant. Box 10.2 s h o w s the application of the m e t h o d to samples in which both n1 a n d n2 < 25. T h e e x a m p l e in this box features m o r p h o l o g i c a l m e a s u r e m e n t s
BOX 10.2 Kolmogorov-Smirnov two-sample test, testing differences in distributions of two samples of continuous observations. (Both n, and n2 r\ τ ri ο ο -—' r-i rn Ί" | 1
π ΟΟ Ο m oo 00 -t η ο rί ON NO Ο ON oc NO r ιr C-J Ο οο oo NO Νθ" ' " = y v2 Of course, Σ r
c o r r e s p o n d s to y, Σ ι .
v
^ r ' Σ· χ + Is' Σ χ~
m e a n of Κ)
Unexplained, error V - V (observed Y from estimated V) Ϋ Total (observed Y from mean of F)
T h e explained
mean
η- 2
n
_ ,
square,
£
s
.x = V r - Σ f £ v2 '
o r mean
=
£
γ
'f .*
ι _ {Σ
σ
ϊ χ
si
η
square
due
lo linear
regression,
meas-
ures the a m o u n t of variation in Y a c c o u n t e d for by variation of X. It is tested over the unexplained mean square, which measures the residual variation and is used as an e r r o r MS. T h e m e a n s q u a r e d u e to linear regression, sf, is based on o n e degree of freedom, a n d consequently (n - 2) df remain for the e r r o r M S sincc the total sum of squares possesses η - I degrees of freedom. T h e test is of the null hypothesis H 0 : β = 0. W h e n we carry out such an a n o v a on the weight loss d a t a of Box I I.I, we o b t a i n the following results:
Source of variation
df
SS
1
23.5145
7
0.6161 141306
MS
l\
Explained - d u e to linear regression Unexplained r e g r e s s i o n line Total
23.5145
267.18**
error around '8~
0.08801
T h e significance test is /-'s = s$/sy.x. It is clear f r o m the observed value of /·', that a large a n d significant p o r t i o n of the variance of V has been explained by regression on A'.
3
κ II
C
I!
>
κ JS
Ί=> as
υ
•S.C 55 *
43
*
•s.
Ο η! > 2
C2-1)
This is the formula for the p r o d u c t - m o m e n t correlation coefficient rYiY, variables Yt and Y2. We shall simplify the symbolism to =*12
' 2 Σ = Σ ^ / Σ } ' ? = 1, which yields a perfect correlation of + I. If deviations in one variable were paired with opposite but equal
272
CHAPTER 1 2 /
CORRELATION
because the sum of p r o d u c t s in the n u m e r a t o r would be negative. Proof that the correlation coefficient is b o u n d e d by + 1 and — 1 will be given shortly. If the variates follow a specified distribution, the bivariate normal distribution, the correlation coefficient rjk will estimate a parameter of that distribution symbolized by pjk. Let us a p p r o a c h the distribution empirically. Suppose you have sampled a h u n d r e d items a n d measured two variables on each item, obtaining two samples of 100 variates in this manner. If you plot these 100 items on a g r a p h in which the variables a n d Y2 are the coordinates, you will obtain a scatterg r a m of points as in Figure 12.3 A. Let us assume that both variables, Yl and Y2, are normally distributed a n d that they are quite independent of each other, so that the fact that one individual happens to be greater t h a n the m e a n in character Y1 has no effect whatsoever on its value for variable Y2. T h u s this same individual may be greater or less than the mean for variable Y2. If there is absolutely no relation between and Y2 a n d if the two variables are standardized to make their scales comparable, you will find that the outline of the scattergram is roughly circular. Of course, for a sample of 100 items, the circle will be only imperfectly outlined; but the larger the sample, the more clearly you will be able to discern a circle with the central area a r o u n d the intersection Y2 heavily darkened because of the aggregation there of m a n y points. If you keep sampling, you will have to superimpose new points u p o n previous points, and if you visualize these points in a physical sense, such as grains of sand, a m o u n d peaked in a bell-shaped fashion will gradually accumulate. This is a three-dimensional realization of a n o r m a l distribution, shown in perspective in Figure 12.1. Regarded from cither c o o r d i n a t e axis, the m o u n d will present a two-dimensional appearance, and its outline will be that of a n o r m a l distribution curvc, the two perspectives giving the distributions of V, and Y2, respectively. If we assume that the two variables and Y2 are not independent but are positively correlated to some degree, then if a given individual has a large value of V,, it is more likely t h a n not to have a large value of Y2 as well. Similarly, a small value of V, will likely be associated with a small value of Y2. Were you to sample items from such a population, the resulting scattergram (shown in
iKii'Ri: 12.1 B i v a r i a t e n o r m ; · I f r e q u e n c y d i s t r i b u t i o n . T h e p a r a m e t r i c c o r r e l a t i o n ρ b e t w e e n v a r i a b l e s V, a n d e q u a l s z e r o . T h e f r e q u e n c y d i s t r i b u t i o n m a y be v i s u a l i z e d a s a b e l l - s h a p e d
mound.
1 2 . 2 /' THE PRODUCT-MOMEN Ε CORK I I.ATION COEFFICIENT
FIGURE 12.2 B i v a r i a t e n o r m a l f r e q u e n c y d i s t r i b u t i o n . T h e p a r a m e t r i c c o r r e l a t i o n μ b e t w e e n v a r i a b l e s F, a n d Y2 e q u a l s 0.9. T h e b e l l - s h a p e d m o u n d of F i g u r e 12.1 h a s b e c o m e e l o n g a t e d .
Figure 12.3D) would b e c o m e elongated in the form of an ellipse. This is so because those p a r t s of the circlc that formerly included individuals high for one variable and low for the o t h e r (and vice versa), are now scarcely represented. C o n t i n u e d sampling (with Ihc sand grain model) yields a three-dimensional elliptic m o u n d , s h o w n in Figure 12.2. If correlation is perfect, all the d a t a will fall along a single regression line (the identical line would describe the regression of Y, on Y2 and of Y2 on Y,), and if we let them pile up in a physical model, they will result in a flat, essentially two-dimensional normal curve lying on this regression line. T h e circular or elliptical resulting m o u n d is clearly a two variables, a n d this is the By a n a l o g y with Fxprcssion
shape of the outline of the scattergram and of the function of the degree of correlation between the p a r a m e t e r />jk of the bivariate n o r m a l distribution. (12.2), the p a r a m e t e r f>jk can be defined as
where ajk is the p a r a m e t r i c covariance of variables V( and Yk a n d at and ak arc the p a r a m e t r i c s t a n d a r d deviations of variables Yf and Yk, as before. W h e n two variables are distributed according to the bivariatc normal, a sample correlation cocflicicnt rjk estimates the p a r a m e t r i c correlation coefficient pjk. We can m a k e some statements a b o u t the sampling distribution of (>ik and set confidence limits to it. Regrettably, the elliptical shape of s c a t t e r g r a m s of correlated variables is not usually very clear unless either very large samples have been taken or the p a r a m e t r i c correlation (>jk is very high. T o illustrate this point, we show in Figure 12.3 several g r a p h s illustrating s c a t t c r g r a m s resulting from samples of 100 items from bivariatc n o r m a l p o p u l a t i o n s with differing values of (>jk. Note
274
Y
Y
CHAPTER 12 /
2r
2
I -
I
0-
0
y ο
- 1 -
-1
-1
- 2"
_ ">
-2
Y
CORRELATION
o1
3I
1 3
2
11
10
J_
1
J
1
2
1
X l-Kii IRI: 12.3 R a n d o m s a m p l e s f r o m b i v a r i a l e n o r m a l d i s t r i b u t i o n s w i l h v a r y i n g v a l u e s of t h e p a r a m e t r i c c o r r e l a t i o n c o e l h c i c n t p. S a m p l e s i / c s n (Η)/-
I U . «'),>
OS. ( I ) ) , ,
100 in all g r a p h s e x c e p t ( i . w h i c h h a s n
0.7. (I I ρ
0.7. I f ) ρ
0.9. (Ci )p
0.5.
500. (Α) ρ
(1.4.
1 2 . 2 /' THE PRODUCT-MOMEN Ε CORK I I.ATION COEFFICIENT
t h a t in the first g r a p h (Figure 12.3A), with pJk = 0, the circular distribution is only very vaguely outlined. A far greater sample is required to d e m o n s t r a t e the circular s h a p e of the distribution m o r e clearly. N o substantial difference is n o t e d in F i g u r e 12.3B, based o n pjk = 0.3. K n o w i n g t h a t this depicts a positive correlation, one can visualize a positive slope in the scattergram; b u t w i t h o u t prior knowledge this would be difficult to detect visually. T h e next g r a p h (Figure 12.3C, based on pjk = 0.5) is s o m e w h a t clearer, but still does n o t exhibit an unequivocal trend. In general, correlation c a n n o t be inferred f r o m inspection of scattergrams based on samples f r o m p o p u l a t i o n s with pjk between —0.5 a n d + 0.5 unless there are n u m e r o u s sample points. This point is illustrated in the last g r a p h (Figure 12.3G), also sampled f r o m a p o p u l a t i o n with pjk — 0.5 but based on a sample of 500. Here, the positive slope a n d elliptical outline of the scattergram are quite evident. Figure 12.3D, based o n pjk = 0.7 a n d η = 100, shows the trend m o r e clearly t h a n the first three graphs. N o t e t h a t the next graph (Figure 12.3E), based on the same m a g n i t u d e of pJk b u t representing negative correlation, also shows the t r e n d but is m o r e s t r u n g out t h a n Figure 12.3D. T h e difference in shape of the ellipse h a s n o relation to the negative n a t u r e of the correlation; it is simply a f u n c t i o n of sampling error, a n d the c o m parison of these t w o figures should give you some idea of the variability to be expected on r a n d o m sampling f r o m a bivariate n o r m a l distribution. Finally, Figure 12.3F, representing a correlation of pjk = 0.9, shows a tight association between the variables a n d a reasonable a p p r o x i m a t i o n to an ellipse of points. N o w let us r e t u r n to the expression for the sample correlation coefficient shown in Expression (12.3). S q u a r i n g this expression results in
12
\2 ( Σ J ^ 21 Σ νί V π _ ( Σ >'.>'2) 2 . Σ ^
Σ>Ί
Look at the left term of the last expression. It is the s q u a r e of the sum of p r o d u c t s of variables Y, a n d Y2, divided by the sum of squares of Y,. If this were a regression problem, this would be the f o r m u l a for the explained sum of squares of variable Y2 on variable Y,, E y 2 . In the symbolism of C h a p t e r 11, on regression, it would be E y 2 = ( E . x y ) 2 / E x 2 . T h u s , we can write Σ j5
(12.6)
T h e s q u a r e of the correlation coefficient, therefore, is the ratio formed by the explained sum of squares of variable Y2 divided by the total sum of squares of variable Y2. Equivalently,
Izi Zri
• 1 2 - ^ 2
( !2 · 6ί1 )
276
CHAPTER 1 2 /
CORRELATION
which can be derived just as easily. (Remember that since we are n o t really regressing one variable on the other, it is just as legitimate to have Yt explained by Y2 as the other way around.) T h e ratio symbolized by Expressions (12.6) a n d (12.6a) is a p r o p o r t i o n ranging f r o m 0 to 1. This becomes obvious after a little contemplation of the m e a n i n g of this formula. The explained sum of squares of any variable must be smaller t h a n its total sum of squares or, maximally, if all the variation of a variable has been explained, it can be as great as the total sum of squares, but certainly no greater. Minimally, it will be zero if n o n e of the variable can be explained by the other variable with which the covariance has been computed. Thus, we obtain an i m p o r t a n t measure of the p r o p o r t i o n of the variation of one variable determined by the variation of the other. This quantity, the square of the correlation coefficient, r\2, is called the coefficient of determination. It ranges from zero to 1 a n d must be positive regardless of whether the correlation coefficient is negative or positive. Incidentally, here is proof that the correlation coefficient c a n n o t vary beyond - 1 a n d + 1 . Since its square is the coefficient of determination and we have just shown that the b o u n d s of the latter are zero to 1, it is obvious that the b o u n d s of its square root will be ± 1. T h e coefficient of determination is useful also when one is considering the relative i m p o r t a n c e of correlations of different magnitudes. As can be seen by a reexamination of Figure 12.3, the rate at which the scatter d i a g r a m s go f r o m a distribution with a circular outline to one resembling an ellipse seems to be m o r e directly proportional to r2 t h a n to r itself. Thus, in Figure 12.3B, with ρ 2 = 0.09, it is difficult to detect the correlation visually. However, by the time we reach Figure 12.3D, with μ 2 = 0 . 4 9 , the presence of correlation is very apparent. The coefficient of determination is a quantity that may be useful in regression analysis also. You will recall that in a regression we used a n o v a to partition the total sum of squares into explained and unexplained sums of squares. O n c e such an analysis of variance has been carried out, one can obtain the ratio of the explained sums of squares over the total SS as a measure of the p r o p o r t i o n of the total variation that has been explained by the regression. However, as already discusscd in Section 12.1, it would not be meaningful to take the square root of such a coefficient of determination and consider it as an estimate of the parametric correlation of these variables. We shall now take up a mathematical relation between the coefficients of correlation and regression. At the risk of being repetitious, we should stress again that though we can easily convert one coefficient into the other, this docs not mean that the two types of coefficients can be used interchangeably on the same sort of data. O n e i m p o r t a n t relationship between the correlation coefficient and the regression coefficient can be derived as follows from Expression (12.3): J>i>'2
=
Σ yi>'2 χΣντ
xlvi
1 2 . 2 /' THE PRODUCT-MOMEN Ε CORK I I.ATION COEFFICIENT
M u l t i p l y i n g n u m e r a t o r a n d d e n o m i n a t o r of this expression by V Z y f , obtain
w e
. >1ΫΜ _ Σ y ^ i . V Z y i
v r a
If?
7Σ
y\
Dividing n u m e r a t o r a n d d e n o m i n a t o r of the right t e r m of this expression by sjn — 1, we o b t a i n
/Σ ν" ~ 1 = / Σ yi 'η — 1
Σ μ »
Σ.Γ
—
(12-7)
Similarly, we c o u l d d e m o n s t r a t e t h a t r
=
\i
• ζ—
(12.7a)
a n d hence
s,
b,.2 = r
l 2
^ s2
(12.7b)
In these expressions b2., is the regression coefficient for variable Y2 on Y,. We see, therefore, t h a t the correlation coefficient is the regression slope multiplied by the ratio of the s t a n d a r d deviations of the variables. T h e c o r r e l a t i o n coefficient m a y thus be regarded as a s t a n d a r d i z e d regression coefficient. If the t w o s t a n d a r d deviations a r e identical, both regression coefficients a n d the correlation coefficient will be identical in value. N o w t h a t we k n o w a b o u t the coclficicnt of correlation, s o m e of the earlier work on paired c o m p a r i s o n s (see Section 9.3) can be put into p r o p e r perspective. In Appendix A 1.8 we s h o w for the c o r r e s p o n d i n g p a r a m e t r i c expressions that the variance of a sum of t w o variables is •\>Ί + i"2) = sf + si + 2r 1 2 s,.v 2
(12.8)
where s, and .s2 a r e s t a n d a r d deviations of Y, a n d Y2, respectively, a n d ri2 is the correlation coefficient between these variables. Similarly, for a difference between t w o variables, we o b t a i n =
s
i + s2 ~~ 2rl2sls2
(12.9)
W h a t Expression (12.8) indicates is t h a t if we m a k e a new c o m p o s i t e variable that is the sum of t w o o t h e r variables, the variance of this new variable will be the sum of the variances of the variables of which it is c o m p o s e d plus an a d d e d term, which is a f u n c t i o n of the s t a n d a r d deviations of these two variables a n d of the c o r r e l a t i o n between them. It is shown in Appendix A 1.8 that this added term is twicc the covariance of Yl a n d Y2. W h e n the t w o variables
278
CHAPTER
12 /
CORRELATION
BOX 12.1 C o m p u t a t i o n of the p r o d u c t - m o m e n t
correlation coefficient.
Relationships between gill weight and body weight in the crab crassipes. η — 12.
V) r,
(2) Y2
Gi It weight in milligrams
Body weight in grams
159 179 100 45 384 230 100 320 80 220 320 210
14.40 15.20 11.30 2.50 22.70 14.90 1.41 15.81 4.19 15.39 17.25 9.52
Pachygrapsus
Source: Unpublished data by L. Miller.
Computation 1. £ Y l = 159 + ••• + 210 = 2347 2.
= I59 2 + · · · + 210 2 = 583,403
3. £ Y 2 = 14.40 + · · · + 9.52 = 144.57 4. γ γ \ = (I4.40) 2 + · · · + (9.52)2 = 2204.1853 5· Σ γ ι γ 2 = 14.40(159) + • · • + 9.52(210) = 34,837.10 6. Sum of squares of Y, =
= £Y2 η
. „ (quantity l) 2 quantity 2 - — = 583,403
v(2347)
2
12
124,368.9167 7. Sum of squares of Yz =
=
-
—1—
. . (quantity 3)2 (144.57)2 = quantity 4 - ~ - = 2204.1853 - - — — - i η 12 = 462.4782
12.2
/' THE P R O D U C T - M O M E N Ε CORK I I.ATION COEFFICIENT
BOX 12.1 Continued
8. Sum of products = £ y:y2 = £ V", Y2
η
. . quantity 1 χ quantity 3 quantity 5 — η
- 34.837.10 -
12
- 6561.6175
9. Product-moment correlation coefficient (by Expression (12.3)): r
=
= 2
VX y i Σ
quantity 8 χ/quantity 6 χ quantity 7
6561.6175 7(124,368.9167)(462.4782) 6561.6175 7584.0565
6561.6175 ^577517,912.7314
: 0.8652 « 0.87
being s u m m e d are u n c o r r e c t e d , this a d d e d covariance term will be zero, and the variance of the sum will simply be the sum of variances of the two variables. This is the reason why, in an a n o v a or in a t test of the ditference between the two means, we had to a s s u m e 1 he independence of the two variables to permit us to add their variances. Otherwise we would have had to allow for a covariance term. By contrast, in the p a i r e d - c o m p a r i s o n s technique we expect correlation between the variables, since the m e m b e r s in each pair share a c o m m o n experience. T h e p a i r e d - c o m p a r i s o n s test automatically s u b t r a c t s a covariance term, resulting in a smaller s t a n d a r d e r r o r and consequently in a larger value of i s . since the n u m e r a t o r of the ratio remains the same. T h u s , whenever correlation between two variables is positive, the variance of their differences will be considerably smaller than the sum of their variances; (his is (he reason why the p a i r e d - c o m p a r i s o n s test has to be used in place of (he / test for difference of means. These c o n s i d e r a t i o n s are equally true for the c o r r e s p o n d i n g analyses of variance, singlc-classilication and two-way a n o v a . T h e c o m p u t a t i o n of a p r o d u c t - m o m e n t correlation coefficient is quite simple. T h e basic quantities needed are the same six required for c o m p u t a t i o n of the regression coefficient (Section 11.3). Box 12.1 illustrates how the coefficient should be c o m p u t e d . T h e e x a m p l e is based on a sample of 12 crabs in which gill weight V, a n d b o d y weight Y2 have been recorded. We wish to know whether there is a correlation between the weight of the gill a n d that of the body, the latter representing a measure of overall size. T h e existence of a positive correlation might lead you to conclude that a bigger-bodied c r a b with its resulting greater a m o u n t of m e t a b o l i s m would require larger gills in o r d e r to
CHAPTER 1 2 /
280
CORRELATION
γ·
400 r
f i g u r e 12.4 S c a t t e r d i a g r a m f o r c r a b d a t a of B o x 12.1.
10
15
20
25
30
H o d y w e i g h t in g r a m s
p r o v i d e the necessary oxygen. T h e computations a r e illustrated in Box 12.1. T h e c o r r e l a t i o n coefficient of 0.87 agrees with the clear slope a n d n a r r o w elliptical outline of the s c a t t e r g r a m for these data in F i g u r e 12.4.
12.3
S i g n i f i c a n c e tests in
correlation
T h e most c o m m o n significance test is w h e t h e r it is possible for a s a m p l e correlation coefficient to have c o m e f r o m a p o p u l a t i o n with a p a r a m e t r i c correlation coefficient of zero. T h e null hypothesis is therefore H n : ρ = 0. This implies that the t w o variables are u n c o r r e c t e d . If the sample comes from a bivariate n o r m a l d i s t r i b u t i o n a n d ρ = 0, the s t a n d a r d e r r o r of the c o r r e l a t i o n coefficient is sr = v
[ΕΞΆ
=r
0.02 values
1% significance tests, respectively. S u c h
When
5.4564 >
t0,00nm
of t s h o u l d
be
tests w o u l d
50, w e c a n _ a l s o \ f - J n ~
3, w e
make
-
_
—
5%
0.
use of t h e ζ
transformation
r—— =
2
v «
-
3
S i n c e ζ is n o r m a l l y d i s t r i b u t e d a n d w e a r e u s i n g a p a r a m e t r i c s t a n d a r d i, w i t h
for
alternative
test
ζ —0 t
used
a p p l y if t h e
0 o r H t : ρ < 0, rather t h a n Η , : ρ Φ than
d e s c r i b e d i n t h e t e x t . S i n c e 50, we can set confidence limits to r using the ζ transformation. We first convert the sample r to z, set confidence limits to this z, and then transform these limits back to the r scale. We shall find 95% confidence limits for the above wing vein length data. For r = 0.837, r = 1.2111, α = 0.05. 1
-
2
t
«
--
-- -
ί
°
0 5
"
)
--
117. 21 11 11 1
1 , 9 6 0
1.2111 - 0 . 0 8 7 9 = 1.1232 L2 = ζ +
1
= 1.2111 + 0.0879 = 1.2990
V" - 3 We retransform these ζ values to the r scale by finding the corresponding arguments for the ζ function in Table X. L, «0.808
and
L2 « 0.862
are the 95% confidence limits around r = 0.837. Test of the difference between two correlation
coefficients
For two correlation coefficients we may test H0: p{ = p2 versus / / , : pt Φ p2 as follows:
1 •
3
+
n,
12.3
/
S I G N I F I C A N C E TESTS IN C O R R E L A T I O N
BOX 12.2 Continued Since
z
t
-
deviation,
zz we
is n o r m a l l y
distributed
compare
with
ts
and
rI[3D] o r
we
are
employ
using a
Table
Π,
parametric "Areas
standard
of the
normal
curve." F o r example, the correlation
sophila pseudoobscura =
39
at
the
was
Grand
found
Canyon
between by
and
b o d y weight a n d w i n g length in
Dro-
Sokoloff (1966) to be 0.552 in a sample 0.665
in
a
sample
of
n2 =
20
at
of
Flagstaff,
Arizona. Grand
Canyon:
Zj =
0.6213
0.6213 — 0.8017 _ ~
-0.1804 v'0.086,601 ~
By linear interpolation in T a b l e Π , w e to
be about
Flagstaff:
be between
±0.6130
evidence on
w h i c h t o reject the null
find
- 0.1804 _ 0.294,28
the probability
2(0.229,41) =
z
"
2
=
_
0
0.8017 6
n
Q
' t h a t a v a l u e o f t, will
0.458,82, so
we clearly
have
no
hypothesis.
W h e n ρ is close t o + 1.0, t h e d i s t r i b u t i o n of s a m p l e v a l u e s of r is m a r k e d l y a s y m m e t r i c a l , a n d , a l t h o u g h a s t a n d a r d e r r o r is a v a i l a b l e for r in such cases, it s h o u l d n o t be a p p l i e d unless the s a m p l e is very large (n > 500), a m o s t inf r e q u e n t case of little interest. T o o v e r c o m e this difficulty, we t r a n s f o r m r to a f u n c t i o n z, d e v e l o p e d by F i s h e r . T h e f o r m u l a for ζ is (12.10)
You m a y recognize this as ζ = t a n h ' r, the f o r m u l a for the inverse hyp e r b o l i c t a n g e n t of r. T h i s f u n c t i o n h a s been t a b u l a t e d in T a b l e X, w h e r e values of ζ c o r r e s p o n d i n g t o a b s o l u t e v a l u e s of r a r c given. I n s p e c t i o n of E x p r e s s i o n (12.10) will s h o w that w h e n r = 0, ζ will also e q u a l zero, since i ' n I e q u a l s zero. H o w e v e r , as r a p p r o a c h e s ± 1 , (1 + /•)/(! - r) a p p r o a c h e s / a n d 0; c o n s e q u e n t l y , ζ a p p r o a c h e s + infinity. T h e r e f o r e , s u b s t a n t i a l differences between r a n d ζ o c c u r at the higher v a l u e s for r. Thus, w h e n r is 0.115, ζ = 0.1 I 55. F o r r = - 0 . 5 3 1 , wc o b t a i n ζ = - 0 . 5 9 1 5 ; r = 0.972 yields ζ = 2.1273. N o t e byh o w m u c h ζ exceeds r in this last p a i r of values. By f i n d i n g a given value of ζ in T a b i c X, we can also o b t a i n the c o r r e s p o n d i n g value of r. Inverse i n t e r p o l a t i o n m a y be necessary. T h u s , ζ = 0.70 c o r r e s p o n d s t o r = 0.604, a n d a value of ζ = - 2 . 7 6 c o r r e s p o n d s t o r = —0.992. S o m e p o c k c t c a l c u l a t o r s h a v e built-in h y p e r b o l i c a n d inverse h y p e r b o l i c f u n c t i o n s . K e y s for such f u n c t i o n s w o u l d o b v i a t e the need for T a b l e X. T h e a d v a n t a g e of the ζ t r a n s f o r m a t i o n is t h a t while c o r r e l a t i o n coefficients arc d i s t r i b u t e d in s k e w e d f a s h i o n for v a l u e s of ρ φ 0. the values of ζ are a p -
284
CHAPTER 1 2
/
CORRELATION
ζ (zeta), following the usual convention. T h e expected variance of ζ is
This is a n a p p r o x i m a t i o n a d e q u a t e for s a m p l e sizes η > 50 a n d a tolerable a p p r o x i m a t i o n even w h e n η > 25. An interesting aspect of the variance of ζ evident f r o m Expression (12.11) is that it is i n d e p e n d e n t of the m a g n i t u d e of r, but is simply a f u n c t i o n of sample size n. As s h o w n in Box 12.2, for s a m p l e sizes greater t h a n 50 we c a n also use the ζ t r a n s f o r m a t i o n to test t h e significance of a sample r e m p l o y i n g the hypothesis H0: ρ = 0. In the second section of Box 12.2 we show the test of a null hypothesis t h a t ρ φ 0. W e m a y have a hypothesis that the true c o r r e l a t i o n between two variables is a given value ρ different f r o m zero. Such h y p o t h e s e s a b o u t the expected c o r r e l a t i o n between t w o variables are frequent in genetic w o r k , a n d we m a y wish t o test observed d a t a against such a hypothesis. Alt h o u g h there is n o a priori reason to a s s u m e that the true c o r r e l a t i o n between right a n d left sides of the bee wing vein lengths in Box 12.2 is 0.5, we s h o w the test of such a hypothesis to illustrate the m e t h o d . C o r r e s p o n d i n g to ρ = 0.5, there is ζ, the p a r a m e t r i c value of z. It is the ζ t r a n s f o r m a t i o n of p. W e n o t e that the probability that the s a m p l e r of 0.837 could have been sampled f r o m a p o p u l a t i o n with ρ = 0.5 is vanishingly small. Next, in Box 12.2 we see h o w to set confidence limits to a s a m p l e correlation coefficient r. This is d o n e by m e a n s of the ζ t r a n s f o r m a t i o n ; it will result in asymmetrical confidence limits when these are r e t r a n s f o r m e d t o the r scale, as when setting confidence limits with variables subjected to s q u a r e root or logarithmic t r a n s f o r m a t i o n s . A test for the significance of the difference between two s a m p l e correlation coefficients is the final e x a m p l e illustrated in Box 12.2. A s t a n d a r d e r r o r for the difference is c o m p u t e d a n d tested against a table of areas of the n o r m a l curvc. In the e x a m p l e the c o r r e l a t i o n between body weight and wing length in two Drosopliila p o p u l a t i o n s was tested, a n d the difference in correlation cocfficicnts between the t w o p o p u l a t i o n s was found not significant. T h e f o r m u l a given is an acceptable a p p r o x i m a t i o n when the smaller of the two samples is greater t h a n 25. It is frequently used with even smaller s a m p l e sizes, as s h o w n in o u r e x a m p l e in Box 12.2.
12.4
Applications
of
correlation
T h e p u r p o s e of correlation analysis is to m e a s u r e the intensity of association observed between a n y pair of variables a n d to test whether it is greater t h a n could be cxpcctcd by c h a n c e alone. O n c e established, such an association is likely to lead to reasoning a b o u t causal relationships between the variables. S t u d e n t s of statistics are told at an early stage n o t to confuse significant correlation with c a u s a t i o n . W c arc also w a r n e d a b o u t so-called n o n s e n s e corrcla-
12.4 /
APPLICATIONS OF CORRELATION
285
tions, a well-known case being the positive c o r r e l a t i o n between the n u m b e r of Baptist ministers a n d the per capita liquor c o n s u m p t i o n in cities with p o p u l a tions of over 10,000 in the U n i t e d States. Individual cases of correlation m u s t be carefully analyzed before inferences are d r a w n f r o m them. It is useful to distinguish correlations in which o n e variable is t h e entire or, m o r e likely, the partial cause of a n o t h e r f r o m others in which the t w o correlated variables have a c o m m o n cause a n d f r o m m o r e c o m p l i c a t e d situations involving b o t h direct influence a n d c o m m o n causes. T h e establishment of a significant c o r r e l a t i o n does not tell us which of m a n y possible s t r u c t u r a l m o d e l s is a p p r o p r i a t e . F u r t h e r analysis is needed t o discriminate between the various models. T h e t r a d i t i o n a l distinction of real versus n o n s e n s e o r illusory c o r r e l a t i o n is of little use. In s u p p o s e d l y legitimate correlations, causal c o n n e c t i o n s are k n o w n o r at least believed to be clearly u n d e r s t o o d . In so-called illusory correlations, no reasonable c o n n e c t i o n between the variables can be f o u n d ; o r if one is d e m o n s t r a t e d , it is of n o real interest or m a y be s h o w n to be a n artifact of the s a m p l i n g p r o c e d u r e . T h u s , the correlation between Baptist ministers a n d liquor c o n s u m p t i o n is simply a c o n s e q u e n c e of city size. T h e larger t h e city, the m o r e Baptist ministers it will c o n t a i n on the average a n d the greater will be the liquor c o n s u m p t i o n . T h e correlation is of little interest t o a n y o n e studying either the distribution of Baptist ministers o r the c o n s u m p t i o n of alcohol. S o m e correlations have time as the c o m m o n factor, a n d processes that c h a n g e with time are frequently likely to be correlated, not because of any functional biological reasons but simply because the c h a n g e with time in the t w o variables u n d e r c o n s i d e r a t i o n h a p p e n s to be in the same direction. T h u s , size of an insect p o p u l a t i o n building u p t h r o u g h the s u m m e r m a y be correlated with the height of some weeds, but this m a y simply be a f u n c t i o n of the passage of time. T h e r e may be n o ecological relation between the plant a n d the insects. A n o t h e r situation in which the correlation might be considered an artifact is when o n e of the variables is in part a m a t h e m a t i c a l f u n c t i o n of the other. T h u s , for example, if Y = Z / A and we c o m p u t e the correlation of A' with Y, the existing relation will tend to p r o d u c e a negative correlation. P e r h a p s the only correlations properly called nonsense o r illusory arc those assumed by p o p u l a r belief o r scientific intuition which, when tested by p r o p e r statistical m e t h o d o l o g y using a d e q u a t e sample sizes, are found to be not significant. T h u s , if we can s h o w that there is no significant correlation between a m o u n t of s a t u r a t e d fats eaten a n d the degree of atherosclerosis, we can consider this to be an illusory correlation. R e m e m b e r also that when testing significance of correlations at c o n v e n t i o n a l levels of significance, you must allow for type I error, which will lead to y o u r j u d g i n g a certain percentage of c o r r e l a t i o n s significant when in fact the p a r a m e t r i c value of ρ = 0. C o r r e l a t i o n coefficients have a history of extensive use a n d application d a t i n g back to the English biometric school at the beginning of the twentieth century. Recent years have seen s o m e w h a t less application of this technique as increasing segments of biological research have b e c o m e experimental. In experiments in which o n e factor is varied a n d the response of a n o t h e r variable to the
286
CHAPTER
12
/
CORRELATION
deliberate v a r i a t i o n of the first is examined, the m e t h o d of regression is m o r e a p p r o p r i a t e , as has already been discussed. H o w e v e r , large areas of biology a n d of o t h e r sciences r e m a i n where the experimental m e t h o d is not suitable because variables c a n n o t be b r o u g h t u n d e r c o n t r o l of the investigator. T h e r e a r e m a n y a r e a s of medicine, ecology, systematics, evolution, a n d o t h e r fields in which experimental m e t h o d s a r e difficult to apply. As yet, the weather c a n n o t be c o n trolled, n o r c a n historical e v o l u t i o n a r y factors be altered. Epidemiological variables are generally not subject t o experimental m a n i p u l a t i o n . Nevertheless, we need an u n d e r s t a n d i n g of the scientific m e c h a n i s m s underlying these p h e n o m e n a as m u c h as of those in b i o c h e m i s t r y or experimental e m b r y o l o g y . In such cases, correlation analysis serves as a first descriptive technique e s t i m a t i n g t h e degrees of association a m o n g the variables involved.
12.5
Kendall's coefficient of r a n k
correlation
O c c a s i o n a l l y d a t a are k n o w n n o t to follow the bivariate n o r m a l d i s t r i b u t i o n , yet we wish to test for the significance of association between the t w o variables. O n e m e t h o d of analyzing such d a t a is by r a n k i n g the variates a n d calculating a coefficient of r a n k correlation. This a p p r o a c h belongs to the general family of n o n p a r a m e l r i c m e t h o d s we e n c o u n t e r e d in C h a p t e r 10. where we learned m e t h o d s for analyses of r a n k e d variates paralleling a n o v a . In o t h e r cases especially suited to r a n k i n g m e t h o d s , we c a n n o t measure the variable o n an a b s o l u t e scale, but only o n an o r d i n a l scale. This is typical of d a t a in which we estimate relative p e r f o r m a n c e , as in assigning positions in a class. W e can say that A is the best s t u d e n t , Β is the second-best student, C a n d D are e q u a l to each o t h e r a n d next-best, a n d so on. If two instructors independently rank a g r o u p of students, wc can then test w h e t h e r the two sets of r a n k i n g s are i n d e p e n d e n t (which we would not expect if the j u d g m e n t s of the instructors arc based on objective evidence). Of greater biological a n d mcdical interest arc the following examples. We might wish to correlate o r d e r of emergence in a s a m p l e of insects with a r a n k i n g in size, or o r d e r of g e r m i n a t i o n in a s a m p l e of plants with rank o r d e r of flowering. An epidemiologist may wish to associate rank o r d e r of o c c u r r c n c c (by time) of an infectious disease within a c o m m u n i t y , on the o n e hand, with its severity as measured by an objective criterion, on the other. Wc present in Box 12.3 Kendall's coefficient of rank correlation, generally symbolized by τ (tau), a l t h o u g h it is a s a m p l e statistic, not a p a r a m e t e r . T h e f o r m u l a for Kendall's coefficient of rank correlation is τ = N/n(n — I), where η is the conventional s a m p l e size and Ν is a c o u n t of ranks, which can be o b tained in a variety of ways. A second variable Y2, if it is perfectly correlated with the first variable V,. should be in the s a m e o r d e r as the V, variatcs. H o w e v e r , if the correlation is less t h a n perfect, the o r d e r of the variates T, will not entirely c o r r e s p o n d to that of V,. T h e q u a n t i t y Ν m e a s u r e s h o w well the second variable c o r r e s p o n d s to the o r d e r of the first. It has a m a x i m a l value of n{n 1) a n d a minimal value of —n{n 1). T h e following small example will m a k e this clear.
12.5 / k e n d a l l ' s c o e f f i c i e n t 1 r a n k c ' o r r i l a 1 i o n
BOX 113 Kendall's coefficient of r a n k correlation,
τ.
Computation of a rank correlation coefficient between the blood neutrophil 13.021,68 6115.000,00
Deviation from expectation
+ + +
— — —
_ + + + +
Since expected frequencies ft < 3 for a = 13 classes should be avoided, we lump the classes at both tails with the adjacent classes to create classes of adequate size. Corresponding classes of observed frequencies / ( should be lumped to match. The number of classes after lumping is a = 11. Compute G by Expression (13.4): (
U\
\jlJ
52
+181
= K K^) = 94.871,55
+ + 27ln
·' ·
(ηέ^))
Since there are a = 11 classes remaining, the degrees of freedom would be α — 1 == 10, if this were an example tested against expected frequencies based on an extrinsic hypothesis. However, because the expected frequencies are based on a binomial distribution with mean pg estimated from the p , of the sample, a further degree of freedom is removed, and the sample value of G is compared with a χ2 distribution with a - 2 = 11 — 2 = 9 degrees of freedom. We applied Williams' correction to G, to obtain a better approximation to χ2. In the formula computed below, ν symbolizes the pertinent degrees of freedom of the
.2 / SINGLE-CLASSIFICATION (I(K)DNESS OF FIT TESTS
BOX 13.1 Continued problem. We obtain
G^ = 94.837,09 > xi00im
= 27,877
The null hypothesis—that the sample data follow a binomial distribution—is therefore rejected decisively. Typically, the following degrees of freedom will pertain to G tests for goodness of fit with expected frequencies based on a hypothesis intrinsic to the sample data (a is the number of classes after lumping, if any):
Distribution
Binomial Normal Poisson
Parameters estimated from sample Ρ μ, a μ
df a
—2
a-3 a-2
When the parameters for such distributions are estimated from hypotheses extrinsic to the sampled data, the degrees of freedom are uniformly a — 1. 2. Special case of frequencies divided in a = 2 classes: In an Fz cross in drosophila, the following 176 progeny were obtained, of which 130 were wild-type flies and 46 ebony mutants. Assuming that the mutant is an autosomal recessive, one would expect a ratio of 3 wild-typefliesto each mutant fly. To test whether the observed results are consistent with this 3:1 hypothesis, we set up the data as follows. Flies
f
Hypothesis
Wild type Ebony mutant
/ , = 130 f2 = 4 6 η = 176
ρ = 0.75 q = 0.25
f
pn = 132.0 qn = 44.0 176.0
Computing G from Expression (13.4), we obtain
= 2[130In ( Η δ + 46 In iff)] * < ί ' ' = 0.120,02
+
' '
'
CHAPTER 1 3 /
304
ANALYSIS OF FREQUENCIES
BOX 13.1 Continued Williams* correction for the two-cell case is