2,251 688 1MB
Pages 281 Page size 910.592 x 658.24 pts Year 2005
6/4/09
08:58
Page 1
Statistics Explained
STEVE McKILLUP is an associate professor of biology in the School of Biological and Environmental Sciences at Central Queensland University, Rockhampton.
McKillup
Statistics Explained is a reader-friendly introduction to experimental design and statistics for undergraduate students in the life sciences, particularly for those who do not have a strong mathematical background. Hypothesis testing and experimental design are discussed first. Statistical tests are then explained using pictorial examples and a minimum of formulae. This class-tested approach, along, with a well-structured set of diagnostic tables, will give students the confidence to choose an appropriate test with which to analyse their own data sets. Presented in a lively and straightforward manner, Statistics Explained will give readers the depth and background necessary to proceed to more advanced texts and applications. It will therefore be essential reading for all bioscience undergraduates, and will serve as a useful refresher course for more advanced students.
Statistics Explained An Introductory Guide for Life Scientists
Designed by Hart McLeod
Steve McKillup
MCKILLUP:STATISTICS EXPLAINED CVR PMS 296 PMS 375
9780521543163cvr.qxd
Statistics Explained
Statistics Explained is a reader-friendly introduction to experimental design and statistics for undergraduate students in the life sciences, particularly those who do not have a strong mathematical background. Hypothesis testing and experimental design are discussed first. Statistical tests are then explained using pictorial examples and a minimum of formulae. This class-tested approach, along with a well-structured set of diagnostic tables, will give students the confidence to choose an appropriate test with which to analyse their own data sets. Presented in a lively and straightforward manner Statistics Explained will give readers the depth and background necessary to proceed to more advanced texts and applications. It will therefore be essential reading for all bioscience undergraduates, and will serve as a useful refresher course for more advanced students. Steve McKillup is an Associate Professor of Biology in the School of
Biological and Environmental Sciences at Central Queensland University, Rockhampton.
Statistics Explained An Introductory Guide for Life Scientists
STEVE MCKILLUP
CAMBRIDGE UNIVERSITY PRESS
˜ Paulo, Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, Sao Delhi,Tokyo, Mexico City Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK www.cambridge.org Information on this title: www.cambridge.org/9780521543163 # S. McKillup 2005 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2006 7th printing 2011 Printed in the United Kingdom at the University Press, Cambridge A catalogue record for this publication is available from the British Library ISBN 978-0-521-83550-3 hardback ISBN 978-0-521-54316-3 paperback
Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party internet websites referred to in this publication, and does not guarantee that any content on such websites is, or will remain, accurate or appropriate. Information regarding prices, travel timetables and other factual infromation given in this work are correct at the time of first printing but Cambridge University Press does not guarantee the accuracy of such information thereafter.
Contents
Preface 1 1.1
page xi
1.2
Introduction Why do life scientists need to know about experimental design and statistics? What is this book designed to do?
1
2 2.1 2.2 2.3 2.4 2.5 2.6 2.7
‘Doing science’ – hypotheses, experiments, and disproof Introduction Basic scientific method Making a decision about an hypothesis Why can’t an hypothesis or theory ever be proven? ‘Negative’ outcomes Null and alternate hypotheses Conclusion
7 7 7 10 11 11 12 13
3 3.1 3.2 3.3 3.4 3.5 3.6 3.7
Collecting and displaying data Introduction Variables, experimental units, and types of data Displaying data Displaying ordinal or nominal scale data Bivariate data Multivariate data Summary and conclusion
14 14 14 16 20 23 25 26
4 4.1 4.2
Introductory concepts of experimental design Introduction Sampling – mensurative experiments
27 27 28
1 5
vi
Contents
4.3 4.4 4.5 4.6 4.7 4.8
Manipulative experiments Sometimes you can only do an unreplicated experiment Realism A bit of common sense Designing a ‘good’ experiment Conclusion
32 39 40 41 41 42
5 5.1 5.2 5.3
Probability helps you make a decision about your results Introduction Statistical tests and significance levels What has this got to do with making a decision or statistical testing? Making the wrong decision Other probability levels How are probability values reported? All statistical tests do the same basic thing A very simple example – the chi-square test for goodness of fit What if you get a statistic with a probability of exactly 0.05? Statistical significance and biological significance Summary and conclusion
44 44 45
5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11
Working from samples – data, populations, and statistics Using a sample to infer the characteristics of a population Statistical tests The normal distribution Samples and populations Your sample mean may not be an accurate estimate of the population mean What do you do when you only have data from one sample? Why are the statistics that describe the normal distribution so important? Distributions that are not normal Other distributions Other statistics that describe a distribution Conclusion
49 49 50 51 52 52 55 55 55 57 57 57 57 63 65 67 71 72 73 74 75
Contents
7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 9 9.1 9.2 9.3 9.4 9.5
Normal distributions – tests for comparing the means of one and two samples Introduction The 95% confidence interval and 95% confidence limits Using the Z statistic to compare a sample mean and population mean when population statistics are known Comparing a sample mean with an expected value Comparing the means of two related samples Comparing the means of two independent samples Are your data appropriate for a t test? Distinguishing between data that should be analysed by a paired sample test or a test for two independent samples Conclusion Type 1 and Type 2 errors, power, and sample size Introduction Type 1 error Type 2 error The power of a test What sample size do you need to ensure the risk of Type 2 error is not too high? Type 1 error, Type 2 error, and the concept of biological risk Conclusion
vii
77 77 77 78 81 88 90 92 94 95 96 96 96 97 100 102 104 104 105 105 106 112 117
9.6
Single factor analysis of variance Introduction Single factor analysis of variance An arithmetic/pictorial example Unequal sample sizes (unbalanced designs) An ANOVA does not tell you which particular treatments appear to be from different populations Fixed or random effects
10 10.1 10.2
Multiple comparisons after ANOVA Introduction Multiple comparison tests after a Model I ANOVA
119 119 119
117 118
viii
Contents
10.3
An a-posteriori Tukey comparison following a significant result for a single factor Model I ANOVA Other a-posteriori multiple comparison tests Planned comparisons
10.4 10.5 11 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 12 12.1 12.2 12.3 12.4 12.5 12.6 12.7 13 13.1 13.2 13.3 13.4 13.5 13.6 13.7
122 123 124
Two factor analysis of variance Introduction What does a two factor ANOVA do? How does a two factor ANOVA analyse these data? How does a two factor ANOVA separate out the effects of each factor and interaction? An example of a two factor analysis of variance Some essential cautions and important complications Unbalanced designs More complex designs
136 139 140 149 149
Important assumptions of analysis of variance: transformations and a test for equality of variances Introduction Homogeneity of variances Normally distributed data Independence Transformations Are transformations legitimate? Tests for heteroscedasticity
151 151 151 152 155 156 158 159
Two factor analysis of variance without replication, and nested analysis of variance Introduction Two factor ANOVA without replication A-posteriori comparison of means after a two factor ANOVA without replication Randomised blocks Nested ANOVA as a special case of a one factor ANOVA A pictorial explanation of a nested ANOVA A final comment on ANOVA – this book is only an introduction
127 127 129 131
162 162 162 166 167 168 170 175
Contents
14 14.1 14.2 14.3 14.4 14.5 14.6 14.7 15 15.1 15.2 15.3 15.4 15.5 15.6 15.7 15.8 15.9 15.10 15.11 16 16.1 16.2 16.3 17 17.1 17.2 17.3
Relationships between variables: linear correlation and linear regression Introduction Correlation contrasted with regression Linear correlation Calculation of the Pearson r statistic Is the value of r statistically significant? Assumptions of linear correlation Summary and conclusion
ix
176 176 177 177 178 184 184 184
Simple linear regression Introduction Linear regression Calculation of the slope of the regression line Calculation of the intercept with the Y axis Testing the significance of the slope and the intercept of the regression line An example – mites that live in the your hair follicles Predicting a value of Y from a value of X Predicting a value of X from a value of Y The danger of extrapolating beyond the range of data available Assumptions of linear regression analysis Further topics in regression
186 186 186 188 192
Non-parametric statistics Introduction The danger of assuming normality when a population is grossly non-normal The value of making a preliminary inspection of the data
205 205
Non-parametric tests for nominal scale data Introduction Comparing observed and expected frequencies – the chi-square test for goodness of fit Comparing proportions among two or more independent samples
208 208
193 199 201 201 202 202 204
205 207
209 212
x
Contents
17.4 17.5 17.6
Bias when there is one degree of freedom Three-dimensional contingency tables Inappropriate use of tests for goodness of fit and heterogeneity Recommended tests for categorical data Comparing proportions among two or more related samples of nominal scale data
17.7 17.8
18
215 219 220 221 222
18.8 18.9
Non-parametric tests for ratio, interval, or ordinal scale data Introduction A non-parametric comparison between one sample and an expected distribution Non-parametric comparisons between two independent samples Non-parametric comparisons among more than two independent samples Non-parametric comparisons of two related samples Non-parametric comparisons among three or more related samples Analysing ratio, interval, or ordinal data that show gross differences in variance among treatments and cannot be satisfactorily transformed Non-parametric correlation analysis Other non-parametric tests
19 19.1
Choosing a test Introduction
246 246
20 20.1 20.2 20.3 20.4 20.5
Doing science responsibly and ethically Introduction Dealing fairly with other people’s work Doing the experiment Evaluating and reporting results Quality control in science
255 255 255 257 258 260
18.1 18.2 18.3 18.4 18.5 18.6 18.7
References Index
224 224 225 227 232 236 238
241 243 245
261 263
Preface
If you mention ‘statistics’ or ‘biostatistics’ to life scientists, they often look nervous. Many fear or dislike mathematics, but an understanding of statistics and experimental design is essential for graduates, postgraduates, and researchers in the biological, biochemical, health, and human movement sciences. Since this understanding is so important, life science students are usually made to take some compulsory undergraduate statistics courses. Nevertheless, I found that a lot of graduates (and postgraduates) were unsure about designing experiments and had difficulty knowing which statistical test to use (and which ones not to!) when analysing their results. Some even told me they had found statistics courses ‘boring, irrelevant and hard to understand’. It seemed there was a problem with the way many introductory biostatistics courses were presented, which was making students disinterested and preventing them from understanding the concepts needed to progress to higher-level courses and more complex statistical applications. There seemed to be two major reasons for this problem, and as a student I encountered both. First, a lot of statistics textbooks take a mathematical approach and often launch into considerable detail and pages of daunting looking formulae without any straightforward explanation about what statistical testing really does. Second, introductory biostatistics courses are often taught in a way that does not cater for life science students who may lack a strong mathematical background. When I started teaching at Central Queensland University I thought there had to be a better way of introducing essential concepts of
xii
Preface
biostatistics and experimental design. It had to start from first principles and develop an understanding that could be applied to all statistical tests. It had to demystify what these tests actually did and explain them with a minimum of formulae and terminology. It had to relate statistical concepts to experimental design. And, finally, it had to build a strong understanding to help the student progress to more complex material. I tried this approach with my undergraduate classes and the response from a lot of students, including some postgraduates who sat in on the course, was, ‘Hey Steve, you should write an introductory stats book!’ Ward Cooper suggested I submit a proposal for this sort of book to Cambridge University Press. Ruth McKillup read, commented on, and reread several drafts, provided constant encouragement, and tolerated my absent mindedness. My students, especially Steve Dunbar, Kevin Strychar, and Glenn Druery, encouraged me to start writing and my friends and colleagues, especially Dearne Mayer and Sandy Dalton, encouraged me to finish. Finally, I sincerely thank the anonymous reviewers of the initial proposal and the subsequent manuscript who, without exception, made most appropriate suggestions for improvement.
1
Introduction
1.1
Why do life scientists need to know about experimental design and statistics?
If you work on living things it is usually impossible to get data from every individual of the group or species in question. Imagine trying to measure the length of every anchovy in the Pacific Ocean, the haemoglobin count of every adult in the USA, the diameter of every pine tree in a plantation of 200 000, or the individual protein content of 10 000 prawns in a large aquaculture pond. The total number of individuals of a particular species present in a defined area is often called the population. Since a researcher usually cannot measure every individual in the population (unless they are studying the few remaining members of an endangered species), they have to work with a carefully selected subset containing several individuals, often called experimental units, that they hope is a representative sample from which the characteristics of the population can be inferred. You can also think of a population as the total number of artificial experimental units possible (e.g. the 125 567 plots of 1 m2 that would cover a coral reef) and your sample being the subset (e.g. 20 plots) you have chosen to work with. The best way to get a representative sample is usually to choose a proportion of the population at random – without bias, with every possible experimental unit having an equal chance of being selected. The trouble with this approach is that there are often great differences among experimental units from the same population. Think of the people you have seen today – unless you met some identical twins (or triplets etc.), no two would have been the same. Even species that seem to be made up of similar looking individuals (like flies or cockroaches or snails) show great variability. This leads to several problems.
2
Introduction
Population
Sample 1
Sample 2
Figure 1.1 Even a random sample may not necessarily be a good
representative of the population. Two samples have been taken at random from the same population. By chance, sample 1 contains a group of relatively large fish, while those in sample 2 are relatively small.
First, even a random sample may not be a good representative of the population from which it has been taken (Figure 1.1). For example, you may choose students for an exercise experiment who are, by chance, far less (or far more) physically fit than the population of the college they represent; a batch of seed chosen at random may not represent the variability present in all seed of that species; and a sample of mosquitoes from a particular place may have very different insecticide resistance than the same species from elsewhere.
1.1 Experimental design and statistics
3
Population 1
Sample 1
Population 2
Sample 2
Figure 1.2 Samples selected at random from very different populations may
not necessarily be different. Simply by chance sample 1 and sample 2 are similar.
Therefore, if you take a random sample from each of two similar populations, the samples may be different to each other simply by chance. On the basis of this you might mistakenly conclude that the two populations are very different. You need some way of knowing if the difference between samples is one you would expect by chance, or whether the populations really do seem to be different. Second, even if two populations are very different, samples from each may be similar, and give the misleading impression the populations are also similar (Figure 1.2).
4
Introduction
Control group (before the experiment)
Treatment group (before the experiment)
Control group (after 300 days)
Treatment group (after 300 days)
Figure 1.3 Two samples of fish were taken from the same population and
deliberately matched so that six equal-sized individuals were initially present in each group. Fish in the treatment group were fed a vitamin supplement for 300 days, while those in the untreated control group were not. The supplement caused each fish in the treatment group to grow about 10% longer, but this difference is small compared with the variation in growth among individuals, which may obscure any effect of treatment.
Finally, natural variation among individuals within a sample may obscure any effect of an experimental treatment (Figure 1.3). There is often so much variation within a sample (and a population) that an effect of treatment may be difficult or impossible to detect. For example, what would you conclude if you found that 50 people given a newly synthesised drug showed an average decrease in blood pressure, but when you looked more closely at the group you found that blood pressure remained unchanged for 25, decreased markedly for 15, and increased slightly for the remaining 10? Has the drug really had an effect? What if tomato plants treated with a new fertiliser yielded from 1.5 to 9 kg of fruit per plant,
1.2 What is this book designed to do?
5
compared with 1.5 to 7.5 kg per plant in an untreated group? Would you conclude there was a meaningful difference between these two groups? These sorts of problems are usually unavoidable when you work with samples and mean that a researcher has to take every possible precaution to try and ensure their samples are likely to be representative and thus give a good estimate of conditions in the population. Researchers need to know how to sample. They also need a good understanding of experimental design, because a good design will take natural variation into account and also minimise additional unwanted variation introduced by the experimental procedure itself. They also need to take accurate and precise measurements to minimise other sources of error. Finally, considering the variability among samples described above, the results of an experiment may not be clear-cut. So it is often difficult to make a decision about a difference between samples from different populations or different experimental treatments. Is it the sort of difference you would expect by chance, or are the populations really different? Is the experimental treatment having an effect? You need something to help you decide, and that is what statistical tests do, by calculating the probability of obtaining a particular difference among samples. Once you have the probability, the decision is up to you. So you need to understand how statistical tests work! 1.2
What is this book designed to do?
An understanding of experimental design and statistics is important, whether you are a biomedical scientist, ecologist, entomologist, genetic engineer, microbiologist, nursing professional, taxonomist, or human movement scientist, so most life science students are made to take a general introductory statistics course. Many of these courses take a detailed mathematical approach that a lot of life scientists find uninspiring. This book is an introduction that does not assume a strong mathematical background. Instead, it develops a conceptual understanding of how statistical tests actually work, using pictorial explanations where possible and a minimum of formulae. If you have read other texts, or have already done an introductory course, you may find that the way this material is presented is unusual, but I have found that non-statisticians find this approach very easy to
6
Introduction
understand and sometimes even entertaining. If you have a background in statistics you may find some sections a little too explanatory, but at the same time they are likely to make sense. This book most certainly will not teach you everything about the subject areas, but it will help you decide what sort of statistical test to use and what the results mean. It will also help you understand and criticise the experimental designs of others. Most importantly, it will help you design and analyse your own experiments, understand more complex experimental designs, and move on to more advanced statistical courses.
2
‘Doing science’ – hypotheses, experiments, and disproof
2.1
Introduction
Before starting on experimental design and statistics, it is important to be familiar with how science is done. This is a summary of a very conventional view of scientific method. 2.2
Basic scientific method
The essential features of the ‘hypothetico-deductive’ view of scientific method (see Popper, 1968) are that a person observes or samples the natural world and uses all the information available to make an intuitive, logical guess, called an hypothesis, about how the system functions. The person has no way of knowing if their hypothesis is correct – it may or may not apply. Predictions made from the hypothesis are tested, either by further sampling or by doing experiments. If the results are consistent with the predictions then the hypothesis is retained. If they are not, it is rejected, and a new hypothesis formulated (Figure 2.1). The initial hypothesis may come about as a result of observations, sampling, and/or reading the scientific literature. Here is an example from ecological entomology. The Portuguese millipede Ommatioulus moreleti was accidentally introduced into southern Australia from Portugal in the 1950s. This millipede lives in leaf litter and grows to about four centimetres long. In the absence of natural enemies from its country of origin (especially European hedgehogs, which eat a lot of millipedes), its numbers rapidly increased to plague proportions in South Australia. Although it causes very little damage to agricultural crops, O. moreleti is a serious ‘nuisance’ pest because it invades houses. In heavily infested areas of South Australia during the late 1980s it
8
Doing science
Observations, previous work, ‘intuition’
Hypothesis
Prediction from hypothesis
Test of prediction
Result consistent with prediction
Result not consistent with prediction
Hypothesis survives and is retained
Hypothesis is rejected
Figure 2.1 The process of hypothesis formulation and testing.
used to be common to find over 1000 millipedes invading a moderate sized house in just one night. When you disturb one of these millipedes it ejects a smelly yellow defensive secretion. Once inside a house the millipedes would crawl across the floor, up the walls, and over the ceiling, where they fell into food and on to the faces and even into the open mouths of sleeping people. When accidentally crushed underfoot they stained carpets and floors, and smelt. The problem was so great that almost half a million dollars was spent on research to control this pest. While working on ways to reduce the nuisance caused by the Portuguese millipede I noticed that householders who reported severe problems had well-lit houses with large, uncurtained windows. In contrast, nearby neighbours whose houses were not so well lit, and who closed their curtains at night, reported far fewer millipedes inside. The numbers of O. moreleti per square metre were similar in the leaf litter around both types of houses. From these observations and very limited sampling of less than ten houses, I formulated the hypothesis, ‘Portuguese millipedes are attracted to visible light at night.’ I had no way of knowing whether this simple hypothesis was the reason for home invasions by millipedes, but it seemed logical from my observations.
2.2 Basic scientific method
Column
1
2
3
4
9
5
Figure 2.2 Arrangement of a 2 5 grid of lit and unlit tiles across a field where millipedes were abundant. Filled squares indicate unlit tiles and open squares indicate lit tiles.
From this hypothesis it was straightforward to predict, ‘At night, in a field where Portuguese millipedes are abundant, more will be present on white tiles illuminated by visible light than on unlit white tiles.’ This prediction was tested by doing a simple and inexpensive manipulative field experiment with two treatments – lit tiles and a control treatment of unlit tiles. Since any difference in millipede numbers between one lit and one unlit tile might occur just by chance or some other unknown factor(s), the two treatments were replicated five times. I set up ten identical white ceramic floor tiles in a two row five column rectangular grid in a field where millipedes were abundant (Figure 2.2). For each column of two tiles, I tossed a coin to decide which of the pair was going to be lit. The other tile was left unlit. Having one lit tile in each column ensured that replicates of both the treatment and control were dispersed across the field rather than having all the treatment tiles clustered together and was a precaution in case the number of millipedes per square metre varied across the field. The coin tossing eliminated any likelihood that I might subconsciously place the lit tile of each pair in an area where millipedes were more common. I hammered a thin two metre long wooden stake vertically into the ground next to each tile. For every one of the lit tiles I attached a pocket torch to its stake and made sure the light shone on the tile. I started the experiment at dusk by turning on the torches. Three hours later I went back and counted the numbers of millipedes on all tiles. The tiles within each treatment were the experimental units (Chapter 1). From this experiment there were at least four possible outcomes:
10
Doing science
1 No millipedes were present on the unlit tiles but lots were present on each of the lit tiles. This result is consistent with the hypothesis, which has survived this initial test and can be retained. 2 High and similar numbers of millipedes were present on both the lit and unlit tiles. This is not consistent with the hypothesis, which can probably be rejected since it seems light has no effect. 3 No (or very few) millipedes were present on any tiles. It is difficult to know if this has any bearing on the hypothesis – there may be a fault with the experiment (e.g. the tiles were themselves repellent or perhaps too slippery, or millipedes may not have been active that night). The hypothesis is neither rejected nor retained. 4 Lots of millipedes were present on the unlit tiles, but none were present on the lit ones. This is a most unexpected outcome that is not consistent with the hypothesis, which is extremely likely to be rejected. These are the four simplest outcomes. A more complicated and much more likely one is that you find some millipedes on the tiles in both treatments, and that is what happened – see McKillup (1988). This sort of outcome is a problem, because you need to decide if light is having an effect on the millipedes, or whether the difference in numbers between lit and unlit treatments is simply happening by chance. Here statistical testing is extremely useful and necessary because it helps you decide whether a difference between treatments is meaningful.
2.3
Making a decision about an hypothesis
Once you have the result of the experimental test of an hypothesis, two things can happen: either the results of the experiment are consistent with the hypothesis, which is retained; or the results are inconsistent with the hypothesis, which may be rejected. If the hypothesis is rejected it is likely to be wrong and another will need to be proposed. If the hypothesis is retained, withstands further testing, and has some very widespread generality, it may progress to become a theory. But a
2.5 ‘Negative’ outcomes
11
theory is only ever a very general hypothesis that has withstood repeated testing. There is always a possibility it may be disproven in the future.
2.4
Why can’t an hypothesis or theory ever be proven?
No hypothesis or theory can ever be proven – one day there may be evidence that rejects it and leads to a different explanation (which can include all the successful predictions of the previous hypothesis). Consequently we can only falsify or disprove hypotheses and theories – we can never ever prove them. Cases of disproof and a subsequent change in thinking are common. Here are two examples. Medical researchers used to believe that excess stomach acidity was responsible for the majority of gastric ulcers in humans. There was a radical change in thinking when many ulcers healed following antibiotic therapy designed to reduce numbers of the bacterium Helicobacter pylori in the stomach wall. There have been at least three theories of how the human kidney produces a concentrated solution of urine, and the latest may not necessarily be correct.
2.5
‘Negative’ outcomes
People are often quite disappointed if the outcome of an experiment is not what they expected and their hypothesis is rejected. But there is nothing wrong with this – rejection of an hypothesis is still progress in the process of understanding how a system functions. Therefore, a ‘negative’ outcome that causes you to reject a cherished hypothesis is just as important as a ‘positive’ one that causes you to retain it. Unfortunately researchers tend to be very possessive and protective of their hypotheses, and there have been cases where results have been falsified in order to allow an hypothesis to survive. This does not advance our understanding of the world and is likely to be detected when other scientists repeat the experiments or do further experiments based on these false conclusions. There will be more about this in Chapter 20, which is about doing science responsibly and ethically.
12
Doing science
2.6
Null and alternate hypotheses
It is scientific convention that when you test an hypothesis you state it as two hypotheses, which are essentially alternates. For example, the hypothesis, ‘Portuguese millipedes are attracted to visible light at night’, is usually stated in combination with, ‘Portuguese millipedes are not attracted to visible light at night’. The latter includes all cases not included in the first hypothesis (e.g. no response, or avoidance of visible light). These hypotheses are called the alternate and null hypotheses respectively. Importantly, the null hypothesis is always stated as the hypothesis of ‘no difference’ or ‘no effect’. So, looking at the two hypotheses above, the second ‘are not’ hypothesis is the null hypothesis and the first is the alternate hypothesis. This is a tedious but very important convention (because it clearly states the hypothesis and its alternative) and there will be several reminders in this book. Box 2.1 Two other views about scientific method Popper’s hypothetico-deductive philosophy of scientific method, where hypotheses are sequentially tested and always at risk of being rejected, is widely accepted. In reality, however, scientists may do things a little differently. Kuhn (1970) argues that scientific enquiry does not necessarily proceed with the steady testing and survival or rejection of hypotheses. Instead, hypotheses with some generality and which have survived initial testing become well-established theories or ‘paradigms’, which are relatively immune to rejection even if subsequent testing may find evidence against them. A few negative results are used to refine the paradigm to make it continue to fit all available evidence. It is only when the negative evidence becomes overwhelming that the paradigm is rejected and replaced by a new one. Lakatos (1978) also argues that a strict hypothetico-deductive process of scientific enquiry does not necessarily occur. Instead, fields of enquiry, called ‘research programmes’ are based on a set of ‘core’ theories that are rarely questioned or tested. The core is surrounded by a protective ‘belt’ of theories that are tested. A successful research programme is one that accumulates more and more theories that have
2.7 Conclusion
13
survived testing within the belt, which provides increasing protection for the core. If, however, many of the belt theories are rejected, doubt will eventually be cast on the veracity of the core and of the research programme itself, which will be replaced by a more successful one. These two views and the hypothetico-deductive view are not irreconcilable. In all cases observations and experiments provide evidence either for or against an hypothesis or theory. In the hypothetico-deductive view science proceeds by the orderly testing and survival or rejection of individual hypotheses, while the other two views reflect the complexity of theories required to describe a research area and emphasise that it would be foolish to reject a theory outright on the basis of limited negative evidence.
2.7
Conclusion
There are five components to an experiment: (1) formulating an hypothesis, (2) making a prediction from the hypothesis, (3) doing an experiment or sampling to test the prediction, (4) analysing the data, and (5) deciding whether to retain or reject the hypothesis. The description of scientific method given here is extremely simple and basic and there has been an enormous amount of philosophical debate about how science is done (see Box 2.1). For example, more than one hypothesis might explain a set of observations and it may be difficult to test these by progressively considering each one against its null. For further reading, Chalmers (1999) gives a very readable and clearly explained discussion of the process and philosophy of scientific discovery.
3
Collecting and displaying data
3.1
Introduction
One way of generating hypotheses is to collect data and look for patterns. Often, however, it is difficult to see any pattern from a set of data, which may just be a list of numbers. Graphs and descriptive statistics are very useful for summarising and displaying data in ways that may reveal patterns. This chapter describes the different types of data you are likely to encounter and discusses ways of displaying them.
3.2
Variables, experimental units, and types of data
The particular attributes you measure when you collect data are called variables (e.g. body temperature, the numbers of a particular species of beetle per broad bean pod, the amount of fungal damage per leaf, or the numbers of brown and albino mice). These data are collected from each experimental unit, which may be an individual (e.g. a human being or a whale) or a defined item (e.g. a square metre of the seabed, a leaf, or a lake). If you only measure one variable per experimental unit, the data set is univariate. Data for two variables per unit are bivariate, while data for three or more variables measured on the same experimental unit are multivariate. Variables can be measured on four scales – ratio, interval, ordinal, or nominal. A ratio scale describes a variable whose numerical values truly indicate the quantity being measured. *
There is a true zero point below which you cannot have any data (for example, if you are measuring the lengths of lizards, you cannot have a lizard of negative length).
3.2 Variables, experimental units, and types of data
*
*
15
An increase of the same numerical amount indicates the same quantity across the range of measurements (for example, a 2 cm and a 40 cm lizard will have grown by the same amount, if they both increase in length by 10 cm). A particular ratio holds across the range of the variable (for example, a 40 cm lizard is 20 times longer than a 2 cm lizard and a 100 cm lizard is also 20 times longer than a 5 cm lizard). An interval scale describes a variable that can be less than zero.
*
*
*
The zero point is arbitrary (for example, temperature measured in degrees celsius has a zero point at which water freezes), so negative values are possible. The true zero point for temperature, where there is a complete absence of heat, is zero kelvin (about –2738C), so unlike the celsius scale the kelvin scale is a ratio scale. An increase of the same numerical amount indicates the same quantity across the range of measurements (for example a 28C increase indicates the same increase in heat whatever the starting temperature). Since the zero point is arbitrary, a particular ratio does not hold across the range of the variable (for example, the ratio of 68C compared with 18C is not the same as 608C with 108C. The two ratios in terms of the kelvin scale are 279 : 274 K and 333 : 283 K).
An ordinal scale applies to data where values are ranked – given a value that simply indicates their relative order. These ranks do not necessarily indicate constant differences. For example, five children of ages 2, 7, 9, 10, and 16 years have been aged on a ratio scale. If, however, you rank these ages in order from the youngest to the oldest (e.g. as ranks 1 to 5), the data have been reduced to an ordinal scale. Child 2 is not necessarily twice as old as child 1. *
An increase in the same numerical amount of ranks does not necessarily hold across the range of the variable.
A nominal scale applies to data where the values are classified according to an attribute. For example, if there are only two possible forms of coat colour in mice, then a sample of mice can be subdivided into the numbers within each of these two attributes.
16
Collecting and displaying data
The first three categories described above can include either continuous or discrete data. Nominal scale data (since they are attributes) can only be discrete. Continuous data can have any value within a range. For example, any value of temperature is possible within the range from 108C to 208C, such as 15.38C or 17.828C. Discrete data are very different to continuous data, because they can only have fixed numerical values within a range. For example, the number of offspring produced increases from one fixed whole number to the next, because you cannot have a fraction of an offspring. It is important that you know what type of data you are dealing with, because this will be one of the factors that determines your choice of statistical test. 3.3
Displaying data
A list of data may reveal very little, but a pictorial summary is a way of exploring the data that might help you notice a pattern, which can help generate or test hypotheses. 3.3.1
Histograms
Here is a list of the number of visits made to a medical doctor during the previous six months by a sample of 60 students chosen at random from a first-year university biostatistics class of 600. These data are univariate, ratio scaled, and discrete: 1,11,2,1,10,2,1,1,1,1,12,1,6,2,1,2,2,7,1,2,1,1,1,1,1,3,1,2,1,2,1,4,6,9,1,2,8,1,9,1, 8,1,1,1,2, 2,1,2,1,2,1,1,8,1,2,1,1,1,1,7 It is difficult to see any pattern from this list of numbers, but you could summarise and display these data by drawing a histogram. To do this you separately count the number (the frequency) of cases for students who visited a medical doctor never, once, twice, three times, through to the maximum number of visits and plot these as a series of rectangles on a graph with the X axis showing the number of visits and the Y axis the number of students in each of these cases. Figure 3.1 shows a histogram for the data. This visual summary shows that the distribution is skewed to the right – most students make few visits to a medical doctor, but there is a long ‘tail’
3.3 Displaying data
17
Number of students
40
30
20
10
0 1
2 3 4 5 6 7 8 9 10 11 12 Number of visits to a medical doctor
Figure 3.1 The number of visits made to a medical doctor during the past six months for 60 students chosen at random from a first-year biostatistics class of 600.
(and perhaps even a separate group) who have made six or more visits. Incidentally, looking at the graph you may be a little suspicious, since every student made at least one visit. When the class was asked about this, it was found that every student was required to undergo a routine medical examination during their first year at university, so these data are somewhat misleading in terms of indicating the health of the group. You may be tempted to draw a line joining the midpoints of the tops of each bar to indicate the shape of the distribution, but this implies that the data on the X axis are continuous, which is not the case since visits are discrete whole numbers. 3.3.2
Frequency polygons or line graphs
If the data are continuous it is appropriate to draw a line linking the midpoint of the tops of each bar in a histogram. Here is an example for some continuous data that can be summarised as a histogram or as a frequency polygon (often called a line graph). The time a person takes to respond to a stimulus is called their reaction time. This can be easily measured in the laboratory by getting them to press a button as soon as they see a light flash. The time elapsing between the instant of the flash and when the button is pressed is defined as the reaction time. A researcher suspected that an abnormally long reaction time might be a useful way of making an early diagnosis of certain neurological
18
Collecting and displaying data
diseases, so they chose a group of 30 students at random from a first year biomedical science class and measured their reaction times in seconds. These data are shown below. Here too, nothing is very obvious from this list: 0.70, 0.50, 1.20, 0.80, 0.30, 0.34, 0.56, 0.41, 0.30, 1.20, 0.40, 0.64, 0.52, 0.38, 0.62, 0.47, 0.24, 0.55, 0.57, 0.61, 0.39, 0.55, 0.49, 0.41, 0.72, 0.71, 0.68, 0.49, 1.10, 0.59 First, since the data are continuous, they are not as easy to summarise as the discrete data in Figure 3.1. To display a histogram for continuous data you need to subdivide the data into the frequency of cases within a series of intervals of equal width. First you need to look at the range of the data (here reaction time varies from a minimum of 0.24 through to a maximum of 1.20 seconds) and decide on an interval width that will give you an informative display of the data. Here the chosen width is 0.999. Therefore, starting from 0.20, this will give 11 intervals, the first of which is 0.20–0.29. The chosen interval width needs to be one that shows the shape of the distribution. There would be no point in choosing a width that included all the data in just two intervals because you would only have two bars on the histogram. Nor would there be any point in choosing more than 20 intervals because this would give a lot of bars containing only a few data, which would be unlikely to reveal the shape of the distribution. Once you have decided on an appropriate interval size, you need to count the number of students with a response time that falls within each (Table 3.1) and plot these frequencies on the Y axis against the intervals (indicated by the midpoint of each interval) on the X axis. This has been done in Figure 3.2(a). Finally, the midpoints of the tops of each rectangle have been joined by a line to give a frequency polygon, or line graph (Figure 3.2(b)). Most students have short reaction times, but there is a distinct group of three who took a relatively long time to respond and who may be of further interest to the researcher. 3.3.3
Cumulative graphs
Often it is useful to display data as a histogram of cumulative frequencies. This is a graph that displays the progressive total of cases (starting at zero or zero per cent and finishing at the sample size or 100%) on the Y axis against
3.3 Displaying data
19
Table 3.1. Summary of the data for the reaction times in seconds of 30 students chosen at random from a first year biomedical class
Interval range
Number of students
0.20–0.29 0.30–0.39 0.40–0.49 0.50–0.59 0.60–0.69 0.70–0.79 0.80–0.89 0.90–0.99 1.00–1.09 1.10–1.19 1.20–1.29
1 5 6 7 4 3 1 0 0 1 2
Table 3.2. Summary of the data for the reaction time in seconds of 30 students chosen at random from a first year biomedical class as frequencies and cumulative frequencies Cumulative frequency Interval range
Number of students
Total
Per cent
0.20–0.29 0.30–0.39 0.40–0.49 0.50–0.59 0.60–0.69 0.70–0.79 0.80–0.89 0.90–0.99 1.00–1.09 1.10–1.19 1.20–1.29
1 5 6 7 4 3 1 0 0 1 2
1 6 12 19 23 26 27 27 27 28 30
3.3 20 40 63.3 76.6 86.6 90 90 90 93.3 100
the increasing value of the variable on the X axis. Table 3.2 gives an example, using the data in Table 3.1. A cumulative frequency graph can never decrease. Figure 3.3 displays the data in Table 3.2 as a frequency histogram.
20
Collecting and displaying data
(a) 8
Frequency
6
4
2
0 (b)
0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 Reaction time (seconds)
8
Frequency
6
4
2
0 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1.05 1.15 1.25 Reaction time (seconds)
Figure 3.2 Data for the reaction time in seconds of 30 biomedical students
selected at random, displayed as (a) a histogram and (b) a frequency polygon or line graph. The points on the frequency polygon (b) correspond to the midpoints of the bars on (a).
Although I have given the rather tedious manual procedures for constructing histograms, you will find that most statistical software packages have excellent graphics programs for displaying your data. These will automatically select an interval width, summarise the data, and plot the graph of your choice. 3.4
Displaying ordinal or nominal scale data
When you display data for nominal or ordinal scale variables you need to modify the form of the graph slightly because the categories are unlikely to
3.4 Displaying ordinal or nominal scale data
21
30
Count
20
10
0 0.40
0.60 0.80 1.00 Reaction time (seconds)
Figure 3.3 A cumulative frequency histogram for the reaction time of 30 students.
Table 3.3. The number of basal cell carcinomas detected and removed from eight locations on the body for 400 males aged from 40–50 years, during 12 months at a skin cancer clinic in Brisbane, Australia
Location
Number of basal cell carcinomas
Head (H) Neck and shoulders (NS) Arms (A) Legs (L) Upper back (UB) Lower back (LB) Chest (C) Lower abdomen (LA)
211 103 74 49 94 32 21 12
be continuous, so the bars need to be separated to clearly indicate the lack of continuity. Here is an example for some nominal scale data. Table 3.3 gives the location of 596 basal cell carcinomas (a form of skin cancer that is most common on sun-exposed areas of the body) detected and removed
22
Collecting and displaying data
(a)
Number of cases
300
200
100
0 H
NS
A L UB LB C LA Location of basal cell carcinoma
H
NS UB A L LB C LA Location of basal cell carcinoma
(b)
Number of cases
300
200
100
0
Figure 3.4 (a) The number of basal cell carcinomas detected and removed by
location on the body during 12 months at a skin cancer clinic in Brisbane, Australia. (b) The same data but with the number of cases ranked in order from most to least.
from 400 males aged from 40 to 50 years treated during 12 months at a skin cancer clinic in Brisbane, Australia. The locations have been defined as (a) head, (b) neck and shoulders, (c) arms, (d) legs, (e) upper back, (f) lower back, (g) chest, and (h) lower abdomen. These can be displayed on a bar graph with the categories in any order along the X axis and the number of cases on the Y axis (Figure 3.4(a)). It often helps to rank the data in order of magnitude to aid interpretation (Figure 3.4(b)).
3.5 Bivariate data
3.5
23
Bivariate data
Data where two variables have been measured on each experimental unit can often reveal patterns that may either suggest hypotheses, or be useful for testing them. Table 3.4 gives two lists of bivariate data for the number of dental caries (these are the holes that develop in decaying teeth) and the ages for 20 children between the ages of one and nine years from each of the cities of Uxford and Hambridge. Looking at these data, there is not anything that stands out apart from an increase in the number of caries with age. If you calculate descriptive statistics such as the average age and average number of dental caries for each of the two groups (Table 3.5), they are not very informative either. (You have probably calculated the average for a set of data and this Table 3.4. The number of dental caries and age of 20 children chosen at random from each of the two cities of Uxford and Hambridge Uxford
Hambridge
Caries
Age
Caries
Age
1 1 4 4 5 6 2 9 4 2 7 3 9 11 1 1 3 1 1 6
3 2 4 3 6 5 3 9 5 1 8 4 8 9 2 4 7 1 1 5
10 1 12 1 1 11 2 14 2 8 1 4 1 1 7 1 1 1 2 1
9 5 9 2 2 9 3 9 6 9 1 7 1 5 8 7 6 4 6 2
Collecting and displaying data
Table 3.5. The average number of dental caries and age of 20 children chosen at random from each of the two cities of Uxford and Hambridge Uxford
Hambridge
Caries
Age
Caries
Age
4.05
4.5 years
4.1
5.5 years
(a) Uxford
Number of dental caries
12 10 8 6 4 2 0 0
2
4 6 Age (years)
8
10
4 6 Age (years)
8
10
(b) Hambridge 16 Number of dental caries
24
14 12 10 8 6 4 2 0
0
2
Figure 3.5 The number of dental caries plotted against the age of
20 children chosen at random from each of the two cities of (a) Uxford and (b) Hambridge.
3.6 Multivariate data
25
procedure will be described in Chapter 6, but the average is the sum of all the values divided by the sample size.) Table 3.5 shows that the sample from Hambridge had slightly more dental caries on average than the one from Uxford, but this is not surprising since the Hambridge sample was an average of one year older. If, however, you graph these data, patterns emerge. One way of displaying bivariate data is as a two-dimensional plot with increasing values of one variable on the horizontal (or X axis) and increasing values of the second variable on the vertical (or Y axis). Figure 3.5 shows both sets of data, with tooth decay (Y axis) plotted against child age (X axis) for each city. These graphs show that tooth decay increases with age, but the pattern differs between cities – in Uxford the increase is fairly steady, but in Hambridge it remains low in children up to age seven but then suddenly increases. This might suggest hypotheses about the reasons why, or stimulate further investigation (perhaps a child dental care program, or water fluoridation, has been in place in Hambridge for the past eight years compared with no action on decay in Uxford). Of course, there is always the possibility that the samples are different due to chance, so perhaps the first step in any further investigation would be to repeat the sampling using much larger numbers of children from each city. Graphs of this type are frequently used and you will have seen them many times before in newspapers, reports, scientific articles, and on television. 3.6
Multivariate data
Often life scientists have data for three or more variables measured on the same experimental unit. For example, a biomedical scientist might have data for age, blood pressure, and serum cholesterol for each individual in a sample of 20 people, or a marine ecologist might have data for the numbers of several species of marine invertebrates present in samples from a polluted area. Results for three variables could be shown as three-dimensional graphs, but direct display is difficult for more than this number of variables. Some relatively new statistical techniques have made it possible to condense and summarise multivariate data in a two-dimensional display, but they are beyond the scope of this book.
26
Collecting and displaying data
3.7
Summary and conclusion
Graphs may reveal patterns in data sets that are not obvious from looking at lists or calculating descriptive statistics. Graphs can also provide an easily understood visual summary of a set of results. In later chapters there will be discussion of data displays such as boxplots and probability plots, which can be used to decide whether the data set is suitable for a particular analysis. Most modern statistical software packages have easy-to-use graphics options that produce high-quality graphs and figures. These packages are very useful for life scientists who are writing assignments, reports, or scientific publications.
4
Introductory concepts of experimental design
4.1
Introduction
To generate hypotheses you often sample different groups or places (which is sometimes called a mensurative experiment because you usually measure something, such as height or weight, on each experimental unit) and explore these data for patterns or associations. To test hypotheses you may do mensurative experiments, or manipulative experiments where you change a condition and observe the effect of that change upon each experimental unit (like the experiment with millipedes and light described in Chapter 2). Often you may do several experiments of both types to test a particular hypothesis. The quality of your sampling and the design of your experiment can have an effect upon the outcome and determine whether your hypothesis is rejected or not. Therefore it is important to have an appropriate and properly designed experiment. First, you should attempt to make your measurements as accurate and precise as possible so they are the best estimates of actual values. Accuracy is the closeness of a measured value to the true value. Precision is the ‘spread’ or variability of repeated measures of the same value. For example, a thermometer that consistently gives a reading corresponding to a true temperature (e.g. 208C) is both accurate and precise. Another that gives a reading consistently higher (e.g. þ108C) than a true temperature is not accurate, but it is very precise. In contrast, a thermometer that gives a fluctuating reading within a wide range of values around a true temperature is not precise and will usually be inaccurate except when the reading occasionally happens to correspond to the true temperature.
28
Introductory concepts of experimental design
Inaccurate and imprecise measurements or a poor or unrealistic sampling design can result in the generation of inappropriate hypotheses. Measurement errors or a poor experimental design can give a false or misleading outcome that may result in the incorrect retention or rejection of an hypothesis. The following is a discussion of some important essentials of sampling and experimental design. 4.2
Sampling – mensurative experiments
Mensurative experiments are often a good way of generating hypotheses or testing predictions from them. (An example of the latter is, ‘I think millipedes are attracted to light at night. So if I sample 500 well-lit houses and 500 that are not well lit, the first group should, on average, contain more millipedes than the second.’) You have to be careful when interpreting the results of mensurative experiments because you are sampling an existing condition, rather than manipulating conditions experimentally. There may be some other difference between your groups (e.g. well-lit houses may have a more ‘open plan’ design, which makes it easier for millipedes to get inside, and light may not be important at all). 4.2.1
Confusing a correlation with causality
A correlation between two variables means they vary together. A positive correlation means that high values of one variable are associated with high values of the other, while a negative correlation means that high values of one variable are associated with low values of the other. For example, the graph in Figure 4.1 shows a positive correlation between the population density of mice per square metre and the weight of wheat plants in kilograms per square metre from different parts of a large field. Unfortunately a correlation is often mistakenly interpreted as indicating causality. It seems plausible that the amount of wheat might be the cause of differences in the numbers of mice (which may be eating the wheat or using it for shelter), but even if there is a very obvious correlation between any two variables it does not necessarily show that one is responsible for the other. The correlation may have occurred by chance, or a third unmeasured factor might determine the numbers of the two variables studied
4.2 Sampling – mensurative experiments
29
Number of mice per m2
Kilograms of wheat per m2
Figure 4.1 Example of a positive correlation between the numbers of mice and the weight of wheat plants per square metre. Number of mice
Weight of wheat
Soil moisture
Figure 4.2 The involvement of a third variable ‘Soil moisture’ that determines the ‘Number of mice’ and ‘Kilograms of wheat’ per square metre. Even though there is no causal relationship between the number of mice and weight of wheat, the two variables are positively correlated.
(Figure 4.2). For example, soil moisture may determine both the number of mice and the weight of wheat. Therefore, although there is a causal relationship between soil moisture and each of the two variables, they are not causally related themselves. 4.2.2
The inadvertent inclusion of a third variable: sampling confounded in time
Occasionally researchers have no choice but to sample different populations of the same species, or different habitats, at different times. These results should be interpreted with great caution, since changes occurring over time may contribute to differences (or the lack of them) among
30
Introductory concepts of experimental design
samples. The sampling is said to be confounded in that more than one variable may be having an effect on the results. Here is an example. An ecologist hypothesised that the density of above-ground vegetation might affect the population density of earthworms, and therefore sampled several different areas for these two variables. The work was very time consuming because the earthworms had to be sampled by taking cores of soil and unfortunately the ecologist had no help. Therefore, areas of low vegetation density were sampled in January, low to moderate density in February, moderate density in March, and high density in April. The sampling showed a negative correlation between vegetation density and earthworm density. Unfortunately, however, the density of earthworms was the same in all areas, but decreased as the year progressed (and the ecologist did not know this). Therefore, the negative correlation between earthworm density and vegetation density was an artefact of the sampling of different places being confounded in time. This is an example of a common problem, and you are likely to find similar cases in many published scientific papers and reports. 4.2.3
The need for independent samples in mensurative experiments
Frequently researchers sample the numbers, or population density, of a species in relation to an environmental gradient (such as depth in a lake), to see if there is any correlation between density of the species and the gradient of interest. There is an obvious need to replicate the sampling – that is, to independently estimate density more than once. For example, consider sampling Dark Lake, Wisconsin, to investigate the population density of freshwater prawns in relation to depth. If you only sampled at one place (Figure 4.3(a)) the results would not be a good indication of changes in the population density of prawns with depth in the lake. The sampling needs to be replicated, but there is little value in repeatedly sampling one small area (e.g. by taking several samples under ‘’ in Figure 4.3(b)) since this still will not give an accurate indication of changes in population density with depth across the whole lake (although it may give a very accurate indication of conditions in that particular part of the lake). This sort of sampling is one aspect of what Hurlbert (1984) called
4.2 Sampling – mensurative experiments
31
(a) * 10m
10
100
25
38
34
83
20m
10
5
16
99
2
126
(b) ***** 10m
10
100
25
38
34
83
20m
10
5
16
99
2
126
10m
* 10
* * 100
* 25
* * * 38
* * 34
* 83
20m
10
5
16
99
2
126
(c)
Figure 4.3 Variation in the number of freshwater prawns per cubic metre of
water at two different depths (10 m and 20 m) in Dark Lake, Wisconsin. (a) An unreplicated sample taken at only one place () would give a very misleading indication of changes in the population density of prawns with depth within the entire lake. (b) Several replicates taken at only one place () would still give a very misleading indication of conditions within the entire lake. (c) Several replicates taken at random across the lake would give a better indication within the entire lake.
pseudoreplication, which is still a very common flaw in a lot of scientific research. The replicates are ‘pseudo’ – sham or unreal – because they are unlikely to truly describe what is occurring across the entire area being discussed (in this case the lake). A better design would be to sample at several places chosen at random within the lake as shown in Figure 4.3(c). This type of inappropriate sampling is very common. Here is another example. A researcher sampled a large coral reef by dropping a 1 m2 square frame, subdivided into a grid of 100 equal-sized squares, at random in one place only and then took one sample from each of these smaller squares. Although these 100 replicates may very accurately describe conditions within the sampling frame they may not necessarily describe the remaining 9999 m2 of the reef and would be pseudoreplicates if the results were
32
Introductory concepts of experimental design
interpreted in this way. A more appropriate design would be to sample 100 replicates chosen at random across the whole reef.
4.2.4
The need to repeat the sampling on several occasions and elsewhere
In the example described above, the results of sampling Dark Lake can only confidently be discussed in relation to that particular lake on that day. Therefore, when interpreting results you need to be cautious. Sampling the same lake on several different occasions will strengthen the findings, and may be sufficient if you are only interested in that lake. Sampling more than one lake will make the results more able to be generalised. Inappropriate generalisation is another example of pseudoreplication since data from one location may not hold in the more general case. At the same time, however, even if your study is limited you can still make more general predictions from your findings provided these are clearly identified as predictions.
4.3
Manipulative experiments
4.3.1
Independent replicates
It is essential to have several independent replicates of any treatment used in an experiment. I mentioned this briefly when describing the millipedes and light experiment in Chapter 2 and said if there were only one lit and one unlit tile any difference between them could have simply been due to chance or some other unknown factor(s). As the number of randomly chosen independent replicates increases, so does the likelihood that any difference between the experimental group and the control group is a result of the experimental treatment. The following example is deliberately absurd because I will use it later in this chapter to discuss a lack of replication that is not so obvious. Imagine you were asked to test the hypothesis that vitamin C caused guinea pigs to grow more rapidly. You obtained two six-week-old guinea pigs of the same sex and weight, caged them separately, and offered one an unlimited amount of commercial rodent food plus 20 mg of vitamin C per day, while the other guinea pig was only offered an unlimited amount of commercial rodent food. The guinea pigs were re-weighed after three
4.3 Manipulative experiments
33
months and the results were obvious – the guinea pig that received vitamin C was 40% heavier than the one that had not. This result is consistent with the hypothesis but there is an obvious flaw in the experiment – with only one guinea pig in each treatment, any differences between treatments may be due to differences between the guinea pigs, differences between the treatment cages, or both. (For example, the slowgrowing guinea pig may, by chance, have been heavily infested with intestinal parasites). There is a need to replicate this experiment and the replicates need to be truly independent – for example it is not sufficient to have ten ‘vitamin C’ guinea pigs together in one cage and ten control guinea pigs in another, because any differences between treatments may still be caused by some difference between the cages. There will be more about this shortly. 4.3.2
Control treatments
Control treatments are needed because they allow the experimenter to isolate the reason why something is occurring in an experiment by comparing two treatments that differ by only one factor. Frequently the need for a rigorous experimental design makes it necessary to have several different treatments, more than one of which can be considered controls. Here is an example. Herbivorous species of marine snails are often common in rock pools on the shore, where they eat algae that grow on the sides of the pools. Very occasionally these snails are seen being attacked and eaten by carnivorous species of intertidal snails, which also occur in the rock pools. An ecologist was surprised that such attacks occurred so infrequently and hypothesised that this was because the herbivorous snails showed ‘avoidance’ by climbing out of the water in response to water borne odours from their predators. The null hypothesis is, ‘herbivorous snails will not avoid their predators’ and the alternate hypothesis is, ‘herbivorous snails will avoid their predators’. One prediction that might distinguish between these hypotheses is that, ‘herbivorous snails will crawl out of their pool when a predatory snail is added’. This could be tested by dropping a predatory snail into a rock pool where some herbivorous snails are present and seeing how many crawled out during the next five minutes. Unfortunately, this experiment is not controlled. By adding a predator and waiting for five minutes, several things have happened to the
34
Introductory concepts of experimental design
Table 4.1. Breakdown of three treatments into their effects upon herbivorous snails Predator
Control for disturbance
Control for time
predator disturbance time
disturbance time
time
herbivorous snails in the pool. Certainly, you are adding a predator. But the pool is also being disturbed, simply by adding something (the predator) to it. Also, the experiment is not well controlled in terms of time, since five minutes have elapsed while the experiment is being done. Therefore, even if all the herbivorous snails crawled out of the pool, the experimenter could not confidently attribute this to the addition of the predator – the snails may have crawled out in response to disturbance, because the pool had warmed up in the sun, or many other reasons. One improvement to this experiment would be a control for the disturbance associated with adding a predator. A popular treatment to control for this is to include another pool into which a small stone about the size of the predator is dropped, as ‘something added to the pool’. Another important improvement would include a control pool to which nothing was added. At this stage, by incorporating the improvements, you would have three treatments. Table 4.1 lists what these treatments are doing to the snails. For such a simple hypothesis, ‘herbivorous snails will avoid their predators’, the experiment has already expanded to three treatments. But many ecologists are likely to say that even this design is not adequate, since the ‘predator’ treatment is the only one in which a snail has been added to a pool. Therefore, even if the snails all crawled out of the pools in the treatment to which the predator had been added but remained submerged in the other two treatments, the response may have been only a response to the addition of any living snail, rather than a predator. Ideally, a fourth treatment should be included, where an herbivorous snail is added, to control for this (Table 4.2). You may, at this point, be thinking that the above design is far too finicky. Nevertheless, experiments have to have appropriate controls so
4.3 Manipulative experiments
35
Table 4.2. Breakdown of four treatments into their effects upon herbivorous snails Predator
Control for snail
Control for disturbance
Control for time
predator disturbance time
herbivore disturbance time
disturbance time
time
that the effects of each potentially contributing factor can be isolated. Furthermore, the design would have to include replicates as well – you could not just do it once, using four pools, since any difference among treatments may result from some difference among the pools rather than the actual treatments applied. I have done this experiment (McKillup and McKillup, 1993) and included all the treatments listed in Table 4.2 with six replicates, using 30 pools altogether. It is often difficult to work out what control treatments you need to incorporate in a manipulative experiment. One way to clarify these is to list all of the things you are actually doing to an experimental treatment and make sure you have appropriate controls for each. 4.3.3
Other common types of manipulative experiments where treatments are confounded with time
Many experiments confound treatments with time. For example, experiments designed to evaluate the effects of drugs often measure some physiological variable (e.g. blood pressure) of the same group of experimental subjects before and after a treatment. Any change is attributed to the effect of the drug. Here, however, several different things have been done to the treatment group. I will use blood pressure as an example, but they apply to any ‘before and after’ experiment. First, time has elapsed, and blood pressure can change over a matter of minutes or hours in response to many factors, even room temperature. Second, the group has been given a drug, but studies have shown that administration of even an empty capsule or an injection of saline (these are called placebo treatments) can affect a person’s blood pressure.
36
Introductory concepts of experimental design
Third, each person in the group has had their blood pressure measured twice. Many people are ‘white coat hypertensive’ – their blood pressure increases substantially at the sight of a physician approaching with the inflatable cuff and pressure gauge used to measure blood pressure. An improvement to this experiment could include a group that was treated in exactly the same way as the experimental group, except that the subjects were given an appropriate placebo. This would at least isolate the effect of the drug from the other ways in which both groups had been disturbed. Consequently, well-designed medical experiments often include ‘sham operations’ where the control subjects are operated on in the same way as the experimental subjects, except that they do not receive the experimental manipulation. For example, early experiments to investigate the function of the parathyroid glands, which are small patches of tissue present within the thyroid, included an experimental treatment where the parathyroids were completely removed from several dogs, while a control group of dogs had their thyroids exposed and cut, but the parathyroids were left in place. 4.3.4
Pseudoreplication
One of the nastiest pitfalls is appearing to have a replicated manipulative experimental design, which really is not replicated. This is another aspect of ‘pseudoreplication’ described by Hurlbert (1984) who invented the word – before then it was just called ‘bad design’. Here is an example that relates back to the discussion about the need for replicates. An aquacultural scientist hypothesised that a diet which included excess vitamin A would increase the growth rate of prawns. They were aware of the need to replicate their experiment, so they set up two treatment ponds, each containing 1000 prawns of the same species and of similar weight and age from the same hatchery. One pond was chosen at random and the 1000 prawns within it fed commercial prawn food plus vitamin A, while the 1000 prawns in the second pond were only fed commercial prawn food. After six months the prawns were harvested and weighed. The prawns that received vitamin A were twice as heavy, on average, as the ones that had not. The scientist was delighted – an experiment with 1000 replicates of each treatment had produced a result consistent with the hypothesis. Unfortunately, there are not 1000 truly independent replicates in each pond. All prawns receiving vitamin A were in pond 1 and all those receiving
4.3 Manipulative experiments
37
only standard food were in pond 2. Therefore, any difference in growth may, or may not, have been due to the vitamin – it could equally well have been due to some other (perhaps unknown) difference between the two ponds. The experimental replicates are the ponds, not the prawns, so the experiment has no effective replication at all and is essentially the same as the absurd unreplicated guinea pig experiment described earlier in this chapter. An improvement to the design would be to run each treatment in several ponds. For example, an experiment with five ponds in each treatment, each containing 200 prawns, has at least been replicated five times. But here too, it is still necessary to have truly independent replicates – you can not subdivide two ponds into five enclosures and run one treatment in each pond. This is one case of apparent replication, and here are four examples. 1 Even if you have several separate replicates of each treatment (say five treatment aquaria and five control aquaria), the arrangement of these can lead to a lack of independence. First you may have your treatment aquaria all clumped together at one end of a laboratory bench and the experimental aquaria at the other. But there may be some known or unknown feature of the laboratory (e.g. light levels, ventilation, disturbance) that affects one group of aquaria differently to the other (Figure 4.4(a)). 2 Replicates placed alternately. If you decided to get around the clustering problem by placing treatments and controls alternately (i.e. by placing, from left to right, treatment 1, control 1; treatment 2, control 2; treatment 3 etc. . . . ), there can still be problems. Just by chance all the treatment aquaria (or all the controls) might be under regularly placed laboratory ceiling lights, next to windows, or subject to some other regular feature you are not even aware of (Figure 4.4(b)). 3 Often, because of a shortage of equipment, you may have to have all of your replicates of one temperature treatment in only one controlled temperature cabinet, and all replicates of another temperature in only one other. Unfortunately, if there is something peculiar to one cabinet, in addition to temperature, then either the experimental or control treatment may be affected. This pattern is called ‘isolative segregation’ (Figure 4.4(c)). 4 The final example is more subtle. Imagine you decided to test the hypothesis that, ‘Water with a high nitrogen content increases the
38
Introductory concepts of experimental design
(a) T1
T2
T3
T4
T5
C1
C2
C3
C4
C5
_________________________________________________________
(b)
T1
C1
T2
C2
T3
C3
T4
C4
T5
C5
_________________________________________________________
(c) Incubator 1 20˚C
Incubator 2 30˚C
Figure 4.4 Three cases of apparent pseudoreplication. (a) Clustering of
replicates means that there is no independence among controls or treatments. (b) A regular arrangement of treatments and controls may, by chance, correspond to some feature of the environment (here the very obvious ceiling lights) that might affect the results. (c) Clustering of temperature treatments within particular incubators.
growth of freshwater mussels.’ You set up five control aquaria and five experimental aquaria, which were placed on the bench in a completely randomised pattern, to get around examples 1 and 2 above. All tanks had to have water constantly flowing through them, so you set up one storage tank containing water high in nitrogen and one containing water low in nitrogen. Water from each storage tank was piped into five aquaria as shown in Figure 4.5. This looks fine, but unfortunately all of the five aquaria within each treatment are sharing the same water. All in the ‘high nitrogen’ treatment receive water high in nitrogen from Tank A and all aquaria in the control receive water low in nitrogen from Tank B, so any difference in mussel growth between treatments may be due either to the nitrogen or some other feature of the storage tanks. Really, therefore, this design is little better than the case of isolative segregation (example 3 above). Ideally, each aquarium should have its own separate and independent supply. Finally, the allocation of replicate tanks to treatments should be done
4.4 Unreplicated experiment
39
Tank A – high nitrogen
T1
C1
C2
T2
T3
C3
T4
C4
C5
T5
Tank B – low nitrogen
Figure 4.5 The positions of the treatment tanks are randomised, but all tanks within a treatment share water from one supply tank.
using a method that removes any possibility of unintentional bias by the experimenter. (For example, the toss of a coin was used to allocate pairs of tiles to lit and unlit treatments in the experiment with millipedes and light described in Section 2.2.)
4.4
Sometimes you can only do an unreplicated experiment
Although replication is desirable in any experiment, there are some cases where it is not possible. For example, when doing large-scale mensurative or manipulative experiments on systems such as lakes or rivers there may be only one polluted lake or river available to study. Although you cannot attribute the reason for any difference, or the lack of it, to the treatment (e.g. a polluted versus a relatively unpolluted river), since you only have one replicate, the results are still useful. First, they are still evidence for or against your hypothesis and can be cautiously discussed in the light of the lack of replication. Second, it may be possible to achieve replication by analysing your results in conjunction with those from similar studies done elsewhere by other researchers. This is called a meta-analysis. Finally, the results of a large-scale but unreplicated experiment may suggest smallerscale experiments that can be done with replication so that you can continue to test the hypothesis.
40
Introductory concepts of experimental design
4.5
Realism
Even an apparently well-designed mensurative or manipulative experiment may still suffer from a lack of realism. Here are two examples. The first is a mensurative experiment on the incidence of testicular torsion. Testicular torsion can occur in males when the testicular artery supplying the testis with oxygenated blood becomes twisted. This can restrict or cut off the blood supply and thereby damage or kill the testis. Apparently this is an extremely painful condition and usually requires surgery to either restore blood flow or remove the damaged testis. Since the testes retract closer to the body as temperature decreases, a physician hypothesised that the likelihood of torsion would be greater during winter compared with summer. Their alternate hypothesis was, ‘Retraction of the testis during cold weather increases the incidence of testicular torsion.’ The null hypothesis was, ‘Retraction of the testis during cold weather does not increase the incidence of testicular torsion.’ The physician found that the incidence of testicular torsion was twice as high during winter compared with summer in a small town in Alaska. Unfortunately there were very few affected males (six altogether) in the sample, so this difference may have occurred simply by chance, making it impossible to distinguish between these hypotheses. Later, another researcher obtained data from a much larger sample of 96 affected males from hospital records in north Queensland, Australia. They found no difference in the incidence of testicular torsion between summer and winter, but this may not have been a realistic test of the hypothesis, because even Alaskan summers are considerably colder than north Queensland winters. Second, an experiment to investigate factors affecting the selection of breeding sites by the mosquito Anopheles farauti offered adult females a choice of salinities ranging from 0, 5, 10, 15, 20, 25, 30, and 35 parts per thousand. Eggs were laid in all but the two highest salinities (30 and 35 parts per thousand). The conclusion was that salinity significantly affects the choice of breeding sites by mosquitoes. Unfortunately the salinity in the habitat where the mosquitoes occurred never exceeded ten parts per thousand, again making the choice of treatments unrealistic.
4.7 Designing a good experiment
4.6
41
A bit of common sense
By now, you may be quite daunted by the challenge of being able to design a good experiment. Provided, however, that you have appropriate controls, replicates, and have also thought about any obvious problems of pseudoreplication and realism, you are well on the way to a good design. Furthermore, the desire for a near-perfect design has to be balanced against financial constraints as well as space and time available to do the experiment. Often it is not possible to have more than two incubators, or as many replicates as you would like. It also depends on the type of life science you do. For example, many microbiologists working with organisms they grow on agar plates, where conditions can be strictly controlled, would never be concerned about clustering of replicates or isolative segregation because they were confident that conditions did not vary in different parts of their laboratory and their incubators only differed in relation to temperature. Most of the time they may be right, but considerations about experimental design need to be borne in mind by all life scientists. Also, you may not have the resources to do a large manipulative field experiment at more than one site. Although, strictly speaking, the results cannot be generalised to other sites, they may nevertheless apply, and careful interpretation and discussion of results can make more general predictions. For example, the ‘millipede and light’ experiment described in Chapter 2 was initially done during one night at one site. It was repeated on the following night at the same site in the presence of some colleagues (who were initially rather sceptical), and later at two other sites, as well as in the laboratory. All the results were consistent with the hypothesis, so I concluded, ‘Portuguese millipedes are attracted to visible light at night.’ Nevertheless, the hypothesis may not be correct or apply to all populations of O. moreleti, but, to date, there has been no evidence to the contrary. 4.7
Designing a ‘good’ experiment
Designing a well-controlled, appropriately replicated and realistic experiment has been described by some researchers as an ‘art’. It is not, but there are often several different ways to test the same hypothesis, and hence several different experiments that could be done. Consequently, it is
42
Introductory concepts of experimental design
Ability
Cost of the experiment
Ability to do the experiment
Cost
Very poor
Excellent Quality of the experimental design
Figure 4.6 An example of the trade off between the cost and ability to do an
experiment. As the quality of the experimental design increases, so does the cost of the experiment (solid line), while the ability to do the experiment decreases (dashed line). Your design usually has to be a compromise between one that is practicable, affordable, and of sufficient rigour.
difficult to set a guide to designing experiments beyond an awareness of the general principles discussed in this chapter. 4.7.1
Good design versus the ability to do the experiment
It has often been said, ‘There is no such thing as a perfect experiment.’ One inherent problem is that, as a design gets better and better, the cost in time and equipment also increases, but the ability to actually do the experiment decreases (Figure 4.6). An absolutely perfect design may be impossible to carry out. Therefore, every researcher must choose a design that is ‘good enough’ but still practical. There are no rules for this – the decision on design is in the hands of the researcher, and will be eventually judged by their colleagues who examine any report from the work. 4.8
Conclusion
The above discussion only superficially covers some important aspects of experimental design. Considering how easy it is to make a mistake, you probably will not be surprised that a lot of published scientific papers have serious flaws in design or interpretation that could have been avoided.
4.8 Conclusion
43
Work with major problems in the design of experiments is still being done and, quite alarmingly, many researchers are not aware of these. As an example, after teaching the material in this chapter I often ask my students to find a published paper, review and criticise the experimental design, and then offer constructive suggestions for improvement. Many have later reported that it was far easier to find a flawed paper than they expected.
5
Probability helps you make a decision about your results
5.1
Introduction
Most science is comparative. Researchers often need to know if a particular experimental treatment has had an effect, or if there are differences among a particular variable measured at several different locations. For example, does a new drug affect blood pressure, does a diet high in vitamin C reduce the risk of liver cancer in humans, or is there a relationship between vegetation cover and the population density of rabbits? But when you make these sorts of comparisons, any differences among treatments or among areas sampled may be real or they may simply be the sort of variation that occurs by chance among samples from the same population. Here is an example using blood pressure. A biomedical scientist was interested in seeing if the newly synthesised drug ‘Arterolin B’ had any effect on blood pressure in humans. A group of six humans had their systolic blood pressure measured before and after administration of a dose of Arterolin B. The average systolic blood pressure was 118.33 mm Hg before and 128.83 mm Hg after being given the drug (Table 5.1). The average change in blood pressure from before to after administration of the drug is quite large (an increase of 10.5 mm Hg), but by looking at the data you can see there is a lot of variation among individuals – blood pressure went up in three cases, down in two, and stayed the same for the remaining person. Even so, the scientist might conclude that a dose of Arterolin B increases blood pressure. But there is a problem (apart from the poor experimental design that has no controls for time or the disturbing effect of having one’s blood pressure measured). How do you know that the effect of the drug is meaningful or significant? Perhaps this change occurred by chance and the drug had no effect. Somehow you need a way of helping you make a
5.2 Statistical tests and significance levels
45
Table 5.1. The systolic blood pressure in mm Hg for six people before and after being given the experimental drug Arterolin B Person
Before
After
1 2 3 4 5 6
100 120 120 140 80 150
108 120 150 135 120 140
Average
118.33
128.83
decision about your results. This led to the development of statistical tests and a commonly agreed upon level of statistical significance. 5.2
Statistical tests and significance levels
Statistical tests are just a way of working out the probability of obtaining the observed, or an even more extreme, difference among samples (or between an observed and expected value) if a specific hypothesis (usually the null of no difference) is true. Once the probability is known, the experimenter can make a decision about the difference, using criteria that are uniformly used and understood. Here is a very easy example where the probability of every possible outcome can be calculated. Imagine you have a large sack containing 5000 white and 5000 black beads that are otherwise identical. All of these beads are well mixed together. They are a population of 10 000 beads. You take one bead out at random, without looking in the sack. Since there are equal numbers of black and white, your probability of getting a black one is 50%, or ½, which is also your chance of getting a white one. The chance of getting either a black or white bead is the sum of these probabilities: (½ + ½) which is 1.0 (or 100%) since there are no other colours. (If you are unsure about probability, there is a short explanation of the concepts you will need for this book in Box 5.1.) Now consider what happens if you take out a sample of six beads in sequence, one after the other, without looking in the sack. Each bead is
46
Probability helps you make a decision
Box 5.1 Basic concepts of probability The probability of any event can only vary between 0 and 1 (which correspond to 0 and 100%). If an event is certain to occur, it has a probability of 1; while, if it is certain the event will not occur, it has a probability of 0. The probability of a particular event is the number of outcomes giving that event, divided by the total number of possible outcomes. For example, when you toss a coin there are only two possible outcomes – a head or a tail. These two events are mutually exclusive – you cannot get both. Consequently, the probability of a head is 1 divided by 2 = ½ (and thus the probability of a tail is also ½). The addition rule
The probability of getting either a head or a tail is ½ + ½ = 1. This is an example of the addition rule: when several outcomes are mutually exclusive, the probability of getting any of these is the sum of their separate probabilities. (Therefore, the probability of getting either a 1, 2, 3, or 4 when rolling a six-sided die is 4/6.) The multiplication rule
Independent events. To calculate the joint probability of two or more independent events (for example, a head followed by another head in two independent tosses of a coin) you simply multiply the independent probabilities together. Therefore, the probability of getting two heads with two tosses of a coin are ½ ½ = ¼. The chance of a head or a tail with two tosses is ½, because there are two ways of obtaining this: HT or TH. Related events. If the events are not independent (for example, the first event being a number in the range of 1–3 inclusive when rolling a six-sided die and the second event being that this is an even number), the multiplication rule also applies, but you have to multiply the probability of one event by the conditional probability of the second. When rolling a die the independent probability of a number from 1 to 3 is 3/6 = ½, and the independent probability of any even number is also ½ (the even numbers are 2, 4, or 6 divided by the six possible outcomes). If, however, you have already rolled a number from 1 to 3, the probability of that restricted set of outcomes being an even number is 1/3
5.2 Statistical tests and significance levels
47
(because ‘2’ is the only even number possible in this set of three outcomes). Therefore, the probability of both related events is ½ 1/3 ¼ 1/6. You can look at this the other way – the chance of an even number when rolling a die is ½ (you would get numbers 2, 4, or 6) and the probability of one of these numbers being in the range from 1 to 3 is 1/3 (the number 2 out of these three outcomes). Therefore the probability of both is again ½ 1/3 ¼ 1/6.
replaced after it is drawn and the contents of the sack remixed before taking out the next, so these are independent events. Here are all of the possible outcomes. You may get six black beads or six white ones (both outcomes are very unlikely); five black and one white, or one black and five white (which is more likely); four black and two white, or two black and four white (which is even more likely); or three black and three white (which is very likely because the proportion of beads in the sack is 1:1). The probability of getting six black beads in sequence is the probability of getting one black one (½) multiplied by itself six times, which is ½ ½ ½ ½ ½ ½ = 1/64. The probability of getting six white beads is also 1/64. The probability of five black and one white is greater because there are six ways of getting this combination (WBBBBB or BWBBBB or BBWBBB or BBBWBB or BBBBWB or BBBBBW) giving 6/64. There is the same probability (6/64) of getting five white and one black. The probability of four black and two white is even greater because there are 15 ways of getting this combination (WWBBBB, BWWBBB, BBWWBB, BBBWWB, BBBBWW, WBWBBB, WBBWBB, WBBBWB, WBBBBW, BWBWBB, BWBBWB, BWBBBW, BBWBWB, BBWBBW, BBBWBW) giving 15/64. There is the same probability (15/64) of getting four white and two black. Finally, the probability of three black and three white (there are 20 ways of getting this combination) is 20/64. You can summarise all of the outcomes as a table of probabilities (Table 5.2). These probabilities are shown as a histogram in Figure 5.1. Note that the distribution is symmetrical with a peak corresponding to the cases where half the beads will be black and half white. (Incidentally, this is
48
Probability helps you make a decision
Table 5.2. The probabilities of obtaining all possible combinations of black and white beads in samples of six from a large population where there are equal numbers of black and white beads
Number of black
Number of white
Probability of this outcome
Percentage of cases likely to give this result
6 5 4 3 2 1 0
0 1 2 3 4 5 6
1/64 6/64 15/64 20/64 15/64 6/64 1/64
1.56 9.38 23.44 31.25 23.44 9.38 1.56
64/64
100%
Expected number of each case in a sample of 64
Total
30
20
10
0 0
1 2 3 4 5 6 Number of black beads in a sample of 6
Figure 5.1 The expected numbers of each possible mixture of colours when
drawing six beads independently with replacement on 64 different occasions from a large population containing 50% black and 50% white beads.
an example of the binomial distribution, which will be discussed in Chapter 17.) Therefore, if you were given a sack containing 50% black and 50% white beads, from which you drew six, you would have a very high probability of drawing a sample that contains beads of both colours. It is very unlikely you
5.4 Making the wrong decision
49
would get only six black or six white (the probability of each is 1/64, so the probability of either six black or six white is the sum of these which is only 2/64, or 0.0313 or 3.13%). 5.3
What has this got to do with making a decision or statistical testing?
The statistician Sir Ronald Fisher proposed that, if the probability of getting this or a more extreme difference between the expected outcome (the null hypothesis discussed in Chapter 2) and the actual outcome is less than 5%, then it is appropriate to conclude that the difference is statistically significant (Fisher, 1954). There is no biological reason for the choice of 5% (which is the same as 1/20 or 0.05). It is the probability that many researchers use as a standard ‘statistical significant level’. Using the example of the beads in the sack, if your null hypothesis specified that there were equal numbers of black and white beads in the population, you could do an experiment to test it by drawing out a sample of six beads as described above. If all six were black or all were white, the probability of either outcome (which in this case are the most extreme departures from the expected under the null hypothesis) is only 3.13% and would be considered statistically significant. A researcher would reject the null hypothesis and conclude that the sample did not come from a population containing equal numbers of black and white beads. 5.4
Making the wrong decision
If the proportions of black and white beads in the sack really were equal, then most of the time a sample of six beads would contain both colours. But, if the beads in the sample were all only black or all only white, a researcher would decide the sack (the population) did not contain 50% black and 50% white. Here they would have made the wrong decision, but this would not happen very often (the probability of either of these outcomes is 2/64). The unavoidable problem with using probability to help you make a decision is that there is always a chance of making a wrong decision and you have no way of telling when you have done this.
50
Probability helps you make a decision
As described above, if a researcher got a sample of six of one colour, they would decide that the population (the contents of the bag) was not 50% black and 50% white when really it was. This mistake, where the null hypothesis of equal numbers is inappropriately rejected, is called a Type 1 error. There is another problem too. Sometimes an unknown population is different to the expected (e.g. it may contain 90% white beads and 10% black ones), but the sample taken (e.g. four white and two black) is not significantly different to the expected outcome predicted by the hypothesis of 50:50. In this case the researcher would decide the composition of the population was the one expected under the null hypothesis (50:50), even though it was not. This mistake, when the alternate hypothesis holds but is inappropriately rejected, is called a Type 2 error. Every time you do a statistical test you run the risk of a Type 1 or Type 2 error. There will be more discussion of these errors in Chapter 8, but they are unavoidably associated with using probability to help you make a decision. 5.5
Other probability levels
Sometimes, depending on the hypothesis being tested, a researcher may decide that the ‘less than 5%’ significance level (with its 5% chance of inappropriately rejecting the null hypothesis) is too risky. Here is a medical example. Malaria is caused by a parasitic protozoan that is carried by certain species of mosquito. When an infected mosquito bites a person the protozoans are injected into the person’s bloodstream, where they reproduce inside red blood cells. A small proportion of malarial infections progress to cerebral malaria, where the parasite infects cells in the person’s brain, causing severe inflammation and often death. A biomedical scientist was asked to test a new and extremely expensive drug that was hoped to reduce mortality in people suffering from cerebral malaria. A large experiment was done, where half of cerebral malaria cases chosen at random received the new drug and the other half did not. The survival of both groups over the next month was compared. The alternate hypothesis was, ‘There will be increased survival of the drug-treated group compared to the control.’ Here, the prohibitive cost of the drug meant that the manufacturer had to be very confident that it was of real use before recommending and
5.6 How are probability values reported?
51
marketing it. Therefore, the risk of a Type 1 error (significantly greater survival in the experimental group compared with the control simply by chance) when using the 5% significance level might be considered too risky. Instead, the researcher might decide to reduce the risk of Type 1 error by using the 1% or even 0.1% level and only recommend the drug if the reduction in mortality was so marked that it was significant at these levels. Here is an example of the opposite case. Before releasing any new pharmaceutical product on the market it has to be assessed for side effects. There were concerns that the new sunscreen ‘Bayray Blockout 2020’ might cause an increase in pimples among frequent users. A pharmaceutical scientist ran an experiment using 200 high-school students during their summer holiday. Each was asked to apply Bayray Blockout 2020 to their left cheek and the best-selling but boringly named ‘Sensible Suncare’ to their right cheek every morning, and then spend the next hour sunbathing. After six weeks the number of pimples per square cm on each cheek were counted and compared. The alternate hypothesis was, ‘Bayray Blockout 2020 causes an increase in pimple numbers compared with Sensible Suncare.’ Here, an increase could be disastrous for sales, so the scientist decided on a significance level of 10% rather than the conventional 5%. Even though there was a 10% chance (double the usual risk) of a Type 1 error, the company could not take the chance that Bayray Blockout 2020 increased the incidence of pimples. The most commonly used significance level is 5%, which is 0.05. If you decide to use a different level in an analysis, the decision needs to be made, justified, and clearly specified before the experiment is done. For a significant result the actual probability is also important. For example, a probability of 0.04 is not very much less than 0.05. In contrast, a probability of 0.002 is very much less than 0.05. Therefore, even though both are significant, the result with the lowest probability gives much stronger evidence for rejecting the null hypothesis. 5.6
How are probability values reported?
The symbol used for the chosen significance level (e.g. 0.05) is the Greek (alpha). Often you will see the probability reported as P < 0.05 or P < 0.01 or P < 0.001. These mean, respectively, ‘The probability is less than 0.05’ or ‘The probability is less than 0.01’ or ‘The probability is less than 0.001.’
52
Probability helps you make a decision
N.S. means ‘not significant,’ which is when the probability is 0.05 or more (P 0.05). Of course, as noted above, if you have specified a significance level of 0.05 and get a result with a probability of less than 0.001, this is far stronger evidence for your alternate hypothesis than a result with a probability of 0.04. 5.7
All statistical tests do the same basic thing
In the ‘beads from a sack’ example all of the possible outcomes were listed and the probability of each was calculated directly. Some statistical tests do this. Most, however, use a formula to produce a number called a statistic. The probability of getting each possible value of the statistic has been previously calculated, so you can use the formula to get the numerical value of the statistic, look up the probability of that value in a published set of statistical tables, and make your decision to retain the null hypothesis if it has a probability of 0.05 or reject it if it has a probability of