1,074 25 2MB
Pages 293 Page size 445.5 x 675 pts
Springer Undergraduate Mathematics Series Advisory Board M.A.J. Chaplain University of Dundee K. Erdmann University of Oxford A. MacIntyre Queen Mary, University of London E. S¨uli University of Oxford J.F. Toland University of Bath
For other titles published in this series, go to www.springer.com/series/3423
N.H. Bingham
•
John M. Fry
Regression Linear Models in Statistics
13
N.H. Bingham Imperial College, London UK [email protected]
John M. Fry University of East London UK [email protected]
Springer Undergraduate Mathematics Series ISSN 1615-2085 ISBN 978-1-84882-968-8 e-ISBN 978-1-84882-969-5 DOI 10.1007/978-1-84882-969-5 Springer London Dordrecht Heidelberg New York British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Control Number: 2010935297 Mathematics Subject Classification (2010): 62J05, 62J10, 62J12, 97K70 c Springer-Verlag London Limited 2010 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Cover design: Deblik Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To James, Ruth and Tom Nick
To my parents Ingrid Fry and Martyn Fry John
Preface
The subject of regression, or of the linear model, is central to the subject of statistics. It concerns what can be said about some quantity of interest, which we may not be able to measure, starting from information about one or more other quantities, in which we may not be interested but which we can measure. We model our variable of interest as a linear combination of these variables (called covariates), together with some error. It turns out that this simple prescription is very flexible, very powerful and useful. If only because regression is inherently a subject in two or more dimensions, it is not the first topic one studies in statistics. So this book should not be the first book in statistics that the student uses. That said, the statistical prerequisites we assume are modest, and will be covered by any first course on the subject: ideas of sample, population, variation and randomness; the basics of parameter estimation, hypothesis testing, p–values, confidence intervals etc.; the standard distributions and their uses (normal, Student t, Fisher F and chisquare – though we develop what we need of F and chi-square for ourselves). Just as important as a first course in statistics is a first course in probability. Again, we need nothing beyond what is met in any first course on the subject: random variables; probability distribution and densities; standard examples of distributions; means, variances and moments; some prior exposure to momentgenerating functions and/or characteristic functions is useful but not essential (we include all we need here). Our needs are well served by John Haigh’s book Probability models in the SUMS series, Haigh (2002). Since the terms regression and linear model are largely synonymous in statistics, it is hardly surprising that we make extensive use of linear algebra and matrix theory. Again, our needs are well served within the SUMS series, in the two books by Blyth and Robertson, Basic linear algebra and Further linear algebra, Blyth and Robertson (2002a), (2002b). We make particular use of the vii
viii
Preface
material developed there on sums of orthogonal projections. It will be a pleasure for those familiar with this very attractive material from pure mathematics to see it being put to good use in statistics. Practical implementation of much of the material of this book requires computer assistance – that is, access to one of the many specialist statistical packages. Since we assume that the student has already taken a first course in statistics, for which this is also true, it is reasonable for us to assume here too that the student has some prior knowledge of and experience with a statistical package. As with any other modern student text on statistics, one is here faced with various choices. One does not want to tie the exposition too tightly to any one package; one cannot cover all packages, and shouldn’t try – but one wants to include some specifics, to give the text focus. We have relied here mainly on S-Plus/R .1 Most of the contents are standard undergraduate material. The boundary between higher-level undergraduate courses and Master’s level courses is not a sharp one, and this is reflected in our style of treatment. We have generally included complete proofs except in the last two chapters on more advanced material: Chapter 8, on Generalised Linear Models (GLMs), and Chapter 9, on special topics. One subject going well beyond what we cover – Time Series, with its extensive use of autoregressive models – is commonly taught at both undergraduate and Master’s level in the UK. We have included in the last chapter some material, on non-parametric regression, which – while no harder – is perhaps as yet more commonly taught at Master’s level in the UK. In accordance with the very sensible SUMS policy, we have included exercises at the end of each chapter (except the last), as well as worked examples. One then has to choose between making the book more student-friendly, by including solutions, or more lecturer-friendly, by not doing so. We have nailed our colours firmly to the mast here by including full solutions to all exercises. We hope that the book will nevertheless be useful to lecturers also (e.g., in inclusion of references and historical background). Rather than numbering equations, we have labelled important equations acronymically (thus the normal equations are (NE ), etc.), and included such equation labels in the index. Within proofs, we have occasionally used local numbering of equations: (∗), (a), (b) etc. In pure mathematics, it is generally agreed that the two most attractive subjects, at least at student level, are complex analysis and linear algebra. In statistics, it is likewise generally agreed that the most attractive part of the subject is 1
S+, S-PLUS, S+FinMetrics, S+EnvironmentalStats, S+SeqTrial, S+SpatialStats, S+Wavelets, S+ArrayAnalyzer, S-PLUS Graphlets, Graphlet, Trellis, and Trellis Graphics are either trademarks or registered trademarks of Insightful Corporation in the United States and/or other countries. Insightful Corporation1700 Westlake Avenue N, Suite 500Seattle, Washington 98109 USA.
Preface
ix
regression and the linear model. It is also extremely useful. This lovely combination of good mathematics and practical usefulness provides a counter-example, we feel, to the opinion of one of our distinguished colleagues. Mathematical statistics, Professor x opines, combines the worst aspects of mathematics with the worst aspects of statistics. We profoundly disagree, and we hope that the reader will disagree too. The book has been influenced by our experience of learning this material, and teaching it, at a number of universities over many years, in particular by the first author’s thirty years in the University of London and by the time both authors spent at the University of Sheffield. It is a pleasure to thank Charles Goldie and John Haigh for their very careful reading of the manuscript, and Karen Borthwick and her colleagues at Springer for their kind help throughout this project. We thank our families for their support and forbearance. NHB, JMF Imperial College, London and the University of East London, March 2010
Contents
1.
Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Correlation version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Large-sample limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 The origins of regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Applications of regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 The Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Maximum Likelihood and Least Squares . . . . . . . . . . . . . . . . . . . . . 1.7 Sums of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Two regressors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 3 7 8 9 11 14 21 23 26 28
2.
The Analysis of Variance (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The Chi-Square Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Change of variable formula and Jacobians . . . . . . . . . . . . . . . . . . . 2.3 The Fisher F-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Normal sample mean and sample variance . . . . . . . . . . . . . . . . . . . 2.6 One-Way Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Two-Way ANOVA; No Replications . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Two-Way ANOVA: Replications and Interaction . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33 33 36 37 38 39 42 49 52 56
3.
Multiple Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.1 The Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xi
xii
Contents
3.2 Solution of the Normal Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Properties of Least-Squares Estimators . . . . . . . . . . . . . . . . . . . . . . 3.4 Sum-of-Squares Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Chi-Square Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Idempotence, Trace and Rank . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Quadratic forms in normal variates . . . . . . . . . . . . . . . . . . . 3.5.3 Sums of Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Orthogonal Projections and Pythagoras’s Theorem . . . . . . . . . . . 3.7 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64 70 73 79 80 81 82 82 85 89 94
4.
Further Multilinear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.1 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.1.1 The Principle of Parsimony . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.1.2 Orthogonal polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.1.3 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.2 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.3 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 105 4.4 The Multinormal Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.4.1 Estimation for the multivariate normal . . . . . . . . . . . . . . . . 113 4.5 Conditioning and Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.6 Mean-square prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.7 Generalised least squares and weighted regression . . . . . . . . . . . . . 123 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.
Adding additional covariates and the Analysis of Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.1 Introducing further explanatory variables . . . . . . . . . . . . . . . . . . . . 129 5.1.1 Orthogonal parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.2 ANCOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.2.1 Nested Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.
Linear Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.1 Minimisation Under Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 6.2 Sum-of-Squares Decomposition and F-Test . . . . . . . . . . . . . . . . . . . 152 6.3 Applications: Sequential Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.3.1 Forward selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.3.2 Backward selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.3.3 Stepwise regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Contents
xiii
7.
Model Checking and Transformation of Data . . . . . . . . . . . . . . . 163 7.1 Deviations from Standard Assumptions . . . . . . . . . . . . . . . . . . . . . 163 7.2 Transformation of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.3 Variance-Stabilising Transformations . . . . . . . . . . . . . . . . . . . . . . . . 171 7.4 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.
Generalised Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.2 Definitions and examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.2.1 Statistical testing and model comparisons . . . . . . . . . . . . . 185 8.2.2 Analysis of residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 8.2.3 Athletics times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.3 Binary models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 8.4 Count data, contingency tables and log-linear models . . . . . . . . . 193 8.5 Over-dispersion and the Negative Binomial Distribution . . . . . . . 197 8.5.1 Practical applications: Analysis of over-dispersed models in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
9.
Other topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 9.1 Mixed models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 9.1.1 Mixed models and Generalised Least Squares . . . . . . . . . . 206 9.2 Non-parametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 9.2.1 Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 9.3 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 9.3.1 Optimality criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 9.3.2 Incomplete designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 9.4 Time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 9.4.1 Cointegration and spurious regression . . . . . . . . . . . . . . . . . 220 9.5 Survival analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 9.5.1 Proportional hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 9.6 p >> n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Dramatis Personae: Who did what when . . . . . . . . . . . . . . . . . . . . . . . 269 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
1
Linear Regression
1.1 Introduction When we first meet Statistics, we encounter random quantities (random variables, in probability language, or variates, in statistical language) one at a time. This suffices for a first course. Soon however we need to handle more than one random quantity at a time. Already we have to think about how they are related to each other. Let us take the simplest case first, of two variables. Consider first the two extreme cases. At one extreme, the two variables may be independent (unrelated). For instance, one might result from laboratory data taken last week, the other might come from old trade statistics. The two are unrelated. Each is uninformative about the other. They are best looked at separately. What we have here are really two one-dimensional problems, rather than one two-dimensional problem, and it is best to consider matters in these terms. At the other extreme, the two variables may be essentially the same, in that each is completely informative about the other. For example, in the Centigrade (Celsius) temperature scale, the freezing point of water is 0o and the boiling point is 100o, while in the Fahrenheit scale, freezing point is 32o and boiling point is 212o (these bizarre choices are a result of Fahrenheit choosing as his origin of temperature the lowest temperature he could achieve in the laboratory, and recognising that the body is so sensitive to temperature that a hundredth of the freezing-boiling range as a unit is inconveniently large for everyday, N.H. Bingham and J.M. Fry, Regression: Linear Models in Statistics, Springer Undergraduate Mathematics Series, DOI 10.1007/978-1-84882-969-5 1, c Springer-Verlag London Limited 2010
1
2
1. Linear Regression
non-scientific use, unless one resorts to decimals). The transformation formulae are accordingly C = (F − 32) × 5/9,
F = C × 9/5 + 32.
While both scales remain in use, this is purely for convenience. To look at temperature in both Centigrade and Fahrenheit together for scientific purposes would be silly. Each is completely informative about the other. A plot of one against the other would lie exactly on a straight line. While apparently a two– dimensional problem, this would really be only one one-dimensional problem, and so best considered as such. We are left with the typical and important case: two–dimensional data, (x1 , y1 ), . . . , (xn , yn ) say, where each of the x and y variables is partially but not completely informative about the other. Usually, our interest is on one variable, y say, and we are interested in what knowledge of the other – x – tells us about y. We then call y the response variable, and x the explanatory variable. We know more about y knowing x than not knowing x; thus knowledge of x explains, or accounts for, part but not all of the variability we see in y. Another name for x is the predictor variable: we may wish to use x to predict y (the prediction will be an uncertain one, to be sure, but better than nothing: there is information content in x about y, and we want to use this information). A third name for x is the regressor, or regressor variable; we will turn to the reason for this name below. It accounts for why the whole subject is called regression. The first thing to do with any data set is to look at it. We subject it to exploratory data analysis (EDA); in particular, we plot the graph of the n data points (xi , yi ). We can do this by hand, or by using a statistical package: Minitab ,1 for instance, using the command Regression, or S-Plus/R by using the command lm (for linear model – see below). Suppose that what we observe is a scatter plot that seems roughly linear. That is, there seems to be a systematic component, which is linear (or roughly so – linear to a first approximation, say) and an error component, which we think of as perturbing this in a random or unpredictable way. Our job is to fit a line through the data – that is, to estimate the systematic linear component. For illustration, we recall the first case in which most of us meet such a task – experimental verification of Ohm’s Law (G. S. Ohm (1787-1854), in 1826). When electric current is passed through a conducting wire, the current (in amps) is proportional to the applied potential difference or voltage (in volts), the constant of proportionality being the inverse of the resistance of the wire 1
Minitab , Quality Companion by Minitab , Quality Trainer by Minitab , Quality. Analysis. Results and the Minitab logo are all registered trademarks of Minitab, Inc., in the United States and other countries.
1.2 The Method of Least Squares
3
(in ohms). One measures the current observed for a variety of voltages (the more the better). One then attempts to fit a line through the data, observing with dismay that, because of experimental error, no three of the data points are exactly collinear. A typical schoolboy solution is to use a perspex ruler and fit by eye. Clearly a more systematic procedure is needed. We note in passing that, as no current flows when no voltage is applied, one may restrict to lines through the origin (that is, lines with zero intercept) – by no means the typical case.
1.2 The Method of Least Squares The required general method – the Method of Least Squares – arose in a rather different context. We know from Newton’s Principia (Sir Isaac Newton (1642– 1727), in 1687) that planets, the Earth included, go round the sun in elliptical orbits, with the Sun at one focus of the ellipse. By cartesian geometry, we may represent the ellipse by an algebraic equation of the second degree. This equation, though quadratic in the variables, is linear in the coefficients. How many coefficients p we need depends on the choice of coordinate system – in the range from two to six. We may make as many astronomical observations of the planet whose orbit is to be determined as we wish – the more the better, n say, where n is large – much larger than p. This makes the system of equations for the coefficients grossly over-determined, except that all the observations are polluted by experimental error. We need to tap the information content of the large number n of readings to make the best estimate we can of the small number p of parameters. Write the equation of the ellipse as a1 x1 + a2 x2 + . . . = 0. Here the aj are the coefficients, to be found or estimated, and the xj are those of x2 , xy, y 2 , x, y, 1 that we need in the equation of the ellipse (we will always need 1, unless the ellipse degenerates to a point, which is not the case here). For the ith point, the left-hand side above will be 0 if the fit is exact, but i say (denoting the ith error) in view of the observational errors. We wish to keep the errors i small; we wish also to put positive and negative i on the same footing, which we may do by looking at the squared errors 2i . A measure of the discrepn ancy of the fit is the sum of these squared errors, i=1 2i . The Method of Least Squares is to choose the coefficients aj so as to minimise this sums of squares, n SS := 2i . i=1
As we shall see below, this may readily and conveniently be accomplished. The Method of Least Squares was discovered independently by two workers, both motivated by the above problem of fitting planetary orbits. It was first
4
1. Linear Regression
published by Legendre (A. M. Legendre (1752–1833), in 1805). It had also been discovered by Gauss (C. F. Gauss (1777–1855), in 1795); when Gauss published his work in 1809, it precipitated a priority dispute with Legendre. Let us see how to implement the method. We do this first in the simplest case, the fitting of a straight line y = a + bx by least squares through a data set (x1 , y1 ), . . . , (xn , yn ). Accordingly, we choose a, b so as to minimise the sum of squares n n 2i = (yi − a − bxi )2 . SS := i=1
i=1
Taking ∂SS/∂a = 0 and ∂SS/∂b = 0 gives n n ∂SS/∂a := −2 ei = −2 (yi − a − bxi ), i=1 i=1 n n ∂SS/∂b := −2 xi ei = −2 xi (yi − a − bxi ). i=1
i=1
To find the minimum, we equate both these to zero: n n (yi − a − bxi ) = 0 and xi (yi − a − bxi ) = 0. i=1
i=1
This gives two simultaneous linear equations in the two unknowns a, b, called the normal equations. Using the ‘bar’ notation x :=
1 n xi . i=1 n
Dividing both sides by n and rearranging, the normal equations are a + bx = y
and ax + bx2 = xy.
Multiply the first by x and subtract from the second: b=
xy − x.y x2 − (x)2
,
and then a = y − bx.
We will use this bar notation systematically. We call x := n1 ni=1 xi the sample mean, or average, of x1 , . . . , xn , and similarly for y. In this book (though not n all others!), the sample variance is defined as the average, n1 i=1 (xi − x)2 , of (xi − x)2 , written s2x or sxx . Then using linearity of average, or ‘bar’, s2x = sxx = (x − x)2 = x2 − 2x.x + x2 = (x2 ) − 2x.x + (x)2 = (x2 ) − (x)2 ,
1.2 The Method of Least Squares
5
since x.x = (x)2 . Similarly, the sample covariance of x and y is defined as the average of (x − x)(y − y), written sxy . So sxy
=
(x − x)(y − y) = xy − x.y − x.y + x.y
=
(xy) − x.y − x.y + x.y = (xy) − x.y.
Thus the slope b is given by the sample correlation coefficient b = sxy /sxx , the ratio of the sample covariance to the sample x-variance. Using the alternative ‘sum of squares’ notation n n (xi − x)2 , Sxy := (xi − x)(yi − y), Sxx := i=1
i=1
b = Sxy /Sxx ,
a = y − bx.
The line – the least-squares line that we have fitted – is y = a + bx with this a and b, or y − y = b(x − x), b = sxy /sxx = Sxy /Sxx . (SRL) It is called the sample regression line, for reasons which will emerge later. Notice that the line goes through the point (x, y) – the centroid, or centre of mass, of the scatter diagram (x1 , y1 ), . . . , (xn , yn ).
Note 1.1 We will see later that if we assume that the errors are independent and identically distributed (which we abbreviate to iid) and normal, N (0, σ 2 ) say, then these formulas for a and b also give the maximum likelihood estimates. Further, 100(1 − α)% confidence intervals in this case can be calculated from points a ˆ and ˆb as x2i , a = a ˆ ± tn−2 (1 − α/2)s nSxx b
tn−2 (1 − α/2)s √ = ˆb ± , Sxx
where tn−2 (1 − α/2) denotes the 1 − α/2 quantile of the Student t distribution with n − 2 degrees of freedom and s is given by s=
1 n−2
2 Sxy . Syy − Sxx
6
1. Linear Regression
Example 1.2
65 60 55
Height (Inches)
70
We fit the line of best fit to model y = Height (in inches) based on x = Age (in years) for the following data: x=(14, 13, 13, 14, 14, 12, 12, 15, 13, 12, 11, 14, 12, 15, 16, 12, 15, 11, 15), y=(69, 56.5, 65.3, 62.8, 63.5, 57.3, 59.8, 62.5, 62.5, 59.0, 51.3, 64.3, 56.3, 66.5, 72.0, 64.8, 67.0, 57.5, 66.5).
11
12
13 14 Age (Years)
15
16
Figure 1.1 Scatter plot of the data in Example 1.2 plus fitted straight line
One may also calculate Sxx and Sxy as Sxx = xi yi − nxy, x2i − nx2 . Sxy = Since that
¯ = 13.316, y¯ = 62.337, xi yi = 15883, x
b=
2 xi = 3409, n = 19, we have
15883 − 19(13.316)(62.337) = 2.787 (3 d.p.). 3409 − 19(13.3162)
Rearranging, we see that a becomes 62.33684 − 2.787156(13.31579) = 25.224. This model suggests that the children are growing by just under three inches
1.2 The Method of Least Squares
7
per year. A plot of the observed data and the fitted straight line is shown in Figure 1.1 and appears reasonable, although some deviation from the fitted straight line is observed.
1.2.1 Correlation version The sample correlation coefficient r = rxy is defined as sxy r = rxy := , sx sy the quotient of the sample covariance and the product of the sample standard deviations. Thus r is dimensionless, unlike the other quantities encountered so far. One has (see Exercise 1.1) −1 ≤ r ≤ 1, with equality if and only if (iff) all the points (x1 , y1 ), . . . , (xn , yn ) lie on a straight line. Using sxy = rxy sx sy and sxx = s2x , we may alternatively write the sample regression line as y − y = b(x − x),
b = rxy sy /sx .
(SRL)
Note also that the slope b has the same sign as the sample covariance and sample correlation coefficient. These will be approximately the population covariance and correlation coefficient for large n (see below), so will have slope near zero when y and x are uncorrelated – in particular, when they are independent, and will have positive (negative) slope when x, y are positively (negatively) correlated. We now have five parameters in play: two means, μx and μy , two variances σx2 and σy2 (or their square roots, the standard deviations σx and σy ), and one correlation, ρxy . The two means are measures of location, and serve to identify the point – (μx , μy ), or its sample counterpart, (x, y) – which serves as a natural choice of origin. The two variances (or standard deviations) are measures of scale, and serve as natural units of length along coordinate axes centred at this choice of origin. The correlation, which is dimensionless, serves as a measure of dependence, or linkage, or association, and indicates how closely y depends on x – that is, how informative x is about y. Note how differently these behave under affine transformations, x → ax + b. The mean transforms linearly: E(ax + b) = aEx + b; the variance transforms by var(ax + b) = a2 var(x); the correlation is unchanged – it is invariant under affine transformations.
8
1. Linear Regression
1.2.2 Large-sample limit When x1 , . . . , xn are independent copies of a random variable x, and x has mean Ex, the Law of Large Numbers says that x → Ex
(n → ∞).
See e.g. Haigh (2002), §6.3. There are in fact several versions of the Law of Large Numbers (LLN). The Weak LLN (or WLLN) gives convergence in probability (for which see e.g. Haigh (2002). The Strong LLN (or SLLN) gives convergence with probability one (or ‘almost surely’, or ‘a.s.’); see Haigh (2002) for a short proof under stronger moment assumptions (fourth moment finite), or Grimmett and Stirzaker (2001), §7.5 for a proof under the minimal condition – existence of the mean. While one should bear in mind that the SLLN holds only off some exceptional set of probability zero, we shall feel free to state the result as above, with this restriction understood. Note the content of the SLLN: thinking of a random variable as its mean plus an error, independent errors tend to cancel when one averages. This is essentially what makes Statistics work: the basic technique in Statistics is averaging. All this applies similarly with x replaced by y, x2 , y 2 , xy, when all these have means. Then 2 s2x = sxx = x2 − x2 → E x2 − (Ex) = var(x), the population variance – also written σx2 = σxx – and sxy = xy − x.y → E(xy) − Ex.Ey = cov(x, y), the population covariance – also written σxy . Thus as the sample size n increases, the sample regression line y − y = b(x − x),
b = sxy /sxx
tends to the line y − Ey = β(x − Ex),
β = σxy /σxx .
(P RL)
This – its population counterpart – is accordingly called the population regression line. Again, there is a version involving correlation, this time the population correlation coefficient σxy ρ = ρxy := : σx σy y − Ey = β(x − Ex),
β = ρxy σy /σx .
(P RL)
1.3 The origins of regression
9
Note 1.3 The following illustration is worth bearing in mind here. Imagine a school Physics teacher, with a class of twenty pupils; they are under time pressure revising for an exam, he is under time pressure marking. He divides the class into ten pairs, gives them an experiment to do over a double period, and withdraws to do his marking. Eighteen pupils gang up on the remaining two, the best two in the class, and threaten them into agreeing to do the experiment for them. This pair’s results are then stolen by the others, who to disguise what has happened change the last two significant figures, say. Unknown to all, the best pair’s instrument was dropped the previous day, and was reading way too high – so the first significant figures in their results, and hence all the others, were wrong. In this example, the insignificant ‘rounding errors’ in the last significant figures are independent and do cancel – but no significant figures are correct for any of the ten pairs, because of the strong dependence between the ten readings. Here the tenfold replication is only apparent rather than real, and is valueless. We shall see more serious examples of correlated errors in Time Series in §9.4, where high values tend to be succeeded by high values, and low values tend to be succeeded by low values.
1.3 The origins of regression The modern era in this area was inaugurated by Sir Francis Galton (1822–1911), in his book Hereditary genius – An enquiry into its laws and consequences of 1869, and his paper ‘Regression towards mediocrity in hereditary stature’ of 1886. Galton’s real interest was in intelligence, and how it is inherited. But intelligence, though vitally important and easily recognisable, is an elusive concept – human ability is infinitely variable (and certainly multi–dimensional!), and although numerical measurements of general ability exist (intelligence quotient, or IQ) and can be measured, they can serve only as a proxy for intelligence itself. Galton had a passion for measurement, and resolved to study something that could be easily measured; he chose human height. In a classic study, he measured the heights of 928 adults, born to 205 sets of parents. He took the average of the father’s and mother’s height (‘mid-parental height’) as the predictor variable x, and height of offspring as response variable y. (Because men are statistically taller than women, one needs to take the gender of the offspring into account. It is conceptually simpler to treat the sexes separately – and focus on sons, say – though Galton actually used an adjustment factor to compensate for women being shorter.) When he displayed his data in tabular form, Galton noticed that it showed elliptical contours – that is, that squares in the
10
1. Linear Regression
(x, y)-plane containing equal numbers of points seemed to lie approximately on ellipses. The explanation for this lies in the bivariate normal distribution; see §1.5 below. What is most relevant here is Galton’s interpretation of the sample and population regression lines (SRL) and (PRL). In (P RL), σx and σy are measures of variability in the parental and offspring generations. There is no reason to think that variability of height is changing (though mean height has visibly increased from the first author’s generation to his children). So (at least to a first approximation) we may take these as equal, when (P RL) simplifies to y − Ey = ρxy (x − Ex).
(P RL)
Hence Galton’s celebrated interpretation: for every inch of height above (or below) the average, the parents transmit to their children on average ρ inches, where ρ is the population correlation coefficient between parental height and offspring height. A further generation will introduce a further factor ρ, so the parents will transmit – again, on average – ρ2 inches to their grandchildren. This will become ρ3 inches for the great-grandchildren, and so on. Thus for every inch of height above (or below) the average, the parents transmit to their descendants after n generations on average ρn inches of height. Now 0 0, μi are real, −1 < ρ < 1. Since f is clearly non-negative, to show that f is a (probability density) function (in two dimensions), it suffices to show that f integrates to 1: ∞ ∞ f (x, y) dx dy = 1, or f = 1. −∞
−∞
Write f1 (x) :=
∞
f (x, y) dy, −∞
f2 (y) :=
∞
f (x, y) dx. −∞
∞ ∞ Then to show f = 1, we need to show −∞ f1 (x) dx = 1 (or −∞ f2 (y) dy = 1). Then f1 , f2 are densities, in one dimension. If f (x, y) = fX,Y (x, y) is the joint density of two random variables X, Y , then f1 (x) is the density fX (x) of X, f2 (y) the density fY (y) of Y (f1 , f2 , or fX , fY , are called the marginal densities of the joint density f , or fX,Y ). To perform the integrations, we have to complete the square. We have the algebraic identity (1 − ρ2 )Q ≡
y − μ 2
σ2
−ρ
x − μ 2 1
σ1
x − μ1 2 + 1 − ρ2 σ1
(reducing the number of occurrences of y to 1, as we intend to integrate out y first). Then (taking the terms free of y out through the y-integral) ∞ 1 exp − 21 (x − μ1 )2 /σ12 − 2 (y − cx )2 1 √ f1 (x) = exp dy, √ σ22 (1 − ρ2 ) σ1 2π −∞ σ2 2π 1 − ρ2 (∗) where σ2 cx := μ2 + ρ (x − μ1 ). σ1 The integral is 1 (‘normal density’). So exp − 21 (x − μ1 )2 /σ12 √ f1 (x) = , σ1 2π which integrates to 1 (‘normal density’), proving
16
1. Linear Regression
Fact 1. f (x, y) is a joint density function (two-dimensional), with marginal density functions f1 (x), f2 (y) (one-dimensional). So we can write f (x, y) = fX,Y (x, y),
f1 (x) = fX (x),
f2 (y) = fY (y).
Fact 2. X, Y are normal: X is N (μ1 , σ12 ), Y is N (μ2 , σ22 ). For, we showed f1 = fX to be the N (μ1 , σ12 ) density above, and similarly for Y by symmetry. Fact 3. EX = μ1 , EY = μ2 , var X = σ12 , var Y = σ22 . This identifies four out of the five parameters: two means μi , two variances σi2 . Next, recall the definition of conditional probability: P (A|B) := P (A ∩ B)/P (B). In the discrete case, if X, Y take possible values xi , yj with probabilities fX (xi ), fY (yj ), (X, Y ) takes possible values (xi , yj ) with corresponding probabilities fX,Y (xi , yj ): fX (xi ) = P (X = xi ) = Σj P (X = xi , Y = yj ) = Σj fX,Y (xi , yj ). Then the conditional distribution of Y given X = xi is fY |X (yj |xi ) =
fX,Y (xi , yj ) P (Y = yj , X = xi ) = , P (X = xi ) j fX,Y (xi , yj )
and similarly with X, Y interchanged. In the density case, we have to replace sums by integrals. Thus the conditional density of Y given X = x is (see e.g. Haigh (2002), Def. 4.19, p. 80) fY |X (y|x) :=
fX,Y (x, y) fX,Y (x, y) = ∞ . fX (x) f (x, y) dy −∞ X,Y
Returning to the bivariate normal: Fact 4. The conditional distribution of y given X = x is σ2 N μ2 + ρ (x − μ1 ), σ22 1 − ρ2 . σ1
Proof
Go back to completing the square (or, return to (∗) with and dy deleted): 2 2 exp − 21 (x − μ1 ) /σ12 exp − 12 (y − cx ) / σ22 1 − ρ2 √ f (x, y) = . . √ σ1 2π σ2 2π 1 − ρ2
1.5 The Bivariate Normal Distribution
17
The first factor is f1 (x), by Fact 1. So, fY |X (y|x) = f (x, y)/f1 (x) is the second factor:
1 −(y − cx )2 exp fY |X (y|x) = √ , 2σ22 (1 − ρ2 ) 2πσ2 1 − ρ2 where cx is the linear function of x given below (∗). This not only completes the proof of Fact 4 but gives Fact 5. The conditional mean E(Y |X = x) is linear in x: E(Y |X = x) = μ2 + ρ
σ2 (x − μ1 ). σ1
Note 1.5 1. This simplifies when X and Y are equally variable, σ1 = σ2 : E(Y |X = x) = μ2 + ρ(x − μ1 ) (recall EX = μ1 , EY = μ2 ). Recall that in Galton’s height example, this says: for every inch of mid-parental height above/below the average, x−μ1 , the parents pass on to their child, on average, ρ inches, and continuing in this way: on average, after n generations, each inch above/below average becomes on average ρn inches, and ρn → 0 as n → ∞, giving regression towards the mean. 2. This line is the population regression line (PRL), the population version of the sample regression line (SRL). 3. The relationship in Fact 5 can be generalised (§4.5): a population regression function – more briefly, a regression – is a conditional mean. This also gives Fact 6. The conditional variance of Y given X = x is var(Y |X = x) = σ22 1 − ρ2 . Recall (Fact 3) that the variability (= variance) of Y is varY = σ22 . By Fact 5, the variability remaining in Y when X is given (i.e., not accounted for by knowledge of X) is σ22 (1 − ρ2 ). Subtracting, the variability of Y which is accounted for by knowledge of X is σ22 ρ2 . That is, ρ2 is the proportion of the
18
1. Linear Regression
variability of Y accounted for by knowledge of X. So ρ is a measure of the strength of association between Y and X. Recall that the covariance is defined by cov(X, Y ) := =
E[(X − EX)(Y − EY )] = E[(X − μ1 )(Y − μ2 )], E(XY ) − (EX)(EY ),
and the correlation coefficient ρ, or ρ(X, Y ), defined by E[(X − μ1 )(Y − μ2 )] cov(X, Y ) √ = ρ = ρ(X, Y ) := √ σ1 σ2 varX varY is the usual measure of the strength of association between X and Y (−1 ≤ ρ ≤ 1; ρ = ±1 iff one of X, Y is a function of the other). That this is consistent with the use of the symbol ρ for a parameter in the density f (x, y) is shown by the fact below. Fact 7. If (X, Y )T is bivariate normal, the correlation coefficient of X, Y is ρ.
Proof
ρ(X, Y ) := E
X − μ1 σ1
Y − μ2 σ2
x − μ1 y − μ2 f (x, y)dxdy. = σ1 σ2
Substitute for f (x, y) = c exp(− 12 Q), and make the change of variables u := (x − μ1 )/σ1 , v := (y − μ2 )/σ2 : − u2 − 2ρuv + v 2 1 du dv. ρ(X, Y ) = uv exp 2(1 − ρ2 ) 2π 1 − ρ2 Completing the square as before, [u2 − 2ρuv + v 2 ] = (v − ρu)2 + (1 − ρ2 )u2 . So 2 1 u (v − ρu)2 1 ρ(X, Y ) = √ u exp − v exp − du. √ dv. 2 2(1 − ρ2 ) 2π 2π 1 − ρ2 Replace v in the inner integral by (v − ρu) + ρu, and calculate the two resulting integrals separately. The first is zero (‘normal mean’, or symmetry), the second is ρu (‘normal density’). So 2 u 1 du = ρ ρ(X, Y ) = √ .ρ u2 exp − 2 2π (‘normal variance’), as required. This completes the identification of all five parameters in the bivariate normal distribution: two means μi , two variances σi2 , one correlation ρ.
1.5 The Bivariate Normal Distribution
19
Note 1.6 1. The above holds for −1 < ρ < 1; always, −1 ≤ ρ ≤ 1, by the CauchySchwarz inequality (see e.g. Garling (2007) p.15, Haigh (2002) Ex 3.20 p.86, or Howie (2001) p.22 and Exercises 1.1-1.2). In the limiting cases ρ = ±1, one of X, Y is then a linear function of the other: Y = aX + b, say, as in the temperature example (Fahrenheit and Centigrade). The situation is not really two-dimensional: we can (and should) use only one of X and Y , reducing to a one-dimensional problem. 2. The slope of the regression line y = cx is ρσ2 /σ1 = (ρσ1 σ2 )/(σ12 ), which can be written as cov(X, Y )/varX = σ12 /σ11 , or σ12 /σ12 : the line is y − EY =
σ12 (x − EX). σ11
This is the population version (what else?!) of the sample regression line y−y =
sXY (x − x), sXX
familiar from linear regression. The case ρ = ±1 – apparently two-dimensional, but really one-dimensional – is singular; the case −1 < ρ < 1 (genuinely two-dimensional) is nonsingular, or (see below) full rank. We note in passing Fact 8. The bivariate normal law has elliptical contours. For, the contours are Q(x, y) = const, which are ellipses (as Galton found). Moment Generating Function (MGF). Recall (see e.g. Haigh (2002), §5.2) the definition of the moment generating function (MGF) of a random variable X. This is the function M (t),
or MX (t) := E exp{tX}
for t real, and such that the expectation (typically a summation or integration, which may be infinite) converges (absolutely). For X normal N (μ, σ 2 ), 1 1 M (t) = √ etx exp − (x − μ)2 /σ 2 dx. 2 σ 2π Change variable to u := (x − μ)/σ: 1 2 1 exp μt + σut − u M (t) = √ du. 2 2π
20
1. Linear Regression
Completing the square, 1 M (t) = eμt √ 2π
1 1 2 2 exp − (u − σt)2 du.e 2 σ t , 2
or MX (t) = exp(μt + 12 σ 2 t2 ) (recognising that the central term on the right is 1 – ‘normal density’) . So MX−μ (t) = exp( 12 σ 2 t2 ). Then (check) (0), var X = E[(X − μ)2 ] = MX−μ (0). μ = EX = MX
Similarly in the bivariate case: the MGF is MX,Y (t1 , t2 ) := E exp(t1 X + t2 Y ). In the bivariate normal case: M (t1 , t2 ) = =
E(exp(t1 X + t2 Y )) = exp(t1 x + t2 y)f (x, y) dx dy exp(t1 x)f1 (x) dx exp(t2 y)f (y|x) dy.
The inner integral is the MGF of Y |X = x, which is N (cx , σ22 , (1 − ρ2 )), so is exp(cx t2 + 12 σ22 (1 − ρ2 )t22 ). By Fact 5 σ2 cx t2 = [μ2 + ρ (x − μ1 )]t2 , σ1 so M (t1 , t2 ) is equal to σ2 1 σ2 exp t2 μ2 − t2 μ1 + σ22 1 − ρ2 t22 exp t1 + t2 ρ x f1 (x) dx. σ1 2 σ1 Since f1 (x) is N (μ1 , σ12 ), the inner integral is a normal MGF, which is thus exp(μ1 [t1 + t2 ρ
σ2 1 ] + σ12 [. . .]2 ). σ1 2
Combining the two terms and simplifying, we obtain Fact 9. The joint MGF is 1 2 2 MX,Y (t1 , t2 ) = M (t1 , t2 ) = exp μ1 t1 + μ2 t2 + σ1 t1 + 2ρσ1 σ2 t1 t2 + σ22 t22 . 2 Fact 10. X, Y are independent iff ρ = 0.
Proof For densities: X, Y are independent iff the joint density fX,Y (x, y) factorises as the product of the marginal densities fX (x).fY (y) (see e.g. Haigh (2002), Cor. 4.17). For MGFs: X, Y are independent iff the joint MGF MX,Y (t1 , t2 ) factorises as the product of the marginal MGFs MX (t1 ).MY (t2 ). From Fact 9, this occurs iff ρ = 0.
1.6 Maximum Likelihood and Least Squares
21
Note 1.7 1. X, Y independent implies X, Y uncorrelated (ρ = 0) in general (when the correlation exists). The converse is false in general, but true, by Fact 10, in the bivariate normal case. 2. Characteristic functions (CFs). The characteristic function, or CF, of X is φX (t) := E(eitX ). Compared to the MGF, this has the drawback of involving complex numbers, but the great advantage of always existing for t real. Indeed, |φX (t)| = E(eitX ) ≤E eitX = E1 = 1. By contrast, the expectation defining the MGF MX (t) may diverge for some real t (as we shall see in §2.1 with the chi-square distribution.) For background on CFs, see e.g. Grimmett and Stirzaker (2001) §5.7. For our purposes one may pass from MGF to CF by formally replacing t by it (though one actually needs analytic continuation – see e.g. Copson (1935), §4.6 – or Cauchy’s Theorem – see e.g. Copson (1935), §6.7, or Howie (2003), Example 9.19). Thus for the univariate normal distribution N (μ, σ 2 ) the CF is
1 2 2 φX (t) = exp iμt − σ t 2 and for the bivariate normal distribution the CF of X, Y is
1 2 2 2 φX,Y (t1 , t2 ) = exp iμ1 t1 + iμ2 t2 − σ t + 2ρσ1 σ2 t1 t2 + σ2 t2 . 2 1 1
1.6 Maximum Likelihood and Least Squares By Fact 4, the conditional distribution of y given X = x is N (μ2 + ρ
σ2 (x − μ1 ), σ1
σ22 (1 − ρ2 )).
Thus y is decomposed into two components, a linear trend in x – the systematic part – and a normal error, with mean zero and constant variance – the random part. Changing the notation, we can write this as y = a + bx + ,
∼ N (0, σ 2 ).
22
1. Linear Regression
With n values of the predictor variable x, we can similarly write i ∼ N (0, σ 2 ).
yi = a + bxi + i ,
To complete the specification of the model, we need to specify the dependence or correlation structure of the errors 1 , . . . , n . This can be done in various ways (see Chapter 4 for more on this). Here we restrict attention to the simplest and most important case, where the errors i are iid: yi = a + bxi + i ,
i
iid N (0, σ 2 ).
(∗)
This is the basic model for simple linear regression. Since each yi is now normally distributed, we can write down its density. Since the yi are independent, the joint density of y1 , . . . , yn factorises as the product of the marginal (separate) densities. This joint density, regarded as a function of the parameters, a, b and σ, is called the likelihood, L (one of many contributions by the great English statistician R. A. Fisher (1890-1962), later Sir Ronald Fisher, in 1912). Thus L
= =
n
1 σ n (2π)
1 2n
1 1 σ n (2π) 2 n
1 exp{− (yi − a − bxi )2 /σ 2 } 2 n 1 exp{− (yi − a − bxi )2 /σ 2 }. i=1 2 i=1
Fisher suggested choosing as our estimates of the parameters the values that maximise the likelihood. This is the Method of Maximum Likelihood; the resulting estimators are the maximum likelihood estimators or MLEs. Now maximising the likelihood L and maximising its logarithm := log L are the same, since the function log is increasing. Since 1 n 1 (yi − a − bxi )2 /σ 2 , := log L = − n log 2π − n log σ − i=1 2 2 so far as maximising with respect to a and b are concerned (leaving σ to one side for the moment), this is the same as minimising the sum of squares SS := n 2 i=1 (yi − a − bxi ) – just as in the Method of Least Squares. Summarising:
Theorem 1.8 For the normal model (∗), the Method of Least Squares and the Method of Maximum Likelihood are equivalent ways of estimating the parameters a and b.
1.7 Sums of Squares
23
It is interesting to note here that the Method of Least Squares of Legendre and Gauss belongs to the early nineteenth century, whereas Fisher’s Method of Maximum Likelihood belongs to the early twentieth century. For background on the history of statistics in that period, and an explanation of the ‘long pause’ between least squares and maximum likelihood, see Stigler (1986). There remains the estimation of the parameter σ, equivalently the variance σ 2 . Using maximum likelihood as above gives −n 1 n ∂/∂σ = + 3 (yi − a − bxi )2 = 0, i=1 σ σ or 1 n σ2 = (yi − a − bxi )2 . i=1 n At the maximum, a and b have their maximising values a ˆ, ˆb as above, and then the maximising value σ ˆ is given by 1 n 1 n (yi − a ˆ − ˆbxi )2 = (yi − yˆi )2 . σ ˆ2 = 1 1 n n Note that the sum of squares SS above involves unknown parameters, a and b. Because these are unknown, one cannot calculate this sum of squares numerically from the data. In the next section, we will meet other sums of squares, which can be calculated from the data – that is, which are functions of the data, or statistics. Rather than proliferate notation, we will again denote the largest of these sums of squares by SS; we will then break this down into a sum of smaller sums of squares (giving a sum of squares decomposition). In Chapters 3 and 4, we will meet multidimensional analogues of all this, which we will handle by matrix algebra. It turns out that all sums of squares will be expressible as quadratic forms in normal variates (since the parameters, while unknown, are constant, the distribution theory of sums of squares with and without unknown parameters is the same).
1.7 Sums of Squares Recall the sample regression line in the form y = y + b(x − x),
b = sxy /sxx = Sxy /Sxx .
(SRL)
We now ask how much of the variation in y is accounted for by knowledge of x – or, as one says, by regression. The data are yi . The fitted values are yˆi , the left-hand sides above with x on the right replaced by xi . Write yi − y = (yi − yˆi ) + (ˆ yi − y),
24
1. Linear Regression
square both sides and add. On the left, we get n (yi − y)2 , SS := i=1
the total sum of squares or sum of squares for short. On the right, we get three terms: (ˆ yi − y)2 , SSR := i
which we call the sum of squares for regression, SSE := (yi − yˆi )2 , i
the sum of squares for error (since this sum of squares measures the errors between the fitted values on the regression line and the data), and a cross term 1 (yi − yˆi )(ˆ yi − y) = n (yi − yˆi )(ˆ yi − y) = n.(y − yˆ)(y − y). i i n By (SRL), yˆi − y = b(xi − x) with b = Sxy /Sxx = Sxy /Sx2 , and yi − yˆ = (yi − y) − b(xi − x). So the right above is n times 1 b(xi − x)[(yi − y) − b(xi − x)] = bSxy − b2 Sx2 = b Sxy − bSx2 = 0, i n as b = Sxy /Sx2 . Combining, we have
Theorem 1.9 SS = SSR + SSE. In terms of the sample correlation coefficient r2 , this yields as a corollary
Theorem 1.10 r2 = SSR/SS,
1 − r2 = SSE/SS.
Proof It suffices to prove the first. 2 2 2 Sxy Sxy (ˆ yi − y)2 b (xi − x)2 b2 Sx2 Sx2 SSR = = = = . = = r2 , SS Sy2 Sx4 Sy2 Sx2 Sy2 (yi − y)2 (yi − y)2 as b = Sxy /Sx2 .
1.7 Sums of Squares
25
The interpretation is that r2 = SSR/SS is the proportion of variability in y accounted for by knowledge of x, that is, by regression (and 1 − r2 = SSE/SS is that unaccounted for by knowledge of x, that is, by error). This is just the sample version of what we encountered in §1.5 on the bivariate normal distribution, where (see below Fact 6 in §1.5) ρ2 has the interpretation of the proportion of variability in y accounted for by knowledge of x. Recall that r2 tends to ρ2 in the large-sample limit, by the Law of Large Numbers, so the population theory of §1.5 is the large-sample limit of the sample theory here.
Example 1.11 We wish to predict y, winning speeds (mph) in a car race, given the year x, by a linear regression. The data for years one to ten are y=(140.3, 143.1, 147.4, 151.4, 144.3, 151.2, 152.9, 156.9, 155.7, 157.7). The estimates for a and b now become a ˆ = 139.967 and ˆb = 1.841. Assuming normally distributed errors in our regression model means that we can now calculate confidence intervals for the parameters and express a level of uncertainty around these estimates. In this case the formulae for 95% confidence intervals give (135.928, 144.005) for a and (1.190, 2.491) for b. Distribution theory. Consider first the case b = 0, when the slope is zero, there is no linear trend, and the yi are identically distributed, N (a, σ 2 ). Then y and yi − y are also normally distributed, with zero mean. It is perhaps surprising, but true, that (yi − y)2 and y are independent; we prove this in §2.5 below. The distribution of the quadratic form (yi −y)2 involves the chi-square distribution; see §2.1 below. In this case, SSR and SSE are independent chi-square variates, and SS = SSR + SSE is an instance of chi-square decompositions, which we meet in §3.5. In the general case with the slope b non-zero, there is a linear trend, and a sloping regression line is more successful in explaining the data than a flat one. One quantifies this by using a ratio of sums of squares (ratio of independent chi-squares) that increases when the slope b is non-zero, so large values are evidence against zero slope. This statistic is an F-statistic (§2.3: F for Fisher). Such F-tests may be used to test a large variety of such linear hypotheses (Chapter 6). When b is non-zero, the yi − y are normally distributed as before, but with non-zero mean. Their sum of squares (yi − y)2 then has a non-central chisquare distribution. The theory of such distributions is omitted here, but can be found in, e.g., Kendall and Stuart (1979), Ch. 24.
26
1. Linear Regression
1.8 Two regressors Suppose now that we have two regressor variables, u and v say, for the response variable y. Several possible settings have been prefigured in the discussion above: 1. Height. Galton measured the father’s height u and the mother’s height v in each case, before averaging to form the mid-parental height x := (u+v)/2. What happens if we use u and v in place of x? 2. Predicting grain yields. Here y is the grain yield after the summer harvest. Because the price that the grain will fetch is determined by the balance of supply and demand, and demand is fairly inflexible while supply is unpredictable, being determined largely by the weather, it is of great economic and financial importance to be able to predict grain yields in advance. The two most important predictors are the amount of rainfall (in cm, u say) and sunshine (in hours, v say) during the spring growing season. Given this information at the end of spring, how can we use it to best predict yield in the summer harvest? Of course, the actual harvest is still subject to events in the future, most notably the possibility of torrential rain in the harvest season flattening the crops. Note that for the sizeable market in grain futures, such predictions are highly price-sensitive information. 3. House prices. In the example above, house prices y depended on earnings u and interest rates v. We would expect to be able to get better predictions using both these as predictors than using either on its own. 4. Athletics times. We saw that both age and distance can be used separately; one ought to be able to do better by using them together. 5. Timber. The economic value of a tree grown for timber depends on the volume of usable timber when the tree has been felled and taken to the sawmill. When choosing which trees to fell, it is important to be able to estimate this volume without needing to fell the tree. The usual predictor variables here are girth (in cm, say – measured by running a tape-measure round the trunk at some standard height – one metre, say – above the ground) and height (measured by use of a surveyor’s instrument and trigonometry).
1.8 Two regressors
27
With two regressors u and v and response variable y, given a sample of size n of points (u1 , v1 , y1 ), . . . , (un , vn , yn ) we have to fit a least-squares plane – that is, we have to choose parameters a, b, c to minimise the sum of squares n (yi − c − aui − bvi )2 . SS := i=1
Taking ∂SS/∂c = 0 gives n i=1
(yi − c − aui − bvi ) = 0 :
c = y − au − bv.
We rewrite SS as SS =
n i=1
[(yi − y) − a(ui − u) − b(vi − v)]2 .
Then ∂SS/∂a = 0 and ∂SS/∂b = 0 give n i=1
(ui − u)[(yi − y) − a(ui − u) − b(vi − v)] = 0,
n i=1
(vi − v)[(yi − y) − a(ui − u) − b(vi − v)] = 0.
Multiply out, divide by n to turn the sums into averages, and re-arrange using our earlier notation of sample variances and sample covariance: the above equations become asuu + bsuv = syu , asuv + bsvv = syv . These are the normal equations for a and b. The determinant is 2 suu svv − s2uv = suu svv (1 − ruv )
(since ruv := suv /(su sv )). This is non-zero iff ruv = ±1 – that is, iff the points (u1 , v1 ), . . . , (un , vn ) are not collinear – and this is the condition for the normal equations to have a unique solution. The extension to three or more regressors may be handled in just the same way: with p regressors we obtain p normal equations. The general case is best handled by the matrix methods of Chapter 3.
Note 1.12 As with the linear regression case, under the assumption of iid N (0, σ 2 ) errors these formulas for a and b also give the maximum likelihood estimates. Further,
28
1. Linear Regression
100(1 − α)% confidence intervals can be returned routinely using standard software packages, and in this case can be calculated as 2 2 2 ui vi − ( ui vi ) 2 , c = cˆ ± tn−3 (1 − α/2)s n ui Svv + n ui vi [2nuv − ui vi ] − n2 u2 vi2 Svv , a = a ˆ ± tn−3 (1 − α/2)s 2 ui Svv + ui vi [2nuv − ui vi ] − nu2 vi2 Suu , b = ˆb ± tn−3 (1 − α/2)s 2 ui Svv + ui vi [2nuv − ui vi ] − nu2 vi2 where
s=
1 Syy − a ˆSuy − ˆbSvy ; n−3
see Exercise 3.10.
Note 1.13 (Joint confidence regions) In the above, we restrict ourselves to confidence intervals for individual parameters, as is done in e.g. S-Plus/R . One can give confidence regions for two or more parameters together, we refer for detail to Draper and Smith (1998), Ch. 5.
EXERCISES 1.1. By considering the quadratic Q(λ) :=
1 n (λ(xi − x) + (yi − y))2 , i=1 n
show that the sample correlation coefficient r satisfies (i) −1 ≤ r ≤ 1; (ii) r = ±1 iff there is a linear relationship between xi and yi , axi + byi = c
(i = 1, . . . , n).
1.2. By considering the quadratic Q(λ) := E[(λ(x − x) + (y − y))2 ], show that the population correlation coefficient ρ satisfies (i) −1≤ρ≤1;
1.8 Two regressors
29
(ii) ρ = ±1 iff there is a linear relationship between x and y, ax + by = c with probability 1. (These results are both instances of the Cauchy–Schwarz inequality for sums and integrals respectively.) 1.3. The effect of ageing on athletic performance. The data in Table 1.1 gives the first author’s times for the marathon and half-marathon (in minutes). (i) Fit the model log(time) = a + b log(age) and give estimates and Age 46 48 49 50 51 57 59 60 61 62
Half-marathon 85.62 84.90 87.88 87.88 87.57 90.25 88.40 89.45 96.38 94.62
Age 46.5 47.0 47.5 49.5 50.5 54.5 56.0 58.5 59.5 60.0
Marathon 166.87 173.25 175.17 178.97 176.63 175.03 180.32 183.02 192.33 191.73
Table 1.1 Data for Exercise 1.3
95% confidence intervals for a and b. (ii) Compare your results with the runners’ Rule of Thumb that, for ageing athletes, every year of age adds roughly half a minute to the half-marathon time and a full minute to the marathon time. 1.4. Look at the data for Example 1.11 on car speeds. Plot the data along with the fitted regression line. Fit the model y = a + bx + cx2 and test for the significance of a quadratic term. Predict the speeds for x=(-3, 13) and compare with the actual observations of 135.9 and 158.6 respectively. Which model seems to predict best out of sample? Do your results change much when you add these two observations to your sample? 1.5. Give the solution to the normal equations for the regression model with two regressors in §1.8 1.6. Consider the data in Table 1.2 giving the first author’s half-marathon times:
30
1. Linear Regression
Age (x) 42 43 44 46 48 49 50
Time (y) 92.00 92.00 91.25 85.62 84.90 87.88 87.88
Age (x) 51 57 59 60 61 62 63
Time (y) 87.57 90.25 88.40 89.45 96.38 94.62 91.23
Table 1.2 Data for Exercise 1.6 (i) Fit the models y = a + bx and y = a + bx + cx2 . Does the extra quadratic term appear necessary? (ii) Effect of club membership upon performance. Use the following proxy v = (0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) to gauge the effect of club membership. (v = 1 corresponds to being a member of a club). Consider the model y = a + bx + cv. How does membership of a club appear to affect athletic performance? 1.7. The following data, y = (9.8, 11.0, 13.2, 15.1, 16.0) give the price index y in years one to five. (i) Which of the models y = a + bt, y = Aebt fits the data best? (ii) Does the quadratic model, y = a + bt + ct2 offer a meaningful improvement over the simple linear regression model? 1.8. The following data in Table 1.3 give the US population in millions. Fit a suitable model and interpret your findings. Year 1790 1800 1810 1820 1830 1840 1850 1860 1870 1880
Population 3.93 5.31 7.24 9.64 12.90 17.10 23.20 31.40 39.80 50.20
Year 1890 1900 1910 1920 1930 1940 1950 1960 1970
Population 62.90 76.00 92.00 105.70 122.80 131.70 151.30 179.30 203.20
Table 1.3 Data for Exercise 1.8.
1.8 Two regressors
31
1.9. One-dimensional change-of-variable formula. Let X be a continuous random variable with density fX (x). Let Y = g(X) for some monotonic function g(·). (i) Show that −1 dg −1 (x) . fY (x) = fX g (x) dx (ii) Suppose X∼N (μ, σ 2 ). Show that Y = eX has probability density function
(logx − μ)2 1 √ exp − fY (x) = . 2σ 2 2πσ [Note that this gives the log-normal distribution, important in the Black–Scholes model of mathematical finance.] 1.10. The following exercise motivates a discussion of Student’s t distribution as a normal variance mixture (see Exercise 1.11). Let U ∼χ2r be a chi-squared distribution with r degrees of freedom (for which see §2.1), with density 1
fU (x) =
1
x 2 r−1 e− 2 x 1
2 2 r Γ ( r2 )
.
(i) Show, using Exercise 1.9 or differentiation under the integral sign that Y = r/U has density 1
fY (x) =
1
1
r 2 r x−1− 2 r e− 2 rx
−1
1
2 2 r Γ ( r2 )
(ii) Show that if X∼Γ (a, b) with density fX (x) =
xa−1 ba e−bx , Γ (a)
then Y = X −1 has density fY (x) = Deduce the value of
0
∞
ba x−1−a e−b/x . Γ (a)
x−1−a e−b/x dx.
.
32
1. Linear Regression
1.11. Student’s t distribution. A Student t distribution t(r) with r degrees of freedom can be constructed as follows: 1. Generate u from fY (·). 2. Generate x from N (0, u), where fY (·) is the probability density in Exercise 1.10 (ii). Show that − 12 (r+1) Γ r2 + 12 x2 . ft(r) (x) = √ 1+ πrΓ ( r2 ) r The Student t distribution often arises in connection with the chisquare distribution (see Chapter 2). If X∼N (0, 1) and Y ∼χ2r with X and Y independent then X ∼t(r). Y /r
2
The Analysis of Variance (ANOVA)
While the linear regression of Chapter 1 goes back to the nineteenth century, the Analysis of Variance of this chapter dates from the twentieth century, in applied work by Fisher motivated by agricultural problems (see §2.6). We begin this chapter with some necessary preliminaries, on the special distributions of Statistics needed for small-sample theory: the chi-square distributions χ2 (n) (§2.1), the Fisher F -distributions F (m, n) (§2.3), and the independence of normal sample means and sample variances (§2.5). We shall generalise linear regression to multiple regression in Chapters 3 and 4 – which use the Analysis of Variance of this chapter – and unify regression and Analysis of Variance in Chapter 5 on Analysis of Covariance.
2.1 The Chi-Square Distribution We now define the chi-square distribution with n degrees of freedom (df), χ2 (n). This is the distribution of X12 + . . . + Xn2 , with the Xi iid N (0, 1). Recall (§1.5, Fact 9) the definition of the MGF, and also the definition of the Gamma function, ∞ e−x xt−1 dx (t > 0) Γ (t) := 0
N.H. Bingham and J.M. Fry, Regression: Linear Models in Statistics, Springer Undergraduate Mathematics Series, DOI 10.1007/978-1-84882-969-5 2, c Springer-Verlag London Limited 2010
33
34
2. The Analysis of Variance (ANOVA)
(the integral converges for t > 0). One may check (by integration by parts) that Γ (n + 1) = n! (n = 0, 1, 2, . . .), so the Gamma function provides a continuous extension to the factorial. It is also needed in Statistics, as it comes into the normalisation constants of the standard distributions of small-sample theory, as we see below.
Theorem 2.1 The chi-square distribution χ2 (n) with n degrees of freedom has (i) mean n and variance 2n, 1 (ii) MGF M (t) = 1/(1 − 2t) 2 n for t < 12 , (iii) density 1 1 1 n−1 2 (x > 0). exp − x f (x) = 1 n 1 .x 2 22 Γ 2n
Proof (i) For n = 1, the mean is 1, because a χ2 (1) is the square of a standard normal, and a standard normal has mean 0 and variance 1. The variance is 2, because the fourth moment of a standard normal X is 3, and 2 2 − E X2 = 3 − 1 = 2. var X 2 = E X 2 For general n, the mean is n because means add, and the variance is 2n because variances add over independent summands (Haigh (2002), Th 5.5, Cor 5.6). (ii) For X standard normal, the MGF of its square X 2 is ∞ ∞ 2 2 2 1 2 1 1 1 etx e− 2 x dx = √ e− 2 (1−2t)x dx. M (t) := etx φ(x) dx = √ 2π −∞ 2π −∞ √ So the integral converges only for t < 12 ; putting y := 1 − 2t.x gives √ 1 M (t) = 1/ 1 − 2t t< for X∼N (0, 1). 2 Now when X, Y are independent, the MGF of their sum is the product of their MGFs (see e.g. Haigh (2002), p.103). For etX , etY are independent, and the mean of an independent product is the product of the means. Combining these, the MGF of a χ2 (n) is given by 1 1 n t< M (t) = 1/(1 − 2t) 2 for X∼χ2 (n). 2
2.1 The Chi-Square Distribution
35
(iii) First, f (.) is a density, as it is non-negative, and integrates to 1: f (x) dx
∞ 1 1 1 2 n−1 exp x dx x − 1 2 2 2 n Γ 12 n 0 ∞ 1 1 1 = u 2 n−1 exp(−u) du (u := x) 2 Γ 12 n 0 = 1,
=
by definition of the Gamma function. Its MGF is M (t) = =
1 1 etx x 2 n−1 exp − x dx 2 0 ∞ 1 1 1 2 n−1 exp x(1 − 2t) dx. x − 1 2 2 2 n Γ 12 n 0 1 1 2 2 n Γ 12 n
∞
Substitute u := x(1 − 2t) in the integral. One obtains 1
M (t) = (1 − 2t)− 2 n
1 1 n 2 2 Γ 12 n
0
∞
1
1
u 2 n−1 e−u du = (1 − 2t)− 2 n ,
by definition of the Gamma function. Chi-square Addition Property. X1 + X2 is χ2 (n1 + n2 ).
If X1 , X2 are independent, χ2 (n1 ) and χ2 (n2 ),
Proof X1 = U12 + . . . + Un21 , X2 = Un21 +1 + . . . + Un21 +n2 , with Ui iid N (0, 1). So X1 + X2 = U12 + · · · + Un21 +n2 , so X1 + X2 is χ2 (n1 + n2 ). Chi-Square Subtraction Property. If X = X1 + X2 , with X1 and X2 independent, and X ∼ χ2 (n1 + n2 ), X1 ∼ χ2 (n1 ), then X2 ∼ χ2 (n2 ).
Proof As X is the independent sum of X1 and X2 , its MGF is the product of their 1 1 MGFs. But X, X1 have MGFs (1 − 2t)− 2 (n1 +n2 ) , (1 − 2t)− 2 n1 . Dividing, X2 1 has MGF (1 − 2t)− 2 n2 . So X2 ∼ χ2 (n2 ).
36
2. The Analysis of Variance (ANOVA)
2.2 Change of variable formula and Jacobians Recall from calculus of several variables the change of variable formula for multiple integrals. If in f (x1 , . . . , xn ) dx1 . . . dxn = f (x) dx I := . . . A
A
we make a one-to-one change of variables from x to y — x = x(y) or xi = xi (y1 , . . . , yn ) (i = 1, . . . , n) — let B be the region in y-space corresponding to the region A in x-space. Then ∂x f (x) dx = f (x(y)) dy = f (x(y))|J| dy, I= ∂y A B B where J, the determinant of partial derivatives ∂x ∂(x1 , · · · , xn ) J := = := det ∂y ∂(y1 , · · · , yn )
∂xi ∂yj
is the Jacobian of the transformation (after the great German mathematician C. G. J. Jacobi (1804–1851) in 1841 – see e.g. Dineen (2001), Ch. 14). Note that in one dimension, this just reduces to the usual rule for change of variables: dx = (dx/dy).dy. Also, if J is the Jacobian of the change of variables x → y above, the Jacobian ∂y/∂x of the inverse transformation y → x is J −1 (from the product theorem for determinants: det(AB) = detA.detB – see e.g. Blyth and Robertson (2002a), Th. 8.7). Suppose now that X is a random n-vector with density f (x), and we wish to change from X to Y, where Y corresponds to X as y above corresponds to x: y = y(x) iff x = x(y). If Y has density g(y), then by above, ∂x f (x) dx = f (x(y)) dy, P (X ∈ A) = ∂y A B
and also P (X ∈ A) = P (Y ∈ B) =
g(y)dy. B
Since these hold for all B, the integrands must be equal, giving g(y) = f (x(y))|∂x/∂y| as the density g of Y. In particular, if the change of variables is linear: y = Ax + b,
x = A−1 y − A−1 b,
∂y/∂x = |A|,
−1
∂x/∂y = |A−1 | = |A|
.
2.3 The Fisher F-distribution
37
2.3 The Fisher F-distribution Suppose we have two independent random variables U and V , chi–square distributed with degrees of freedom (df) m and n respectively. We divide each by its df, obtaining U/m and V /n. The distribution of the ratio F :=
U/m V /n
will be important below. It is called the F -distribution with degrees of freedom (m, n), F (m, n). It is also known as the (Fisher) variance-ratio distribution. Before introducing its density, we define the Beta function, 1 xα−1 (1 − x)β−1 dx, B(α, β) := 0
wherever the integral converges (α > 0 for convergence at 0, β > 0 for convergence at 1). By Euler’s integral for the Beta function, B(α, β) =
Γ (α)Γ (β) Γ (α + β)
(see e.g. Copson (1935), §9.3). One may then show that the density of F (m, n) is 1
f (x) =
1
1
m 2 mn 2 n x 2 (m−2) . 1 1 1 B( 2 m, 2 m) (mx + n) 2 (m+n)
(m, n > 0,
x > 0)
(see e.g. Kendall and Stuart (1977), §16.15, §11.10; the original form given by Fisher is slightly different). There are two important features of this density. The first is that (to within a normalisation constant, which, like many of those in Statistics, involves ra1 tios of Gamma functions) it behaves near zero like the power x 2 (m−2) and near 1 infinity like the power x− 2 n , and is smooth and unimodal (has one peak). The second is that, like all the common and useful distributions in Statistics, its percentage points are tabulated. Of course, using tables of the F -distribution involves the complicating feature that one has two degrees of freedom (rather than one as with the chi-square or Student t-distributions), and that these must be taken in the correct order. It is sensible at this point for the reader to take some time to gain familiarity with use of tables of the F -distribution, using whichever standard set of statistical tables are to hand. Alternatively, all standard statistical packages will provide percentage points of F , t, χ2 , etc. on demand. Again, it is sensible to take the time to gain familiarity with the statistical package of your choice, including use of the online Help facility. One can derive the density of the F distribution from those of the χ2 distributions above. One needs the formula for the density of a quotient of random variables. The derivation is left as an exercise; see Exercise 2.1. For an introduction to calculations involving the F distribution see Exercise 2.2.
38
2. The Analysis of Variance (ANOVA)
2.4 Orthogonality Recall that a square, non-singular (n × n) matrix A is orthogonal if its inverse is its transpose: A−1 = AT . We now show that the property of being independent N (0, σ 2 ) is preserved under an orthogonal transformation.
Theorem 2.2 (Orthogonality Theorem) If X = (X1 , . . . , Xn )T is an n-vector whose components are independent random variables, normally distributed with mean 0 and variance σ 2 , and we change variables from X to Y by Y := AX where the matrix A is orthogonal, then the components Yi of Y are again independent, normally distributed with mean 0 and variance σ 2 .
Proof We use the Jacobian formula. If A = (aij ), since ∂Yi /∂Xj = aij , the Jacobian ∂Y /∂X = |A|. Since A is orthogonal, AAT = AA−1 = I. Taking determinants, |A|.|AT | = |A|.|A| = 1: |A| = 1, and similarly |AT | = 1. Since length is preserved under an orthogonal transformation, n n Yi2 = Xi2 . 1
1
The joint density of (X1 , . . . , Xn ) is, by independence, the product of the marginal densities, namely
n 1 1 n 2 1 2 1 √ exp − xi = exp − x . f (x1 , . . . , xn ) = 1 i=1 2π 1 i 2 2 (2π) 2 n From this and the Jacobian formula, we obtain the joint density of (Y1 , . . . , Yn ) as n 1 1 2 1 1 n 2 √ y exp − exp − y f (y1 , . . . , yn ) = = . 1 i 1 1 2 2 i 2π (2π) 2 n But this is the joint density of n independent standard normals – and so (Y1 , . . . , Yn ) are independent standard normal, as claimed.
2.5 Normal sample mean and sample variance
39
Helmert’s Transformation. There exists an orthogonal n × n matrix P with first row 1 √ (1, . . . , 1) n (there are many such! Robert Helmert (1843–1917) made use of one when he introduced the χ2 distribution in 1876 – see Kendall and Stuart (1977), Example 11.1 – and it is convenient to use his name here for any of them.) For, take this vector, which spans a one-dimensional subspace; take n−1 unit vectors not in this subspace and use the Gram–Schmidt orthogonalisation process (see e.g. Blyth and Robertson (2002b), Th. 1.4) to obtain a set of n orthonormal vectors.
2.5 Normal sample mean and sample variance For X1 , . . . , Xn independent and identically distributed (iid) random variables, with mean μ and variance σ 2 , write X :=
1 n Xi 1 n
for the sample mean and S 2 :=
1 n (Xi − X)2 1 n
for the sample variance.
Note 2.3 Many authors use 1/(n − 1) rather than 1/n in the definition of the sample variance. This gives S 2 as an unbiased estimator of the population variance σ 2 . But our definition emphasizes the parallel between the bar, or average, for sample quantities and the expectation for the corresponding population quantities: 1 n X= Xi ↔ EX, 1 n
2 S2 = X − X ↔ σ 2 = E (X − EX)2 , which is mathematically more convenient.
40
2. The Analysis of Variance (ANOVA)
Theorem 2.4 If X1 , . . . , Xn are iid N (μ, σ 2 ), (i) the sample mean X and the sample variance S 2 are independent, (ii) X is N (μ, σ 2 /n), (iii) nS 2 /σ 2 is χ2 (n − 1).
Proof (i) Put Zi := (Xi − μ)/σ, Z := (Z1 , . . . , Zn )T ; then the Zi are iid N (0, 1), n Z = (X − μ)/σ, nS 2 /σ 2 = (Zi − Z)2 . 1
Also, since n 1
(Zi − Z)2 n 1
= =
Zi2
=
n 1 n 1 n 1
Zi2 − 2Z
n 1
Zi + nZ
2
2
Zi2 − 2Z.nZ + nZ =
n 1
2
2
Zi2 − nZ :
(Zi − Z)2 + nZ .
The terms on the right above are quadratic forms, with matrices A, B say, so we can write n Zi2 = Z T AZ + Z T BX. (∗) 1
Put W := P Z with P a Helmert transformation – P orthogonal with first row √ (1, . . . , 1)/ n: √ 1 n 2 Zi = nZ; W12 = nZ = Z T BZ. W1 = √ 1 n So n 2
Wi2 =
n 1
Wi2 − W12 =
n 1
Zi2 − Z T BZ = Z T AZ =
n
(Zi − Z)2 = nS 2 /σ 2 .
1
But the Wi are independent (by the orthogonality of P ), so W1 is independent n of W2 , . . . , Wn . So W12 is independent of 2 Wi2 . So nS 2 /σ 2 is independent of n(X − μ)2 /σ 2 , so S 2 is independent of X, as claimed. (ii) We have X = (X1 + . . . + Xn )/n with Xi independent, N (μ, σ 2 ), so with MGF exp(μt + 12 σ 2 t2 ). So Xi /n has MGF exp(μt/n + 12 σ 2 t2 /n2 ), and X has MGF n
1 2 2 2 1 2 2 exp μt/n + σ t /n = exp μt + σ t /n . 2 2 1 So X is N (μ, σ 2 /n). n (iii) In (∗), we have on the left 1 Zi2 , which is the sum of the squares of n 1 standard normals Zi , so is χ2 (n) with MGF (1 − 2t)− 2 n . On the right, we have
2.5 Normal sample mean and sample variance
41
√ 2 two independent terms. As Z is N (0, 1/n), nZ is N (0, 1), so nZ = Z T BZ 1 is χ2 (1), with MGF (1 − 2t)− 2 . Dividing (as in chi-square subtraction above), n n 1 Z T AZ = 1 (Zi − Z)2 has MGF (1 − 2t)− 2 (n−1) . So Z T AZ = 1 (Zi − Z)2 is χ2 (n − 1). So nS 2 /σ 2 is χ2 (n − 1).
Note 2.5 1. This is a remarkable result. We quote (without proof) that this property actually characterises the normal distribution: if the sample mean and sample variance are independent, then the population distribution is normal (Geary’s Theorem: R. C. Geary (1896–1983) in 1936; see e.g. Kendall and Stuart (1977), Examples 11.9 and 12.7). 2. The fact that when we form the sample mean, the mean is unchanged, while the variance decreases by a factor of the sample size n, is true generally. The point of (ii) above is that normality is preserved. This holds more generally: it will emerge in Chapter 4 that normality is preserved under any linear operation.
Theorem 2.6 (Fisher’s Lemma) Let X1 , . . . , Xn be iid N (0, σ 2 ). Let n Yi = cij Xj
(i = 1, . . . , p,
j=1
p < n),
where the row-vectors (ci1 , . . . , cin ) are orthogonal for i = 1, . . . , p. If n p Xi2 − Yi2 , S2 = 1
1
then (i) S 2 is independent of Y1 , . . . , Yp , (ii) S 2 is χ2 (n − p).
Proof Extend the p × n matrix (cij ) to an n × n orthogonal matrix C = (cij ) by Gram–Schmidt orthogonalisation. Then put Y := CX, so defining Y1 , . . . , Yp (again) and Yp+1 , . . . , Yn . As C is orthogonal, Y1 , . . . , Yn n n are iid N (0, σ 2 ), and 1 Yi2 = 1 Xi2 . So n p n Yi2 = S2 = − Yi2 1
1
2
2
p+1
2
is independent of Y1 , . . . , Yp , and S /σ is χ (n − p).
42
2. The Analysis of Variance (ANOVA)
2.6 One-Way Analysis of Variance To compare two normal means, we use the Student t-test, familiar from your first course in Statistics. What about comparing r means for r > 2? Analysis of Variance goes back to early work by Fisher in 1918 on mathematical genetics and was further developed by him at Rothamsted Experimental Station in Harpenden, Hertfordshire in the 1920s. The convenient acronym ANOVA was coined much later, by the American statistician John W. Tukey (1915–2000), the pioneer of exploratory data analysis (EDA) in Statistics (Tukey (1977)), and coiner of the terms hardware, software and bit from computer science. Fisher’s motivation (which arose directly from the agricultural field trials carried out at Rothamsted) was to compare yields of several varieties of crop, say – or (the version we will follow below) of one crop under several fertiliser treatments. He realised that if there was more variability between groups (of yields with different treatments) than within groups (of yields with the same treatment) than one would expect if the treatments were the same, then this would be evidence against believing that they were the same. In other words, Fisher set out to compare means by analysing variability (‘variance’ – the term is due to Fisher – is simply a short form of ‘variability’). We write μi for the mean yield of the ith variety, for i = 1, . . . , r. For each i, we draw ni independent readings Xij . The Xij are independent, and we assume that they are normal, all with the same unknown variance σ 2 : Xij ∼ N (μi , σ 2 )
(j = 1, . . . , ni ,
We write n :=
r 1
i = 1, . . . , r).
ni
for the total sample size. With two suffices i and j in play, we use a bullet to indicate that the suffix in that position has been averaged out. Thus we write Xi• ,
or X i , :=
1 ni Xij j=1 ni
(i = 1, . . . , r)
for the ith group mean (the sample mean of the ith sample) X•• ,
or X, :=
1 r ni 1 r Xij = ni Xi• i=1 i=1 j=1 n n
2.6 One-Way Analysis of Variance
43
for the grand mean and, 1 ni (Xij − Xi• )2 j=1 ni
Si2 :=
for the ith sample variance. Define the total sum of squares r ni (Xij − X•• )2 = [(Xij − Xi• ) + (Xi• − X•• )]2 . SS := i=1
j=1
i
As
j
j
(Xij − Xi• ) = 0
(from the definition of Xi• as the average of the Xij over j), if we expand the square above, the cross terms vanish, giving (Xij − Xi• )2 SS = i j + (Xij − Xi• )(Xi• − X•• ) i j + (Xi• − X•• )2 i j = (Xij − Xi• )2 + Xi• − X•• )2 i j i j = ni Si2 + ni (Xi• − X•• )2 . i
i
The first term on the right measures the amount of variability within groups. The second measures the variability between groups. We call them the sum of squares for error (or within groups), SSE, also known as the residual sum of squares, and the sum of squares for treatments (or between groups), respectively: SS = SSE + SST, where SSE :=
i
ni Si2 ,
SST :=
i
ni (Xi• − X•• )2 .
Let H0 be the null hypothesis of no treatment effect: H0 :
μi = μ
(i = 1, . . . , r).
If H0 is true, we have merely one large sample of size n, drawn from the distribution N (μ, σ 2 ), and so SS/σ 2 =
1 (Xij − X•• )2 ∼ χ2 (n − 1) i j σ2
In particular, E[SS/(n − 1)] = σ 2
under H0 .
under H0 .
44
2. The Analysis of Variance (ANOVA)
Whether or not H0 is true, 1 (Xij − Xi• )2 ∼ χ2 (ni − 1). j σ2
ni Si2 /σ 2 =
So by the Chi-Square Addition Property SSE/σ 2 = since as n =
i ni ,
i
ni Si2 /σ 2 = r i=1
1 (Xij − Xi• )2 ∼ χ2 (n − r), i j σ2 (ni − 1) = n − r.
In particular, E[SSE/(n − r)] = σ 2 . Next, SST :=
ni (Xi• − X•• )2 ,
i
where
X•• =
1 ni Xi• , n i
SSE :=
ni Si2 .
i
Now Si2 is independent of Xi• , as these are the sample variance and sample mean from the ith sample, whose independence was proved in Theorem 2.4. Also Si2 is independent of Xj• for j = i, as they are formed from different independent samples. Combining, Si2 is independent of all the Xj• , so of their (weighted) average X•• , so of SST , a function of the Xj• and of X•• . So SSE = i ni Si2 is also independent of SST . We can now use the Chi-Square Subtraction Property. We have, under H0 , the independent sum SS/σ 2 = SSE/σ 2 +ind SST /σ 2 . By above, the left-hand side is χ2 (n − 1), while the first term on the right is χ2 (n − r). So the second term on the right must be χ2 (r − 1). This gives:
Theorem 2.7 Under the conditions above and the null hypothesis H0 of no difference of treatment means, we have the sum-of-squares decomposition SS = SSE +ind SST, where SS/σ 2 ∼ χ2 (n − 1), SSE/σ 2 ∼ χ2 (n − r) and SSE/σ 2 ∼ χ2 (r − 1).
2.6 One-Way Analysis of Variance
45
When we have a sum of squares, chi-square distributed, and we divide by its degrees of freedom, we will call the resulting ratio a mean sum of squares, and denote it by changing the SS in the name of the sum of squares to MS. Thus the mean sum of squares is M S := SS/df(SS) = SS/(n − 1) and the mean sums of squares for treatment and for error are M ST
:=
SST /df(SST ) = SST /(r − 1),
M SE
:=
SSE/df(SSE) = SSE/(n − r).
By the above, SS = SST + SSE; whether or not H0 is true, E[M SE] = E[SSE]/(n − r) = σ 2 ; under H0 , E[M S] = E[SS]/(n − 1) = σ 2 ,
and so also
E[M ST ]/(r − 1) = σ 2 .
Form the F -statistic F := M ST /M SE. Under H0 , this has distribution F (r − 1, n − r). Fisher realised that comparing the size of this F -statistic with percentage points of this F -distribution gives us a way of testing the truth or otherwise of H0 . Intuitively, if the treatments do differ, this will tend to inflate SST , hence M ST , hence F = M ST /M SE. To justify this intuition, we proceed as follows. Whether or not H0 is true, 2 2 ni (Xi• − X•• )2 = ni Xi• − 2X•• ni Xi• + X•• ni SST = i i i i 2 2 = ni Xi• − nX•• ,
i
= nX•• and i ni = n. So 2
2
E[SST ] = ni E Xi• − nE X•• i
= ni var(Xi• ) + (EXi• )2 − n var(X•• ) + (EX•• )2 .
since
i ni Xi•
i
2
But var(Xi• ) = σ /ni , 1 r 1 r 2 ni Xi• ) = 2 ni var(Xi• ), var(X•• ) = var( i=1 1 n n 1 r 2 2 = ni σ /ni = σ 2 /n 1 n2
46
(as
2. The Analysis of Variance (ANOVA)
i ni
= n). So writing μ :=
1 1 ni μi = EX•• = E ni Xi• , i i n n 2 σ2 σ + μ2 + μ2i − n 1 ni n 2 2 (r − 1)σ + ni μi − nμ2 i (r − 1)σ 2 + ni (μi − μ)2
r
E(SST ) = = = (as
i ni
= n, nμ =
ni
i
i ni μi ).
This gives the inequality
E[SST ] ≥ (r − 1)σ 2 , with equality iff μi = μ (i = 1, . . . , r),
i.e.
H0 is true.
Thus when H0 is false, the mean of SST increases, so larger values of SST , so of M ST and of F = M ST /M SE, are evidence against H0 . It is thus appropriate to use a one-tailed F -test, rejecting H0 if the value F of our F -statistic is too big. How big is too big depends, of course, on our chosen significance level α, and hence on the tabulated value Ftab := Fα (r − 1, n − r), the upper α-point of the relevant F -distribution. We summarise:
Theorem 2.8 When the null hypothesis H0 (that all the treatment means μ1 , . . . , μr are equal) is true, the F -statistic F := M ST /M SE = (SST /(r−1))/(SSE/(n−r)) has the F -distribution F (r − 1, n − r). When the null hypothesis is false, F increases. So large values of F are evidence against H0 , and we test H0 using a one-tailed test, rejecting at significance level α if F is too big, that is, with critical region F > Ftab = Fα (r − 1, n − r). Model Equations for One-Way ANOVA. Xij = μi + ij
(i = 1, . . . , r,
j = 1, . . . , r),
ij
iid
N (0, σ 2 ).
Here μi is the main effect for the ith treatment, the null hypothesis is H0 : μ1 = . . . = μr = μ, and the unknown variance σ 2 is a nuisance parameter. The point of forming the ratio in the F -statistic is to cancel this nuisance parameter σ 2 , just as in forming the ratio in the Student t-statistic in one’s first course in Statistics. We will return to nuisance parameters in §5.1.1 below.
2.6 One-Way Analysis of Variance
47
Calculations. In any calculation involving variances, there is cancellation to be made, which is worthwhile and important numerically. This stems from the definition and ‘computing formula’ for the variance,
σ 2 := E (X − EX)2 = E X 2 − (EX)2 and its sample counterpart 2
S 2 := (X − X)2 = X 2 − X . Writing T , Ti for the grand total and group totals, defined by Xij , Ti := Xij , T := i
j
j
2 = T 2 /n: so X•• = T /n, nX••
SS =
SST =
i
SSE = SS − SST =
j
2 Xij − T 2 /n,
T 2 /ni i i
− T 2 /n,
i
j
2 Xij −
i
Ti2 /ni .
These formulae help to reduce rounding errors and are easiest to use if carrying out an Analysis of Variance by hand. It is customary, and convenient, to display the output of an Analysis of Variance by an ANOVA table, as shown in Table 2.1. (The term ‘Error’ can be used in place of ‘Residual’ in the ‘Source’ column.)
Source Treatments Residual Total
df r−1 n−r n−1
SS SST SSE SS
Mean Square M ST = SST /(r − 1) M SE = SSE/(n − r)
F M ST /M SE
Table 2.1 One-way ANOVA table.
Example 2.9 We give an example which shows how to calculate the Analysis of Variance tables by hand. The data in Table 2.2 come from an agricultural experiment. We wish to test for different mean yields for the different fertilisers. We note that
48
2. The Analysis of Variance (ANOVA)
Fertiliser A B C D E F
Yield 14.5, 12.0, 13.5, 10.0, 11.5, 11.0, 13.0, 13.0, 15.0, 12.0, 12.5, 13.5,
9.0, 6.5 9.0, 8.5 14.0, 10.0 13.5, 7.5 8.0, 7.0 14.0, 8.0
Table 2.2 Data for Example 2.9
we have six treatments so 6−1 = 5 degrees of freedom for treatments. The total number of degrees of freedom is the number of observations minus one, hence 23. This leaves 18 degrees of freedom for the within-treatments sum of squares. 2 − The total sum of squares can be calculated routinely as (yij − y2 ) = yij 2 2 2 ny , which is often most efficiently calculated as yij − (1/n) ( yij ) . This calculation gives SS = 3119.25 − (1/24)(266.5)2 = 159.990. The easiest next step is to calculate SST , which means we can then obtain SSE by subtraction as above. The formula for SST is relatively simple and reads i Ti /ni − T 2 /n, where Ti denotes the sum of the observations corresponding to the ith treatment and T = ij yij . Here this gives SST = (1/4)(422 + 412 + 46.52 + 472 + 422 + 482 )−1/24(266.5)2 = 11.802. Working through, the full ANOVA table is shown in Table 2.3. Source Between fertilisers Residual Total
df 5 18 23
Sum of Squares 11.802 148.188 159.990
Mean Square 2.360 8.233
F 0.287
Table 2.3 One-way ANOVA table for Example 2.9 This gives a non-significant p-value compared with F3,16 (0.95) = 3.239. R calculates the p-value to be 0.914. Alternatively, we may place bounds on the p-value by looking at statistical tables. In conclusion, we have no evidence for differences between the various types of fertiliser. In the above example, the calculations were made more simple by having equal numbers of observations for each treatment. However, the same general procedure works when this no longer continues to be the case. For detailed worked examples with unequal sample sizes see Snedecor and Cochran (1989) §12.10.
2.7 Two-Way ANOVA; No Replications
49
S-Plus/R. We briefly describe implementation of one-way ANOVA in S-Plus/R . For background and details, see e.g. Crawley (2002), Ch. 15. Suppose we are studying the dependence of yield on treatment, as above. [Note that this requires that we set treatment to be a factor variable, taking discrete rather than continuous values, which can be achieved by setting treatment 1) where X∼F1,16 , (iii) P(X < 4) where X∼F1,3 , (iv) P(X > 3.4) where X∼F19,4 , (v) P(ln X > −1.4) where X∼F10,4 .
2.8 Two-Way ANOVA: Replications and Interaction
Fat 1 164 172 168 177 156 195
Fat 2 178 191 197 182 185 177
Fat 3 175 193 178 171 163 176
57
Fat 4 155 166 149 164 170 168
Table 2.10 Data for Exercise 2.3. 2.3. Doughnut data. Doughnuts absorb fat during cooking. The following experiment was conceived to test whether the amount of fat absorbed depends on the type of fat used. Table 2.10 gives the amount of fat absorbed per batch of doughnuts. Produce the one-way Analysis of Variance table for these data. What is your conclusion? 2.4. The data in Table 2.11 come from an experiment where growth is measured and compared to the variable photoperiod which indicates the length of daily exposure to light. Produce the one-way ANOVA table for these data and determine whether or not growth is affected by the length of daily light exposure. Very short 2 3 1 1 2 1
Short 3 4 2 1 2 1
Long 3 5 1 2 2 2
Very long 4 6 2 2 2 3
Table 2.11 Data for Exercise 2.4
2.5. Unpaired t-test with equal variances. Under the null hypothesis the statistic t defined as n1 n2 X 1 − X 2 − (μ1 − μ2 ) t= n1 + n2 s should follow a t distribution with n1 + n2 − 2 degrees of freedom, where n1 and n2 denote the number of observations from samples 1 and 2 and s is the pooled estimate given by s2 =
(n1 − 1)s21 + (n2 − 1)s22 , n1 + n2 − 2
58
2. The Analysis of Variance (ANOVA)
where s21
=
s22
=
1 ( x21 − (n1 − 1)x21 ), n1 − 1 1 ( x22 − (n2 − 1)x22 ). n2 − 1
(i) Give the relevant statistic for a test of the hypothesis μ1 = μ2 and n1 = n2 = n. (ii) Show that if n1 = n2 = n then one-way ANOVA recovers the same results as the unpaired t-test. [Hint. Show that the F -statistic satisfies F1,2(n−1) = t22(n−1) ]. 2.6. Let Y1 , Y2 be iid N (0, 1). Give values of a and b such that a(Y1 − Y2 )2 + b(Y1 + Y2 )2 ∼χ22 . 2.7. Let Y1 , Y2 , Y3 be iid N (0, 1). Show that 1 2 2 2 (Y1 − Y2 ) + (Y2 − Y3 ) + (Y3 − Y1 ) ∼χ22 . 3 Generalise the above result for a sample Y1 , Y2 , . . ., Yn of size n. 2.8. The data in Table 2.12 come from an experiment testing the number of failures out of 100 planted soyabean seeds, comparing four different seed treatments, with no treatment (‘check’). Produce the two-way ANOVA table for this data and interpret the results. (We will return to this example in Chapter 8.) Treatment Check Arasan Spergon Semesan, Jr Fermate
Rep 1 8 2 4 3 9
Rep 2 10 6 10 5 7
Rep 3 12 7 9 9 5
Rep 4 13 11 8 10 5
Rep 5 11 5 10 6 3
Table 2.12 Data for Exercise 2.8
2.9. Photoperiod example revisited. When we add in knowledge of plant genotype the full data set is as shown in Table 2.13. Produce the two-way ANOVA table and revise any conclusions from Exercise 2.4 in the light of these new data as appropriate.
2.8 Two-Way ANOVA: Replications and Interaction
Genotype A B C D E F
Very short 2 3 1 1 2 1
Short 3 4 2 1 2 1
59
Long 3 5 1 2 2 2
Very Long 4 6 2 2 2 3
Table 2.13 Data for Exercise 2.9
2.10. Two-way ANOVA with interactions. Three varieties of potato are planted on three plots at each of four locations. The yields in bushels are given in Table 2.14. Produce the ANOVA table for these data. Does the interaction term appear necessary? Describe your conclusions. Variety A B C
Location 1 15, 19, 22 20, 24, 18 22, 17, 14
Location 2 17, 10, 13 24, 18, 22 26, 19, 21
Location 3 9, 12, 6 12, 15, 10 10, 5, 8
Location 4 14, 8, 11 21, 16, 14 19, 15, 12
Table 2.14 Data for Exercise 2.10
2.11. Two-way ANOVA with interactions. The data in Table 2.15 give the gains in weight of male rats from diets with different sources and different levels of protein. Produce the two-way ANOVA table with interactions for these data. Test for the presence of interactions between source and level of protein and state any conclusions that you reach. Source Beef Cereal Pork
High Protein 73, 102, 118, 104, 81, 107, 100, 87, 117, 111 98, 74, 56, 111, 95, 88, 82, 77, 86, 92 94, 79, 96, 98, 102, 102, 108, 91, 120, 105
Low Protein 90, 76, 90, 64, 86, 51, 72, 90, 95, 78 107, 95, 97, 80, 98, 74, 74, 67, 89, 58 49, 82, 73, 86, 81, 97, 106, 70, 61, 82
Table 2.15 Data for Exercise 2.11
3
Multiple Regression
3.1 The Normal Equations We saw in Chapter 1 how the model yi = a + bxi + i ,
i
iid N (0, σ 2 )
for simple linear regression occurs. We saw also that we may need to consider two or more regressors. We dealt with two regressors u and v, and could deal with three regressors u, v and w similarly. But in general we will need to be able to handle any number of regressors, and rather than rely on the finite resources of the alphabet it is better to switch to suffix notation, and use the language of vectors and matrices. For a random vector X, we will write EX for its mean vector (thus the mean of the ith coordinate Xi is E(Xi ) = (EX)i ), and var(X) for its covariance matrix (whose (i, j) entry is cov(Xi , Xj )). We will use p regressors, called x1 , . . . , xp , each with a corresponding parameter β1 , . . . , βp (‘p for parameter’). In the equation above, regard a as short for a.1, with 1 as a regressor corresponding to a constant term (the intercept term in the context of linear regression). Then for one reading (‘a sample of size 1’) we have the model y = β1 x1 + . . . + βp xp + ,
i
∼
N (0, σ 2 ).
In the general case of a sample of size n, we need two suffices, giving the model equations yi = β1 xi1 + . . . + βp xip + i ,
i
iid N (0, σ 2 )
(i = 1, . . . , n).
N.H. Bingham and J.M. Fry, Regression: Linear Models in Statistics, Springer Undergraduate Mathematics Series, DOI 10.1007/978-1-84882-969-5 3, c Springer-Verlag London Limited 2010
61
62
3. Multiple Regression
Writing the typical term on the right as xij βj , we recognise the form of a matrix product. Form y1 , . . . , yn into a column vector y, 1 , . . . , n into a column vector , β1 , . . . , βp into a column vector β, and xij into a matrix X (thus y and are n × 1, β is p × 1 and X is n × p). Then our system of equations becomes one matrix equation, the model equation y = Xβ + .
(M E)
This matrix equation, and its consequences, are the object of study in this chapter. Recall that, as in Chapter 1, n is the sample size – the larger the better – while p, the number of parameters, is small – as small as will suffice. We will have more to say on choice of p later. Typically, however, p will be at most five or six, while n could be some tens or hundreds. Thus we must expect n to be much larger than p, which we write as n >> p. In particular, the n×p matrix X has no hope of being invertible, as it is not even square (a common student howler).
Note 3.1 We pause to introduce the objects in the model equation (M E) by name. On the left is y, the data, or response vector. The last term is the error or error vector; β is the parameter or parameter vector. Matrix X is called the design matrix. Although its (i, j) entry arose above as the ith value of the jth regressor, for most purposes from now on xij is just a constant. Emphasis shifts from these constants to the parameters, βj .
Note 3.2 To underline this shift of emphasis, it is often useful to change notation and write A for X, when the model equation becomes y = Aβ + .
(M E)
Lest this be thought a trivial matter, we mention that Design of Experiments (initiated by Fisher) is a subject in its own right, on which numerous books have been written, and to which we return in §9.3. We will feel free to use either notation as seems most convenient at the time. While X is the natural choice for straight regression problems, as in this chapter, it is less suitable in the general Linear Model, which includes related contexts such as Analysis of Variance (Chapter 2) and Analysis of Covariance (Chapter 5). Accordingly, we shall usually prefer A to X for use in developing theory.
3.1 The Normal Equations
63
We make a further notational change. As we shall be dealing from now on with vectors rather than scalars, there is no need to remind the reader of this by using boldface type. We may thus lighten the notation by using y for y, etc.; thus we now have y = Aβ + , (M E) for use in this chapter (in Chapter 4 below, where we again use x as a scalar variable, we use x for a vector variable). From the model equation p aij βj + i , yi = j=1
i
iid N (0, σ 2 ),
the likelihood is L = =
2 p 1 2 exp − yi − aij βj /σ i=1 j=1 2 2 p 1 n 2 yi − exp − aij βj /σ , i=1 j=1 2
n
1 1
σ n (2π) 2 n 1 1
σ n (2π) 2 n
and the log-likelihood is 2 p 1 n aij βj yi − /σ 2 . := log L = const − n log σ − i=1 j=1 2 As before, we use Fisher’s Method of Maximum Likelihood, and maximise with respect to βr : ∂/∂βr = 0 gives p n (r = 1, . . . , p), air yi − aij βj = 0 i=1
or
j=1
p
n
j=1
i=1
n air aij βj = air yi . i=1
Write C = (cij ) for the p × p matrix C := AT A, (called the information matrix – see Definition 3.10 below), which we note is symmetric: C T = C. Then n n (AT )ik Akj = aki akj . cij = k=1
So this says
p j=1
crj βj =
k=1
n i=1
air yi =
n i=1
(AT )ri yi .
64
3. Multiple Regression
In matrix notation, this is (Cβ)r = (AT y)r
(r = 1, . . . , p),
or combining, Cβ = AT y,
C := AT A.
(N E)
These are the normal equations, the analogues for the general case of the normal equations obtained in Chapter 1 for the cases of one and two regressors.
3.2 Solution of the Normal Equations Our next task is to solve the normal equations for β. Before doing so, we need to check that there exists a unique solution, the condition for which is, from Linear Algebra, that the information matrix C := AT A should be non-singular (see e.g. Blyth and Robertson (2002a), Ch. 4). This imposes an important condition on the design matrix A. Recall that the rank of a matrix is the maximal number of independent rows or columns. If this is as big as it could be given the size of the matrix, the matrix is said to have full rank, otherwise it has deficient rank. Since A is n × p with n >> p, A has full rank if its rank is p. Recall from Linear Algebra that a square matrix C is non-negative definite if xT Cx ≥ 0 for all vectors x, while C is positive definite if xT Cx > 0
∀x = 0
(see e.g. Blyth and Robertson (2002b), Ch. 8). A positive definite matrix is non-singular, so invertible; a non-negative definite matrix need not be.
Lemma 3.3 If A (n × p, n > p) has full rank p, C := AT A is positive definite.
Proof As A has full rank, there is no vector x with Ax = 0 other than the zero vector (such an equation would give a non-trivial linear dependence relation between the columns of A). So (Ax)T Ax = xT AT Ax = xT Cx = 0
3.2 Solution of the Normal Equations
65
only for x = 0, and is > 0 otherwise. This says that C is positive definite, as required.
Note 3.4 The same proof shows that C := AT A is always non-negative definite, regardless of the rank of A.
Theorem 3.5 For A full rank, the normal equations have the unique solution βˆ = C −1 AT y = (AT A)−1 AT y.
ˆ (β)
Proof In the full-rank case, C is positive definite by Lemma 3.3, so invertible, so we may solve the normal equations to obtain the solution above. From now on, we restrict attention to the full-rank case: the design matrix A, which is n×p, has full rank p.
Note 3.6 The distinction between the full- and deficient-rank cases is the same as that between the general and singular cases that we encountered in Chapter 1 in connection with the bivariate normal distribution. We will encounter it again later in Chapter 4, in connection with the multivariate normal distribution. In fact, this distinction bedevils the whole subject. Linear dependence causes rankdeficiency, in which case we should identify the linear dependence relation, use it to express some regressors (or columns of the design matrix) in terms of others, eliminate the redundant regressors or columns, and begin again in a lower dimension, where the problem will have full rank. What is worse is that nearlinear dependence – which when regressors are at all numerous is not uncommon – means that one is close to rank-deficiency, and this makes things numerically unstable. Remember that in practice, we work numerically, and when one is within rounding error of rank-deficiency, one is close to disaster. We shall return to this vexed matter later (§4.4), in connection with multicollinearity. We note in passing that Numerical Linear Algebra is a subject in its own right; for a monograph treatment, see e.g. Golub and Van Loan (1996).
66
3. Multiple Regression
Just as in Chapter 1, the functional form of the normal likelihood means that maximising the likelihood minimises the sum of squares 2 n p yi − aij βj . SS := (y − Aβ)T (y − Aβ) = i=1
j=1
Accordingly, we have as before the following theorem.
Theorem 3.7 ˆ to the normal equations (N E) are both the maximumThe solutions (β) likelihood estimators and the least-squares estimators of the parameters β. There remains the task of estimating the remaining parameter σ. At the ˆ So taking ∂SS/∂σ = 0 in the log-likelihood maximum, β = β. 2 p 1 n yi − aij βj /σ 2 := log L = const − n log σ − i=1 j=1 2 gives, at the maximum,
2 p 1 n n yi − + 3 aij βj = 0. i=1 j=1 σ σ ˆ At the maximum, β = β; rearranging, we have at the maximum that 2 p 1 n yi − σ2 = aij βˆj . i=1 j=1 n This sum of squares is, by construction, the minimum value of the total sum of squares SS as the parameter β varies, the minimum being attained at the ˆ This minimised sum of squares is called the sum of least-squares estimate β. squares for error, SSE: 2 T p n yi − y − Aβˆ , aij βˆj = y − Aβˆ SSE = −
i=1
j=1
so-called because, as we shall see in Corollary 3.23 below, the unbiased estimaˆ 2 = SSE/(n − p). tor of the error variance σ 2 is σ We call yˆ := Aβˆ the fitted values, and
e := y − yˆ,
the difference between the actual values (data) and fitted values, the residual vector. If e = (e1 , . . . , en ), the ei are the residuals, and the sum of squares for error n n (yi − yˆi )2 = e2i SSE = i=1
is the sum of squared residuals.
i=1
3.2 Solution of the Normal Equations
67
Note 3.8 We pause to discuss unbiasedness and degrees of freedom (df). In a first course in Statistics, one finds the maximum-likelihood estimators (MLEs) μ ˆ, σ ˆ 2 of the 2 2 parameters μ, σ in a normal distribution N (μ, σ ). One finds μ ˆ = x,
σ ˆ 2 = s2x :=
1 n (xi − x)2 i=1 n
(and the distributions are given by x ∼ N (μ, σ 2 /n) and nˆ σ 2 /σ 2 ∼ χ2 (n − 1)). 2 But this is a biased estimator of σ ; to get an unbiased estimator, one has to replace n in the denominator above by n − 1 (in distributional terms: the mean of a chi-square is its df). This is why many authors use n − 1 in place of n in the denominator when they define the sample variance (and we warned, when we used n in Chapter 1, that this was not universal!), giving what we will call the unbiased sample variance, s2u :=
1 n (xi − x)2 . i=1 (n − 1)
The problem is that to estimate σ 2 , one has first to estimate μ by x. Every time one has to estimate a parameter from the data, one loses a degree of freedom. In this one-dimensional problem, the df accordingly decreases from n to n − 1. Returning to the general case: here we have to estimate p parameters, β1 , . . . , βp . Accordingly, we lose p degrees of freedom, and to get an unbiased estimator we have to divide, not by n as above but by n−p, giving the estimator σ ˆ2 =
1 SSE. (n − p)
Since n is much larger than p, the difference between this (unbiased) estimator and the previous (maximum-likelihood) version is not large, but it is worthwhile, and so we shall work with the unbiased version unless otherwise stated. We find its distribution in §3.4 below (and check it is unbiased – Corollary 3.23).
Note 3.9 (Degrees of Freedom) Recall that n is our sample size, that p is our number of parameters, and that n is much greater than p. The need to estimate p parameters, which reduces the degrees of freedom from n to n − p, thus effectively reduces the sample size by this amount. We can think of the degrees of freedom as a measure of the amount of information available to us. This interpretation is in the minds of statisticians when they prefer one procedure to another because it ‘makes more degrees of freedom available’ for
68
3. Multiple Regression
the task in hand. We should always keep the degrees of freedom of all relevant terms (typically, sums of squares, or quadratic forms in normal variates) in mind, and think of keeping this large as being desirable. We rewrite our conclusions so far in matrix notation. The total sum of squares is 2 p n yi − aij βj = (y − Aβ)T (y − Aβ) ; SS := i=1
j=1
its minimum value with respect to variation in β is the sum of squares for error 2 T p n yi − y − Aβˆ , aij βˆj = y − Aβˆ SSE = i=1
j=1
where βˆ is the solution to the normal equations (N E). Note that SSE is a statistic – we can calculate it from the data y and βˆ = C −1 AT y, unlike SS which contains unknown parameters β. One feature is amply clear already. To carry through a regression analysis in practice, we must perform considerable matrix algebra – or, with actual data, numerical matrix algebra – involving in particular the inversion of the p × p matrix C := AT A. With matrices of any size, the calculations may well be laborious to carry out by hand. In particular, matrix inversion to find C −1 will be unpleasant for matrices larger than 2 × 2, even though C – being symmetric and positive definite – has good properties. For matrices of any size, one needs computer assistance. The package MATLAB1 is specially designed with matrix operations in mind. General mathematics packages such as Mathematica2 or Maple3 have a matrix inversion facility; so too do a number of statistical packages – for example, the solve command in S-Plus/R . QR Decomposition ˆ in TheThe numerical solution of the normal equations ((N E) in §3.1, (β) orem 3.5) is simplified if the design matrix A (which is n × p, and of full rank p) is given its QR decomposition A = QR, where Q is n × p and has orthonormal columns – so QT Q = I 1
2 3
MATLAB , Simulink and Symbolic Math ToolboxTM are trademarks of The MathWorks, Inc., 3 Apple Hill Drive, Natick, MA, 01760-2098, USA, http://www. mathworks.com Mathematica is a registered trademark of Wolfram Research, Inc., 100 Trade Center Drive, Champaign, IL 61820-7237, USA, http://www.wolfram.com MapleTM is a trademark of Waterloo Maple Inc., 615 Kumpf Drive, Waterloo, Ontario, Canada N2V 1K8, http://www.maplesoft.com
3.2 Solution of the Normal Equations
69
– and R is p × p, upper triangular, and non-singular (has no zeros on the diagonal). This is always possible; see below. The normal equations AT Aβˆ = AT y then become RT QT QRβˆ = RT QT y, or
RT Rβˆ = RT QT y,
as QT Q = I, or
Rβˆ = QT y,
as R, and so also RT , is non-singular. This system of linear equations for βˆ has an upper triangular matrix R, and so may be solved simply by backsubstitution, starting with the bottom equation and working upwards. The QR decomposition is just the expression in matrix form of the process of Gram–Schmidt orthogonalisation, for which see e.g. Blyth and Robertson (2002b), Th. 1.4. Write A as a row of its columns, A = (a1 , . . . , ap ); the n-vectors ai are linearly independent as A has full rank p. Write q1 := a1 /a1 , and for j = 2, . . . , p, j−1 where wj := aj − (aTk qk )qk . qj := wj /wj , k=1
Then the qj are orthonormal (are mutually orthogonal unit vectors), which span the column-space of A (Gram-Schmidt orthogonalisation is this process of passing from the aj to the qj ). Each qj is a linear combination of a1 , . . . , aj , and the construction ensures that, conversely, each aj is a linear combination of q1 , . . . , qj . That is, there are scalars rkj with j aj = rkj qk (j = 1, . . . , p). k=1
Put rkj = 0 for k > j. Then assembling the p columns aj into the matrix A as above, this equation becomes A = QR, as required.
Note 3.10 Though useful as a theoretical tool, the Gram–Schmidt orthogonalisation process is not numerically stable. For numerical implementation, one needs a stable variant, the modified Gram-Schmidt process. For details, see Golub and Van Loan (1996), §5.2. They also give other forms of the QR decomposition (Householder, Givens, Hessenberg etc.).
70
3. Multiple Regression
3.3 Properties of Least-Squares Estimators We have assumed normal errors in our model equations, (ME) of §3.1. But (until we need to assume normal errors in §3.5.2), we may work more generally, and assume only (M E ∗ ) Ey = Aβ, var(y) = σ 2 I. We must then restrict ourselves to the Method of Least Squares, as without distributional assumptions we have no likelihood function, so cannot use the Method of Maximum Likelihood. Linearity.
The least-squares estimator βˆ = C −1 AT y
is linear in the data y. Unbiasedness. E βˆ = C −1 AT Ey = C −1 AT Aβ = C −1 Cβ = β : βˆ is an unbiased estimator of β. Covariance matrix. ˆ = var(C −1 AT y) = var(β)
C −1 AT (var(y))(C −1 AT )T
=
C −1 AT .σ 2 I.AC −1
=
σ 2 .C −1 AT .AC −1
=
σ 2 C −1
(C = C T )
(C = AT A).
We wish to keep the variances of our estimators of our p parameters βi small, and these are the diagonal elements of the covariance matrix above; similarly for the covariances (off-diagonal elements). The smaller the variances, the more precise our estimates, and the more information we have. This motivates the next definition.
Definition 3.11 The matrix C := AT A, with A the design matrix, is called the information matrix.
3.3 Properties of Least-Squares Estimators
71
Note 3.12 1. The variance σ 2 in our errors i (which we of course wish to keep small) is usually beyond our control. However, at least at the stage of design and planning of the experiment, the design matrix A may well be within our control; hence so will be the information matrix C := AT A, which we wish to maximise (in some sense), and hence so will be C −1 , which we wish to minimise in some sense. We return to this in §9.3 in connection with Design of Experiments. 2. The term information matrix is due to Fisher. It is also used in the context of parameter estimation by the method of maximum likelihood. One has the likelihood L(θ), with θ a vector parameter, and the log-likelihood (θ) := log L(θ). The information matrix is the negative of the Hessian (matrix of second derivatives) of the log-likelihood: I(θ) := (Iij (θ))pi,j=1 , when Iij (θ) := −
∂2 (θ). ∂θi ∂θj
Under suitable regularity conditions, the maximum likelihood estimator θˆ is asymptotically normal and unbiased, with variance matrix (nI(θ))−1 ; see e.g. Rao (1973), 5a.3, or Cram´er (1946), §33.3. Unbiased linear estimators. Now let β˜ := By be any unbiased linear estimator of β (B a p × n matrix). Then E β˜ = BEy = BAβ = β – and so β˜ is an unbiased estimator for β – iff BA = I. Note that ˜ = Bvar(y)B T = B.σ 2 I.B T = σ 2 BB T . var(β) In the context of linear regression, as here, it makes sense to restrict attention to linear estimators. The two most obviously desirable properties of such estimators are unbiasedness (to get the mean right), and being minimum variance (to get maximum precision). An estimator with both these desirable properties may be termed a best estimator. A linear one is then a best linear unbiased estimator or BLUE (such acronyms are common in Statistics, and useful; an alternative usage is minimum variance unbiased linear estimate, or MVULE, but this is longer and harder to say). It is remarkable that the leastsquares estimator that we have used above is best in this sense, or BLUE.
72
3. Multiple Regression
Theorem 3.13 (Gauss–Markov Theorem) Among all unbiased linear estimators β˜ = By of β, the least-squares estimator βˆ = C −1 AT y has the minimum variance in each component. That is βˆ is the BLUE.
Proof By above, the covariance matrix of an arbitrary unbiased linear estimate β˜ = By and of the least-squares estimator βˆ are given by ˜ = σ 2 BB T var(β)
ˆ = σ 2 C −1 . and var(β)
Their difference (which we wish to show is non-negative) is ˜ − var(β) ˆ = σ 2 [BB T − C −1 ]. var(β) Now using symmetry of C, C −1 , and BA = I (so AT B T = I) from above, (B − C −1 AT )(B − C −1 AT )T = (B − C −1 AT )(B T − AC −1 ). Further, (B − C −1 AT )(B T − AC −1 ) = BB T − BAC −1 − C −1 AT B T + C −1 AT AC −1 = BB T − C −1 − C −1 + C −1
(C = AT A)
= BB T − C −1 . Combining, ˜ − var(β) ˆ = σ 2 (B − C −1 AT )(B − C −1 AT )T . var(β) Now for a matrix M = (mij ), (M M T )ii =
k
mik (M T )ki =
k
m2ik ,
the sum of the squares of the elements on the ith row of matrix M . So the ith diagonal entry above is var(β˜i ) = var(βˆi ) + σ 2 (sum of squares of elements on ith row of B − C −1 AT ). So var(β˜i ) ≥ var(βˆi ), and var(β˜i ) = var(βˆi ) iff B − C −1 AT has ith row zero. So some β˜i has greater variance than βˆi unless B = C −1 AT (i.e., unless all rows of B − C −1 AT are zero) – that is, unless ˆ the least-squares estimator, as required. β˜ = By = C −1 AT y = β,
3.4 Sum-of-Squares Decompositions
73
One may summarise all this as: whether or not errors are assumed normal, LEAST SQUARES IS BEST.
Note 3.14 The Gauss–Markov theorem is in fact a misnomer. It is due to Gauss, in the early eighteenth century; it was treated in the book Markov (1912) by A. A. Markov (1856–1922). A misreading of Markov’s book gave rise to the impression that he had rediscovered the result, and the name Gauss–Markov theorem has stuck (partly because it is useful!).
p Estimability. A linear combination cT β = i=1 ci βi , with c = (c1 , . . . , cp )T a known p-vector, is called estimable if it has an unbiased linear estimator,
n bT y = i=1 bi yi , with b = (b1 , . . . , bn )T a known n-vector. Then E(bT y) = bT E(y) = bT Aβ = cT β. This can hold identically in the unknown parameter β iff cT = bT A, that is, c is a linear combination (by the n-vector b) of the n rows (p-vectors) of the design matrix A. The concept is due to R. C. Bose (1901–1987) in 1944. In the full-rank case considered here, the rows of A span a space of full dimension p, and so all linear combinations are estimable. But in the defective rank case with rank k < p, the estimable functions span a space of dimension k, and non-estimable linear combinations exist.
3.4 Sum-of-Squares Decompositions We define the sum of squares for regression, SSR, by SSR := (βˆ − β)T C(βˆ − β). Since this is a quadratic form with matrix C which is positive definite, we have SSR ≥ 0, and SSR > 0 unless βˆ = β, that is, unless the least-squares estimator is exactly right (which will, of course, never happen in practice).
Theorem 3.15 (Sum-of-Squares Decomposition) SS = SSR + SSE.
(SSD)
74
3. Multiple Regression
Proof Write ˆ + A(βˆ − β). y − Aβ = (y − Aβ) Now multiply the vector on each side by its transpose (that is, form the sum of squares of the coordinates of each vector). On the left, we obtain SS = (y − Aβ)T (y − Aβ), the total sum of squares. On the right, we obtain three terms. The first squared term is ˆ ˆ T (y − Aβ), SSE = (y − Aβ) the sum of squares for error. The second squared term is (A(βˆ − β))T A(βˆ − β) = (βˆ − β)T AT A(βˆ − β) = (βˆ − β)T C(βˆ − β) = SSR, the sum of squares for regression. The cross terms on the right are ˆ T A(βˆ − β) (y − Aβ) and its transpose, which are the same as both are scalars. But ˆ = AT y − AT Aβˆ = AT y − C ˆb = 0, AT (y − Aβ) by the normal equations (N E) of §3.1-3.2. Transposing, ˆ T A = 0. (y − Aβ) So both cross terms vanish, giving SS = SSR + SSE, as required.
Corollary 3.16 We have that SSE = min SS, β
the minimum being attained at the least-squares estimator βˆ = C −1 AT y.
Proof ˆ SSR ≥ 0, and = 0 iff β = β.
3.4 Sum-of-Squares Decompositions
75
We now introduce the geometrical language of projections, to which we return in e.g. §3.5.3 and §3.6 below. The relevant mathematics comes from Linear Algebra; see the definition below. As we shall see, doing regression with p regressors amounts to an orthogonal projection on an appropriate p-dimensional subspace in n-dimensional space. The sum-of-squares decomposition involved can be visualised geometrically as an instance of Pythagoras’s Theorem, as in the familiar setting of plane or solid geometry.
Definition 3.17 Call a linear transformation P : V →V a projection onto V1 along V2 if V is the direct sum V = V1 ⊕V2 , and if x = (x1 , x2 )T with P x = x1 . Then (Blyth and Robertson (2002b), Ch.2, Halmos (1979), §41) V1 = Im P = Ker (I − P ), V2 = Ker P = Im (I − P ). Recall that a square matrix is idempotent if it is its own square M 2 = M . Then (Halmos (1979), §41), M is idempotent iff it is a projection. For use throughout the rest of the book, with A the design matrix and C := AT A the information matrix, we write P := AC −1 AT (‘P for projection’ – see below). We note that P is symmetric. Note also ˆ P y = AC −1 AT y = Aβ, by the normal equations (N E).
Lemma 3.18 P and I − P are idempotent, and so are projections.
Proof P 2 = AC −1 AT .AC −1 AT = AC −1 AT = P : P 2 = P. (I − P )2 = I − 2P + P 2 = I − 2P + P = I − P.
76
3. Multiple Regression
We now rewrite the two terms SSR and SSE on the right in Theorem 3.15 in the language of projections. Note that the first expression for SSE below shows again that it is a statistic – a function of the data (not involving unknown parameters), and so can be calculated from the data.
Theorem 3.19 SSE = y T (I − P )y = (y − Aβ)T (I − P )(y − Aβ), SSR = (y − Aβ)T P (y − Aβ).
Proof
T y − Aβˆ , and Aβˆ = P y, As SSE := y − Aβˆ SSE
=
T y − Aβˆ y − Aβˆ
=
(y − P y)T (y − P y) = y T (I − P )(I − P )y = y T (I − P )y,
as I − P is a projection. For SSR, we have that T T SSR := βˆ − β C βˆ − β = βˆ − β AT A βˆ − β . But
βˆ − β = C −1 AT y − β = C −1 AT y − C −1 AT Aβ = C −1 AT (y − Aβ),
so SSR
=
(y − Aβ)T AC −1 .AT A.C −1 AT (y − Aβ)
=
(y − Aβ)T AC −1 AT (y − Aβ)
=
(y − Aβ)T P (y − Aβ),
(AT A = C)
as required. The second formula for SSE follows from this and (SSD) by subtraction. Coefficient of Determination The coefficient of determination is defined as R2 , where R is the (sample)
3.4 Sum-of-Squares Decompositions
77
correlation coefficient of the data and the fitted values that is of the pairs (yi , yˆi ):
2 2 yˆi − yˆ . R := (yi − y) yˆi − yˆ / (yi − y) Thus −1 ≤ R ≤ 1, 0 ≤ R2 ≤ 1, and R2 is a measure of the goodness of fit of the fitted values to the data.
Theorem 3.20 R2 = 1 −
SSE . (yi − y)2
For reasons of continuity, we postpone the proof to §3.4.1 below. Note that R2 = 1 iff SSE = 0, that is, all the residuals are 0, and the fitted values are the exact values. As noted above, we will see in §3.6 that regression (estimating p parameters from n data points) amounts to a projection of the n-dimensional data space onto an p-dimensional hyperplane. So R2 = 1 iff the data points lie in an p-dimensional hyperplane (generalising the situation of Chapter 1, where R2 = 1 iff the data points lie on a line). In our full-rank (non-degenerate) case, this will not happen (see Chapter 4 for the theory of the relevant multivariate normal distribution), but the bigger R2 is (or the smaller SSE is), the better the fit of our regression model to the data.
Note 3.21 R2 provides a useful summary of the proportion of the variation in a data set explained by a regression. However, as discussed in Chapters 5 and 11 of Draper and Smith (1998) high values of R2 can be misleading. In particular, we note that the values R2 will tend to increase as additional terms are added to the model, irrespective of whether those terms are actually needed. An adjusted R2 statistic which adds a penalty to complex models can be defined as n−1 2 2 Ra = 1 − (1 − R ) , n−p where n is the number of parameters and n−p is the number of residual degrees of freedom; see Exercises 3.3, and §5.2 for a treatment of models penalised for complexity. We note a result for later use.
78
3. Multiple Regression
Proposition 3.22 (Trace Formula) E(xT Ax) = trace(A.var(x)) + ExT .A.Ex.
Proof xT Ax = so by linearity of E, E[xT Ax] =
ij
ij
aij xi xj , aij E[xi xj ].
Now cov(xi , xj ) = E(xi xj ) − (Exi )(Exj ), so E xT Ax = aij [cov(xi xj ) + Exi .Exj ] ij = aij cov(xi xj ) + aij .Exi .Exj . ij
ij
The second term on the right is ExT AEx. For the first, note that (AB)ii = aij bji = aij bij , trace(AB) = i
ij
ij
if B is symmetric. But covariance matrices are symmetric, so the first term on the right is trace(A var(x)), as required.
Corollary 3.23 trace(P ) = p,
trace(I − P ) = n − p,
E(SSE) = (n − p)σ 2 .
So σ ˆ 2 := SSE/(n − p) is an unbiased estimator for σ 2 .
Proof By Theorem 3.19, SSE is a quadratic form in y − Aβ with matrix I − P = I − AC −1 AT . Now trace(I − P ) = trace(I − AC −1 AT ) = trace(I) − trace(AC −1 AT ). But trace(I) = n (as here I is the n × n identity matrix), and as trace(AB) = trace(BA) (see Exercise 3.12), trace(P ) = trace(AC −1 AT ) = trace(C −1 AT A) = trace(I) = p,
3.4 Sum-of-Squares Decompositions
79
as here I is the p × p identity matrix. So trace(I − P ) = trace(I − AC −1 AT ) = n − p. Since Ey = Aβ and var(y) = σ 2 I, the Trace Formula gives E(SSE) = (n − p)σ 2 .
This last formula is analogous to the corresponding ANOVA formula E(SSE) = (n − r)σ 2 of §2.6. In §4.2 we shall bring the subjects of regression and ANOVA together.
3.4.1 Coefficient of determination We now give the proof of Theorem 3.20, postponed in the above.
Proof As at the beginning of Chapter 3 we may take our first regressor as 1, corresponding to the intercept term (this is not always present, but since R is translation-invariant, we may add an intercept term without changing R). The first of the normal equations then results from differentiating (yi − β1 − a2i β2 − . . . − api βp )2 = 0 with respect to β1 , giving (yi − β1 − a2i β2 − . . . − api βp ) = 0. At the minimising values βˆj , this says (yi − yˆi ) = 0. So y = yˆ, and also
(yi − yˆi )(yˆi − y) =
(a)
(yi − yˆi )yˆi
=
(y − yˆ)T yˆ
=
(y − P y)T P y
=
y T (I − P )P y
=
y T (P − P 2 )y,
80
3. Multiple Regression
so
(yi − yˆi )(yˆi − y) = 0,
as P is a projection. So [(yi − yˆi ) + (yˆi − y)]2 = (yi − yˆi )2 + (yˆi − y)2 , (yi − y)2 =
(b)
(c)
since the cross-term is 0. Also, in the definition of R, (yi − y)(yˆi − y) (by (a)) (yi − y)(yˆi − yˆ) = = [(yi − yˆi ) + (yˆi − y)](yˆi − y) (by (b)). = (yˆi − y)2 So
2
(yˆi − y)2 (yˆi − y)2
= . R =
( (yi − y)2 (yˆi − y)2 ) (yi − y)2
2
By (c), R2
(yˆi − y)2
(yi − yˆi )2 + (yˆi − y)2
(yi − yˆi )2
= 1−
(yi − yˆi )2 + (yˆi − y)2 SSE = 1−
, (yi − y)2 =
by (c) again and the definition of SSE.
3.5 Chi-Square Decomposition Recall (Theorem 2.2) that if x = x1 , . . . , xn is N (0, I) – that is, if the xi are iid N (0, 1) – and we change variables by an orthogonal transformation B to y := Bx, then also y ∼ N (0, I). Recall from Linear Algebra (e.g. Blyth and Robertson (2002a) Ch. 9) that λ is an eigenvalue of a matrix A with eigenvector x (= 0) if Ax = λx (x is normalised if xT x = Σi x2i = 1, as is always possible).
3.5 Chi-Square Decomposition
81
Recall also (see e.g. Blyth and Robertson (2002b), Corollary to Theorem 8.10) that if A is a real symmetric matrix, then A can be diagonalised by an orthogonal transformation B, to D, say: B T AB = D (see also Theorem 4.12 below, Spectral Decomposition) and that (see e.g. Blyth and Robertson (2002b), Ch. 9) if λ is an eigenvalue of A, |D − λI| = B T AB − λI = B T AB − λB T B = B T |A − λI| |B| = 0. Then a quadratic form in normal variables with matrix A is also a quadratic form in normal variables with matrix D, as xT Ax = xT BDB T x = y T Dy,
y := B T x.
3.5.1 Idempotence, Trace and Rank Recall that a (square) matrix M is idempotent if M 2 = M .
Proposition 3.24 If B is idempotent, (i) its eigenvalues λ are 0 or 1, (ii) its trace is its rank.
Proof (i) If λ is an eigenvalue of B, with eigenvector x, Bx = λx with x = 0. Then B 2 x = B(Bx) = B(λx) = λ(Bx) = λ(λx) = λ2 x, so λ2 is an eigenvalue of B 2 (always true – that is, does not need idempotence). So λx = Bx = B 2 x = . . . = λ2 x, and as x = 0, λ = λ2 , λ(λ − 1) = 0: λ = 0 or 1. (ii) trace(B)
=
sum of eigenvalues
=
# non-zero eigenvalues
=
rank(B).
82
3. Multiple Regression
Corollary 3.25 rank(P ) = p,
rank(I − P ) = n − p.
Proof This follows from Corollary 3.23 and Proposition 3.24. Thus n = p + (n − p) is an instance of the Rank–Nullity Theorem (‘dim source =dim Ker + dim Im’): Blyth and Robertson (2002a), Theorem 6. 4) applied to P , I − P .
3.5.2 Quadratic forms in normal variates We will be interested in symmetric projection (so idempotent) matrices P . Because their eigenvalues are 0 and 1, we can diagonalise them by orthogonal transformations to a diagonal matrix of 0s and 1s. So if P has rank r, a quadratic form xT P x can be reduced to a sum of r squares of standard normal variates. By relabelling variables, we can take the 1s to precede the 0s on the diagonal, giving xT P x = y12 + . . . + yr2 ,
yi
iid N (0, σ 2 ).
So xT P x is σ 2 times a χ2 (r)-distributed random variable. To summarise:
Theorem 3.26 If P is a symmetric projection of rank r and the xi are independent N (0, σ 2 ), the quadratic form xT P x ∼ σ 2 χ2 (r).
3.5.3 Sums of Projections As we shall see below, a sum-of-squares decomposition, which expresses a sum of squares (chi-square distributed) as a sum of independent sums of squares (also chi-square distributed) corresponds to a decomposition of the identity I
3.5 Chi-Square Decomposition
83
as a sum of orthogonal projections. Thus Theorem 3.13 corresponds to I = P + (I − P ), but in Chapter 2 we encountered decompositions with more than two summands (e.g., SS = SSB + SST + SSI has three). We turn now to the general case. Suppose that P1 , . . . , Pk are symmetric projection matrices with sum the identity: I = P1 + . . . + Pk . Take the trace of both sides: the n × n identity matrix I has trace n. Each Pi has trace its rank ni , by Proposition 3.24, so n = n1 + . . . + nk . Then squaring, I = I2 =
i
Pi2 +
Taking the trace, n= ni +
i 0 (so Σ −1 exists), X has a density. The link between the multinormal density below and the multinormal MGF above is due to the English statistician F. Y. Edgeworth (1845–1926).
Theorem 4.16 (Edgeworth’s Theorem, 1893) If μ is an n-vector, Σ > 0 a symmetric positive definite n × n matrix, then (i)
1 T −1 f (x) := − (x − μ) Σ (x − μ) 1 exp 1 2 (2π) 2 n |Σ| 2 1
is an n-dimensional probability density function (of a random n-vector X, say), (ii) X has MGF M (t) = exp tT μ + 12 tT Σt , (iii) X is multinormal N (μ, Σ).
Proof 1
1
Write Y := Σ − 2 X (Σ − 2 exists as Σ > 0, by above). Then Y has covariance 1 1 1 1 matrix Σ − 2 Σ(Σ − 2 )T . Since Σ = Σ T and Σ = Σ 2 Σ 2 , Y has covariance matrix I (the components Yi of Y are uncorrelated).
112
4. Further Multilinear Regression
1
1
Change variables as above, with y = Σ − 2 x, x = Σ 2 y. The Jacobian is 1 1 1 (taking A = Σ − 2 ) J = ∂x/∂y = det(Σ 2 ), = (detΣ) 2 by the product theorem for determinants. Substituting, 1 exp − (x − μ)T Σ −1 (x − μ) 2 is 1 T 1 1 1 1 1 1 exp − Σ 2 y − Σ 2 Σ − 2 μ Σ −1 Σ 2 y − Σ 2 Σ − 2 μ , 2 1
or writing ν := Σ − 2 μ, 1 1 1 T −1 12 T 2 exp − (y − ν) Σ Σ Σ (y − ν) = exp − (y − ν) (y − ν) . 2 2 So by the change of density formula, Y has density g(y) given by 1 (2π)
1 2n
n 1 1 1 1 2 |Σ| 2 exp{− (y − ν)T (y − ν)} = 1 exp{− (yi − νi ) }. i=1 (2π) 2 2 2 |Σ| 1 2
This is the density of a multivariate vector y∼N (ν, I) whose components are independent N (νi , 1) by Theorem 4.14. (i) Taking A = B = Rn in the Jacobian formula, f (x)dx
=
Rn
= =
1
1 exp − (x − μ)T Σ −1 (x − μ) dx 2 Rn 1 exp − (y − ν)T (y − ν) dy 2 Rn 1
|Σ| 2 1 n
(2π) 2 1 1
(2π) 2 n g(y)dy = 1. Rn
So f (x) is a probability density (of X say). 1
(ii) X = Σ 2 Y is a linear transformation of Y, and Y is multivariate normal, so X is multivariate normal. 1
1
1
1
1
1
(iii) EX = Σ 2 EY = Σ 2 ν = Σ 2 .Σ − 2 μ = μ, covX = Σ 2 covY(Σ 2 )T = 1 1 Σ 2 IΣ 2 = Σ. So X is multinormal N (μ, Σ). So its MGF is 1 T T M (t) = exp t μ + t Σt . 2
4.4 The Multinormal Density
113
4.4.1 Estimation for the multivariate normal Given a sample x1 , . . . , xn from the multivariate normal Np (μ, Σ), Σ > 0, form the sample mean (vector) 1 n x := xi , i=1 n as in the one-dimensional case, and the sample covariance matrix 1 n S := (xi − x)(xi − x)T . i=1 n The likelihood for a sample of size 1 is 1 −p/2 −1/2 T −1 L = (2π) |Σ| exp − (x − μ) Σ (x − μ) , 2 so the likelihood for a sample of size n is 1 n L = (2π)−np/2 |Σ|−n/2 exp − (xi − μ)T Σ −1 (xi − μ) . 1 2 Writing xi − μ = (xi − x) − (μ − x), n (xi − μ)T Σ −1 (xi − μ) = (xi − x)T Σ −1 (xi − x) + n(x − μ)T Σ −1 (x − μ) 1 1 (the cross terms cancel as (xi − x) = 0). The summand in the first term on the right is a scalar, so is its own trace. Since trace(AB) = trace(BA) and trace(A + B) = trace(A) + trace(B) = trace(B + A), n n trace (xi − x)T Σ −1 (xi − x) = trace Σ −1 (xi − x)(xi − x)T
n
1
1
= trace Σ −1 .nS = n trace Σ −1 S .
Combining, −np/2
L = (2π)
−n/2
|Σ|
−1 1 1 T −1 exp − n trace Σ S − n(x − μ) Σ (x − μ) . 2 2
This involves the data only through x and S. We expect the sample mean x to be informative about the population mean μ and the sample covariance matrix S to be informative about the population covariance matrix Σ. In fact x, S are fully informative about μ, Σ, in a sense that can be made precise using the theory of sufficient statistics (for which we must refer to a good book on statistical inference – see e.g. Casella and Berger (1990), Ch. 6) – another of Fisher’s contributions. These natural estimators are in fact the maximum likelihood estimators:
114
4. Further Multilinear Regression
Theorem 4.17 For the multivariate normal Np (μ, Σ), x and S are the maximum likelihood estimators for μ, Σ.
Proof Write V = (vij ) := Σ −1 . By above, the likelihood is 1 1 L = const.|V |n/2 exp − n trace(V S) − n(x − μ)T V (x − μ) , 2 2 so the log-likelihood is 1 1 1 = c + n log |V | − n trace(V S) − n(x − μ)T V (x − μ). 2 2 2 The MLE μ ˆ for μ is x, as this reduces the last term (the only one involving μ) to its minimum value, 0. Recall (see e.g. Blyth and Robertson (2002a), Ch. 8) that for a square matrix A = (aij ), its determinant is |A| = aij Aij j
for each i, or
|A| =
i
aij Aij
for each j, expanding by the ith row or jth column, where Aij is the cofactor (signed minor) of aij . From either, ∂|A|/∂aij = Aij , so
∂ log |A|/∂aij = Aij /|A| = (A−1 )ij ,
the (i, j) element of A−1 , recalling the formula for the matrix inverse. Also, if B is symmetric, aij bji = aij bij , trace(AB) = i
j
i,j
so ∂ trace(AB)/∂aij = bij . Using these, and writing S = (sij ), ∂ log |V |/∂vij = (V −1 )ij = (Σ)ij = σij
(V := Σ −1 ),
∂ trace(V S)/∂vij = sij . So
1 n(σij − sij ), 2 which is 0 for all i and j iff Σ = S. This says that S is the MLE for Σ, as required. ∂ /∂ vij =
4.5 Conditioning and Regression
115
4.5 Conditioning and Regression Recall from §1.5 that the conditional density of Y given X = x is fY |X (y|x) := fX,Y (x, y)/ fX,Y (x, y) dy. Conditional means. The conditional mean of Y given X = x is E(Y |X = x), a function of x called the regression function (of Y on x). So, if we do not specify the value x, we get E(Y |X). This is random, because X is random (until we observe its value, x; then we get the regression function of x as above). As E(Y |X) is random, we can look at its mean and variance. For the next result, see e.g. Haigh (2002) Th. 4.24 or Williams (2001), §9.1.
Theorem 4.18 (Conditional Mean Formula) E[E(Y |X)] = EY.
Proof EY
=
yfY (y)dy = ydy fX,Y (x, y) dx (definition of conditional density) y dy fY |X (y|x)fX (x) dx fX (x) dx yfY |X (y|x) dx,
= =
interchanging the order of integration. The inner integral is E(Y |X = x). The outer integral takes the expectation of this over X, giving E[E(Y |X)]. Discrete case: similarly with summation in place of integration. Interpretation. – EY takes the random variable Y , and averages out all the randomness to give a number, EY . – E(Y |X) takes the random variable Y knowing X, and averages out all the randomness in Y NOT accounted for by knowledge of X.
116
4. Further Multilinear Regression
– E[E(Y |X)] then averages out the remaining randomness, which IS accounted for by knowledge of X, to give EY as above.
Example 4.19 (Bivariate normal distribution) N (μ1 , μ2 ; σ12 , σ22 ; ρ), or N (μ, Σ), μ = (μ1 , μ2 )T ,
Σ=
σ12 ρσ1 σ2
ρσ1 σ2 σ22
2 σ11 σ12
=
σ12 2 σ22
.
By §1.5, E(Y |X = x) = μ2 + ρ
σ2 (x − μ1 ), σ1
so
E(Y |X) = μ2 + ρ
σ2 (X − μ1 ). σ1
So E[E(Y |X)] = μ2 + ρ
σ2 (EX − μ1 ) = μ2 = EY, σ1
as
EX = μ1 .
As with the bivariate normal, we should keep some concrete instance in mind as a motivating example, e.g.: X = incoming score of student [in medical school or university, say], Y = graduating score; X = child’s height at 2 years (say), Y = child’s eventual adult height, or X = mid-parental height, Y = child’s adult height, as in Galton’s study. Conditional variances. Recall varX := E[(X − EX)2 ]. Expanding the square,
2 varX = E X 2 − 2X.(EX) + (EX) = E X 2 − 2(EX)(EX) + (EX)2 , = E X 2 − (EX)2 . Conditional variances can be defined in the same way. Recall that E(Y |X) is constant when X is known (= x, say), so can be taken outside an expectation over X, EX say. Then var(Y |X) := E(Y 2 |X) − [E(Y |X)]2 . Take expectations of both sides over X: EX [var(Y |X)] = EX [E(Y 2 |X)] − EX [E(Y |X)]2 . Now EX [E(Y 2 |X)] = E(Y 2 ), by the Conditional Mean Formula, so the right is, adding and subtracting (EY )2 , {E(Y 2 ) − (EY )2 } − {EX [E(Y |X)]2 − (EY )2 }.
4.5 Conditioning and Regression
117
The first term is var Y , by above. Since E(Y |X) has EX -mean EY , the second term is varX E(Y |X), the variance (over X) of the random variable E(Y |X) (random because X is). Combining, we have (Williams (2001), §9.1, or Haigh (2002) Ex 4.33):
Theorem 4.20 (Conditional Variance Formula) varY = EX var(Y |X) + varX E(Y |X). Interpretation. – varY = total variability in Y, – EX var(Y |X) = variability in Y not accounted for by knowledge of X, – varX E(Y |X) = variability in Y accounted for by knowledge of X.
Example 4.21 (The Bivariate normal)
σ2 2 2 Y |X = x is N μ2 + ρ (x − μ1 ), σ2 1 − ρ , varY = σ22 , σ1 σ2 σ2 E(Y |X = x) = μ2 + ρ (x − μ1 ), E(Y |X) = μ2 + ρ (X − μ1 ), σ1 σ1 which has variance var E(Y |X) = (ρσ2 /σ1 )2 varX = (ρσ2 /σ1 )2 σ12 = ρ2 σ22 ,
var(Y |X = x) = σ22 1 − ρ2 for all x, var(Y |X) = σ22 1 − ρ2 , (as in Fact 6 of §1.5):
EX var(Y |X) = σ22 1 − ρ2 .
Corollary 4.22 E(Y |X) has the same mean as Y and smaller variance (if anything) than Y .
Proof From the Conditional Mean Formula, E[E(Y |X)] = EY . Since var(Y |X) ≥ 0, EX var(Y |X) ≥ 0, so varE[Y |X] ≤ varY from the Conditional Variance Formula.
118
4. Further Multilinear Regression
Note 4.23 This result has important applications in estimation theory. Suppose we are to estimate a parameter θ, and are considering a statistic X as a possible estimator (or basis for an estimator) of θ. We would naturally want X to contain all the information on θ contained within the entire sample. What (if anything) does this mean in precise terms? The answer lies in Fisher’s concept of sufficiency (‘data reduction’), that we met in §4.4.1. In the language of sufficiency, the Conditional Variance Formula is seen as (essentially) the Rao–Blackwell Theorem, a key result in the area. Regression. In the bivariate normal, with X = mid-parent height, Y = child’s height, E(Y |X = x) is linear in x (regression line). In a more detailed analysis, with U = father’s height, V = mother’s height, Y = child’s height, one would expect E(Y |U = u, V = v) to be linear in u and v (regression plane), etc. In an n-variate normal distribution Nn (μ, Σ), suppose that we partition X = (X1 , . . . , Xn )T into X1 := (X1 , . . . , Xr )T and X2 := (Xr+1 , . . . , Xn )T . Let the corresponding partition of the mean vector and the covariance matrix be μ1 Σ11 Σ12 μ= , Σ= , μ2 Σ21 Σ22 T where EXi = μi , Σ11 is the covariance matrix of X1 , Σ22 that of X2 , Σ12 = Σ21 the covariance matrix of X1 with X2 . For clarity, we restrict attention to the non-singular case, where Σ is positive definite.
Lemma 4.24 If Σ is positive definite, so is Σ11 .
Proof xT Σx > 0 as Σ is positive definite. Take x = (x1 , 0)T , where x1 has the same number of components as the order of Σ11 (that is, in matrix language, so that the partition of x is conformable with those of μ and Σ above). Then x1 Σ11 x1 > 0 for all x1 . This says that Σ11 is positive definite, as required.
4.5 Conditioning and Regression
119
Theorem 4.25 The conditional distribution of X2 given X1 = x1 is
−1 −1 X2 |X1 = x1 ∼ N μ2 + Σ21 Σ11 (x1 − μ1 ), Σ22 − Σ21 Σ11 Σ12 .
Corollary 4.26 The regression of X2 on X1 is linear: −1 (x1 − μ1 ). E(X2 |X1 = x1 ) = μ2 + Σ21 Σ11
Proof Recall from Theorem 4.16 that AX, BX are independent iff AΣB T = 0, or as Σ is symmetric, BΣAT = 0. Now X1 = AX where A = (I, 0),
X1
−1 −1 −1 X2 − Σ21 Σ11 X1 = −Σ21 Σ11 I = BX, where B = −Σ21 Σ11 I . X2 Now BΣAT
=
−1 −Σ21 Σ11
I
−1 −Σ21 Σ11
Σ11 Σ12 Σ21 Σ22
Σ11 I Σ21
=
=
=
−1 −Σ21 Σ11 Σ11 + Σ21 = 0,
I 0
−1 so X1 and X2 − Σ21 Σ11 X1 are independent. Since both are linear transformations of X, which is multinormal, both are multinormal. Also,
−1 −1 μ1 . E(BX) = BEX = −Σ21 Σ11 I μ1 μ2 = μ2 − Σ21 Σ11 −1 To calculate the covariance matrix, introduce C := −Σ21 Σ11 , so B = (C I), −1 T T and recall Σ12 = Σ21 , so C = −Σ11 Σ12 : T
Σ11 Σ12 C T var(BX) = BΣB = C I Σ21 Σ22 I
Σ11 C T + Σ12 = C I = CΣ11 C T + CΣ12 + Σ21 C T + Σ22 Σ21 C T + Σ22 −1 −1 −1 −1 = Σ21 Σ11 Σ11 Σ11 Σ12 − Σ21 Σ11 Σ12 − Σ21 Σ11 Σ12 + Σ22 −1 = Σ22 − Σ21 Σ11 Σ12 .
120
4. Further Multilinear Regression
By independence, the conditional distribution of BX given X1 = AX is the −1 μ1 , Σ22 − same as its marginal distribution, which by above is N (μ2 − Σ21 Σ11 −1 −1 −1 Σ21 Σ11 Σ12 ). So given X1 , X2 − Σ21 Σ11 X1 is N (μ2 − Σ21 Σ11 μ1 , Σ22 − −1 Σ12 ). Σ21 Σ11 −1 X1 It remains to pass from the conditional distribution of X2 − Σ21 Σ11 −1 given X1 to that of X2 given X1 . But given X1 , Σ21 Σ11 X1 is constant, so we −1 X1 . The result is again multinormal, with can do this simply by adding Σ21 Σ11 −1 (X1 − μ1 ). the same covariance matrix, but (conditional) mean μ2 + Σ21 Σ11 That is, the conditional distribution of X2 given X1 is
−1 −1 (X1 − μ1 ), Σ22 − Σ21 Σ11 Σ12 , N μ2 + Σ21 Σ11 as required.
Note 4.27 −1 Here Σ22 − Σ21 Σ11 Σ12 is called the partial covariance matrix of X2 given X1 . In the language of Linear Algebra, it is called the Schur complement of Σ22 in Σ (Issai Schur (1875–1941) in 1905; see Zhang (2005)). We will meet the Schur complement again in §9.1 (see also Exercise 4.10).
Example 4.28 (Bivariate normal) Here n = 2, r = s = 1 :
ρσ1 σ2 Σ11 Σ= = σ22 Σ21 ρσ1 σ2 −1 Σ21 Σ11 (X1 − μ1 ) = (X1 − μ1 ) = σ12 σ12 ρσ1 σ2
Σ12 , Σ22 ρσ2 (X1 − μ1 ), σ1
−1 Σ22 − Σ21 Σ11 Σ12 = σ22 − ρσ1 σ2 .σ1−2 .ρσ1 σ2 = σ22 (1 − ρ2 ),
as before.
Note 4.29 The argument can be extended to cover the singular case as well as the nonsingular case, using generalised inverses of the relevant matrices. For details, see e.g. Rao (1973), §8a.2v, 522–523.
Note 4.30 The details of the matrix algebra are less important than the result: conditional distributions of multinormals are multinormal. To find out which multinormal,
4.6 Mean-square prediction
121
we then only need to get the first and second moments – mean vector and covariance matrix – right.
Note 4.31 The result can actually be generalised well beyond the multivariate normal case. Recall (bivariate normal, Fact 8) that the bivariate normal has elliptical contours. The same is true in the multivariate normal case, by Edgeworth’s Theorem – the contours are Q(x) := (x − μ)T Σ −1 (x − μ) = constant. It turns out that this is the crucial property. Elliptically contoured distributions are much more general than the multivariate normal but share most of its nice properties, including having linear regression.
4.6 Mean-square prediction Chapters 3 and 4 deal with linear prediction, but some aspects are more general. Suppose that y is to be predicted from a vector x, by some predictor f (x). One obvious candidate is the regression function M (x) := E[y|x], (‘M for mean’). Then E[(y − M (x))(M (x) − f (x))] = E[E[(y − M (x))(M (x) − f (x))|x]], by the Conditional Mean Formula. But given x, M (x) − f (x) is known, so can be taken through the inner expectation sign (like a constant). So the right is E[(M (x) − f (x))E[(y − M (x))|x]]. But the inner expression is 0, as M = E(y|x). So E (y − f )2 = E ((y − M ) + (M − f ))2 = E (y − M )2 + 2E[(y − M )(M − f )] + E (M − f )2 = E (y − M )2 + E (M − f )2 , by above. Interpreting the left as the mean-squared error – in brief, prediction error – when predicting y by f (x), this says: (i) E[(y − M )2 ]≤E[(y − f )2 ] : M has prediction error at most that of f . (ii) The regression function M (x) = E[y|x] minimises the prediction error over all predictors f .
122
4. Further Multilinear Regression
Now cov(y, f ) =
E[(f − EF )(y − Ey)] (definition of covariance)
=
E[(f − Ef )E[(y − Ey)|x]] (Conditional Mean Formula)
=
E[(f − Ef )(M − EM )] (definition of M )
=
cov(M, f ).
So corr2 (f, y) =
cov2 (f, y) varM cov2 (f, y) varM = . = corr2 (M, f ). . var f var y varf varM var y var y
When the predictor f is M , one has by above cov(y, M ) = cov(M, M ) = varM. So corr2 (y, M ) =
var M cov2 (y, M ) = . var y var M var y
Combining, corr2 (f, y) = corr2 (f, M ).corr2 (M, y). Since correlation coefficients lie in [−1, 1], and so their squares lie in [0, 1], this gives corr2 (f, y) ≤ corr2 (M, y), with equality iff f = M. This gives
Theorem 4.32 The regression function M (x) := E(y|x) has the maximum squared correlation with y over all predictors f (x) of y.
Note 4.33 1. One often uses the alternative notation ρ(·, ·) for the correlation corr(·, ·). One then interprets ρ2 = ρ2 (M, y) as a measure of how well the regression M explains the data y.
4.7 Generalised least squares and weighted regression
123
2. The simplest example of this is the bivariate normal distribution of §1.5. 3. This interpretation of ρ2 reinforces that it is the population counterpart of R2 and its analogous interpretation in Chapter 3. 4. Since corr2 (y, M )≤1, one sees again that var M ≤var y, as in the Conditional Variance Formula and the Rao–Blackwell Theorem, Theorem 4.20, Corollary 4.22 and Note 4.23. 5. This interpretation of regression as maximal correlation is another way of looking at regression in terms of projection, as in §3.6. For another treatment see Williams (2001), Ch. 8.
4.7 Generalised least squares and weighted regression Suppose that we write down the model equation y = Xβ + ,
(GLS)
where it is assumed that ∼N (0, σ 2 V ), with V =I in general. We take V full rank; then V −1 exists, X T V −1 X is full rank, and (X T V −1 X)−1 exists. (GLS) is the model equation for generalised least squares. If V is diagonal (GLS) is known as weighted least squares. By Corollary 4.13 (Matrix square roots) we can find P non-singular and symmetric such that P T P = P 2 = V.
Theorem 4.34 (Generalised Least Squares) Under generalised least squares (GLS), the maximum likelihood estimate βˆ of β is
−1 T −1 X V y. βˆ = X T V −1 X This is also the best linear unbiased estimator (BLUE).
124
4. Further Multilinear Regression
Proof Pre–multiply by P −1 to reduce the equation for generalised least squares to the equation for ordinary least squares: P −1 y = P −1 Xβ + P −1 .
(OLS)
Now by Proposition 4.4 (ii) cov(P −1 ) = P −1 cov()(P −1 )T = P −1 σ 2 V P −1 = σ 2 .P −1 P P P −1 = σ 2 I. So (OLS) is now a regression problem for β within the framework of ordinary least squares. From Theorem 3.5 the maximum likelihood estimate of β can now be obtained from the normal equations as
T −1 −1 −1 T
−1 T −2 P X P X y = X T P −2 X X P y P −1 X T −1 −1 T −1 = X V X X V y, T −1 −1 since X V X is non–singular. By Theorem 3.13 (Gauss–Markov Theorem), this is also the BLUE.
Note 4.35 By §3.3 the ordinary least squares estimator βˆ = (X T X)−1 X T y is unbiased but by above is no longer the Best Linear Unbiased Estimator (BLUE).
Note 4.36 Theorem 4.34 is the key to a more general setting of mixed models (§9.1), where the BLUE is replaced by the best linear unbiased predictor (BLUP).
Note 4.37 In practice, if we do not assume that V = I then the form that V should take instead is often unclear even if V is assumed diagonal as in weighted least squares. A pragmatic solution is first to perform the analysis of the data assuming V = I and then to use the residuals of this model to provide an estimate Vˆ of V for use in a second stage analysis if this is deemed necessary. There appear to be no hard and fast ways of estimating V , and doing so in practice clearly depends on the precise experimental context. As an illustration, Draper and Smith (1998), Ch. 9, give an example of weighted regression assuming a quadratic relationship between a predictor and the squared residuals. See also Carroll and Ruppert (1988).
4.7 Generalised least squares and weighted regression
125
EXERCISES 4.1. Polynomial regression The data in Table 4.2 give the percentage of divorces caused by adultery per year of marriage. Investigate whether the rate of divorces caused by adultery is constant, and further whether or not a quadratic model in time is justified. Interpret your findings. Year % Year %
1 3.51 8 5.07
2 9.50 9 3.65
3 8.91 10 3.80
4 9.35 15 2.83
5 8.18 20 1.51
6 6.43 25 1.27
7 5.31 30 0.49
Table 4.2 Data for Exercise 4.1.
4.2. Corner-point constraints and one-way ANOVA. Formulate the regression model with k treatment groups as ⎛ ⎞ 1n1 0n2 0n3 . . . 0n1 ⎜ 1n2 1n2 0n3 . . . 0n2 ⎟ ⎜ ⎟ ⎟ A = ⎜ ⎜ 1n3 0n2 1n3 . . . 0n3 ⎟, ⎝ ... ... ... ... ... ⎠ 1nk 0nk 0nk . . . 1nk ⎞ ⎛ n 1 + n 2 + . . . + n k n 2 . . . nk ⎜ n2 n2 . . . 0 ⎟ ⎟, AT A = ⎜ ⎝ ... ... ... ... ⎠ nk
0
...
nk
where nj denotes the number of observations in treatment group j, 1nj is an associated nj column vector of 1s and yj denotes a column vector of observations corresponding to treatment group j. (i) Show that ⎛ ⎞ n1 y 1 + n2 y 2 + . . . + nk y k ⎜ ⎟ n2 y 2 ⎟. AT y = ⎜ ⎝ ⎠ ... nk y k (ii) In the case of two treatment groups calculate βˆ and calculate the fitted values for an observation in each treatment group.
126
4. Further Multilinear Regression
(iii) Show that
⎛
1 n1 −1 n1
⎜ ⎜ M = (AT A)−1 = ⎜ ⎝ ... −1 n1
1 n2
−1 n1
+ n11 ... 1 n1
... ... ... ...
−1 n1 1 n1 1 nk
... + n11
⎞ ⎟ ⎟ ⎟. ⎠
ˆ give the fitted values for an observation in treatment Calculate β, group j and interpret the results. 4.3. Fit the model in Example 2.9 using a regression approach. 4.4. Fit the model in Example 2.11 using a regression approach. 4.5. Define, Y0 ∼N (0, σ02 ), Yi = Yi−1 + i , where the i are iid N (0, σ 2 ). What is the joint distribution of (i) Y1 , Y2 , Y3 , (ii) Y1 , . . . , Yn ? ⎛ ⎞ 1 a 0 4.6. Let Y ∼N3 (μ, Σ) with Σ = ⎝ a 1 b ⎠. Under what conditions 0 b 1 are Y1 + Y2 + Y3 and Y1 − Y2 − Y3 independent? 4.7. Mean-square prediction. Let Y ∼ U (−a, b), a, b > 0, X = Y 2 . (i) Calculate E(Y n ). (ii) Find the best mean-square predictors of X given Y and of Y given X. (iii) Find the best linear predictors of X given Y and of Y given X. 4.8. If the mean μ0 in the multivariate normal distribution is known, show that the MLE of Σ is n ˆ= 1 Σ (xi − μ0 )T (xi − μ0 ) = S + (x − μ0 )T (x − μ0 ). 1 n [Hint: Define the precision matrix Λ = Σ −1 and use the differential
T rule ∂/∂A ln |A| = A−1 .] 4.9. Background results for Exercise 4.11. (i) Let X∼N (μ, Σ). Show that fX (x) ∝ exp xT Ax + xT b , where A = − 12 Σ −1 and b = Σ −1 μ. (ii) Let X and Y be two continuous random variables. Show that the conditional density fX|Y (x|y) can be expressed as KfX,Y (x, y) where K is constant with respect to x.
4.7 Generalised least squares and weighted regression
127
4.10. Inverse of a partitioned matrix. Show that the following formula holds for the inverse of a partitioned matrix: −1 M −M BD−1 A B = , −D−1 CM D−1 + D−1 CM BD−1 C D where M = (A − BD−1 C)−1 . See e.g. Healy (1956), §3.4. 4.11. Alternative derivation of conditional distributions in the multivariate normal family. Let X∼N (μ, Σ) and introduce the partition μA ΣAA ΣAB xA ,μ = ,Σ = . x= xB μB ΣBA ΣBB Using Exercise 4.9 show that the conditional distribution of xA |xB is multivariate normal with μA|B ΣA|B
−1 = μA + ΣAB ΣBB (xB − μB ), −1 = ΣAA − ΣAB ΣBB ΣBA .
5
Adding additional covariates and the Analysis of Covariance
5.1 Introducing further explanatory variables Suppose that having fitted the regression model y = Xβ + ,
(M0 )
we wish to introduce q additional explanatory variables into our model. The augmented regression model, MA , say becomes y = Xβ + Zγ + .
(MA )
We rewrite this as y = Xβ + Zγ +
= (X, Z) (β, γ)T + , = W δ + ,
say, where W := (X, Z),
δ :=
β γ
.
Here X is n×p and assumed to be of rank p, Z is n×q of rank q, and the columns of Z are linearly independent of the columns of X. This final assumption means that there is a sense in which the q additional explanatory variables are adding N.H. Bingham and J.M. Fry, Regression: Linear Models in Statistics, Springer Undergraduate Mathematics Series, DOI 10.1007/978-1-84882-969-5 5, c Springer-Verlag London Limited 2010
129
130
5. Adding additional covariates and the Analysis of Covariance
genuinely new information to that already contained in the pre-existing X matrix. The least squares estimator δˆ can be calculated directly, by solving the normal equations as discussed in Chapter 3, to give δˆ = (W T W )−1 W T y. However, in terms of practical implementation, the amount of computation can be significantly reduced by using the estimate βˆ obtained when fitting the model (M0 ). We illustrate this method with an application to Analysis of Covariance, or ANCOVA for short. The results are also of interest as they motivate formal F -tests for comparison of nested models in Chapter 6.
Note 5.1 ANCOVA is an important subject in its own right and is presented here to illustrate further the elegance and generality of the general linear model as presented in Chapters 3 and 4. It allows one to combine, in a natural way, quantitative variables with qualitative variables as used in Analysis of Variance in Chapter 2. The subject was introduced by Fisher in 1932 (in §49.1 of the fourth and later editions of his book, Fisher (1958)). We proceed with the following lemma (where P is the projection or hat matrix, P = X(X T X)−1 X T or P = A(AT A)−1 AT = AC −1 AT in our previous notation).
Lemma 5.2 If R = I − P = I − X(X T X)−1 X T , then Z T RZ is positive definite.
Proof Suppose xT Z T RZx = 0 for some vector x. We have xT Z T RZx = xT Z T RT RZx = 0, since R is idempotent from Lemma 3.18. It follows that RZx = 0, which we write as Zx = P Zx = Xy say, for some vector y. This implies x = 0 as, by assumption, the columns of Z are linearly independent of the columns of X. Since x = 0, it follows that Z T RZ is positive definite.
Theorem 5.3 Let RA = I − W (W T W )−1 W T , L = (X T X)−1 X T Z and βˆA ˆ δ= . γˆA
5.1 Introducing further explanatory variables
131
Then (i) γˆA = (Z T RZ)−1 Z T Ry, (ii) βˆA = (X T X)−1 X T (y − Z γˆA ) = βˆ − Lˆ γA , (iii) The sum of squares for error of the augmented model is given by yT RA y = (y − Z γˆA )T R(y − Z γˆA ) = yT Ry − γˆA Z T Ry.
Proof (i) We write the systematic component in the model equation (MA) as Xβ + Zγ
= Xβ + P Zγ + (I − P )Zγ, = X β + (X T X)−1 X T Zγ + RZγ, α = , X RZ γ = Vλ
say, where α = β + (X T X)−1 X T Zγ. Suppose V λ = 0 for some λ. This gives Xβ + Zγ = 0, with both β = γ = 0 by linear independence of the columns of X and Z. Hence V has full rank p + q, since its null space is of dimension 0. From the definition R = I − X(X T X)−1 X T , one has X T R = RX = 0. From Theorem 3.5, the normal equations can be solved to give ˆ λ
= (V T V )−1 V T y, −1 XT X T X X T RZ = y. ZT R Z T RX Z T RZ
As X T R = RX = 0, this product is T −1 0 XT X X ˆ = λ y ZT R 0 Z T RZ (X T X)−1 X T y = . (Z T RZ)−1 Z T Ry We can read off from the bottom row of this matrix γˆA = (Z T RZ)−1 Z T Ry.
132
5. Adding additional covariates and the Analysis of Covariance
(ii) From the top row of the same matrix, ˆ α ˆ = (X T X)−1 X T y = β, since βˆ = (X T X)−1 X T y. Since we defined α = β + (X T X)−1 X T Zγ, it follows that our parameter estimates for the augmented model must satisfy −1 T ˆ X Z γˆA = β, α ˆ = βˆA + X T X and the result follows. (iii) We have that RA y
= y − X βˆA − Z γˆA −1 T = y − X XT X X (y − Z γˆ ) − Z γˆA T −1 T = I −X X X X (y − Z γˆA ) = R(y − Z γˆA ) −1 T = Ry − RZ Z T RZ Z Ry
(by (ii) and (N E))
(by (i)).
So by the above, yT RA y
=
−1 T yT RZ Z T RZ Z Ry,
=
T T Z Ry. γˆA
Since the matrices RA and R are symmetric and idempotent (Lemma 3.18), the result can also be written as T RA y yT RA
=
yT RA y
=
(y − Z γˆA )T RT R(y − Z γˆA )
=
(y − Z γˆA )T R(y − Z γˆA ).
Sum of squares decomposition. We may rewrite (iii) as SSE = SSEA + γˆA Z T Ry. That is, the sum of squares attributable to the new explanatory variables Z is γˆA Z T Ry. Linear hypothesis tests and an Analysis of Variance formulation based on a decomposition of sums of squares are discussed at length in Chapter 6. The result above gives a practical way of performing these tests for models which
5.1 Introducing further explanatory variables
133
are constructed in a sequential manner. In particular, the result proves useful when fitting Analysis of Covariance models (§5.2–5.3). One Extra Variable. The case with only one additional explanatory is worth special mention. In this case the matrix Z is simply a column vector, x(p) say. We have that Z T RZ = xT(p) Rx(p) is a scalar and the above formulae simplify to give xT(p) Ry
γˆA
=
βˆA
= βˆ − (X T X)−1 X T x(p) βˆA ,
yT RA y
xT(p) Rx(p)
,
= yT Ry − γˆ xT(p) Ry.
5.1.1 Orthogonal parameters From Theorem 5.2(ii), the difference in our estimates of β in our two models, γA , where (M0 ) and (MA ), is Lˆ L := (X T X)−1 X T Z. Now L = 0 iff XT Z = 0
(orth)
(recall X is n × p, Z is n × q, so X T Z is p × q, the matrix product being conformable). This is an orthogonality relation, not between vectors as usual but between matrices. When it holds, our estimates βˆ and βˆA of β in the original and augmented models (M0 ) and (MA ) are the same. That is, if we are considering extending our model from (M0 ) to (MA ), that is in extending our parameter from β to δ, we do not have to waste the work already done in estimating β, only to estimate the new parameter γ. This is useful and important conceptually and theoretically. It is also important computationally and in calculations done by hand, as was the case before the development of statistical packages for use on computers. As our interest is in the parameters (β, γ, δ) rather than the design matrices (X, Z, W ), we view the orthogonality relation in terms of them, as follows:
Definition 5.4 In the above notation, the parameters β, γ are orthogonal (or β, γ are orthogonal parameters) if (orth) X T Z = 0.
134
5. Adding additional covariates and the Analysis of Covariance
Note 5.5 1. We have met such orthogonality before, in the context of polynomial regression (§4.1) and orthogonal polynomials (§4.1.2). 2. Even with computer packages, orthogonality is still an advantage from the point of view of numerical stability, as well as computational efficiency (this is why the default option in S-Plus uses orthogonal polynomials – see §4.1.3). Numerical stability is very important in regression, to combat one of the standing dangers – multicollinearity (see §7.4). 3. Orthogonal polynomials are useful in Statistics beyond regression. In statistical models with several parameters, it often happens that we are interested in some but not all of the parameters needed to specify the model. In this case, the (vector) parameter we are interested in – β, say – is (naturally) called the parameter of interest, or interest parameter, while the complementary parameter we are not interested in – γ, say – is called the nuisance parameter. The simplest classical case is the normal model N (μ, σ 2 ). If we are interested in the mean μ only, and not the variance σ 2 , then σ is a nuisance parameter. The point of the Student t-statistic √ t := n − 1(X − μ)/S ∼ t(n − 1) familiar from one’s first course in Statistics is that it cancels out σ: √ n(X − μ)/σ ∼ N (0, 1), nS 2 /σ 2 ∼ χ2 (n − 1), X and S independent. The tasks of estimating μ with σ known and with σ unknown are fundamentally different (and this is reflected in the difference between the normal and the t distributions). Again, it may happen that with two parameters, θ1 and θ2 say, we have two statistics S1 and S2 , such that while S2 is uninformative about θ1 on its own, (S1 , S2 ) is more informative about θ1 than S1 alone is. One then says that the statistic S2 is ancillary for inference about θ1 . Ancillarity (the concept is again due to Fisher) is best studied in conjunction with sufficiency, which we met briefly in §4.4.1. and §4.5. With such issues in mind, one may seek to find the simplest, or most tractable, way to formulate the problem. It can be very helpful to reparametrise, so as to work with orthogonal parameters. The relevant theory here is due to Cox and Reid (1987) (D. R. Cox (1924–) and Nancy Reid (1952–)). Loosely speaking, orthogonal parameters allow one to separate a statistical model into its component parts.
5.2 ANCOVA
135
5.2 ANCOVA Recall that in regression (Chapters 1, 3, and 4) we have continuous (quantitative) variables, whilst in ANOVA (Chapter 2) we have categorical (qualitative) variables. For questions involving both qualitative and quantitative variables, we need to combine the methods of regression and ANOVA. This hybrid approach is Analysis of Covariance (ANCOVA).
Example 5.6 Suppose we want to compare two treatments A, B for reducing high blood pressure. Now blood pressure y is known to increase with age x (as the arteries deteriorate, by becoming less flexible, or partially blocked with fatty deposits, etc.). So we need to include age as a quantitative variable, called a covariate or concomitant variable, while we look at the treatments (qualitative variable), the variable of interest. Suppose first that we inspect the data (EDA). See Figure 5.1, where x is age in years, y is blood pressure (in suitable units), the circles are those with treatment A and the triangles are those with treatment B. This suggests the model
β0A + β1 xi + i for Treatment A; yi = β0B + β1 xi + i for Treatment B. This is the full model (of parallel-lines type in this example): there is a common slope, that is increase in age has the same effect for each treatment. Here the parameter of interest is the treatment effect, or treatment difference, β0A − β0B , and the hypothesis of interest is that this is zero: H0 : β0A = β0B . Now see what happens if we ignore age as a covariate. In effect, this projects the plot above onto the y-axis. See Figure 5.2. The effect is much less clear! Rewrite the model as (μi := Eyi ; Ei = 0 as usual)
β0 + β1 xi for Treatment A; μi = β0 + β1 xi + β2 for Treatment B and test H0 : β2 = 0. The full model is: β2 unrestricted. The reduced model is: β2 = 0. Thus we are testing a linear hypothesis β2 = 0 here.
5. Adding additional covariates and the Analysis of Covariance
y
136
x
Figure 5.1 EDA plot suggests model with two different intercepts
We can put the quantitative variable x and the qualitative variable treatment on the same footing by introducing an indicator (or Boolean) variable, zi :=
0 if the ith patient has Treatment A, 1 if the ith patient has Treatment B.
Then – Full model: μi = β0 + β1 xi + β2 z, – Reduced model: μi = β0 + β1 xi , – Hypothesis: H0 : β2 = 0. As with regression and ANOVA, we might expect to test hypotheses using an F -test (‘variance-ratio test’), with large values of an F -statistic significant against the null hypothesis. This happens with ANCOVA also; we come to the distribution theory later. Interactions. The effect above is additive – one treatment simply shifts the regression line vertically relative to the other – see Figure 5.1. But things may
137
y
5.2 ANCOVA
x
Figure 5.2 Ignorance of covariate blurs the ease of interpretation be more complicated. For one of the treatments, say, there may be a decreasing treatment effect – the treatment effect may decrease with age, giving rise to non-parallel lines. The two lines may converge with age (when the treatment that seems better for younger patients begins to lose its advantage), may cross (when one treatment is better for younger patients, the other for older patients), or diverge with age (when the better treatment for younger patients looks better still for older ones). See Figure 5.3. The full model now has four parameters (two general lines, so two slopes and two intercepts): μi = β0 + β1 xi + β2 zi + β3 zi xi
(general lines),
the interaction term in β3 giving rise to separate slopes. The first thing to do is to test whether we need two separate slopes, by testing H0 : β3 = 0. If we do not, we simplify the model accordingly, back to μi = β0 + β1 xi + β2 zi
(parallel lines).
5. Adding additional covariates and the Analysis of Covariance
y
138
y
x
x
Figure 5.3 Top panel: Interaction term leads to convergence and then crossover for increasing x. Bottom panel: Interaction term leads to divergence of treatment effects.
5.2 ANCOVA
139
We can then test for treatment effect, by testing H0 : β2 = 0. If the treatment (β2 z) term is not significant, we can reduce again, to μi = β0 + β1 xi
(common line).
We could, for completeness, then test for an age effect, by testing H 0 : β1 = 0 (though usually we would not do this – we know blood pressure does increase with age). The final, minimal model is μi = β0 . These four models – with one, two, three and four parameters – are nested models. Each is successively a sub-model of the ‘one above’, with one more parameter. Equally, we have nested hypotheses β3 = 0, β2 (= β3 ) = 0, β1 (= β2 = β3 ) = 0.
Note 5.7 In the medical context above, we are interested in treatments (which is the better?). But we are only able to test for a treatment effect if there is no interaction. Otherwise, it is not a question of the better treatment, but of which treatment is better for whom.
5.2.1 Nested Models Update. Using a full model, we may wish to simplify it by deleting nonsignificant terms. Some computer packages allow one to do this by using a special command. In S-Plus/R the relevant command is update. F -tests for nested models may simply be performed as follows: m1.lm p. Models can be compared using this method by plotting Cp against p. Suitable candidate models should lie close to the line Cp = p. Note, however that by definition Cp = p for the full model. Non-additive or non-Gaussian errors. These may be handled using Generalised Linear Models (see Chapter 8). Generalised Linear Models can be fitted in S-Plus and R using the command glm. For background and details, see McCullagh and Nelder (1989). Correlated Errors. These are always very dangerous in Statistics! Independent errors tend to cancel. This is the substance of the Law of Large Numbers (LLN), that says x¯ → Ex (n → ∞) – sample means tend to population means as sample size increases. Similarly for sample variances and other sample quantities. This is basically why Statistics works. One does not even need to have independent errors: weakly dependent errors (which may be defined precisely, in a variety of ways) exhibit similar cancellation behaviour. By contrast, strongly dependent errors need not cancel. Here, increasing the sample size merely replicates existing readings, and if these are way off this does not help us (as in Note 1.3). Correlated errors may have some special structure – e.g., in time or in space. Accordingly, one would then have to use special methods to reflect this – Time Series or Spatial Statistics; see Chapter 9. Correlated errors may be detected using the Durbin–Watson test or, more crudely, using a runs test (see Draper and Smith (1998), Ch. 7).
7.2 Transformation of Data If the residual plot ‘funnels out’ one may try a transformation of data, such as √ y → log y or y → y (see Figure 7.2). If on the other hand the residual plot ‘funnels in’ one may instead try y → y 2 , etc (see Figure 7.3). Is there a general procedure? One such approach was provided in a famous paper Box and Cox (1964). Box and Cox proposed a one-parameter family of
169
Residual
7.2 Transformation of Data
Fitted value
Figure 7.2 Plot showing ‘funnelling out’ of residuals
power transformations that included a logarithmic transformation as a special case. With λ as parameter, this is y →
(y λ − 1)/λ if λ = 0, log y if λ = 0.
Note that this is an indeterminate form at λ = 0, but since yλ − 1 eλ log y − 1 = , λ λ d λ log y e − 1) = log y.eλ log y = log y dλ L’Hospital’s Rule gives (y λ − 1)/λ → log y
if λ = 0,
(λ → 0).
So we may define (y λ − 1)/λ as log y for λ = 0, to include λ = 0 with λ = 0 above. One may – indeed, should – proceed adaptively by allowing the data to suggest which value of λ might be suitable. This is done in S-Plus by the command boxcox.
7. Model Checking and Transformation of Data
Residual
170
Fitted value
Figure 7.3 Plot showing ‘funnelling in’ of residuals
Example 7.3 (Timber Example) The value of timber yielded by a tree is the response variable. This is measured only when the tree is cut down and sawn up. To help the forestry worker decide which trees to fell, the predictor variables used are girth (‘circumference’ – though the tree trunks are not perfect circles) and height. These can be easily measured without interfering with the tree – girth by use of a tape measure (at some fixed height above the ground), height by use of a surveying instrument and trigonometry. Venables and Ripley (2002) contains a data library MASS, which includes a data set timber: attach(timber) names(timber) [1] “volume” “girth” “height” boxcox(volume) ∼ (girth + height)
7.3 Variance-Stabilising Transformations
171
Dimensional Analysis. The data-driven choice of Box–Cox parameter λ seems to be close to 1/3. This is predictable on dimensional grounds: volume is in cubic metres, girth and height in metres (or centimetres). It thus always pays to be aware of units. There is a whole subject of Dimensional Analysis devoted to such things (see e.g. Focken (1953)). A background in Physics is valuable here.
7.3 Variance-Stabilising Transformations In the exploratory data analysis (EDA), the scatter plot may suggest that the variance is not constant throughout the range of values of the predictor variable(s). But, the theory of the Linear Model assumes constant variance. Where this standing assumption seems to be violated, we may seek a systematic way to stabilise the variance – to make it constant (or roughly so), as the theory requires. If the response variable is y, we do this by seeking a suitable function g (sufficiently smooth – say, twice continuously differentiable), and then transforming our data by y → g(y). Suppose y has mean μ: Ey = μ. Taylor expand g(y) about y = μ: 1 g(y) = g(μ) + (y − μ)g (μ) + (y − μ)2 g (μ) + . . . 2 Suppose the bulk of the response values y are fairly closely bunched around the mean μ. Then, approximately, we can treat y − μ as small; then (y − μ)2 is negligible (at least to a first approximation, which is all we are attempting here). Then g(y) ∼ g(μ) + (y − μ)g (μ). Take expectations: as Ey = μ, the linear term goes out, giving Eg(y) ∼ g(μ). So g(y) − g(μ) ∼ g(y) − Eg(y) ∼ g (μ)(y − μ). Square both sides: [g(y) − g(μ)]2 ∼ [g (μ)]2 (y − μ)2 . Take expectations: as Ey = μ and Eg(y) ∼ g(μ), this says var(g(y)) ∼ [g (μ)]2 var(y).
172
7. Model Checking and Transformation of Data
Regression. So if
var(yi |xi ) = σi2 ,
E(yi |xi ) = μi ,
we use EDA to try to find some link between the means μi and the variances σi2 . Suppose we try σi2 = H(μi ), or σ 2 = H(μ). Then by above, var(g(y)) ∼ [g (μ)]2 σ 2 = [g (μ)]2 H(μ). We want constant variance, c2 say. So we want [g (μ)]2 H(μ) = c2 ,
c , g (μ) = H(μ)
g(y) = c
dy . H(y)
Note 7.4 The idea of variance-stabilising transformations (like so much else in Statistics!) goes back to Fisher. He found the density of the sample correlation coefficient r2 in the bivariate normal distribution – a complicated function involving the population correlation coefficient ρ2 , simplifying somewhat in the case ρ = 0 (see e.g. Kendall and Stuart (1977), §16.27, 28). But Fisher’s z transformation of 1921 (Kendall and Stuart (1977), §16.33) 1+r 1+ρ 1 1 r = tanh z, z = log , ρ = tanh ζ, ζ = log 2 1−r 2 1−ρ gives z approximately normal, with variance almost independent of ρ: z ∼ N (0, 1/(n − 1)). Taylor’s Power Law. The following empirical law was proposed by R. L. Taylor in 1961 (Taylor (1961)): log variance against log mean is roughly linear with slope γ between 1 and 2. Both these extreme cases can occur. An example of slope 1 is the Poisson distribution, where the mean and the variance are the same. An example of slope 2 occurs with a Gamma-distributed error structure, important in Generalised Linear Models (Chapter 8). With H(μ) = μγ above, this gives variance v = σ 2 = H(μ) = μγ . Transform to g(y) = c
dy =c H(y)
dy y
1 2γ
1 1− 1 γ = c y 1− 2 γ − y0 2 .
7.3 Variance-Stabilising Transformations
173
This is of Box–Cox type, with 1 λ = 1 − γ. 2 Taylor’s suggested range 1 ≤ γ ≤ 2 gives 1 1 0≤1− γ ≤ . 2 2 Note that this range includes the logarithmic transformation (Box–Cox, λ = 0), and the cube–root transformation (λ = 1/3) in the timber example. Partly for dimensional reasons as above, common choices for λ include λ = −1/2, 0, 1/3, 1/2, (1), 3/2 (if λ = 1 we do not need to transform). An empirical choice of λ (e.g. by Box–Cox as above) close to one of these may suggest choosing λ as this value, and/or a theoretical examination with dimensional considerations in mind. Delta Method. A similar method applies to reparametrisation. Suppose we choose a parameter θ. If the true value is θ0 and the maximum-likelihood esˆ then under suitable regularity conditions a central limit theorem timator is θ, (CLT) will hold: √ (n → ∞). n θˆ − θ0 /σ → N (0, 1) Now suppose that one wishes to change parameter, and work instead with φ, where φ := g(θ). Then the same method (Taylor expansion about the mean) enables one to transfer this CLT for our estimate of θ to a CLT for our estimate of φ: √ n φˆ − φ0 / (g (θ0 ) σ) → N (0, 1) (n → ∞).
Example 7.5 (Variance and standard deviation) It is convenient to be able to change at will from using variance σ 2 as a parameter to using standard deviation σ. Mathematically the change is trivial, and it is also trivial computationally (given a calculator). Using the delta-method, it is statistically straightforward to transfer the results of a maximum-likelihood estimation from one to the other.
174
7. Model Checking and Transformation of Data
7.4 Multicollinearity Recall the distribution theory of the bivariate normal distribution (§1.5). If we are regressing y on x, but y is (exactly) a linear function of x, then ρ = ±1, the bivariate normal density does not exist, and the two-dimensional setting is wrong – the situation is really one-dimensional. Similar remarks apply for the multivariate normal distribution (§4.3). When we assume the covariance matrix Σ is non-singular, the density exists and is given by Edgeworth’s Theorem; when Σ is singular, the density does not exist. The situation is similar again in the context of Multiple Regression in Chapter 3. There, we assumed that the design matrix A (n × p, with n >> p) has full rank p. A will have defective rank (< p) if there are linear relationships between regressors. In all these cases, we have a general situation which is non-degenerate, but which contains a special situation which is degenerate. The right way to handle this is to identify the degeneracy and its cause. By reformulating the problem in a suitably lower dimension, we can change the situation which is degenerate in the higher-dimensional setting into one which is non-degenerate if handled in its natural dimension. To summarise: to escape degeneracy, one needs to identify the linear dependence relationship which causes it. One can then eliminate dependent variables, begin again with only linearly independent variables, and avoid degeneracy. The problem remains that in Statistics we are handling data, and data are uncertain. Not only do they contain sampling error, but having sampled our data we have to round them (to the number of decimal places or significant figures we – or the default option of our computer package – choose to work to). We may well be in the general situation, where things are non-degenerate, and there are no non-trivial linear dependence relations. Nevertheless, there may be approximate linear dependence relations. If so, then rounding error may lead us close to degeneracy (or even to it): our problem is then numerically unstable. This phenomenon is known as multicollinearity. Multiple Regression is inherently prone to problems of this kind. One reason is that the more regressors we have, the more ways there are for some of them to be at least approximately linearly dependent on others. This will then cause the problems mentioned above. Our best defence against multicollinearity is to be alert to the danger, and in particular to watch for possible approximate linear dependence relations between regressors. If we can identify such, we have made two important gains: (i) we can avoid the numerical instability associated with multicollinearity, and reduce the dimension and thus the computational complexity, (ii) we have identified important structural information about the problem by identifying an approximate link between regressors.
7.4 Multicollinearity
175
The problem of multicollinearity in fact bedevils the whole subject of Multiple Regression, and is surprisingly common. It is one reason why the subject is ‘an art as well as a science’. It is also a reason why automated computer procedures such as the S-Plus commands step and update produce different outcomes depending on the order in which variables are declared in the model.
Example 7.6 (Concrete example) The following example is due to Woods et al. (1932). It is a very good illustration of multicollinearity and how to handle it. In a study of the production of concrete, the response variable Y is the amount of heat (calories per gram) released while the concrete sets. There are four regressors X1 , . . . , X4 representing the percentages (by weight rounded to the nearest integer) of the chemically relevant constituents from which the concrete is made. The data are shown in Table 7.1 below.
n 1 2 3 4 5 6 7 8 9 10 11 12 13
Y 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.9
X1 7 1 11 11 7 11 3 1 2 21 1 11 10
X2 26 29 56 31 52 55 71 31 54 47 40 66 68
X3 6 15 8 8 6 9 17 22 18 4 23 9 8
X4 60 52 20 47 33 22 6 44 22 26 34 12 12
Table 7.1 Data for concrete example
Here the Xi are not exact percentages, due to rounding error and the presence of between 1% and 5% of other chemically relevant compounds. However, X1 , X2 , X3 , X4 are rounded percentages and so sum to near 100 (cf. the mixture models of Exercise 6.10). So, strong (negative) correlations are anticipated, and we expect that we will not need all of X1 , . . . , X4 in our chosen model. In this simple example we can fit models using all possible combinations of variables
176
7. Model Checking and Transformation of Data
and the results are shown in Table 7.2. Here we cycle through, using as an intuitive guide the proportion of the variability in the data explained by each model as defined by the R2 statistic (see Chapter 3). Model X1 X2 X3 X4
100R2 53.29 66.85 28.61 67.59
Model X1 X2 X1 X3 X1 X4 X2 X3 X2 X4 X3 X4
100R2 97.98 54.68 97.28 84.93 68.18 93.69
Model X1 X2 X3 X1 X2 X4 X1 X3 X4 X2 X3 X4 X1 X2 X3 X4
100R2 98.32 98.32 98.2 97.33 98.32
Table 7.2 All-subsets regression for Example 7.6 The multicollinearity is well illustrated by the fact that omitting either X3 or X4 from the full model does not seem to have much of an effect. Further, the models with just one term do not appear sufficient. Here the t-tests generated as standard output in many computer software packages, in this case R1 using the summary.lm command, prove illuminating. When fitting the full model X1 X2 X3 X4 we obtain the output in Table 7.3 below: Coefficient Intercept X1 X2 X3 X4
Estimate 58.683 1.584 0.552 0.134 -0.107
Standard Error 68.501 0.728 0.708 0.738 0.693
t-value 0.857 2.176 0.780 0.182 -0.154
p-value 0.417 0.061 0.458 0.860 0.882
Table 7.3 R output for Example 7.6 So despite the high value of R2 , tests for individual model components in the model are non-significant. This in itself suggests possible multicollinearity. Looking at Table 7.2, model selection appears to come down to a choice between the best two-term model X1 X2 and the best three-term models X1 X2 X3 and X1 X2 X4 . When testing X1 X2 X3 versus X1 X2 we get a t-statistic of 0.209 for X3 suggesting that X3 can be safely excluded from the model. A similar analysis for the X1 X2 X4 gives a p-value of 0.211 suggesting that X4 can also be safely omitted from the model. Thus, X1 X2 appears to be the best model and the multicollinearity inherent in the problem suggests that a model half the 1
c 2009 R Foundation R : A language and environment for statistical computing. for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0 http://www. R-project.org
7.4 Multicollinearity
177
size of the full model will suffice. In larger problems one might suggest using stepwise regression or backward selection starting with the full model, rather than the all-subsets regression approach we considered here. Regression Diagnostics. A regression analysis is likely to involve an iterative process in which a range of plausible alternative models are examined and compared, before our final model is chosen. This process of model checking involves, in particular, looking at unusual or suspicious data points, deficiencies in model fit, etc. This whole process of model examination and criticism is known as Regression Diagnostics. For reasons of space, we must refer for background and detail to one of the specialist monographs on the subject, e.g. Atkinson (1985), Atkinson and Riani (2000).
EXERCISES 7.1. Revisit the concrete example using, (i) stepwise selection starting with the full model, (ii) backward selection starting with the full model, (iii) forward selection from the null constant model. 7.2. Square root transformation for count data. Counts of rare events are often thought √ to be √ approximately Poisson distributed. The transformation Y or Y + 1, if some counts are small, is often thought to be effective in modelling count data. The data in Table 7.4 give a count of the number of poppy plants in oats. (i) Fit an Analysis of Variance model using the raw data. Does a plot of residuals against fitted values suggest a transformation? (ii) Interpret the model in (i). (iii) Re-fit the model in (i-ii) using a square–root transformation. How do your findings change?
Treatment Block 1 Block 2 Block 3 Block 4
A 438 442 319 380
B 538 422 377 315
C 77 61 157 52
D 17 31 87 16
Table 7.4 Data for Exercise 7.2
E 18 26 77 20
178
7. Model Checking and Transformation of Data
7.3. Arc sine transformation for proportions. If we denote the empirical proportions by pˆ, we replace pˆ by introducing the transformation √ y = sin−1 ( pˆ). In this angular scale proportions near zero or one are spread out to increase their variance and make the assumption of homogenous errors more realistic. (With small values of n < 50 1 1 or 1 − 4n .) The data the suggestion is to replace zero or one by 4n in Table 7.5 give the percentage of unusable ears of corn. (i) Fit an Analysis of Variance model using the raw data. Does a plot of residuals against fitted values suggest a transformation? (ii) Interpret the model in (i). (iii) Re-fit the model in (i–ii) using the suggested transformation. How do your findings change? Block Treatment A Treatment B Treatment C Treatment D
1 42.4 33.3 8.5 16.6
2 34.4 33.3 21.9 19.3
3 24.1 5.0 6.2 16.6
4 39.5 26.3 16.0 2.1
5 55.5 30.2 13.5 11.1
6 49.1 28.6 15.4 11.1
Table 7.5 Data for Exercise 7.3
7.4. The data in Table 7.6 give the numbers of four kinds of plankton caught in different hauls. (i) Fit an Analysis of Variance model using the raw data. Does a plot of residuals against fitted values suggest a transformation of the response? (ii) Calculate the mean and range (max(y)−min(y)) for each species and repeat using the logged response. Comment. (iii) Fit an Analysis of Variance model using both raw and logged numbers, and interpret the results. 7.5. Repeat Exercise 7.4 using (i) The square-root transformation of Exercise 7.2. (ii) Taylor’s power law. 7.6. The delta method: Approximation formulae for moments of transformed random variables. Suppose the random vector U satisfies E(U ) = μ, var(U ) = ΣU , V = f (U ) for some smooth function f . Let Fij be the matrix of derivatives defined by ∂u ∂f ∂fi Fij (u) = = = . ∂v ij ∂v ij ∂vj
7.4 Multicollinearity
Haul 1 2 3 4 5 6 7 8 9 10 11 12
179
Type I 895 540 1020 470 428 620 760 537 845 1050 387 497
Type II 1520 1610 1900 1350 980 1710 1930 1960 1840 2410 1520 1685
Type III 43300 32800 28800 34600 27800 32800 28100 18900 31400 39500 29000 22300
Type IV 11000 8600 8260 9830 7600 9650 8900 6060 10200 15500 9250 7900
Table 7.6 Data for Exercise 7.4 We wish to construct simple estimates for the mean and variance of V . Set V ≈ f (μ) + F (μ)(u − μ). Taking expectations then gives E(V ) ≈ f (μ). (i) Show that ΣV ≈ F (μ)ΣU√ F (μ)T . (ii) Let U ∼P o(μ) and V = U . Give approximate expressions for the mean and variance of V . (iii) Repeat (ii) for V = log(U + 1). What happens if μ >> 1? 7.7. Show, using the delta method, how you might obtain parameter estimates and estimated standard errors for the power-law model y = αxβ . 7.8. Analysis using graphics in S-Plus/R . Re-examine the plots shown in Figures 7.2 and 7.3. The R -code which produced these plots is shown below. What is the effect of the commands xaxt/yaxt="n"? Use ?par to see other options. Experiment and produce your own examples to show funnelling out and funnelling in of residuals. Code for funnels out/in plot y2 1). The number of failures before the rth success has the negative binomial distribution in the form just obtained (the binomial coefficient counts the number of ways of distributing the n failures over the first n + r − 1 trials; for each such way, these n failures and r − 1 successes happen with probability q n pr−1 ; the (n + r)th trial is a success with probability p). So the number of failures before the rth success (i) has the negative binomial distribution (which it is customary and convenient to parametrise as N B(r, p) in this case); (ii) is the sum of r independent copies of geometric random variables with distribution G(p); (iii) so has mean rq/p and variance rq/p2 (agreeing with the above with r = ν, p = r/(τ + r), q = τ /(τ + r)). The Federalist. The Federalist Papers were a series of essays on constitutional matters, published in 1787–1788 by Alexander Hamilton, John Jay and James Madison to persuade the citizens of New York State to ratify the U.S. Constitution. Authorship of a number of these papers, published anonymously, was later disputed between Hamilton and Madison. Their authorship has since been settled by a classic statistical study, based on the use of the negative binomial distribution for over-dispersed count data (for usage of key indicator words – ‘whilst’ and ‘while’ proved decisive); see Mosteller and Wallace (1984).
8.5 Over-dispersion and the Negative Binomial Distribution
199
8.5.1 Practical applications: Analysis of over-dispersed models in R For binomial and Poisson families, the theory of Generalised Linear Models specifies that the dispersion parameter φ = 1. Over-dispersion can be very common in practical applications and is typically characterised by the residual deviance differing significantly from its asymptotic expected value given by the residual degrees of freedom (Venables and Ripley (2002)). Note, however, that this theory is only asymptotic. We may crudely interpret over-dispersion as saying that data varies more than if the underlying model really were from a Poisson or binomial sample. A solution is to multiply the variance functions by a dispersion parameter φ, which then has to be estimated rather than simply assumed to be fixed at 1. Here, we skip technical details except to say that this is possible using a quasi-likelihood approach and can be easily implemented in R using the Generalised Linear Model families quasipoisson and quasibinomial. We illustrate the procedure with an application to overdispersed Poisson data.
Example 8.7 We wish to fit an appropriate Generalised Linear Model to the count data of Exercise 7.2. Fitting the model with both blocks and treatments gives a residual deviance of 242.46 on 12 df giving a clear indication of over-dispersion. A quasi-poisson model can be fitted with the following commands: m1.glm t) = hf (t)/F (t), to first order in h. We call the coefficient of h on the right the hazard function, h(t). Thus ∞ ∞ ∞ h(t) = f (t)/ f (u) du = −D f / f, t
and integrating one has ∞ t f =− h: log t
(since f is a density,
∞ 0
0
t
t
∞
t
t f (u) du = exp − h(u) du 0
f = 1, giving the constant of integration).
Example 9.10 1. The exponential distribution. If F is the exponential distribution with parameter λ, E(λ) say, f (t) = λe−λt , F (t) = e−λt , and h(t) = λ is constant. This property of constant hazard rate captures the lack of memory property of the exponential distributions (for which see e.g. the sources cited in §8.4), or the lack of ageing property: given that an individual has survived to date, its further survival time has the same distribution as that of a new individual. This is suitable for modelling the lifetimes of certain components (lightbulbs, etc.) that fail without warning, but of course not suitable for modelling lifetimes of biological populations, which show ageing. 2. The Weibull distribution. Here f (t) = λν −λ tλ−1 exp{−(t/λ)ν }, with λ, ν positive parameters; this reduces to the exponential E(λ) for ν = 1. 3. The Gompertz-Makeham distribution. This is a three-parameter family, with hazard function h(t) = λ + aebt . This includes the exponential case with a = b = 0, and allows one to model a baseline hazard (the constant term λ), with in addition a hazard growing
224
9. Other topics
exponentially with time (which can be used to model the winnowing effect of ageing in biological populations). In medical statistics, one may be studying survival times in patients with a particular illness. One’s data is then subject to censoring, in which patients may die from other causes, discontinue treatment, leave the area covered by the study, etc.
9.5.1 Proportional hazards One is often interested in the effect of covariates on survival probabilities. For example, many cancers are age-related, so the patient’s age is an obvious covariate. Many forms of cancer are affected by diet, or lifestyle factors. Thus the link between smoking and lung cancer is now well known, and similarly for exposure to asbestos. One’s chances of contracting certain cancers (of the mouth, throat, oesophagus etc.) are affected by alcohol consumption. Breast cancer rates are linked to diet (western women, whose diets are rich in dairy products, are more prone to the disease than oriental women, whose diets are rich in rice and fish). Consumption of red meat is linked to cancer of the bowel, etc., and so is lack of fibre. Thus in studying survival rates for a particular cancer, one may identify a suitable set of covariates z relevant to this cancer. One may seek to use a linear combination β T z of such covariates with coefficients β, as in the multiple regression of Chapters 3 and 4. One might also superimpose this effect on some baseline hazard, modelled non-parametrically. One is led to model the hazard function by h(t; z) = g(β T z)h0 (t), where the function g contains the parametric part β T z and the baseline hazard h0 the non-parametric part. This is the Cox proportional hazards model (D. R. Cox in 1972). The name arises because if one compares the hazards for two individuals with covariates z1 , z2 , one obtains h(t; z1 )/h(t; z2 ) = g(β T z1 )/g(β T z2 ), as the baseline hazard term cancels. The most common choices of g are: (i) Log-linear: g(x) = ex (if g(x) = eax , one can absorb the constant a into β); (ii) Linear: g(x) = 1 + x; (iii) Logistic: g(x) = log(1 + x).
9.6 p >> n
225
We confine ourselves here to the log-linear case, the commonest and most important. Here the hazard ratio is h(t; z1 )/h(t; z2 ) = exp β T (z1 − z2 ) . Estimation of β by maximum likelihood must be done numerically (we omit the non-parametric estimation of h0 ). For a sample of n individuals, with covariate vectors z1 , . . . , zn , the data consist of the point events occurring – the identities (or covariate values) and times of death or censoring of non-surviving individuals; see e.g. Venables and Ripley (2002), §13.3 for use of S-Plus here, and for theoretical background see e.g. Cox and Oakes (1984).
9.6 p >> n We have constantly emphasised that the number p of parameters is to be kept small, to give an economical description of the data in accordance with the Principle of Parsimony, while the sample size n is much larger – the larger the better, as there is then more information. However, practical problems in areas such as bioinformatics have given rise to a new situation, in which this is reversed, and one now has p much larger than n. This happens with, for example, data arising from microarrays. Here p is the number of entries in a large array or matrix, and p being large enables many biomolecular probes to be carried out at the same time, so speeding up the experiment. But now new and efficient variable-selection algorithms are needed. Recent developments include that of LASSO (least absolute shrinkage and selection operator) and LARS (least angle regression). One seeks to use such techniques to eliminate most of the parameters, and reduce to a case with p 0, there would be distinct real roots and a sign change in between). So (‘b2 − 4ac ≤ 0’): s2xy ≤sxx syy = s2x s2y ,
r2 := (sxy /sx sy )2 ≤1.
So −1 ≤ r ≤ + 1, as required. The extremal cases r = ±1, or r2 = 1, have discriminant 0, that is Q(λ) has a repeated real root, λ0 say. But then Q(λ0 ) is the sum of squares of λ0 (xi − x) + (yi − y), which is zero. So each term is 0: λ0 (xi − x) + (yi − y) = 0 (i = 1, . . ., n). That is, all the points (xi , yi ) (i = 1, . . ., n), lie on a straight line through the centroid (x, y) with slope −λ0 .
227
228
Solutions
1.2 Similarly Q(λ) = E λ2 (x − Ex)2 + 2λ(x − Ex)(y − Ey) + (y − Ey)2 = λ2 E[(x − Ex)2 ] + 2λE[(x − Ex)(y − Ey)] + E (y − Ey)2 = λ2 σx2 + 2λσxy + σy2 . (i) As before Q(λ)≥0 for all λ, as the discriminant is ≤0, i.e. 2 σxy ≤ σx2 σy2 ,
ρ := (σxy /σx σy )2 ≤ 1,
− 1 ≤ ρ ≤ + 1.
The extreme cases ρ = ±1 occur iff Q(λ) has a repeated real root λ0 . Then Q(λ0 ) = E[(λ0 (x − Ex) + (y − Ey))2 ] = 0. So the random variable λ0 (x − Ex) + (y − Ey) is zero (a.s. – except possibly on some set of probability 0). So all values of (x, y) lie on a straight line through the centroid (Ex, Ey) of slope −λ0 , a.s. 1.3 (i) Half-marathon: a = 3.310 (2.656, 3.964). b = 0.296 (0.132, 0.460). Marathon: a = 3.690 (2.990, 4.396). b = 0.378 (0.202, 0.554). (ii) Compare rule with model y = ea tb and consider, for example, dy dt (t). Should obtain a reasonable level of agreement. 1.4 A plot gives little evidence of curvature and there does not seem to be much added benefit in fitting the quadratic term. Testing the hypothesis c = 0 gives a p-value of 0.675. The predicted values are 134.44 and 163.89 for the linear model and 131.15 and 161.42 for the quadratic model. 1.5 The condition in the text becomes a Syu Suu Suv . = Suv Svv Syv b We can write down the solution for (a b)T as −1 1 Syu Svv Suu Suv = 2 Suv Svv Syv −Suv Suu Svv − Suv
−Suv Suu
giving a=
Suu Syv − Suv Syu Svv Syu − Suv Syv , b= . 2 2 Suu Svv − Suv Suu Svv − Suv
Syu Syv
,
Solutions
229
1.6 (i) A simple plot suggests that a quadratic model might fit the data well (leaving aside, for the moment, the question of interpretation). An increase in R2 , equivalently a large reduction in the residual sum of squares, suggests the quadratic model offers a meaningful improvement over the simple model y = a + bx. A t-test for c = 0 gives a p-value of 0.007. (ii) t-tests give p-values of 0.001 (in both cases) that b and c are equal to zero. The model has an R2 of 0.68, suggesting that this simple model explains a reasonable amount, around 70%, of the variability in the data. The estimate gives c = −7.673, suggesting that club membership has improved the half-marathon times by around seven and a half minutes. 1.7 (i) The residual sums of squares are 0.463 and 0.852, suggesting that the linear regression model is more appropriate. (ii) A t-test gives a p-value of 0.647, suggesting that the quadratic term is not needed. (Note also the very small number of observations.) 1.8 A simple plot suggests a faster-than-linear growth in population. Sensible suggestions are fitting an exponential model using log(y) = a + bt, or a quadratic model y = a + bt + ct2 . A simple plot of the resulting fits suggests the quadratic model is better, with all the terms in this model highly significant. 1.9 (i) Without loss of generality assume g(·) is a monotone increasing function. We have that FY (x) = P(g(X)≤x) = P(X≤g −1 (x)). It follows that fY (x)
=
d dx
g −1 (x)
fX (u) du, dg −1 (x) = fX g −1 (x) . dx −∞
(ii) P(Y ≤x) = P(eX ≤x) = P(X≤ log x), logx (y−μ)2 1 d √ fY (x) = e− 2σ2 dy, dx ∞ 2πσ
1 (log x − μ)2 −1 = √ x exp − . 2σ 2 2πσ
230
Solutions
1.10 (i) P (Y ≤x) = P (r/U ≤x) = P (U ≥r/x). We have that ∞ 1 r2 r2 −1 − u2 u e du d 2 , fY (x) = r dx r/x Γ(2) r 1 r2 r r2 −1 − r e 2x x2 2 xr = , Γ 2 r
=
r
r
r 2 x−1− 2 e− 2x . r 2 2 Γ 2r
(ii) P(Y ≤x) = P(X≥1/x), and this gives ∞ a−1 a −bu u b e du d , fY (x) = dx x1 Γ (a) 1 a 1 a−1 −b/x e x2 b x = , Γ (a) ba x−1−a e−b/x . Γ (a)
=
Since the above expression is a probability density, and therefore integrates to one, this gives ∞ Γ (a) x−1−a e−b/x dx = a . b 0 1.11 ∞ We have that f (x, u) = fY (u)φ(x|0, u) and ft(r) (x) = 0 f (x, u)du, where φ(·) denotes the probability density of N (0, u). Writing this out explicitly gives ftr (x) = = =
=
∞
r
r
r
x2
r 2 u−1− 2 e− 2u e− 2u .√ r 1 du, 2 2 Γ 2r 2πu 2 0 ∞ r r x2 1 r2 − 32 − r2 − 2 + 2 u √ u e du, r 2 2 Γ ( 2r ) 2π 0 r Γ 2r + 12 r2 √ , r 1 r 2 2 Γ ( 2r ) 2π r + x2 ( 2 + 2 ) 2 2 r 1 − 12 (r+1) 2 Γ 2+2 x √ . 1+ r πrΓ ( 2r )
Solutions
231
Chapter 2 2.1 (i)
z
= P (Z≤z) = P (X/Y ≤z) =
h(u) du 0
∞
=
f (x, y) dx dy x/y≤z
yz
dy
dx f (x, y).
0
0
Differentiate both sides w.r.t. z: ∞ h(z) = dy yf (yz, y) (z > 0), 0
as required (assuming enough smoothness to differentiate under the integral sign, as we do here). cx x (ii) 0 fX/c (u) du = P (X/c≤x) = P (X≤cx) = 0 fX (u) du. Differentiate w.r.t. x: fX/c (x) = cfX (cx), as required. (iii) As χ2 (n) has density e− 2 x x 2 n−1 1
1
2 2 n Γ ( 12 n) 1
,
χ2 /n has density, by (ii), ne− 2 nx (nx) 2 n−1 1
1
=
2 2 n Γ ( 12 n) 1
χ2 (m)/m χ2 (n)/n
So F (m, n) :=
∞
h(z) =
y. m 2
1
1
2 2 n Γ ( 12 n) 1
m2m 2
1 2m
1 2m
n
1 2 (m+n)
.
(independent quotient) has density, by (i), n2n
1
0
=
n 2 n e− 2 nx x 2 n−1
1
1
− 12 myz
e
Γ ( 12 m) 1 1 2 n 2 m−1 z
Γ ( 12 m)Γ ( 12 n)
z
∞
1 2 m−1
y
1 2 m−1
2
1 2n
Γ ( 12 n)
e− 2 ny y 2 n−1 dy 1
e− 2 (n+mz)y y 2 (m+n)−1 dy. 1
1
0
Put 12 (n + mz)y = u in the integral, which becomes 1
1 (m+n) 2
∞
e−u u 2 (m+n)−1 du 1
1
0
(n + mz) 2 (m+n)
Γ ( 1 (m + n)) = (m+n) 2 . 1 1 2 (m+n) (n + mz) 2
1
232
Solutions
Combining, Γ ( 12 (m + n)) z 2 m−1 , 1 1 1 Γ ( 2 m)Γ ( 2 n) (n + mz) 2 (m+n) 1
h(z) = m
1 2m
n
1 2n
as required. 2.2 (i) 0.726. (ii) 0.332. (iii) 0.861. (iv) 0.122. (v) 0.967. 2.3 The ANOVA table obtained is shown in Table 1. The significant p-value obtained (p = 0.007) gives strong evidence that the absorption levels vary between the different types of fats. The mean levels of fat absorbed are Fat 1 172g, Fat 2 185g, Fat 3 176g, Fat 4 162g. There is some suggestion that doughnuts absorb relatively high amounts of Fat 2, and relatively small amounts of Fat 4.
Source Between fats Residual Total
df 3 20 23
Sum of Squares 1636.5 2018.0 3654.5
Mean Square 545.5 100.9
F 5.406
Table 1 One-way ANOVA table for Exercise 2.3
2.4 The one-way ANOVA table is shown in Table 2. The p-value obtained, p = 0.255, suggests that the length of daily light exposure does not affect growth. Source Photoperiod Residual Total
df 3 20 23
Sum of Squares 7.125 32.5 39.625
Mean Square 2.375 1.625
F 1.462
Table 2 One-way ANOVA table for Exercise 2.4
2.5 (i) The statistic becomes √ t=
n(X 1 − X 2 ) √ , 2s
Solutions
233
where s2 is the pooled variance estimate given by
2 s + s22 s2 = 1 . 2 (ii) The total sum of squares SS can be calculated as 2 2 2 n 2 n 2 2 Xij − X1 + X2 = X1 + X2 − X 1 + 2X 1 X 2 + X 2 . 2 2 Similarly, SSE =
2 2 X12 − nX 1 + X22 − nX 2 .
This leaves the treatments sum of squares to be calculated as n 2 n 2 2 SST = X 1 − 2X 1 X 2 + X 2 = X1 − X2 , 2 2 on 1 degree of freedom since there are two treatments. Further, since by subtraction we have 2(n − 1) residual degrees of freedom, the F statistic can be constructed as n 2
F =
(X 1 −X 2 ) 1 2(n−1)s2 2(n−1)
2
2 n X1 − X2 = , 2s2
and can be tested against F1,2(n−1) . We see from (i) that F is the square of the usual t statistic. 2.6 By definition Y12 + Y22 ∼χ22 . Set 2
2
a (Y1 − Y2 ) + b (Y1 + Y2 ) = Y12 + Y22 . It follows that aY12 + bY12 = Y12 , aY22 + bY22 = Y22 , −2aY1 Y2 + 2bY1 Y2 = 0. Hence a = b = 1/2. 2.7 By Theorem 2.4 , Y12
2
+
Y22
+
Y32
(Y1 + Y2 + Y3 ) − 3
∼ χ22 .
The result follows since the LHS can be written as 1 2 2Y1 + 2Y22 + 2Y32 − 2 (Y1 Y2 + Y1 Y3 + Y2 Y3 ) , 3
234
Solutions
or equivalently as 1 2 2 2 (Y1 − Y2 ) + (Y2 − Y3 ) + (Y2 − Y3 ) . 3 Continuing, we may again apply Theorem 2.4 to obtain , n 2 n ( i=1 Yi ) 2 ∼ χ2n−1 . Yi − i=1 n The LHS can be written as
n − 1 n 2 1 2 2 Yi − Yi Yj = (Yi − Yj ) , i=1 i