9,938 2,398 16MB
Pages 788 Page size 612.298 x 792 pts Year 2011
Applied Multivariate Statistical Analysis
FIFTH E DITION
Applied Multivariate Statistical Analysis
RICHARD A. JOH N SON
University of WisconsinMadison
D EAN W. WIC H E RN Texas A&M University
Prentice Hall

PRENTICE HALL, Upper Saddle River, New Jersey
07458
Library of Congress CataloginginPublication Data
Johnson, Richard Arnold. Applied multivariate statistical analysis/Richard A. Johnson.5th ed. p. cm. Includes bibliographical references and index. ISBN 0130925535 1. Multivariate analysis. I. Wichern, Dean W. II. Title. QA278 .J 63 2002 519.5'35dc21
2001036199
Quincy McDonald EditorinChief: Sally Yagan
Acquisitions Editor:
David W. Riccardi Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Assistant Managing Editor: Bayani DeLeon Production Editor: Steven S. Pawlowski Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Editorial Assistant/Supplements Editor: Joanne Wendelken Managing Editor, Audio/Video Assets: Grace Hazeldine Art Director: Jayne Conte Cover Designer: Bruce Kenselaar Dlustrator: Marita Froimson Vice President/Director Production and Manufacturing: Executive Managing Editor:
•
© 2002, 1998, 1992, 1988, 1982 by PrenticeHall, Inc. Upper Saddle River, NJ 07458
All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 ISBN 0130925535
Pearson Education LTD., London Pearson Education Australia PTY, Limited, Sydney Pearson Education Singapore, Pte. Ltd Pearson Education North Asia Ltd, Hong Kong Pearson Education Canada, Ltd., Toronto Pearson Education de Mexico, S.A. de C.V. Pearson EducationJapan, Tokyo Pearson Education Malaysia, Pte. Ltd
To the memory of my mother and my father. R. A. J. To Dorothy, Michael, and Andrew.
D. W W
Contents
1
PREFACE
ASPECTS OF MULTIVARIATE ANALYSIS
1.1 1 .2 1.3
1 .4
1.5 1 .6
2
XV
1
Introduction 1 Applications of Multivariate Techniques The Organization of Data 5
3
Arrays, 5 Descriptive Statistics, 6 Graphical Techniques, 11
Data Displays and Pictorial Representations
Linking Multiple TwoDimensional Scatter Plots, 20 Graphs of Growth Curves, 24 Stars, 25 Chernoff Faces, 28
Distance 30 Final Comments Exercises 38 References 48
19
38
MATRIX ALGEBRA AND RANDOM VECTORS
2.1 2.2
2.3 2.4 2.5 2.6
2.7
50
Introduction 50 Some Basics of Matrix and Vector Algebra 50
Vectors, 50 Matrices, 55
Positive Definite Matrices 61 A SquareRoot Matrix 66 Random Vectors and Matrices 67 Mean Vectors and Covariance Matrices
68
Partitioning the Covariance Matrix, 74 The Mean Vecto r and Covariance Matrix for Linear Combinations of Random Variables, 76 Partitioning the Sample Mean Vector and Covariance Matrix, 78
Matrix Inequalities and Maximization 79
vii
viii
Contents Supplement 2A: Vectors and Matrices: Basic Concepts
Vectors, 84 Matrices, 89
84
Exercises 104 References 1 1 1
3
3.1 3.2 3.3
3.4
3.5 3 .6
4
112
SAMPLE GEOMETRY AND RANDOM SAMPLING
Introduction 112 The Geometry of the Sample 112 Random Samples and the Expected Values of the Sample Mean and Covariance Matrix 120 Generalized Variance 124
Situations in which the Generalized Sample Variance Is Zero, 130 Generalized Variance Determined by I R I and Its Geometrical Interpretation, 136 Another Generalization of Variance, 138
Sample Mean, Covariance, and Correlation As Matrix Operations 139 Sample Values of Linear Combinations of Variables Exercises 145 References 148
141
THE MULTIVARIATE NORMAL DISTRIBUTION
4.1 4.2 4.3
4.4 4.5 4.6
Introduction 149 The Multivariate Normal Density and Its Properties
149
149
Additional Properties of the Multivariate Normal Distribution, 156
Sampling from a Multivariate Normal Distribution and Maximum Likelihood Estimation 168
The Multivariate Normal Likelihood, 168 Maximum Likelihood Estimation of JL and I, 170 Sufficient Statistics, 173
The Sampling Distribution of X and S
173
Properties of the Wishart D istribution, 174
LargeSample Behavior of X and S 175 Assessing the Assumption of Normality 177
Evaluating the Normality of the Univariate Marginal D istributions, 178 Evaluating Bivariate Normality, 183
4.7
Detecting Outliers and Cleaning Data
4.8
Transformations To Near Normality
Steps for Detecting Outliers, 190
189
194
Transforming Multivariate Observations, 198
Exercises 202 References 209
Contents 5
INFERENCES ABOUT A MEAN VECTOR
5.1 5.2 5.3
5 .4
5.5 5.6
5.7 5.8
6
21 0
Introduction 210 The Plausibility of Ito as a Value for a Normal Population Mean 210 Hotelling's T2 and Likelihood Ratio Tests 216 General Likelihood Ratio Method, 219
Confidence Regions and Simultaneous Comparisons of Component Means 220
Simultaneous Confidence Statements, 223 A Comparison of Simultaneous Confidence Intervals with Oneata Time Intervals, 229 The Bonferroni Method of Multiple Comparisons, 232
Large Sample Inferences about a Population Mean Vector 234 Multivariate Quality Control Charts 239
Charts for Monitoring a Sample of Individual Multivariate Observations for Stability, 241 Contro l Regions for Future Individual Observations, 247 Control Ellipse for Future Observations, 248 T 2 Chart for Future Observations, 248 Control Charts Based on Subsample Means, 249 Control Regions for Future Subsample Observations, 251
Inferences about Mean Vectors when Some Observations Are Missing 252 Difficulties Due to Time Dependence in Multivariate Observations 256 Supplement SA: Simultaneous Confidence Intervals and Ellipses as Shadows of the pDimensional Ellipsoids 258 Exercises 260 References 270
COMPARISONS OF SEVERAL MULTIVARIATE MEANS
6.1 6.2 6.3
6.4
ix
272
Introduction 272 Paired Comparisons and a Repeated Measures Design 272
Paired Comparisons, 272 A Repeated Measures Design for Comparing Treatments, 278
Comparing Mean Vectors from Two Populations
Assumptions Concerning the Structure of the Data, 283 Further Assumptions when n1 and n2 Are Small, 284 Simultaneous Confidence Intervals, 287 The TwoSample Situation when �1 i= �2, 290
283
Comparing Several Multivariate Population Means ( OneWay Manova) 293
Assumptions about the Structure of the Data for Oneway MAN OVA, 293 A Summary of Univariate AN OVA, 293 Multivariate Analysis of Variance (MAN OVA), 298
x
Contents 6.5
6.6 6.7
6.8
6.9
7
Simultaneous Confidence Intervals for Treatment Effects TwoWay Multivariate Analysis of Variance 307
305
Univariate TwoWay FixedEffects Model with Interaction, 307 Multivariate TwoWay FixedEffects Model with Interaction, 309
Profile Analysis 318 Repeated Measures Designs and Growth Curves Perspectives and a Strategy for Analyzing Multivariate Models 327 Exercises 332 References 352
323
MULTIVARIATE LINEAR REGRESSION MODELS
7.1 7.2 7.3
Introduction 354 The Classical Linear Regression Model Least Squares Estimation 358
7.4
Inferences About the Regression Model
7.5 7.6
7.7
7.8 7.9 7.10
354
354
SumofSquares Decomposition, 360 Geometry of Least Squares, 361 Sampling Properties of Classical Least Squares Estimators, 363
365
Inferences Concerning the Regression Parameters, 365 Likelihood Ratio Tests for the Regression Parameters, 370
Inferences from the Estimated Regression Function
Estimating the Regression Function at z0, 374 Forecasting a New Observation at z0, 375
Model Checking and Other Aspects of Regression
Does the Model Fit?, 377 Leverage and Influence, 380 Additional Problems in Linear Regression, 380
Multivariate Multiple Regression
374 377
383
Likelihood Ratio Tests for Regression Parameters, 392 Other Multivariate Test Statistics, 395 Predictions from Multivariate Multiple Regressions, 395
The Concept of Linear Regression Prediction of Several Variables, 403 Partial Correlation Coefficient, 406
398
Comparing the Two Formulations of the Regression Model
Mean Corrected Form of the Regression Model, 407 Relating the Formulations, 409
407
Multiple Regression Models with Time Dependent Errors 410 Supplement 7 A: The Distribution of the Likelihood Ratio for the Multivariate Multiple Regression Model Exercises 417 References 424
415
Conte nts 8
PRINCIPAL COMPONENTS
426
8.2
Introduction 426 Population Principal Components
8.3
Summarizing Sample Variation by Principal Components
8.1
8.4 8.5
8.6
xi
426
Principal Components Obtained from Standardized Variables, 432 Principal Components for Covariance Matrices with Special Structures, 435
The Number of Principal Components, 440 Interpretation of the Sample Principal Components, 444 Standardizing the Sample Principal Components, 445
Graphing the Principal Components Large Sample Inferences 452
437
450
Large Sample Properties of Ai and ej, 452 Testing for the Equal Correlation Structure, 453 A
Monitoring Quality with Principal Components
Checking a Given Set of Measurements for Stability, 455 Controlling Future Values, 459
455
Supplement 8A: The Geometry of the Sample Principal Component Approximation 462 The pD imensional Geometrical Interpretation, 464 The nD imensional Geometrical Interpretation, 465
Exercises 466 References 475 9
FACTOR ANALYSIS AND INFERENCE FOR STRUCTURED COVARIANCE MATRICES
9.1 9.2 9.3
9.4
Introduction 477 The Orthogonal Factor Model Methods of Estimation 484
477
478
The Principal Component (and Principal Factor) Method, 484 A Modified Approachthe Principal Factor So lution, 490 The Maximum Likelihood Method, 492 A Large Sample Test for the Number of Common Factors, 498
Factor Rotation
501
Oblique Rotations, 509
9.5
Factor Scores
9.6 9.7
Perspectives and a Strategy for Factor Analysis Structural Equation Models 524
510
The Weighted Least Squares Method, 511 The Regression Method, 513
The LISREL Model, 525 Construction of a Path Diagram, 525 Covariance Structure, 526 Estimation, 527 ModelFitting Strategy, 529
517
xii
Contents
Supplement 9A: Some Computational Details for Maximum Likelihood Estimation
Recommended Computational Scheme, 531 Maximum Likelihood Estimators of p LzL'z + \flz, 532
530
=
Exercises 533 References 541
10
CANONICAL CORRELATION ANALYSIS
10.1 10.2 10.3
10.4 10.5 10.6
11
543
Introduction 543 Canonical Variates and Canonical Correlations Interpreting the Population Canonical Variables
543 551
Identifying the Canonical Variables, 551 Canonical Correlations as Generalizations of Other Correlation Coefficients, 553 The First r Canonical Variables as a Summary of Variability, 554 A Geometrical Interpretation of the Population Canonical Correlation Analysis 555
The Sample Canonical Variates and Sample Canonical Correlations 556 Additional Sample Descriptive Measures 564 Matrices of Errors ofApproximations, 564 Proportions of Explained Sample Variance, 567
Large Sample Inferences Exercises 573 References 580
569
DISCRIMINATION AND CLASSIFICATION
11.1 1 1 .2 1 1 .3
11.4 1 1 .5 11.6 11.7 11 . 8
581
Introduction 581 Separation and Classification for Two Populations 582 Classification with Two Multivariate Normal Populations
Classification of Normal Populations When I1 Scaling, 595 Classification of Normal Populations When I1
=
#:
I2
=
I, 590
590
I2, 596
Evaluating Classification Functions 598 Fisher's Discriminant FunctionSeparation of Populations Classification with Several Populations 612 The Minimum Expected Cost of Misclassification Method, 613 Classification with Normal Populations, 616
Fisher's Method for Discriminating among Several Populations 628
Using Fisher's Discriminants to Classify Objects, 635
Final Comments
641
Including Qualitative Variables, 641 Classification Trees, 641 Neural Networks, 644
609
Contents
xiii
Selection of Variables, 645 Testing for Group Differences, 645 Graphics, 646 Practical Considerations Regarding Multivariate Normality, 646
Exercises 647 References 666 12
CLUSTERING, DISTANCE METHODS, AND ORDINATION
12.1 12 . 2
12.3
12.4 12 . 5 12.6
12.7 12 . 8
Introduction 668 Similarity Measures
668
670
Distances and Similarity Coefficients for Pairs of Items, 670 Similarities and Association Measures for Pairs of Variables, 676 Concluding Comments on Similarity, 677
Hierarchical Clustering Methods
679
Single Linkage, 681 Complete Linkage, 685 Average Linkage, 689 Ward's Hierarchical Clustering Method, 690 Final CommentsHierarchical Procedures, 693
Nonhierarchical Clustering Methods
694
Kmeans Method, 694 Final CommentsNonhierarchical Procedures, 698
Multidimensional Scaling
700
Correspondence Analysis
709
The Basic Algorithm, 700
Algebraic Development of Correspondence Analysis, 711 Inertia, 718 Interpretation in Two Dimensions, 719 Final Comments, 719
Biplots for Viewing San1pling Units and Variables Constructing Biplots, 720
Procrustes Analysis: A Method for Comparing Configurations
723
Supplement 12A: Data Mining
731
719
Constructing the Procrustes Measure ofAgreement, 724
Introduction, 731 The Data Mining Process, 732 Model Assessment, 733
Exercises 738 References 7 45
APPENDIX
DATA INDEX
SUBJECT INDEX
748 758 761
Preface
I NTE N D E D AU D I E NCE This book originally grew out of our lecture notes for an "Applied Multivariate Analy sis" course offered j ointly by the Statistics Department and the School of Business at the University of WisconsinMadison. Applied Multivariate Statistical Analysis, Fifth Edition, is concerned with statistical methods for describing and analyzing multi variate data. Data analysis, while interesting with one variable, becomes truly fasci nating and challenging when several variables are involved. Researchers in the biological, physical, and social sciences frequently collect measurements on several variables. Modern computer packages readily provide the numerical results to rather complex statistical analyses. We have tried to provide readers with the supporting knowledge necessary for making proper interpretations, selecting appropriate tech niques, and understanding their strengths and weaknesses. We hope our discussions will meet the needs of experimental scientists, in a wide variety of subject matter areas, as a readable introduction to the statistical analysis of multivariate observations. LEVEL Our aim is to present the concepts and methods of multivariate analysis at a level that is readily understandable by readers who have taken two or more statistics cours es. We emphasize the applications of multivariate methods and, consequently, have attempted to make the mathematics as palatable as possible. We avoid the use of cal culus. On the other hand, the concepts of a matrix and of matrix manipulations are important. We do not assume the reader is familiar with matrix algebra. Rather, we introduce matrices as they appear naturally in our discussions, and we then show how they simplify the presentation of multivariate models and techniques. The introductory account of matrix algebra, in Chapter 2, highlights the more important matrix algebra results as they apply to multivariate analysis. The Chapter 2 supplement provides a summary of matrix algebra results for those with little or no previous exposure to the subject. This supplementary material helps make the book selfcontained and is used to complete proofs. The proofs may be ignored on the first reading. In this way we hope to make the book accessible to a wide audience. In our attempt to make the study of multivariate analysis appealing to a large audience of both practitioners and theoreticians, we have had to sacrifice a consistency XV
xvi
Preface
of level. Some sections are harder than others. In particular, we have summarized a voluminous amount of material on regression in Chapter 7. The resulting presenta tion is rather succinct and difficult the first time through. We hope instructors will be able to compensate for the unevenness in level by judiciously choosing those sec tions, and subsections, appropriate for their students and by toning them down if necessary. ORGAN IZATI ON AN D APPROACH The methodological "tools" of multivariate analysis are contained in Chapters 5 through 12. These chapters represent the heart of the book, but they cannot be as similated without much of the material in the introductory Chapters 1 through 4. Even those readers with a good knowledge of matrix algebra or those willing to ac cept the mathematical results on faith should, at the very least, peruse Chapter 3, "Sample Geometry," and Chapter 4, "Multivariate Normal Distribution." Our approach in the methodological chapters is to keep the discussion direct and uncluttered. Typically, we start with a formulation of the population models, delineate the corresponding sample results, and liberally illustrate everything with examples. The examples are of two types: those that are simple and whose calculations can be eas ily done by hand, and those that rely on realworld data and computer software. These will provide an opportunity to (1) duplicate our analyses, (2) carry out the analyses dictated by exercises, or (3) analyze the data using methods other than the ones we have used or suggested. The division of the methodological chapters (5 through 12) into three units al lows instructors some flexibility in tailoring a course to their needs. Possible sequences for a onesemester (two quarter) course are indicated schematically. Each instructor will undoubtedly omit certain sections from some chapters to cover a broader collection of topics than is indicated by these two choices. Getting Started
�
Chapters 14
�
Inference About Means
Classification and Grouping
Chapters 57
Chapters 11 and 12
Analysis of Covariance
Analysis of Covariance
Structure
Structure
Chapters 810
Chapters 810
I
I
For most students, we would suggest a quick pass through the first four chap ters (concentrating primarily on the material in Chapter 1 ; Sections 2.1, 2 . 2, 2.3, 2 .5, 2.6, and 3.6; and the "assessing normality" material in Chapter 4) followed by a se lection of methodological topics. For example, one might discuss the comparison of mean vectors, principal components, factor analysis, discriminant analysis and clus tering. The discussions could feature the many "worked out" examples included in
Preface
xvii
these sections of the text. Instructors may rely on diagrams and verbal descriptions to teach the corresponding theoretical developments. If the students have uniform ly strong mathematical backgrounds, much of the book can successfully be covered in one term. We have found individual dataanalysis proj ects useful for integrating materi al from several of the methods chapters. Here, our rather complete treatments of multivariate analysis of variance (MANOVA), regression analysis, factor analysis, canonical correlation, discriminant analysis, and so forth are helpful, even though they may not be specifically covered in lectures. CHAN G E S TO TH E FI FTH EDITI ON New material . Users of the previous editions will notice that we have added several exercises and data sets, some new graphics, and have expanded the discus sion of the dimensionality of multivariate data, growth curves and classification and regression trees (CART). In addition, the algebraic development of correspondence analysis has been redone and a new section on data mining has been added to Chap ter 12. We put the data mining material in Chapter 12 since much of data mining, as it is now applied in business, has a classification and/or grouping obj ective. As always, we have tried to improve the exposition in several places. Data CD. Recognizing the importance of modern statistical packages in the analysis of multivariate data, we have added numerous realdata sets. The full data sets used in the book are saved as ASCII files on the CDROM that is packaged with each copy of the book. This format will allow easy interface with existing statistical software packages and provide more convenient handson data analysis opportunities. Instructors Sol utions Manual. An Instructors Solutions Manual (ISBN 0130925551) containing complete solutions to most of the exercises in the book is avail able free upon adoption from Prentice Hall. For information on additional for sale supplements that may be used with the book or additional titles of interest, please visit the Prentice Hall Web site at www.prenhall.com. ACKNOWLE D G M E NTS We thank our many colleagues who helped improve the applied aspect of the book by contributing their own data sets for examples and exercises. A number of indi viduals helped guide this revision, and we are grateful for their suggestions: Steve Coad, University of Michigan; Richard Kiltie, University of Florida; Sam Kotz, George Mason University; Shyamal Peddada, University of Virginia; K. Sivakumar, Univer sity of Illinois at Chicago; Eric Smith, Virginia Tech; and Stanley Wasserman, Uni versity of Illinois at UrbanaChampaign. We also acknowledge the feedback of the students we have taught these past 30 years in our applied multivariate analysis cours es. Their comments and suggestions are largely responsible for the present iteration
xviii
Preface
of this work. We would also like to give special thanks to Wai Kwong Cheang for his help with the calculations for many of the examples. We must thank Dianne Hall for her valuable work on the CDROM and Solu tions Manual, Steve Verrill for computing assistance throughout, and Alison Pollack for implementing a Chernoff faces program. We are indebted to Cliff Gilman for his assistance with the multidimensional scaling examples discussed in Chapter 12. Jacquelyn Forer did most of the typing of the original draft manuscript, and we ap preciate her expertise and willingness to endure the caj oling of authors faced with pub lication deadlines. Finally, we would like to thank Quincy McDonald, Joanne Wendelken, Steven Scott Pawlowski, Pat Daly, Linda Behrens, Alan Fischer, and the rest of the Prentice Hall staff for their help with this project.
R. A. Johnson rich@stat. wisc. edu
D. W. Wichern
[email protected]
Applied Multivariate Statistical Analysis
CHAPT E R
1
Aspects of Multivariate Analysis
1.1
I NTRODUCTI O N Scientific inquiry i s an iterative learning process. Objectives pertaining t o the ex planation of a social or physical phenomenon must be specified and then tested by gathering and analyzing data. In turn, an analysis of the data gathered by experi mentation or observation will usually suggest a modified explanation of the phe nomenon. Throughout this iterative learning process, variables are often added or deleted from the study. Thus, the complexities of most phenomena require an in vestigator to collect observations on many different variables. This book is concerned with statistical methods designed to elicit information from these kinds of data sets. Because the data include simultaneous measurements on many variables, this body of methodology is called multivariate analysis. The need to understand the relationships between many variables makes mul tivariate analysis an inherently difficult subj ect. Often, the human mind is over whelmed by the sheer bulk of the data. Additionally, more mathematics is required to derive multivariate statistical techniques for making inferences than in a univari ate setting. We have chosen to provide explanations based upon algebraic concepts and to avoid the derivations of statistical results that require the calculus of many variables. Our objective is to introduce several useful multivariate techniques in a clear manner, making heavy use of illustrative examples and a minimum of mathe matics. Nonetheless, some mathematical sophistication and a desire to think quan titatively will be required. Most of our emphasis will be on the analysis of measurements obtained with out actively controlling or manipulating any of the variables on which the mea surements are made. Only in Chapters 6 and 7 shall we treat a few experimental plans (designs) for generating data that prescribe the active manipulation of im portant variables. Although the experimental design is ordinarily the most impor tant part of a scientific investigation, it is frequently impossible to control the generation of appropriate data in certain disciplines. (This is true, for example, in business, economics, ecology, geology, and sociology.) You should consult [7] and 1
2
Chapter
1
Aspects of M u ltiva riate Ana lysis [8] for detailed accounts of design principles that, fortunately, also apply to multi variate situations. It will become increasingly clear that many multivariate methods are based upon an underlying probability model known as the multivariate normal distribu tion. Other methods are ad hoc in nature and are justified by logical or commonsense arguments. Regardless of their origin, multivariate techniques must, invariably, be im plemented on a computer. Recent advances in computer technology have been ac companied by the development of rather sophisticated statistical software packages, making the implementation step easier. Multivariate analysis is a "mixed bag." It is difficult to establish a classification scheme for multivariate techniques that both is widely accepted and indicates the appropriateness of the techniques. One classification distinguishes techniques de signed to study interdependent relationships from those designed to study depen dent relationships. Another classifies techniques according to the number of populations and the number of sets of variables being studied. Chapters in this text are divided into sections according to inference about treatment means, inference about covariance structure, and techniques for sorting or grouping. This should not, however, be considered an attempt to place each method into a slot. Rather, the choice of methods and the types of analyses employed are largely determined by the objectives of the investigation. In Section 1.2, we list a smaller number of practical problems designed to illustrate the connection between the choice of a statistical method and the obj ectives of the study. These problems, plus the examples in the text, should provide you with an appreciation for the applicability of multivariate techniques across different fields. The objectives of scientific investigations to which multivariate methods most naturally lend themselves include the following:
1. Data reduction or structural simplification . The phenomenon being studied is represented as simply as possible without sacrificing valuable information. It is hoped that this will make interpretation easier. 2. Sorting and grouping. Groups of " similar" obj ects or variables are created, based upon measured characteristics. Alternatively, rules for classifying obj ects into welldefined groups may be required. 3. Investigation of the dependence among variables. The nature of the relation ships among variables is of interest. Are all the variables mutually indepen dent or are one or more variables dependent on the others? If so, how?
4.
Prediction. Relationships between variables must be determined for the pur pose of predicting the values of one or more variables on the basis of observa tions on the other variables.
5. Hypothesis construction and testing. Specific statistical hypotheses, formulated in terms of the parameters of multivariate populations, are tested. This may be done to validate assumptions or to reinforce prior convictions. We conclude this brief overview of multivariate analysis with a quotation from F. H. C. Marriott [19], page 89. The statement was made in a discussion of cluster analysis, but we feel it is appropriate for a broader range of methods. You should
Section 1 .2
Appl i cations of M u ltivariate Tec h n i ques
3
keep it in mind whenever you attempt or read about a data analysis. It allows one to maintain a proper perspective and not be overwhelmed by the elegance of some of the theory: If the results disagree with informed opinion, do not admit a simple logical interpreta tion, and do not show up clearly in a graphical presentation, they are probably wrong. There is no magic about numerical methods, and many ways in which they can break down. They are a valuable aid to the interpretation of data, not sausage machines au tomatically transforming bodies of numbers into packets of scientific fact.
1 .2
APPLI CATI O N S O F M U LTIVARIATE TECH N I Q U E S The published applications of multivariate methods have increased tremendously in recent years. It is now difficult to cover the variety of realworld applications of these methods with brief discussions, as we did in earlier editions of this book. However, in order to give some indication of the usefulness of multivariate techniques, we offer the following short descriptions of the results of studies from several disciplines. These descriptions are organized according to the categories of objectives given in the previous section. Of course, many of our examples are multifaceted and could be placed in more than one category.
Data reduction or simplification •
•
•
•
•
Using data on several variables related to cancer patient responses to radio therapy, a simple measure of patient response to radiotherapy was constructed. (See Exercise 1.15.) Track records from many nations were used to develop an index of performance for both male and female athletes. (See [10] and [22].) Multispectral image data collected by a highaltitude scanner were reduced to a form that could be viewed as images (pictures) of a shoreline in two dimen sions. (See [23].) Data on several variables relating to yield and protein content were used to create an index to select parents of subsequent generations of improved bean plants. (See [ 1 4] . ) A matrix of tactic similarities was developed from aggregate data derived from professional mediators. From this matrix the number of dimensions by which professional mediators judge the tactics they use in resolving disputes was de termined. (See [21].)
Sorting and grouping •
•
Data on several variables related to computer use were employed to create clusters of categories of computer jobs that allow a better determination of ex isting (or planned) computer utilization. (See [2].) Measurements of several physiological variables were used to develop a screen ing procedure that discriminates alcoholics from nonalcoholics. (See [26].)
4
Chapter
1
Aspects of M u ltivariate Ana lysis •
•
Data related to responses to visual stimuli were used to develop a rule for sep arating people suffering from a multiplesclerosiscaused visual pathology from those not suffering from the disease. (See Exercise 1.14.) The U. S. Internal Revenue Service uses data collected from tax returns to sort taxpayers into two groups: those that will be audited and those that will not. (See [31].)
Investigation of the dependence among variables Data on several variables were used to identify factors that were responsible for client success in hiring external consultants. (See [13] .) Measurements of variables related to innovation, on the one hand, and vari ables related to the business environment and business organization, on the other hand, were used to discover why some firms are product innovators and some firms are not. (See [5] .) ' Data on variables representing the outcomes of the 10 decathlon events in the Olympics were used to determine the physical factors responsible for success in the decathlon. (See [17] .) The associations between measures of risktaking propensity and measures of socioeconomic characteristics for toplevel business executives were used to as sess the relation between risktaking behavior and performance. (See [18].)
•
•
•
Prediction •
•
•
•
The associations between test scores and several high school performance vari ables and several college performance variables were used to develop predic tors of success in college. (See [11].) Data on several variables related to the size distribution of sediments were used to develop rules for predicting different depositional environments. (See [9] and [20] .) Measurements on several accounting and financial variables were used to de velop a method for identifying potentially insolvent propertyliability insurers. (See [28] .) Data on several variables for chickweed plants were used to develop a method for predicting the species of a new plant. (See [4] .)
Hypotheses testing •
•
•
Several pollutionrelated variables were measured to determine whether levels for a large metropolitan area were roughly constant throughout the week, or whether there was a noticeable difference between weekdays and weekends. (See Exercise 1 .6.) Experimental data on several variables were used to see whether the nature of the instructions makes any difference in perceived risks, as quantified by test scores. (See [27] . ) Data on many variables were used to investigate the differences in structure of American occupations to determine the support for one of two competing so ciological theories. (See [16] and [25] .)
Section •
1 .3
The Orga nization of Data
5
Data on several variables were used to determine whether different types of firms in newly industrialized countries exhibited different patterns of innova tion. (See [15].)
The preceding descriptions offer glimpses into the use of multivariate methods in widely diverse fields. 1 .3
TH E ORGANIZATION OF DATA Throughout this text, we are going to be concerned with analyzing measurements made on several variables or characteristics. These measurements (commonly called data) must frequently be arranged and displayed in various ways. For example, graphs and tabular arrangements are important aids in data analysis. Summary numbers, which quantitatively portray certain features of the data, are also necessary to any description. We now introduce the preliminary concepts underlying these first steps of data organization. Arrays Multivariate data arise whenever an investigator, seeking to understand a social or 1 of variables or characters to record. physical phenomenon, selects a number p The values of these variables are all recorded for each distinct item, individual, or >
experimental unit.
We will use the notation xj k to indicate the particular value of the kth variable that is observed on the jth item, or trial. That is,
xj k = measurement of the kth variable on the jth item Consequently, n measurements on p variables can be displayed as follows: Variable
1 Variable 2
Variable k
Variable p
Item 1: Item 2 :
X11 X2 1
X12 X2 2
xl k X2 k
Xl p X2 p
Item j:
Xj l
Xj 2
Xj k
Xj p
Item n:
Xn l
Xn 2
Xnk
Xn p
Or we can display these data as a rectangular array, called X, of n rows and p columns:
X=
X1 1 X2 1
X1 2 X 22
X1 k X2 k
Xl p X2 p
xj l
xj 2
Xj k
Xj p
Xn l
Xn 2
Xnk
Xn p
6
Chapter 1
Aspects of M u ltivariate Ana lysis
The array X, then, contains the data consisting of all of the observations on all of the variables. Example 1 . 1
(A data a rray)
A selection of four receipts from a university bookstore was obtained in order to investigate the nature of book sales. Each receipt provided, among other things, the number of books sold and the total amount of each sale. Let the first variable be total dollar sales and the second variable be number of books sold. Then we can regard the corresponding numbers on the receipts as four measurements on two variables. Suppose the data, in tabular form, are Variable 1 ( dollar sales) : Variable 2 ( number of books) :
42 52 48 58 4 5 4 3
Using the notation just introduced, we have
x1 1 = 42 x2 1 = 52 x3 1 = 48 x4 1 = 58 x 1 2 = 4 x2 2 = 5 x3 2 = 4 x4 2 = 3 and the data array X is 42 52 X= 48 58 with four rows and two columns.
4 5 4 3 •
Considering data in the form of arrays facilitates the exposition of the subject matter and allows numerical calculations to be performed in an orderly and efficient manner : The efficiency is twofold, as gains are attained in both (1) describing nu merical calculations as operations on arrays and (2) the implementation of the cal culations on computers, which now use many languages and statistical packages to perform array operations. We consider the manipulation of arrays of numbers in Chapter 2. At this point, we are concerned only with their value as devices for dis playing data. Descri ptive Statistics A large data set is bulky, and its very mass poses a serious obstacle to any attempt to visually extract pertinent information. Much of the information contained in the data can be assessed by calculating certain summary numbers, known as descriptive statistics. For example, the arithmetic average, or sample mean, is a descriptive sta tistic that provides a measure of locationthat is, a "central value" for a set of num bers. And the average of the squares of the distances of all of the numbers from the mean provides a measure of the spread, or variation, in the numbers. We shall rely most heavily on descriptive statistics that measure location, vari ation, and linear association. The formal definitions of these quantities follow.
Section 1 .3
The Organ ization of Data
7
Let x 1 1 , x2 1 , . . . , x n 1 be n measurements on the first variable. Then the arith metic average of these measurements is
If the n measurements represent a subset of the full set of measurements that might have been observed, then x1 is also called the sample mean for the first vari able. We adopt this terminology because the bulk of this book is devoted to proce dures designed for analyzing samples of measurements from larger collections. The sample mean can be computed from the n measurements on each of the p variables, so that, in general, there will be p sample means: k A measure of spread is provided by the surements on the first variable as
s 2l where
=
1 , 2, . . . ' p
(11)
sample variance, defined for n mea
1 �  2 x ( jl  xl ) n
 £.J j=l
x1 is the sample mean of the xj 1 ' s. In general, for p variables, we have k
=
1, 2, . . . ' p
(12)
Two comments are in order. First, many authors define the sample variance with a divisor of n  1 rather than n. Later we shall see that there are theoretical reasons for doing this, and it is particularly appropriate if the number of measurements, n, is small. The two versions of the sample variance will always be differentiated by dis playing the appropriate expression. Second, although the s 2 notation is traditionally used to indicate the sample variance, we shall eventually consider an array of quantities in which the sample vari ances lie along the main diagonal. In this situation, it is convenient to use double subscripts on the variances in order to indicate their positions in the array. Therefore, we introduce the notation skk to denote the same variance computed from measure ments on the ith variable, and we have the notational identities k
=
1, 2, .. . ' p
(13)
The square root of the sample variance, � ' is known as the sample standard deviation. This measure of variation is in the same units as the observations. Consider n pairs of measurements on each of variables 1 and 2:
8
Chapter 1
Aspects of M u ltivariate Ana lysis
That is, xj 1 and xj2 are observed on the jth experimental item (j = 1, 2, . . . , n ) . A measure of linear association between the measurements of variables 1 and 2 is pro vided by the sample covariance
s1 2
1 n :L (xj l  xl ) (xj2  x2 ) n j=l
=
or the average product of the deviations from their respective means. If large values for one variable are observed in conjunction with large values for the other variable, and the small values also occur together, s 1 2 will be positive. If large values from one variable occur with small values for the other variable, s1 2 will be negative. If there is no partic ular association between the values for the two variables, s 1 2 will be approximately zero. The sample covariance
s;k
=
�
1 n (xi;  X;) ( xik  Xk) n
i
=
1 , 2 , . . . , p, k
=
1 , 2 , . . . , p (14)
measures the association between the ith and kth variables. We note that the covari ance reduces to the sample variance when i = k. Moreover, sik = ski for all i and k. The final descriptive statistic considered here is the sample correlation coefficient (or Pearson's productmoment correlation coefficient; see [3] ). This measure of the lin ear association between two variables does not depend on the units of measurement. The sample correlation coefficient for the ith and kth variables is defined as
n
for i
=
1 , 2, . . . , p and k
=
��
:Ll j=
( xj i  xJ (xj k  xk)
( xi;  X; )
1 , 2, . , p. Note rik . .
=
2
��
( xik  Xk)
2
(15)
rki for all i and k.
The sample correlation coefficient is a standardized version of the sample co variance, where the product of the square roots of the sample variances provides the standardization. Notice that rik has the same value whether n or n  1 is chosen as the common divisor for si i ' sk k ' and sik · The sample correlation coefficient rik can also be viewed as a sample covariance. Suppose the original values xj i and xj k are replaced by standardized values ( xj i  xi)/� and (xj k  xk)j�. The standardized values are commensurable because both sets are centered at zero and expressed in standard deviation units. The sample correlation co efficient is just the sample covariance of the standardized observations. Although the signs of the sample correlation and the sample covariance are the same, the correlation is ordinarily easier to interpret because its magnitude is bound ed. To summarize, the sample correlation r has the following properties:
1. The value of r must be between  1 and + 1 . 2. Here r measures the strength of the linear association. If r = 0 , this implies a lack of linear association between the components. Otherwise, the sign of r in dicates the direction of the association: r < 0 implies a tendency for one value in the pair to be larger than its average when the other is smaller than its aver age; and r > 0 implies a tendency for one value of the pair to be large when the other value is large and also for both values to be small together.
Section 1 .3
The Organ ization of Data
9
3. The value of rik remains unchanged if the measurements of the ith variable are changed to yjl = a xj i + b, j = 1, 2, . . . , n, and the values of the kth variable are changed to Yj k = cxj k + d, j = 1, 2, . . . , n, provided that the constants a and c have the same sign. The quantities sik and rik do not, in general, convey all there is to know about the association between two variables. Nonlinear associations can exist that are not revealed by these descriptive statistics. Covariance and correlation provide mea sures of linear association, or association along a line. Their values are less informa tive for other kinds of association. On the other hand, these quantities can be very sensitive to "wild" observations ( "outliers" ) and may indicate association when, in fact, little exists. In spite of these shortcomings, covariance and correlation coefficients are routinely calculated and analyzed. They provide cogent numerical summaries of as sociation when the data do not exhibit obvious nonlinear patterns of association and when wild observations are not present. Suspect observations must be accounted for by correcting obvious recording mistakes and by taking actions consistent with the identified causes. The values of sik and rik should be quoted both with and without these observations. The sum of squares of the deviations from the mean and the sum of cross product deviations are often of interest themselves. These quantities are n
wkk and
wik
=
n
=
2 (xj k  xk) 2: j= l
2: (xj i  xi) (xj k  xk )
j=l
k
i
=
=
1, 2, . . . ' p
1, 2, . . . , p, k
(16) =
1 , 2, . . . , p
(1 7)
The descriptive statistics computed from n measurements on p variables can also be organized into arrays.
10
Chapter
1
Aspects of M u ltivariate Ana lysis
The sample mean array is denoted by x, the sample variance and covariance array by the capital letter S n , and the sample correlation array by R. The subscript n on the array S n is a mnemonic device used to remind you that n is employed as a divisor for the elements sik . The size of all of the arrays is determined by the num ber of variables, p. The arrays S n and R consist of prows and p columns. The array x is a single col umn with p rows. The first subscript on an entry in arrays S n and R indicates the row; the second subscript indicates the column. Since sik = ski and rik = rki for all i and k, the entries in symmetric positions about the main northwestsoutheast diagonals in arrays S n and R are the same, and the arrays are said to be symmetric. Example 1 . 2
(The arrays
x,
Sn, and R fo r bivariate data)
Consider the data introduced in Example 1 . 1 . Each receipt yields a pair of measurements, total dollar sales, and number of books sold. Find the ar rays x, S n , and R. Since there are four receipts, we have a total of four measurements ( ob servations ) on each variable. The sample means are
4
x1
=
x2
=
� � xj l j=l 4 � � xj2 j= l
=
� ( 42 + 52 + 48 + 58 )
=
� (4 + 5 + 4 + 3 )
=
=
50
4
The sample variances and covariances are
sl l
4
= =
s22
= =
s1 2
= =
� � (xj l  x1 ) 2 j= l � ( (42  50 ) 2 + (52  50 ) 2 + (48  50 ) 2 + (58  50) 2 ) = 34 4 � � (xj2  x2 ) 2 j= l � ( (4  4) 2 + (5  4) 2 + (4  4) 2 + (3  4) 2 ) = .5 4 � � (xj l  xl ) (xj2  x2 ) j= l � ( (42  50) (4  4 ) + ( 52  50) (5  4) + ( 48  50) ( 4  4) + ( 58  50) (3  4 ) ) =  1 .5
and
sn
=
[
34  1 .5 .5  1 .5
]
Section 1 .3
The Organ ization of Data
11
The sample correlation is
r1 2 =
�Vs;
=
r2 1 = r1 2 so
R = G raphical Techniques
 1 .5
v'34v3
=  . 36
[  � �J  .3
.3
•
Plots are important, but frequently neglected, aids in data analysis. Although it is impossible to simultaneously plot all the measurements made on several variables and study the configurations, plots of individual variables and plots of pairs of variables can still be very informative. Sophisticated computer programs and display equipment allow one the luxury of visually examining data in one, two, or three dimensions with relative ease. On the other hand, many valuable insights can be obtained from the data by constructing plots with paper and pencil. Simple, yet elegant and effective, methods for displaying data are available in [29] . It is good statistical practice to plot pairs of variables and visually inspect the pattern of association. Consider, then, the following seven pairs of measurements on two variables:
( x1 ) : Variable 2 (x2 ) : Variable
1
3
5
4 5.5
2
6
4
7
8 10
2
5
5 7.5
These data are plotted as seven points in two dimensions ( each axis represent ing a variable ) in Figure 1.1. The coordinates of the points are determined by the paired measurements: ( 3 , 5 ) , ( 4, 5.5 ) , . . . , (5, 7.5 ) . The resulting twodimensional plot is known as a scatter diagram or scatter plot.
•
s C\S
OJ) C\S ;....
:.a
Q
0
• •
• •• •
x2
x2
10
•
10
8
8
6
6
4
4
2
2
0
• •
•
•
2
•
' 2
•
4 •
' 4
•
•
6
8
10
'
'
I
6
Dot diagram
8
xl ..,.
10
xl
Figure 1 . 1
A scatter plot a n d m a rg i na l dot d i a g rams.
12
Aspects of M u ltivariate Ana lysis
Chapter 1 • • •
•
• •
•
x2
x2
10
10
8
8
6
6
4
4
2
2 0
•
•
• •
4
2
•
! 2
•
•
!
•
4
•
•
6
8
!
!
6
xl
10 I
8
10
�
Figure 1 . 2
xl
Scatter plot a n d dot d i a g rams for rea rranged data .
Also shown in Figure 1 . 1 are separate plots of the observed values of variable 1 and the observed values of variable 2, respectively. These plots are called (marginal) dot diagrams. They can be obtained from the original observations or by projecting the points in the scatter diagram onto each coordinate axis. The information contained in the singlevariable dot diagrams can be used to calculate the sample means .X1 and .X2 and the sample variances s1 1 and s22 . (See Ex ercise 1 . 1 . ) The scatter diagram indicates the orientation of the points, and their co ordinates can be used to calculate the sample covariance s 1 2 . In the scatter diagram of Figure 1 . 1 , large values of x 1 occur with large values of x2 and small values of x 1 with small values of x2 • Hence, s1 2 will be positive. Dot diagrams and scatter plots contain different kinds of information. The in formation in the marginal dot diagrams is not sufficient for constructing the scatter plot. As an illustration, suppose the data preceding Figure 1 . 1 had been paired dif ferently, so that the measurements on the variables x 1 and x2 were as follows: Variable 1 Variable 2
( x1 ) : ( x2 ) :
5
4
6
2
2
8
3
5
5.5
4
7
10
5
7.5
(We have simply rearranged the values of variable 1.) The scatter and dot diagrams for the "new" data are shown in Figure 1 .2. Comparing Figures 1 . 1 and 1 .2, we find that the marginal dot diagrams are the same, but that the scatter diagrams are decidedly different. In Figure 1 .2, large values of x 1 are paired with small values of x2 and small values of x 1 with large values of x2 • Consequently, the descriptive statistics for the in dividual variables .X1 , x2 , s1 1 , and s22 remain unchanged, but the sample covariance s1 2 , which measures the association between pairs of variables, will now be negative. The different orientations of the data in Figures 1 . 1 and 1 .2 are not discernible from the marginal dot diagrams alone. At the same time, the fact that the marginal dot diagrams are the same in the two cases is not immediately apparent from the scatter plots. The two types of graphical procedures complement one another; they are not competitors. The next two examples further illustrate the information that can be conveyed by a graphic display.
Section 1 .3
• • •
• •
•
•
•
•
_.___�x1
._____.__�L__ ���__.__
0
10
50
60
Employees (thousands)
Example 1 .3
13
Dun & Bradstreet
Time Warner 10
The Orga n i zation of Data
70
80
Figure 1 .3
Profits per employee a n d n u mber o f employees for 1 6 publishing firms.
(The effect of unusual o bservations on sample correlatio ns)
Some financial data representing j obs and productivity for the 16 largest pub lishing firms appeared in an article in Forbes magazine on April 30, 1990. The data for the pair of variables x 1 = employees (j obs) and x2 = profits per employee (productivity) are graphed in Figure 1 . 3 . We have labeled two "unusual" observations. Dun & Bradstreet is the largest firm in terms of num ber of employees, but is "typical" in terms of profits per employee. Time Warner has a "typical" number of employees, but comparatively small (negative) profits per employee. The sample correlation coefficient computed from the values of x 1 and x2 is  .39 for all 16 firms  .56 for all firms but Dun & Bradstreet  .39 for all firms but Time Warner  .50 for all firms but Dun & Bradstreet and Time Warner It is clear that atypical observations can have a considerable effect on the sam ple correlation coefficient. • Example 1 .4
(A scatter plot for baseba l l data)
In a July 17, 1978, article on money in sports, Sports Illustrated magazine pro vided data on x 1 = player payroll for National League East baseball teams. We have added data on x2 = wonlost percentage for 1 977. The results are given in Table 1 . 1 . The scatter plot in Figure 1 .4 supports the claim that a championship team can be bought. Of course, this causeeffect relationship cannot be substantiat ed, because the experiment did not include a random assignment of payrolls. Thus, statistics cannot answer the question: Could the Mets have won with $4 million to spend on player salaries? •
14
Chapter 1
Aspects of M u ltivariate Ana lysis
TABLE 1 . 1
1 977 SALARY AN D FI NAL RECORD
FOR TH E NATIONAL LEAG U E EAST Team
x 1 = player payroll
x2 = wonlost percentage
Philadelphia Phillies Pittsburgh Pirates St. Louis Cardinals Chicago Cubs Montreal Expos New York Mets
3 ,497,900 2,485,475 1 ,782,875 1 ,725,450 1 ,645,575 1 ,469,800
.623 .593 .512 .500 .463 .395
� 0£) C\S
� �
u ;.... � �
. 800
CZl 0
�
b 0
�
•
.400
•• •
•
•
���� xl 0 1.0 2.0 3.0 4.0
Figure 1 .4
Player payroll in millions of dollars
Sala ries a n d won  lost pe rce ntage from Ta ble 1 . 1 .
To construct the scatter plot in Figure 1 .4, we have regarded the six paired ob servations in Table 1 . 1 as the coordinates of six points in twodimensional space. The figure allows us to examine visually the grouping of teams with respect to the vari ables total payroll and wonlost percentage. Example 1 .5
{M u lti ple scatter plots for paper strength measu rements)
Paper is manufactured in continuous sheets several feet wide. Because of the orientation of fibers within the paper, it has a different strength when measured in the direction produced by the machine than when measured across, or at right angles to, the machine direction. Table 1 .2 shows the measured values of
x 1 = density ( grams/ cubic centimeter ) x2 = strength ( pounds ) in the machine direction x3 = strength ( pounds ) in the cross direction A novel graphic presentation of these data appears in Figure 1 . 5 , page 1 6 . The scatter plots are arranged as the offdiagonal elements of a co variance array and box plots as the diagonal elements. The latter are on a different scale with this software, so we use only the overall shape to provide
Section 1 .3
TABLE 1 .2
The Orga n i zation of Data
PAPE RQ UALITY M EASU R E M E NTS Strength
Specimen
Density
Machine direction
Cross direction
1 2 3 4 5 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
.801 .824 .841 .816 .840 .842 .820 .802 .828 .819 .826 .802 .810 .802 .832 .796 .759 .770 .759 .772 .806 .803 .845 .822
121.41 127.70 129.20 131.80 135.10 131 .50 126.70 115.10 130.80 124.60 1 18.31 1 14.20 120.30 115.70 117.51 109.81 109.10 115.10 11 8.31 1 12.60 1 1 6.20 1 1 8.00 131 .00 125.70
70.42 72.47 78.20 74.89 71 .21 78.39 69.02 73.10 79.28 76.48 70.25 72.88 68.23 68.12 71.62 53.10 50.85 51 .68 50.60 53.51 56.53 70.70 74.35 68.29
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
.816 .836 .815 .822 .822 .843 .824 .788 .782 .795 .805 .836 .788 .772 .776 .758
125.80 125.50 127.80 130.50 127.90 123.90 124.10 120.80 107.40 120.70 121 .91 122.31 110.60 103.51 110.71 11 3.80
70.64 76.33 76.75 80.33 75.68 78.54 71.91 68.22 54.42 70.41 73.68 74. 93 53.52 48.93 53.67 52.42
6
Source: Data courtesy of SONOCO Products Company.
15
Aspects of M u ltivariate Ana lys is
Chapter 1
16
.
Max
Min
. . . . .. . . ...... . . . . .. . . . .. . .. :· . .. . . .
0.8 1 0.76
. .
.
.. .
.. · ·' . ... .
.
. ·.·.
.
.. .. .. . ..
,.....
Max
.
.
.
0.97
~
Med
Strength (CD)
Strength (MD)
Density
. ... . ..
... . ... .
.
.
Min
. . .. .. . . ·. . . ::.·.'. .
..
�
.
.
..
. . :· . . .
_ ,__
. . .. .. . . .. . .
. .. . . . . . . .
Figure
1.5
1 03. 5
.. . . . .. . . . . .. . . . . .
,
.
135.1
1 2 1 .4
Med
,
. .. ·: : ... •.··. . . .. ·: ..
.
.. . . ...
Max Med
Min
.. ... . . . . .. .. . . .. .. . . . ' · .. .
.
.
T
j_
80.33 70.70
48.93
Scatter pl ots a n d boxpl ots of paperq u a l ity data from Ta ble 1 . 2.
information on symmetry and possible outliers for each individual characteris tic. The scatter plots can be inspected for patterns and unusual observations. In Figure 1 .5 , there is one unusual observation: the density of specimen 25 . Some of the scatter plots have patterns suggesting that there are two separate clumps of observations. These scatter plot arrays are further pursued in our discussion of new soft• ware graphics in the next section. In the general multiresponse situation, p variables are simultaneously record ed on n items. Scatter plots should be made for pairs of important variables and, if the task is not too great to warrant the effort, for all pairs. Limited as we are to a threedimensional world, we cannot always picture an en tire set of data. However, two further geometric representations of the data provide an important conceptual framework for viewing multivariable statistical methods. In cases where it is possible to capture the essence of the data in three dimensions, these representations can actually be graphed.
Section 1 .3
17
The Organ ization of Data
n
Points in p Dimensions (pDimensional Scatter Plot). Consider the natur al extension of the scatter plot to p dimensions, where the p measurements on the jth item represent the coordinates of a point in pdimensional space. The co ordinate axes are taken to correspond to the variables, so that the jth point is xj 1 units along the first axis, xj2 units along the second, . . . , xj P units along the pth axis. The resulting plot with n points not only will exhibit the overall pattern of variabili ty, but also will show similarities (and differences) among the n items. Groupings of items will manifest themselves in this representation. The next example illustrates a threedimensional scatter plot. Example 1 .6
(Loo king for l owerd imensional structu re)
A zoologist obtained measurements on n = 25 lizards known scientifically as Cophosaurus texanus. The weight, or mass, is given in grams while the snout vent length (SVL) and hind limb span (HLS) are given in millimeters. The data are displayed in Table 1.3. TABLE 1 .3
LIZARD SIZE DATA
Lizard
Mass
SVL
HLS
Lizard
Mass
SVL
HLS
1 2 3 4 5 6 7 8 9 10 11 12 13
5.526 10.401 9.213 8. 953 7.063 6.610 1 1 .273 2.447 15.493 9.004 8.199 6.601 7.622
59.0 75.0 69.0 67 .5 62.0 62.0 74.0 47.0 86.5 69.0 70.5 64.5 67 .5
113.5 142.0 124.0 125.0 129.5 123.0 140.0 97.0 162.0 126.5 136.0 116.0 135.0
14 15 16 17 18 19 20 21 22 23 24 25
10.067 10.091 10.888 7.610 7.733 12.015 10.049 5 . 149 9.158 12.132 6.978 6.890
73.0 73.0 77.0 61.5 66.5 79.5 74.0 59.5 68.0 75.0 66.5 63.0
136.5 135.5 139.0 1 1 8.0 133.5 150.0 137.0 116.0 123 .0 141.0 1 17.0 1 17.0
Source: Data courtesy of Kevin E. Bonine.
Although there are three size measurements, we can ask whether or not most of the variation is primarily restricted to two dimensions or even to one dimension. To help answer questions regarding reduced dimensionality, we construct the threedimensional scatter plot in Figure 1 .6. Clearly most of the variation is scatter about a onedimensional straight line. Knowing the position on a line along the maj or axes of the cloud of points would be almost as good as know ing the three measurements Mass, SVL, and HLS. However, this kind of analysis can be missleading if one variable has a much larger variance than the others. Consequently, we first calculate the stan dardized values, Zj k = ( xj k  xk )/ � , so the variables contribute equally to
18
Chapter 1
Aspects of M u ltiva riate Ana lysis
15
5
HLS
90
SVL
3
Figure 1 .6
3 D scatter plot of l i zard data from Ta ble 1 . 3 .
•
2
1
25 1 .5 .
2 ZsvL
2
.....__
0.5 0.5 ._.._ _ 2 . 5 1 . 5 ZHLS
Figure 1 . 7
3 D scatter plot of sta n d a r d i zed l i zard d ata .
the variation in the scatter plot. Figure 1 .7 gives the threedimensional scatter plot for the standardized variables. Most of the variation can be explained by • a single variable determined by a line through the cloud of points. A threedimensional scatter plot can often reveal group structure. Example 1 .7
(Looking for g roup structu re i n th ree d i mensions)
Referring to Example 1 .6, it is interesting to see if male and female lizards oc cupy different parts of the three dimensional space containing the size data. The gender, by row, for the lizard data in Table 1.3 are f m f f m f m f m f m f m m m m f m m m f f m f f Figure 1.8 repeats the scatter plot for the original variables but with males marked by solid circles and females by open circles. Clearly, males are typically • larger than females.
Section 1 .4
Data Displays and Pictoria l Representations
19
jOfl � 15
•
0
•
CS o
5
SVL
Figure 1 .8
� ·

90
1 15
HLS
3 D scatter p l ot of m a l e a n d fe m a l e lizards.
Points in n Dimensions. The n observations of the p variables can also be . regarded as p points in ndimensional space. Each column of X determines one of the points. The ith column, p
consisting of all n measurements on the ith variable, determines the ith point. In Chapter 3, we show how the closeness of points in n dimensions can be re lated to measures of association between the corresponding variables.
1 .4 DATA D I SPLAYS AN D PI CTORIAL REPRESENTATI O N S The rapid development o f powerful personal computers and workstations has led to a proliferation of sophisticated statistical software for data analysis and graphics. It is often possible, for example, to sit at one ' s desk and examine the nature of multidi mensional data with clever computergenerated pictures. These pictures are valu able aids in understanding data and often prevent many false starts and subsequent inferential problems. As we shall see in Chapters 8 and 12, there are several techniques that seek to represent pdimensional observations in few dimensions such that the original dis tances ( or similarities ) between pairs of observations are ( nearly ) preserved. In gen eral, if multidimensional observations can be represented in two dimensions, then outliers, relationships, and distinguishable groupings can often be discerned by eye. We shall discuss and illustrate several methods for displaying multivariate data in two dimensions. One good source for more discussion of graphical methods is [12] .
20
Chapter 1
Aspects of M u ltivariate Ana lysis
Li n ki ng M u lti ple TwoD imensional Scatter Plots One of the more exciting new graphical procedures involves electronically connect ing many twodimensional scatter plots. Example 1 .8
(Li nked scatter plots and brushi ng)
To illustrate linked two dimensional scatter plots, we refer to the paperquality data in Table 1 .2. These data represent measurements on the variables x1 = density, x2 = strength in the machine direction, and x3 = strength in the cross direction. Figure 1.9 shows twodimensional scatter plots for pairs of these variables organized as a 3 X 3 array. For example, the picture in the upper left hand corner of the figure is a scatter plot of the pairs of observations ( x1 , x3 ) . That is, the x1 values are plotted along the horizontal axis, and the x3 values are plotted along the vertical axis. The lower righthand corner of the figure con tains a scatter plot of the observations ( x3 , x1 ) . That is, the axes are reversed. Corresponding interpretations hold for the other scatter plots in the figure. No tice that the variables and their threedigit ranges are indicated in the boxes along the SWNE diagonal. The operation of marking (selecting), the obvious outlier in the ( x1 , x3 ) scatter plot of Figure 1 . 9 creates Figure 1 .10(a) , where the outlier is labeled as specimen 25 and the same data point is highlighted in all the scatter plots. Specimen 25 also appears to be an outlier in the ( x1 , x2 ) scatter plot but not in the ( x2 , x3 ) scatter plot. The operation of deleting this specimen leads to the modified scatter plots of Figure 1 .10(b ) From Figure 1 . 1 0 , we notice that some points in, for example, the ( x2 , x3 ) scatter plot seem to be disconnected from the others. Selecting these .
. . ... ., . . . . .,. . : · . . ...
.., . :.
..
.
. •
.
.
.
J
.
I
.
 �


'
.. . ,
.
. . .. ,
80.3 .
Cross
( x3 )
. • •
.
.
. .=· ., . . . . · ·'· . ., . . . r . . ..
, _
.
48.9
.
1 35 .
.
Machine
(x2)
.
.
1 04
.
.
.
. .
. I . . .
.
.
.97 1
.
. .. .
I I .
.
..
,
..
..
'
�· .
'
.
Density
(xi )
. 758
.
.
.,
••1 . . . . . .
..
.
. .. . . ' ..,.. ' · ;··
.
.
.
. . .
. .. .
.
. � ��· . .,
·. . . ., . .
..
Figure 1 .9
Scatter plots for the paperq ual ity data of Ta ble 1 .2 .
Section 1 .4 . •I
... . ..
.. ..: · "'
. . ., .., .
:.
.
.
.
.
. .. � :. .. . .. .
.
,

( x3 )
Cross
.
( x2 )
Machine
..
.
.
.,
.
.
.
.
.
.
'·
.. .
,
·: 25 • • ..
'
25
. . . . I' ··�· : . . ·· ' . · . : . . . '
.
.
25
Density
.
I I ... .
. . ..
.
. 97 1
. 758
.
48.9
1 04
21
80.3
135
25
(xi )
,
• J • •. . .
.
.
. . ..
I , _ • : · 25  \
.
.
.
. r . . .. . .
.
..
25
··'. · '.
.
.
Data D isp lays a n d Pictoria l Representations
.
.
. ..
. . . .:
. ·.. ·. .,
' •.' ..\ . .
.
"'
.
(a)
"'
. •I
..· : ..
. . . ... . . . ., . .
:' . .
.
..
.
.
.
.
.
. r . . .. . .
.
I
 \
... .
.. 
,
. . ..
,
80.3 .
.. �
( x3 )
Cross
• J • •. . .
48.9
.
135
.=· . .. . .
··'. · '.
.
.
,
.
( x2 )
Machine
.
. .
1 04
. . ..
.
.,
.
.· . .:
..
,
• I I • ... .
'
�· .
'
.97 1
(xi )
Density
. 758
.
'
. •• 1 . . .. . . . . .
. . . . ' . •••,.. I .
;··
(b)
. . .. ..
�·
.
. ·.. ·. .,
' •.' ..\ . . . ..
Figure 1 . 1 0 Mod ified scatter pl ots for the paperq u a l ity d ata with outl ier (25) (a) sel ected and (b) deleted.
22
Chapter 1
Aspects of M u ltiva riate Ana lysis
points, using the (dashed) rectangle (see page 23 ) , highlights the selected points in all of the other scatter plots and leads to the display in Figure 1 . 1 1 (a) . Further checking revealed that specimens 1 62 1 , specimen 34, and specimens 3841 were actually specimens from an older roll of paper that was included in order to have enough plies in the cardboard being manufactured. Deleting the outlier and the cases corresponding to the older paper and ad j usting the ranges of the remaining observations leads to the scatter plots in Figure 1 . 1 1 (b ) . The operation of highlighting points corresponding to a selected range of one of the variables is called brushing. Brushing could begin with a rectangle, as in Figure 1 . 1 1 (a) , but then the brush could be moved to provide a sequence of highlighted points. The process can be stopped at any time to provide a snap • shot of the current situation. Scatter plots like those in Example 1 . 8 are extremely useful aids in data analy sis. Another important new graphical technique uses software that allows the data analyst to view highdimensional data as slices of various threedimensional per spectives. This can be done dynamically and continuously until informative views are obtained. A comprehensive discussion of dynamic graphical methods is avail able in [1] . A strategy for online multivariate exploratory graphical analysis, moti vated by the need for a routine procedure for searching for structure in multivariate data, is given in [32]. Example 1 .9
{Rotated plots in th ree di mensions)
Four different measurements of lumber stiffness are given in Table 4.3, page 187. In Example 4.14, specimen (board) 16 and possibly specimen (board) 9 are identified as unusual observations. Figures 1 . 12(a), (b) , and (c) contain per spectives of the stiffness data in the x 1 , x2 , x3 space. These views were obtained by continually rotating and turning the threedimensional coordinate axes. Spin ning the coordinate axes allows one to get a better understanding of the three dimensional aspects of the data. Figure 1 .12( d) gives one picture of the stiffness data in x2 , x3 , x4 space. Notice that Figures 1 . 12(a) and (d) visually confirm specimens 9 and 1 6 as outliers. Specimen 9 is very large in all three coordi nates. A counterclockwiselike rotation of the axes in Figure 1 . 12(a) produces Figure 1. 12(b ), and the two unusual observations are masked in this view. A fur ther spinning of the x2 , x3 axes gives Figure 1 . 12(c) ; one of the outliers (16) is now hidden. Additional insights can sometimes be gleaned from visual inspection of the slowly spinning data. It is this dynamic aspect that statisticians are just begin • ning to understand and exploit. Plots like those in Figure 1 . 12 allow one to identify readily observations that do not conform to the rest of the data and that may heavily influence inferences based on standard datagenerating models.
Section 1 .4
.. .. . ·: . . .. ,.. .. .... ,
,
. :. . ..,
.
I
I
. .
�
.. � :.
.. . ' . .
. ·· '. · .
I
I
.
.
. ... . .
r
..
.
, _ 
.
•
•
.
•.
I
I
48.9 135
.
.
.. . . . ., . .
( x2 )
Machine
'
1 04
.
. 97 1
(x i )
Density
.
. 758
.
'
.
.. .·· ' . ;· . .. . .
.
....,... ' · '
.
( x3 )
Cross
I
.I
23
80.3
\
I
      J . .
. .... , ... . .,
Data Displays and Pictorial Representations
. . . ....�·
.
' . . · · . . . . ,
. .. ' . .. I I . '
. . ·. . . . . ., ·. , . a. . . . .. \
(a)
.
.
.
. . .. . .. .. . . . . . . . . .. . . Cross .. .. . . ( x3 ) . . .. . . .. . .. . . .. . .. .. . . . .. 68. 1 . . . 1 35 . . . .. .. . . .. . .. . . . . . . .. . . . . Machine . . . . . . . ( x2) . . . .. . . . .. . . I . . 1 14 .. . . .. . . .. . .845 . . . . . . .. . . . .. .. . . .... . . .. . Density . . (xi ) . .. . .. . ..  . . . . . .
.
.
. 788
(b)
80.3
.. . . .
...
.
. Figure 1 . 1 1 Mod ified scatter p l ots with (a) g ro u p of poi nts sel ected a n d (b) poi nts, i n c l u d i n g s peci m e n 2 5, deleted a n d the scatter plots resca led.
24
Chapter 1
Aspects of M u ltiva riate Ana lysis .16
. . .
. .
.
.
. . . .
. .
.
.
.
.
•
:... . . ..
. .. . .
.
. .
.
.
. . .
9 x
(a)
16 .
.
• •
.
.
• . .
.
.
.
(b) Outliers masked.
Outliers clear.
�
l
.
.
.
•
.
.
.
.
•
• • .
.
.
. •
•
.
•
.
�3
•
.
•
1§
(c)
Figure
1.12
Specimen 9 large.
(d )
9
Good view of space.
x2 , x3 , x4
Th reed imensional perspectives fo r the l u mber stiffness data.
Graphs of G rowth Cu rves When the height of a young child is measured at each birthday, the points can be plotted and then connected by lines to produce a graph. This is an example of a growth curve. In general, repeated measurements of the same characteristic on the same unit or subj ect can give rise to a growth curve if an increasing, or decreasing, or even an increasing followed by a decreasing pattern is expected. Example 1 . 1 0
{Arrays of g rowth cu rves)
The Alaska Fish and Game Department monitors grizzly bears with the goal of maintaining a healthy population. Bears are shot with a dart to induce sleep and weighed on a scale hanging from a tripod. Measurements of length are taken with a steel tape. Table 1 .4 gives the weights (wt) in kilograms and lengths (lngth) in centimeters of seven female bears at 2, 3, 4, and 5 years of age. First for each bear, we plot the weights versus the ages and then connect the weights at successive years by straight lines. This gives an approximation to growth curve for weight. Figure 1.13 shows the growth curves for all seven bears. The noticeable exception to a common pattern is the curve for bear 5. Is this an outlier or just natural variation in the population? In the field, bears are weighed on a scale that reads pounds. Further inspection revealed that, in this case, an assistant later failed to convert the field readings to kilograms when creating the electronic database. The correct weights are ( 45, 66, 84, 1 12) kilograms. Because it can be difficult to inspect visually the individual growth curves in a combined plot, the individual curves should be replotted in an array where
Section 1 .4
TABLE 1 .4
Data Displays and Pictoria l Representations
25
FE MALE B EAR DATA
Bear
Wt2
Wt3
Wt4
1 2 3 4 5 6 7
48 59 61 54 100 68 68
59 68 77 43 145 82 95
95 102 93 104 1 85 95 109
Wt 5
Lngth 2
Lngth 3
Lngth 4
Lngth 5
141 140 145 146 150 142 139
157 168 162 159 158 140 171
168 174 172 176 168 178 176
1 83 170 177 171 175 189 175
82 102 107 104 247 118 111
Source: Data courtesy of H. Roberts.
2.0
2.5
3.0
3.5 Year
4.0
4.5
5.0
Figure 1 . 1 3 Com b i n ed g rowth cu rves fo r we ight fo r seven fem a l e g rizzly bears.
similarities and differences are easily observed. Figure 1 . 14 gives the array of seven curves for weight. Some growth curves look linear and others quadratic. Figure 1.15 gives a growth curve array for length. One bear seemed to get shorter from 2 to 3 years old, but the researcher knows that the steel tape mea • surement of length can be thrown off by the bear ' s posture when sedated. We now turn to two popular pictorial representations of multivariate data in two dimensions: stars and Chernoff faces. Sta rs Suppose each data unit consists of nonnegative observations on p 2 variables. In two dimensions, we can construct circles of a fixed (reference) radius with p equally spaced rays emanating from the center of the circle. The lengths of the rays repre sent the values of the variables. The ends of the rays can be connected with straight lines to form a star. Each star represents a multivariate observation, and the stars can be grouped according to their (subj ective) similarities. >
26
Aspects of M u ltivariate Ana lysis
Chapter 1
B ear
B ear 1 ...c=
150 
...c=
_/
. � 1 00 
�
50 0
I
I
2
1
...c=
I
I
3
4
�
I
50 0
5
I
I
1
2
1
I
2
4
I
50 0
I
1
5
2
3
4
5
...c=
I
I
3
4
�
I
/
50 0
I
1
5
I
2
I
3
I
4
�
I
50 0
I
I
1
5
1.14
I
I
I
I
I
1
2
3
4
5
Year
�
+l
. � 1 00 
2
I
I
3
4
I
5
Year
Year
Year
Figure
...c=
150 
�
50 0
I
B ear 7
. � 1 00 
�
I
B ear 6 1 50 
�
I
B ear 5
150 
. � 1 00 
I
Year
/
I
3
I
...c=
/
+l
Year
�
0
I
150 
. � 1 00 
�
Bear 4
3
Year
�
50 
...c=
__r
. � 1 00 
150 
. � 1 00 
�
1 50 
�
�
B ear
2
I n divid u a l g rowth cu rves for weight for fe male g rizzly bears.
Bear 4
Bear 3
Bear 2
Bear 1

...c= �
gf 11)
�
1 80 
1 80

{j
� 1 60 OJ)
1 60
� 11)
1 40 
140 1
2
3
4
I
1
5
Year
r I
2
Bear 5
I
3
I
4
1 80 
{j
� 1 60 OJ)
� 11)
I
�
�
I
2
I
3
Year
Year
Bear 6
Bear 7 1 80 
1 80
1 60 140 
I
1
5
/ I
4
I
1 80 
{j
� 1 60 OJ)
� 11)
1 40 
5
I
1
/ I
2
I
3
I
4
{j
gf
� 11)
1 60
� 11)
140 
140 1
2
3 Year
1.15

� 1 60 
OJ)
I
5
Year
Figure
{j
4
5
I
1
r I
2
I
3
I
1
I
/
2
I
3 Year

1 80 OJ)

1 40 

{j

I
I
4
5
Year
I n d ivi d u a l g rowth cu rves for l e n gth for fem a l e g rizzly bears.
I
4
I
5
Section
1 .4
Data Displays a n d Pictorial Representations
27
It is often helpful, when constructing the stars, to standardize the observations. In this case some of the observations will be negative. The observations can then be reexpressed so that the center of the circle represents the smallest standardized ob servation within the entire data set. Example 1 . 1 1
(Uti l ity data as stars)
Stars representing the first 5 of the 22 public utility firms in Table 12.5, page 683, are shown in Figure 1 .16. There are eight variables; consequently, the stars are distorted octagons. Boston Edison Co. ( 2)
Arizona Public Service ( 1 ) 2
8
5
5
Central Louisiana Electric Co. ( 3 ) Commonwealth Edison Co. (4)
5
5 Consolidated Edison Co. (NY) ( 5) 1
5
Figure 1 . 1 6
p u b l ic util ities.
Sta rs for the fi rst five
28
Chapter 1
Aspects of M u ltivariate Ana lysis
The observations on all variables were standardized. Among the first five utilities, the smallest standardized observation for any variable was  1 .6. Treat ing this value as zero, the variables are plotted on identical scales along eight equiangular rays originating from the center of the circle. The variables are or dered in a clockwise direction, beginning in the 12 o 'clock position. At first glance, none of these utilities appears to be similar to any other. However, because of the way the stars are constructed, each variable gets equal weight in the visual impression. If we concentrate on the variables 6 (sales in kilowatthour [kWh] use per year) and 8 (total fuel costs in cents per kWh), then Boston Edison and Consolidated Edison are similar (small variable 6, large variable 8), and Arizona Public Service, Central Louisiana Electric, and Com monwealth Edison are similar (moderate variable 6, moderate variable 8). • Chernoff Faces People react to faces. Chernoff [6] suggested representing pdimensional observations as a twodimensional face whose characteristics (face shape, mouth curvature, nose length, eye size, pupil position, and so forth) are determined by the measurements on the p variables. As originally designed, Chernoff faces can handle up to 18 variables. The as signment of variables to facial features is done by the experimenter, and different choices produce different results. Some iteration is usually necessary before satis factory representations are achieved. Chernoff faces appear to be most useful for verifying (1) an initial grouping suggested by subj ectmatter knowledge and intuition or (2) final groupings produced by .clustering algorithms. Example 1 . 1 2
(Uti l ity data as Chernoff faces)
From the data in Table 12.5, the 22 public utility companies were represented as Chernoff faces. We have the following correspondences: Variable
Xl : X2: X3 : X4 : Xs : X6 : X7 : X8:
Facial characteristic
Fixedcharge coverage Rate of return on capital Cost per kW capacity in place Annual load factor
�
Peak kWh demand growth from 1974
�
Sales (kWh use per year) Percent nuclear Total fuel costs (cents per kWh)
�
� � �
� �
Halfheight of face Face width Position of center of mouth Slant of eyes height Eccentricity of eyes width Halflength of eye Curvature of mouth Length of nose
( )
The Chernoff faces are shown in Figure 1 . 17. We have subj ectively grouped "similar" faces into seven clusters. If a smaller number of clusters is de sired, we might combine clusters 5, 6, and 7 and, perhaps, clusters 2 and 3 to
Section 1 .4 Cluster 1
Cluster 2
Cluster 3
0 0 0 W CD G 0 CD 8 0 0 CD 0 0 0 4
1
6
10
3
22
13
9
Cluster 4
20
14
8
18
11
19
16
Figure 1 . 1 7
29
Data D isplays and Pictori a l Representations Cluster 5
Cluster 7
ffi CD \D CD 5
7
21
15
Cluster 6
CD CD CD 2
12
17
Chernoff faces for 22 p u b l ic util ities.
obtain four or five clusters. For our assignment of variables to facial features, the firms group largely according to geographical location. • Constructing Chernoff faces is a task that must be done with the aid of a com puter. The data are ordinarily standardized within the computer program as part of the process for determining the locations, sizes, and orientations of the facial char acteristics. With some training, we can use Chernoff faces to communicate similari ties or dissimilarities, as the next example indicates. Example 1 . 1 3
{Using Chernoff faces to show changes over ti me)
Figure 1 .1 8 illustrates an additional use of Chernoff faces. ( See [24] .) In the figure, the faces are used to track the financial wellbeing of a company over time. As indicated, each facial feature represents a single financial indicator, and the longitudinal changes in these indicators are thus evident at a glance. •
30
Chapter 1
Aspects of M u ltiva riate Ana lysis Liquidity �
Profitability
1 975
1 976
1 978
1 977
1 979
� Time Figure 1 . 1 8 Chernoff faces over time.
Chernoff faces have also been used to display differences in multivariate ob servations in two dimensions. For example, the twodimensional coordinate axes might represent latitude and longitude ( geographical location ) , and the faces might represent multivariate measurements on several U.S. cities. Additional examples of this kind are discussed in [30] . There are several ingenious ways to picture multivariate data in two dimen sions. We have described some of them. Further advances are possible and will al most certainly take advantage of improved computer graphics.
1 .5
DI STANCE Although they may at first appear formidable, most multivariate techniques are based upon the simple concept of distance. Straightline, or Euclidean, distance should be familiar. If we consider the point P = ( x 1 , x2 ) in the plane, the straightline distance, d( 0, P) , from P to the origin 0 = (0, 0) is, according to the Pythagorean theorem,
Vx i + x �
(19)
Vx t + X � + . . · + X �
(1 10)
d(O, P)
=
The situation is illustrated in Figure 1.19. In general, if the point P has p coordinates so that P = ( x 1 , x2 , , x p ) , the straightline distance from P to the origin 0 = (0, 0, . . . , 0 ) is • • •
d( 0, P)
d(O, P ) =
jx T + X �
=
�
p
1 �11( Xl�l
0 F'
Figure 1 . 1 9 Dista n ce g iven by the Pythagorea n theorem.
Section 1 . 5
31
Dista nce
(See Chapter 2.) All points (x 1 , x2 , , xp ) that lie a constant squared distance, such as c 2 , from the origin satisfy the equation • • •
(111)
Because this is the equation of a hypersphere (a circle if p = 2) , points equidistant from the origin lie on a hypersphere. The straightline distance between two arbitrary points P and Q with coordi nates P = (x 1 , x2 , . . . , xp ) and Q = (y1 , y2 , . . . , Yp ) is given by d(P, Q ) = V (x l  Y1 ) 2 + (x2  Y2 ) 2 + · + (x p  Yp ) 2 (112)
··
Straightline, or Euclidean, distance is unsatisfactory for most statistical pur poses. This is because each coordinate contributes equally to the calculation of Eu clidean distance. When the coordinates represent measurements that are subject to random fluctuations of differing magnitudes, it is often desirable to weight coordinates subject to a great deal of variability less heavily than those that are not highly vari able. This suggests a different measure of distance. Our purpose now is to develop a "statistical" distance that accounts for differ ences in variation and, in due course, the presence of correlation. Because our choice will depend upon the sample variances and covariances, at this point we use the term statistical distance to distinguish it from ordinary Euclidean distance. It is statistical distance that is fundamental to multivariate analysis. To begin, we take as fixed the set of observations graphed as the pdimensional scatter plot. From these, we shall construct a measure of distance from the origin to a point P = (x 1 , x2 , , xp ) · In our arguments, the coordinates (x 1 , x2 , . . . , xp ) of P can vary to produce different locations for the point. The data that determine distance will, however, remain fixed. To illustrate, suppose we have n pairs of measurements on two variables each having mean zero. Call the variables x 1 and x2 , and assume that the x 1 measure ments vary independently of the x2 measurements. 1 In addition, assume that the vari ability in the x 1 measurements is larger than the variability in the x2 measurements. A scatter plot of the data would look something like the one pictured in Figure 1 .20. . • •
Figure 1 .20
A scatter plot with
g reater va r i a b i l ity in the x1 d i rection than i n the x2 d i rection. 1
At this point, "independently" means that the x2 measurements cannot be predicted with any ac curacy from the x 1 measurements, and vice versa.
32
Chapter 1
Aspects of M u ltiva r i ate Ana lysis
Glancing at Figure 1.20, we see that values which are a given deviation from the origin in the x 1 direction are not as "surprising" or "unusual" as are values equidis tant from the origin in the x2 direction. This is because the inherent variability in the x 1 direction is greater than the variability in the x2 direction. Consequently, large x 1 coordinates (in absolute value) are not as unexpected as large x2 coordinates. It seems reasonable, then, to weight an x2 coordinate more heavily than an x 1 coordi nate of the same value when computing the "distance" to the origin. One way to proceed is to divide each coordinate by the sample standard devi ation. Therefore, upon division by the standard deviations, we have the "standardized" coordinates x i = x 1j� and xi = x2j'\l's; . The standardized coordinates are now on an equal footing with one another. After taking the differences in variability into account, we determine distance using the standard Euclidean formula. Thus, a statistical distance of the point P = ( x 1 , x2 ) from the origin 0 = (0, 0) can be computed from its standardized coordinates xi = x 1j� and x i = x2j'\l's; as
+ (xi) 2 2 = + )( �) ( � Y
d(O, P) = \l(xi) 2
(113)
Comparing (113) with (19), we see that the difference between the two expressions is due to the weights k 1 = 1/s1 1 and k2 = 1/s22 attached to x i and x� in (113). Note that if the sample variances are the same, k 1 = k2 , and xi and x� will receive the same weight. In cases where the weights are the same, it is convenient to ignore the com mon divisor and use the usual Euclidean distance formula. In other words, if the variability in the x 1 direction is the same as the variability in the x2 direction, and the x 1 values vary independently of the x2 values, Euclidean distance is appropriate. Using (113), we see that all points which have coordinates (x 1 , x2 ) and are a constant squared distance c 2 from the origin must satisfy
2 � S1 1
+
2 � = c2 S22
(114)
Equation (1 14) is the equation of an ellipse centered at the origin whose major and minor axes coincide with the coordinate axes. That is, the statistical distance in (113) has an ellipse as the locus of all points a constant distance from the origin. This gen eral case is shown in Figure 1 .21 .
c
0
p
���� X
Js::

c
fi;;
c
fi;;
I
Figure 1 .2 1
The e l l i pse of co n sta nt statistical d i stance d 2 ( 0, P) xf/s1 1 + x�/s22 = c 2 • =
Section 1 . 5
Example 1 . 1 4
Distance
33
{Calculating a statistica l d istance)
A set of paired measurements ( x 1 , x2 ) on two variables yields x1 = x2 = 0, s1 1 = 4, and s22 = 1. Suppose the x1 measurements are unrelated to the x2 mea surements; that is, measurements within a pair vary independently of one an other. Since the sample variances are unequal, we measure the square of the distance of an arbitrary point P = (x 1 , x2 ) to the origin 0 = (0, 0) by
x21 _2 x2 2 d (0 ' P) = 4 + 1 All points ( x 1 , x2 ) that are a constant distance 1 from the origin satisfy the equation x 21 _2 x2 1 _ + 4 1 =
The coordinates of some points a unit distance from the origin are presented in the following table:
2
x1 . D 1stance: 4
Coordinates: ( x 1 , x2 )
x22 = 1 +T
02 1 2 4 + 1 = 02 ( 1 ) 2 4 + 1 = 22 + 02 = 4 1 2 1 + ( v'3/2 ? 4 1 =
(0, 1 ) (0,  1) (2, 0) (1, v'3/2)
1 1 1 1
A plot of the equation xi/4 + x�/1 = 1 is an ellipse centered at (0, 0) whose major axis lies along the x 1 coordinate axis and whose minor axis lies along the x 2 coordinate axis. The halflengths of these major and minor axes are V4 = 2 and VI = 1, respectively. The ellipse of unit distance is plotted in Fig ure 1.22. All points on the ellipse are regarded as being the same statistical dis • tance from the originin this case, a distance of 1.
��r�+� Xl 2
2
1
1
Figure 1 . 22 2 2
x ___2_ 4
+
x
� = 1.
1
E l l i pse of u n it dista n ce,
34
Chapter 1
Aspects of M u ltivariate Ana lysis
The expression in (113) can be generalized to accommodate the calculation of statistical distance from an arbitrary point P == ( x 1 , x2 ) to any fixed point Q == (y1 , y2 ) . If we assume that the coordinate variables vary independently of one another, the distance from P to Q is given by
d( P , Q )
=
I (x i  YI ) 2 'I S1 1
+
(x2  Y2 ) 2 S22
(115)
The extension of this statistical distance to more than two dimensions is straight forward. Let the points P and Q have p coordinates such that P == (x1 , x2 , , x p ) and Q == (y1 , y2 , , yp )· Suppose Q is a fixed point [it may be the origin 0 == (0, 0, . . . , 0 ) ] and the coordinate variables vary independently of one another. Let s1 1 , s22 , . . . , sP P be sample variances constructed from n measurements on x 1 , x2 , . . . , xP , respectively. Then the statistical distance from P to Q is • • •
• • •
d ( P , Q)
2 I (x l  YI ) 'I S1 1
=
+
(x2  Y2 ) 2 s22
+ ... +
(xp  Yp ) 2 sPP
(116)
All points P that are a constant squared distance from Q lie on a hyperellipsoid centered at Q whose major and minor axes are parallel to the coordinate axes. We note the following: 1. The distance of P to the origin 0 is obtained by setting y1 == y2 == · · · == Yp == 0 in (116). 2. If s 1 1 == s22 == · · · == sPP ' the Euclidean distance formula in (1 12) is appropriate. The distance in (116) still does not include most of the important cases we shall encounter, because of the assumption of independent coordinates. The scatter plot in Figure 1 .23 depicts a twodimensional situation in which the x 1 measurements do not vary independently of the x2 measurements. In fact, the coordinates of the pairs ( x 1 , x2 ) exhibit a tendency to be large or small together, and the sample correlation coefficient is positive. Moreover, the variability in the x2 direction is larger than the variability in the x 1 direction .

•
I
•
· I .. �� �� x l . . 
• 

•
• • •
I
I
I
• •
• I •
• •
•

Figure 1 . 2 3
A scatte r plot fo r positively correl ated meas u reme nts a n d a rotated coord i n ate system .
Section 1 .5
35
Distance
What is a meaningful measure of distance when the variability in the x 1 direc tion is different from the variability in the x2 direction and the variables x 1 and x2 are correlated? Actually, we can use what we have already introduced, provided that we look at things in the right way. From Figure 1 .23 , we see that if we rotate the origi nal coordinate system through the angle (} while keeping the scatter fixed and label the rotated axes x 1 and x2 , the scatter in terms of the new axes looks very much like that in Figure 1 .20. (You may wish to turn the book to place the x 1 and x2 axes in their customary positions. ) This suggests that we calculate the sample variances using the x1 and x2 coordinates and measure distance as in Equation (113). That is, with ref erence to the x l and x2 axes, we define the distance from the point p = (x l ' x2 ) to the origin 0 = (0, 0 ) as
d(O, P)
(117)
=
where sl l and s22 denote the sample variances computed with the xl and x2 measurements. The relation between the original coordinates ( x 1 , x2 ) and the rotated coordi nates ( xl ' x2 ) is provided by x1 = x 1 cos ( (}) + x2 sin ( (} ) (118) X2 =  X 1 sin ( (}) + X2 COS ( (} ) Given the relations in (118), we can formally substitute for x1 and x2 in (117) and express the distance in terms of the original coordinates. After some straightforward algebraic manipulations, the distance from P = (x1 , x2 ) to the origin 0 = (0, 0 ) can be written in terms of the original coordi nates x 1 and x 2 of P as (119)
where the a ' s are numbers such that the distance is nonnegative for all possible val ues of x 1 and x2 . Here a 1 1 , a 12 , and a 22 are determined by the angle 0, and s1 1 , s12 , and s22 calculated from the original data. 2 The particular forms for a 1 1 , a 1 2 , and a 22 are not important at this point. What is important is the appearance of the cross product term 2a 1 2 x 1 x2 necessitated by the nonzero correlation r1 2 . Equation (119) can be compared with (113). The expression in (113) can be regarded as a special case of (119) with a 1 1 = 1/ s 1 1 , a 22 = 1/ s22 , and a 12 = 0. 2 Specifically, 2 2 cos ( 8) sin ( 8) a1 2 1 cos2 ( 8)s 1 + 2 sin( 8) cos ( 8)s1 2 + sin2( 8)s22 + cos2 ( 8)s2 2 2 sin( 8) cos ( 8)s1 2 + sin ( 8)s1 1 1 sin2 (8) cos2 (8) a2 2 2 cos2 ( 8)s1 1 + 2 sin( 8) cos ( 8)s1 2 + sin2 ( 8)s22 + cos ( 8)s22  2 sin( 8) cos ( 8)s1 + sin2( 8)s1 1 2 
and a
12 
cos ( 8) sin( 8) 2 cos (8)s1 1 + 2 sin(8) cos (8)s1 2 + sin2(8)s22

2
sin( 8) cos ( 8)
2 cos (8)s22  2 sin(8) cos (8)s1 2 + sin (8)s1 1
36
Chapter 1
Aspects of M u ltiva r i ate Ana lysis
'
/
/
Q
'
/
/
p
'/
'
/
/
/
/
/
/
/
'
Figure 1 .24 E l l i pse of points a co nsta nt dista nce from the point Q.
In general, the statistical distance of the point P = (x 1 , x2 ) from the fixed point (y1 , y2 ) for situations in which the variables are correlated has the general form d( P , Q ) = v'a 1 1 (x 1  Y1 ) 2 + 2a 1 2 ( X1  Y1 ) ( x2  Y2 ) + a 22 (x2  Y2 ) 2 (120)
Q =
and can always be computed once a 1 1 , a 1 2 , and a 22 are known. In addition, the coordi nates of all points P = ( x1 , x2 ) that are a constant squared distance c 2 from Q satisfy a 1 1 (x 1  Yl ) 2 + 2a 1 2 ( X1  Yl ) (x2  Y2 ) + a 22 (x2  Y2 ) 2 = C2 (121) By definition, this is the equation of an ellipse centered at Q. The graph of such an equation is displayed in Figure 1 .24. The major (long) and minor (short) axes are in dicated. They are parallel to the x1 and x2 axes. For the choice of a 1 1 , a 1 2 , and a 22 in footnote 2, the xl and x2 axes are at an angle (} with respect to the x l and x2 axes. The generalization of the distance formulas of (119) and (120 ) to p dimen sions is straightforward. Let P = (x1 , x2 , . . . , xp ) be a point whose coordinates rep resent variables that are correlated and subj ect to inherent variability. Let 0 = ( 0, 0, . . . , 0) denote the origin, and let Q = (y1 , y2 , . . . , yp) be a specified fixed point. Then the distances from P to 0 and from P to Q have the general forms
d(O, P) = v'a 1 1 x t + a 22 x� + . . . + a PP x� + 2a 1 2 x 1 x2 + 2a 1 3 x 1 x3 + . . . + 2a p  l ,p xp _1 xp and
d (P, Q) =
·
( 122 )
·
[a 1 1(x 1  Yl ) 2 + a 22 (x2  Y2 ) 2 + . . + a pp (xp  Yp ) 2 + 2a 1 2 ( X1  Yl ) ( x2  Y2 ) + 2a 1 3 ( x 1  Y1 ) ( X3  Y3) + . . + 2a p  1 , p ( xp  1  Yp  1 ) ( xp  Yp ) ]
where the a's are numbers such that the distances are always nonnegative?
( 1  23 )
3 The algebraic expressions for the squares of the distances in (122) and (123) are known as qua dratic forms and, in particular, positive definite quadratic forms. It is possible to display these quadratic forms in a simpler manner using matrix algebra; we shall do so in Section 2.3 of Chapter 2.
Section 1 . 5
37
Distance
We note that the distances in (122) and (123) are completely determined by the coefficients (weights) a i k ' i = 1 , 2, . . . , p, k = 1, 2, . . , p. These coefficients can be set out in the rectangular array .
(1 24)
where the ai k ' s with i i= k are displayed twice, since they are multiplied by 2 in the distance formulas. Consequently, the entries in this array specify the distance func tions. The a i k ' s cannot be arbitrary numbers; they must be such that the computed distance is nonnegative for every pair of points. (See Exercise 1 .10.) Contours of constant distances computed from (1 22) and (123) are hyperellipsoids. A hyperellipsoid resembles a football when p = 3; it is impossible to visualize in more than three dimensions. The need to consider statistical rather than Euclidean distance is illustrated heuristically in Figure 1 .25. Figure 1 .25 depicts a cluster of points whose center of gravity (sample mean) is indicated by the point Q. Consider the Euclidean distances from the point Q to the point P and the origin 0. The Euclidean distance from Q to P is larger than the Euclidean distance from Q to 0. However, P appears to be more like the points in the cluster than does the origin. If we take into account the vari ability of the points in the cluster and measure distance by the statistical distance in (120) , then Q will be closer to P than to 0. This result seems reasonable given the nature of the scatter.
•
P''� •
•
•
.
•
• • • • • •• ••Q • : fi" •
. . • • • • • • . •
. .•� •• • •• • • • • . .
• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • .
0
X1
Figure 1 .2 5
A cl uster of poi nts rel ative to a point P a n d the orig i n .
Other measures of distance can be advanced. (See Exercise 1 . 12.) At times, it is useful to consider distances that are not related to circles or ellipses. Any distance measure d ( P, Q ) between two points P and Q is valid provided that it satisfies the fol lowing properties, where R is any other intermediate point: d ( P, Q) d ( p' Q ) d ( p ' Q) d ( P, Q )
=
> =
0. How ever, a negative value of c creates a vector with a direction opposite that of x. From (22) it is clear that x is expanded if I c I > 1 and contracted if 0 < I c I < 1. [Recall Figure 2.2(a).] Choosing c = L� 1 , we obtain the unit vector L� 1 x, which has length 1 and lies in the direction of x. 2
Figure 2 . 3
Le ngth of x =
Vxt
+ x� .
Section 2.2
53
Some Basics of Matrix and Vector Algebra
2
y X
Figure 2.4 The a n g l e e between x ' = [x1 , X2 J a n d y' = [y1 , Y2 ] ·
A second geometrical concept is angle. Consider two vectors in a plane and the angle 0 between them, as in Figure 2.4. From the figure, 0 can be represented as the difference between the angles 0 1 and 02 formed by the two vectors and the first coordinate axis. Since, by definition,
and
cos ( O )
=
cos ( 02  0 1 ) = cos ( 02 ) cos ( 0 1 ) + sin ( 02 ) sin ( 0 1 )
the angle 0 between the two vectors x ' = [ x 1 , x2 ] and cos ( 0 )
=
cos ( 02  0 1 ) =
y' = [ y1 , y2 ] is specified by
Y ) ( ) ) ( ( (�) Ly Lx Ly Lx �
We find it convenient to introduce the dimensions, the inner product of x and y is
2
+
x2
=
XlYl + XzYz
LxLy
(2 3 )
inner product of two vectors. For n = 2
x'y = X 1 Y1 + X2 Y2
With this definition and Equation (2 3 ),
L =� X
cos ( O )
=
x' y x' y LxLy � v?Y =
Since cos ( 90° ) = cos (270° ) = 0 and cos ( 0 ) = 0 only if x'y = 0, x and y are per pendicular when x 'y = 0. For an arbitrary number of dimensions n, we define the inner product of x and y as
(24)
The inner product is denoted by either x ' y or y' x. Using the inner product, we have the natural extension of length and angle to vectors of n components:
Lx =
length of x
=�
(2 5 )
54
Chapter 2
M atrix Algebra and Ra ndom Vectors
Since, again, cos ( 0)
x' y = 0.
Example 2.1
x' y
x' y "v;:x' x "v;:y' y
(2  6 ) L x Ly = 0 only if x' y = 0, we say that x and y are perpendicular when cos ( 0)
=
=
(Ca lcu lati ng lengths of vecto rs and the angle between them)
Given the vectors x' = [ 1 , 3, 2 ] and y' = [  2, 1 ,  1 ] , find 3x and x + y. Next, determine the length of x, the length of y, and the angle between x and y. Also, check that the length of 3x is three times the length of x. First,
x'x = 12 + 3 2 + 2 2 = 14, y' y = ( 2) 2 + 12 + ( 1 ) 2 = 6, and x'y = 1 ( 2) + 3 ( 1 ) + 2 (  1 ) = 1. Therefore,
Next,
and
Lx = \1£ = v14 = 3 .742 cos ( O )
so 0
x' y
Ly = VfY = v'6 = 2.449
=
= LxLy 3 . 742
+
62 = VI26
1 X
2.449 =  .l09
= 96.3°. Finally,
L3x = V3 2 showing L3x
+
92
= 3Lx .
and
3Lx = 3v14 = VI26 •
A pair of vectors x and y of the same dimension is said to be linearly dependent if there exist constants c 1 and c2 , both not zero, such that
A set of vectors x 1 , x , , x k is said to be 2 c1 , c2 , . . . , ck , not all zero, such that . . •
linearly dependent if there exist constants
(2 7 ) Linear dependence implies that at least one vector in the set can be written as a linear combination of the other vectors. Vectors of the same dimension that are not linearly dependent are said to be linearly independent.
Section 2.2
Example 2.2
Some Basics of Matrix and Vector Al gebra
55
(Identifying l i nearly i ndependent vecto rs)
Consider the set of vectors
Setting implies that
c1 + c2 + c3 = 0 2c1  2c3 = 0 c 1  c2 + c3 = 0 with the unique solution c 1 = c2 = c3 = 0. As we cannot find three constants c 1 , c2 , and c3 , not all zero, such that c 1 x 1 + c2 x2 + c3x 3 = 0, the vectors x 1 , x2 , and x 3 are linearly independent. •
The projection (or shadow) of a vector x on a vector y is

(x' y) 1 (x' y) . . y ProJ ection of x on y = ,y = yy Ly Ly where the vector L; 1 y has unit length. The length of the proj ection is Length of protection =
x' y l x' y l = L x L Lx L
y y where 0 is the angle between x and y. (See Figure 2.5 .)
= L x l cos ( O )
(28)
I
(29)
Matrices A matrix is any rectangular array of real numbers. We denote an arbitrary array of n rows and p columns by
a1 1 a1 2 a a A = 2 1 22 (n x p) an l an 2
al p a2 p
a np
Many of the vector concepts just introduced have direct generalizations to matrices.
�1.('
LX
( )
cos (8)
x' y y y' y
)lt.....JI
Figure 2 . 5
The p rojection of x o n y .
56
Chapter 2
Matrix Algebra and Ra ndom Vectors
The transpose operation A' of a matrix changes the columns into rows, so that the first column of A becomes the first row of A' , the second column becomes the second row, and so forth. Example 2.3
(The transpose of a matrix)
If
[3
A 1 ( 2X 3 ) then
A' = ( 3 X2 )
1 2 5 4
[3 ]
]
1 1 5
2 4
•
A matrix may also be multiplied by a constant c. The product cA is the matrix that results from multiplying each element of A by c. Thus
ca 1 P ca 2 P
ca l l ca l 2 ca 2 1 ca 22 cA = ( n X p) ca n l ca n 2
ca np
Two matrices A and B of the same dimensions can be added. The sum A + ( i, j)th entry ai j + bi j . Example 2.4
B has
(The sum of two matrices and m u lti pl ication of a matrix by a co nsta nt)
If
A ( 2X 3 )
_
[01 13
1 1
]
and
[ 5 ] 1
1 2  3 B3 = 2 ( 2X )
then
[0 B3 = [ 0 1 ( 2X ) � ( 2X 3 ) 4A
A + ( 2X 3 )
4
�24
+ 1 + 2
4 4
] 31
and
2 1 
 +5 1 +
31 ] = [ 31
1 2
4
2
]
•
It is also possible to define the multiplication of two matrices if the dimensions of the matrices conform in the following manner: When A is ( n X k ) and B is ( k X p ) , so that the number of elements in a row of A is the same as the number of elements in a column of B , we can form the matrix product AB. An element of the new matrix AB is formed by taking the inner product of each row of A with each column of B.
Section 2.2
57
Some Basics of Matrix a n d Vector Algebra
The matrix product AB is
A B = the ( n X p) matrix whose entry in the ith row (n x k) (k xp) and jth column is the inner product of the ith row of A and the jth column of B
or
···
k
(2  10 ) + a i kbk j = � a i e be j €= 1 When k = 4, we have four products to add for each entry in the matrix AB. Thus, ( i, j) entry of AB
= a i1 b1 j + ai 2 b2j +
a 1 1 a 1 2 a 13 a 1 4 A B = (a i l a i 2 a i3 a i 4) n( X 4 ) ( 4 X p ) a n l a n2 a n 3 a n 4
= Row i
Example 2.5
[· ( ·
blj b2j b3 j b4 j
bl l b2 1 b3 1 b4 1
bl p b2p b3p b4 p
Column
�
1
an b 1 i + a ; 2 b2 i
(Matrix mu lti p l ication)
a ; 3 b3 i + a ; 4 b4 i )
··]
If
then
[
3 1 2 A3 3B = 5 4 1 ( 2 x ) ( x l)
] [ �] [ 
=
9
3 ( 2) + (  1 ) (7) + 2( 9 ) + 4( 9) 1 ( 2) + 5 (7)
]
( 2Xl)
and
( 2�2 ) ( 2�3 )
=
[� n [� [ �
� !J
2( 3 ) + 0( 1 ) 2(  1 ) + 0( 5 ) 2(2) + 0(4) 1 (3 )  1 ( 1 ) 1 (  1 )  1 (5 ) 1 ( 2)  1 ( 4)
 [�
=
( 2X 3 )
�J
] •
58
Chapter 2
Matrix Algebra and Ra ndom Vectors
When a matrix B consists of single column, it is customary to use the lowercase b vector notation. Example 2.6
{Some typical p roducts and their d i mensions)
Let
A =
[
1 2 3 2 4 1
]
b =
[ !] [ _;] c =
d
=
[� ]
Then Ab, be ' , b ' c , and d ' Ab are typical products.
[ _;]
The product Ab is a vector with dimension equal to the number of rows of A.
b ' c = [7
3 6]
[]
[
= [  13]
]
The product b ' c is a 1 X 1 vector or a single number, here 13.
be ' =
7 3 [5 8 4] = 6
35 56 28  15 24 12 30 48 24
The product be ' is a matrix whose row dimension equals the dimension of b and whose column dimension equals that of c. This product is unlike b ' c, which is a single number.
The product d' Ab is a 1 X 1 vector or a single number, here 26.
•
Square matrices will be of special importance in our development of statistical methods. A square matrix is said to be symmetric if A = A ' or a if = a1 i for all i and j. Example 2.7
{A symmetric matrix)
The matrix
Section 2 . 2
59
Some Basics of Matrix and Vector Algebra
is symmetric; the matrix
is not symmetric.
•
When two square matrices A and B are of the same dimension, both products AB and BA are defined, although they need not be equal. (See Supplement 2A.) If we let I denote the square matrix with ones on the diagonal and zeros elsewhere, it follows from the definition of matrix multiplication that the ( i, j)th entry of AI is a l· 1 X 0 + + a l·, J·  1 X 0 + a lj· · X 1 + a l· , J· + 1 X 0 + + a l· k X 0 = a l· f· ' so AI = A Similarly, lA = A, so
···
···
•
I A = A I = A for any kAk (211) (X ) (k X k ) ( k X k) ( kX k ) (k X k) ( k X k) The matrix I acts like 1 in ordinary multiplication ( 1 a = a 1 = a ) , so it is called the identity rna trix. The fundamental scalar relation about the existence of an inverse number a  1 such that a 1 a = aa  1 = 1 if a # 0 has the following matrix algebra extension: If there exists a matrix B such that ·
•
B A = A B = I (k X k ) (k X k ) ( k X k ) (k X k) ( k X k ) then B is called the inverse of A and is denoted by A 1 . The technical condition that an inverse exists is that the k columns a 1 , a2 , . . . , a k of A are linearly independent. That is, the existence of A 1 is equivalent to (212) (See Result 2A.9 in Supplement 2A.) Example 2.8
(The existence of a matrix i nverse)
For
A= you may verify that
[
so
.4  .2 .8  .6
][ ] [ 3 2 = 4 1
[! � ]
(  .2)3 + ( .4)4 (  .2)2 + ( .4 ) 1 ( .8)3 + ( 6 ) 4 ( 8)2 + ( .6)1
[

 .2 .4 .8  .6
]
.
.
J
60
Chapter 2
Matrix Algebra and Ra ndom Vectors
is A 1 . We note that
implies that c1 = c2 = 0, so the columns of A are linearly independent. This confirms the condition stated in (212) . • A method for computing an inverse, when one exists, is given in Supplement 2A. The routine, but lengthy, calculations are usually relegated to a computer, especially when the dimension is greater than three. Even so, you must be forewarned that if the column sum in (212) is nearly 0 for some constants c1 , . . . , ck , then the computer may produce incorrect inverses due to extreme errors in rounding. It is always good to check the products AA 1 and A1 A for equality with I when A 1 is produced by a computer package . ( See Exercise 2.10.) Diagonal matrices have inverses that are easy to compute. For example, 1 0 0 0 0 a1 1 1 a1 1 0 0 0 0 0 0 0 0 a 22 0 a 22 0 0 0 1 0 0 0 a 33 0 0 0 0 0 has inverse 0 0 0 a 44 0 1 0 0 0 0 as s 0 0 0 0 0
0
0
0
1
if all the aii # 0. Another special class of square matrices with which we shall become familiar are the orthogonal matrices, characterized by
QQ' = Q ' Q = I or Q' = Q  1
(213)
The name derives from the property that if Q has ith row qi , then QQ' = I implies that q�qi = 1 and qiqj = 0 for i * j, so the rows have unit length and are mutually perpendicular ( orthogonal) . According to the condition Q' Q = I, the columns have the same property. We conclude our brief introduction to the elements of matrix algebra by intro ducing a concept fundamental to multivariate statistical analysis. A square matrix A is said to have an eigenvalue A, with corresponding eigenvector x # 0, if
Ax = Ax
(214)
Ordinarily, we normalize x so that it has length unity; that is, 1 = x'x. It is conve nient to denote normalized eigenvectors by e, and we do so in what follows. Sparing you the details of the derivation ( see [1 ]), we state the following basic result:
Section 2.3
Example 2.9
Positive Defin ite Matrices
61
(Verifyi ng eigenval ues and eigenvectors)
Let
A=
[
1 5


5
1
]
Then, since
[ A1
=
1 5 1 5
]
1 v'2 1 =6 v'2

1 v'2 1 v'2

 

6 is an eigenvalue, and
1 v'2 1 v'2 is its corresponding normalized eigenvector. You may wish to show that a sec ond eigenvalueeigenvector pair is A2 = 4, e2 = [1/ v'2 , 1/v'2J. • A method for calculating the A ' s and e ' s is described in Supplement 2A. It is in structive to do a few sample calculations to understand the technique. We usually rely on a computer when the dimension of the square matrix is greater than two or three. 2.3
POSITIVE D E F I N ITE MATRICES
The study of the variation and interrelationships in multivariate data is often based upon distances and the assumption that the data are multivariate normally distributed. Squared distances (see Chapter 1) and the multivariate normal density can be ex pressed in terms of matrix products called quadratic forms (see Chapter 4). Conse quently, it should not be surprising that quadratic forms play a central role in multivariate analysis. In this section, we consider quadratic forms that are always nonnegative and the associated positive definite matrices. Results involving quadratic forms and symmetric matrices are, in many cases, a direct consequence of an expansion for symmetric matrices known as the spectral
62
Chapter 2
Matrix Algebra and Ra ndom Vectors
decomposition. The spectral decomposition of a k X k symmetric matrix A is given by 1
where A 1 , A2 , . . . , Ak are the eigenvalues of A and e 1 , e 2 , . . . , e k are the associated nor malized eigenvectors. (See also Result 2A.14 in Supplement 2A). Thus, eie i = 1 for i = 1 , 2, . . , k, and eiej = 0 for i # j. .
Example 2. 1 0
(The spectra l decomposition of a matrix)
Consider the symmetric matrix
[ � �: � ] 3
A=
4
2 2
10
The eigenvalues obtained from the characteristic equation I A  AI I = 0 are A 1 = 9, A2 = 9, and A3 = 18 (Definition 2A.30) . The corresponding eigen vectors e 1 , e 2 , and e 3 are the (normalized) solutions of the equations Aei = Aiei for i = 1, 2, 3. Thus, Ae 1 = Ae 1 gives
[
2] [ ] [ ]
13 4 4 13 2 2 2 10
or
e1 1 e1 1 e2 1 = 9 e2 1 e3 1 e3 1
13e 1 1  4e2 1 + 2e3 1 = 9e 1 1 4e 1 1 + 13e2 1  2e3 1 = 9e2 1 2e 1 1 2e2 1 + 10e3 1 = 9e3 1 
Moving the terms on the right of the equals sign to the left yields three homogeneous equations in three unknowns, but two of the equations are redundant. Selecting one of the equations and arbitrarily setting e1 1 = 1 and e2 1 = 1 , we find that e3 1 = 0. Consequently, the normalized eigenvector is e1 = [ 1;V12 + 12 + o2 , 1 ;V12 + 12 + 02 , o;V12 + 12 + o2 ] = [ 1/ v'z , 1/ v'z , 0 ], since the sum of the squares of its elements is unity. You may verify that e2 = [1/V'I8,  1/VIS , 4/ V'I8] is also an eigenvector for 9 = A2 , and e3 = [2/3, 2/3, 1/3 ] is the normalized eigenvector correspond ing to the eigenvalue A3 = 18. Moreover, eiej = 0 for i # j. 1 A proof of Equation (216) is beyond the scope of this book. The interested reader will find a proof in [6] , Chapter 8 .
Section 2.3
63
Pos itive Defi n ite Matrices
The spectral decomposition of A is then
[
or 13 4 4 13 2 2
� ]
1 V2 =9 1 10 V2 0

[�

1 V2

o]
1
V18

+9
1
V18

4
V18

1 2 =9 1 2 0
2 3 2 3 1 3

[Jrs
1 0 2 1 +9 0 2 0 0
1
4
V18 V18

J
+ 18

1 18 1 18 4 18
1 18 1 18 4 18
[� �] 
4 18 4 18 16 18 _j 4 2 4 9 9 9 2 4 4  9 9 9 1 2 2  9 9 9

+ 18



as you may readily verify.
2 3


•
The spectral decomposition is an important analytical tool. With it, we are very easily able to demonstrate certain statistical results. The first of these is a matrix ex planation of distance, which we now develop. Because x ' Ax has only squared terms x f and product terms xi xk , it is called a quadratic form. When a k X k symmetric matrix A is such that (217) 0 < x ' Ax for all x' = [ x1 , x2 , . . . , xk ] , both the matrix A and the quadratic form are said to be nonnegative definite. If equality holds in (217) only for the vector x ' = [0, 0, . . . , OJ, then A or the quadratic form is said to b e positive definite. In other words, A is pos itive definite if (218) 0 < x' Ax for all vectors x # 0.
64
Chapter 2
Matrix Al gebra a n d Ra ndom Vectors
Example 2.1 1
(A positive defi nite matrix and q uad ratic fo rm) Show that the matrix for the following quadratic form is positive definite: 3x i + 2x�  2V2 x 1 x2 To illustrate the general approach, we first write the quadratic form in matrix notation as
By Definition 2A.30, the eigenvalues of A are the solutions of the equa tion I A  AI I == 0, or ( 3  A) (2  A)  2 == 0. The solutions are A 1 == 4 and A2 == 1 . Using the spectral decomposition in (216), we can write A == A 1 e 1 e i + A2 e 2 e 2 (2X 2 ) ( 2 X 1 ) ( 1 X 2 ) ( 2 X 1 ) ( 1 X 2 ) == 4e 1 e i + e2 e2 (2X 1 ) ( 1 X 2 ) ( 2 X 1 ) ( 1 X 2 ) where e 1 and e 2 are the normalized and orthogonal eigenvectors associated with the eigenvalues A 1 == 4 and A2 == 1 , respectively. Because 4 and 1 are scalars, premultiplication and postmultiplication of A by x' and x, respectively, where x' == [ x 1 , x2 ] is any nonzero vector, give x' A x == 4x e 1 e i x + x' e2 e2 x ( 1 X2) ( 2 X 2 ) ( 2Xl) ( 1 X 2 ) ( 2 Xl) ( 1 X 2 )( 2 Xl) ( 1 X 2 ) ( 2 Xl) ( 1 X 2 ) ( 2 Xl) == 4yy + y� 0 with y1 == x' e 1 == e1 x and y2 == x' e2 == e2 x We now show that y1 and y2 are not both zero and, consequently, that x' Ax == 4yy + y� > 0, or A is positive definite. From the definitions of y1 and y2 , we have '
>
or
y == E X ( 2 Xl) ( 2 X2)(2Xl) Now E is an orthogonal matrix and hence has inverse E ' . Thus, x == E' y. But • x is a nonzero vector, and 0 # x == E' y implies that y # 0. Using the spectral decomposition, we can easily show that a k X k symmetric matrix A is a positive definite matrix if and only if every eigenvalue of A is positive. ( See Exercise 2.17.) A is a nonnegative definite matrix if and only if all of its eigen values are greater than or equal to zero. Assume for the moment that the p elements x 1 , x2 , . . . , xP of a vector x are re alizations of p random variables X1 , X2 , , XP . As we pointed out in Chapter 1 , we • • •
Section 2.3
Positive Defin ite Matrices
65
can regard these elements as the coordinates of a point in pdimensional space, and the "distance" of the point [ x 1 , x2 , , x p ] to the origin can, and in this case should, be interpreted in terms of standard deviation units. In this way, we can account for the inherent uncertainty (variability) in the observations. Points with the same as sociated "uncertainty" are regarded as being at the same distance from the origin. If we use the distance formula introduced in Chapter 1 [see Equation (122)], the distance from the origin satisfies the general formula (distance ) 2 = a 1 1 x i + a 22 x � + + a P P x � • • •
···
+ 2( a 12 x 1 x2 + a 13 x 1 x3 +
···
+ a p  l , p xp  l xp )
provided that ( distance ) 2 > 0 for all [ x 1 , x2 , . . . , xp ] # [0, 0, . . . , OJ . Setting ai j = a j i , i # j, i = 1, 2, . . . , p, j = 1, 2, . . . , p, we have
a 1 1 a 12 0 < ( dist ance ) 2 = [ x 1 , x2, . . . , x P ] a 2 1 a 22
ap l ap 2
a lp a 2p
xl x2
a PP
Xp
or for x # 0 0 < (distance ) 2 = x' Ax (219) From (219), we see that the p X p symmetric matrix A is positive definite. In sum, distance is determined from a positive definite quadratic form x' Ax. Conversely, a positive definite quadratic form can be interpreted as a squared distance.
Comment. Let the square of the distance from the point x' = [ x 1 , x2 , . . . , xp ] to the origin be given by x' Ax, where A is a p X p symmetric positive definite ma trix. Then the square of the distance from x to an arbitrary fixed point JL ' = [ JL 1 , JL2 , , JL p ] is given by the general expression (x  JL ) ' A(x  JL ) . • • •
Expressing distance as the square root of a positive definite quadratic form al lows us to give a geometrical interpretation based on the eigenvalues and eigenvec tors of the matrix A. For example, suppose p = 2. Then the points x' = [ x 1 , x2 ] of constant distance c from the origin satisfy
By the spectral decomposition, as in Example 2.11, A = A 1 e 1 e� + A2 e2 e2 so x'Ax = A 1 (x' e 1 ) 2 + A2 (x' e2 ) 2 Now, c2 = A 1 yi + A2 y� is an ellipse in y1 = x' e 1 and y2 = x' e2 because A 1 , A2 > 0 when A is positive definite. (See Exercise 2.17.) We easily verify that x = cA11/2 e 1 satisfies x' Ax = A 1 ( cA11/2 e� e 1 ) 2 = c2 . Similarly, x = cA2 1/2 e 2 gives the appropriate distance in the e 2 direction. Thus, the points at distance c lie on an ellipse whose axes are given by the eigenvectors of A with lengths proportional to the reciprocals of the square roots of the eigenvalues. The constant of proportionality is c. The situation is illustrated in Figure 2.6.
66
Chapter 2
Matrix Algebra and Ra ndom Vectors
Figure 2.6 Poi nts a consta nt d ista nce c from the origin (p = 2, 1 A1 < A2 ) .
:::;
···
If p > 2, the points x' = [ x 1 , x2 , , xp ] a constant distance c = Vx7Ax from the origin lie on hyperellipsoids c2 = A 1 (x' e 1 ) 2 + + Ap (x' e p ) 2 , whose axes are given by the eigenvectors of A. The halflength in the direction e i is equal to cj'\IA; , i = 1, 2, . . , p, where A 1 , A2 , , AP are the eigenvalues of A. • • •
.
. • .
2.4 A SQ UAREROOT MATRIX
The spectral decomposition allows us to express the inverse of a square matrix in terms of its eigenvalues and eigenvectors, and this leads to a useful squareroot matrix . Let A be a k X k positive definite matrix with the spectral decomposition k A = � A i ei ei . Let the normalized eigenvectors be the columns of another matrix i=l P = [ e 1 , e 2 , . . . , e k ] . Then
where PP ' = P ' P = I and A is the diagonal matrix
Al 0 0 A2 A = (kXk) 0 0
0 0 Ak
with Ai
>
0
Section 2 . 5
Random Vectors and Matrices
67
Thus,
since (PA 1 P ' ) PAP' = PAP' (PA 1 P ' ) = PP ' = I . Next, let A 1/2 denote the diagonal matrix with \!A; as the ith diagonal element. k The matrix � \IA; eiei = PA 112 P' is called the square root of A and is denoted by A112 . i =l
2.5 RAN DOM VECTORS AN D MATRICES
A random vector is a vector whose elements are random variables. Similarly, a random matrix is a matrix whose elements are random variables. The expected value of a ran dom matrix (or vector) is the matrix (vector) consisting of the expected values of each of its elements. Specifically, let X = { Xi j } be an n X p random matrix. Then the ex pected value of X, denoted by E(X) , is the n X p matrix of numbers (if they exist)
E(X)
(223)
where, for each element of the matrix, 2 2 If you are unfamiliar with calculus, you should concentrate on the interpretation of the expected value and, eventually, variance. Our development is based primarily on the properties of expectation rather than its particular evaluation for continuous or discrete random variables.
68
Chapter 2
Matrix Algebra and Random Vectors
lao oo
if Xi j is a continuous random variable with Xl· j..f. } l j·(x l. j. ) dxl· j. probability density function h j ( xi j ) if Xi j is a discrete random variable with probability function Pi j (x i j )
Example 2. 1 2
(Computi ng expected va l u es fo r di screte random variables)
Suppose p = 2 and n = 1, and consider the random vector X' = [ X1 , X2 ] . Let the discrete random variable X1 have the following probability function: 1 .3 Then E(X1 )
=
2: x 1 p1 ( x 1 )
all x 1
=
0 .3
1 .4
(  1 ) ( .3) + (0) (.3) + ( 1 ) ( .4)
=
.1.
Similarly, let the discrete random variable X2 have the probability function 0 1 .8 .2 Then E(X2 )
=
Thus,
2: x2 p2 (x2 )
all x2
E( X )
=
=
(0) (.8) + ( 1 ) ( .2)
E(X1 ) ] [ · 1 ] [E(X .2 2) =
=
.2.
•
Two results involving the expectation of sums and products of matrices follow directly from the definition of the expected value of a random matrix and the univariate properties of expectation, E (X1 + YjJ = E(X1 ) + E( Yi ) and E(cX1 ) = cE(X1 ) . Let X and Y be random matrices of the same dimension, and let A and B be conformable matrices of constants. Then (see Exercise 2.40)
2.6
M EAN VECTORS AN D COVARIAN CE MATRICES
Suppose X' = [X1 , X2 , , Xp ] is a p X 1 random vector. Then each element of X is a random variable with its own marginal probability distribution. (See Example 2.12.) The . . .
Section 2.6
69
Mean Vectors and Cova ria nce Matrices
marginal means JLi and variances ar are defined as JLi = E(Xi ) and ar = E(Xi  JLi) 2 , i = 1, 2, . . . , p, respectively. Specifically,
00 100 x;li(x; ) dx;
JLi =
:L xi pi (xi) all x1
al? =
if X; is a continuous random variable with probability density function fi( xi ) if Xi is a discrete random variable with probability function Pi ( xi )
( xi  JLi ) 2JFi ( xi ) dxi 00 ioo
(xi  JLi) 2 Pi ( xi ) :L all x1
if Xl· is a continuous random variable with probability density function /i( xJ
(225)
if Xi is a discrete random variable with probability function Pi ( xi)
It will be convenient in later sections to denote the marginal variances by aii rather than the more traditional a7 , and consequently, we shall adopt this notation. The behavior of any pair of random variables, such as Xi and Xk , is described by their joint probability function, and a measure of the linear association between them is provided by the covariance
=
if xi ' xk are continuous random variables with the joint density function h k( xi , xk) if xi ' xk are discrete random variable with joint probability function Pik(xi , x k ) (226)
:L :L (xi  JLJ (xk  JLk ) Pi k (xi , xk)
all X1 all xk
and JLi and JLk , i, k = 1, 2, . . . , p, are the marginal means. When i = k, the covari ance becomes the marginal variance. More generally, the collective behavior of the p random variables X1 , X2 , , XP or, equivalently, the random vector X' = [ X1 , X2 , , XP ] , is described by a joint prob ability density function f(x 1 , x2 , , x p ) = f(x) . As we have already noted in this book, f(x) will often be the multivariate normal density function. (See Chapter 4.) If the joint probability P[Xi < xi and Xk < xk ] can be written as the product of the corresponding marginal probabilities, so that • • •
• • •
• • •
(227)
70
Matrix Algebra and Ra ndom Vectors
Chapter 2
for all pairs of values xi , xk , then Xi and Xk are said to be statistically independent. When Xi and Xk are continuous random variables with joint density fik (xi , x k ) and marginal densities /i ( xz ) and fk ( xk ) , the independence condition becomes fik( xi , xk ) = /i ( xi)fk ( xk ) for all pairs ( xi , xk ) . The p continuous random variables X1 , X2 , , XP are mutually statistically independent if their joint density can be factored as (228) !I2 · · ·p (x l , x2 , . . . , xp ) = f1 (x i )f2 (x2 ) · . /p (xp ) for all ptuples (x 1 , x2 , , xp ) · Statistical independence has an important implication for covariance. The fac torization in (228) implies that Cov (Xi , Xk ) = 0. Thus, • • •
.
• • •
The converse of (229) is not true in general; there are situations where Cov ( Xi , Xk ) = 0, but Xi and Xk are not independent. (See [2] .) The means and covariances of the p X 1 random vector X can be set out as matrices. The expected value of each element is contained in the vector of means IL = E(X) , and the p variances (J'ii and the p (p  1 )/2 distinct covariances (J'ik ( i < k) are contained in the symmetric variancecovariance matrix I = E(X  1L ) (X  1L ) ' . Specifically,
E(X1 ) E(X2 ) E(X) = E(Xp )
and
iLl (230)
= JL
= JLp
I = E(X  1L ) (X  1L ) '
=E xp  JLp (XI  JLI ) 2 (X2  JL2 ) (XI  JLI ) =E
(XI  JL I ) (X2  JL2 ) (X2  JL2 ) 2
(XI  JLI ) (Xp  JLp ) (X2  JL2 ) (Xp  JLp ) (Xp  JLp ) 2
(Xp  JLp ) (XI  JLI ) (Xp  JLp ) (X2  JL2 ) E(Xl  JL I ) 2 E(X1  JL 1 ) (X2  JL2 ) E(X2  JL2 ) 2 E(X2  JL2 ) (XI  JL 1 )
E(X1  JL 1 ) (Xp  JLp ) E(X2  JL2 ) (Xp  JLp )
E(Xp  JLp ) (XI  JL I ) E(Xp  JLp ) (X2  JL2 )
E(Xp  JLp ) 2
Section 2.6
Mean Vectors and Cova riance Matrices
71
or I = Cov (X) =
Example 2. 1 3
(231)
(Computi ng the cova riance matrix)
Find the covariance matrix for the two random variables X1 and X2 introduced in Example 2.12 when their joint probability function p1 2 ( x 1 , x2 ) is represented by the entries in the body of the following table:
x2 xl
0
1
P1 ( x 1 )
1 0 1 P2 (x2 )
.24 .16 .40 .8
.06 .14 .00 .2
.3 .3 .4 1
We have already shown that JL 1 = E(X1 ) = .1 and JL2 = E(X2 ) = .2. (See Example 2.12.) In addition,
o1 1 = E(X1  JL 1 ) 2 = � ( x i  . 1 ) 2 PI (x l ) all x 1
= (  1  .1 ) 2 ( .3 ) + (0  .1 ) 2 ( .3 ) + ( 1  .1 ) 2 ( .4) = .69
o22 = E(X2  JL2 ) 2 = � ( x2  .2) 2 P2 (x2 ) all x2
= (0  .2) 2 ( .8) + ( 1  .2) 2 ( .2) = .16
o1 2 = E(X1  JL 1 ) (X2  JL2 ) =
�
all pairs ( x 1 , x 2 )
(xi  .1 ) (x2  .2) PI 2 (x l , x2 )
= (  1  .1 ) ( 0  .2) ( .24) + (  1  .1 ) ( 1  .2) ( .06 )
+ . . . + ( 1  . 1 ) ( 1  .2) ( .00) =  .08
72
Chapter 2
Matrix Algebra and Ra ndom Vectors
Consequently, with X' = [X1 , X2 ] , IL
= E(X) =
E(X1 ) ] [ JL 1 ] [ ·1 ] [ E(X = = .2 JL2 2)
and I = E(X  JL ) (X  JL ) ' =E
[ (X(X21  JL2JL1 )) 2(XI  JL1 )
E(Xl  JL I ) ( X2  JL2 ) ] E(X1  JL1 ) 2 [ = E(X2  JL2 ) (X1  JL1 ) E(X2  JL2 ) 2 ]  :�� J = [ :�� :: � = [  : � :
•
We note that the computation of means, variances, and covariances for discrete random variables involves summation (as in Examples 2.12 and 2.13), while analogous computations for continuous random variables involve integration. Because aik = E(Xi  JLJ (Xk  JLk ) = ak i ' it is convenient to write the ma trix appearing in (231) as
I = E(X  JL ) (X  JL ) ' =
(232)
We shall refer to IL and I as the population mean ( vector ) and population variancecovariance (matrix) , respectively. The multivariate normal distribution is completely specified once the mean vector IL and variancecovariance matrix I are given ( see Chapter 4) , so it is not surprising that these quantities play an important role in many multivariate procedures. It is frequently informative to separate the information contained in vari ances aii from that contained in measures of association and, in particular, the measure of association known as the population correlation coefficient Pik · The correlation coefficient Pik is defined in terms of the covariance aik and variances aii and akk as (233) The correlation coefficient measures the amount of linear association between the ran dom variables Xi and Xk . ( See, for example, [2] .)
Section 2.6
Mean Vectors and Cova ria nce Matrices
Let the population correlation matrix be the p X
p
lT l p
lT 1 2
lT 1 2
lT22
� ver;;
lT l p
lT2 p
lTpp
� yo=;; yo=;; yo=;;
� ver;; yo=;; ver;;
and let the p X
symmetric matrix
lT l l
� � � yo=;; p=
p
1 P1 2 P1 2 1
P1 p P2 p
P1 p P2 p
1
73
lT2 p
yo=;; ver;;
ver;; ver;; (234)
standard deviation matrix be
y l /2 =
�
0
0
yo=;;
0
0
0 0
(235)
Then it is easily verified ( see Exercise 2.23) that
(236) and
(237) That is, I can be obtained from V 112 and p, whereas p can be obtained from I. Moreover, the expression of these relationships in terms of matrix operations allows the calculations to be conveniently implemented on a computer. Exa m p l e 2. 1 4
(Co mputing the correlation matrix fro m the covariance matrix)
Suppose
Obtain V 112 and p.
74
Chapter 2
Matrix Algebra and Ra ndom Vectors
Here
0 �  � � 050 ] 0 ] [0 0
va=;
and
�
�[ 00 OJ0 [ 225 ] [�0! 00 �] = [ l !]
Consequently, from (237), the correlation matrix p is given by ( v l/2 r l :.t ( v l/2 r l =
4 1 1 9 3 2 3
3
!
1 6
!5
�
1
1
5
•
Partitioning the Cova ria nce Matrix
Often, the characteristics measured on individual trials will fall naturally into two or more groups. As examples, consider measurements of variables representing consumption and income or variables representing personality traits and physical characteristics. One approach to handling these situations is to let the character istics defining the distinct groups be subsets of the total collection of characteris tics. If the total collection is represented by a X 1 ) dimensional random vector X, the subsets can be regarded as components of X and can be sorted by parti tioning X. In general, we can partition the characteristics contained in the X 1 random vector X into, for instance, two groups of size q and q, respectively. For exam ple, we can write x l X=
Xq Xq +l xp
}q } p
p
= q
[i��;j
(p
p
and 1L = E(X) =
p
iL l
JLq JLq + l
=
[:;��]
JLp (238)
Section 2.6
75
Mean Vectors and Cova ria nce Matrices
From the definitions of the transpose and matrix multiplication,
(XI  JL I ) ( Xq+l  JLq +I ) ( X1  JL 1 ) ( xq+2  JLq +2 ) ( X2  JL2 ) ( Xq +l  JLq +I ) ( X2  JL2 ) ( xq+2  JLq +2 )
(X1  JL I ) (Xp  JLp ) ( X2  JL2 ) (Xp  JLp )
1
Upon taking the expectation of the matrix (X ( l )  JL ( ) ) (X ( 2 )  JL ( 2 ) ) ' , we get
(Tl p
O"l,q+l O"l,q+2
(239) O"q , q +l O"q , q +2
O"qp
which gives all the covariances, O"ij ' i = 1, 2, . . . , q , j = q + 1, q + 2, . . . , p, between a component of X ( l ) and a component of X ( 2 ) . Note that the matrix I 1 2 is not necessarily symmetric or even square. Making use of the partitioning in Equation (238), we can easily demonstrate that (X  JL ) (X  JL ) ' (X ( l )
 IL
( qx l )
(X ( 2 ) 
(1 ))
(X ( l )
 IL
( lxq)
(1)
)
2 l l IL ( ) ) (X ( )  IL ( ) ) p Xl lXq) q ((  ) ) (
'
'
and consequently,
 IL
0"11
O"q l
O"l q
(Tp l
 IL
[��1  i ��?J
q  q I 2 1 : I 22 (p x p) I
O"l, q +l
(Tl p
O"q, q+l
O"qp
I I I
O"q q

(]"q+ l, l
I
P
(X ( 2 )
pq
q
I = E(X  1L ) (X  1L ) ' = p ( x p)
( 1 ))
( 2 ) )' ( lX ( p q )) (qX l) (X ( 2 )  IL ( 2 ) ) (X ( 2 )  IL ( 2 ) ) ' (( p  q ) Xl ) ( lX ( p  q )) (X ( l )
1
1:

(]"q +l, q (]"q +l ,q + l O"pq
O"p,q+l
(]"q+l ,p (Tpp
(240)
76
Chapter 2
Matrix Algebra and Random Vectors
Note that I 1 2 = I21 . The covariance matrix of X (l ) is I 1 1 , that of X ( 2 ) is I22 , and that of elements from x (l ) and X (2 ) is I 1 2 ( or I 2 1 ). The Mean Vector a n d Cova riance Matrix for Linear Combi nations of Ra ndom Variables
Recall that if a single random variable, such as X1 , is multiplied by a constant c, then E(cX1 ) = cE(X1 ) = CJL 1 and If X2 is a second random variable and a and b are constants, then, using additional properties of expectation, we get Cov ( aX1 , bX2 ) = E(aX1  aJL1 ) (bX2  bJL2 ) = abE (X1  JL1 ) (X2  JL2 ) = abCov ( X1 , X2 ) = aba 1 2 Finally, for the linear combination aX1 + bX2 , we have E( aX1 + bX2 ) = aE(X1 ) + bE(X2 ) = aJL1 + bJL2 Var (aX1 + bX2 ) = E[ (aX1 + bX2 )  (aJL1 + bJL2 ) ] 2 I
I
I
I
= E[a( X1  JL1 ) + b(X2  JL2 ) ] 2 = E[a2 (X1  JL1 ) 2 + b2 (X2  JL2 ) 2 + 2ab(X1  JL 1 ) ( X2  JL2 ) J = a 2 Var (X1 ) + b2 Var ( X2 ) + 2abCov (X1 , X2 ) = a 2 a 1 1 + b2 a22 + 2aba 1 2 (241) With c' = [a, b], aX1 + bX2 can be written as [a b]
[ �: ]
=
c' X
Similarly, E( aX1 + bX2 ) = aJL1 + b JL2 can be expressed as
[a b] If we let
[:: ] = c'
p
be the variancecovariance matrix of X, Equation (241) becomes Var (aX1 + bX2 ) = Var ( c' X ) = c' I c since
(242)
Section 2.6
77
Mean Vectors and Covariance Matrices
The preceding results can be extended to a linear combination of p random variables:
In
general, consider the q linear combinations of the p random variables X1 , . . . , XP : zl = cl l xl + c1 2 x2 + . . . + clp xp z2 = Cz l Xl + Czz X2 + . . . + Cz p Xp .. .. . .
or Z=
zl Zz Zq (qx l )
C1 1 C 1 2 C C = 2 1 zz Cq l Cq2 (q x p)
cl p c2p
xl Xz
C qp
xp (p X l )
= ex
(244)
where 11x and Ix are the mean vector and variancecovariance matrix of X, respec tively. ( See Exercise 2.28 for the computation of the offdiagonal terms in CixC ' . ) We shall rely heavily on the result in (245) in our discussions of principal com ponents and factor analysis in Chapters 8 and 9. Example 2.1 5
(Means and cova ria nces of l i near com b i nations)
Let X' = [ X1 , X2 ] be a random vector with mean vector 11'x = [ JL 1 , JLz ] and variancecovariance matrix
Find the mean vector and covariance matrix for the linear combinations Z1 = X1  X2 z2 = xl + Xz
78
Chapter 2
Matrix Algebra and Random Vectors
or
in terms of ILx and Ix . Here
D  11 ] [JLIL2l ]  [ILiLll  JLJL22]
p, z = E(Z) = C p, x =
+
and Iz = Cov (Z) = CixC ' =
o11
[ 11  11 ] [ lT12o11 olT2212 ] [  11 11 ]
X1 X2
Note that if = o 22that is, if and have equal variancesthe offdiagonal terms in Iz vanish. This demonstrates the wellknown result that the sum and dif ference of two random variables with identical variances are uncorrelated. • Partitio n i ng the Sample Mean Vecto r and Covariance Matrix
Many of the matrix results in this section have been expressed in terms of population means and variances ( covariances ). The results in (236), (237), (238), and (240) also hold if the population quantities are replaced by their appropriately defined sample counterparts. Let x' = [ p ] be the vector of sample averages constructed from and let n observations on p variables
.X1 , x2 , , x X1, X2 , X , P • • •
• • •
,
be the corresponding sample variancecovariance matrix.
Section 2 . 7
Matrix Ineq u a l ities and Maxi m ization
79
The sample mean vector and the covariance matrix can be partitioned in order to distinguish quantities corresponding to groups of variables. Thus,
xl X
�!{
( pXl )
[�:x(�2!)]
Xq +l
__
(246)
Xp and
sl q : sl ,q +l
S1 1
si p
I
s�::·�      s�:�� 1s�:��:�      s�:�; I
I I
sn = (p X p)
Sqq
Sq l
q =p 
q
[
q S1 1 S2 1

pq 1 S1 2 r : S 22
I I
I
Sq,q +l
Sqp
J
(247)

where x ( l ) and x ( 2 ) are the sample mean vectors constructed from observations . 1y, s 1 1 1s th e samp 1 e covari. x ( l )  [ x 1 , . . . , xq J ' and x ( 2 )  [ Xq + b · · · , xp J ' , respect1ve ance matrix computed from observations x ( l ) ; S 22 is the sample covariance matrix computed from observations x (2 ) ; and S 1 2 = S2 1 is the sample covariance matrix for elements of x ( l ) and elements of x ( 2 ) . .
2.7
.
MATRIX I N EQUALITI ES AND MAXI M I ZATION
Maximization principles play an important role in several multivariate techniques. Linear discriminant analysis, for example, is concerned with allocating observations to predetermined groups. The allocation rule is often a linear function of measure ments that maximizes the separation between groups relative to their withingroup variability. As another example, principal components are linear combinations of measurements with maximum variability. The matrix inequalities presented in this section will easily allow us to derive cer tain maximization results, which will be referenced in later chapters.
80
Chapter 2
Matrix Algebra and Random Vectors
CauchySchwarz Inequality. Let b and d be any two p X 1 vectors. Then (b' d) 2 < (b'b) (d' d)
(248)
(
with equality if and only if b = cd or d = cb) for some constant c. Proof. The inequality is obvious if either b = 0 or d = 0. Excluding this pos
sibility, consider the vector b  xd, where x is an arbitrary scalar. Since the length of b  xd is positive for b  xd =I= 0, in this case 0 < (b  xd) ' (b  xd) = b'b  xd'b  b ' ( xd) + x2 d' d = b'b  2x(b' d) + x 2 (d'd) The last expression is quadratic in x. If we complete the square by adding and sub tracting the scalar (b' d) 2/d' d, we get (b' d) 2 (b' d) 2 0 < b'b  2x(b' d) + x 2 (d' d) + d'd d' d (b' d) 2 b' d 2  b'b + (d ' d) X  d' d d' d
(
)
The term in brackets is zero if we choose x = b' d/d' d, so we conclude that (b' d) 2 0 < b'b d' d

or (b' d) 2 < (b' b) (d'd) if b =I= xd for some x. Note that if b = cd, 0 = (b  cd) ' (b  cd) , and the same argument produces 2 • (b'd) = (b'b) ( d ' d ) . A simple, but important, extension of the CauchySchwarz inequality fol lows directly. Extended CauchySchwarz Inequality. Let b and d be any two vectors, ( px l ) and let B be a positive definite matrix. Then ( pxl )
( pxp)
(b' d) 2 < (b' Bb) (d'B1 d)
(249)
with equality if and only if b = cB 1 d or d = eBb) for some constant c.
(
Proof. The inequality is obvious when b = 0 or d = 0. For cases other than
these, consider the squareroot matrix B 112 defined in terms of its eigenvalues Ai and p 2 1 the normalized eigenvectors ei as B 1 = :L � e i e . If we set s ee also (222)]
i =l p 1 "  e .e� B 1/2 = � i =l Y J\i ,.. /\
l
l
i
[
Section 2 . 7
Matrix I n eq u a l ities a n d Maxim ization
81
it follows that and the proof is completed by applying the CauchySchwarz inequality to the vec tors ( B 112 b) and ( B 112 d) . • The extended CauchySchwarz inequality gives rise to the following maxi mization result. Maximization Lemma. Let B be positive definite and d be a given vec(p x p) ( p xl ) tor. Then, for an arbitrary nonzero vector x , ( pXl ) (x' d) 2 = d' B 1 d max (250) x :;t: O x' Bx with the maximum attained when x = cB  1 d for any constant c =1= 0. ( pXl ) ( pXp ) ( pXl ) 2 Proof. By the extended CauchySchwarz inequality, (x' d) < (x'Bx) ( d'B  1 d ) . Because x =I= 0 and B is positive definite, x ' B x > 0. Dividing both sides of the in equality by the positive scalar x' Bx yields the upper bound (X' d)2 < d' B 1 d x ' Bx Taking the maximum over x gives Equation (250) because the bound is attained for x = cB 1 d. • A final maximization result will provide us with an interpretation of eigenvalues.
···
Maximiz ation of Q uadratic Forms for Points on the Unit Sphere. Let B be ( pXp ) a positive definite matrix with eigenvalues A 1 A2 AP 0 and associated normalized eigenvectors e 1 , e 2 , . . . , eP . Then >
>
>
x' Bx max , = A 1 X X x :;t: O
( attained when x = e 1 )
x' Bx min , = AP x :;t: O X X
( attained when x = e p )
>
(251)
Moreover, x ' Bx max , = A k + l
x .1
e 1 , . . . , ek
X X
where the symbol
j_
( attained when x = e k +l , k
is read "is perpendicular to. "
==
1, 2, . . . , p

1 ) (252)
82
Chapter 2
Matrix Algebra and Ra ndom Vectors
Proof. Let P be the orthogonal matrix whose columns are the eigenvectors
( p x p) e 1 , e2 , . . . , e P and A be the diagonal matrix with eigenvalues A 1 , A2 , . . . , AP along the P' x . main diagonal. Let B 112 P A 1 12 P' s ee (222)] and y ( pXl ) ( p X p ) ( pXl ) Consequently, x # 0 implies y # 0. Thus, x' Bx x' B lf2 B lf2 x x ' P A 112 P' P A 112 P ' x y' Ay
[
=
x' x
=
=

y'y
y'y
x' PP' x I pXp ( ) '.r1
(253)
Setting x
=
e 1 gives y
=
P' e 1
1 0
=
0 since k 1 k i= 1 =
For this choice of x, we have y' Ay/y' y
=
A 1/1
=
A 1 , or (254)
A similar argument produces the second part of (251). Now, x Py y1 e 1 + y2 e2 + . . . + yP e P , so x l_ er , . . . , e k implies 0 el�x y1 e l� e 1 + y2 e l� e2 + . . + yp el� e p y · i < k =
=
=
=
·
=
n

Therefore, for x perpendicular to the first k eigenvectors ei , the lefthand side of the inequality in (253) becomes p i YT x' Bx i =2: k +l "p x' x 2: Y T i = k +l Taking yk +l 1, Yk + 2 . . • Yp 0 gives the asserted maximum. =
=
·
=
=
Section 2 . 7
Matrix Inequal ities a n d Maxi m i zation
83
For a fixed x0 =I= 0, x0Bx0/x0x0 has the same value as x' Bx, where x' = x0/ � is of unit length. Consequently, Equation (251) says that the largest eigenvalue, A 1 , is the maximum value of the quadratic form x' Bx for all points x whose distance from the origin is unity. Similarly, AP is the smallest value of the qua dratic form for all points x one unit from the origin. The largest and smallest eigen values thus represent extreme values of x' Bx for points on the unit sphere. The "intermediate" eigenvalues of the p X p positive definite matrix B also have an in terpretation as extreme values when x is further restricted to be perpendicular to the earlier choices.
S U PP LE M E NT 2A
Vectors an d Matrices: Basic Concepts
Vectors
Many concepts, such as a person ' s health, intellectual abilities, or personality, cannot be adequately quantified as a single number. Rather, several different measurements x 1 , x2 , , Xm are required. • • •
Definition 2A.l. An mtuple of real numbers ( x 1 , x2 , , xi , . . . , xm ) arranged in a column is called a vector and is denoted by a boldfaced, lowercase letter. Examples of vectors are • • •
a = [�J
x=
1 1 b= 1 ' 1
Vectors are said to be equal if their corresponding entries are the same. Definition 2A.2 (Scalar Multiplication). Let c be an arbitrary scalar. Then the product ex is a vector with ith entry cxi . To illustrate scalar multiplication, take c1 = 5 and c2 =  1 .2. Then
c1 y = 5
[ �] [ �] =
2
1  10
and c2 y
=
(  1.2)
[ �] [=�:� ] 2
=
2.4
Definition 2A.3 (Vector Addition). The sum of two vectors x and y, each hav ing the same number of entries, is that vector z
84
= x + y with ith entry zi = xi + Yi
Supplement 2A
Vectors and Matrices: Basic Concepts
85
Thus,
X
z
y
+
Taking the zero vector, 0, to be the mtuple (0, 0, . . . , 0) and the vector x to be the 1ntuple ( x 1 , x2 , , x m ) , the two operations of scalar multiplication and vector addition can be combined in a useful manner. • • •
Definition 2A.4. The space of all real mtuples, with scalar multiplication and
vector addition as just defined, is called a vector space.
···
+ a k x k is a linear combi Definition 2A.5. The vector y = a 1 x 1 + a 2 x2 + nation of the vectors x 1 , x 2 , . . . , x k . The set of all linear combinations of x 1 , x2 , . . . , x k , is called their linear span.
Definition 2A.6. A set of vectors x 1 , x 2 , . . . , x k is said to be linearly dependent if there exist k numbers ( a 1 , a 2 , , a k ) , not all zero, such that • • •
a1 x1
+
a 2 x2
+
+
· · ·
ak xk = 0
Otherwise the set of vectors is said to be linearly independent. If one of the vectors, for example, xi , is 0, the set is linearly dependent. ( Let ai be the only nonzero coefficient in Definition 2A.6.) The familiar vectors with a one as an entry and zeros elsewhere are linearly in dependent. For m = 4, 1 0 xl = 0 0
'
0 1 x2 = 0 0
0 0 x3 = 1 0
'
0 0 x4 = 0 1
'
so
a1 0 a1 0 •
•
+
+
a2 1 a2 0 •
•
+
+
implies that a 1 = a 2 = a 3 = a 4 = 0. As another example, let k = 3 and m = 3, and let
a3 0 a3 1 •
•
+
+
a4 0 a4 0 •
•
86
Chapter 2
Matrix Algebra and Ra ndom Vectors
Then 2x 1  x 2 + 3x3 = 0 Thus, x 1 , x 2 , x3 are a linearly dependent set of vectors, since any one can be written as a linear combination of the others (for example, x2 = 2x 1 + 3x3 ). Definition 2A.7. Any set of m linearly independent vectors is called a basis for the vector space of all mtuples of real numbers. Result 2A.l. Every vector can be expressed as a unique linear combination of • a fixed basis. With m = 4, the usual choice of a basis is 1 0 0 0 1 0 0 ' 0 ' 1 ' 0 0 0
0 0 0 1
These four vectors were shown to be linearly independent. Any vector x can be uniquely expressed as 1 0 0 0 0 0 0 1 xl + x2 + x3 + x4 0 0 0 1 0 1 0 0
xl x2 =x x3 x4 A vector consisting of m elements may be regarded geometrically as a point in mdimensional space. For example, with m = 2, the vector x may be regarded as representing the point in the plane with coordinates x 1 and x2 • Vectors have the geometrical properties of length and direction. 2 X2
       
I I I I I I
X =[��]
Definition 2A.8. The length of a vector of m elements emanating from the origin is given by the Pythagorean formula: length of X = Lx = y'xr + X � + . . . + X � Definition 2A.9. The angle (} between two vectors x and y, both having m en tries, is defined from ( X1Y1 + X2Y2 + + Xm Ym ) cos ( O)
· · · LxL y
Supplement 2A
Vectors and Matrices: Basic Concepts
where Lx = length of x and Ly = length of y, x 1 , x2 , Y1 , Y2 , , Ym are the elements of y.
• • •
87
, xm are the elements of x, and
. . •
Let 4 3 and y = 0 1
1 5 x= 2 2
Then the length of x, the length of y, and the cosine of the angle between the two vectors are length of x = V ( 1 ) 2 + 52 + 22 + ( 2) 2 = \134 = 5.83 length of y = V42 + ( 3 ) 2 + 02 + 1 2 = \126 = 5 . 10 and 1 1 + 3 3 + X4 Y4 ] + cos (B) = Lx L y [ X l Yl X2 Y2 X Y 1 1 = \134 \126 [ (  1 )4 + 5 ( 3) + 2(0) + ( 2 ) 1 ] 1 21 J =  ·706 5.83 X 5.10 [
Consequently, B = 135°.
Definition 2A.l0. The inner ( or dot) product of two vectors x and y with the same number of entries is defined as the sum of component products: We use the notation x'y or y'x to denote this inner product. With the x'y notation, we may express the length of a vector and the cosine of the angle between two vectors as Lx = length of x = Vxi + x� + + x� = � x'y cos ( B ) = �: �: v x' x v y'y · · ·
"
"
Definition 2A.ll. When the angle between two vectors x, y is B = goo or 270°, we say that x and y are perpendicular. Since cos (B) = 0 only if B = goo or 270°, the condition becomes x and y are perpendicular if x'y = 0 We write x
..l
y.
88
Chapter 2
Matrix Algebra and Random Vectors
The basis vectors 1 0 0 ' 0
0 1 0 ' 0
0 0 1 0
0 0 0 1
'
are mutually perpendicular. Also, each has length unity. The same construction holds for any number of entries m. Result 2A.2. (a) z is perpendicular to every vector if and only if z = 0. (b) If z is perpendicular to each vector x 1 , x 2 , . . . , x k , then z is perpendicular to their linear span. • (c) Mutually perpendicular vectors are linearly independent. Definition 2A.12. The projection (or shadow) of a vector x on a vector y is . . (x' y) proJectzon ofx on y = Y L2
If y has unit length so that Ly = 1, projection ofx on y
y
=
( x' y)y
Result 2A.3 (GramSchmidt Process). Given linearly independent vectors x 1 , x2 , . . . , x k , there exist mutually perpendicular vectors u 1 , u2 , . . . , uk with the same linear span. These may be constructed sequentially by setting
We can also convert the u ' s to unit length by setting z j = uj j� . In this con k 1 struction, (xkz j ) z j is the projection of x k on z j and � (xkz j )z j is the projection ofx k j= 1 on the linear span ofx 1 , x2 , . . . , x k  1 · • For example, to construct perpendicular vectors from 3 4 1 0 and 0 0 2 1
Supp lement 2A
Vectors and Matrices: Basic Concepts
89
we take 4 0 0 2 so
u� u 1 = 42 + 02 + 02 + 22 = 20 and
x; u 1 = 3(4) + 1 ( 0) + 0(0)  1 (2) = 10 Thus, 3 1 u2 = 0 1
4 10 0 20 0 2
1 1 0 2
4 1 0 and z 1  V20 0 2
'
1 1 1 z2 = v'6 0 2
Matrices
Definition 2A.l3. An m X k matrix, generally denoted by a boldface upper case letter such as A, R, I, and so forth, is a rectangular array of elements having m
rows and k columns.
[ A = � !] ' ] [ .�
Examples of matrices are 7 2
I=
.7 2 .3 1
B = [�

.
3 1 ' 8
3 2 1/
E = [e 1 ]
�l
I=
[ � �] 0 1 0
In our work, the matrix elements will be real numbers or functions taking on values in the real numbers. Definition 2A.14. The dimension ( abbreviated dim) of an m X k matrix is the
ordered pair (m, k); m is the row dimension and k is the column dimension. The di mension of a matrix is frequently indicated in parentheses below the letter repre senting the matrix. Thus, the m X k matrix A is denoted by Ak . In the preceding (mx ) examples, the dimension of the matrix I is 3 X 3, and this information can be conveyed by writing 3I3 . ( X )
90
Matrix Algebra a n d Ra ndom Vectors
Chapter 2
An m X k matrix, say, A, of arbitrary constants can be written A
( mX k )
a 12 a 22
a1 1 a 21
==
a ml a m2
or more compactly as Ak == { ai j }, where the index i refers to the row and the index (mX ) j refers to the column. An m X 1 matrix is referred to as a column vector. A 1 X k matrix is referred to as a row vector. Since matrices can be considered as vectors side by side, it is nat ural to define multiplication by a scalar and the addition of two matrices with the same dimensions. Definition 2A.15. Two matrices Ak {bi j } are said to be { ai j } and B k ( mX ) (m x ) eq ual, written A B, if ai j == bi j , i 1, 2, . . . , m , j == 1, 2, . . . , k. That is, two matri ces are equal if (a) Their dimensionality is the same. (b) Every corresponding element is the same. ==
==
==
==
Definition 2A.16 (Matrix Addition). Let the matrices A and B both be of di mension m X k with arbitrary elements ai j and bi j ' i 1, 2, . . . , m, j 1, 2, . . . , k, respectively. The sum of the matrices A and B is an m X k matrix C, written C == A + B, such that the arbitrary element of C is given by ==
==
C·l 1.
==
a l 1. + b·l 1.
i
==
1, 2, . . . , m , j
==
1, 2, . . . , k
Note that the addition of matrices is defined only for matrices of the same dimension. For example,
� �J
[! � n [�  � � J [ �
1
+
A
+
c
B
Definition 2A.17 (Scalar Multiplication). Let c be an arbitrary scalar and Ak == { a l· 1·} Then cAk == A ck == B k == {b·l 1·} ' where b·l 1· ca l· 1 a l· 1·C ' ( mX ) ( mX ) ( m X ) ( mX ) i 1, 2, . . . , m, j = 1, 2, . . . , k. ==
•
·
==
==
Multiplication of a matrix by a scalar produces a new matrix whose elements are the elements of the original matrix, each multiplied by the scalar. For example, if c 2,
[ 3 4 ] [3 ] [ 8] ==
2 2 6 0 5 cA
4 2 6 2 0 5 Ac
6 4 0
12 10
B
Supplement 2A
Vectors and Matrices: Basic Concepts
91
Definition 2A.l8 (Matrix Subtraction). Let Ak = {ai j } and B = {bi j } ( mX ) (m x k) be two matrices of equal dimension. Then the difference between A and B, written A  B, is an m X k matrix C = { ci j } given by C = A  B = A + ( 1)B
That is, ci j = aij + (  1 )bij = ai j  bij , i = 1, 2, . . . , m, j = 1, 2, . . . , k. Definition 2A.l9. Consider the m X k matrix A with arbitrary elements ai j ' i = 1, 2, . . . , m, j = 1, 2, . . . , k. The transpose of the matrix A, denoted by A' , is the k X m matrix with elements a j i , j = 1, 2, . . . , k, i = 1, 2, . . . , m. That is, the transpose of the matrix A is obtained from A by interchanging the rows and columns. As an example, if
[ ]
2 7 1 3 A , then 3A' = 1 4 ( X2) ( 2X 3 ) 7 4 6 3 6
[
2
]
Result 2A.4. For all matrices A, B, and C (of equal dimension) and scalars c and d, the following hold: (a) (A + B) + C = A + (B + C ) (b) A + B = B + A (c) c(A + B ) = cA + cB (d) (c + d)A = cA + dA (That is, the transpose of the sum is equal to the (e) (A + B) ' = A' + B' sum of the transposes.) (f) (cd)A = c(dA) (g) (cA) ' = cA' • Definition 2A.20. If an arbitrary matrix A has the same number of rows and columns, then A is called a square matrix. The matrices I, I, and E given after Definition 2A.13 are square matrices. Definition 2A.21. Let A be a k X k (square) matrix. Then A is said to be symmetric if A = A' . That is, A is symmetric ai j = aj i , i = 1, 2, . . . , k, j = 1, 2, . . . , k.
OJ [
Examples of symmetric matrices are 1 0 I = 0 1 0 , (3 X 3) Q Q 1
a c e f c b g d B = e g c a (4 X 4) f d a d
92
Chapter 2
Matrix Algebra and Random Vectors
Definition 2A.22. The k X k identity matrix, denoted by k I k , is the (x ) square matrix with ones on the main (NWSE) diagonal and zeros elsewhere. The 3 X 3 identity matrix is shown before this definition. Definition 2A.23 (Matrix Multiplication). The product AB of an m X n matrix A = {a i j } and an n X k matrix B = { bi j } is the m X k matrix C whose elements are lj
e· · =
n
" �
€= 1
l t,
q
a obo·
i = 1, 2, . . . , m j = 1, 2, . . . , k
Note that for the product AB to be defined, the column dimension of A must equal the row dimension of B. If that is so, then the row dimension of AB equals the row dimension of A, and the column dimension of AB equals the column dimension of B. For example, let
[
3 1 2 4 0 5 (2X 3 ) A
Then
[ 43 where
1 2 0 5 (2 X 3 ) e1 1 e1 2 c2 1 e22
= = = =
]
]
[ � � ] 4
( 3 X 2)
(3) (3) ( 3) ( 4) (4) (3) ( 4) ( 4)
+ + + +
3
and
_
[� � ]
B =
(3 X 2)
4
3
[ 3211 2031 ] [ ee21 11 ee221 2 ] _
(2 X 2 )
( 1 ) (6) + (2) (4) = 11 ( 1) (  2) + ( 2) ( 3) = 20 (0) (6) + (5 ) (4) = 32 ( 0) ( 2) + ( 5) ( 3) = 31
As an additional example, consider the product of two vectors. Let
1 0 x= 2 3
2 3 and y = 1 8
Then x' = [1 0 2 3 ] and
2 1 0 = y'x 3 = [ 20 J = [ 2 3 1  8 x'y = [1 0  2 3] J 1 2 8 3 Note that the product xy is undefined, since x is a 4 X 1 matrix and y is a 4 X 1 ma trix, so the column dim of x, 1, is unequal to the row dim of y, 4. If x and y are vec tors of the same dimension, such as n X 1, both of the products x'y and xy' are
Supplement 2A
defined. In particular, y' x = x 'y matrix with i, jth element xi yj .
=
···
Vectors and Matrices: Basic Concepts
x 1 y1
+
x2 y2 +
+
93
XnYn , and xy' is an n X n
Result 2A.5. For all matrices A, B. and C (of dimensions such that the indicated products are defined) and a scalar c , (a) c ( AB ) = ( cA ) B (b) A ( BC) = ( AB ) C (c) A ( B + C ) = AB + AC (d) ( B + C ) A = BA + CA (e) ( AB ) ' = B ' A '
More generally, for any xj such that Axj is defined, n n (f) 2: Axj = A 2: xj j=l
j=l
•
There are several important differences between the algebra of matrices and the algebra of real numbers. Two of these differences are as follows: 1. Matrix multiplication is, in general, not commutative. That is, in general, AB # BA. Several examples will illustrate the failure of the commutative law (for matrices).
but
is not defined.
but
Also,
[4  11 ] [ 23 41 ] [ 311 4O J 0
=
94
Chapter 2
Matrix Algebra and Ra ndom Vectors
[ 32 41 ] [4 11 ]  [ 12 17 ]
but
8
0
2. Let 0 denote the zero matrix, that is, the matrix with zero for every element. In
the algebra of real numbers, if the product of two numbers, ab, is zero, then a = 0 or b = 0 . In matrix algebra, however, the product of two nonz ero ma trices may be the zero matrix. Hence, 0 AB ( mXn )( n X k ) ( mX k ) does not imply that A = 0 or B = 0. For example,
It is true, however, that if either A = 0 or Bk = 0 k , then ( nX ) ( n X ) ( mXn ) ( mXn ) A nBX k mX0 k . mXn ( )( ) ( ) =
Definition 2A.24. The determinant of the square k X k matrix A = { ai j } , de noted by I A I , is the scalar if k = 1 I A I = a1 1 k I A I = � a 1 j i A 1 j l (  1) 1 + j if k > 1 j=l where A 1 j is the ( k  1) X ( k  1 ) matrix obtained by deleting the first row and jth k column of A. Also, I A I = � ai j I A i j I (  1 ) i + j , with the ith row in place of the first row. j=l Examples of determinants (evaluated using Definition 2A.24) are 1 3 = 1 4 ( 1 ) 2 + 3 6 1 1 1 1 ( 1 ) 3 = 1 (4) + 3(6) ( 1) = 14 6 4 In general,
3 1 6 7 4 5 = 3 4 51 ( 1) 2 + 1 72 51 (  1 ) 3 + 6 27 4 ( 1) 4 7 7 2 7 1 = 3(39)  1 ( 3) + 6(  5 7) = 222 1 0 0 1 0 ( 1 ) 2 0 0 3 + 0 0 1 ( 1 ) 4 = 1( 1 ) = 1 0 1 0 =1 + 0 ( 1) 0 1 0 1 0 0 0 0 1
Supplement 2A
Vectors and Matrices: Basic Concepts
95
If I is the k X k identity matrix, I I I = 1.
a1 1 a1 2 a1 3 a 2 1 a 22 a 2 3 a3 1 a 3 2 a33
= a 1 1 a 22 a 33 + a 1 2 a 2 3a3 1 + a 2 1 a 3 2 a 1 3  a 3 1 a 22 a 1 3  a 2 1 a 1 2 a 33  a 3 2 a 2 3 a 1 1 The determinant of any 3 X 3 matrix can be computed by summing the products of elements along the solid lines and subtracting the products along the dashed lines in the following diagram. This procedure is not valid for matrices of higher dimension, but in general, Definition 2A.24 can be employed to evaluate these determinants. ..... ...... ....
'
....
' .... "),. '< ...... �
, ......
'
'
\
\
We next want to state a result that describes some properties of the determinant. However, we must first introduce some notions related to matrix inverses. Definition 2A.25. The row rank of a matrix is the maximum number of linearly independent rows, considered as vectors ( that is, row vectors ) . The column rank of
a matrix is the rank of its set of columns, considered as vectors. For example, let the matrix A =
[� ! � ] 
0 1 1 The rows of A , written as vectors, were shown to be linearly dependent after Defin ition 2A.6. Note that the column rank of A is also 2, since
but columns 1 and 2 are linearly independent. This is no coincidence, as the follow ing result indicates.
96
Chapter 2
Matrix Algebra and Ra ndom Vectors
Result 2A.6. The row rank and the column rank of a matrix are equal.
•
Thus, the rank of a matrix is either the row rank or the column rank. A square matrix kAk is nonsingular if kAk kx 0 (x ) ( x )( x 1 ) ( k X1 ) implies that kx 0 . If a matrix fails to be nonsingular, it is called singular . ( X 1 ) ( k X1 ) Equivalently, a square matrix is nonsingular if its rank is equal to the number of rows (or columns) it has. Note that Ax = x 1 a 1 + x2a2 + + xk ak , where ai is the ith column of A, so that the condition of nonsingularity is just the statement that the columns of A are linearly independent. Definition 2A.26.
···
Result 2A.7. Let A be a nonsingular square matrix of dimension k X k. Then there is a unique k X k matrix B such that AB = BA = I
•
where I is the k X k identity matrix.
Definition 2A.27. The B such that AB = BA = I is called the inverse of A and is denoted by A 1 . In fact, if BA = I or AB = I, then B = A\ and both products
must equal I.
For example, A =
s1nce
[ 21 53 ]
has A 1 =
[ _ ;?_ ]  7i
[21 53 ] [ t  �] = [ t  � ] [21 7
Result 2A.8.
2 2
7
(a) The inverse of any X matrix
is given by
7
7
i
2_
7
]
3 = 5
[01 01 ]
Supp l ement 2A
Vectors and Matrices: Basic Concepts
97
(b) The inverse of any 3 X 3 matrix
is given by
A 1
 1  TAf
In both (a) and (b) , it is clear that I A I # 0 if the inverse is to exist. (c) In general, A 1 has j, ith entry [ I Ai j Il l A I ] (  1 ) i + j , where A i j is the matrix obtained from A by deleting the ith row and jth column. • Result 2A.9. For a square matrix A of dimension k X k, the following are
equivalent:
(a) kAk kx 0 implies x 0 (A is nonsingular). ( X )( X l ) ( k X l ) ( k X 1 ) ( k Xl ) (b) I A I # 0. (c) There exists a matrix A 1 such that AA 1 A 1 A I . (k X k) =
=
=
=
•
Result 2A.l0. Let A and B be square matrices of the same dimension, and let the indicated inverses exist. Then the following hold:
(a) ( A 1 ) ' = ( A ' ) 1 (b) ( AB )  1 = B 1 A 1 The determinant has the following properties.
•
Result 2A.ll. Let A and B be k X k square matrices.
(a) I A I l A ' I (b) If each element of a row (column) of A is zero, then I A I = 0 ( c) If any two rows (columns) of A are identical, then I A I = 0 (d) If A is nonsingular, then I A I = 1/l A 1 1 ; that is, I A I I A 1 1 = 1. =
(e) I AB I = I A I I B I (f) I c A I = c k I A I , where c is a scalar.
[6]
You are referred to for proofs of parts of Results 2A.9 and 2A.11. Some of these proofs are rather complex and beyond the scope of this book. •
98
Chapter 2
Matrix Algebra and Random Vectors
Definition 2A.28. Let A = { a i j } be a k X k square matrix. The trace of the
k
matrix A, written tr (A), is the sum of the diagonal elements; that is, tr (A) = � a i i . i= 1 Result 2A.l2. Let A and B be k X k matrices and c be a scalar. (a) tr ( c A) = c tr (A ) (b) tr (A ± B ) = tr (A) ± tr (B) (c) tr (AB ) = tr (BA) (d) tr (B 1 AB ) = tr (A) k k • (e) tr (AA' ) = � � a rj i=1 j=1 Definition 2A.29. A square matrix A is said to be orthogonal if its rows, con
sidered as vectors, are mutually perpendicular and have unit lengths; that is, AA' = I. Result 2A.l3. A matrix A is orthogonal if and only if A 1 = A'. For an or thogonal matrix, AA' = A' A = I, so the columns are also mutually perpendicular • and have unit lengths. An example of an orthogonal matrix is 1 1 21 2 1 2 A= 1  21 21 2 1 2 2
1 21 2 1 2 1 2
1 2 1 21 21 2
Clearly, A = A', so AA' = A' A = AA. We verify that AA = I = AA' = A' A, or 1 1 1 1 1 1 1 1 1 0 0 0 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 0 1 0 0 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 0 0 1 0 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 0 0 0 1 2 2 2 2 2 2 2 2 A I A so A' = A\ and A must be an orthogonal matrix. Square matrices are best understood in terms of quantities called eigenvalues and eigenvectors. Definition 2A.30. Let A be a k X k square matrix and I be the k X k identi ty matrix. Then the scalars A 1 , A2 , , Ak satisfying the polynomial equation I A AI I = 0 are called the eigenvalues (or characteristic roots) of a matrix A. The equation I A AI I = 0 (as a function of A) is called the characteristic equation. • . .


For example, let A=
[� � ]
Supplement 2A
Then
I A  AI I =
Vectors and Matrices: Basic Concepts
99
c �J [� n A
0 1A = (1  A) (3  A) = 0 1 3 A implies that there are two roots, A 1 = 1 and A2 = 3. The eigenvalues of A are 3 and 1 . Let 3 4 A= 2 2 10 Then the equation 2 4 13  A 4 13  A 2 = A3 + 36A2  405A + 1458 = 0 I A  AI I = 2 2 10  A =
_
[ � �� �]
has three roots: A 1 = 9, A2 = 9, and A3 = 18; that is, 9, 9, and 18 are the eigenvalues of A. Definition 2A.31. Let A be a square matrix of dimension k X k and let A be an eigenvalue of A. If kx is a nonzero vector ( kx # k 0 ) such that ( X l) ( X l ) ( Xl ) Ax = Ax then x is said to be an eigenvector (characteristic vector) of the matrix A associated with the eigenvalue A.
An equivalent condition for A to be a solution of the eigenvalueeigenvector equation is I A  AI I = 0. This follows because the statement that Ax = Ax for some A and x # 0 implies that 0 = (A  AI)x = x 1 col 1 (A  AI) + + x k col k (A  AI) That is, the columns of A AI are linearly dependent so, by Result 2A.9(b ), I A  AI I = 0, as asserted. Following Definition 2A.30, we have shown that the eigenvalues of
···

A=
c �]
are A 1 = 1 and A2 = 3. The eigenvectors associated with these eigenvalues can be determined by solving the following equations:
1 00
Chapter 2
Matrix Algebra and Ra ndom Vectors
From the first expression,
xl = xl x 1 + 3x2 = x2 or x 1 = 2x2 There are many solutions for x 1 and x2 • Setting x2 = 1 (arbitrarily) gives x 1 =  2, and hence,
is an eigenvector corresponding to the eigenvalue 1. From the second expression, x 1 = 3x 1 x 1 + 3x2 = 3x2 implies that x 1 = 0 and x2 = 1 (arbitrarily), and hence,
[�]
X=
is an eigenvector corresponding to the eigenvalue 3. It is usual practice to deter mine an eigenvector so that it has length unity. That is, if Ax = Ax, we take e = x/ Vx'x as the eigenvector corresponding to A. For example, the eigenvector for A 1 = 1 is e1 = [ 2/VS, 1/VS] . Definition 2A.32. A quadratic form Q(x) in the k variables x 1 , x2 , • • • , x k is Q(x) = x' Ax, where x' = [x 1 , x2 , • • • , xk ] and A is a k X k symmetric matrix.
k
k
Note that a quadratic form can be written as Q(x) = � � a ijxixj . For example, i =l j =l Q(x) = [ x 1 x2 ]
[� � J
Q(x) = [x 1 x2 x3 ]
[�
Any symmetric square matrix can be reconstructured from its eigenvalues and eigenvectors. The particular expression reveals the relative importance of each pair according to the relative size of the eigenvalue and the direction of the eigenvector. Result 2A.l4. The Spectral Decomposition. Let A be a k X k symmetric ma trix. Then A can be expressed in terms of its k eigenvalueeigenvector pairs ( Ai , ei) as
k
A= " � Al e l e l�
i =l
•
Supplem ent 2A
Vectors and Matrices: Basic Concepts
1 01
For example, let A =
Then
[ 2.2.4 2.8.4 ]
I A  AI I = A2  SA + 6.16  .16 = (A  3 ) (A  2)
so A has eigenvalues A 1 = 3 and A2 = 2. The corresponding eigenvectors are e1 = [ 1/ v'5 , 2/ v'5 ] and e2 = [ 2/v'S ,  1/v'S ] , respectively. Consequently, 1 2.2 .4 v'5 = 3 A = .4 2.8 2 v'5
[
2 1 v'5 _ +2 1 v'S v'S v'5
[ �]
]
 [ :� � :! ] [ � : �  :! ] 1
[
2 1 v'S v'S
J
+
The ideas that lead to the spectral decomposition can be extended to provide a decomposition for a rectangular, rather than a square, matrix. If A is a rectangu lar matrix, then the vectors in the expansion of A are the eigenvectors of the square matrices AA ' and A ' A. Result 2A.15. SingularValue De composition. Let A be an m X k matrix of
real numbers. Then there exist an m X m orthogonal matrix U and a k X k orthog onal matrix V such that A = UAV'
where the m X k matrix A has (i, i) entry Ai 0 for i = 1, 2, . . . , min(m, k) and the other entries are zero. The positive constants Ai are called the singular values of A. • >
The singularvalue decomposition can also be expressed as a matrix expansion that depends on the rank r of A. Specifically, there exist r positive constants A 1 , A2 , . . . , Ar , r orthogonal m X 1 unit vectors u 1 , u2 , . . . , u, and r orthogonal k X 1 unit vectors v1 , v2 , . . . , vr , such that r A = � Ai ui vi = Ur A r V � i= l
where ur = [ ul ' u2 , . . . ' Ur ] , vr = [ vl ' v2 , . . . ' Vr ] , and A r is an r X r diagonal matrix with diagonal entries Ai . Here AA ' has eigenvalueeigenvector pairs ( Ar , ui ) , so AA ' u l· = A?u l l.
with Ai , A�, . . . , A; > 0 = A; + 1 , A; + 2 , . . , A� (for m > k). Then vi = Aj 1 A ' ui . Alter natively, the vi are the eigenvectors of A' A with the same nonzero eigenvalues Ar . .
1 02
Chapter 2
M atrix Algebra and Ra ndom Vectors
The matrix expansion for the singularvalue decomposition written in terms of the full dimensional matrices U, V, A is A = U A V' ( m x k ) ( mXm )( m x k )( k X k ) where U has m orthogonal eigenvectors of AA' as its columns, V has k orthogonal eigenvectors of A' A as its columns, and A is specified in Result 2A.15. For example, let A
=
[  31
Then AA , =
[
3 1 1 1 3 1
1 1 3 1
]
[J � ] �

1
1
=
[ 111 111 J
You may verify that the eigenvalues 'Y = A2 of AA' satisfy the equation y2  22y + 120 = (y  12) (y  10) , and consequently, the eigenvalues are y 1 = AI = 12 and y2 = A� = 10. The corresponding eigenvectors are ul =
[ � �J
and u2 =
Also,
[ � � J respectively.
so j A' A  yl l = y3  22y2  120y = y(y  12) ( y  10), and the eigenvalues are y 1 = AI = 12, y2 = A� = 10, and y3 = A� = 0. The nonzero eigenvalues are the same as those of AA' . A computer calculation gives the eigenvectors
[� [�
Eigenvectors v1 and v2 can be verified by checking: 10
A ' Av1 =
10
A ' Av2 =
0 10 4
�] � [�] � [�] � ] [�] [ � ]
0 1 10 vs 4 2
= 12
o
1 = 10  vs
= Aiv1
o
= Ah2
Supp lement 2A
Taking A 1 = VI2 and A2
A=[3
1 1 1 1
3
1 = VI2 v'2 1 v'2

=
J
1 03
Vectors and M atrices: Basic Concepts
v'IO we find that the singularvalue decomposition of A is ,
[�
�]
2 v'6

1 v'2 + v'IO 1 v'2
[�


1 v'5

o]
The equality may be checked by carrying out the operations on the righthand side. The singularvalue decomposition is closely connected to a result concerning the approximation of a rectangular matrix by a lowerdimensional matrix, due to Eckart and Young If a m X k matrix A is approximated by B, having the same di mension but lower rank, the sum of squared differences
( [3] ) .
m k :L :L i = l j=l
( a i j  bi j ) 2
=
tr [ (A  B ) (A  B ) ' ]
Result 2A.16. Let A be an m X k matrix of real numbers with m singular value decomposition UA V' . Let s < k = rank (A). Then
B
=
>
k and
s " � Al·U·V� l l i= l
is the ranks least squares approximation to A. It minimizes tr [ (A  B) (A  B) ' J over all m X k matrices B having rank no greater than s. The minimum value, or
k
error of approximation, is :L Af . +
•
i =s l
To establish this result, we use UU' = I m and VV ' squares as
=
I k to write the sum of
tr [ ( A  B ) (A  B ) ' ] = tr [UU' (A  B ) VV' (A  B ) ' ]
= tr [U' (A  B ) VV' (A  B ) 'UJ
m k
= tr [ ( A  C ) ( A  C) ' ] = :L :L ( Ai j  ci j ) 2 i = l j=l
=
m :L i= l
( Ai  cii ) 2 +
:Li =l=j:L crj
where C = U' BV. Clearly, the minimum occurs when ci j = 0 for i # j and cii the s largest singular values. The other ci i
=
0. That is, UBV'
=
s
=
Ai for
As or B = :L Ai uivi . i=l
1 04
Chapter 2
Matrix Algebra and Random Vectors
EXE RCI S E S
= [5, 1, 3] and y' = [ 1, 3, 1 ] . (a) Graph the two vectors. (b) Find (i) the length of x, (ii) the angle between x and y, and (iii) the projection of y on x. (c) Since x = 3 and y = 1, graph [5  3, 1  3, 3  3] = [ 2 2, OJ and [ 1  1, 3  1, 1  1 ] = [ 2, 2, O J . 2.2. Given the matrices 2.1. Let x'
,
perform the indicated multiplications. (a) SA (b) BA (c) A' B ' (d) C' B (e) Is AB defined? 2.3. Verify the following properties of the transpose when
A=
[� � J
( a) (A' ) ' = A (b) ( c' )  1 = ( c 1 ) ' ( c) ( AB ) ' = B ' A'
B=
D � �J
and C =
D �J
(d) For general A and B , ( AB ) ' = B' A'. (m x k) (k x€) 2.4. When A 1 and B  1 exist, prove each of the following. (a) (A' )  1 = ( A1 ) ' (b) ( AB )  1 = B 1 A 1 Hint: Part a can be proved by noting that AA 1 = I, I = 1', and ( AA 1 ) ' = ( A 1 ) ' A'. Part b follows from ( B 1 A 1 ) AB = B 1 ( A 1 A ) B = B 1 B = I. 2.5. Check that 5 12 �� 1: Q= 13 13 is an orthogonal matrix. 2.6. Let
]
[
A= ( a) Is A symmetric?
(b) Show that A is positive definite.
[
9
2
6]
2
Chapter 2
2.7.
LetDetA beermasingievtenheineigExerenvalciusees2.and6. eigenvectors of A. ectral decomposition of A. FiFiWrnniddtethteheeisgpenval u es and ei g envect o r s of Given the matrix A = [21 22] findDetthe eeirgmenvaline tuheessAp1ectandralAdecompos 2 and the asitsioocina(t2ed16)norofmA.alized eigenvectors and LetFiAnbed as in Exercise 2.8. Comput e t h e ei g enval u es and ei g envect o r s of WrfromiteExerthe csipsecte 2r.8a.l decomposition of A\ and compare it with that of A Consider the matrices A = [:.001 ::���] and B = [:.001 ::���001 ] ThesMor1 eeovermat,rtihcese col1areumnsidentoficalA except f o r a s m al l di f e r e nce i n t h e pos i t i o n. ( 2, 2) ( a nd B) ar e near l y l i n ear l y dependent . Show t h at l y , s m al l changes per h aps caus e d by r o undi n g Acan give( 3)subsBtant. iConsal y diequent f e r e nt i n ver s e s . wi t h Show= 0,that the idets giervmeninantby tofhethpre oduct ofdiatgonal mat r i x A = t h us , h e di a gonal el e ment s ; I A I =By Definition 2A.24, I A I = 0 0. Repeat for the sub matShowrixthat tobthe detainedermbyinantdeleoftinagstqhuare fiersstyrmmet ow andricfirst columatmnriofx AA.can be ex pres eFrd asom(th2epr16)and( oduct of22i0)ts ,eiAgenval= PAPues'with P' P = I. tFrhatomRes is, I A uI l=t 2A.11(e), II AP' II =PII·PApplAP' yI =ExerI PcI isAeP'I2.1 1.= I P I A I P' I = I A I I I , since I I I = I P' P I = mat r i x Show2 tI hQatQI'QI =I = 1Alorso1, froifmRes . Q is ault 2A.11,1ortQhogonal 2 Thus , = = ' ' • I I Q I I Q I I Q Q IShowQ 1 =thatI I QNow' AuseQExerandciseA2.1have1. the same eigenvalues if Q is orthogonal. Let be an ei g enval u e of A. Then 0 By Exer c i s e 2. 1 3 and AI A = I I · ResQ ' Qul=I.t 2A.11(e), we can write 0 = I Q ' I A  AI I Q I = I Q ' AQ  AI I, since Adefquadr a t i c f o r m x' Ax i s s a i d t o be pos i t i v e def i n i t e i f t h e mat r i x A i s pos i t i v e inite. Is the quadratic form 3xi 3x�  2x1x2 positive definite? (a) (b)
A 1 .
(c)
(d)
2.8.
2.9.
A 1 .
e1
e2 •
A 1 .
(a) (b)
A 1 .
(c)
2.10.
1 05
Exercises
·
2.11.
p
a ij
Hint:
2.12.
i # j, a 1 1 a 22 · · · a pp ·
X
a1 1A1 1 +
A1 1
2.14.
Hint:
Hint:
2.15.
+
· · ·
+
X A 1 , A2 , . . . , A P ; p
Hint:
2.13.
{ a ij }
p
+
j.
p
1 1 1.
( p X p ) ( p X p )( p X p ) A
(p Xp)
+
X
p
p
Ti f=1 Ai .
1 06
Chapter 2
M atrix Algebra and Random Vectors
2.16.
Cons i d er an ar b i t r a r y mat r i x Then i s a s y mmet r i c mat r i x . Show tSethat is neces s a r i l y nonnegat i v e def i n i t e . s o t h at ProveConsthat everider tyheie gdefenvalinitiuoenofof aan eigenvalposuie,tivwhere defeinite matAe.rixMulitsipposly onitivte.he lConseft byide'er sthoethseattse'of pointsAe' e. whose "distances" from the origin are given by foforconsta1ntanddistfaoncesr and4. tDetheirerasmsioncieattheedmajlengtorhands. Sketmincorh taxeshe elofliptshees elofliconpses on their positions. What wil happen as isLetntacrnteasdiesst?ances and ecomment wher e ( T heA/sand t h e ; i e o r s of t h e mat r i x e;'s arShowe thePreiogperenvaltieuses(1)and(4)asofsotciheatseqduarnoremroalotizmated eirigxenvect i n ( 2 2 2). Detdeteerrmmiinnee the sqanduareshroowot matthatrix using the matrix in Exercise 2.3. Also, (See Result 2A.15) Using the matrix A' A y = Ax
Hint:
2.17.
2.18.
nX
Hint:
A.
p
' y y
Ae
=
A' A
p
x' A' Ax. kXk
Ae
X
p
A
=
=
( x 1 , x2 )
c2
c2
=
=
c2
2.19.
m
A 1 12
(mXm)
=
� \IT; i =1
=
P A 1 12 P ' ,
PP'
=
P' P
=
I.
A.)
2.20. 2.21.
A 112 ,
A112 , A112 A 112
=
I.
A' A AA'
(c)
A.
(a) (b)
AA' A' A
(c)
2.23.
A
CalCalccululaattee andandobtobtaaiinniittsseieiggenvalenvaluuesesandandeieiggenvect envectoorrss. . Check that the nonzer o ei g enval u es ar e t h e s a me as t h os e i n par t a. Obt a i n t h e s i n gul a r v al u e decompos i t i o n of (See Result 2A.15) Using the matrix Calccululaattee andand obtobtaaiinn iittsseieiggenval envaluuesesandandeieiggenvect envectoorrss. . Check that the Calnonzer o ei g enval u es ar e t h e same as t h os e i n par t a. Obt a i n t h e s i n gul a r v al u e decompos i t i o n of Verthe ify the rpopul elatioanstiohnipcovar s iance matrixand[Equation (232)] , pis the wherepopis deviulatiaotniocorn matrelartiixon[Ematquatrixio[nE(quat235)io]n. (234)], and is the population standard (a) (b)
2.22.
=
A 112 A112
p
X
V 112 pV 112
p
=
I
A. p = ( V 1 12 )  1 I ( V 112 )  r ,
V 1 12
p
X
I
p
Chapter 2
2.24.
X
Find The ei g enval u es and ei g envect o r s of (Letc) Thehaveeigenvalcovaruiesanceandmateigrenvect o r s of ix p
( a)
I
p1 3 .
x3 . (a) X1  2X2 (b)  x1 + 3X2 X1 + X2 + X3 (e) X1 + 2X2  X3 (f) 3X1  4X2 X1
2.28.
V 1 12 .
DetMuletirpmlyinyoure andmatrices to check the relation UseFindas given in Exercise Fi n d t h e cor r e l a t i o n bet w een and Dernatioivnse exprin terems isoofnsthfoermeans the meanandandcovarvariaincesancesofofthtehreafndom ol owivarng ilainbleares combi and (c) and i f ar e i n dependent r a ndom var i a bl e s . ShowCov that c c wher e and c c Thi s ver i f i e s t h e of f 1 2 diagonalBy elements in or diagonal elements if c c and So Cov The product (b)
2.27.
I. I 1 .
X
(a) (b)
2.26.
1 07
Let have covariance matrix (a) I 1 (b)
2.25.
Exercises
V 112 pV 112 = I.
2.25.
X1
� X2 + � X3 •
X1 , X2 ,
X2
( c1 1 x1 + c 1 2 x2 + . . . + C1 p xp , c2 1 x1 + c22 x2 + . . . + C2 p xp ) = 1 Ix 2 = [c1 1 , c 1 2 , . . . , c 1 P ] = [ c2 1 , c22 , . . . , c2 P ]. Cix C ' (245) 1 = 2. Hint: (243) , Z1  E(Z1 ) = c1 1 ( X1  JL 1 ) + · · · + c1 p (Xp  JLp ) Z2  E(Z2 ) = c2 1 ( X1  JL 1 ) + · · + c2 p (Xp  JLp ) · (Z1 , Z2 ) = E[ ( Z1  E(Z1 ) ) (Z2  E(Z2 ) ) ] = E [ ( c1 1 (X1  JL 1 ) + . . . + C1 p ( Xp  JLp ) ) ( C2 1 ( X1  JL 1 ) + C22 ( X2  JL2 ) + . . . + C2 p ( Xp  JLp ) ) ] . ·
(c 1 1 (X1  JL 1 ) + C1 2 (X2  JL2 ) + · · · + c1 p (Xp  JLp ) ) (c2 1 (X1  JL 1 ) + C22 (X2  JL2 ) + · · · + c2 p (Xp  JLp ) ) = Cl e ( Xe  JLe) C2 m (Xm  JLm ) 1 p p = L L Cu; C2 m (Xe  JLe ) (Xm  JLm ) €=1 m=1
(�
) (�
)
1 08
Chapter 2
M atrix Algebra and Random Vectors
has expected value = Verholdiffyotrhale llaelstement step bys. the definition of matrix multiplication. The same steps wi t h mean Cons i d er t h e ar b i t r ar y r a ndom vect o r X' vector = Partition(lX into [XX(2 )) J X = where X(1l = [�J and X(2l = [�; ] Letthe covar(lbe itahncee covarmatiraicncees ofXmat(r(l2i)xandof XXwi(2)tandh generthe acovarl elementiance matriParx oftiantionelementinto ofXYou ar) eandgivanenelthemente randomofX vect). or X' = wi t h mean vect o r = 4, 3, 1 and variancecovar3 ia0nce matrix = 0 011 91 04 Partition X as l X= 2 = [i� ;J Let A = [1 and B = [� 1 ] andE(consX(id) er1 the linear combinations AX(l) and BX(2). Find E(CovA(XX( ( ))(1 Cov( A X ) 2 ( E(E(CovBXX(X)(2()2)(2 CovCov ((XBX( ) l)X(2) 2 Cov (AX( ), BX( ) p p L L 1e aem €= 1 m=l C C2m
2.29.
[cl b · · · , C1 p ] I[ c2r , · · · , C2 p ] '. =
[JL 1 , JL2 , JL3 , JL4 , JLs J .
IL '
[X1 , X2 , X3 , X4 , X5]
I
2.30.
ILx
ai
[X1 , X2 , X3 , X4]
2, J
[
2
Ix
2 2
2
2
2
x x x3 x4

2]
(a) (b) (c)
1
1
(d) (e)
(f) (g)
(h)
(i) (j )
1,
2
k.
I
Chapter 2
2.31. Repeat Exercise
2.30,
but with
[20 ] 1 1
B=
and
2.32. You are given the random vector X' = [X1 , X2 , IL'x =  1 , OJ and variancecovariance matrix
• • •
1 2
1
1
1 Ix = 21 1 1 2
with mean vector
21
1 1 1 1 1 1
x1 x2 X = x3 = x4 Xs
Let
A=
] � C
and
[i���J
B=
C
and consider the linear combinations AX ( 1 ) and
(a) E (X (1) ) (b) E( AX ( 1 ) ) (c) Cov (X (1) ) (d) Cov (AX ( 1 ) ) (e) E (X ( 2 ) ) (f) E (BX ( 2 ) ) (g) Cov (X ( 2 ) ) (h) Cov (BX ( 2 ) ) (i) Cov (X ( 1 ) , X ( 2 ) ) (j) Cov (AX ( 1 ) , BX ( 2 ) )
2.33. Repeat Exercise
, X5]
4 3 00 6 4 0 0 0 0 2
Partition X as
2.32,
1 09
A and B replaced by
A = [1 1J
[2, 4, 3,
Exercises
but with X partitioned as
X=
1 1
BX ( 2 ) .
�J
Find
110
Chapter 2
Matrix Algebra and Ra ndom Vectors
and with A and B replaced by
A=
[�  � � J
and
B=
[� n _
2.34. Consider the vectors b' = [2,  1 , 4, OJ and d' = [  1, 3, 2, 1 ] . Verify the CauchySchwarz inequality (b' d) 2 < (b'b) (d ' d) . 2.35. Using the vectors b' = [ 4, 3] and d' = [1, 1 ], verify the extended Cauchy Schwarz inequality (b' d) 2 < (b' Bb ) ( d' B 1 d) if
B=
[ � �]
2.36. Find the maximum and minimum values of the quadratic form 4xi + 4x� + 6x 1 x2 for all points x' = [ x 1 , x2 ] such that x' x = 1. 2.37. With A as given in Exercise 2. 6, find the maximum value o f x' Ax for x'x = 1. 2.38. Find the maximum and minimum values o f the ratio x' Ax/x' x for any nonzero vectors x' = [ x 1 , x2 , x3 ] if
A= 2.39. Show that
[�
3
�� �
4 2 2
10 s
]
t
A B C has ( i, j)th entry � � au; bekck j
(rXs)(sXt)(tXv)
€= 1 k =l
t
Hint: B C has ( e, j) th entry � bekck j = dej · So A ( B C) has ( i, j) th element
k =l
Y) = E(X) + E(Y) and E( AXB ) = AE(X) B . Hint: X + Y has Xi j + Yi j as its ( i, j) th element. Now, E ( Xi j + Yi j ) = E ( Xi j ) + E( Yi j ) by a univariate property of expectation, and this last quan tity is the ( i, j) th element of E(X) + E(Y). Next (see Exercise 2.39), AXB has ( i, j )th entry � � aieXek bkj , and by the additive property of expectation,
2.40. Verify (224): E(X +
e k
which is the
(i, j)th element of AE(X) B .
Chapter 2
References
111
RE F ERENCES 1 . Bellman, R. Introduction to Matrix Analysis ( 2d ed. ) New York: McGrawHill, 1970.
2. Bhattacharyya, G. K., and R. A. Johnson. Statistical Concepts and Methods. New York: John Wiley, 1977. 3. Eckart, C. and G. Young, "The Approximation of One Matrix by Another of Lower Rank." Psychometrika, 1 ( 1936 ) , 21 1218.
4.
Graybill, F. A. Introduction to Matrices with App lications in Statistics. Belmont, CA: Wadsworth, 1969.
5. Halmos, P. R. Finite Dimensional Vector Spaces ( 2d ed. ) . Princeton, NJ: D. Van Nos trand, 1958.
6. Noble, B. , and J. W. D aniel. Applied Linear Algebra ( 3d ed. ) . Englewood Cliffs, NJ: Prentice Hall, 1988.
C H A PTER
3
Sample Geometry and Random Sam pling
3.1
INTRODUCTION With the vector concepts introduced in the previous chapter, we can now delve deep er into the geometrical interpretations of the descriptive statistics x, Sn , and R; we do so in Section 3.2. Many of our explanations use the representation of the columns of X as p vectors in n dimensions. In Section 3.3 we introduce the assumption that the observations constitute a random sample. Simply stated, random sampling implies that (1) measurements taken on different items (or trials) are unrelated to one an other and (2) the j oint distribution of all p variables remains the same for all items. Ultimately, it is this structure of the random sample that justifies a particular choice of distance and dictates the geometry for the ndimensional representation of the data. Furthermore, when data can be treated as a random sample, statistical inferences are based on a solid foundation. Returning to geometric interpretations in Section 3.4, we introduce a single number, called generalized variance, to describe variability. This generalization of variance is an integral part of the comparison of multivariate means. In later sec tions we use matrix algebra to provide concise expressions for the matrix products and sums that allow us to calculate X and sn directly from the data matrix X. The con nection between x, Sn , and the means and covariances for linear combinations of variables is also clearly delineated, using the notion of matrix products.
3.2 TH E G E O M ETRY OF TH E SAM PLE A single multivariate observation is the collection of measurements on p different variables taken on the same item or trial. As in Chapter 1, if n observations have been obtained, the entire data set can be placed in an n X p array (matrix) :
112
Section 3.2
X=
The Geometry of the Sample
X1 1 X12 X21 X22
Xl p X2 p
Xn l Xn2
Xn p
X
( n xp)
X1 1 X 1 2 X2 1 X22
Xl p X2 p
x1 x2
Xn l Xn2
Xn p
x'n
113
Each row of represents a multivariate observation. Since the entire set of mea surements is often one particular realization of what might have been observed, we say that the data are a sample of size n from a pvariate "population." The sample then consists of n measurements, each of which has p components. As we have seen, the data can be plotted in two different ways. For the pdimensional scatter plot, the rows of represent n points in pdimensional space. We can write
X=
( n Xp)
X
� 1st (multivariate ) observation (31) � nth (multivariate ) observation
The row vector xj , representing the jth observation, contains the coordinates of a point. The scatter plot of n points in pdimensional space provides information on the locations and variability of the points. If the points are regarded as solid spheres, the sample mean vector x, given by (18), is the center of balance. Variability occurs in more than one direction, and it is quantified by the sample variancecovariance ma trix S n . A single numerical measure of variability is provided by the determinant of the sample variancecovariance matrix. When p is greater than 3, this scatter plot rep resentation cannot actually be graphed. Yet the consideration of the data as n points in p dimensions provides insights that are not readily available from algebraic ex pressions. Moreover, the concepts illustrated for p 2 or p 3 remain valid for the other cases. Example 3 . 1
(Com puti ng the mean vector)
= =
Compute the mean vector x from the data matrix.
=
=
X = [! � ] = =
Plot the n 3 data points in p 2 space, and locate x on the resulting diagram. The first point, x 1 , has coordinates x1 = [ 4, 1 ]. Similarly, the remaining two points are x2 [  1 , 3 ] and x 3 [ 3 , 5 ] . Finally,
x=
41 +3 3 1
+3+5 3
114
Chapter 3
Sa m p l e Geometry and Ra ndom Sa m p l i ng 2
Figure 3 . 1 A plot of the data matrix X as n = 3 poi nts i n p = 2 space.
Figure 3.1 shows that x is the balance point ( center of gravity ) of the scat• ter plot. The alternative geometrical representation is constructed by considering the data as p vectors in ndimensional space. Here we take the elements of the columns of the data matrix to be the coordinates of the vectors. Let
X11 X12 X21 X22 X = ( n x p) Xn l Xn2
Xl p X2 p = [y Y (32) l i 2 . i yp ] Xn p Then the coordinates of the first point y1 = [ x1 1 , x2 1 , . . . , x n 1 ] are the n measure ments on the first variable. In general, the ith point yi = [ x 1 i , x2 i, . . . , xn i] is deter :
I
. .
mined by the ntuple of all measurements on the ith variable. In this geometrical representation, we depict yr , . . . , yP as vectors rather than points, as in the p dimensional scatter plot. We shall be manipulating these quantities shortly using the algebra of vectors discussed in Chapter 2. Example 3.2
(Data as p vectors in
n
d i mensions)
Plot the following data as p = 2 vectors in n = 3 space:
Here y1 = [ 4,  1 , 3 J and y2 = [ 1 , 3, 5] . These vectors are shown in Figure 3.2. • Many of the algebraic expressions we shall encounter in multivariate analysis can be related to the geometrical notions of length, angle, and volume. This is im portant because geometrical representations ordinarily facilitate understanding and lead to further insights.
Section 3 . 2
T h e Geometry of t h e Sa mple
115
3 6
Y! 2 I
1 ,..
1
6
5
.....
4
3
2 Figure 3 . 2 A plot of the data matrix X as p = 2 vectors i n n = 3 space.
Unfortunately, we are limited to visualizing objects in three dimensions, and consequently, the ndimensional representation of the data matrix X may not seem like a particularly useful device for n > 3. It turns out, however, that geometrical relationships and the associated statistical concepts depicted for any three vectors remain valid regardless of their dimension. This follows because three vectors, even if n dimensional, can span no more than a threedimensional space, just as two vec tors with any number of components must lie in a plane. By selecting an appropri ate threedimensional perspectivethat is, a portion of the ndimensional space containing the three vectors of interesta view is obtained that preserves both lengths and angles. Thus, it is possible, with the right choice of axes, to illustrate certain al gebraic statistical concepts in terms of only two or three vectors of any dimension n. Since the specific choice of axes is not relevant to the geometry, we shall always label the coordinate axes 1, 2, and 3. It is possible to give a geometrical interpretation of the process of finding a sample mean. We start by defining the n X 1 vector 1� [ 1 , 1, . . . , 1]. (To simplify the notation, the subscript n will be dropped when the dimension of the vector 1n is clear from the context.) The vector 1 forms equal angles with each of the n coordi nate axes, so the vector ( 1/Yn)1 has unit length in the equalangle direction. Con sider the vector yi [ x l i , x2 i, . . . , x n iJ . The proj ection of Yi on the unit vector ( 1/Yn)1 is, by (28) , =
=
,( )
1 1 1 1 X1 i + X2 i + Yz. n Vn Vn _
···
···
+ Xn i
1  Xz·1 _
_
(33)
That is, the sample mean xi (xli + x2 i + + x n J/n yi1/n corresponds to the multiple of 1 required to give the projection of Yi onto the line determined by 1. Further, for each yi , we have the decomposition =
=
Y ; X,l 0
1
X�
116
Chapter 3
Sample Geometry and Random Sa m p l i n g
3
The decom position of Yi i nto a mean component xi 1 a n d a devi ation com ponent di Yi  Xj 1 , i = 11 2, 3. Figure 3.3
=
where xil is perpendicular to Yi  xil. The deviation, or mean corrected, vector is
x l i  xi i  xi d · = Y·  x  1 = X2 l
l
l
(34)
The elements of di are the deviations of the measurements on the ith variable from their sample mean. Decomposition of the Yi vectors into mean components and de viation from the mean components is shown in Figure 3.3 for p = 3 and n = 3. Example 3.3
(Deco m posing a vector i nto its mean and deviatio n components)
Let us carry out the decomposition of Yi into xil and di = Yi  xil, i = 1, 2, for the data given in Example 3.2:
X=
[ � �] 
Here x1 = (4  1 + 3 )/3 = 2 and x2 = ( 1 + 3 + 5 )/3 = 3, so
Consequently,
Section 3 . 2
The Geometry of t h e Sample
117
and
We note that .X 1 1 and d 1
=
[ �]
y1  .X1 1 are perpendicular, because
( :Xl l )'(yl  :Xl l )
=
[2 2 2]
A similar result holds for x2 1 and d 2
=

=
46+2
=
0
y2  x2 1. The decomposition is
•
For the time being, we are interested in the deviation (or residual) vectors Yi  xil . A plot of the deviation vectors of Figure 3.3 is given in Figure 3.4. We have translated the deviation vectors to the origin without changing their lengths or orientations. Now consider the squared lengths of the deviation vectors. Using (25) and (34), we obtain
di
=
L��
=
djdi
=
(Length of deviation vector) 2
n
� (xj i  xi) 2 j=l =
(35)
sum of squared deviations
3
Figure 3.4 The deviation vectors di from F ig u re 3 . 3 .
118
Chapter 3
Sa m p l e Geometry and Random Sa m p l i n g
From (13), we see that the squared length is proportional to the variance of the mea surements on the ith variable. Equivalently, the length is proportional to the stan dard deviation. Longer vectors represent more variability than shorter vectors. For any two deviation vectors di and d k , n (36) d i d k == � (xji  xi) (xj k  xk ) j=l Let fJi k denote the angle formed by the vectors di and d k . From (26) , we get or, using (35) and (36), we obtain
�
(xi ;  X; ) ( xi k  Xk ) =
so that [see (15)]
��
(xi ;  X; )
2
��
2 (xik  Xk ) cos ( O ;k )
(37) The cosine of the angle is the sample correlation coefficient. Thus, if the two devia tion vectors have nearly the same orientation, the sample correlation will be close to 1 . If the two vectors are nearly perpendicular, the sample correlation will be ap proximately zero. If the two vectors are oriented in nearly opposite directions, the sample correlation will be close to 1. Example 3.4
(Calcu lati ng Sn and R fro m deviatio n vectors)
Given the deviation vectors in Example 3.3, let us compute the sample variancecovariance matrix Sn and sample correlation matrix R using the geo metrical concepts just introduced. From Example 3.3,
These vectors, translated to the origin, are shown in Figure 3.5 on page 119. Now,
or s1 1
==
134 • Also,
Section 3 . 2
T h e Geometry o f t h e Sample
1 19
3
dl ������ I
..::
_
_
5
or s22
=
8
3.
_
4
_
_
3
2
Figure 3.5
a n d d2 .
Finally,
dld2 or s1 2
and
=
2
,.,
=
[2  3 1 ]
; . Consequently,
 �] 8
3
'
[ �]
2
=
The deviation vectors d1
3s1 2
•
The concepts of length, angle, and projection have provided us with a geomet rical interpretation of the sample. We summarize as follows:
1 The square of the length and the inner product are ( n  l ) su and ( n  l )sib respectively, when the divisor n  1 is used in the definitions of the sample variance and covariance.
1 20
3.3
Chapter 3
Sa m p l e Geometry and Random Sa m p l i n g
RAN D O M SAM PLES AN D THE EXPECTED VALU ES O F TH E SAM PLE M EAN AN D COVARIAN CE MATRIX
In order to study the sampling variability of statistics like x and Sn with the ultimate aim of making inferences, we need to make assumptions about the variables whose observed values constitute the data set X. Suppose, then, that the data have not yet been observed, but we intend to col lect n sets of measurements on p variables. Before the measurements are made, their values cannot, in general, be predicted exactly. Consequently, we treat them as ran dom variables. In this context, let the (j, k )th entry in the data matrix be the random variable Xi k · Each set of measurements Xi on p variables is a random vector, and we have the random matrix
(
nXX p)
=
X11 X1 2 X2 1 X22
xl p x2 p
X! X2
xn l Xn2
xn p
X'n
(38)
A random sample can now be defined. If the row vectors X1 , X2, . . . , X� in (38) represent independent observations from a common joint distribution with density function f (x) f ( x1 , x2 , . . . , xp ) , then X 1 , X 2 , . . . , X n are said to form a random sample from f (x) . Mathematically, X 1 , X 2 , . . . , X n form a random sample if their joint density function is given by the product f (x 1 )f(x2 ) f(x n ), where f (xi) f(xi 1 , xi 2 , . . . , xi P ) is the density func tion for the jth row vector. Two points connected with the definition of random sample merit special attention: 1. The measurements of the p variables in a single trial, such as Xj [ Xi 1 , Xi 2 , . . . , Xi p ] , will usually be correlated. Indeed, we expect this to be the case. The measurements from different trials must, however, be independent. 2. The independence of measurements from trial to trial may not hold when the variables are likely to drift over time, as with sets of p stock prices or p economic indicators. Violations of the tentative assumption of independence can have a serious impact on the quality of statistical inferences. The following examples illustrate these remarks. =
···
=
=
Example 3.5
(Selecti ng a random sa mple)
As a preliminary step in designing a permit system for utilizing a wilderness canoe area without overcrowding, a naturalresource manager took a survey of users. The total wilderness area was divided into subregions, and respon dents were asked to give information on the regions visited, lengths of stay, and other variables. The method followed was to select persons randomly (perhaps using a random number table) from all those who entered the wilderness area during a particular week. All persons were equally likely to be in the sample, so the more popular entrances were represented by larger proportions of canoeists.
Section 3.3
Random Samples and the Expected Va l ues
1 21
Here one would expect the sample observations to conform closely to the criterion for a random sample from the population of users or potential users. On the other hand, if one of the samplers had waited at a campsite far in the in terior of the area and interviewed only canoeists who reached that spot, suc cessive measurements would not be independent. For instance, lengths of stay would tend to be large. • Example 3.6
{A nonrandom sample)
Because of concerns with future solidwaste disposal, a study was conducted of the gross weight of solid waste generated per year in the United States ("Char acteristics of Municipal Solid Wastes in the United States, 19602000," Franklin Associates, Ltd.). Estimated amounts attributed to x 1 = paper and paperboard waste and x2 = plastic waste, in millions of tons, are given for selected years in Table 3.1. Should these measurements on X' = [X1 , X2 ] be treated as a ran dom sample of size n = 6? No! In fact, both variables are increasing over time. A drift like this would be very rare if the yeartoyear values were independent observations from the same distribution. TABLE 3.1
Year x 1 (paper) x2 (plastics)
SOLID WASTE
1960 29.8 .4
1965 37.9 1.4
1970 43.9 3.0
1975 42.6 4.4
1980 53.9 7.6
1985 61.7 9.8 •
As we have argued heuristically in Chapter 1, the notion of statistical indepen dence has important implications for measuring distance. Euclidean distance ap pears appropriate if the components of a vector are independent and have the same variances. Suppose we consider the location of the kth column Yk = [ X1k , X2 k , . . . , xnk ] of X, regarded as a point in n dimensions. The location of this point is determined by the joint probability distribution f (yk ) = f(x l k ' X2 k , . . . , Xnk )· When the measurements Xl k ' X2 k ' . . . , Xn k are a ran dom sample, f (yk ) = f (x 1 k , x2 k , . . . , x n k ) = fk (x 1k ) fk(x2 k ) · · · fk (xnk ) and, conse quently, each coordinate xj k contributes equally to the location through the identical marginal distributions fk ( xj k ). If the n components are not independent or the marginal distributions are not identical, the influence of individual measurements (coordinates) on location is asym metrical. We would then be led to consider a distance function in which the coordi nates were weighted unequally, as in the "statistical" distances or quadratic forms introduced in Chapters 1 and 2. Certain conclusions can be reached concerning the sampling distributions of X and s n without making further assumptions regarding the form of the underlying joint distribution of the variables. In particular, we can see how X and S n fare as point estimators of the corresponding population mean vector IL and covariance matrix I.
1 22
Chapter 3
Sa m p l e Geometry and Random Sa m p l i ng
Let X 1 , X 2 , . . . , X n be a random sample from a joint distribution that has mean vector IL and covariance matrix I. Then X is an unbiased estimator of IL , and its covariance rna trix is Result 3.1.
That is,
E(X) 
Cov ( X)
=
1L
=

1 I n
(
(population mean vector) population variancecovariance matrix divided by sample size
)
(39)
For the covariance matrix S n ,
Thus, (310) so [ nj ( n  1 ) ] S n is an unbiased estimator of I, while S n is a biased estimator with (bias) E (S n )  I  ( 1/n)I. =
=
Proof.
Now, X (X 1 + X 2 + · · · + X n )fn. The repeated use of the proper ties of expectation in (224) for two vectors gives 
E (X)
=
=
=
=
( ) E ( � Xn ) E ( � X 1 ) E ( � X2 ) 1 1 1 E n X 1 + n X2 + · · · + n X n
+
+
0
+
1 1 1 + ) E(X + ) ) + E(X · · · 1 2 n n n E(X n
= IL Next,
( X  �t ) ( X  �t ) '
=
=
(
�
1 n (Xi  IL ) n
=
=
)(
1 1 1 1L + 1L + · · · + 1L n n n
E( X  �t ) ( X  �t ) '
=
)
' 1 n (X ) e &i IL n
1 n n 2 L L (X j  IL ) (X e  IL ) ' n j=l €= 1
so
Cov( X )
0 0
:2 ( � � E (Xi  �t ) (Xe  �t ) ' )
Section 3.3
Ra ndom Sa m p l es and the Expected Va l ues
1 23
For j # e, each entry in E(Xi  p.. ) (X e  p.. ) ' is zero because the entry is the co variance between a component of Xi and a component of X e , and these are inde pendent. [See Exercise 3.17 and (229).] Therefore,
� ( � E (X1  p ) (X1  p ) ' )
Cov ( X) = 2
Since I = E(Xi  p.. ) (Xi  p.. ) ' is the common population covariance matrix for each Xi , we have
�n ( i±=l E( X1  p ) (X1  p ) ' ) = �n (I + nI + . . . + I) = 2 (ni) = ( n ) I n
Cov ( X) =
terms
_!_
l
To obtain the expected value of S11 , we first note that (Xii  XJ (Xik  Xk) is the (i, k )th element of (Xi  X) (Xi  X)'. The matrix representing sums of squares and cross products can then be written as
� n
(X1  X) (X1  X ) ' =
� (X1  X)Xj + ( � (X1  X) ) ( X)' n
n
n
1
= i.£.,; " X X'·  nX X' =l n
1
n
since � (Xi  X) = 0 and nX' = � Xj . Therefore, its expected value is
i= l
i= l
For any random vector V with E (V) = #Lv and Cov (V) = I v , we have E (VV' ) = I v + #Lv#L'v . (See Exercise 3.16.) Consequently,
E(Xi Xj) = I +
p.. p.. '
1 and E ( X X' ) =  I + p.. p.. ' n
Using these results, we obtain 
(
)
� E(X1Xj)  n ( XX' ) = ni  npp'  n 1 I + pp' = (n  l)I E f=t n and thus, since Sn = (l/n)
( � X1Xj  nXX' ) ,
it follows immediately that
•
1 24
Chapter 3
Sample Geometry and Random Sa m p l i n g
n Result 3.1 shows that the (i, k)th entry, (n  1 ) 1 � ( Xji  Xi ) ( Xjk  Xk ) , of j =1 [ n/ ( n  1 ) ]S n is an unbiased estimator of oik . However, the individual sample standard deviations � , calculated with either n or n  1 as a divisor, are not unbiased estimators of the corresponding population quantities � . Moreover, the correla tion coefficients rik are not unbiased estimators of the population quantities Pik . However, the bias E ( �)  � ' or E ( rik )  Pik ' can usually be ignored if the sample size n is moderately large. Consideration of bias motivates a slightly modified definition of the sample variancecovariance matrix. Result 3.1 provides us with an unbiased estimator S of I:
n 1 Here S, without a subscript, has (i, k)th entry ( n  1 ) � ( Xji  Xi ) ( Xjk  Xk ) · j=l This definition of sample covariance is commonly used in many multivariate test statistics. Therefore, it will replace Sn as the sample covariance matrix in most of the ma terial throughout the rest of this book. 3.4
G E N ERALIZED VARIANCE
With a single variable, the sample variance is often used to describe the amount of variation in the measurements on that variable. When p variables are observed on each unit, the variation is described by the sample variancecovariance matrix S1 1 S12 S = S1 2 S22
slp s2p
slp s2p
sPP
{ s;k
=
n
� 1 J=l :± (xji  X; ) ( xjk  Xk )
}
The sample covariance matrix contains p variances and � p(p  1 ) potentially dif ferent covariances. Sometimes it is desirable to assign a single numerical value for the variation expressed by S. One choice for a value is the determinant of S, which re duces to the usual sample variance of a single characteristic when p = 1. This de terminant2 is called the generalized sample variance:
2 Definition 2A.24 defines "determinant" and indicates one method for calculating the value of a determinant.
Section 3.4
Example 3.7
Genera l ized Va ria nce
1 25
(Calcu lati ng a genera l i zed varia nce)
Employees (x 1 ) and profits per employee (x2 ) for the 16 largest publishing firms in the United States are shown in Figure 1.3. The sample covariance ma trix, obtained from the data in the April 30, 1990, Forbes magazine article, is s
=
[
252.04 68.43 68.43 123.67
]
Evaluate the generalized variance. In this case, we compute l S I = (252.04) (123.67)  ( 68.43) ( 68.43 ) = 26,487
•
The generalized sample variance provides one way of writing the information on all variances and covariances as a single number. Of course, when p > 1 , some information about the sample is lost in the process. A geometrical interpretation of I S I will help us appreciate its strengths and weaknesses as a descriptive summary. Consider the area generated within the plane by two deviation vectors d 1 = y1  x1 1 and d 2 = y2  x2 1. Let Ld1 be the length of d 1 and Ld2 the length of d 2 . By elementary geometry, we have the diagram 
1




Height

=

















.,

·
Ld 1 sin ( 8 )
and the area of the trapezoid is I Ld1 sin ( 0) 1Ld2 • Since cos2 ( 0) + sin2 ( 0 ) express this area as
From (35) and (37), L d1 = Ld2 =
and
� � (xi1 ��

2 X 1 ) = v' (n  l ) s1 1 2
( xi2  X2 ) =
cos ( 0) = r1 2 Therefore,
v' (n  l ) s22
=
1, we can
1 26
Chapter 3
Sample Geometry and Random Sa m p l i n g J\ II \ II \ I I \ I [ \ \ I 1\ \ I I \ \ � I \ \ I\ I \ \ I ( \ ' II \ II � \ 2 " I \ I \ \I \ I
3
3
.,.\,
d
\'
d3
(a)
·�
dl
(b)
Figure 3.6 (a) " La rge" general ized sample va ria nce fo r p = 3 . (b) "Sma l l " general ized sa m p l e va ria nce for p = 3 .
Also,
(314) If we compare (314) with (313), we see that Assuming now that I S I = ( n  1 )  ( p  1 ) (volume ) 2 holds for the volume gener ated in n space by the p  1 deviation vectors d 1 , d 2 , . . . , d P _ 1 , we can establish the following general result for p deviation vectors by induction (see [1 ], p. 260): Generalized sample variance
=
ISI
=
( n  1 ) p (volume ) 2
(315)
Equation (315) says that the generalized sample variance, for a fixed set of data, is proportional to the square of the volume generated by the p deviation vectors3 d 1 = Y1  x1 l, d2 = y2  x2 1, . . . ,dP = yP  xP l. Figures 3.6(a) and (b) show trapezoidal regions, generated by p = 3 residual vectors, corresponding to "large" and "small" generalized variances. 3 If generalized variance is defined in terms of the sample covariance matrix Sn = [ ( n  1 )/n]S� then, using Result 2A. 1 1 , I Sn l = I [ ( n  1 )/n]I S I = I [ ( n  1 )/n]Ip i i S I = [ ( n  1 )/nJPI S I . Conse P quently, using (315), we can also write the following: Generalized sample variance = I Sn I = n P( volum e ) 2 .
Section 3.4
Genera l ized Va ria nce
1 27
For a fixed sample size, it is clear from the geometry that volume, or I S I , will in crease when the length of any di = Yi  xil (or Vi;;) is increased. In addition, vol ume will increase if the residual vectors of fixed length are moved until they are at right angles to one another, as in Figure 3.6(a). On the other hand, the volume, or I S I, will be small if just one of the sii is small or one of the deviation vectors lies nearly in the (hyper) plane formed by the others, or both. In the second case, the trapezoid has very little height above the plane. This is the situation in Figure 3.6(b ), where d 3 lies nearly in the plane formed by d 1 and d 2 • Generalized variance also has interpretations in the pspace scatter plot represen tation of the data. The most intuitive interpretation concerns the spread of the scatter about the sample mean point x' = [x1 , x2 , • • • , xp ]· Consider the measure of distance given in the comment below (219), with x playing the role of the fixed point IL and s  1 playing the role of A. With these choices, the coordinates x' = [x 1 , x2 , • • • , xp ] of the points a constant distance c from x satisfy (3 1 6) (x  x) ' S 1 (x  x) = c2 2 [When p = 1 , (x  x) ' S 1 (x  x) = (x l  x1 ) /s l l is the squared distance from x l to .X 1 in standard deviation units.] Equation (316) defines a hyperellipsoid (an ellipse if p = 2) centered at x. It can be shown using integral calculus that the volume of this hyperellipsoid is related to I S I · In particular,
(317) or (Volume of ellipsoid ) 2
=
(constant) (generalized sample variance )
where the constant kP is rather formidable. 4 A large volume corresponds to a large generalized variance. Although the generalized variance has some intuitively pleasing geometrical interpretations, it suffers from a basic weakness as a descriptive summary of the sam ple covariance matrix S, as the following example shows. Example 3.8
(I nterpreti ng the general ized variance)
Figure 3.7 gives three scatter plots with very different patterns of correlation. All three data sets have x' = [2, 1 ], and the covariance matrices are S
=
[5 5 ] 4
4
,r
=
.8 S
4For those who are curious, k P ated at z .
=
=
[ OJ
3 ,r 0 3
=
0 S
=
[ 5 5] _4
4
,r
=
 .8
21T P12jpf(pj2 ) , where f(z) denotes the gamma function evalu
1 28
Chapter 3
Sa m p l e Geometry and Random Sa m p l i ng
7
7 • •
•
•
• • •
••
• •
. .. . .. • •
•
' . •
•
•
•
•
.
•
• •
•
•
•
• •
•
•
• •
• • • • • •
•
•
•
•
•
•
7
x
l
•
•
•
(a)
(b)
•
7 •
• r+��+� ��+��+�� • • •
•
•
• • ••
• • •
• •
•
•
• •
•
•
,. •
7
• • •• • •
•
l
•
•
•
x
•
(c) Figure 3.7
Scatter plots with th ree d iffe rent orie ntations.
Each covariance matrix S contains the information on the variability of the component variables and also the information required to calculate the corre lation coefficient. In this sense, S captures the orientation and size of the pattern of scatter. The eigenvalues and eigenvectors extracted from S further describe the pattern in the scatter plot. For 0 = (A  5 ) 2  4 2 the eigenvalues satisfy = (A  9 ) ( A  1 ) and we determine the eigenvalueeigenvector pairs A 1 = 9, e1 = [ 1/vl, 1/vl] and A2 = 1, e2 = [ 1/vl, 1/vlJ.
Section
3.4
Genera l ized Va ria nce
1 29
The meancentered ellipse, with center x' = [2, 1 J for all three cases, is (x  x) ' S 1 (x  x ) < c 2 To describe this ellipse, as in Section 2.3, with A = s  1 , we notice that if (A, e) is an eigenvalueeigenvector pair for S, then (A l , e) is an eigenvalueeigenvector pair for s  1 . That is, if Se = Ae, then multiplying on the left by s 1 gives s  1 Se = AS  1 e, or s 1 e = A 1 e. Therefore, using the eigenvalues from S, we know that the ellipse extends c� in the direction of e i from x. In p = 2 dimensions, the choice c2 = 5.99 will produce an ellipse that con tains approximately 95 percent of the observations. The vectors 3 V5.99 e 1 and V5.99 e 2 are drawn in Figure 3.8(a). Notice how the directions are the natural axes for the ellipse, and observe that the lengths of these scaled eigenvectors are comparable to the size of the pattern in each direction. Next, for
s=
[� �] ,
the eigenvalues satisfy
O = (A  3 ) 2
and we arbitrarily choose the eigenvectors so that A 1 = 3, e1 = [1, O J and A2 = 3, e 2 = [0, 1 ] . The vectors V3 V5.99 e 1 and V3 V5.99 e 2 are drawn in Figure 3.8(b). Finally, for 5 4 0 = ( A  5 ) 2  ( 4 ) 2 the eigenvalues satisfy s= ' = (A  9 ) ( A  1 ) 4 5
[
]
and we determine the eigenvalueeigenvector pairs A 1 = 9, e1 = [ 1/ v2,  1/ vl] and A 2 = 1, e 2 = [1/vl, 1/vl] . The scaled eigenvectors 3 V5.99 e 1 and V5.99 e2 are drawn in Figure 3.8(c). In two dimensions, we can often sketch the axes of the meancentered el lipse by eye. However, the eigenvector approach also works for high dimensions where the data cannot be examined visually. Note: Here the generalized variance I S I gives the same value, I S I = 9, for all three patterns. But generalized variance does not contain any informa tion on the orientation of the patterns. Generalized variance is easier to inter pret when the two or more samples (patterns) being compared have nearly the same orientations. Notice that our three patterns of scatter appear to cover approximately the same area. The ellipses that summarize the variability
do have exactly the same area [see (317)], since all have I S I = 9.
•
As Example 3.8 demonstrates, different correlation structures are not detected by I S I · The situation for p > 2 can be even more obscure. Consequently, it is often desirable to provide more than the single number I S I as a summary of S. From Exercise 2 1 2, I S I can be expressed as the product A 1 A2 · · · AP .
1 30
Chapter 3
Sa m p l e Geometry and Random Sa m p l i ng
7
•
•
•
•
• •
'
.
•
•
• •
•
•
: ·
'
•
•
•
•
'
•
• •
7
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
• • •
•
•
•
• • •
•
•
• • •
7
•
x,
(b)
(a)
•
•
•
•
7 • •
• •
• • ••• •
• •
•
•
•
•
•
•
•
• • • •
7
x,
•
(c) Figure 3.8
Axes of the meancentered 95 percent e l l i pses for the scatter plots in Fig u re 3.7.
of the eigenvalues of S. Moreover, the meancentered ellipsoid based on s  1 [see (316)] has axes whose lengths are proportional to the square roots of the A/s (see Sec tion 2.3). These eigenvalues then provide information on the variability in all direc tions in the pspace representation of the data. It is useful, therefore, to report their individual values, as well as their product. We shall pursue this topic later when we discuss principal components. Situations i n which the Generalized Sample Variance Is Zero
The generalized sample variance will be zero in certain situations. A generalized variance of zero is indicative of extreme degeneracy, in the sense that at least one column of the matrix of deviations,
Section 3.4
Genera l ized Va ria nce
x1  x' x2  x'
X 11  X1 X12  X2 X21  X 1 X22  X2
X1 p  X p X2 p  X p
x�  x'
Xnl  X1 Xn2  X2 X  1 x' ( n Xp) ( n Xl) ( l Xp)
Xn p  Xp
131
(318)
can be expressed as a linear combination of the other columns. As we have shown geometrically, this is a case where one of the deviation vectorsfor instance, di = [ x l i  Xi , . . . , Xn i  Xi ] lies in the ( hyper ) plane generated by d l , . . . ,d i  1 '
di + l , · · · , dp .
The generalized variance is zero when, and only when, at least one deviation vector lies in the ( hyper ) plane formed by all linear combinations of the othersthat is, when the columns of the matrix of deviations in (318) are lin early dependent. Result 3.2.
Proof.
If the columns of the deviation matrix (X  lx' ) are linearly depen dent, there is a linear combination of the columns such that 0 = a 1 col 1 ( X  lx' ) +
··· +
aP
col p ( X  lx' )
= (X  lx' ) a for some a =I= 0 But then, as you may verify, ( n  1 )S = (X  lx' ) ' (X  lx' ) and ( n  1 )Sa = ( X  lx' ) ' (X  lx' ) a = o so the same a corresponds to a linear dependency, a 1 col 1 ( S ) + · · · + a P col p ( S ) = Sa = 0, in the columns of S. So, by Result 2A.9, I S I = 0. In the other direction, if I S I = 0, then there is some linear combination Sa of the columns of S such that Sa = 0. That is, 0 = ( n  1 )Sa = ( X  lx' ) ' (X  lx' ) a. Premultiplying by a' yields 0 = a' (X  lx' ) ' ( X  lx' ) a = L (x  lx' ) a and, for the length to equal zero, we must have (X  lx' ) a = 0. Thus, the columns of ( X  lx' ) are linearly dependent. • Example 3.9
(A case where the general ized va ria nce is zero)
Show that I S I = 0 for
3X3 =
(X)
[
and determine the degeneracy. Here x' = [ 3, 1, 5 ] , so
[� � �]
1 3 21 X  lx' = 4  3 1  1 43 01
4 0 4
1 32
Chapter 3
Sample Geometry and Ra ndom Sa m p l i n g
3
3 4 5 6 3
Fl.Q_ure 3.9
A case where the th ree d i m ensional vo l u me is zero ( I S I = 0 ) .
4
The deviation (column) vectors are d1 = [ 2, 1, 1 ] , d2 = [ 1 , 0,  1 ] , and d3 = [0, 1, 1 ]. Since d3 = d 1 + 2d 2 , there is column degeneracy. (Note that there is row degeneracy also.) This means that one of the deviation vectors for example, d3lies in the plane generated by the other two residual vectors. Consequently, the th reedimensional volume is zero. This case is illustrated in Figure 3.9 and may be verified algebraically by showing that I S I = 0. We have
and from Definition 2A.24, lSI =
1 1 3 1 � (  1 )2 +
= 3
2
(1  �) +
(D 6 � 3
1
( 1 ) 3 + ( o )
(�) (  �  o ) + o = �  � = o
•
When large data sets are sent and received electronically, investigators are some times unpleasantly surprised to find a case of zero generalized variance, so that S does not have an inverse. We have encountered several such cases, with their asso ciated difficulties, before the situation was unmasked. A singular covariance matrix occurs when, for instance, the data are test scores and the investigator has included variables that are sums of the others. For example, an algebra score and a geometry score could be combined to give a total math score, or class midterm and final exam scores summed to give total points. Once, the total weight of a number of chemicals was included along with that of each component. This common practice of creating new variables that are sums of the original variables and then including them in the data set has caused enough lost time that we emphasize the necessity of being alert to avoid these consequences.
Section 3.4
Example 3. 1 0
Genera l ized Va riance
1 33
(Creati ng new variables that lead to a zero genera l i zed varia nce)
Consider the data matrix
1 9 10 4 12 16 X = 2 10 12 5 8 13 3 1 1 14 where the third column is the sum of first two columns. These data could be the number of successful phone solicitations per day by a parttime and a fulltime employee, respectively, so the third column is the total number of successful so licitations per day. Show that the generalized variance I S I = 0, and determine the nature of the dependency in the data. We find that the mean corrected data matrix, with entries xj k  xk , is 2  1 3 1 2 3 X  lx' = 1 0  1 2 2 0 1 0 1 The resulting covariance matrix is
[
]
2.5 0 2.5 s = 0 2.5 2.5 2.5 2.5 5.0 We verify that, in this case, the generalized variance I s I = 2.5 2 X 5 + 0 + 0  2.5 3  2.5 3  0 = 0 In general, if the three columns of the data matrix X satisfy a linear con straint a 1 xj 1 + a 2 xj2 + a 3 xj 3 = c, a constant for all j, then a 1 x1 + a 2 x2 + a 3 x3 = c, so that a l (xjl  xl ) + a 2 (xj2  x2 ) + a 3 (xj 3  x3 ) = 0 for all j. That is, (X  lx' ) a = o and the columns of the mean corrected data matrix are linearly dependent. Thus, the inclusion of the third variable, which is linearly related to the first two, has led to the case of a zero generalized variance. Whenever the columns of the mean corrected data matrix are linear ly dependent, ( n  1 )Sa = (X  lx' ) ' (X  lx' ) a = (X  lx' ) O = o and Sa = 0 establishes the linear dependency of the columns of S. Hence, I S I = 0.
1 34
Chapter 3
Sa m p l e Geometry and Random Sa m p l i ng
Since Sa = 0 = Oa, we see that a is a scaled eigenvector of S associated with an eigenvalue of zero. This gives rise to an important diagnostic: If we are unaware of any extra variables that are linear combinations of the others, we can find them by calculating the eigenvectors of S and identifying the one as sociated with a zero eigenvalue. That is, if we were unaware of the dependen cy in this example, a computer calculation would find an eigenvalue proportional to a' = [ 1 , 1 ,  1 ] , since Sa =
[
2.5
0
0
2.5 2.5
2.5
The coefficients reveal that
In addition, the sum of the first two variables minus the third is a constant c for all n units. Here the third variable is actually the sum of the first two variables, so the columns of the original data matrix satisfy a linear constraint with c = 0. Because we have the special case c = 0, the constraint establishes the fact that the columns of the data matrix are linearly dependent. • Let us summarize the important equivalent conditions for a generalized variance to be zero that we discussed in the preceding example. Whenever a nonzero vector a satisfies one of the following three conditions, it satisfies all of them: ( 1 ) Sa = 0
a is a scaled
eigenvector of S with eigenvalue 0.
( 2 ) a' ( xj  x ) = 0 for all j ( 3 ) a' xj = c for all j ( c = a' x)
The linear combination of the mean corrected data, using a, is zero.
The linear combination of the original data, using a, is a constant.
We showed that if condition (3) is satisfiedthat is, if the values for one variable can be expressed in terms of the othersthen the generalized variance is zero because S has a zero eigenvalue. In the other direction, if condition ( 1) holds, then the eigen vector a gives coefficients for the linear dependency of the mean corrected data. In any statistical analysis, I S I = 0 means that the measurements on some vari ables should be removed from the study as far as the mathematical computations are concerned. The corresponding reduced data matrix will then lead to a covariance matrix of full rank and a nonzero generalized variance. The question of which mea surements to remove in degenerate cases is not easy to answer. When there is a choice, one should retain measurements on a (presumed) causal variable instead of those on a secondary characteristic. We shall return to this subject in our discussion of principal components. At this point, we settle for delineating some simple conditions for S to be of full rank or of reduced rank.
Section 3.4
Genera l ized Va ria nce
1 35
Result 3.3. If n < p, that is, (sample size) < (number of variables), then
I S I = 0 for all samples.
Proof. We must show that the rank of S is less than or equal to p and then
apply Result 2A.9. For any fixed sample, the n row vectors in (318) sum to the zero vector. The existence of this linear combination means that the rank of X  lx' is less than or equal to n  1, which, in turn, is less than or equal to p  1 because n < p. Since ( n  1 ) s = (X  lx ' ) ' (X  lx ' )
( pX n )
( pX p )
( n Xp)
the kth column of S, col k ( S ) , can be written as a linear combination of the rows of (X  lx' ) ' . In particular, ( n  1 ) col k ( S ) = (X  lx' ) ' col k (X  lx' ) = ( x l k  xk ) row1 (X  lx' ) ' + . . · +
(xnk  xk ) rown (X  lx' ) '
Since the row vectors of ( X  lx' ) ' sum t o the zero vector, we can write, for ex ample, row1 (X  lx' ) ' as the negative of the sum of the remaining row vectors. After substituting for row1 (X  lx' ) ' in the peceding equation, we can express col k ( S ) as a linear combination of the at most n  1 linearly independent row vectors row2 (X  lx' )', . . . , rown (X  lx' ) ' . The rank of S is therefore less than or equal to n  1 , whichas noted at the beginning of the proofis less than or equal to p  1, and S is singular. This implies, from Result 2A.9, that I S I = 0. • Result 3.4. Let the p X 1 vectors x 1 , x 2 ,
, x n , where xj is the jth row of the data matrix X, be realizations of the independent random vectors X 1 , X 2 , , X n . Then • • •
• • •
a'Xj has positive variance for each constant vector a # 0, then, provided that p < n, S has full rank with probability 1 and I S I > 0. 2. If, with probability 1, a'Xj is a constant (for example, c) for all j, then I S I = 0. 1. If the linear combination
Proof.
(Part 2). If a'Xj = a 1 Xj 1 + a 2 Xj2
+
. . . + a p Xj p = c with probability
n 1, a'xj = c for all j, and the sample mean of this linear combination is c = :L ( a 1 xj 1 j=l . . + a 2 xj2 + · + a p xj p ) / n = a 1 x1 + a 2 x2 + . . · + a P xP = a' x. Then
[[ ]] [ ][  ]
(X  lX' )a = a 1
xl l  x l :
_
Xn l  x l 
+
···
+ aP
x l p  xP :
_
Xn p  xp
c�c a ' x 1 � a' x = : = : =0 a ' x n  a' x cc
indicating linear dependence; the conclusion follows from Result 3.2. The proof of Part (1) is difficult and can be found in [2] .
•
1 36
Chapter 3
Sample Geometry and Random Sam p l i ng
Genera l i zed Va riance Determi ned by I R I and Its Geo metrical I nterpretation
The generalized sample variance is unduly affected by the variability of measure ments on a single variable. For example, suppose some si i is either large or quite small. Then, geometrically, the corresponding deviation vector d i = (yi  xi l) will be very long or very short and will therefore clearly be an important factor in deter mining volume. Consequently, it is sometimes useful to scale all the deviation vec tors so that they have the same length. Scaling the residual vectors is equivalent to replacing each original observation xjk by its standardized value ( xjk  xk )j � . The sample covariance matrix of the standardized variables is then R, the sample correlation matrix of the original vari ables. (See Exercise 3.13.) We define
Since the resulting vectors [ (x l k  xk )/ � , (x2 k  xk )/�,
(xn k  xk )/ � J = (yk  xk l) '/ � all have length Vn=1 , the generalized sample variance of the standardized vari ables will be large when these vectors are nearly perpendicular and will be small when two or more of these vectors are in almost the same direction. Employing the argument leading to (3 7), we readily find that the cosine of the angle (}i k between (yi  xil)/ � and (yk  xk l)/� is the sample correlation coefficient rik · There fore, we can make the statement that I R I is large when all the rik are nearly zero and it is small when one or more of the rik are nearly + 1 or 1. In sum, we have the following result: Let 0 0 0 '
i
=
1, 2,
0 0 0
'p
vs;;l l
be the deviation vectors of the standardized variables. These deviation vectors lie in the direction of d i , but have a squared length of n  1 . The volume generated in pspace by the deviation vectors can be related to the generalized sample variance. The same steps that lead to (315) produce
(
Generalized sai?ple va�iance of the standardized variables
)
=
IRI
=
(n
_
1 )P (volume) 2
(320)
Section 3.4
Genera l ized Va riance i'
3
\
\
d 2 ....
....
.... \ .... .... \ \ \
1 37
3
II \ I I \ I I \ I I \ I I I I I � � II I \ I I I\
......
d
I I \1 I I\ I \ I \ I � 'I 2 IU3 I
d1
�.. 2
(a)
(b)
Figure 3 . 1 0 The vol u m e generated by equa llength deviation vectors of the sta n d a rd ized va riables.
The volume generated by deviation vectors of the standardized variables is il lustrated in Figure 3.10 for the two sets of deviation vectors graphed in Figure 3.6. A comparison of Figures 3.10 and 3.6 reveals that the influence of the d 2 vector ( large variability in x2 ) on the squared volume I S I is much greater than its influence on the squared volume I R I · The quantities I S I and I R I are connected by the relationship ( 321 ) l S I = ( s1 1 s22 · . . sp p ) I R I so ( 322 ) [The proof of ( 321 ) is left to the reader as Exercise 3.12. ] Interpreting ( 322 ) in terms of volumes, we see from ( 315 ) and ( 320) that the squared volume ( n  1 ) P I S I is proportional to the squared volume ( n  1 ) P I R I . The constant of proportionality is the product of the variances, which, in turn, is pro portional to the product of the squares of the lengths ( n  1 )sii of the di . Equation ( 321 ) shows, algebraically, how a change in the measurement scale of X1 , for exam ple, will alter the relationship between the generalized variances. Since I R tfsbased on standardized measurements, it is unaffected by the change in scale. However, the relative value of I S I will be changed whenever the multiplicative factor s1 1 changes. Example 3.1 1
(I l l ustrating the relatio n between I S I and I R I)
[4 ]
Let us illustrate the relationship in ( 321 ) for the generalized variances I S I and I R I when p = 3. Suppose 3 1 s = 3 9 2 (3 X 3 ) 1 2 1
1 38
Chapter 3
= =
= R = [ i ! !]
Sa m p l e Geometry and Ra ndom Sa m p l i ng
Then s1 1
4, s22
9, and s3 3
= =
1. Moreover,
Using Definition 2A.24, we obtain 3 9 2 ( 1 ) 2 + 3 IsI 4 1 2 1 4 (9  4)  3 ( 3  2) 1 1 2 I I = 1 � i ( 1 ? + � I 3 2 = ( 1  $)  ( � ) ( �  � )
R
= R=
It then follows that 14 = I S I
s1 1 s22 s3 3l I
= =
2 3 9 (  1 )3 + 1 ( 1 ) 4 1 2 1 + 1 ( 6  9) 14 1 1 2 i (  1 ? + � I2 �3 (  1 ) 4 7 + (�) ( �  � ) 1 8
7 (4) ( 9 ) ( 1 ) ( 1 8 )
Another Genera l i zatio n of Variance
=
14
( check )
•
We conclude this discussion by mentioning another generalization of variance. Specif ically, we define the total sample variance as the sum of the diagonal elements of the sample variancecovariance matrix S. Thus,
Example 3 . 1 2
(Calculating the tota l sa mple va riance)
Calculate the total sample variance for the variancecovariance matrices S in Examples 3.7 and 3.9. From Example 3.7. 252.04 68.43 s= 68.43 123.67 and
[
Total sample variance From Example 3.9,
]
= = = = [ � � �J = = = 3
s
252.04 + 123.67
s1 1 + s22
375.71
o
� 1 2
and
Total sample variance
s 1 1 + s22 + s33
3 +1 +1
5
•
Section 3 . 5
Sa m p l e Mean, Cova riance, and Corre lation as Matrix Operat ions
1 39
Geometrically, the total sample variance is the sum of the squared lengths of the p deviation vectors d 1 = (y1  x1 1), . . . , d P = (yp  xp 1), divided by n  1 . The total sample variance criterion pays no attention to the orientation (correlation struc ture) of the residual vectors. For instance, it assigns the same values to both sets of residual vectors (a) and (b) in Figure 3.6. 3. 5 SAM PLE M EAN, COVARIANCE, AN D CORRE LATION AS MATRIX OPE RATI O N S
We have developed geometrical representations of the data matrix X and the de rived descriptive statistics x and S. In addition, it is possible to link algebraically the calculation of x and S directly to X using matrix operations. The resulting expressions, which depict the relation between x, S, and the full data set X concisely, are easily pro grammed on electronic computers. We have it that xi = (x l i . 1 + x2 i . 1 + . . . + Xn i . 1 )/n = yj 1jn. Therefore,
xl x =
x2
Xp
y1 1 n y2 1 n
X1 1 X 1 2 1 n
y� 1 n
X2 1 X2 2 Xp l Xp 2
or
x = _!_ X ' 1 n
(324)
That is, x is calculated from the transposed data matrix by postmultiplying by the vector 1 and then multiplying the result by the constant ljn. Next, we create an n X p matrix of means by transposing both sides of (324) and premultiplying by 1 ; that is,
1 1 x' =  11 ' X = n
x l x2 x l x2
Xp Xp
x l x2
Xp
Subtracting this result from X produces the n X
p
(325)
matrix of deviations (residuals)
X1 1  X 1 X 1 2  X2 x  l 11 'X = X21 X 1 X22 X2 n Xn l  X 1 Xn2  X2
X1 p  X p X2 p  Xp Xn p  Xp
(326)
1 40
Chapter 3
Sa m p l e Geometry and Random Sa m p l i n g
Now, the matrix ( n  1 ) S representing sums of squares and cross products is just the transpose of the matrix (326) times the matrix itself, or
X 1 1  X 1 X2 1  X 1 X1 2  X2 X22  X2 (n  1 )S =
X
X1 1  X1 X1 2  X2 X2 1  X1 X22  X2
(
= Xsince
(I
! ll'X ) ' ( X  ! ll' X ) = X ' ( 1  ! 11 ' ) X
)(
)
'  _!_ 11 ' I  _!_ 11 ' = I  _!_ 11 '  _!_ 11 ' + __!_2 11 ' 11 ' = I  _!_ 11 ' n n n n n n
To summarize, the matrix expressions relating x and S to the data set X are
(
)
1 (327) X ' I  _!_ 11 ' X n n1 The result for S n is similar, except that 1/n replaces 1/ ( n  1 ) as the first factor. The relations in (327) show clearly how matrix operations on the data matrix X lead to x and S. Once S is computed, it can be related to the sample correlation matrix R. The resulting expression can also be "inverted" to relate R to S. We first define the p X p 1 1 1 2 2 nsample standard deviation matrix D 1 and compute its inverse, ( D 1 ) = 112 . Let s=
D ll2 =
( pXp)
�
0
0
Vs;
0
0
0 0
Then
1
�
n 1 12 =
0
1
0
vS;;
0
0
( pXp )
0 0
1
vs;;
(328)
Section 3.6
Since
[
pp
s� P
s: r S1 2 S= : s l s2
and
S1 2 S1 1 � � � Vs;
R=
we have
]
Sa m p l e Va l ues of Linear Com b i n ations of Va riables
p
S2 Sl p � vs;;; Vs; vs;;;
sl
p
sPP
�� sPP
=
[ ,:p p '1 2
r2
vs;;; vs;;;
R = n  1 /2 sn  1/2
1 41
;]
rP
(329)
Postmultiplying and premultiplying both sides of (329) by D 112 and noting that n  1/2 D l/2 = D lf2 n  1/2 = I gives (330) That is, R can be obtained from the information in S, whereas S can be obtained from D 112 and R. Equations (329) and (330) are sample analogs of (236) and (237). 3.6 SAM PLE VALU ES OF LIN EAR CO M B I NATI ONS OF VARIABLES
We have introduced linear combinations of p variables in Section 2.6. In many mul tivariate procedures, we are led naturally to consider a linear combination of the form
c'X = c1X1 + c2 X2 + · · · + cP XP whose observed value on the jth trial is j = 1, 2, . . . , n The n derived observations in (331) have ( c'x 1 + c'x 2 + . · + c'xn ) Sample mean ==  n 1 = c' (x 1 + x2 + · · · + x n ) = c' x n
(331)
.
(332)
Since (c' xj  c' x) 2 = (c' (xj  x) ) 2 = c' (xj  x) (xj  x)'c, we have (c'x 1  c' x) 2 + (c'x2  c' x) 2 + . . . + (c'xn  c' x) 2 . Sample variance = n1 c' (x 1  x) (x l  x)'c + c' (x2  x) (x 2  x)'c + . . . + c' (x n  x) (xn  x)'c n 1 (x l  x) (x l  x)' + (x2  x) (x 2  x)' + . . . + (x n  x) (x n  x)' = c' c n1
[
]
1 42
Chapter 3
Sa m p l e Geometry and Ra ndom Sa m p l i ng
or (333) Sample variance of c ' X = c ' S c Equations (332) and (333) are sample analogs of (243). They correspond to sub stituting the sample quantities x and S for the "population" quantities IL and I, re spectively, in (243). Now consider a second linear combination b' X = b1 X1 + b2 X2 +
···
+ bP XP
whose observed value on the jth trial is j = 1, 2, . . . , n
(334)
It follows from (332) and (333) that the sample mean and variance of these derived observations are Sample mean of b ' X = b' x Sample variance of b'X = b' Sb Moreover, the sample covariance computed from pairs of observations on b ' X and c' X is Sample covariance (b'x 1  b' x) (c' x 1  c' x) + (b'x2  b' x) (c'x2  c' x) + · · · + (b'x n  b' x) (c'x n  c' x) n  1 b' (x l  x) (x l  x) ' c + b' (x 2  x) (x 2  x) ' c + . . . + b' (x n  x) (x n  x) ' c n  1 (x l  x) (x l  x) ' + (x 2  x) (x 2  x) ' + . . . + (x n  x) (x n  x) ' = b' c n  1 or Sample covariance of b'X and c' X = b ' S c (335) I n sum, we have the following result.
[
]
Result 3.5. The linear combinations
· ·· ·
b'X = b1 X1 + b2 X2 + + bP XP c'X = c1 X1 + c2 X2 + · · + cP XP have sample means, variances, and covariances that are related to x and S by Sample mean of b' X = b' x Sample mean of c' X = c' x Sample variance of b ' X = b ' Sb (336) Sample variance of c' X = c' Sc • Sample covariance of b'X and c ' X = b ' S c
Section 3.6
Example 3.1 3
Sa m p l e Va l ues of Linear Com b i n ations of Va riables
1 43
{Means and covaria nces fo r l i near combi nations)
We shall consider two linear combinations and their derived values for the n = 3 observations given in Example 3.9 as
[�:] [ �:]
Consider the two linear combinations
b 'X = [2 2  1 ] and
c'X = [ 1  1 3 ]
= 2Xl + 2X2  X3
= X1  X2 + 3X3
The means, variances, and covariance will first be evaluated directly and then be evaluated by (336). Observations on these linear combinations are obtained by replacing X1 , X2 , and X3 with their observed values. For example, the n = 3 observa tions on b'X are
b'x 1 b'x2 b 'x 3
= 2x 1 1 + 2x 1 2  x 1 3 = 2( 1 ) + 2(2)  (5) = 1 = 2x2 1 + 2x22  x23 = 2(4) + 2 ( 1 )  (6) = 4 = 2x3 1 + 2x32  x33 = 2(4) + 2(0)  (4) = 4
The sample mean and variance of these values are, respectively, ( 1 + 4 + 4) == 3 Sample mean = 3 ( 1  3 )2 + ( 4  3 )2 + (4  3 )2 . Sample variance = == 3 3_1 In a similar manner, the n
c'x 1 c'x2 c'x3
==
3 observations on c'X are
1x 1 1  1x 12 + 3x 1 3 == 1 ( 1 )  1 (2) + 3 ( 5 ) = 14 = 1 (4)  1 ( 1 ) + 3 ( 6 ) == 21 == 1 (4)  1 ( 0) + 3(4) == 16
==
and ( 1 4 + 2 1 + 16 ) Samp1e mean == _______  17 3 2 + (16  17 ) 2 (14 17) 2 + (2117) Sample variance == _ 3 1
==
13
1 44
Chapter 3
Sa m p l e Geometry and Ra ndom Sa m p l i ng
Moreover, the sample covariance, computed from the pairs of observations (b'x 1 , c' x 1 ) , (b'x 2 , c'x 2 ) , and (b'x3 , c ' x3 ) , is Sample covariance ( 1  3 ) ( 14  17) + (4  3 ) (21  17) + (4  3 ) (16  17) 9 2 31 Alternatively, we use the sample mean vector x and sample covariance matrix S derived from the original data matrix X to calculate the sample means, variances, and covariances for the linear combinations. Thus, if only the de scriptive statistics are of interest, we do not even need to calculate the obser vations b ' xj and c'xj . From Example 3.9, 3 3 and S = X=
� OJ [� iI
[�]
Consequently, using (336), we find that the two sample means for the derived observations are Sample mean of b ' X = b' X = [2 2  1 J
Sample mean of c' X = c' X = [ 1  1 3] Using (336), we also have Sample variance of b'X = b ' Sb = [2 2  1 ]
[�] [�]
= 17
( check)
(check)
[ �] �J [ l [ l] [ i � �] [� ] n [
= [2 2  1 ] Sample variance of c' X = c' Sc
=3
3
2
1 1 2
=3
( check)
= 13
( check)
= [1 1 3] 
= [1 1 3]
Chapter 3
Sample covariance of b ' X and c ' X = b ' Sc
Exe rcises
1 45
[ � �] [ � ] [ !] = 3
=
[2 2  1 ]
=
[2 2  1 ]
2
1
1 2
�
( check )
As indicated, these last results check with the corresponding sample quan tities computed directly from the observations on the linear combinations. • The sample mean and covariance relations in Result 3.5 pertain to any number of linear combinations. Consider the q linear combinations
ai l xl
+
ai 2 x2
+
0 0 0
+
ai p xp ,
i = 1, 2,
0 0 0
'
( 337 )
q
These can be expressed in matrix notation as
a l l xl a 2l xl
+
+
a l2 x2 a 22 x2
+
a q l xl
+
a q 2 X2
+
+
0 0 0
0 0 0
0 0 0
+
a l p xp a 2 p xp
+
a q p xp
+
=
a1 1 a1 2 a 2 1 a 22
a lp a2 p
xl x2
aq l aq 2
aq p
xp
= AX
( 338 )
Taking the ith row of A, ai , to be b' and the kth row of A, ak , to be c' , we see that Equations ( 336 ) imply that the ith row of AX has sample mean ai x and the ith and kth rows of AX have sample covariance aiSa k . Note that aiSa k is the ( i, k ) th ele ment of ASA' . Result 3.6. The q linear combinations AX in ( 338 ) have sample mean vector
Ax and sample covariance matrix ASA ' .
•
EXERCI SES
3.1. Given the data matrix
Graph the scatter plot in p = 2 dimensions. Locate the sample mean on your diagram. (b) Sketch the n = 3dimensional representation of the data, and plot the de viation vectors y1  x1 1 and y2  x2 1. (c) Sketch the deviation vectors in (b) emanating from the origin. Calculate the lengths of these vectors and the cosine of the angle between them. Relate these quantities to Sn and R.
(a)
1 46
Chapter 3
Sa m p l e Geometry and Random Sam p l i n g
3.2. Given the data matrix
Graph the scatter plot in p = 2 dimensions, and locate the sample mean on your diagram. (b) Sketch the n = 3space representation of the data, and plot the deviation vectors y1  .X1 1 and y2  .X2 1 . (c) Sketch the deviation vectors in (b) emanating from the origin. Calculate their lengths and the cosine of the angle between them. Relate these quan tities to Sn and R. Perform the decomposition of y1 into .X1 1 and y1  .X1 1 using the first column of the data matrix in Example 3.9. Use the six observations on the variable X1 , in units of millions, from Table 1.1. (a) Find the projection on 1 ' = [ 1 , 1, 1, 1, 1, 1 ] . (b) Calculate the deviation vector y1  .X1 1 . Relate its length to the sample standard deviation. (c) Graph (to scale) the triangle formed by y1 , .X 1 1 , and y1  x1 1 . Identify the length of each component in your graph. (d) Repeat Parts ac for the variable X2 in Table 1.1. (e) Graph (to scale) the two deviation vectors y1  .X1 1 and y2  .X2 1 . Calcu late the value of the angle between them. Calculate the generalized sample variance I S I for (a) the data matrix X in Ex ercise 3.1 and (b) the data matrix X in Exercise 3.2. Consider the data matrix X= 3 5 2 (a) Calculate the matrix of deviations (residuals), X lx' . Is this matrix of full rank? Explain. (b) Determine S and calculate the generalized sample variance I S I· Interpret the latter geometrically. (c) Using the results in (b), calculate the total sample variance. [See (323).] Sketch the solid ellipsoids (x  x) ' S1 ( x  x) < 1 [see (316)] for the three matrices (a)
3.3. 3.4.
3.5. 3.6.
3.7.
[
s=
[ OJ
�!
[
5 4

�]

]
4 5 '
[f
(Note that these matrices have the same generalized variance I S I · ) 3.8. Given 1 0 S = 0 1 0 0 0 1
1
and S =

 2
Chapter 3
Exercises
1 47
(a) Calculate the total sample variance for each S . Compare the results. (b) Calculate the generalized sample variance for each S, and compare the
re sults. Comment on the discrepancies, if any, found between Parts a and b. 3.9. The following data matrix contains data on test scores, with x 1 = score on first test, x2 = score on second test, and x 3 = total score on the two tests: 12 18 X = 14 20 16
17 20 16 18 19
29 38 30 38 35
( a) Obtain the mean corrected data matrix, and verify that the columns are lin
early dependent. Specify an a' = [ a 1 , a 2 , a3] vector that establishes the lin ear dependence. (b) Obtain the sample covariance matrix S, and verify that the generalized vari ance is zero. Also, show that Sa = 0, so a can be rescaled to be an eigen vector corresponding to eigenvalue zero. (c) Verify that the third column of the data matrix is the sum of the first two columns. That is, show that there is linear dependence, with a 1 = 1, a 2 = 1, and a 3 =  1 . 3.10. When the generalized variance is zero, it is the columns of the mean corrected data matrix Xc = X  1i' that are linearly dependent, not necessarily those of the data matrix itself. Given the data 3 1 0 6 4 6 4 2 2 7 0 3 5 3 4 (a) Obtain the mean corrected data matrix, and verify that the columns are lin early dependent. Specify an a' = [ a 1 , a 2 , a 3 ] vector that establishes the de pendence. (b) Obtain the sample covariance matrix S, and verify that the generalized vari ance is zero. (c) Show that the columns of the data matrix are linearly independent in this case. 3.11. Use the sample covariance obtained in Example 3.7 to verify (329) and (330), which state that R = n 112 S D  112 and D 112 RD 112 = S . 3.12. Show that I S I = (s1 1 s22 · · · spp ) l R 1 . Hint: From Equation (330), S = D 112 RD 112 . Taking determinants gives I S I = I D 112 l l R I I D 112 1 . (See Result 2A.11.) Now examine I D 112 1 . 3.13. Given a data matrix X and the resulting sample correlation matrix R, consider the standardized observations ( x1 k  xk )/ � , k = 1, 2, . . . , p, j = 1, 2, . . . , n. Show that these standardized quantities have sample covari
ance matrix R.
1 48
Chapter 3
=
Sa m p l e Geometry and Ra ndom Sam p l i ng
3.14. Consider the data matrix X in Exercise 3.1. We have n p
= =
=
=
2 variables X1 and X2 • Form the linear combinations c' X b' X
(a) Evaluate the
3 observations on
[ �J  X1 2x2 3] [ �J 2X1 3X2
=
[  1 2] [2
+
+
sample means, variances, and covariance of b'X and c'X from first principles. That is, calculate the observed values of b'X and c'X, and then use the sample mean, variance, and covariance formulas. (b) Calculate the sample means, variances, and covariance of b'X and c'X using (336). Compare the results in (a) and (b). 3.15. Repeat Exercise 3.14 using the data matrix
and the linear combinations
X= [ � � � ] = [ �:] = 3] [ �: ] == =
b'X and c'X
8 3 3
[1 1 1]
[1 2
#Lv and covari ance matrix E(V  JL v ) (V  JL v ) ' Iv · Show that E(VV' ) Iv + JL viLv · 3.17. Show that, if X and z are independent, then each component of X is
3.16. Let V be a vector random variable with mean vector E(V)
( pX l )
=
(q X l )
independent of each component of Z. Hint: P[Xl < x l , x2 < x2 , . . . ' xp < Xp and zl < Z r , . . . ' Zq < Zq ] P[Xl < x l , x2 < x2 , . . . xp < xp ] . P[Zl < Z r , . . . ' Zq < Zq ] by independence. Let x2 , . . , xP and z2 , . . . , Z tend to infinity, to obtain q P[X1 < x 1 and Z1 < z 1 ] P[X1 < x 1 ] P[Z1 < z 1 ] .
=
'
•
for all x l ' Z l · So xl and zl are independent. Repeat for other pairs. REFERENCES 1. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (2d ed.). New York: John Wiley, 1 984. 2. Eaton, M., and M. Perlman."The NonSingularity of Generalized Sample Covariance Matrices." Annals of Statistics, 1 (1973), 710717.
CHAPTER
4
The Multivariate Normal Distribution
4. 1
I NTRODUCTION
A generalization of the familiar bellshaped normal density to several dimensions plays a fundamental role in multivariate analysis. In fact, most of the techniques encountered in this book are based on the assumption that the data were generated from a multivariate normal distribution. While real data are never exactly multivariate normal, the normal density is often a useful approximation to the "true" population distribution. One advantage of the multivariate normal distribution stems from the fact that it is mathematically tractable and "nice" results can be obtained. This is frequently not the case for other datagenerating distributions. Of course, mathematical at tractiveness per se is of little use to the practitioner. It turns out, however, that nor mal distributions are useful in practice for two reasons: First, the normal distribution serves as a bona fide population model in some instances; second, the sampling dis tributions of many multivariate statistics are approximately normal, regardless of the form of the parent population, because of a central limit effect. To summarize, many realworld problems fall naturally within the framework of normal theory. The importance of the normal distribution rests on its dual role as both population model for certain natural phenomena and approximate sampling distribution for many statistics. 4.2 TH E M U LTIVARIATE N ORMAL D E N S ITY AND ITS PROPERTI ES
The multivariate normal density is a generalization of the univariate normal density to p 2 dimensions. Recall that the univariate normal distribution, with mean JL and variance a2 , has the probability density function >
 00
< X
0
since each term Aj 1 (x' ez) 2 is nonnegative. In addition, e i = 0 for all i only if p = 0. So =I= 0 implies that ( 1/ e i ) 2 > 0, and it follows that I1 is posi • tive definite.
x
x
The following summarizes these concepts:
1 54
Chapter 4
The M u ltivariate Norm a l Distri bution
A contour of constant density for a bivariate normal distribution with o1 1 = o2 2 is obtained in the following example. Example 4.2
(Contou rs of the biva riate normal density)
We shall obtain the axes of constant probability density contours for a bivari ate normal distribution when o1 1 = o22 . From (47), these axes are given by the eigenvalues and eigenvectors of I. Here I I AI I = 0 becomes A 0 = o1 1 lT 1 2 = ( A  o1 1  o1 2 )  o1 1 + o1 2) Consequently, the eigenvalues are 1 = o1 1 + o1 2 and A2 = o1 1  o1 2 . The eigenvector e 1 is determined from 
( A.
A.
or
o1 1 e1 + o1 2 e2 = o1 1 + o1 2) e 1 lT 1 2 e1 + o1 1 e2 = o1 1 + lT 1 2 ) e2 These equations imply that e1 = e2 , and after normalization, the first eigenvalueeigenvector pair is 1 v'2 1 v'2
((
Similarly, A2 = o1 1  o1 2 yields the eigenvector e2 = [ 1/ v'2 ,  ljv'2] . When the covariance o1 2 (or correlation p 1 2) is positive, A1 = o1 1 + o1 2 is the largest eigenvalue, and its associated eigenvector e1 = [1/ v'2 , 1/ v'2 ] lies along the 45° line through the point IL ' = [JL 1 , JL2 ]. This is true for any positive value of the covariance (correlation). Since the axes of the constantdensity el lipses are given by ±eVA;" e 1 and ±eVA; e 2 [see (47)] , and the eigenvectors each have length unity, the major axis will be associated with the largest eigen value. For positively correlated normal random variables, then, the major axis of the constantdensity ellipses will be along the 45° line through IL · (See Figure 4.3.)
Figure 4.3
A consta ntde ns ity conto u r for a biva riate normal d i stribution with a 1 1 a 2 2 and a 1 2 > O (or p1 2 > 0). =
Section 4.2
The M u ltiva riate Normal Density and Its Properties
1 55
A.2 a1  a1 2
will be the largest When the covariance (correlation) is negative, = eigenvalue, and the major axes of the constantdensity ellipses will lie along a line at right angles to the 45° line through IL . (These results are true only for
a 1 a2 · ) =
To summarize, the axes of the ellipses of constant density for a bivariate normal distribution with are determined by
a 1 a2 ±cYa1  a1 2 ±cVa1 a1 2 c2 =
1
v1

+
1
and
v1

1
v1

1
v1

•
We show in Result 4.7 that the choice = x�( a) , where x�( a) is the upper ( 1 00a )th percentile of a chisquare distribution with p degrees of freedom, leads to contours that contain ( 1  a) X 100% of the probability. Specifically, the following is true for a pdimensional normal distribution:
The constantdensity contours containing 50% and 90% of the probability under the bivariate normal surfaces in Figure 4.2 are pictured in Figure 4.4.
112
11 2
Figure 4.4
in F i g u re 4.2.
111
xl



� /1 1
xl
The 50% a n d 90% conto u rs for the bivariate normal d istri butions
The pvariate normal density in (44) has a maximum value when the squared distance in ( 43) is zerothat is, when x = IL · Thus, IL is the point of maximum den sity, or mode , as well as the expected value of X, or m ean . The fact that IL is the mean of the multivariate normal distribution follows from the symmetry exhibited by the constantdensity contours: These contours are centered, or balanced, at IL ·
1 56
Chapter 4
The M u ltiva riate Norm a l Distribution
Additional Properties of the Multivariate N o rmal Distri bution
Certain properties of the normal distribution will be needed repeatedly in our ex planations of statistical models and methods. These properties make it possible to manipulate normal distributions easily and, as we suggested in Section 4 . 1 , are partly responsible for the popularity of the normal distribution. The key proper ties, which we shall soon discuss in some mathematical detail, can be stated rather simply. The following are true for a random vector X having a multivariate normal distribution: 1. Linear combinations of the components of X are normally distributed. 2. All subsets of the components of X have a ( multivariate ) normal distribution.
3. Zero covariance implies that the corresponding components are independently distributed. 4. The conditional distributions of the components are ( multivariate ) normal. These statements are reproduced mathematically in the results that follow. Many of these results are illustrated with examples. The proofs that are included should help improve your understanding of matrix manipulations and also lead you to an appre ciation for the manner in which the results successively build on themselves. Result 4.2 can be taken as a working definition of the normal distribution. With this in hand, the subsequent properties are almost immediate. Our partial proof of Result 4.2 indicates how the linear combination definition of a normal density re lates to the multivariate density in ( 44) . Result 4.2. If X is distributed as Np( JL, I ) , then any linear combination of + variables a' X == + .· + is distributed as N(a' JL, a' Ia). Also, if a' X is distributed as N(a' JL, a' Ia) for every a, then X must be Np( JL, I ) .
a1X1 a2X2 aPXP .
Proof.
The expected value and variance of a ' X follow from (243) . Proving that a' X is normally distributed if X is multivariate normal is more difficult. You can • find a proof in [1]. The second part of result 4.2 is also demonstrated in [1] . Example 4.3
{The distri bution of a l i near com b i nation of the components of a normal random vector)
Consider the linear combination a' X of a multivariate normal random vector determined by the choice a' == [ 1, 0, , O J . Since . . .
a' X == [ 1 , o, . . . , O J
xx2l X 1 ==
Section 4.2
1 57
The M u ltiva riate Norm a l Dens ity and Its Properties
and a' 1L == [ 1 , 0 , . . . , O J
we have
ILJL2l JLp
== JL 1
lTlT1l 2 lTlT21 2 lTlT2l pp lT l lTlp lT2p lTpp anderal iyt, ftohleomarws fgrionmal Resdistruilbtutionthofat anyX1 iscomponent distributedXasi ofN JLi1s,N(o1 JL)i., oMori ) · e gen The next res u l t cons i d er s s e ver a l l i n ear combi n at i o ns of a mul t i v ar i a t e nor m al vector If is distributed as , the linear combinations pa x a x la2lxll a2l ppxpp (qXp)(pXl) arstaentdis,sitsridibutstreidbutasedasNp( IL Also, (px l) (pdxl) , where dis a vector of conThe expect e d val u e and t h e covar i a nce mat r i x of f o l o w ffroormm withAny linear Thuscombi, tnhate iconclon usion concer is a linnearing combifonl atowsiondiofrectlyoffrtohme ResulThet second part of the result can be obtained by considering d) d) , wher e i s di s t r i b ut e d as I t i s known f r o m t h e uni varunchanged iate case tandhat additransnglaatconses thtaentmeand to ttohe raILndom vard iable IL led)aves. Sithnecevariawasnce arbitrary, dis distributed as p IL d, For distributed as find the distribution of 1 0
a' Ia == [ 1 , 0 , . . . , OJ
==
0
(
4.2
•
X
X.
Result 4.3.
X
N ( JL I ) ,
A X
Proof.
a'X + (a'
X+
Example 4.4
X
+ ... + + ... +
==
X +
Nq(AJL, AIA ' ) . + d, I ) .
(245). a' X a == A' b. 4.2.
a' X
q
E(AX) b ' (AX)
N(
AX
N(a' JL, a ' Ia).
a'
+
AX X,
a' + a' == a' ( I).
a' (X +
a' X
+
(The distri bution of two l i near combi nations of the components of a normal ra ndom vecto r)
N3 ( JL, I ) ,
1 1
a
==
•
1 58
Chapter 4
The M u ltiva riate Normal Distri bution
By Result the distribution of AX is multivariate normal with mean 4.3,
and covariance matrix
Alfiedtebyrnatdiirveelcty,caltheculmeanationvectof thore AmeansIL andandcovarcovarianceiancesmatofrixthAeItwA'o rmayandombe varverii ables Y1 xl  x2 and x2  x3 . are thWeemshaveelvesmentnormioalnedly dithstatribalutl esud.bsWeets ofstaatemulthistiprvaroiperatetnory fomrmalalrlayndomas Resvectult or X are noriamnceallmaty dirsitxriIbutased. If we respectively par tition X, its mean vectAllosur bsILe, tands ofiXts covar [ X [ IL ] ] 1 1 l q q Xl ) ( ( ) (pXxl) ( pxq2)X l) (pILxl) ( pILq2) X l) and I I 1 1 2 q q q p x x ( ( ( ) : I(pxp) ( pIq2)1Xq) ( pqI)X2(pq) q) then xl is distributed as Nq( #Ll , Il ) · Set (qAXp) [ (qXq) (qX (pq) J in Result and the conclusion fol ows. TotheapplsubsyeRest ofuinltteresttoasanX1 and selesuctbstheteofcorthreescomponent s of X, we s i m pl y r e l a bel variances as IL1 and I1 , respectively. ponding component means and co Y2 =
=
•
4.4.
Result 4.4.
=
=


!
I I
                Ir                      
l l
Proof.
=
4.4
1
0
4.3,
arbitrary
•
Section 4.2
Example 4.5
The M u ltiva riate Norm a l Dens ity and Its Properties
1 59
(The distri bution of a su bset of a normal random vector)
If X is distributed as N5 ( 1L, I) , find the distribution of [�:]. We set
andand notparetittihoatnedwitash this assignment, X, JL, and I can respectively be rearranged 24 J.LJ.L24 a2 a24 i 1 2 a23 a25 X= 3 JL = J.LJ.L3l I = aa2123 aa3144 i a1 3 aa31 3 aa31 55 X5 J.L5 a25 a45 ! a1 5 a35 a5 or IL x 1 l X= x21) JL = IL2 Thus, from Result for we have the distribution Itexpris clesearedfbyromsimthplisyexampl egththatetapprhe noroprmialatedimeans stributandion fcovaror anyiancessubseftrocanm tbehe s e l e ct i n oressigairnyal. IL and I. The formal process of relabeling and partitioning is unnec We ar e now i n a pos i t i o n t o s t a t e t h at zer o cor r e l a t i o n bet w een nor m al r a ndom variables or sets of normal random variables is equivalent to statistical independence. If X1 and X2 are independent, then Cov (X1 , X2) = a q1 q2 matrix of zeros. Ionlf [y�ifJIi1s2 = ( [:;J [�;;i i!�] ), then X1 and X2 are independent if and x x xl x
(I � ��  �� � _l __c:__l_� ��� �� ?_ (Il l !
'
'
( 2Xl )
( 2Xl )
(3 X
(3 X l )
4.4,
•
Result 4.5.
(a)
(b)
(q 1 x l )
0,
(q2 x l )
Nq 1 +qz 0.
,
X
1 60
Chapter 4
The M u ltivariate Norma l Distribution
, ) and INfq2X( p,1 2and, I2 )X, r2esarpecte iinvdependent and ar e di s t r i b ut e d as ( p. N I 1 1 1 q ely, then [i�J has the multivariate normal distribution Nq!+qz ( [:;] [t��+!:J) ( S ee Exer c i s e f o r par t i a l pr o of s bas e d upon f a ct o r i n g t h e dens i t y function when I12 = Let (3X1) beN3( p., I )with [ I= J Are XSi1 nandce XX12 iandndependent ? What about X , X and X3 ? ( ) 1 2 X have covar i a nce t h ey ar e not i n dependent . = o1 2 2 However, partitioning X and I as I [ I 1 1 2 X= [1!l I = [    ;  J = (I1X212) : (IlX2l) ] we( X s,eXe th) atandX1X=3 ar[e�in:]dependent Therefore, haveby Rescovarulitance matThisriixmIpl1i2es=X[3�isJin. dependent and X 3 1of xl and2 also of x2 ion of the bibecaus variatee tnorhe mjoalintdidensstribituty fiounnctthioatn p[s1e2e=(We(zpoiercouloncortdedtrheoutenlatbeioinn)wrouriimt plediniesascdusitnhsdependence e pr o duct of t h e mar g i n al ( n or m al ) dens i t i e s of Xcas1 eandof ResX2 uThilt s fawictt,hwhiq1 c=h weq2 =encouraged you to verify directly, is simply a special Let X = [�;] be distributed as Np( p,, I) with p, = [:;], IX2==[x""�2"2�,1i!s:�nor""�"2�?m]al,andandI hasI2 l Then the conditional distribution of X1 , given that (c)
'
Proof.
Example 4.6
4014
Oo)
•
(The eq u ivalence of zero cova ria nce and i ndependence fo r normal variabl es)
4 1 0
1 O 3 o 0 2
1,
4 1 0
1 lO 3 !o 0 i2
l _(�>:��i��>:� L i
4050
•
0
0 46)]
0
405
10
Result 4.6.
>
00
Section 4.2
The M u ltivari ate Normal Density and Its Properties
1 61
and Notvariaebltheat. the covariance does not depend on the value x2 of the conditioning We s h al l gi v e an i n di r e ct pr o of . ( S ee Exer c i s e whi c h us e s t h e den sities directly.) Take [ :  ""�"1 2..�.,.221 ] A(pxp) (p(qXqq) X) q : (pqXq)(Xp(pq) q) so is jointly normal with covariance matrix AIA' given by Proof.
4.13,
I
=

0
�                  
i
!
I
Sidependent nce X1 . ILMor1 eoverI12I,2tih(eXquant i a nce, t h ey ar e i n 2  ILi2ty) andX1 X21L1 IL2Ihave1 2I2izer(Xo2 covar has di s t r i b ut i o n ) IL 2 x2  1L2) is, athconse conditant. NBecaus q(O, Ie1 X1 I1 21LI12ii2I1)1.2IGi2iv(enX2 thatILX2)2 andx2X, 2IL1 1LI2 1ar2Ie 2iin(dependent tdiiosnaltribdiutsitornibofutiXon1 of X1L11  IIL1l2I2Ii(1X2I2 2i(x1L22). ILSi2n)ceis Xth1e same1L1 as tIh1e2Iuncondi t i o nal ) ( X 2 i 1L 2 2 iwhens Nq(XO,2Ihas1 theIpar12It2iiciul2a1 )r,valsouies xth2 .e Equirandomvalentvectly, giorvenX1 thatILX1 2=Ix122,IX2i1(ixs2distr1Lib2)uted as Nq( ILl I1 2I2i(x2  1L2) , I1  I12I2i I2 1 ) . gi v en t h at Thetion,condi t i o nal dens i t y of x f o r any bi v ar i a t e di s t r i b u x x 2 2 l ' is defined by =
+
+
Example 4.7
•
(The co nditional density of a bivariate normal distri butio n) =
wher e f ( . I f f ( , i s t h e bi v ar i a t e nor ) i s t h e mar g i n al di s t r i b ut i o n of X ) x x x 1 2 2 2 mal density, show that f( x1 x2) is I
1 62
Chapter 4
The M u ltivariate Norm a l Distri bution
erms iionnvolving xbecome, e o1 ofathie2/obi2variaote1 nor( 1 malPIdens2) . Theity [tsweoe tEquat Herexponent 1  JL1 iaparn thet from the multiplicative constant 1/2(1  PI2), =
(46)]
vo=;;: ,
Becaus e or p op1 2 / ya:;; 1 1 2 2 exponent is =
va:;;; vo=;;: =
o1 2/o2 , the complete
The constant term 2TTv'o1 o2 ( 1  PI2) also factors as Dividing the joint density of X1 and X2 by the marginal density and canceling terms yields the conditional density Thusx2 , wix2tihs N(ourJLcusl to(marlT1 2/ylTnot2 ) a(xti2on, JLth2e) ,condilTl ( 1tionalPI2di) )s. trNow,ibutioInl ofIX1l 2giIv2eni I2th1 at owhi1 ch weai2/obto2ainedo1by( laninPIdi2r)eandct metI1 2hIod.2i o12/o2 , agreeing with Result =
==
+
=
==
4.6�
•
Section 4.2
The M u ltivariate Norma l Dens ity and Its Properties
1 63
For thAle mull conditivartioianalte nordistmrialbutsiitounsatiaron,e i(tmisulwortivarthiaemphas i z i n g t h e f o l o wi n g: t e ) nor m al . The conditional mean is of the form 1. 2.
(49) where the {3's are defined by
f3f32l,,qq++l f3f32l,,qq++22 f3f32l,,pp /3q,q+l /3q,q+2 /3q,p Thevalue(condis) ofttihoenalcondicovartioniiannce,g variable(s).2 2 , does not depend upon the We concl u de t h i s s e ct i o n by pr e s e nt i n g t w o f i n al pr o per t i e s of mul t i v ar i a t e nor malconsrtaandom vect o r s . One has t o do wi t h t h e pr o babi l i t y cont e nt of t h e el l i p s o i d s of nt dens i t y . The ot h er di s c us s e s t h e di s t r i b ut i o n of anot h er f o r m of l i n ear com bi2natiTheons. chisquare distribution determines the variability of the sample variance sin the mulfotrivsarampliateescasfreo.m a univariate normal population. It also plays a basic role Let be diisstdiribstutriebdutasedNasp( x , wherwiteh x denot0.eThen s t h e chi s q uar e diThestriNbutp(ion withdistrdegributeioesnofasfsriegedom. ns)},prwherobabie xlit(ya)1 denot solid (el1l00a)ipsotidh  a teos tthhee upper x a ( percentile of the x distribution. We knowwhertehatZ1 ,x 2, is ,definarede inasdependent the distributio1)n raofndomthe varsumi ables. Next, by the spectral decompositipon1[see Equations (216) and (s2o21) with wher e and see Result 4.1], p Cons e quent l y , p 2 p for instance. Now, 2 p we can write ) , where 1
I 1 1  I 1 I 2i i 1
3.
= s1 1
Result 4.7. X (a) (X  JL ) ' I 1 (X  JL )
JL, I) �
III �
>
p
(b)
JL, I ) {x: (x  JL ) ' I 1 (x  JL ) < � �
Proof. Zi + Z� +
A = I,
( 1/ A.J e i .
· · ·
� Z
+ Z� ,
. . •
�
N(O,
ZP
I1 ei = Ie i = Ai e i , I 1 = L  ei ei , i =l Ai (X  JL ) ' I  1 (X  JL ) = L ( 1/ A.J (X  JL ) ' e i ei (X  JL ) = i =l
L ( 1/ Ai ) ( e i (X  IL ) ) = iL [ ( 1/ \/A; ) e i (X  IL ) ] = iL Z[ ,
i =l
=l Z = A(X  IL
=l
1 64
Chapter 4 The M u ltivariate Norm a l D istribution
(pXl) (pAXp) anddistrXibutedILasis Ndips(tOr,ibAutIeA'd as) , wherNp( eI) . Therefore, by Result z
==
==
0,
4.3, Z ==
A(X  IL ) is
BycludeResthuatlt (X Z1 1L), Z2' ,I. .1 ,(XZPar1L)e has a x�disstt1raindarbutiodn.normal2 variables, and we con For Par t b, we not e t h at P[ ( X c ] i s t h e pr o babi l i t y as ( X JL) JL)' I 1 2 sfrigonedmPartot a,thPe [el(XlipsoiJL)d (' XI1(X�t) ' IJL) (X x�(JL)a)] c by a,andPar the densittybholNp( JLds,. I ). But Res u l t pr o vi d es an i n terpretation of a squared statistical distance. When X is distributed as Np( JL , I ) , icomponent s the squarehasd staatmuch istical ldiarsgterancevarfiraoncem Xthtano thanote populher,aittiowinlmeancontrvectibutoerleILs· Itfo onethe sleqsuartheand ditwstoavarnce.iablMores tehoverat ar,etwnearo hilgyhluncory corrreellaatteed.d rEsandom var i a bl e s wi l cont r i b ut e s e nt i a l y , t h e us e of t h e i n ver s e efofftehctescovarof coriarnceelatimaton. rFrix,om tshtea1ndarproofdizofesResall uofltthe variables and eliminates the (X  JL) ' I (X  JL) Zt Z� Z� 4.5,
independent
< == 1 
and
I
I.
IL
b
IL
417)
p
0,
I I
X
l b e tr ( ll B )/2
0 TJi B 112 I  1 B 1 12
p
( B 1 12 I  1 B 112 ) = :L TJi i =1 2.12.
12 1 1 I B 1 f2 I  1 B 1/2 1 = I B 1 /2 1 1 I  1 1 1 B 1/2 1 = I I I I B /2 1 1 B / 1 1 = I I 1 I I B I =  I B I III
Section 4.3
Sa m p l i ng from a M u ltivariate Normal Distri bution
1 71
or 1 1 I I i = 1 I I I Combining the results for the trace and the determinant yields ( I)' � y = D I I I I But= the fTheunctchoiion TJcbeeTJ/2 has afomaxir eachmum,thwiertehforreespgiectvesto of (2b)beb, occurring at � � I I The upper bound is uni81 q2Iuel1y8at1;2ta=in8ed1 when I 8=1 2 since, for this choice, and Moreover, I I I I I I Straightforward substitution for tr[I1 and I yields the bound asserted. The maxi m um l i k el i h ood es t i m at e s of and ar e t h os e val u es denot e d by I dependand iton thhate obsmaxiermvedizevaltheuesfunc1t,ion Ith) rinough the sTheummarestimy sattaetsisticands andi wil Let , be a r a ndom s a mpl e f r o m a nor m al popul a t i o n 1 with mean and covariance I. Then and I == arvaleuteshe, and ofare andcalleId, treshepectively. Their observed of and I. The exponent i n t h e l i k el i h ood f u nct i o n [ s e e Equat i o n apar t from the multiplic1ative factor � , is [see tr [ I  ( � )] B f2  B 1 /2 1 IBI
1
1
TJ
1 bl e tr [ :t B J /2
TJi
2b.
=
=
e :�, YJ./2 p
1];
lb
2b,
p
II
TJi

IBI
=
B lb
TJ,
i,
lb e
p
1
tr (r' B ) /2
0
n
b = n/2
= �
• • •
• • •
jL
Xn .
i
•
A
( 418)
= [ (n  1 ) /n J P I S I ,
or, since I constant (generalized variance)n/2 Theand, gener a l i z ed var i a nce det e r m i n es t h e "peakednes s " of t h e l i k el i h ood f u nct i o n a t i o n is multiconsvMaxiariaetquent emumnorlmlyik,aliels. iahnatooduresatlimeasmatoursreposofsvares iaanbility when the parent popul Let be t h e maxiwhichmiums a fluikncteliihooodn of estThen imatotrhofe and consider estimating thofe parameter is given by (See and For example, The maximum likeliarhoode these timaximatomr umof JL'liIkel1iJLhoodis [L'es:iti1m[L,atwherors ofe andX andI, rTheespectmaxiivelmyum. likelihood estimator of is �, where 2 j= 1 is the maximum likelihood estimator of Var I
A
I
L( jL , i ) =
6.
[1]
1.
[14].)
( 419)
X
6,
h(6)
( a function of (J)
6
invariance p roperty.
A
maximum likelihood estimate h ( 0) (
same
(420)
function of {))
jL =
i = ( ( n  1 )/n ) S
IL
�
2.
n 1 a · . = � ( X· .  X) n A
ll

l1
aii =
l
( Xi ) ·
h(6),
Section 4.4
The Sa m p l i n g Distri bution of X and
1 73
5
Sufficient Statistics
Frx1 o, xm2,expr. . , xens onliony throughthethoie nsat mpldenseitmean y dependsandonthtehesuwholmofessqeuart ofeobssandervatcrosionssWe expres this fact by saying prthatoductands matrix or are ( 415),
j
x
n
L (xi  x) (xi  x)' = (n  1 ) S . ( n  1 )S ( S ) sufficient statistics : i=1
x
portance� andof suf iincitehnte datstataismatticsrfioxr norismcontal popul aitnionsandis thatrealgarl ofdltehse iofnfothrmeThesataimploinmabout a i n ed e s i z e Thi s gener a l y i s not t r u e f o r nonnor m al popul a t i o ns . Si n ce many mul t i v ar i a t e t e c h ni q ues begi n wi t h s a mpl e means and covar i a nces , i t i s pr u dent tIof check ona cannot the be regarofdedtheasmulmultivtiarvariaitaetnore normmalalas,stuemptchniiqouesn. tSeehat Sectdependion t h e dat solely on and may be ignoring other useful sample information. n.
4.6.)
4.4
x
X
I
S,
adequacy
x
(
S
TH E SAM PLI N G DISTRI B UTION OF X AN D S
Themal popul tentatiavteioasn swiumptth meanion th�atand covariance conscompltituteetaelryandom s a mpl e f r o m a nor det e r m i n es t h e s a mpl i n g diofstriandbutionsbyofdrawiandng a parHeraleewel wiprthetsheentfatmiheliraersuuniltsvonariathtee conclsampluisniognsdi.stributions pop I n t h e uni v ar i a t e cas e we know t h at i s nor m al wi t h mean ulation mean and variance populsaampltionevarsizieance Thetributreiosunltwifothr tmean he mul�tiandvariacovarte casiaence matrixis analogous in that has a normal dis For t h e s a mpl e var i a nce, r e c a l t h at i s di s t r i b ut e d as tchiismquares ae chiis thsqeuardisetrvaributiaioblneofhavia snugm of squardegreseesof ofindependent freedom d.sft.andar. In dtunorrn, mthalis random variables. ThatTheisi,ndividual itserdimstsributedarase independent· · · ly distributed as I t i s t h i s l a t e r f o r m t h at i s s u i t a bl y gener a l i z ed t o t h e bas i c s a mpl i n g di s tribution for the sample covariance matrix. x 1 ' x2 , . . . ' xn
X
X
S
)
S.
(p = 1 ) ,
_1 (]"2
n
=
u2
N(O, u2 ) .
(n

JL = (
X

(p
+ · · · + ( u Zn _ 1 ) 2 .
I
1 ) s2
>
2)
(1/n )I . n 2 ( n  1 ) s = iL =1 n1 uZi
X
(Xi  X) 2
u2 (Zi +
(
)
+ Z� _ 1 )
= (uZ1 ) 2
1 74
Chapter 4 The M u ltivariate Norm a l D i stri bution
The samplafteirngitsdidistsrciboverutioenr;ofit tihs edefsaimplnedeascovartheisauncem ofmatinrdependent ix is called prtheoducts of multivariate normal random vectWiorss.harSpecit disftirciabluty,ion with d.f. distribution of l y di s t r i b ut e d as whereWethesummarare eachize thiendependent sampling distribution results as fol ows:
Wishart
distribution,
Wm( I I ) = = ·
( 422)
m
m
� z j zJ j =l Np ( 0, I) .
Zj
Becaus e i s unknown, t h e di s t r i b ut i o n of cannot be us e d di r e c t l y t o make idinfsetrriebncesutionaboutof doesHowever , proonvides iThindependent inftoormconsatiotnruabout andstic tfhoer not depend s al l o ws us ct a s t a t i makinForg intfheereprncesesentabout, we recorasdwesomeshalflusretheerinrChapt e r e s u l t s f r o m mul t i v ar i a bl e di s t r i b ut i o n tithseordefyi.nTheitionfoasl oawisunmg profothpere itniedependent s of the Wiprshoarductt disst,ributionPraroeofders cainvedbediforundectlyinfrom I
S
X
S
IL ·
IL ·
/L,
I,
5.
[1 ].
Z j Zj .
Properties of the Wishart Distribution
wm ( l I I ) 1 + 2
If A Ais distribthutenedAas AAis distriinbdependent l y of A , whi c h i s di s t r i b ut e d as 2 uted as A That is, the degrIf Aees iofs frdieedom add. s t r i b ut e d as t h en i s di s t r i b ut e d as A Al t h ough we do not have any par t i c ul a r need f o r t h e pr o babi l i t y dens i t y f u nc tfioornmof. Thethe Widenssharityt didoesstribnotutioexin,sitt mayunlesbetofhessoamemplinetseirzeest toissgreeeiattserratthhaner compl i c at e d t h e number of variables When it does exist, its value at the positive definite matrix A is A , Apos i t i v edef i n i t e A r( � where r is the gamma function. See 1.
2.
l Wm2 ( 2 1 I ) ,
l
Wm ( A I I ) ,
Wm ( C C' I CIC' ) .
Wm 1+ m2 (A 1 +
2 1 I).
( 424)
CAC '
n
p.
wn _ 1 ( I I ) = (·)
I
l ( n p  2 ) /2 e tr [ AI 1] /2
2 p ( n 1 ) /2 1Tp ( p  1 ) /4 1 I l ( n 1 ) /2
(
P II i =l
[1].)
( n  i) )
(425)
Section 4.5
4. 5
LARG ESAMPLE B E HAVIOR OFX AN D S
La rgeSa m p l e Behavior of X and S
1 75
Suppose the quantity X is determined by a large number of independent causes Vi , v;, . . . , Vn , where the random variables Vi representing the causes have approxi mately the same variability. If X is the sum X = Vi + Vi + · · · + Vn then the central limit theorem applies, and we conclude that X has a distribution that is nearly normal. This is true for virtually any parent distribution of the Vj 's, provid ed that n is large enough. The univariate central limit theorem also tells us that the sampling distribution of the sample mean,X, for a large sample size is nearly normal, whatever the form of the underlying population distribution. A similar result holds for many other important univariate statistics. It turns out that certain multivariate statistics, like X and S, have largesample properties analogous to their univariate counterparts. As the sample size is increased without bound, certain regularities govern the sampling variation in X and S, irre spective of the form of the parent population. Therefore, the conclusions presented in this section do not require multivariate normal populations. The only require ments are that the parent population, whatever its form, have a mean IL and a finite covariance I. Result 4.12 (Law of large numbers). Let Yi , }2, . . . , Yn be independent obser vations from a population with mean E(Yi) = J.L. Then Yi + Y2 + · · · + Yn y = n

converges in probability to J.L as n increases without bound. That is, for any prescribed accuracy s > s < Y J.L < s J approaches unity as n � oo . Proof.
0, P[

See [9].
•
As a direct consequence of the law of large numbers, which says that each Xi converges in probability to J.Li , i = 1, 2, . . . , p,
X converges in probability to IL
(426)
Also, each sample covariance sik converges in probability to O"ik ' i, k = 1, 2, . . . , p, and S ( or i = Sn ) converges in probability to I
(427)
1 76
Chapter 4
The M u ltivariate Norm a l Distribution
Statement (427) follows from writing
n ( n  l ) sik = L ( Xji  Xi ) ( Xjk  Xk ) j =1 n = L ( Xji  JLi + JLi  Xi ) ( Xjk  JLk + JLk  Xk ) j =1 n = L ( Xji  JLJ ( Xjk  JLk ) + n(Xi  JLi ) (Xk  JLk ) j =1 Letting lj = ( Xji  JLJ (Xjk  JLk ) , with E( lj ) = ai k ' we see that the first term in st A converges to aik and the second term converges to zero, by applying the law of large numbers. The practical interpretation of statements ( 426) and (427) is that, with high probability, X will be close to IL and S will be close to I whenever the sample size is large. The statement concerning X is made even more precise by a multivariate ver sion of the central limit theorem. Result 4.13 (The central limit theorem). Let X 1 , X 2 , . . . , Xn be independent observations from any population with mean IL and finite covariance I. Then Vn
(X  IL ) has an approximate Np ( 0, I) distribution
for large sample sizes. Here n should also be large relative to p. Proof.
See [1].
•
The approximation provided by the central limit theorem applies to dis crete, as well as continuous, multivariate populations. Mathematically, the limit is exact, and the approach to normality is often fairly rapid. Moreover, from the results in Section 4.4, we know that X is exactly normally distributed when the un derlying population is normal. Thus, we would expect the central limit theorem approximation to be quite good for moderate n when the parent population is nearly normal. As we have seen, when n is large, S is close to I with high probability. Conse quently, replacing I by S in the approximating normal distribution for X will have a negligible effect on subsequent probability calculations. Result 4.7 can be used to show that n ( X  IL ) ' I  1 ( X  IL ) has a x distribution
( �)
�
when X is distributed as NP /L, I or, equivalently, when Vn ( X  IL ) has an Np (O, I) distribution. The x distribution is approximately the sampling distribution of n( X  IL ) ' I  1 ( X  IL ) when X is approximately normally distributed. Replac ing I  1 by s 1 does not seriously affect this approximation for n large and much greater than p.
�
Section 4.6
Assessi n g the Ass u m ption of Norma l ity
1 77
We summarize the major conclusions of this section as follows:
In the next three sections, we consider ways of verifying the assumption of nor mality and methods for transforming nonnormal observations into observations that are approximately normal. 4.6
ASS ESSING TH E ASSU M PTION OF N ORMALITY
As we have pointed out, most of the statistical techniques discussed in subsequent chapters assume that each vector observation Xi comes from a multivariate normal distribution. On the other hand, in situations where the sample size is large and the techniques depend solely on the behavior of X, or distances involving X of the form n ( X JL ) ' S  1 ( X JL ) , the assumption of normality for the individual observations is less crucial. But to some degree, the quality of inferences made by these methods depends on how closely the true parent population resembles the multivariate nor mal form. It is imperative, then, that procedures exist for detecting cases where the data exhibit moderate to extreme departures from what is expected under multi variate normality. We want to answer this question: Do the observations Xi appear to violate the assumption that they came from a normal population? Based on the properties of normal distributions, we know that all linear combinations of normal variables are nor mal and the contours of the multivariate normal density are ellipsoids. Therefore, we address these questions: 1. Do the marginal distributions of the elements of X appear to be normal? What about a few linear combinations of the components Xi? 2. Do the scatter plots of pairs of observations on different characteristics give the elliptical appearance expected from normal populations? 3. Are there any "wild" observations that should be checked for accuracy? It will become clear that our investigations of normality will concentrate on the behavior of the observations in one or two dimensions (for example, marginal dis tributions and scatter plots). As might be expected, it has proved difficult to con struct a "good" overall test of joint normality in more than two dimensions because of the large number of things that can go wrong. To some extent, we must pay a price for concentrating on univariate and bivariate examinations of normality: We can 

1 78
Chapter 4
The M u ltivariate Normal D i stri bution
never be sure that we have not missed some feature that is revealed only in higher dimensions. ( It is possible, for example, to construct a nonnormal bivariate distrib ution with normal marginals. [See Exercise 4.8.]) Yet many types of nonnormality are often reflected in the marginal distributions and scatter plots. Moreover, for most practical work, onedimensional and twodimensional investigations are ordinarily sufficient. Fortunately, pathological data sets that are normal in lower dimensional representations, but nonnormal in higher dimensions, are not frequently encountered in practice. Eva l uati ng the Normal ity of the U n iva riate Marg i nal Distributions
Dot diagrams for smaller n and histograms for n > 25 or so help reveal situations where one tail of a univariate distribution is much longer than the other. If the his togram for a variable Xi appears reasonably symmetric, we can check further by counting the number of observations in certain intervals. A univariate normal dis tribution assigns probability .683 to the interval (JLi  �' JLi + � ) and proba bility .954 to the interval (JLi  2�, JLi + 2�). Consequently, with a large sample size n, we expect the observed proportion Pi I of the observations lying in the interval ( xi  �' xi + � ) to be about .683. Similarly, the observed proportion Pi 2 of the observations in (xi  2�, xi + 2� ) should be about .954. Using the normal approximation to the sampling distribution of Pi ( see [9]), we observe that either
I Pi l  .683 1 > 3
( .683) ( .317 ) n
1.396 Vn
or
I Pi 2  .954 1 > 3
( .954) ( .046) n
.628 Vn
( 429)
would indicate departures from an assumed normal distribution for the ith charac teristic. When the observed proportions are too small, parent distributions with thick er tails than the normal are suggested. Plots are always useful devices in any data analysis. Special plots called QQ plots can be used to assess the assumption of normality. These plots can be made for the marginal distributions of the sample observations on each variable. They are, in effect, plots of the sample quantile versus the quantile one would expect to observe if the observations actually were normally distributed. When the points lie very near ly along a straight line, the normality assumption remains tenable. Normality is sus pect if the points deviate from a straight line. Moreover, the pattern of the deviations can provide clues about the nature of the nonnormality. Once the reasons for the non normality are identified, corrective action is often possible. ( See Section 4.8.) To simplify notation, let x 1 , x2 , , x n represent n observations on any single characteristic Xi . Let x ( l ) < x ( 2 ) < · · · < x ( n ) represent these observations after they are ordered according to magnitude. For example, x ( 2 ) is the second smallest obser vation and x ( n ) is the largest observation. The x u ) ' s are the sample quantiles. When • • •
Section 4.6
Assessi n g the Ass u m ption of Norm a l ity
1 79
the x u ) are distinct, exactly j observations are less than or equal to xu ) . (This is the oretically always true when the observations are of the continuous type, which we usually assume.) The proportion jfn of the sample at or to the left of x u ) is often approximated by (j  � )/n for analytical convenience. 1 For a standard normal distribution, the quantiles q(j) are defined by the relation
P[Z
iq(j) V2ii1
< q( 1· ) ] =
oo

j�
e z l2 dz = P ( · ) = I n 2
(430)
(See Table 1 in the appendix). Here P U ) is the probability of getting a value less than or equal to q (j) in a single drawing from a standard normal population. The idea is to look at the pairs of quantiles ( q(j) , xu ) ) with the same associated cumulative probability (j  � )fn. If the data arise from a normal population, the pairs ( q(j) , x u ) ) will be approximately linearly related, since oq (j) + J.L is nearly the ex pected sample quantile. 2 Example 4.9
{Co nstructi ng a QQ plot)
A sample of n = 10 observations gives the values in the following table:
Ordered observations
(j  � )/n
Probability levels
Standard normal quantiles q(j)
.05 .15 .25 .35 .45 .55 .65 .75 .85 .95
1.645  1.036 .674 .385 .125 .125 .385 .674 1.036 1.645
xu )  1.00 .10 .16 .41 .62 .80 1.26 1.54 1.71 2.30
Here, for example,
P[Z
< .385] =
1.385 00
1 e z212 dz 2'1T
" � v
=
.65. [See (430).]
Let us now construct the QQ plot and comment on its appearance. The QQ plot for the foregoing data, which is a plot of the ordered data xu ) against
(
1 The � in the numerator of j  �)In is a "continuity" correction. Some authors (see [5] and [10]) have suggested replacing j  �)In by j  �)l(n + �). 2 A better procedure is t o plot (m (; ) , x (;) ) , where m(j) = E(z(; ) ) is the expected value o f the jth order statistic in a sample of size n from a standard normal distribution. (See [12] for further discussion.)
(
(
1 80
Chapter 4 The M u ltivariate Norma l D i stribution
•
•
•
•
•
A 00 plot for the data in Exa m p l e 4.9.
Figure 4.5
the normal quantiles q ( j ) , is shown in Figure 4.5. The pairs of points ( q ( j ) , x u ) ) lie very nearly along a straight line, and we would not reject the notion that these data are normally distributedparticularly with a sample size as small as n = 10. • The calculations required for QQ plots are easily programmed for electronic computers. Many statistical programs available commercially are capable of pro ducing such plots. The steps leading to a QQ plot are as follows: ing probability values ( 1 � )fn, ( 2 � )/n, . . . , ( n  � J/n; 2. Calculate the standard normal quantiles q ( l ) , q ( 2 ) , , q ( n ) ; and 3. Plot the pairs of observations ( q ( l ) , x ( l ) ) , ( q ( 2 ) , x ( 2 ) ) , . . . , ( q ( n ) , x ( n ) ) , and exam ine the "straightness" of the outcome. 1. Order the original observations to get x ( l ) , X ( z ) , 
• . •
, x ( n � and their correspond

• • •
QQ plots are not particularly informative unless the sample size is moderate to largefor instance, n > 20. There can be quite a bit of variability in the straight ness of the QQ plot for small samples, even when the observations are known to come from a normal population. {A QQ plot fo r rad iation data) The qualitycontrol department of a manufacturer of microwave ovens is re quired by the federal government to monitor the amount of radiation emitted when the doors of the ovens are closed. Observations of the radiation emitted through closed doors of n = 42 randomly selected ovens were made. The data are listed in Table 4.1 on page 181. In order to determine the probability of exceeding a prespecified tolerance level, a probability distribution for the radiation emitted was needed. Can we regard the observations here as being normally distributed? A computer was used to assemble the pairs (q( j ) ' xu ) ) and construct the QQ plot, pictured in Figure 4.6 on page 181. It appears from the plot that the data as a whole are not normally distributed. The points indicated by the cir cled locations in the figure are outliersvalues that are too large relative to the rest of the observations.
Example 4. 1 0
Section 4.6
Assessi n g the Assu m ption of Normal ity
TABLE 4.1
RAD IATI ON DATA (DOOR CLOSE D)
Oven no.
Radiation
Oven no.
Radiation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
.15 .09 .18 .10 .05 .12 .08 .05 .08 .10 .07 .02 .01 .10 .10
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
.10 .02 .10 .01 .40 .10 .05 .03 .05 .15 .10 .15 .09 .08 .18
181
Oven no.
Radiation
31 32 33 34 35 36 37 38 39 40 41 42
.10 .20 .11 .30 .02 .20 .20 .30 .30 .40 .30 .05
Source: Data courtesy o f J. D. Cryer .
.40
.30
.20 3
.10
.00
2  2.0
3
•
5
 1 .0
e3
2 9
.0
2
3
• •
1 .0
2.0
3.0
q (j)
Figure 4.6
A 00 plot of the radiation data (door closed) from Exa m p l e 4. 1 0. (The i ntegers in the plot i n d icate the n u mber of poi nts occ u pyi n g the same locatio n . )
For the radiation data, several observations are equal. When this occurs, those observations with like values are associated with the same normal quan tile. This quantile is calculated using the average of the quantiles the tied ob • servations would have if they all differed slightly.
182
Chapter 4 The M u ltiva riate Normal Distribution
The straightness of the QQ plot can be measured by calculating the correla tion coefficient of the points in the plot. The correlation coefficient for the QQ plot is defined by n
rQ =
(x )  x ) ( q U ) j=2:1 u
q)
;:============ ==;:============= ==
��
(x (j)  i) 2
�� ( %l  zd
( 431)
and a powerful test of normality can be based on it. (See [5], [10], and [11 ].) Formally� we reject the hypothesis of normality at level of significance a if rQ falls below the ap propriate value in Table 4.2. TABLE 4.2
CRITI CAL POI NTS FOR TH E QQ PLOT CORRELATION COEFFICI ENT TEST FOR NORMALITY Significance levels a Sample size .10 .05 n .01
5 10 15 20 25 30 35 40 45 50 55 60 75 100 150 200 300 Example 4. 1 1
.8299 .8801 .9126 .9269 .9410 .9479 .9538 .9599 .9632 .9671 .9695 .9720 .9771 .9822 .9879 .9905 .9935
.8788 .9198 .9389 .9508 .9591 .9652 .9682 .9726 .9749 .9768 .9787 .9801 .9838 .9873 .9913 .9931 .9953
.9032 .9351 .9503 .9604 .9665 .9715 .9740 .9771 .9792 .9809 .9822 .9836 .9866 .9895 .9928 .9942 .9960
(A correlati on coefficient test fo r norma lity)
Let us calculate the correlation coefficient rQ from the QQ plot of Example 4.9 (see Figure 4.5) and test for normality. Using the information from Example 4.9, we have x = .770 and
10 2: (x(j )  x )q (j) j= 1
=
10 8.584, 2: (x (j)  x) 2 j= 1
=
10 8.472, and 2: q(j) j= 1
=
8.795
Section 4.6
Since always, q
=
Assessi n g the Ass u m ption of Norm a l ity
1 83
0,
rQ
=
8.584
V8.472 V8.795
=
.994
A test of normality at the 10% level of significance is provided by referring rQ == .994 to the entry in Table 4.2 corresponding to n == 10 and a == .10. This entry is .9351. Since rQ > .9351, we do not reject the hypothesis of normality. • Instead of rQ , some software packages evaluate the original statistic proposed by Shapiro and Wilk [11]. Its correlation form corresponds to replacing q( J ) by a func tion of the expected value of standard normalorder statistics and their covariances. We prefer rQ because it corresponds directly to the points in the normalscores plot. For large sample sizes, the two statistics are nearly the same (see [12]), so either can be used to judge lack of fit. Linear combinations of more than one characteristic can be investigated. Many statisticians suggest plotting
in which A 1 is the largest eigenvalue of S. Here xj == [ xj 1 , xj 2 , , xj p ] is the jth ob servation on the p variables X1 , X2 , , XP . The linear combination e � x j corre sponding to the smallest eigenvalue is also frequently singled out for inspection. (See Chapter 8 and [ 6] for further details.) . • •
• • •
Eva l u ating Bivariate Normal ity
We would like to check on the assumption of normality for all distributions of 2, 3, . . , p dimensions. However, as we have pointed out, for practical work it is usu ally sufficient to investigate the univariate and bivariate distributions. We consid ered univariate marginal distributions earlier. It is now of interest to examine the bivariate case. In Chapter 1, we described scatter plots for pairs of characteristics. If the ob servations were generated from a multivariate normal distribution, each bivariate distribution would be normal, and the contours of constant density would be ellipses. The scatter plot should conform to this structure by exhibiting an overall pattern that is nearly elliptical. Moreover, by Result 4.7, the set of bivariate outcomes x such that .
has probability .5. Thus, we should expect roughly the same percentage, 50%, of sam ple observations to lie in the ellipse given by {all x such that (x  x) ' S  1 (x  x) < x�( .5) } where we have replaced IL by its estimate x and I  1 by its estimate s  1 . If not, the normality assumption is suspect.
1 84
Chapter 4
The M u ltivariate Norm a l D istri bution
Example 4. 1 2
(Checki ng bivariate normal ity)
Although not a random sample, data consisting of the pairs of observations x 1 == sales, x2 == profits) for the largest U.S. industrial corporations are list ed in Exercise These data give
(
1.4. ] 10 [62,2927309 ' [ 10,255.005.7206 255.14.7306] 105 _s  77,661.1 18 [ 255.14.7306 10,255.005.2706 ] 10_5 ] 00184 ..0103293 [ ..0003293 _ 10 28831 5 3 x�(.5) 1.39. [ ] ] 62,309] ' [ ..0000184 3 62 0 09 03293 . [  2927 , _ 10 03293 .128831  2927 5 1.39 50% 1. 4 126,974, 4224 [ 126,4224974  292762,309] ' [ ..0003293 03293 ] [126,4224974  292762,309] 10_5 00184  ..0128831 4.34 1.39 50%1.20, .59, .83, 1.88, 1.01, 1.02, 5.33, .81, .97, 1. 3 9, . 7 0, 50% 5, 10 4.13.) X ==
S ==
X
so
1
_
X
==
From Table in the appendix, x ' == [ x 1 , x2 ] satisfying
X
Thus, any observation
==
x1 x2
x1 x2
X
2. When the parent population is multivariate normal and both n and n  p are greater than 25 or 30, each of the squared distances d i , d �, . . . , d � should behave like a chisquare random variable. [See Result 4.7 and Equations (426) and ( 427).] Al though these distances are not independent or exactly chisquare distributed, it is helpful to plot them as if they were. The resulting plot is called a chisquare plot or gamma plot, because the chisquare distribution is a special case of the more gener al gamma distribution. (See [6] .) To construct the chisquare plot, 1. Order the squared distances In ( 432) from smallest to largest as
2 < . . . < d 2n d 2( 1 ) <  ()  d (2 ) 
( qcj ( j  Din), dfn),
where qc , p ( ( j  Din) is the 100 � j  � )In quantile of the chisquare distribution with p degrees of freedom.
2. Gra)J h the pairs
•
Quantiles are specified in terms of proportions, whereas percentiles are speci fied in terms of percentages. The quantiles qc, p ( ( j  �)/n) are related to the upper percentiles of a chi squared distribution. In particular, qc , p (( j  �)In) = x�((n  j + � )In). The plot should resemble a straight line through the origin having slope 1. A systematic curved pattern suggests lack of normality. One or two points far above the line indicate large distances, or outlying observations, that merit further attention. Example 4. 1 3
(Co nstructi ng a chisquare pl ot)
Let us construct a chisquare plot of the generalized distances given in Exam ple 4.12. The ordered distances and the corresponding chisquare percentiles for p = 2 and n = 10 are listed in the following table:
1
1 2 3 4 5 6 7 8 9 10
d Jn
1 qc ,2 1Q
.59 .81 .83 .97 1.01 1.02 1.20 1.88 4.34 5.33
.10 .33 .58 .86 1 .20 1.60 2.10 2.77 3.79 5.99
c
1  2
)
1 86
Chapter 4
The M u ltiva r i ate Norm a l D istri bution
•
•
•
•
•
•
•
•
•
•
1....LL''�
0
1
Figure 4.7
2
3
4
5
6
qc,2 (( j !) / 1 0)
A ch isq u a re plot of the ordered d i stances in Exa m p l e 4. 1 3 .
A graph of the pairs ( qc , 2 ( (j  � )/10 ) , d (n) is shown in Figure 4.7. The points in Figure 4.7 do not lie along the line with slope 1. The small est distances appear to be too large and the middle distances appear to be too small, relative to the distances expected from bivariate normal populations for samples of size 10. These data do not appear to be bivariate normal; however, the sample size is small, and it is difficult to reach a definitive conclusion. If further analysis of the data were required, it might be reasonable to transform them to observations more nearly bivariate normal. Appropriate transforma • tions are discussed in Section 4.8.
In addition to inspecting univariate plots and scatter plots, we should check multivariate normality by constructing a chisquared or d 2 plot. Figure 4.8 on page 187 contains d 2 plots based on two computergenerated samples of 30 fourvariate nor mal random vectors. As expected, the plots have a straightline pattern, but the top two or three ordered squared distances are quite variable. The next example contains a real data set comparable to the simulated data set that produced the plots in Figure 4.8.
Section 4.6
1 87
Assessi n g the Ass u m ption of Normal ity
dJ)
d&) •
10
10
8
8
6
6
4
4
2
2
2
0
4
6
Figure 4.8
8
10
12
• •
• •
• ••
•
•
0
qc,i(J  �/ 30)
0
2
4
6
8
10
12
qc,i0  �/ 30)
Ch isq u a re plots for two s i m u l ated fou rva riate normal data sets with n
Example 4. 1 4
=
30.
(Eva l uati ng m u ltivariate normal ity for a fou rvariable data set)
The data in Table 4.3 were obtained by taking four different measures of stiffness, x 1 , x2 , x3 , and x4 , of each of n = 30 boards. The first measurement involves sending a shock wave down the board, the second measurement is determined while vibrating the board, and the last two measurements are obtained from static tests. The squared distances dy = (xj  x) ' S  1 (xj  x) are also presented in the table. TABLE 4.3
FOU R M EAS U R E M ENTS OF STI FFN ESS
Observation no.
xl
x2
x3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859
1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649
1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389
x4
d2
1778 .60 2197 5.48 2222 7.62 1533 5.21 1883 1.40 1546 2.22 1671 4.99 1874 1.49 2581 12.26 1508 .77 1667 1.93 1898 .46 1741 2.70 1678 .13 1714 1.08
Source: Data courtesy of William Galligan.
Observation no.
xl
x2
x3
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490
2149 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382
1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214
x4
d2
1281 16.85 1176 3.50 1308 3.99 1755 1.36 1646 1.46 2111 9.90 1477 5.06 1516 .80 2037 2.54 1533 4.58 1469 3.40 1834 2.38 1597 3.00 2234 6.28 1284 2.58
1 88
Chapter 4
The M u ltivariate Norm a l D i stribution
•
•
••
• ••
••• •• • ••••
•
••
•
•
•
• • •
•
•
'T,�r..�
0
2
Figure 4.9
10
8
6
4
12
qc, 4 ((j � ) / 30)
A ch isq u a re plot for the data i n Exa m p l e 4. 1 4.
The marginal distributions appear quite normal (see Exercise 4.33), with the possible exception of specimen (board) 9. To further evaluate multivariate normality, we constructed the chisquare plot shown in Figure 4.9. The two specimens with the largest squared distances are clearly removed from the straightline pattern. Together, with the next largest point or two, they make the plot appear curved at the upper end. We will • return to a discussion of this plot in Example 4.15. We have discussed some rather simple techniques for checking the multivariate normality assumption. Specifically, we advocate calculating the dJ , j = 1, 2, . . . , n [see Equation ( 432)] and comparing the results with x2 quantiles. For example, pvariate normality is indicated if 1. Roughly half of the dJ are less than or equal to
: t� : f)� : (� : r)����:: (! :T) :� : :�;� : ��=�:�l :::::: t o
2.
qc , p ( .50 ) .
h o ee
s
s
,r s e i ,p .p line having slope 1 and that passes through the origin.
,
st
(See [6] for a more complete exposition of methods for assessing normality.)
Section 4.7
Detecting Outl iers and Clea n i n g Data
1 89
We close this section by noting that all measures of goodness of fit suffer the same serious drawback. When the sample size is small, only the most aberrant be havior will be identified as lack of fit. On the other hand, very large samples in variably produce statistically significant lack of fit. Yet the departure from the specified distribution may be very small and technically unimportant to the infer ential conclusions.
4.7
D ETECTI N G OUTLI ERS AND CLEAN I NG DATA
Most data sets contain one or a few unusual observations that do not seem to be long to the pattern of variability produced by the other observations. With data on a single characteristic, unusual observations are those that are either very large or very small relative to the others. The situation can be more complicated with multivariate data. Before we address the issue of identifying these outliers, we must emphasize that not all outliers are wrong numbers. They may, justifiably, be part of the group and may lead to a better understanding of the phenomena being studied. Outliers are best detected visually whenever this is possible. When the num ber of observations n is large, dot plots are not feasible. When the number of char acteristics p is large, the large number of scatter plots p(p  1 )/2 may prevent viewing them all. Even so, we suggest first visually inspecting the data whenever possible. What should we look for? For a single random variable, the problem is one di mensional, and we look for observations that are far from the others. For instance, the dot diagram ••
• • ••
•
• • •• •• •••• • • •
• ••• • •
•
• •
��r� x
reveals a single large observation. In the bivariate case, the situation is more complicated. Figure 4.10 on page 190 as shows a situation with two unusual observations. The data point circled in the upper right corner of the figure is removed from the pattern, and its second coordinate is large relative to the rest of the x2 measure ments, as shown by the vertical dot diagram. The second outlier, also circled, is far from the elliptical pattern of the rest of the points, but, separately, each of its com ponents has a typical value. This outlier cannot be detected by inspecting the mar ginal dot diagrams. In higher dimensions, there can be outliers that cannot be detected from the univariate plots or even the bivariate scatter plots. Here a large value of (xj  x) ' S  1 (xj  x) will suggest an unusual observation, even though it cannot be seen visually.
1 90
Chapter 4
The M u ltiva r i ate Norm a l Distri bution
•
•
•
•
•
• •• • •
•• • •• •
• • • • • • •
•
@ • • •
•
• • •
•
•
•
• •
• •
•
..
.
•
•
•
• •
• •
..
• •
•
�r��
•
•
• • • ••••
Figure 4. 1 0
• •• • ••• ••
••
• • •• •
•
@
� .
x l
Two outl iers; o n e u n ivariate a n d o n e biva riate.
Steps for Detecti ng Outl iers
1. Make a dot plot for each variable. 2. Make a scatter plot for each pair of variables.
3. Calculate the standardized values Zjk = ( xjk  xk )/� for j = 1 , 2, . . , n and each column k = 1 , 2, . . . , p. Examine these standardized values for large or small values. 4. Calculate the generalized squared distances (xj  x) ' S  1 (x j  x) . Examine these distances for unusually large values. In a chisquare plot, these would be the points farthest from the origin. .
In step 3, "large" must be interpreted relative to the sample size and number of variables. There are n X p standardized values. When n = 100 and p = 5, there are
Section 4.7
Detecting Outl iers and Clea n i n g Data
191
500 values. You expect 1 or 2 of these to exceed 3 or be less than 3, even if the data came from a multivariate distribution that is exactly normal. As a guideline, 3.5 might be considered large for moderate sample sizes. In step 4, "large" is measured by an appropriate percentile of the chisquare distribution with p degrees of freedom. If the sample size is n = 100, we would ex pect 5 observations to have values of dy that exceed the upper fifth percentile of the chisquare distribution. A more extreme percentile must serve to determine obser vations that do not fit the pattern of the remaining data. The data we presented in Table 4.3 concerning lumber have already been cleaned up somewhat. Similar data sets from the same study also contained data on x5 = tensile strength. Nine observation vectors, out of the total of 112, are given as rows in the following table, along with their standardized values.
xl
x2
x3
x4
Xs
Z1
Z2
1631 1770 1376 1705 1643 1567 1528 1803 1587
1528 1677 1190 1577 1535 1510 1591 1826 1554
1452 1707 723 1332 1510 1301 1714 1748 1352
1559 1738 1285 1703 1494 1405 1685 2746 1554
1602 1785 2791 1664 1582 1553 1698 1764 1551
.06 .64 1 .01 .37 .11  .21 .38 .78  .13
 .15 .43  1.47 .04  .12  .22 .10 1 .01  .05
Z3
Z4
Zs
.05 .28  .12 1.07 .94 .60 2.87 .73 @]) .13 .81  .43 .04  .20 .28  .56  .28 .31 1.10 .26 .75 1.23 @)) .52 .35 .26  .32
The standardized values are based on the sample mean and variance, calculat ed from all 112 observations. There are two extreme standardized values. Both are too large with standardized values over 4.5. During their investigation, the researchers recorded measurements by hand in a logbook and then performed calculations that produced the values given in the table. When they checked their records regarding the values pinpointed by this analysis, errors were discovered. The value x 5 = 2791 was corrected to 1241, and x4 = 2746 was corrected to 1670. Incorrect readings on an individual variable are quickly detected by locating a large leading digit for the standardized value. The next example returns to the data on lumber discussed in Example 4.14. Example 4. 1 5
(Detecting outl iers in the data on l u m ber)
Table 4.4 on page 192 contains the data in Table 4.3, along with the standard ized observations. These data consist of four different measures of stiffness x 1 , x2 , x3 , and x4 , on each of n = 30 boards. Recall that the first measurement involves sending a shock wave down the board, the second measurement is de termined while vibrating the board, and the last two measurements are obtained from static tests. The standardized measurements are
1 92
Chapter 4
TABLE 4.4
The M u ltiva riate Norm a l D istri bution
FOU R M EASU R E M E NTS OF STI FFN ESS WITH STAN DARDIZED VALUES
xl
x2
x3
x4
Observation no.
z1
Z2
Z3
1889 2403 2119 1645 1976 1712 1943 2104 2983 1745 1710 2046 1840 1867 1859 1954 1325 1419 1828 1725 2276 1899 1633 2061 1856 1727 2168 1655 2326 1490
1651 2048 1700 1627 1916 1712 1685 1820 2794 1600 1591 1907 1841 1685 1649 21 49 1170 1371 1634 1594 2189 1614 1513 1867 1493 1412 1896 1675 2301 1382
1561 2087 1815 1110 1614 1439 1271 1717 2412 1384 1518 1627 1595 1493 1389 1180 1002 1252 1602 1313 1547 1422 1290 1646 1356 1238 1701 1414 2065 1214
1778 219 7 2222 1533 1883 1546 1671 1874 2581 1508 1667 1898 1741 1678 1714 1281 1176 1308 1755 1646 2111 1477 1516 2037 1533 1469 1834 1597 2234 1284
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
.1 1.5 .7 .8 .2 .6 .1 .6 3.3 .5  .6 .4 .2 .1 .1 .1 1.8  1.5 .2  .6 1.1  .0  .8 .5 .2 .6 .8 .8 1.3 1.3
.3 .9 .2 .4 .5 .1 .2 .2 3.3  .5  .5 .5 .3 .2 .3 1.3  1.8 1.2 .4 .5 1.4 .4  .7 .4  .8 1.1 .5  .2 1.7  1.2
.2 1.9 1.0 1.3 .3 .2 .8 .7 3.0 .4 .0 .4 .3 .1 .4 1.1 1.7 .8 .3 .6 .1 .3 .7 .5  .5 .9 .6 .3 1.8  1.0
k = 1, 2, 3, 4;
Z4
d2
.60 .2 5.48 1.5 7.6 2 1.5 5.21 .6 .5 1.40 2.22 .6 4.99 .2 .5 1.49 2.7 c@§) .77 .7 1.93  .2 .5 .46 2.70 .0 .13  .1 1.08 .0 1.4 � 3.50 1.7 3.99  1.3 .1 1.36 1.46  .2 9.90 1.2 5.06 .8 .80 .6 2.54 1.0 4.58 .6 3.40 .8 .3 2.38 3.00 .4 6.28 1.6 2.58 1.4
j = 1, 2, . . . ' 30
and the squares of the distances are dy = (xj  x) ' S  1 (xj  x) . The last column in Table 4.4 reveals that specimen 16 is a multivariate outlier, since x�( .005) = 14.86; yet all of the individual measurements are well within their respective univariate scatters. Specimen 9 also has a large d 2 value. The two specimens (9 and 16) with large squared distances stand out as clearly different from the rest of the pattern in Figure 4.9. Once these two points are removed, the remaining pattern conforms to the expected straight line relation. Scatter plots for the lumber stiffness measurements are given in Figure 4.11 on page 193. The solid dots in these figures correspond to specimens
Section 4.7 1 500 I

I
I
I


x1
0
� ('.! 
0
� 
0 �0 � �
c:P oo
0


0 0
•
0
00
0 0 C"l �
000
c:Pco
0
({)
,�•
c:P
•
0 Oo 0 0
•
0
J
I
•oca �0 o o 0oc?CSJ •
0
•
0 0
0
0
I
I
•
I
•
0 0
�0
<
1 ] = .3174. 4.9. Refer to Exercise 4.8, but modify the construction by replacing the break point 1 by c so that X1 if c < X1 < c X2 = xl elsewhere Show that c can be chosen so that Cov ( X1 , X2 ) = 0, but that the two random variables are not independent. Hint: For c = 0, evaluate Cov (X1 , X2 ) = E[X1 (X1 ) ] For c very large, evaluate Cov ( X1 , X2 ) . E [ X1 ( X1 ) ] . 4.10. Show each of the following. (a)
{
==
{
(b)
� � �� l A B 1 l A OI I I I B 1
Hint: (a)
0
0
= I A I I B I for I A I
*
0
. II 1 I� �I 0
by the = . Expanding the determinant 0' B 0' 0' 0' first row (see Definition 2A.24) gives 1 times a determinant of the same form, with the order of I reduced by one. This procedure is repeated until
_:
.
1 x I B I is obta ed Similarly, expanding the determinant O last row gives = 1 A I·
l o' I 1
by the
(b)
� � �� � � !I I :, K:C1 I · I I A I cl =
by the last row gives
0,
Chapter 4
Exe rcises
But expanding the determinant
205
1 :, A:C I
= 1. Now use the result in Part a.
A is square, A I = I A22l l A 11  A 12A2i A21 I for I A22l # 0 = IA 11 I A 22  A 21 A !i A 12 I for IA 11 I # 0 Hint: Partition A and verify that [ OI, A 12I A2�] [ AA2111 AA2212 ] [ A22!1 A21 OI J [ Au  A1,2A2"�A2 1 A22 J Take determinants on both sides of this equality. Use Exercise 4.10 for the first and third determinants on the left and for the determinant on the right. The sec ond equality for I A I follows by considering
4.11. Show that, if
I
=
0
0
A symmetric, f 1 [ [ J J (A A A A2�A A : 2 1 21 [ l � ; : K :2" �� u , A �A2 1 �J Thus, (A 11  A 12 A2iA 21 )  1 is the upper lefthand block of A 1 . Hint: Premultiply the expression in the hint to Exercise 4.11 by [ I A 12I A122 ] 1 and postmultiply by [ A;Ii A I J1 Take mverses of the 21 resulting expression. Show the following if I I I # 0. (a) Check that I I I = I I22l l I 11  I 1 2 I 2ii 21 I . (Note that I I I can be factored into the product of contributions from the marginal and conditional distri
4.12. Show that, for =
_
0,
4.13.
0
.
_
.
.
butions.) (b) Check that
(x  1L )'I1 (x  1L ) = [x1  1L1  I112I2i (x2  1L2 ) J ' (I11  I 12I2i i21 ) [ x1  IL 1  I12I2i(x2  1L2 )] + (x 2  IL 2 )'I 2i (x2  IL 2 ) X
(Thus, the joint density exponent can be written as the sum of two terms corresponding to contributions from the conditional and marginal distrib utions.) (c) Given the results in Parts a and b, identify the marginal distribution of and the conditional distribution of =
X 1 1 X2 x2 •
X2
206
Chapter 4
The M u ltivariate Norm a l Distri bution
Hint: (a) Apply Exercise 4. 11. (b) Note from Exercise 4.12 that we can write (x  IL ) ' I  1 (x  IL ) as
If we group the product so that
[
1 , 0
I 1z iZi I
][ ]=[ x 1  1L 1 X2  IL 2
x 1  1L 1  I 1 2 I 2i(x2  1L2 ) X2  IL 2
]
the result follows. 4.14. If X is distributed as Np ( IL , I) with I I I # 0, show that the joint density can be written as the product of marginal densities for o if I 1 2 = X2 X1 and X
(q Xl )
(q ( p  q))
(( p  q) Xl)
Hint: Show by block multiplication that I 1i 0 � , Is t he I. nverse of � 0 I 2i Then write
J
J [ I 1i x 1  IL 1 ] ( x  IL ) , � 1 ( x  IL )  x l  IL l ) ' ( x2  IL 2 I2i J x2  IL 2 = (x l  1L l ) ' I1i (x l  IL l ) + (x2  IL 2 ) ' I2i(x2  IL )
[
�
_
Note that I I I n
=
.
=
)']
[(
[ [
I1 1 0
,
0'
0
I 22 0
2 I I 1 1 l l I 22 l from Exercise 4.10(a). Now factor the joint density. n
IL ) ' and �l ( x  IL ) (xj  x) ' are both p x p j= =l matrices of zeros. Here x; = [ xj 1 , xj 2 , , xj p ] , j = 1, 2, . . . , n, and 1 n x =  � xj n j= l 4.16. Let X 1 , X 2 , X 3 , and X 4 be independent Np ( IL , I ) random vectors. (a) Find the marginal distributions for each of the random vectors V1 = 41 X 1  41 X 2 + 41 x 3  41 x 4 and V2 = 41 X 1 + 41 X 2  41 X3  41 X 4 (b) Find the joint density of the random vectors V1 and V2 defined in (a). 4.17. Let X 1 , X 2 , X 3 , X 4 , and X 5 be independent and identically distributed randon1 vectors with mean vector IL and covariance matrix I. Find the mean vector and covariance matrices for each of the two linear combinations of random vectors 1 1 1 1 1 s X 1 + s X2 + s X 3 + 5 X 4 + 5 X s
4.15. Show that
� (xj j
x) ( x 
• • •
Chapter 4
Exercises
207
and
X 1  X2 + X3  X4 + X 5 in terms of IL and I. Also, obtain the covariance between the two linear com binations of random vectors. 4.18. Find the maximum likelihood estimates of the 2 X 1 mean vector IL and the 2 X 2 covariance matrix I based on the random sample 3 6 X = 54 74 4 7 from a bivariate normal population. 4.19. Let X 1 , X 2 , . . . , X 2 0 be a random sample of size n = 20 from an N6( /L, I ) pop ulation. Specify each of the following completely. (a) The distributi on of (X 1  �L ) ' I  1 (X 1  IL ) (b) The distributions of X and Vn (X IL ) (c) The distribution of ( n  1 ) S 4.20. For the random variables X 1 , X 2 , . . . , X 2 0 in Exercise 4.19, specify the distrib ution of ( 19 S) ' in each case. 0 (a) = 
B �B
4.22.
4.23.
4.24.
�� � ��
OJ
n
1 0 0 0 0 0 0 1 0 0 0 Let Xr, . . . , X60 be a random sample of size 60 from a fourvariate normal distri bution having mean IL and covariance I. Specify each of the following completely. (a) The distribution of X (b) The distribution of (X 1  IL ) ' I  1 (X 1  IL ) (c) The distribution of n ( X  IL ) ' I  1 ( X  IL ) (d) The approximate distribution of n ( X  IL ) ' S  1 ( X  IL ) Let X 1 , X2 . . . , X75 be a random sample from a population distribution with mean IL and covariance matrix I. What is the approximate distribution of each of the following? (a) X (b) n ( X  1L ) ' S  1 ( X  1L ) Consider the annual rates of return (including dividends) on the DowJones industrial average for the years 19631972. These data, multiplied by 100, are 20.6, 18.7, 14.2,  15.7, 19.0, 7.7, 11.6, 8.8, 9.8, and 18.2. Use these 10 observa tions to complete the following. (a) Construct a QQ plot. Do the data seem to be normally distributed? Explain. (b) Carry out a test of normality based on the correlation coefficient rQ . [See ( 431).] Let the significance level be a = .1 0. Exercise 1.4 contains data on three variables for the 10 largest industrial cor porations as of April 1990. For the sales ( x 1 ) and profits ( x2 ) data: (a) Construct QQ plots. Do these data appear to be normally distributed? Explain. (b)
4.21.
B [� � B=[
208
Chapter 4
The M u ltiva ri ate Norm a l Distribution
(b) Carry out a
test of normality based on the correlation coefficient rQ . [See (431).] Set the significance level at a = .10. Do the results of these test s corroborate the results in Part a? 4.25. Refer to the data for the 10 largest industrial corporations in Exercise 1.4. Con struct a chisquare plot using all three variables. The chisquare quantiles are 0.3518 0.7978 1.2125 1.6416 2.1095 2.6430 3.2831 4.1083 5.3170 7.8147 4.26. Exercise 1 .2 gives the age x 1 , measured in years, as well as the selling price x2 , measured in thousands of dollars, for n = 10 used cars. These data are repro duced as follows: 9 10 11 3 8 7 5 5 7 7 2.30 1.90 1.00 .70 .30 1.00 1.05 .45 .70 .30 x2 (a) Use the results of Exercise 1.2 to calculate the squared statistical distances (xj  x) ' S 1 (xj  x) , j = 1, 2, . . . , 10, where xj = [xj l ' Xj 2 J . (b) Using the distances in Part a, determine the proportion of the observation s falling within the estimated 50% probability contour of a bivariate normal distribution. (c) Order the distances in Part a and construct a chisquare plot. (d) Given the results in Parts b and c, are these data approximately bivariate normal? Explain. 4.27. Consider the radiation data (with door closed) in Example 4.10. Construct a QQ plot for the natural logarithms of these data. [Note that the natural logarithm transformation corresponds to the value A = 0 in (434).] Do the nat ural logarithms appear to be normally distributed? Compare your results with Figure 4.13. Does the choice A = � or A = 0 make much difference in this case? The following exercises may require a computer. 4.28. Consider the airpollution data given in Table 1 .5. Construct a QQ plot for the solar radiation measurements and carry out a test for normality based on the correlation coefficient rQ [see ( 431)]. Let a = .05 and use the entry corre sponding to n = 40 in Table 4.2. 4.29. Given the airpollution data in Table 1 .5, examine the pairs X5 = N0 2 and x6 = 0 3 for bivariate normality. (a) Calculate statistical distances (xj  x) ' S 1 (xj  x) , j = 1 , 2, . . . , 42, where xj = [ xjs , xj 6 ] . (b) Determine the proportion of observations xj = [ xj 5 , xj 6], j = 1, 2 , . . . , 42, falling within the approximate 50% probability contour of a bivariate nor mal distribution. (c) Construct a chisquare plot of the ordered distances in Part a. 4.30. Consider the usedcar data in Exercise 4.26. (a) Determine the power transformation A 1 that makes the x 1 values approxi mately normal. Construct a QQ plot for the transformed data. (b) Determine the power transformations A2 that makes the x2 values approx imately normal. Construct a QQ plot for the transformed data. (c) Determine the power transformations A' = [ A 1 , A2 ] that make the [ x 1 , x2 ] values jointly normal using (440). Compare the results with those obtain ed in Parts a and b.
Chapter 4
4.31. Examine the marginal normality of the observations on variables X1 , X2 ,
4.32. 4.33. 4.34. 4.35. 4.36. 4.37. 4.38.
209
References
, X5 for the multiplesclerosis data in Table 1 .6. Treat the nonmultiplesclerosis and multiplesclerosis groups separately. Use whatever methodology, including transformations, you feel is appropriate. Examine the marginal normality of the observations on variables X1 , X2 , , X6 for the radiotherapy data in Table 1.7. Use whatever methodology, including transformations, you feel is appropriate. Examine the marginal and bivariate normality of the observations on variables X1 , X2 , X3 , and X4 for the data in Table 4.3. Examine the data on bone mineral content in Table 1.8 for marginal and bi variate normality. Examine the data on paperquality measurements in Table 1 .2 for marginal and multivariate normality. Examine the data on women 's national track records in Table 1.9 for marginal and multivariate normality. Refer to Exercise 1 . 1 8. Convert the women ' s track records in Table 1.9 to speeds measured in meters per second. Examine the data on speeds for mar ginal and multivariate normality. Examine the data on bulls in Table 1.10 for marginal and multivariate normality. Consider only the variables YrHgt, FtFrBody, PrctFFB, BkFat, SaleHt, and SaleWt. • • •
• • •
REFE RENCES 1 . Anderson, T. W. An Introduction to Multivariate Statistical Analysis (2d ed.). New York: John Wiley, 1984. 2. Andrews, D. F. , R. Gnanadesikan, and J. L. Warner. "Transformations of Multivariate Data." Biometrics, 27 , no. 4 (1971) , 825840. 3. Box, G. E. P., and D. R. Cox. "An Analysis of Transformations" (with discussion). Jour nal of the Royal Statistical Society (B) , 26, no. 2 (1964) , 211252. 4. Daniel, C. , and F. S. Wood, Fitting Equations to Data (2d ed.). New York: John Wiley, 1980. 5. Filliben, J. J. "The Probability Plot Correlation Coefficient Test for Normality." Techno metrics, 17, no. 1 (1975), 11 11 17. 6. Gnanadesikan, R. Methods for Statistical Data Analysis of Multivariate Observations New York: John Wiley, 1977. 7. Hawkins, D. M. Identification of Outliers. London, UK: Chapman and Hall, 1980. 8. Hernandez, F. , and R. A. Johnson. "The LargeSample Behavior of Transformations to Normality." Journal of the American Statistical Association, 75, no. 372 (1980), 855861. 9. Hogg, R. V. , and A. T. Craig. Introduction to Mathematical Statistics (4th ed. ) . New York: Macmillan, 1 978. 10. Looney, S. W., and T. R. Gulledge, Jr. "Use of the Correlation Coefficient with Normal Probability Plots." The American Statistician, 39, no. 1 (1985), 7579. 1 1 . Shapiro, S. S., and M. B. Wilk. "An Analysis of Variance Test for Normality (Complete Samples)." Biometrika, 52, no. 4 (1965 ) , 591611. 12. Verrill, S., and R. A. Johnson. "Tables and LargeSample Distribution Theory for CensoredData Correlation Statistics for Testing Normality." Journal of the American Statistical Association, 83 , no. 404 (1988), 11921 197. 13. Yeo, I. and R.A. Johnson "A New Family of Power Transformations to Improve Nor mality or Symmetry." Biometrika, 87, no. 4 (2000), 954959. 14. Zehna, P. "Invariance of Maximum Likelihood Estimators." Annals of Mathematical Sta tistics, 37 , no. 3 (1966) , 744.
C HAPTER
5
Inferences ab out a Mean Vector
5.1
I NTRODUCTI ON
This chapter is the first of the methodological sections of the book. We shall now use the concepts and results set forth in Chapters 1 through 4 to develop techniques for an alyzing data. A large part of any analysis is concerned with infe ren ce that is, reach ing valid conclusions concerning a population on the basis of information from a sample. At this point, we shall concentrate on inferences about a population mean vec tor and its component parts. Although we introduce statistical inference through ini tial discussions of tests of hypotheses, our ultimate aim is to present a full statistical analysis of the component means based on simultaneous confidence statements. One of the central messages of multivariate analysis is that p correlated vari ables must be analyzed jointly. This principle is exemplified by the methods pre sented in this chapter. 
5.2
TH E PLAU SIBI LITY OF POPU LATI ON M EAN
1Lo
AS A VALU E FOR A NORMAL
Let us start by recalling the univariate theory for determining whether a specific value JLo is a plausible value for the population mean JL · From the point of view of hypothe sis testing, this problem can be formulated as a test of the competing hypotheses Here H0 is the null hypothesis and H1 is the (twosided) alternative hypothesis. If X1 , X2 , , Xn denote a random sample from a normal population, the appropriate test statistic is (X  JLo ) 1 � 1 2 2 £.J Xj and s = £.J ( Xj  X ) where X = � t= Vn n n 1 sj j =l j =l • • •

'
21 0


_
Section 5.2
The P l a u s i b i l ity of f.Lo as a Va l ue for a Normal Pop u l ation Mean
21 1
This test statistic has a student ' s !distribution with n  1 degrees of freedom ( d.f. ). We reject H0 , that JLo is a plausible value of JL , if the observed I t I exceeds a specified percentage point of a !distribution with n  1 d.f. Rejecting H0 when I t I is large is equivalent to rejecting H0 if its square, 2 2 1 2t = (X  JLo ) = n (X ) (s ) (X  JLo ) (51) JLo s 2jn is large. The variable t 2 in (51) is the square of the distance from the sample mean X to the test value JLo · The units of distance are expressed in terms of sj Vn , or es timated standard deviations of X. Once X and s 2 are observed, the test becomes: Reject H0 in favor of H1 , at significance level a, if (52) where tn_ 1 (aj 2) denotes the upper 100(a/2)th percentile of the !distribution with n  1 d.f. If H0 is not rejected, we conclude that JLo is a plausible value for the normal population mean. Are there other values of JL which are also consistent with the data? The answer is yes! In fact, there is always a set of plausible values for a normal population mean. From the wellknown correspondence between acceptance regions for tests of H0 : JL = JLo versus H1 : JL # JLo and confidence intervals for JL , we have {Do not reject H0: JL
=
JLo at level a} or
x  JLo sj Vn
is equivalent to
or
{ tLo
lies in the 100 ( 1  a)% confidence interval X ± tn_ 1 (a/2)
.:n } (53)
The confidence interval consists of all those values JLo that would not be rejected by the level a test of H0 : JL = JLo . Before the sample is selected, the 100(1  a)% confidence interval in (53) is a random interval because the endpoints depend upon the random variables X and s. The probability that the interval contains JL is 1  a; among large numbers of such independent intervals, approximately 100 ( 1  a)% of them will contain JL . Consider now the problem of determining whether a given p X 1 vector JL o is a plausible value for the mean of a multivariate normal distribution. We shall pro ceed by analogy to the univariate development just presented. A natural generalization of the squared distance in (51) is its multivariate analog
212
I nferences a bout a Mean Vector
Chapter 5
where

X (px 1 )
=
1 n  � x. n j.£.J =1 1 '
S (p xp)
=
n
1 _
l
� (X X ) (X X) , and /L o jj.£.J j= l (px 1) 

I
=
M1 o M2 0
JLp o The statistic T2 is called Hotelling's T2 in honor of Harold Hotelling, a pioneer in multivariate analysis, who first obtained its sampling distribution. Here ( 1/n )S is the estimated covariance matrix of X. (See Result 3.1.) If the observed statistical distance T2 is too largethat is, if x is "too far" from �L othe hypothesis H0 : IL = /L o is rejected. It turns out that special tables of T2 per centage points are not required for formal tests of hypotheses. This is true because (n  1 ) p . 2 (5  5) T I S d'1stn'b ute d as ( p ) Fp , np
n
_
where Fp, n  p denotes a random variable with an Fdistribution with p and n  p d.f. To summarize, we have the following:

Statement (56) leads immediately to a test of the hypothesis H0 : IL = /L o versus H1 : IL # /L o . At the a level of significance, we reject H0 in favor of H1 if the observed ( n  1 )p 2 1 (57) T  n ( X  IL o ) s ( X  IL o ) > F (a ( n p ) p, n  p ) It is informative to discuss the nature of the T2 distribution briefly and its corre spondence with the univariate test statistic. In Section 4.4, we described the manner in which the Wishart distribution generalizes the chisquare distribution. We can write 1 (Xj  X ) (Xj  X ) ' j l T2 = Vn ( X  p, 0 ) ' � l Vn ( X  IL o ) I
_
(±
n
_
)
which combines a normal, NP ( 0, I ), random vector and a Wishart, Wp, n _1 ( I ) , random matrix in the form
T2p,n  1
= =
)(
The Plausibi l ity of f.L o as a Va l ue for a Normal Popu lation Mean
Section 5.2
(
()
1 Wishart random matrix multivariate normal multivariate normal ' random vector d.f. random vector 1 Np (O, I) ' Wp , n  1 (I) Np (O, I )
[ 11 n
]
_
)(
This is analogous to or
()
(
213
) (58)
)
1 ( scaled) chisquare normal random variable normal t 2n  1 d.f. random variable random variable for the univariate case. Since the multivariate normal and Wishart random variables are independently distributed [see ( 423)], their joint density function is the product of the marginal normal and Wishart distributions. Using calculus, the distribution (55) of T 2 as given previously can be derived from this joint distribution and the rep resentation (58). It is rare, in multivariate situations, to be content with a test of H0 : IL = #L o , where all of the mean vector components are specified under the null hypothesis. Ordinarily, it is preferable to find regions of IL values that are plausible in light of the observed data. We shall return to this issue in Section 5.4. _

(Eval uati ng T2)
Example 5 . 1
Let the data matrix for a random sample of size n population be
[ X = l:O �]
Evaluate the observed T2 for #L a T2 in this case? We find
=
= 3 from a bivariate normal
[9, 5 ] . What is the sampling distribution of
6+
10
+8
3 9+6+3 3
and
s1 1
=
( 6  8) 2 +
s1 2
=
( 6  8) ( 9  6) +
s22
=
 8 ) 2 + (8  8 ) 2
( 10 2
(10 
=4
8 ) (6  6) + ( 8  8 ) ( 3  6) == 3 2 ( 9  6) 2 + ( 6  6) 2 + (3  6) 2 =9 2
 
214
Chapter 5
Inferences about a Mean Vector
so
s= Thus,
[
4 3 9 3
]
[ ]  [� 2t ]
1 9 3  � s1 (4) (9)  ( 3) ( 3 ) 3 4 and, from (54) ,
[.!.� 2�.!. J [ ] = 2
89 3 [ 1, 6 5 Before the sample is selected, T has the distribution of a (3  1 )2 ,3_ = 4F2, l ( 3 2) F2 2 random variable.
T 2 = 3[8  9, 6  5]
_
_
•
The next example illustrates a test of the hypothesis H0 : IL = ILo using data col lected as part of a search for new diagnostic techniques at the University of Wiscon sin Medical School. Example 5.2
(Testi ng a m u ltivariate mean vecto r with T2)
Perspiration from 20 healthy females was analyzed. Three components, X1 = sweat rate, X2 = sodium content, and X3 = potassium content, were mea sured, and the results, which we call the sweat data, are presented in Table 5.1 . Test the hypothesis H0 : IL ' = [ 4, 50, 10] against H1 : IL ' # [ 4, 50, 10] at level of significance a = .1 0. Computer calculations provide 2.879 10.010  1 .810 4.640 x = 45.400 , s = 10.010 199.788 5.640 3.628 9.965  1.810 5.640 and
] [ ] [ ] [ :��� :��� :��� [ ][ ] [ ] 8 1
We evaluate
=
.258  .002
20[ 4.640  4, 45.400  50, 9.965  10]
.402
.258 .586  .022  .022 .006  .002 .258  .002 .402
= 20[ .640,
4.600,  .035 ]
4.640  4 45.400  50 9.965  10 .467  .042 = 9.7 4 .160
Section 5.2
The Plausi b i l ity of f.L o as a Va l u e for a Norm a l Pop u l ation Mean
TABLE 5 . 1
Individual
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 5
SWEAT DATA
xl (Sweat rate)
x2 (Sodium)
x3 (Potassium)
3.7 5.7 3.8 3.2 3.1 4.6 2.4 7.2 6.7 5.4 3.9 4.5 3.5 4.5 1.5 8.5 4.5 6.5 4.1 5.5
48.5 65.1 47.2 53.2 55.5 36.1 24.8 33.1 47.4 54.1 36.9 58.8 27.8 40.2 13.5 56.4 71.6 52.8 44.1 40.9
9.3 8.0 10.9 12.0 9.7 7.9 14.0 7.6 8.5 11.3 12.7 12.3 9.8 8.4 10.1 7.1 8.2 10.9 11.2 9.4
Source: Courtesy of Dr. Gerald Bargman.
Comparing the observed T2 ( n  1 )p ( n p) Fp,n  p ( .10) _
= 9.74 with the critical value 19 ( 3 )
= 17 F3 , 1 7( .10) = 3.353(2.44) = 8.18
we see that T2 = 9.74 > 8.18, and consequently, we reject H0 at the 10% level of significance. We note that H0 will be rejected if one or more of the component means, or some combination of means, differs too much from the hypothesized values [ 4, 50, 10] . At this point, we have no idea which of these hypothesized val ues may not be supported by the data. We have assumed that the sweat data are multivariate normal. The QQ plots constructed from the marginal distributions of xl ' x2 ' and x3 all ap proximate straight lines. Moreover, scatter plots for pairs of observations have approximate elliptical shapes, and we conclude that the normality assumption • was reasonable in this case. (See Exercise 5.4.)
One feature of the T2statistic is that it is invariant (unchanged) under changes in the units of measurements for X of the form
21 6
Chapter 5
I nferences about a Mean Vector
X ) + pXd l ) , Y = C )( pXl ( pXl) ( pXp (
C nonsingular
( 5  9)
A transformation of the observations of this kind arises when a constant bi is sub tracted from the ith variable to form Xi  bi and the result is multiplied by a constan t a i > to get a i (Xi  bi) · Premultiplication of the centered and scaled quantities a i (Xi  bi) by any nonsingular matrix will yield Equation (59). As an example, the operations involved in changing Xi to a i (Xi  bi ) correspond exactly to the process of converting temperature from a Fahrenheit to a Celsius reading. Given observations x 1 , x2 , . . . , xn and the transformation in (59), it immedi ately follows from Result 3.6 that n y = C X + d and Sy = _ :L ( yj  y) ( yj  y) ' = CSC' n j=l Moreover, by (224) and (245), JL y = E( Y ) = E(CX + d) = E(CX) + E(d) = CJL + d Therefore, T 2 computed with the y ' s and a hypothesized value IL Y , o = CJL o + d is T2 = n( y  IL Y,o ) ' S; 1 ( Y  IL Y,o ) = n ( C ( x  IL o ) ) ' ( CSC ' )  l ( C ( x  IL o ) ) = n ( x  1L o ) ' C ' ( CSC ' )  1 C ( x  JL o ) = n(x  1L o ) ' C' (C' )  1 S 1 C 1 C ( x  IL o ) = n ( x  1L o ) ' S  1 ( x  JL o ) The last expression is recognized as the value of T2 computed with the x 's.
0
11
5.3
H OTELLI NG'S T2 AND LIKELIHOOD RATI O TESTS
We introduced the T2 statistic by analogy with the univariate squared distance t2. There is a general principle for constructing test procedures called the likelihood ratio method, and the T2 statistic can be derived as the likelihood ratio test of H0 : IL = IL o . The general theory of likelihood ratio tests is beyond the scope of this book. (See [3] for a treatment of the topic.) Likelihood ratio tests have several optimal properties for reasonably large samples, and they are particularly convenient for hy potheses formulated in terms of multivariate normal parameters. We know from (418) that the maximum of the multivariate normal likelihood as IL and I are varied over their possible values is given by max L( JL, I) p,, l
=
1
e n p/2 2 2 n np ( 21T ) / 1 I l /
(510)
A
where A
1
n :L (xj  x) (xj  x) ' and jL
1
n :L xj
I== x = n n j =l j=l are the maximum likelihood estimates. Recall that jL and i are those choices for � and I that best explain the observed values of the random sample.
Section 5.3
Hote l l i ng's
Under the hypothesis H0 : IL
T2
JL o , the normal likelihood specializes to
 (21rYP/2 1 I l n/2 ( =
1 � exp  2 t't (xi  ILo ) ' I  1 (xi  ILo )
1
L ( #L o , I ) 
217
and Like l i h ood Ratio Tests
)
The mean ILo is now fixed, but I can be varied to find the value that is "most likely" to have led, with #La fixed, to the observed sample. This value is obtained by maxi mizing L( JL o , I) with respect to I. Following the steps in (413), the exponent in L( JL o , I) may be written as
n
n [ J � [ I1 ( � (xi  Po) (xi  ILo) ' ) ] n� (xj  JLo ) (xj  JLo ) ' b /2,
1  2 � (xj  JL o ) ' I 1 (xj  JL o ) j= l
=
j=l
=
Applying Result 4.10 with B
=
1  2 � tr I 1 (xj  JL o ) (xj  JL o ) '
 tr
and
j= l
=
n
we have
(511) with A
Io
=
1

n
� (xj  JL o ) (xj  ILo) ' n j =l
To determine whether ILo is a plausible value of JL, the maximum of L( ILo , I ) is com pared with the unrestricted maximum of L( JL, I) . The resulting ratio is called the likelihood ratio statistic. Using Equations (510) and (511), we get max L ( JL 0 , I ) = Likelihood ratio = = :t (512) A max L ( JL, I) I o 1 1 p, , I i o I is called Wilks ' lambda. If the ob = i The equivalent statistic served value of this likelihood ratio is too small, the hypothesis H0 : IL = ILo is unlikely to be true and is, therefore, rejected. Specifically, the likelihood ratio test of H0 : IL = ILo against H1 : IL # JL o rejects H0 if
( III ) n/2
A
A2/n I 1 / 1
n
� (xj  x) (xj  x) ' j
n � (xj  ILo) (xj  ILo ) ' =l
n/2
(513)
j= l where ca is the lower ( lOOa ) th percentile of the distribution of ( Note that the likelihood ratio test statistic is a power of the ratio of generalized variances. ) Fortu nately, because of the following relation between and A , we do not need the dis tribution of the latter to carry out the test.
T2
A.
21 8
Chapter 5
I nfe rences a bout a Mean Vector
Np( #L , I)
X 1 , X 2 , X n T2 1 IL #Lo T2 1 ) 1 2A fn ( 1 (n  )
Result 5.1. Let , be a random sample from an popu lation. Then the test in (57) based on is equivalent to the likelihood ratio test of H0 : == versus H : # because . . •
IL #Lo
+
==
Proof.
Let the (p
+
1 ) X (p
+
1 ) matrix
n (xj  x) (xj  x) ' n( x  ILo) ( x  ILo) ' j= 1 1 ' ' ' j±=1 (xj  X) (xj  X)  1  n (X  ILo) ( j±=1 (xj  X) (xj  X) ) (X  ILo ) Since , by (414), n n :L (xj  #Lo) (xj  #Lo)' :L (xj  x x  #Lo) (xj  x x  #La ) ' j=1 j=1n :L (xj  x) (xj  x) ' n ( x  ILo) ( x  ILo) ' j=1 the foregoing equality involving determinants can be written n n 2 ( ) T ' ' 1 :L (xj  X)(xj  X) (  1 ) 1 (  ) :L (xj  Po )(xj  ILo) (n 1) 1=1 1 =1 or 2 ) ( T 1 I nio I I ni I (n 1 ) Thus, (514) A2fn III�oI I ( 1 ( n T2 1 ) ) 1 Here H0 is rejected for small values of A 2/n or, equivalently, large values of T2 • critical values of T2 are determined by (56). Incidentally, relation (514) shows that T 2 may be calculated from two deter minants, thus avoiding the computation of s  1 . Solving (514) for T2 , we have
(  1 ) :L
+
=
==
+
+
==
+
+
=
A
A
=
=
=
+
_
_
+
The II
Section 5.3
T2
=
=
(n  1) A
III
I Io
H otel l i ng's
\

T2
(n  1)
n
(n  1 ) L (xj j =l
JLo) (xj  ILo)'

n
219
and Likel i h ood Ratio Tests
 (n  1 )
(515)
x) (xj  x) ' =l Likelihood ratio tests are common in multivariate analysis. Their optimal large sample properties hold in very general contexts, as we shall indicate shortly. They are well suited for the testing situations considered in this book. Likelihood ratio methods yield test statistics that reduce to the familiar F and tstatistics in univari ate situations.
L (xj j

General Li keli hood Ratio Method
8 L(8)
We shall now consider the general likelihood ratio method. Let be a vector con sisting of all the unknown population parameters, and let be the likelihood func tion obtained by evaluating the joint density of X 1 , X 2 , . , X n at their observed values For x l ' x2 , . . . ' X n . The parameter vector takes its value in the parameter set = [,ur , . . . , ,u p , example, in the pdimensional multivariate normal case, lT 1 b lT 1 p , lT2 2 ' . lT2 P ' , lTp  l , p , lTp p ] and COnsists Of the pdimensional space, where  oo < ,u 1 < oo , . . . ,  oo < ,Up < oo combined with the [p(p + 1 )/2] dimensional space of variances and covariances such that is positive definite. has dimension = p + p(p + 1 )/2. Under the null hypothesis Therefore, = is restricted to lie in a subset of For the multivariate normal = {,u l = JL 1 o , JL2 = JL2 o , . . . , ,U p = ,Up o ; and unspecified, situation with = lT l b . . . ' lT l P ' lT22 ' ' lT2 p ' ' lTp  l , p ' lTpp with positive definite} , so has dimen sion = + p(p + 1 )/2 = p(p + 1 )/2. A likelihood ratio test of E rejects in favor of ft. if max A = OEeo < c (516) max · · ·
,
· ·
,
8
. . •
. .
e
8'
e.
I e v H0: 8 80 , 8 IL ILo I e e. 0 eo I eo v0 0 Ho: 8 e o Ho Hl : 8 e o L(8) L(8) where c is a suitably chosen constant. Intuitively, we reject H0 if the maximum of the likelihood obtained by allowing 8 to vary over the set e 0 is much smaller than the maximum of the likelihood obtained by varying 8 over all values in e. When the maximum in the numerator of expression (516) is much smaller than the maximum in the denominator, e o does not contain plausible values for 8. In each application of the likelihood ratio method, we must obtain the sam • . .
• • •
0E8
pling distribution of the likelihoodratio test statistic A . Then c can be selected to pro duce a test with a specified significance level a. However, when the sample size is large and certain regularity conditions are satisfied, the sampling distribution of 2 ln A is well approximated by a chisquare distribution. This attractive feature ac counts, in part, for the popularity of likelihood ratio procedures.
220
Chapter 5
I nferences a bout a Mean Vector
Result 5.2.
is, approximately, a x;vo random variable. Here the degrees of freedom are v = ( dimension of (dimension of 8 o ) ·
e) 

v0
•
Statistical tests are compared on the basis of their power, which is defined as the curve or surface whose height is P[ test rejects evaluated at each parameter vector Power measures the ability of a test to reject when it is not true. In the rare situation where = is completely specified under and the alternative H1 consists of the single specified value = , the likelihood ratio test has the highest = 60] . power among all tests with the same significance level a = P[ test rejects In many singleparameter cases has one component), the likelihood ratio test is uni formly most powerful against all alternatives to one side of (} = In other cases, this property holds approximately for large samples. We shall not give the technical details required for discussing the optimal prop erties of likelihood ratio tests in the multivariate situation. The general import of these properties, for our purposes, is that they have the highest possible (average) power when the sample size is large.
8.
8 80
(8
H0 I 8], H0
8 81
H0
H0 I 8 H0: 00 •
5.4 CO N F I D E N CE REG I O N S AN D S I M U LTAN EOUS CO M PARISONS OF COMPO N E NT M EANS
To obtain our primary method for making inferences from a sample, we need to ex tend the concept of a univariate confidence interval to a multivariate confidence re gion. Let be a vector of unknown population parameters and @ be the set of all possible values of A confidence region is a region of likely values. This region is determined by the data, and for the moment, we shall denote it by R(X) , where X = [ X , X 2 , . . . , X n ] ' is the data matrix. The region R(X) is said to be a 100 ( 1 a)% confidence region if, before the sample is selected, P[R(X) will cover the true = 1 a (517) This probability is calculated under the true, but unknown, value of The confidence region for the mean of a pdimensional normal population is available from (56) . Before the sample is selected, (n P n(X ( X  < (n p) Fp,n  p (a) = 1  a whatever the values of the unknown and In words, X will be within [ ( n 1 ) pFp, n P (a) / ( n p) ] of with probability 1 a, provided that distance is defined in terms of n s l . For a particular sample, and can be computed, and the inequality < (n  l )pFp, n  p (a)j(n  p) will define a region R(X)
8
8.
1
8

IL
8] 
IL) S_1 IL)  1)p ] IL I. 1/2  _ JL , x S n(x  JL)' S1 ( x  JL)
[
_
,
_
_
8.
Section 5.4
Confidence Reg ions and S i m u ltaneous Com pa risons of Com ponent Means
221
within the space of all possible parameter values. In this case, the region will be an ellipsoid centered at i. This ellipsoid is the 100( 1  a) % confidence region for p.
To determine whether any /L o falls within the confidence region (is a plausible value for IL ), we need to compute the generalized squared distance n( i  p 0 ) S  1 ( i  p0 ) and compare it with [p(n  1 )/ ( n  p ) ] Fp, n  p (a). If the squared distance is larger than [p(n  1 )/(n  p ) ] Fp, n  p (a) , /L o is not in the confi dence region. Since this is analogous to testing H0 : IL = /L o versus H1 : IL # IL o [see (57)], we see that the confidence region of (518) consists of all p0 vectors for which the T2 test would not reject H0 in favor of H1 at significance level a. For p > 4, we cannot graph the joint confidence region for p. However, we can calculate the axes of the confidence ellipsoid and their relative lengths. These are determined from the eigenvalues Ai and eigenvectors e i of S. As in (47), the direc tions and lengths of the axes of p(n  1 ) , (a) nXp x p ) < cz (n p) Fp n  p '
(  ) ' s1 ( 

_
are determined by going
� cj Vn
=
� Vp(n  1 ) Fp,n  p (a)jn( n  p)
units along the eigenvectors e i . Beginning at the center i, the axes of the confidence ellipsoid are ±�
p(n  1 ) (a) e i n(n p) Fp, n  p _
where Sei
=
Ai e i , i
=
1, 2,
. . .
, p (519)
The ratios of the A/s will help identify relative amounts of elongation along pairs of axes. Example 5 . 3
(Constructing a confidence el l i pse for p)
Data for radiation from microwave ovens were introduced in Examples 4.10 and 4.17. Let x = �measured radiation with door closed and
1
x2
=
� measured radiation with door open
222
Chapter 5
n
I nferences a bout a Mean Vector
42 [ .564] [ .0144 .0117] .603203.018 163..0117391 ] .0146 ' = [ 163.391 200.228 ..0026,02, [[ .704,.710, .710].704 95% 42[ .564  ILl ' .603  [ 163.203.039118 163200..232891 ] [ ..560364  iLl ] 2(4041) F2,4o( .05) 3. 2 3, . 0 5) 42(203.018) ( .564  42(200.84(228)163.(3.691)03 ( .564  JLl ) ( .603  6.62 [ . 5 62, . 5 89 42(203.018) ( .564  .562)84(2163.42(391)200.( .526428)( ..650362)( .6.50389)2 .589) = 1.30 6.62 . 5 62, . 5 89 [ ..558962] [ ..556289 ] . . 5. 1 . .564, .603p], ( n  1 ) 2( 4 1) ) 42(40) (3.23) .064 n( n p pn ((nn  p1)) 2(42(441)0) (3.23) .018 .704, .710] .710, .704 For the
pairs of transformed observations, we find that
=
x
S
'
=
=
8_1
The eigenvalue and eigenvector pairs for S are A1 = e1 = e2 = A2 = The
J
confidence ellipse for IL consists of all values ( JL1 , JL2 ) satisfying JL2 J
 JL2
=: �
ca
::3 "'0
· s; �
� �
=
4000 3000 2000 1 000 0  1000  2000
LCL
=
 207 1
 3000 0
10
5
15
Observation Number
Figure 5.7
The Xchart for x 2 = extraord i n a ry eve nt h o u rs.
Was t h er e a s p eci a l caus e of t h e s i n gl e poi n t f o r ext r a or d i n ar y event over ttihmeeUnithatteids outStatseidsebombed the uppera fcontoreirgonl lcapiimittainl,FiandgursetudentDurs at iMadi ng thissopern wereiod, prweekotesperting.iod.A Almathjoough,rity ofbythites extverryaordefdiinniartioyn,overexttriamoredwasinaryusovered intimtheatoccurfours onltainyswhentabilitsyp. ecial events occur and is therefore unpredictable, it stil has a cer 2 A T c har t c a n be appl i e d t o a l a r g e number of char a ct e r i s t i c s . Un ldiiksepltahyede elliinpsteimfeorormatder, itraisthnoter tlhimaniteasd atosctwatotevarr pliaoblt, eands. Morthisemakes over, thpate poiternntss andare trendsForvisitbhleej. th point, we calculate the T2statistic T x) ' S1 (xj  x) (xj Wethe upper then plcontot threolTl2imvalit ues on a time axis. The lower control limit is zero, and we use UCL or, sometTherime esis,no centerline in the T2chart. Notice that the T2statistic is the same as the quantity used to test normality in Section 2 Usonitnhgethtwe opolvariceia2depar t m ent dat a i n Exampl e we cons t r u ct a T p l o t bas e d l e gal appear a nces hour s and bl e s ext r a or d i n ar y x x l 2 event hours. T charts with more than two variables are considered in 5.7?
•
T2Chart.
J
(533)
=
=
x�( .05 )
x�( .01 ) .
4.6.
dJ
Example 5. 1 0
(A T2 chart for overtime hou rs)
5.8,
=
=
244
Chapter 5
I nferences about a Mean Vector
cise 5.e225.6. 9We. take = .01 to be consistent with the ellipse format chart iExern Exampl The T c har t i n Fi g ur e 5. 8 r e veal s t h at t h e pai r ( l e gal appear a nces , ext r a orin dExampl inary evente 5.9),hourconfsirfmors perthatiotdhi1s iiss dueout ofto contthe lraorlg. eFurvaltuheerofinvesexttriagoratdioinn,arasy event overtime during that period. a
II
12 10 8 N
N
6 4 2
•
0 0
•
• •
2
4
8
6
12
10
14
16
Period Figure 5.8
The T2 ch a rt for legal appea ra n ces h o u rs and ext raord i n a ry eve nt hou rs, a = . 0 1 .
2 When t h e mul t i v ar i a nt e T c har t s i g nal s t h at t h e j t h uni t i s out of cont r o l , sBonfhouledrbeonidetineterrmvialnsediswhifrequent ch varlyiachosbles earneforerstphonsis puriblpeos. e. modiThe ktfiehdvarregiiaoblnebasis outed control if does not ltine i1n( .t0h05/e inp)terval tn 1 ( .005/p) where p is the total number of measured variables. Thetubeasyokessembltoy aoftuabe.drivTheeshafintputfors anto tauthe oautmobiomatle eredquiwelredsintghemachicirclenweles musdingt contof goodrol equald toitbey. wiIntorhindercerttoacontin operrolatthinegprliomcesitsswher, oneepraomachi ne prneeroducesmeasweluredds ces s engi four critical variables: X1 =Voltage (volts) XX32 == FeedCurresntpeed((amps)in/min) X = ( i n er t ) Gas fl o w ( c f m ) 4 Table 5.9 gives the values of these variables at 5second intervals. it
on
A
xk 1
Exa m p l e 5 . 1 1
of
( xk

vs;:;; '
xk
+
YS;;;; )
{Contro l of roboti c weldersmore than T2 needed)
of
be
Section 5.6
TABLE 5.9
M u ltivariate Q u a l ity Control Charts
245
WELD ER DATA
Case Voltage (X1 ) Current (X2) Feed speed (X3) Gas flow (X4) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
23.0 22.0 22.8 22.1 22.5 22.2 22.0 22.1 22.5 22.5 22.3 21.8 22.3 22.2 22.1 22.1 21.8 22.6 22.3 23.0 22.9 21.3 21.8 22.0 22.8 22.0 22.5 22.2 22.6 21.7 21.9 22.3 22.2 22.3 22.0 22.8 22.0 22.7 22.6 22.7
276 281 270 278 275 273 275 268 277 278 269 274 270 273 274 277 277 276 278 266 271 274 280 268 269 264 273 269 273 283 273 264 263 266 263 272 277 272 274 270
Source: Data courtesy o f Mark Abbotoy.
289.6 289.0 288.2 288.0 288.0 288.0 290.0 289.0 289.0 289.0 287.0 287.6 288.4 290.2 286.0 287.0 287.0 290.0 287.0 289.1 288.3 289.0 290.0 288.3 288.7 290.0 288.6 288.2 286.0 290.0 288.7 287.0 288.0 288.6 288.0 289.0 287.7 289.0 287.2 290.0
51.0 51.7 51.3 52.3 53.0 51.0 53.0 54.0 52.0 52.0 54.0 52.0 51.0 51.3 51.0 52.0 51.0 51.0 51.7 51.0 51.0 52.0 52.0 51.0 52.0 51.0 52.0 52.0 52.0 52.7 55.3 52.0 52.0 51.7 51.7 52.3 53.3 52.0 52.7 51.0
246
Chapter 5
I nferences a bout a Mean Vector
The nor m al as s u mpt i o n i s r e as o nabl e f o r mos t var i a bl e s , but we t a ke t h e natlatiuornaflolrogar2succesithmsivofe obsgas eflrovw.atioInsn addion eachtion, vartheriaeblise.no appreciable serial corre· A T c har t f o r t h e f o ur wel d i n g var i a bl e s i s gi v en i n Fi g ur e The dot tnoedpoilinne tiss tarhee out oflimcontit androl,tbuthe socaslidelineisisouttheside thliemit. Usliminigt.the limit, format char, thets) showqualfoirtytwelo· varlipsieablfWhat eorsgas? Mosdoflotwht eofandqualthevolivartytacontige,ablsrehosownlarelelipiinnsecontFis g(eurlroeipl.seHowever r e veal s t h at cas e i s out ofvarcontiate rol charandtthfoisr lisn(duegas fltoow)an,unusin Fiuguralley largeshvolowsumethatofthgasis poiflonw.t isTheoutuniside the tAlhrletehseigotmaherlimuniitsv. arItiaappear sharthtats havegas flalowl poiwasntrsewisetthatinththeeitarrtghetrefeosrigcasmae t e c control limits. 95%
99%
31
5.10,
X
�
('l
99% Limit
• 95% Limit
�                              
8
•
42I
• • • •
•
•
•
•
6 •
0
31
X
1210
99%
5.11,
32.
14
95%
5.9. 99%
••
•
I
•
•
• •
••
•
•
•
•
•
I
10
0
•
•
•
••
20
• •
•• •
• ••
I
•
• I
40
30
Case
The T 2chart for the we l d i n g data with 9 5 % and 99% l i m its. Figure 5.9
4.05
•
4.00 0 �
,..
�
1/.l � OJ)
s::=
•
3 . 95
._., �
••
• . a• a ••• a • •• a. • • • ••
3 . 90
3 . 85 20.5
2 1 .0
2 1 .5
22.0
22.5
Voltage
23.0
23.5
24.0
Figure 5 . 1 0 The 99% q u a l ity control e l l i pse for l n (gas fl ow) and vo ltage.
Section 5.6
M u ltivariate Q u a l ity Control Charts
247
1 UCL = 4. 005
4.00
1:>
0 � �
> � >.
0 0 0
0 C\S l:i >< �
0 0 lr)
� � � ;....
• •
•
......
• •
•
•
•
+
•
•
•
•
0 0 0 lr) I
1 500
2500
3500
4500
5500
Figure 5 . 1 2 The 95% co ntrol e l l i pse for future legal a ppeara nces a n d ext raord i n a ry event overtime.
Appearances Overtime
in time order. Set LCL = and take UCL = Poisuggesnts tabovethat ththeeprupperocesscontin quesrol ltiimonit srheoulpredsebent potexamientniaedl sptoecidetalecausrmineevarwhetiatihoern andim mediate corrective action is warranted. Itdentis lasysdiusmedtributtheatd aseach randomWevectproorceedof obsdiferevratentiolnsy whenfromtthheespramplocesinsgisprinodepen cedur e sfiprecist sfaiempls theat, we deteunirmtisnbee itsselseactmpled,eatmeanthe saXme1 andtimcovar e, fromiatncehe prmatocesrixs. SFr1 . oWhen m the the populFor aatgener ion isanorl sumbsalampl, these emeantwo rXajndom quantX hasitiaesnorarme ialndependent .n with mean , di s t r i b ut i o Xj Oand Cov Xj  X = ( )2 Cov Xj 2 Cov X1 = 0,
(n (n
_
l )p F ( . OS ) p) p,n  p
Control Charts Based on Subsam ple Means
m >
(
_
1
=
)
Np ( 0, I ) .
1 1n
(
_
)
+
1
n
n
(

)
(n  1 ) nm
I
250
I nferences about a Mean Vector
Chapter 5
where X j=1 xj As wi l be des c r i b ed i n Sect i o n 6. 4 , t h e s a mpl e covar i a nces f r o m t h e s u b i n Chapt e r 6) of t h e scommon amples ccovar an beicombi n ed t o gi v e a s i n gl e es t i m at e ( c al l e d ance This pooled estimate is S ( S 1 S2 S n ) n Her e and, t h er e f o r e , of t h ei r mean i s i n dependent of each X ) S j Furof frtheedom. er, Notic)Se itshdiatswetribaruteedesastima Wiatinshgart irnatndom mat r i x wi t h degr e es e r n al l y f r o m t h e dat a col l e ct e d anylargeginumber ven periofod.degrTheseese ofestfirmeedom. ators arConse combiequentnedlyt,o give a single estimator with (535) is distributed as Fp, nm  n  p+1 individiuals multivariate observations, theInellanipsanale foormgousat charfashtiofonrtpaio ourrs ofdisscuusbssaiompln one means ( )S ( ) F2, nm  n 1 ( .05) (536) althoughSubsthample rigehts corhandresspiondide isnusg utoalpoily apprnts outoximsidateeofd asthe�cont.05)rjol .ellipse should carmeasefuulred.y checked f o r changes i n t h e behavi o r of t h e qual i t y char a ct e r i s t i c s bei n g The interested reader i2s refer ed to for additional discussion. we plot the quantToity constrTyuct a T Xcharj tXwi)t'hSs1u(bsXjampl Xe)data and characteristics, where the for ) Fp, nm n p+1 ( .05) UCL The UCLValuiess ofofteTjn apprthat oexceed ximatetdheasUCL.0cor5) when i s l a r g e. r e s p ond t o pot e nt i a l y out o f c ont r o l special cause variation, which should be checked. (See 1 �
=
= 
n
£.J
n
S pooled
I.
=
1
+

n
( nm ( nm  n
+ ··· +
nm  n
I
X.
in a
(nm  n)p ( nm  n  p + 1 )
Ellipse Format Chart. _
=
xx
,
_1
=
_
xx
0 implies that treatment 2 is larger, on aver age, than treatment 1 . In general, inferences about 8 can be made using Result 6 . 1 .
Section 6.2
Example 6 . 1
Pa i red Compa risons a n d a Repeated Measu res Design
275
(Checking for a mean difference with pai red observati ons)
Municipal wastewater treatment plants are required by law to monitor their discharges into rivers and streams on a regular basis. Concern about the relia bility of data from one of these selfmonitoring programs led to a study in which samples of effluent were divided and sent to two laboratories for testing. One half of each sample was sent to the Wisconsin State Laboratory of Hygiene, and onehalf was sent to a private commercial laboratory routinely used in the mon itoring program. Measurements of biochemical oxygen demand (BOD) and suspended solids (SS) were obtained, for n == 1 1 sample splits, from the two laboratories. The data are displayed in Table 6.1 . TABLE 6 . 1
EFFLU E NT DATA
Sample j
Commercial lab x1 j 2 (SS) X l j l (BOD)
1 2 3 4 5 6 7 8 9 10 11
6 6 18 8 11 34 28 71 43 33 20
Source: Data courtesy of S. Weber.
27 23 64 44 30 75 26 124 54 30 14
State lab of hygiene x 2 j 1 (BOD) X 2 j 2 (SS) 25 28 36 35 15 44 42 54 34 29 39
15 13 22 29 31 64 30 64 56 20 21
276
Chapter 6
Com parisons of Severa l M u ltivariate Means
Do the two laboratories' chemical analyses agree? If differences exist, what is their nature? The T 2 statistic for testing H0 : B ' == [ 8 1 , 82 ] == [ 0, 0] is constructed fro n1 the differences of paired observations:
 19 22 18 27 4  10  14 12 10 42 15  1 11 4
dj 1 == X 1 j 1  X2 j 1
Here
d
==
[ � ] == [ ] 1
d2
9.36 13.27 '
and T2
[
[
9 60 2
199.26 88.38 88.38 418.61
.0055  .0012 .0012 .0026
][ ] 9.36 13.27
4  19 10 7 
] ==
13.6
.05, we find that [p( n  1 )/(n  p)]Fp, n  p ( .05 ) == [2( 10)/9 ]F2, 9(.0 5 ) 9.47. Since T2 == 13.6 > 9.47, we rej ect H0 and conclude that there is a
Taking a ==
== 1 1 [ 9.36 ' 13.27]
sd 
17
==
nonzero mean difference between the measurements of the two laboratories. It appears, from inspection of the data, that the commercial lab tends to produce lower B OD measurements and higher SS measurements than the State Lab of Hygiene. The 95% simultaneous confidence intervals for the mean differences 8 1 and 82 can be computed using (610). These intervals are

81 : d1
±
�
_
/sf: ( n  1 )p Fp, n  p ( a. ) \j ;: = 9.36 ± v'9.47 (n p )
82 : 13.27 ± v'9.47 'v'(4I8.6I u or ( 5.71, 32.25 )
) 199.26
1l or ( 22.46, 3.74)
The 95% simultaneous confidence intervals include zero, yet the hypothesis H0 : B == 0 was rej ected at the 5% level. What are we to conclude? The evidence points toward real differences. The point B == 0 falls outside the 95% confidence region for B (see Exercise 6.1), and this result is consistent with the T 2 test. The 95% simultaneous confidence coefficient applies to the entire set of intervals that could be constructed for all possible linear combina tions of the form a 1 8 1 + a 2 82 • The particular intervals corresponding to the choices ( a 1 == 1, a 2 == O ) and (a 1 == O, a 2 == 1 ) contain zero. Other choices of a 1 and a 2 will produce simultaneous intervals that do not contain zero. (If the hy pothesis H0 : B == 0 were not rej ected, then all simultaneous intervals would in clude zero.) The B onferroni simultaneous intervals also cover zero. (See Exer cise 6.2.)
Section 6.2
Pa i red Com pa risons and a Repeated Measu res Design
277
Our analysis assumed a normal distribution for the Dj . In fact, the situa tion is further complicated by the presence of one or, possibly, two outliers. ( See Exercise 6.3.) These data can be transformed to data more nearly normal, but with such a small sample, it is difficult to remove the effects of the outlier ( s ) . ( See Exercise 6.4.) The numerical results of this example illustrate an unusual circumstance that can occur when making inferences. • The experimenter in Example 6.1 actually divided a sample by first shaking it and then pouring it rapidly back and forth into two bottles for chemical analysis. This was prudent because a simple division of the sample into two pieces obtained by pour ing the top half into one bottle and the remainder into another bottle might result in more suspended solids in the lower half due to setting. The two laboratories would then not be working with the same, or even like, experimental units, and the conclusions would not pertain to laboratory competence, measuring techniques, and so forth. Whenever an investigator can control the assignment of treatments to experi mental units, an appropriate pairing of units and a randomized assignment of treat ments can enhance the statistical analysis. Differences, if any, between supposedly identical units must be identified and mostalike units paired. Further, a random as signment of treatment 1 to one unit and treatment 2 to the other unit will help elim inate the systematic effects of uncontrolled sources of variation. Randomization can be implemented by flipping a coin to determine whether the first unit in a pair receives treatment 1 ( heads ) or treatment 2 ( tails ) . The remaining treatment is then assigned to the other unit. A separate independent randomization is conducted for each pair. One can conceive of the process as follows: Experimental Design for
3
2
Like pairs of expetimental units
D
D
{6
D
D
t
t
t
Treatments 1 and 2 assigned at random
Treatments 1 and 2 assigned at random
Treatments 1 and 2 assigned at random
Paired C o mp ari so n s • • • • • •
• •
n
D
D t
Treatments • 1 and 2 assigned at random
We conclude our discussion of paired comparisons by noting that d and Sd , and hence T 2 , may be calculated from the fullsample quantities x and S. Here x is the 2p X 1 vector of sample averages for the p variables on the two treatments given by and S is the 2p X
[
]
(611)
2p matrix of sample variances and covariances arranged as S
S
S
11 12 pxp) pxp) ( ( = S2 1 S 22 ( pXp) ( pXp)
(612)
278
Chapter 6
Comparisons of Seve ra l M u ltivariate Means
The matrix S 1 1 contains the sample variances and covariances for the p variables on treatment 1. Similarly, S 22 contains the sample variances and covariances computed for the p variables on treatment 2. Finally, S 1 2 == S2 1 are the matrices of sample co variances computed from observations on pairs of treatment 1 and treatment 2 variables. Defining the matrix
c
( p X 2p)
1 0 0 1
0 1 0 0
0 0
1

0
0 1
0 0
0
1
i
( 61 3)
(p + 1 )st column we can verify (see Exercise 6 .9) that
d1· == C x1· ' j == 1 , 2, . , n d == C x and Sd == CSC' .
.
(6 14)
Thus, (6 1 5 ) and it is not necessary first to calculate the differences d 1 , d 2 , . . . , d n . On the other hand, it is wise to calculate these differences in order to check normality and the as sumption of a random sample. Each row c; of the matrix C in (6  1 3) is a contrast vector, because its elements sum to zero. Attention is usually centered on contrasts when comparing treatments. Each contrast is perpendicular to the vector 1 ' == [ 1 , 1 , . . . , 1 J since c; 1 == 0. The com ponent 1 ' xj , representing the overall treatment sum, is ignored by the test statistic T 2 presented in this section. A Repeated Measures Design fo r Co mparing Treatments Another generalization of the univariate paired tstatistic arises in situations where q treatments are compared with respect to a single response variable. Each subj ect or experimental unit receives each treatment once over successive periods of time. The jth observation is
X 1 =
j
==
1, 2, . . . , n
Xj q
The name rep ea te d measures stems from the fact that all treatments are administered to each unit. For comparative purposes, we consider contrasts of the components of IL == E(X j ) · These could be where
Xji is the response to the ith treatment on the jth unit.
Section 6.2
Pa i red Comparisons and a Repeated Measu res Design
l�l � �3
1 1 0 1 0 1
 J.L 1  J.L 2
1 1 0 0 1 1
1 J.L 1  J.L2
1
J.L 1  J.Lq
0
0
0 0
J.L 1
1
J.Lq
J.L2
279
== C l �t
or J.L2 IL 3
==
0 0 0 0
IL l
J.L2 == C �t 2
0 0 0  1 1 J.Lq Both C 1 and C 2 are called contrast matrices, because their q  1 rows are linearly in
J.Lq  J.Lq  1
dependent and each is a contrast vector. The nature of the design eliminates much of the influence of unittounit variation on treatment comparisons. Of course, the experimenter should randomize the order in which the treatments are presented to each subject. When the treatment means are equal, C 1 �t == C 2 �t == 0 . In general, the hy pothesis that there are no differences in treatments (equal treatment means) be comes C�t == 0 for any choice of the contrast matrix C. Consequently, based on the contrasts C xj in the observations, we have means C x and covariance matrix CSC ' , and we test C�t == 0 using the T2 statistic 1 T 2 == n (C x) ' (CSC' )  C x
It can be shown that T2 does not depend on the particular choice of C. 1 1 Any pair of contrast matrices C and C must be related by C 1 BC , with B nonsingular. 2 2 1 This follows because each C has the largest possible number, q 1 , of linearly independent rows, all perpendicular to the vector 1. Then (BC 2 ) ' ( BC 2 SC2 B ' )1(BC2 ) = C2B ' ( B ' )1 ( C 2 SC2) 1 B1BC 2 = C2(C 2 SC2)  1 C 2 , so T2 computed with C 2 or C 1 BC 2 gives the same result. =

=
280
Chapter 6
Comparisons of Seve ra l M u ltivariate Means
,
A confidence region for contrasts CJL , with IL the mean of a normal populatio n, is determined by the set of all CJL such that (n  1 ) ( q  1 ) , 1 (6 17) n(C x  C p ) ( CSC )  ( C x  Cp ) 1 (a) l (n q + ) Fq l, nq+ _
_
_
x§( .01 )
20.09.
132.76
20.09,
•
6.5
SI M U LTAN E O U S CO N FI D E NCE I NTERVALS FOR TREATM E NT E FFECTS
When t h e hypot h es i s of equal t r e at m ent ef f e ct s i s r e j e ct e d, t h os e ef f e ct s t h at l e d t o tfhere orenijeapprctionoachof th(seehypot hioesnis arecaofn ibenteusreesdt.toForconspaitrruwctisseimcompar isonsconf, theidBonence e Sect ul t a neous ivalntesrarvalesshfoorr ttehretcomponent s of t h e di f e r e nces ( o r Thes e i n t e r h an t h os e obt a i n ed f o r al l cont r a s t s , and t h ey r e qui r e cr i t i c al val u es only fLetor the unibe vthareiaitthe component tstatistic. of Since is estimated by =  = andThe twosampl=e tbased confis tihdeencedif ienrteencervalbetis valweenid witwthoanindependent s a mpl e means . appr o pr i a t e l y modi f i e d a. Notice that Var = Var = ( ) wher iThats estieims, ateids tbyhediitvhiddiinaggonalthe corelermentespondiof ng elAsementsuggesofteWd byby its degrVarees of freedom. n ous confidence statements. whereIt remaiis thnes titohappordiagonaltion eltheementerrorofratWe overand nthe=numer Relencesat,ioson each twsotilsaapplmpleiets.inThertervale arwiel emplvarioayblthese andcriticg(algvalue)t/n2g(paia/2m)rwise, diwherf ere m = 1) / 2 is the number of simultaneous confidence statements. Let n For t h e model i n wi t h conf i d ence at l e as t =  a) , belongs to 5.4)
Tk .
T ki
T" ki
T ki  T ei
xki
ILk  P.. e).
Tk  T e
Tk
7k
(641)
xki  xi
xn
( T" ki  T" e; )
oii
(Xki  Xe; )
1 1 + nk n e
wii
1+
(528)
···
p
(1
T k i  T ei
(Xki  Xe J
+ ng . 1
(642)
pg(g
g � nk . k =l
CT; ;
(637),
I.
Result 6.5.
X.k  X.
(634),
306
Chapter 6
Com pa risons of Severa l M u ltivariate Means
e
i s t h e fitohr dialal gonal component s i 1, and al l di f e r e nces 1, Her e wi i element of uctionusofinsgimthuletanurneoussingihnometervaldatesatimintatroeduced s for thine paiExamplrwWeiseedis6.hf9ale. lreilncesustriantetrtehate mconsenttrmeans Weon thsaewtyipen Exampl e 6. 9 t h at aver a ge cos t s f o r nur s i n g homes di f e r , dependi n g of owner s h i p . We can us e Res u l t 6. 5 t o es t i m at e t h e magni t u des tmaihe nditfenance erencesla.borA ,compar i s o n of t h e var i a bl e X , cos t s of pl a nt oper a t i o n and 3 betcanweenbeprmadeivatebyly owned nuringsTin1g3 homes and gover n ment owned nur s i n g homes es t i m at . Us i n g ( 6 3 5) and T3 3 the information in Example 6..9, 0we70have . 1 37 ..002302 ..002039 0 20 . 0 03 . 182.4.496208 8.200 1.9.568195 2..462833 1..439484 6.538 Consequently, T13  T3 .020  .023 .043 and n 271 138 107 516, so that ( 2711 1071 ) 5161.484 3 •00614 Siquinrcee that t513and( .05/4(3)3,2)for 95%2.87. simSeeultAppendi aneous confx, Tablidencee 1.) sThetatement95% ssimweulrtea neous confidence statement is T1 3  T3 3 belongs to T13  T33 t513 ( .00208) )( n11 n13 ) nw33 ..004343 2..0818,7( .0or0614)( .061, Weownedconclnurudesingthhomes at the averis higaherge maiby n.0t25enanceto .061andhourlaborpercospatt ifeontr gover n ment day t h an privately ownedT1nur3 sinT2g3homes . Wi t h t h e s a me 95% conf i d ence, we c a n s a y t h at belongs to the interval (.058, .026) =
Example 6. 1 0
... , p
W.
X2( g1 ) ( b1 )p (
XJg1 ) ( b1 ) p (a) (
T1
T e # 0.
T2
··· =
Tg
0 some
I SSPres I 1 SSPfac 1 + SSPres I
T1
 1l
T2
= · ·
·
Tg
0
+
> x{g l ) p (a)
XJg1) p (
5 The likelihood test procedures require that p (with probability 1).
::;
gb(n
1 ) , so that SSPres will be positive definite
312
Chap ter 6
Com parisons of Severa l M u ltivariate Means /3 1
In a s i m i l a r manner , f a ct o r 2 ef f e ct s ar e t e s t e d by cons i d eri n g H0: and H1 : at least one SSPres Small values of P2 A* SSPfac2 + SSPres artioen:consRejiescttentH0:wi/3th1 H1/3. 2Once again, for l(anrogefasctamplor 2eefs fandectsus) atinlgevelBaratlieft 's correc (  [gb( n  1)  p + 1 2 (b  1)] ln A* wher e A* i s gi v en by ( 6 5 7) and ( 1 00a) t h per c ent i l e of a a) i s t h e upper chi squareSidimsultrtiabneous ution wiconfthidbence )inptedegrrvalesesfoofr contfreedom. r a s t s i n t h e model par a met e r s pr6.5oarevideavaiinsilgahtblse ifnotroththeetwnatouwraye ofmodelthe f.actWhenor effiencttesr.actResionuleftsfcompar aneglble itgoible, e ct s are mayroni apprconcentoachratapple onieconts to rtahsetscomponent in the factosrof1 andthe difafcteorrences2 maiTen effeTctsof. Thethe Bonf e r f a c t o r effectThes and100(the1component  a)% simsuloftaneous confofitdhencee factino(tre2rvefalfsectfosr, Tree)ispectTmiivelarye. belongsto pg( g  1) \j � bn ( wher e v SSP gb , and"i e · i s t h ei t h di a gonalel e mentofE 1) , E r e s i i is the Siithmcomponent ilarly, the 100(of1 e. a)% simultaneous confid(ence intervals for a ) belongs to pb(b \j � g;; where and Eii are as just defined and  x. qi is the ith component of We have cons i d er e d t h e mul t i v ar i a t e t w ow ay model wi t h catnatiioonsn of. Thatfactoirs,ltehvele model al l o ws f o r r e pl i c at i o ns of t h e r e s p ons e s at each combi s . Thi s enabl e s us t o exami n e t h e " i n t e r a ct i o n" of t h e f a ct o r s . one obsdoeservatnotionalvectloworfoisr avaithe poslablseibatilieachty ofcombi natalioinntoferafactctioonr lteevelrms, tehe two waycoronlyremodel a gener s p ondi n g MAN OVA t a bl e i n cl u des onl y f a ct o r 1, f a c t o r 2, and r e s i d ual s o ur c es of variation as components of the total variation. (See Exercise 6.13.) Thetechnioptqueimumcallecondid Evoltiounstiofnaror exty Operrudinagtiplon.ast(iSceefilm[8]have.) In beenthe courexamise ofnedtheus tX3hat wasopacidone,tywerethreemeasresupredonsatestwXo 1leveltesarof rtheseisftaactnce,ors, X2 glos , = · · · =
f3 b
=
f3 k # I
0
=
=
= · · · =
f3b
0.
I
I
(6 5 7)
I
=
0
> xJb l ) p (a)
(
1
==
6 58)
xfb l ) p (
can
Result we
f3 q
f3 k
T e ;  Tm;
=
(n
m
( ie ;.  im; .) ± tv
{£;;2
a
659)
=
X  Xm . .
/3k ;  /3 q ;
1
Xm 1 •
f3ki  /3q1 are
( i. k ;  i. q ; ) ± tv
v
_
1)
{£;;2
x. ki
(660)
x. k
x. q ·
re p li
Comment.
n
If
y
Exa m p l e 6 . 1 1
=
k.
The
{A twoway mu ltivariate analysis of varia nce of plastic fi l m data) =
=
r
ing
a
study and
a te of extrusion
Section 6.6
TABLE 6.4 =
TwoWay M u ltivariate Ana lysis of Variance
313
PLASTIC FI LM DATA =
=
tear resistance, glos , and opacity Factor 2: Amount of additive Low (1.0%) High (1 .5%) [66..25 9.9.59 4.6.44]] [[76..29 109..01 2.5.70]] [ Low ( 10)% [[56..58 9.9.66 4.3.01]] [[66..91 9.9.59 3.1.99]] 9. 2 [ 6 . 3 0. 8 ] 4 [ 6 . 5 Factin raoter of1: Change 5. 7 ] 9. extrusion [66..76 9.9.31 2.4.81]] [[77..01 9.8.28 5.8.24]] [ High (10%) [[77..21 8.8.34 1.3.68]] [[77..25 10.9.71 2.6.79]] [6.8 8.5 3.4] [7.6 9.2 1.9] andeach combination of the factTheor lemeasvels. uTherementdatas werare die rseppeatlayededin Tabl5et6.im4es. at The mat r i c es of t h e appr o pr i a t e s u m of s q uar e s and cr o s pr o duct s wer e (see the SAStablstea:tistical software output in Panel6.1), leading to the fcalolcoulwiantegdMANOVA SSP Source of variation d.f. 5 045 1. 7 405 . 8 555 1. [ ] change i n r a t e 1.3005 ..74205395 1 Factor 1: of extrusi.on . 6 825 1. 9 305 ] . 7 605 [ amount of .6125 4.1.79005325 1 Factor 2: add1.t1.ve [ . 0 005 . 0 165 . 0 445 ] .5445 3.1.49605685 1 Interaction . 0 200 3. 0 700 1. 7 640 ] [ 2.6280 64..59240520 16 Residual . 2 395 7 855 4. 2 655 . [ ] 5.0855 74.1.29095055 19 Total (corrected)
x1
x2
x3

xl

x2

xi

x2


amount of an additive.
x3
x3

xl

x2

xl

x2


n
=
x3
x3
314 PANEL 6 . 1
Chapter 6
Comparisons of Severa l M u ltivariate M eans
SAS ANALYSIS FOR EXAM PLE 6. 1 1 U S I N G PROC GLM.
title 'MAN OVA'; data film; i n fi l e 'T64.dat'; i n put x1 x2 x3 facto r 1 facto r2; proc g l m d ata = fi l m ; cl ass fact o r 1 fact o r2; m odel x1 x2 x3 = facto r 1 factor2 facto r 1 * factor2 jss3; man ova h = fact o r 1 factor2 factor1 * factor2 j pri nte; mea ns fact or 1 fact or2;
PROGRAM COM MAN DS
General Linear Models Proced u re Cl ass Leve l I nformation Leve ls Cl ass 2 FACTOR 1 2 FACTOR2 N u m ber of observations i n
So u r ce Mod el E rro r Cor rected Tot a I
So u rce
So u rce Mod e l E rro r Cor rected Tot a I
So u rce
OUTPUT
Va l u es 0 1 0 1 data set = 2 0
Pr > F 0.0023
DF 3 16 19
S u m o f Sq u a res 2 . 50 1 50000 1 . 76400000 4.265 50000
Mean Sq u a re 0.83383333 0. 1 1 025000
RSq u a re 0. 586449
C.V. 4.893724
Root M S E 0.33 2039
DF
Type Ill SS
Mean S q u a re
F Va l u e
1 . 74050000 0. 760 50000 0.00050000
1 5 .79 6.90 0.00
0.00 1 1 0. 0 1 83 0.947 1
F Va l u e 4.99
Pr > F 0. 0 1 2 5
F Va l u e 7 . 56
X 1 Mean 6.78500000
DF 3 16 19
S u m of Sq ua res 2.457 50000 2. 62800000 5.08550000
Mean S q u a re 0 . 8 1 9 1 6667 0 . 1 642 5000
RSq u a re 0.483237
C.V. 4.350807
Root M S E 0.405278
DF
Type Ill SS
Mean Sq u a re
F Va l u e
1 . 30050000 0. 6 1 2 50000 0. 54450000
7.92 3.73 3.32
Pr
>
F
X 2 Mean 9 . 3 1 500000 Pr
>
F
0.01 25 0.07 1 4 0.0874
Section 6.6
pANEL 6 . 1
TwoWay M u ltivariate Ana lysis of Va riance
315
(continued)
Pr > F 0.53 1 5
DF 3 16 19
S u m of S q u a res 9.281 50000 64. 92400000 74. 20550000
Mean S q u a re 3 . 09383333 4.0577 5000
RSq u a re 0. 1 2 5078
C.V. 51.19151
Root M S E 2 . 0 1 4386
DF
Type I l l 55
Mean Sq u a re
F Va l u e
Pr > F
0.42050000 4.90050000 3.96050000
0. 1 0 1 .21 0.98
0.7 5 1 7 0.288 1 0.3379
14 14 14
0 .0030 0 . 0030 0 .0030
14 14 14
0.0247 0.0247 0 .0247
Sou rce M ode l E rror Co rrected Tota l
Sou rce
X1 X2 X3
X2 0.02 2 . 628 0. 5 52
X1 1 .764 0 . 02  3.07
F Va l u e 0.76
X3 Mean 3.93 500000
X3 3 .07 0 . 5 5 2 64.924
M a n ova Test Criteria and Exact F Statistics for the H = Type I l l SS&CP Matrix for FACTOR 1 S=1 M = 0.5
Pilla i's Trace Hotel l i n g  Lawl ey Trace Roy's G reatest Root
0. 6 1 8 1 4 1 62 1 . 6 1 877 1 88 1 . 6 1 87 7 1 88
E = E rror SS&CP Matrix N=6
7 . 5 543 7 . 5 543 7 . 5 543
3 3 3
M a n ova Test Criteria a n d Exact F Statistics for the H = Type I l l SS&CP Matrix for FACTO R2 S=1 M = 0.5
P i l l a i 's Trace Hotel l i n g  Lawl ey Trace Roy's G reatest Root
0 .476965 1 0 0 . 9 1 1 9 1 83 2 0 . 9 1 1 9 1 832
4.2556 4.2556 4.2556
E = E rror SS&CP Matrix N=6
3 3 3
(continues on next page)
316
PANEL 6 . 1
Chapter 6
Comparisons of Severa l M u ltivariate Means
(continued) Manova Test Criteria a n d Exact F Statistics for the H = Type
Ill
SS&CP Matrix fo r FACTOR 1 * FACTOR2 S= 1 M = 0.5 N=6
0.22289424 0.286826 1 4 0 . 286826 1 4
P i l l a i's Trace Hotel l i n g  Lawley Trace Roy's G reatest Root Level of FACTO R 1 0
3 3 3
1 .3385 1 . 3385 1 . 3385
         X1         Mean SD 6.49000000 0.420 1 85 1 4 7. 08000000 0.3 224903 1
N 10 10 Level of FACTOR 1 0
Leve l of FACTOR2 0
E = E rror SS&CP Matrix
         X1         Mean SD 6 . 59000000 0.40674863 6. 98000000 0.47328638
Leve l of FACTOR2 0
0.3 0 1 8 0.30 1 8 0.30 1 8
         X2         Mean SD 0. 29832868 9. 57000000 0 . 57 580861 9. 06000000
         X3         SD Mean 1 . 8537949 1 3 . 79000000 2 . 1 82 1 49 8 1 4. 08000000
N 10 10
N 10 10
14 14 14
         X2         Mean SD 9. 1 4000000 0.560 1 587 1 0.42804465 9.49000000
         X3         SD Mean 1 . 5 5077042 3.44000000 2 . 30 1 23 1 55 4.43000000
N 10 10
To test for interaction,Iwe computI e 275.7098 A* + I 354.7906 = _7771 For ( g  1) (b  1)( 1 1, A* ) (gb( n  1)  p + 1)/2 F = A* (l (g  1)(b  1)  pl + 1)/2 hasgb( nan exact pF+dis1trd.ibf.utioen wi[1]t.h) For ourI (gexampl  1)(eb.  1)  p I + 1 and F  ( 1 .7771.7771 ) ((12(12()1()4) 313 ++1)1)//22  1.34 ((21 1(2)( 1)(4) 31 3++1)1 )===3 14 =
SSPres = I SSPint SSPres
=
(S
1)
V1 = v2 ==
v1 =
e
v2 ==
Section 6 . 6
TwoWay M u ltivariate Ana lysis of Va ria nce
317
andhypotF3h, 1es4(i.s05)H0:= 3.3=4. Since= F ==1.34 = F3,(1n4(o.0i5)nte=ract3.3io4,nweeffedoctsnot). reject the Not e t h at t h e appr o xi m at e chi s q uar e s t a t i s t i c f o r t h i s t e s t i s 2 ( 2 ) ( 4) [ 54)th. e Siexactnce Fx�t(e.0st5). = 7.81, we woul(3 + dTo1retache1(st 1ftoh))re/fs2]aacmetolnr(1concl.7and771)ufasct=ioon3.ras266,efprffeorctomvisd(eds(e6ebypages 311 and 312), we calculate = I I 1 + I I = 275.722.70098212 = 3819 and = I I + I I = 275.527.71098347 = 5230 For both  1 = 1 and  1 = 1, F1 = ( 1  ) ( l (g 1)1 )plp ++ 1)1)//22 and ( 1  ) ( l ( b 1)1)plp ++ 1)1)//22 have F d1)istribputi+ons1 andwith degr= eI es of 1)freedomp I + 1,= I =  1)  p1)I + 1,p + 1,= respectively. (See [1] .) (In1 our.3cas819e,) ( 16  3 + 1)/2 F1 = .3819 (1 1  31 + 1)/2 =7.55 = ( 1 .5230.5230 ) ((1116  313 ++ 1)1)//22 = 4·26 and 1  31 + 1 = 3 = ( 16  3 + 1) = 14 = 1 andSimitlhaFrrerloyem,fobefre=, weo4.re2r,6eFj3e, 1c4tF3( H0:.0, 15)4( .0=15)3.=34.3.=3We4, and(haveno fweaFct1ore=rj1e7.ctef55fH0:ects)F3at, 1=4t(h.0e5)5%== le3.vel(3n4,o. factor 2andeffectthse) at the 5% level. We concl u de t h at bot h t h e additiTheve manner. of the effects of factafofersct1thande re2spononsthees,randesponstheyes dois explso ionraned itnraExersts incitshee6.component 15. In thats ofexercandise, simularteaneous conf i d ence i n t e r v al s f o r con considered. y1 1
y 12
y 22
y2 1
=
T2
0
>
/3 1
/3 2
0
change in rate of ex
amount of additive
nature
Te
f3k
•
318
6.7
Chapter 6
Com parisons of Severa l M u ltivariate Means
PROFILE ANALYSIS
Prtioonsfi,lande analsoyfsoisrtperh) artaeinadmis to snitiustaterieodnstointwwhio orchmora bate grteoryupsofoftsruebjatemctents. Als (tlersetssp, quesonses musdif etrebentexprgroupses eardeininsdependent imilar unitofs. oneFurtanother,hiteris. asOrsudmedinariltyh, atwetmihegrhtespposonseetshefoquesr the tequalion, aritye ofthemeanpopulvectatioonrsmeanis divvectidedoirnsttohseesverame?al spIecin prficofposile sanalibiliytiseiss,. the question of Cons i d er t h e popul a t i o n means r e pr e s e nt i n g t h e ernectageedrebyspsonstraeigshttolifnoesur, itsreshatownmentins Fifogrurthee 6.fi4r.stThigrosup.broAkenplloitneofgrthapheseimeans , s t h e for populProfaitleiosncan1. be constructed for each population (group). We shall concentrate onmeantwroesgrponsoupses. toLetp treatments for populationsand1 and 2, respectively. The hypotbe thhee e treatamtioentn prs haveofilesth, wee sacmean (faoverrmaulge)ateefthfeectqueson tthieontwofo sis ations. In itmerplmisesofthtathethpopul popul equaliArty einthaestprepwiofilsees fparashailoen.l? EquiAssuvmialnegnttlhyat: Itshe profiles paral el, are the profile2,s coi3, ncidentaccept a bl e ? 6 ? EquiAssuvmialnegnttlhyat: Itshe profiles coincid1,ent2, , are thaccept a bl e ? e profiles level? That is, are all tEquihe means equal t o t h e s a me cons t a nt ? v al e nt l y : I s accept a bl e ? The null hypothesis in stage 1 can be writ en where is the contrast matrix p
IL 1 = [ JL 1 1 , JL 1 2 , JL 1 3 , JL 1 4 ]
av
con
profile
IL2 = [ JL2 1 , JL2 2 , . . . , JL2 p ]
IL1 = [ JL 1 1 , JL 1 2 , . . . , JL 1p ]
Ha : IL l = IL 2
1.
Ha 1 : JL l i  JL l i  l = JL2 i  JL2 i  l , i = are Ha 2 : JL l i = JL2 i , i = . , p, are
2.
.
. . , p,
. .
3.
Ha 3 : JL 1 1 = JL 1 2 = · · · = JL 1 P = JL2 1 = JL22 = · · · = JL2 p Ha 1 : CIL 1 = C IL 2
C
Mean response
Variable
2
3
4
Figure 6.4
The popu l ation profi l e
p = 4.
6 The question, "Assuming that the profiles are parallel, are the profiles linear?" is considered in Ex ercise 6.12. The null hypothesis of parallel linear profiles can be written H0 : (JL1 � + JL21 )  (JL1 1  1 + JL2 1  1 ) = (JL I L  1 + JL21  1 )  (JL1 1  2 + JL2 1  2 ) , i = 3, . , p. Although this hypothesis may be of interest in a par ticular situation, in practice the question of whether two parallel profiles are the same (coincident , what ever their nature, is usually of greater interest. . .
)
Section 6.7
Profile Ana lysis
319
10 11 01 00 00 00 ( 6 6 1) 1 1 0 0 0 0 Forpothiesndependent is can be tessatempld byesconsof sitzruesctingandthje tr1af,nsr2,ofmormthede twobso popul ervatioatnsions, the null hy and j 1, 2, Thesrnatriexhave sample mean vectors and respectively, and pooled covariance Since the twodissetrtisbofutitornsans, rfeosrpmectedivobser v at i o ns have and el y , an appl i c at i o n of Res u l t 6. 2 pr o vi d es a test for paral el profiles. =
c (( p  l ) X p )
n2
n1
Cx
csp ooled c ' .
=
. . . , n1
=
. . . , n2
C x2 ,
1
Np  l ( C IL 1 , CIC' )
Np  l ( C IL 2 , CIC' )
( J..L 1 i > J..L2 i ,
When t h e pr o f i l e s ar e par a l e l , t h e f i r s t i s ei t h er above t h e s e cond ftoortaall lheigorhtvis ce versa. Under this condition, tandhe profiles wil be coincident only if tarhee equal. Therefore, the null hypothesis at stage 2 can be writ en in the equivalent form Weobsecravnattihoenns test j wi1t,h2,th.e usualandtwosamplj e t1s,ta2,tistic based on the univariate i),
J..L 1 1
+
J..L 1 2 +
H0 2 l' x1 j,
=
· ··
+
J..L l p
=
1 ' IL l
Ho 2 : 1 ' IL1
.
. , n1 ,
=
1 ' x2 j,
J..L2 1
+
J..L2 2
+ ··· +
1 ' IL2 =
. . . , n2 •
J..L2 p 1 ' �L 2 =
320
Chapter 6
Comparisons of Severa l M u ltivariate Means
a e l e ar al For coi n ci d ent pr o f i l e s , and obs r v thaveions tfhroemsatmehe smean, ame norsomthalatpopulthe common ation. Thepronextfile isstleepvelis .to see whether all variables 0 1 and When H0 all observ1ations(, arbye tenable, the )common mean vect or is estimated, using Isftatgehe 3common pr o f i l e i s l e vel , t h en and t h e nul l hypot h es i = s can be writ en as where C is given by (661). Consequently, we have the fol owing test. x2 1 , x22 , . . . , x 2 n2
x1 1 , x1 2 , . . . , x1 n 1
n1
+
H2
n2
X =
J.L
n1
n1
+
""" .£,;
n2 1 = 1
X1 1·
n2
n1 X2 1· = ( n n ) X 1 l+ 2 j =l f..L l = f..L2 = · · · f..L p ,
""" + .£,;
Ho 3 : Cp., =
Example 6. 1 2
+
(n l
n2 +
2 X n2 )
at
0
(A profi le analysis of l ove and ma rriage data)
Asveyedparadult of atslawirgterh rsetsupdyectoftolotveheiandr marmarriargeiage,"contE. rHatibutfiioens"ld, aandsoci"outologicsomest, sur" andmaletshandeir lefevelmals eofs wer"pasesiasonatkedet"oandresp"compani oenatfoleo"wilonve.g quesRecenttionsl,yusmaringried ond t o t h 8point scale in the figure below. the
2
1. 2.
3. 4.
3
4
5
6
8
7
Almarl trhiainge?gs considered, how would you describe to Almarl trhiainge?gs considered, how would you describe from tWhathe 5Subjpisoitnhetectslcesalvelwere sofheown.also asked tloorveespthondat youto thfeeelfofloorwiyourng quesparttnioerns?, using What is the level of love that you feel for your partner?
None at all
your contributions
the
your outcomes
the
passionate companionate
Very little
Some
A great deal
Tremendous amount
2
3
4
5
Section 6.7
Let
Profi le Ana lysis
321
anan 88ppoioinntt ssccalalee rreesspponsonsee ttoo Ques t i o n 1 Ques t i o n 2 aa 55ppoioinntt ssccalalee rreesspponsonsee ttoo Ques t i o n Ques t i o n 4 and the two populationsPopulbe defatiionned1 as married men Popul a t i o n 2 mar r i e d women The popul a t i o n means ar e t h e aver a ge r e s p ons e s t o t h e 4 ques t i o ns ftroirxthe ipopul a t i o ns of mal e s and f e mal e s . As s u mi n g a common covar i a nce ma t i s of i n t e r e s t t o s e e whet h er t h e pr o f i l e s of mal e s and f e mal e s ar e t h e same.A sample of 30 males and 30 females gave the sample mean vectors 6.7.803333 7.6.060033 3.4.790067 ' 4.4.050033 and pooled covariance matrix .606 .262 .066 .161 ..206266 ..167337 ..817310 ..012943 . 3 06 . 0 29 . 1 43 . 1 61 The saSimplncee tmeanhe samplvecteorssizarese arpleotreeasd asonablsamply learprge,ofweilessihnalFilgusureet6.h5e onnorpagemal t322.he ormaly met. Tohtodolest foogy,r paralevenletlihsoughm the data, which arwee incomput tegers, are e clearly nonnor [ �1 11 01 ] 101 101 100 0 1 � 0 0 1 [ ..276819 1..210168 ..712551 ] 1. 0 58 . 1 25 7 51 . and [ 1� 11 01 ...200003333 [ ..016766 ] 0 1 � ] .167 .200 x 1 == x2 == x3 == x4 ==
3
== ==
p ==
I,
n 1 ==
n 2 ==
( females)
( males)
S pooled ==
( H0 1 : CJL 1 == CP 2 ) ,
C S pooiect C ' =
Spooled
==
C ( X1  X2 ) =
==
322
Chapter 6
Com parisons of Seve ra l M u ltivariate Means
Sample mean response xCi
6
4

Key:
X
X
� 0 ��o X
X
Males
o  o Females
2
���� v�i�k 3 2 4
Sample profi les for marriage l ove respon ses.
Figure 6.5
Thus, l [ . 7 19 2 68 . 1 25 . ] 1 2T .167, .066, .200] (310 310) . .212568 .1.710151 .1.705158 [ ...201006766 ] 15( .067) 1.005 Mor3.1e1over2.8, with8.7. Sin.0ce5, T2 1.(30005 3028.7, we(4concl 1)u/de(30that30the4)]hypotF;,h,5es6(i.s05)of parfindialnegliprs notofilessufroprrimensing.and women is tenable. Given the plot in Figure 6.5, this As s u mi n g t h at t h e pr o f i l e s ar e par a l e l , we can t e s t f o r pr o f i l e s . To test H02 : JL1 JL2 (profiles coincident), we need SumSum ofof eleleement mentss iinn (x1  x2) ( x1  4.x22)07 .367 Using 663 , we obtain T2 ( V(310 .36731o)4.027 )2 .501 Wijectththe hypot.05, hFes1,58is( .0th5at th4.e 0pr, andofileTs2are .coi501ncidFent1,5.8(That.05 is,4t.h0e, werescannot r e ponses of men andWe women t o t h e f o ur ques t i o ns pos e d appear t o be t h e s a me. coul d now t e s t f o r l e vel pr o f i l e s however , i t does not make s e ns e t o ; cara scraly eoutoft1his8,twhiest fleorQuesour exampl e , s i n ce Ques t i o ns 1 and 2 wer e meas u r e d on tioenss makes3 andt4hewerteset fmeasor levelureprd onofilaesscmeani ale ofn15.gles The iilncompat i b i l i t y of t h es e s c al uprosftilreatanales thyesineeds. for similar measurements in order to carry out a complete = [
+
=
=
=
(
) =
1'
a=
c2 = [ =
+
xfp ql J g ( a)
me a ns
( 6 7 1 )
{Fitti ng a q uadratic growth cu rve to ca lci u m loss)
A
[ {3 1 ,
=
+
+
(B ' S �;olect B)  1 =
ne
t2
two
wt
A* = � =
= . 7627
Section 6.9
Perspectives and a Strategy for Analyz i n g M u ltivariate Models
Since, with .01,
327
a =
g) ) ln A*  ( 31  � (4  2 2) ) 1n .7627 9. 2 1 7. 8 6 weis lefsailthtoanre.j0e5ctththereeadequac y of,tshoemequadrevidaencetic fittathat the .quadr 01. Siantceic tdoeshe pnotvalufiet i s , however well. We could, without restricting to quadratic growth, test for paral el and coincident calcium los using profile analysis. The Pot t h of f and Roy gr o wt h c u � v e model hol d s f o r mor e gener a l des i g ns t h an onepres wioayn fMANOVA. However , t h e ar e no l o nger gi v en by ( 6 6 7) and t h e ex e compl (668). We refer the reTheraderoetroarit[se1covar 2]manyforiotmoranceherematmodiexamplrixficbecomes eatsioandns tfoumortrhtheermodel tests.triecatateeddherthane. They i n cl u de t h e fol owiDrng:opping the restriction to polynomial growth. Use nonlinear parametric mod elRess ortrievencting nonpar a met r i c s p l i n es . t h e covar i a nce mat r i x t o a s p eci a l f o r m s u ch as equal l y cor r e l a t e d rObsespeonsrviensgonmortheetsamehan oneindirveisdpualons.e variable, over time, on the same individual. This results in a multivariate version of the growth curve model. 
( � (p N
q+
+
=
0 are the eigenvalues of the corresponding eigenvectors. If is of full rank, >
Z.
>
>
Z
Z' Z and e 1 , e 2 , . . . , e r + l are
(Z'Z)  1 = A11 e 1e 1 + A1 e2 e2 + · · · + Ar1+ 1 e r + 1 e�+ 1 2 1 Consider q i Ai 12 Ze i , which is a linear combination of the columns of Z. Then q; A:1/2 Ak1/2 e�Z' Ze k = A: 1/2 Ak112 e�Ak ek = 0 if i # k or 1 if i = k That is the vectors qi are mutually perpendicular and have unit length. Their linear combinations span the space of all linear combinations of the columns of Z. Moreover, r+ 1 1 r+1 1 Z (Z'Z) Z' = i�= 1 Aj Zei ejZ' � i =1 qi q; 
==
l
==
l
l
l
•
'
r +
1
Qk
Section 7 . 3
Least Sq uares Esti mation
( ) =�
363
According to Result 2A.2 and Definition 2A.12, the proj ection of y on a linear com
q2, �1 ( 2Z. [I  Z (ZZ(' ZZ)' Z)1 Z' Z' r+l
. . . , q ,+l } is
bination of { q l ,
r+l
q;y) q;
q;q; y
= z (Z' Zf1 Z'y = z{3.
Thus, multiplication by projects a vector onto the space spanned by the columns of Similarly, J is the matrix for the proj ection of y on the plane perpendicular to the plane spanned by the columns of
Z.
Sam p l i ng Properties of Classical Least Squares Esti mators The least squares estimator tailed in the next result.
e
/3 and the residuals
have the sampling properties de
Result 7.2. Under the general linear regression model in (73), the least squares
p... = ( Z ' Z )1Z'Y E( P ) = P = a2(Z' Z)1 e E(e) = 2 (e) = a2[I  Z (Z' Z)1Z' ] = a2[I  H] Also, E(e'e) = ..... , ..... 1)a , 2s = e e+ Y'[I  Z( Z 'Z )1Z' ] Y Y'[I  H]Y E(s2) = a2 p e Y = zp + e p = ( Z ' Z )  1 Z ' Y = ( Z ' Z )  1 Z ' ( Z p + e) = p + ( Z ' Z )  1 Z ' e e == [[II  Z(Z(ZZ''ZZ))11ZZ'' ]] Y[Zp + e] = [I  Z( Z ' Z )1Z' ] e
estimator
has
P
The residuals
and
Cov ( )
have the properties
0 and Cov
(n 
n

so defining
r 
(r
n
1)
r
nr1
1
we have
Moreover,
Now,
Proof.
and
are uncorrelated.
Before the response
.
2 If Z is not of full rank, we can use the
is observed, it is a random vector.
generalized inverse
(710)
r1 + 1 (Z'Z) =
_2:
i=l
A;1 e1e; , where r1 + 1 . = A, + , as described in Exercise 7.6. Then Z ( Z' Z)Z' _2: q 1 q; 1
A 1 ;;::: A2 ;;::: . . . ;;::: A,. 1 + 1 > 0 A,. 1 + 2 i=l has rank r 1 + 1 and generates the unique projection o f y on the space spanned by the linearly independent columns of Z. This is true for any choice of the generalized inverse. (See [20] .) =
=
=
364
Chapter 7
M u ltivariate Linear Reg ression Models
since [I  ZE((PZ)' Z=)1f3Z' ] Z( Z=' ZZ)1ZZ'E(=e) =Frf3om (224) and Cov (p) == a( Z'2(z'Z)z)1Z1' Cov (e) Z ( Z' Z)1 = a2(Z' Z)1Z' Z ( Z' Z)1 ECov((ee)) == [[II  Z(Z( ZZ''ZZ))11ZZ'' ]] E(Cov(e) e=)[I  Z( Z'Z)1Z' J ' 2[I  Z( Z'Z)1Z ' ] = awhere tCov(he laspt ,e)equal=itE[y f(opl owsf3)e'from] = ( Z 'AlZ )so1,Z' E( ee' ) [I  Z(Z'Z)1Z ' ] 2(Z1 'Z)1Z' [I  Z( Z ' Z )1Z' J = = abecause Z' [I  Z(e'Ze'Z=) e'Z[I' ]=Z( ZFr' Zo)m1Z' ] [I  Z(Zand'Z)Res1Z'u]let == te'r [[e'I (I Z( ZZ '(ZZ')Z)1 Z'1JZ'e ) eJ 1 Z' J ee' ) = t r ([ I Z( Z ' Z ) Now, for anE(artbr i(tWrar))y n= E(nWra1 ndomW2matrix W,Wnn) ) = t r [ )] = E ( ) ) E( E( E( W W W W nn 1 2 Thus, using Result E(e'wee) =obttrai([nI  Z(Z ' Z )1 Z' ]E(ee' ) ) == aa22 ttrr [(II)Z(a2Ztr' Z[Z)(1Z'Z'Z)] 1Z' J == ana2n2  aa22 ttrr [[ ( Z' Z)I 1 Z' Z] ] 1 1 2 n = ( ) aand the result for s2 = e' ej( n  fol ows. The l e as t s q uar e s es t i m at o r pos s e s e s a mi n i m um var i a nce pr o per t y t h at was f3 fparirstametestarblicisfhuednctbyionsGausof tsh.eTheformfol of3wi=ng result concerns "best" esftoimr anyators of linear Let = Z , wher e /3 e 2 E( e) = Cov (e) = a I, andf3 =Z has ful rank For any the estimator 0.
(245),
+
0
(76).
0
0.
(710), (76),
4.9,
X
+ ··· + + ··· +
+ +
2A . 12,
(r + ) X (r + )
1
r
•
r  1) "
c0{30 + c 1 {3 1 + · · · + cr f3 r
c'
Result 7.3 (Gauss' 3 least squares theorem). 0, r + 1. c' Co f3 o + C1 {3 1 + . . . + Cr f3 r "
"
"
Y c,
c.
+
"
3 Much later, Markov proved a less general result, which misled many writers into attaching his name to this theorem.
Section 7.4
I nfe rences About the Reg ression M odel
365
of c' f3 has the smallest posa'sYibl=e varianceaamong al l l i n ear es t i m at o r s of t h e f o r m Y 2 2 that are unbiased for c' f3. . Then For any f i x ed c , l e t a' Y be any unbi a s e d es t i m at o r of c ' f3 Also, edbyvalasuseumptexprioen,s ioEns(ayi'Ye)ld=s the valinguetheoftwf3o. expect EE((aa''YZ/3) = c'a'/3e), =whata' Z/3.ever Equat a'ThiZs/3im=plc'ief3s thor"at(c'c' = a'a'ZZ )fo/31 r =any unbifor alasle/3,d esintclimuatdionrg. the choi1ce f3 = ( c'  a' Z) '. Now, c' = Z ( Z ' Z ) c . Mor e over , a* f3 = f r om c' ( Z ' Z wi t h ) Z ' Y a*' Y = Resa satuilstfy7.in2gE(thPe )unbi= f3as,sedo rc'equiP =rement a*' Y ics' anunbia' Z,asedestimatorofc' f3 . Thus, forany Var(a'Y) == aVar(2(a a' Za*/3 a*)a'e')( a=Var(a* a'ea*)) = a' Io2a 2 ' (a  a*)'1( a  a*) a*' a*] = asa'inZce (a*'a Za*)= 'ca' *=c'(a= a*)Becaus Z ( Z ' Ze )a* cis fixedfrandom (thaecondia*) ' (taion a*)(1a is a*)pos'iZtiv=e unles a= a*, Var(a'Y) is minimized by the choice a*' Y = c' (Z' Zf Z'Y = c' {3 . Thi s power f u l l e ads t o t h e bes t f o r r e s u l t f3 t i o n of s t a t e s t h at s u bs t i t u of c' f3 for any c of interest. In statistical terminolog(y,BtLUE) he estiofmatc' o/3.r c' f3 is calestliemdatthoer We(73des) wictrhibteheinaddiferetntioinalal pr(toecedur e s bas e d on t h e cl a s i c al l i n ear r e gr e s i o n model i n nt a t i v e) as s u mpt i o n t h at t h e er r o r s e have a nor m al di s itnribSectutioion.n 7.Met6. hods for checking the general adequacy of the model are considered Before we can asses the imE(porY)ta=nce of par{3 ticular variables in the (7 1 we musTotdodetseor,mweinsehtalhel assasmplumeintghatdistthrieberutrioonsrs eofhave{3 anda northemrealsididualstribsuutmioofn. squares, and r a nk e i s di s t r i b ut e d f u l Let Y has = Z/3 , wher e e 2 assquares esoti1m) .atTheor /3.il thMore maxieovermum, likelihood estimator of f3 is the same as the least P = (Z' Zf1Z'Y is distributed as cr2(Z' Zf1) + · · · + an Yn
a 1 Y1 +
Proof
+
0
=
+ +
+
+
[
=
0' .
0
•
fJ "
"
best (minimumvariance) linear unbiased estima tor
7.4
I N FERENCES ABOUT TH E REGRESSION M O D E L
I nferences Concerning the Reg ressi o n Parameters
f3 o +
regression function 1 1 )
Z 1 + · · · + f3r Zr
e' e .
Result 7.4. Nn (O,
+
Z
r+1
N, + l ( /3,
366
Chapter 7
M u ltiva riate Linear Reg ress ion Models
and is distributed independent= ly ofisthdiesrterisbidutualedsas =o2X� rzp1 . Further, 2 2 where (J is the maximum likelihood estimator of o • Gi v en t h e dat a and t h e nor m al as s u mpt i o n f o r t h e er r o r s , t h e l i k el i 2 hood function for o2 is n 2/2a2 L ( lT ) = rr j=1 lTeeEl = (2/1T2a)2nf2lTn e n 2 n f 1T ( 2 ) lT ForButathfiisxedminvalimuiezato2io, tnhyie leikldelsithhoode leasis tmaxisquaresmizedestbyimmiatenimi=zin(gZ'(yZ)1Z/3)Z'y,'whi(y chZ/3does). 2 nothooddepend upon • Ther e f o r e , under t h e nor m al as s u mpt i o n, !he maxi m um l i k el i oand l e as t s q uar e s appr o aches pr o vi d e t h e s a me es t i m at o r Next , maxi m i z i n g 2 2 L(p, o ) over o [see (4 1 8) ] gives L ({3 , 02 ) = ( 21T ) nf2 ( (J2 ) n/2 en/2 where cl2 = (y  z{3 ) ' (y  z{3 ) (7 ) FroSpecim (7f1ic0)al,lwey, can expres {3 and as linear combinations of the normal variables = [[f=�z�i�;��;�J = [�J + [�=�i(j};:��ii,] = + ] [�Becaus Z is fiixaed,nceResmaturltic4.es3 werimpleiobtes thaeinjedoinitnnorResmualltit7.y2of. Agai{3 andn, usiTheing (7rmean tors ande covar 6), wevecget Cov ( [�]) = Cov ( = u{(���r��i=z(i, z)�iz, J Sidependent nce Cov (. p(S, ee Res= uflotr4.t5h.e) normal random vectors p and these vect1ors are in Next , l e t ( A , e) Z ' . Then, be any ei g enval u eei g envect o r pai r f o r I Z ( Z ' Z) 1 1 2 by (76), [I  Z (Z'Z1) Z' ] = [I  Z ( Z ' Z)1 Z2' ] so Ae = [I  Z( Z ' Z ) Z' ] e = [I  Z(Z ' Z ) 1Z' J e = A[I  Z( Z ' Z)1Z' J e = ThatResultis7., 2A),=andf0 orromResNow,ult t4.r 9[I,tr [IZ(ZZ(' ZZ) ' ZZ)'] 1Z= ' ] =A1 + A2(s+·ee· th+e Aproof e , wher n 1 A1  A2 val· · uesAofn arAieequaltheeione,genvalanduesof[ Z ' . J Cons e quent l y , e xact l y I Z( Z ' Z ) t h e r e s t ar e zer o . I t t h en f o l o ws f r o m t h e s p ec tral decomposition that (713) e
na2
Proof.
Y
e' e
f3 ,
1 V2iT 1
{3,
c' c
1
c
c
j 2 a2
( y  z{3 ) ( y  z{3 ) I
{J
f3 .
1
12
n
e
e.
e
a
Ae
e.
A
e)
0
1.
>
n
>
r
1
>
e ) A'
e,
n
r
1
A2e
of
Section 7.4
I nferences About the Reg ress ion Model
367
, wher . e , . e ar e t h e nor m al i z ed ei g envect o r s as s o ci a t e d wi t h t h e ei g en e , e r n 1 1 2 values A1 = A2 = = Anr1 = 1. Let e2 V= e r 1 � Then V is normal with mean vect{ e'or 21eandk  2e'ek  2 = Cov( V; , Vk) = o:u u u ' otherwise 2)1and by (710), That is, thena2 ar=ee'inedependent N( O , oZ V ' Z J ( Z = ' Z ) = i V� V�r1 . d1.str1"bUted 2Xn2r1· A conf i d ence el l i p s o i d f o r i s eas i l y cons t r u ct e d. I t i s expr e s e d i n t e r m s of t h e fJ 1 2 2 estimatedcovariancematrix s (Z'Z) ,where s = e' ej( n  r  1) . 2 1 and i s Let = Z wher e Z has f u l r a nk r 1) . ( O , oN fJ n Then a 100(1  a)(%fJ confp )i'dZence' Z ( fJreP)gion
'
=
f3 = 0 .
Example 7 . 5
[22]).
(Testi ng the i m porta nce of additional pred ictors using the extra sumofsq uares approach)
=
=
=
Y
TABLE 7.2
RESTAU RANTS ERVICE DATA
(Y)
Section 7.4
I nferences About the Reg ress ion Model
373
cons1tant 1locat0 ion0 gender i n t e r a ct i o n 1 0 1 0 0 0 0 0 11 11 00 00 11 00 11 00 00 00 00 00 responses 11 11 00 00 11 00 11 00 00 00 00 00 11 11 00 00 00 11 00 11 00 00 00 00 } 2 responses 11 00 11 00 11 00 00 00 11 00 00 00 Z = 11 00 11 00 11 00 00 00 11 00 00 00 responses 11 00 11 00 01 01 00 00 01 01 00 00 11 00 01 01 01 01 00 00 00 01 01 00 } 2 responses 11 00 00 11 01 01 00 00 00 00 01 01 } 2 responses 1 0 0 1 0 1 0 0 0 0 0 1 } 2 responses The coefficif3'ent =vector ,8can1 , ,8be2 , s,(3e3t ,outT1 , asT2 , 'Y1 'Y12 'Y2 1 , 'Y2 , 'Y3 1 'Y32] ' ' ' / s 0) r e pr e sent t h e ef f e ct s of t h e l o cat i o ns on t h e det e r m i n a wher e t h e ,8 ttihoen'Yofik'sserrevprice,estehnte T/sthe rleoprcateisoenntgtender he effeincttserofactgender onctst.he service index, and i o n ef f e The des i g n mat r i x Z i s not of f u l r a nk. ( F or i n s t a nce, col u mn 1 equal s the suForm ofthcole compl umnse24te model or col,urmnsesults fromIna fcomput act, rank(er prZ)ogr= am give SS e ( Z ) = 2977. 4 r s and The ramodel nk(Z)wi=th18outthe=in12.teraction terms has the design matrix zl con sisting of the first six columnsSSofreZ.s (ZWe1 ) =fin3419.d that1 wiy3t2h= 0(nroank(locatZ1i)ongender = 18  4 inte14.ractToion)tes,twe comput Y1 1 = eY1 2 = Y21 = Y2 = 'Y3 1 = (SSres(Z I )  SSres( Z ) )/ (6  4) (SSres(ZSSl r)es(ZSS)/12res(Z) )/2 = (3419.2977.1  42977./12 4)/2 = '89 �
�
�
5
5
[ f3 o , (i >
56.)
6.
6
n
=
n
F
s2
H0 : 

374
Chapter 7
M u ltivariate Linear Reg ress ion Models
TheFdisFtrirbauttioionmaywithbe compar e d wi t h an appr o pr i a t e per c ent a ge poi n t of a n r a t i o i s not s i g ni f i c ant f o r any r e as o and d. f . Thi s n Fablnotedepend significanceuponlevelany loConscatioengender quently, weinconclteractuideon,thandat thtehseserevitceerminsdexcandoesbe droppedUsinfrgomthethexte model . r a s u mo f s q uar e s appr o ach, we may ver i f y t h at t h er e i s no s significant; tdihfateirseI,nncemalanalbetesyandwsiseenoffelmalvoarcatieasinceodons snot(intouatlgioivcatoenstihowheren sefafmeeectthr)ae, butticelngslthcount tato sgender ersviarce.e iunequal , t h e varinteiraatictoinoinsn tcannot he respusonsuale laty tberibsuteparablaetteoddiinftoerienntdependent predictoramount variablse.s Toandevaltheiur atnecese thsearryeltaotifvitetihneflmodel uenceswiofththande prwiedithcouttorsthone tethrme sreinspquesonsetiionntandhis cascompute, it ies the appropriate Ftest statistics. 2
12
a.
•
7.5
I N FERENCES FROM TH E ESTIMATED REGRESSION F U N CTI ON
Once an i n ves t i g at o r i s s a t i s f i e d wi t h t h e f i t e d r e gr e s i o n model , i t can be us e d t o sprolevdiecttworo varpreidiablcteiso.nThenproblze0msand. �etf3 canz0 be usez0d1 , . . . t,oZoesr]tibematseelethcteerdegrvaleus esionfofrutnche tatioz0n .{30 {31z0 1 f3rZor at z0 and to estimate the value of the response Letz0 Y0 denotz01 , .e. . t,hZeorJ.valAccor ue of dthinegrteospthonse model e wheninthe pretdihectexpect or vareiadblvalesuhavee of valisues I ( Y z E o ) f3o {3 1 Zo 1 o f3 rZo r Its least squares estimate is z0f3 . 0 For t h e l i n ear r e gr e s i o n model i n i s t h e unbi a s e d l i n ear z f3 1 esertriomrsatorarofe Enor(Y0mI alz0l)ywiditshtrmiibutnimed,umthvaren iaance, Var(z0P) confz0(iZdence' Z) izn0te2r.valIf tfhoer E(YQ I zo) z0f3 is provided by zO tnrl (� ) v!(zQ(Z'Zf1z0)s2 where tnrd.1f. the upper )th percentile of a !distribution with For a f i x ed 'o 0, i s j u s t a l i n ear combi n at i o n of t h e {3 / s , /3 z z "' "' " 1 2 Resa2(Z'ultZ)1applby Resies. uAlltso, VarUnder (z0f3) thezf0uCovrther( f3as)szu0mptz0io(nZ'tZ)hat z0ias norsinmcealCovly di(sf3tr)ib uted, Result asserts that p is Nr+1 ( f3, a2(Z'Z)1) independently of s2/a2 , which = [1,
+
+
···
(1)
(2)
+
Y
Esti mati ng the Reg ression Fu nction at z0
(73),
= [1,
=
+
+
···
= Zo /3
+
"
Result 7.7.
Yo
"
(73),
o
=
100( 1  a)%
e
(7  1 8)
=
n 
r1
( a/2)
IJ ±
IS
100( a/2
Proof.
7.3
=
7.2.
7.4
so
=
==
e
is distributed as �  and Section 7 . 5
I nferences from the Est i mated Reg ress ion Fu nction
375
z'o P
Consequently, the linear combination is
x r 1/ ( n  r  1 ) . 2 N ( z'o /3 , o z0 (Z' Z )  1 z 0) ( z'o P  z'o{J )j \1o2 z'o ( Z ' Z )  1 z0 v?J;?
"'
( z'o{J  z'o{J )
is distributed as tnr1 • The confidence interval fol ows. Prthanediescttiiomnatofinagnewthe observation, suofch as YQ,Accorat ding to the regres ioisnmormodele uncerof tain or (new response (expected value of at (new error) 2 wher e i s di s t r i b ut e d as and i s i n dependent of and, hence, of and s . 2 Thedoes ernotro.rs influence the estimators and s through the responses but has the Given the linear regres ion model of a new observation The variance of the Var is Whenis givthene erbyrors have a normal distribution, a for n ( �) where tn rdegr1 ees ofis ftrheedom. e upper )th percentile of a tdistribution with We f o recas t by whi c h es t i m at e s By Res u l t has and The f o r e cas t er r o r i s t h en Thus , so the predictor is unbiased. Since and are independent, Ifly idit isstrfiubrtutheerd, asandsumedso isthtathe lhasineara normal di s t r i b ut i o n, t h en i s normal combi n at i o n Cons e quent l y , i s di s t r i b ut e d as Di v i d i n g t h i s r a t i o by which is distributed as � we obtain •
Forecasti ng a New Observation at z0
expected value
z'o = [ 1, z0 1 , . . . , Zo r J
Y0 • Y0 = z'ofJ + eo
Yo) = N(O, o2 )
eo
Y0 z0) +
/3
Result 7.8.
e
Yo
"'
"'
"'
l
Vs 2 ( 1 + z o (z'zr 1 z o )
eo
Yo
(73),
"'
z'ofJ = f3 o + {3 1 Zo 1 + · · · + f3 r Zo r forecast error Yo  z0{J "' ( Yo  z'o /3 ) = o2 (1 + z0(Z ' Z )  1 z 0) 100 ( 1 a)% prediction interval za P ± t  r
"'
100 ( a/2
( a/2)
nr1
p Y,
e
e
unbiased predictor
(7 3),
Yo z'o/3 , E ( Yo I z 0) . 7.7, z'o/3 1 2 E ( z'ofJ ) = z'o{J Var (z'ofJ ) = z'o ( Z ' Z)  z0o • E ( Yo  z o /3 ) = E ( eo ) + Yo  z'o/3 = Zo /J + eo  z o /3 = eo + Zo ( fJ  p ) . E ( z0 ( fJ  fJ"' ) ) = 0 eo fJ "' 1 2 2 Var ( Y0  z0{J ) = Var ( eo ) + Var ( z0f3 ) = a + z0 ( Z ' Z )  z0o = a2 ( 1 + z'o (Z' Z ) 1 z 0) . e fJ Y0  z'o{J . 1 N (O , 1 ) . ( Y0  z'o P )j \1o2 (1 + z0 ( Z ' Z )  z0) VX 1 / ( n  r  1 ) , v?J;?, Yo (  z o/3 ) Vs2 ( 1 + z0(Z' Z )  1 z0) Proof. A
A
"'
"'
"'
r
"'
"'
which is distributed as tnr1 . The prediction interval fol ows immediately.
•
376
Chapter 7
M u ltivariate Linear Reg ress ion Models
The prediction interval for Yo is wider than the confidence interval for estimating the value of the regression function E(Yo I z0) = z0 /J . The additional uncertainty in forecasting Y0 , 1which is represented by the extra term s 2 in the expressi o n s 2 ( 1 + z0(Z ' Z)  z 0) , comes from the presence of the unknown error term s0 • Example 7 . 6
(I nterva l esti mates for a mean response and a futu re response)
Companies considering the purchase of a computer must first assess their future needs in order to determine the proper equipment. A computer scientist col lected data from seven similar company sites so that a forecast equation of computerhardware requirements for inventory management could be devel oped. The data are given in Table 7.3 for z 1 = customer orders (in thousands) z2 = adddelete item count (in thousands) Y = CPU (central processing unit) time (in hours) TABLE 7.3
COM PUTE R DATA
Z1 (Orders)
Z2 (Adddelete items)
y (CPU time)
123.5 146.1 133.9 128.5 151.5 136.2 92.0
2.108 9.213 1.905 .815 1.061 8.603 1.125
141.5 168.9 154.8 146.5 172.8 160.1 108.5
Source: Data taken from H. P. Artis, Fo recasting Computer Require ments: A Forecaster's Dilemma (Piscataway, NJ: Bell Laboratories, 1 979).
Construct a 95% confidence interval for the mean CPU time, E ( Yo I z0) {30 + {3 1 Zo1 + {3 2 z0 2 at z0 = [ 1, 130, 7.5]. Also, find a 95% prediction interval for a new facility ' s CPU requirement corresponding to the same z0 • A computer program provides the estimated regression function =
y
=
( Z ' Z )1
=
[
8.42 + 1.08z 1 + .42z2 8.17969  .06411 .00052 .08831 .00107 .01440
and s = 1.204. Consequently, "
z0 /J
=
8.42 + 1.08 ( 130 ) + .42 ( 7.5 )
=
]
151.97
Section 7.6
Model Checking and Other Aspects of Reg ression
377
and s VzO(Z ' Zr 1 z0 = 1.204 ( .58928) = .71. We have t4( .025 ) = 2.776, so the 95% confidence interval for the mean CPU time at z0 is
z o P ± t4( .025 )s V'z O( Z ' Zr 1 z o = 151.97 ± 2.776 ( .71 ) or ( 150.00, 153.94 ). .______1Since s V1 + z 0 ( Z ' Z ) _ z 0 = ( 1.204 ) ( 1.16071 ) = 1.40, a 95% prediction interval for the CPU time at a new facility with conditions z0 is z 'o P ± t4 ( .025)s V'1 + z0(Z' Z )  1 z0 = 151.97 ± 2.776 ( 1.40) or ( 148.08, 155.86 ) . • 7.6
M O D E L CH ECKI NG AND OTH ER ASPECTS OF REGRESSION Does the Model Fit?
Assuming that the model is "correct," we have used the estimated regression function to make inferences. Of course, it is imperative to examine the adequacy of the model before the estimated function becomes a permanent part of the decisionmaking apparatus. All the sample information on lack of fit is contained in the residuals B 1 = Y1  f3 o  f3 1Z1 1  . . ·  f3rZ1r B2 = Y2  f3 o  /31 Z2 1  · · ·  f3 r Z2r A
A
A
A
A
A
en = Yn  f3 o  {3 1Zn 1  . . .  f3 rZnr e = [I  Z ( Z ' Z )  1 Z ' ] y = [ I  H] y A
or
A
A
(719)
If the model is valid, each residual ej is an estimate of the error sj , which is assumed to
be a normal random variable with mean zero and variance a2 • Although the residuals 1 2 e have expected value O, their covariance matrix a [ I  Z (Z' Z )  Z ' ] = a2 [I  H] is not diagonal. Residuals have unequal variances and nonzero correlations. Fortu nately, the correlations are often small and the variances are nearly equal. Because the residuals e have covariance matrix a2 [I  H], the variances of the sj can vary greatly if the diagonal elements of H, the leverages h j j , are substantially different. Consequently, many statisticians prefer graphical diagnostics based on stu dentized residuals. Using the residual mean square s 2 as an estimate of a2 , we have (720) Var ( ej ) = s 2 ( 1  hjj ) , j = 1, 2, . . , n
.
and the studentized residuals are * s. 1 A
e· 1 :;:=::=:========= A
=
v
"
I
s 2 (1  hjj ) '
j = 1, 2, . . . , n
(721)
We expect the studentized residuals to look, approximately, like independent draw ings from an N ( 0, 1 ) distribution. Some software packages go one step further and studentize ej using the deleteone estimated variance s 2 (j), which is the residual mean square when the jth observation is dropped from the analysis.
378
Chapter 7
M u ltiva riate Linear Reg ress ion Models
Residuals should be plotted in various ways to detect possible anomalies. general diagnostic purposes, the following are useful graphs: 1.
Plot the residuals ej against the predicted values Yj
=
For
ffi o + ffi l Zj l + . . . + ffirz;
r .
Departures from the assumptions of the model are typically indicated by two types of phenomena: (a) A dependence of the residuals on the predicted value. This is illustrated in Figure 7.2(a). The numerical calculations are incorrect, or a {3 0 term has been omitted from the model. (b) The variance is not constant. The pattern of residuals may be funnel shaped, as in Figure 7.2(b ), so that there is large variability for large y and small variability for small y . If this is the case, the variance of the error is not constant, and transformations or a weighted least squares approach (or both) are required. (See Exercise 7.3.) In Figure 7.2( d), the residuals form a horizontal band. This is ideal and indicates equal variances and no de pendence on y . 2. Plot the residuals ej against a predictor variable, such as z 1 , or p roducts of pre dictor variables, such as zi or z 1 z2 • A systematic pattern in these plots suggests the need for more terms in the model. This situation is illustrated in Figure 7.2(c). 3. QQ plots and histograms. Do the errors appear to be normally distributed? To answer this question, the residuals ej or ej can be examined using the techniques discussed in Section 4.6. The QQ plots, histograms, and dot diagrams help to detect the presence of unusual observations or severe departures from nor mality that may require special attention in the analysis. If n is large, minor de partures from normality will not greatly affect inferences about /3.
������ y
A
(a)
������
A
y
(b)
A
������ y
(c)
(d)
Figure 7.2
Resid u a l p l ots.
Section 7 . 6 4.
Model Checking and Other Aspects of Reg ression
379
Plot the residuals versus time. The assumption of independence is crucial, but
hard to check. If the data are naturally chronological, a plot of the residuals ver sus time may reveal a systematic pattern. (A plot of the positions of the resid uals in space may also reveal associations among the errors.) For instance, residuals that increase over time indicate a strong positive dependence. A sta tistical test of independence can be constructed from the first autocorrelation,
n
(722)
"' 2 e· 1
2: j =l of residuals from adj acent periods. A popular test based on the statistic
� (Sj

ej  d
I� e;
0
2( 1

rl ) is called the Durbin Watson test. (See
[13] for a description of this test and tables of critical values.) Example 7.7
{Residual plots)
Three residual plots for the computer data discussed in Example 7.6 are shown in Figure 7.3. The sample size n = 7 is really too small to allow definitive judg ments; however, it appears as if the regression assumptions are tenable. • £
£ 1 .0
1 .0 0
0
zl
 1 .0
 1 .0
• • •
•
•
10
5 •
•
(b)
(a)
1 .0 0
������
A
y
 1 .0
(c) Figure 7.3
Resi d u a l pl ots fo r the co m puter data of Exa m p l e 7 . 6 .
z2
380
Chapter 7
M u ltivariate Linear Reg ress ion Models
If several observations of the response are available for the same values of the predictor variables, then a formal test for lack of fit can be carried out. (See [12] for a discussion of the pureerror lackoffit test.) Leverage and I nfl uence
Although a residual analysis is useful in assessing the fit of a model, departures from the regression model are often hidden by the fitting process. For example, there may be "outliers" in either the response or explanatory variables that can have a consid erable effect on the analysis yet are not easily detected from an examination of resid ual plots. In fact, these outliers may determine the fit. The leverage h1 j is associated with the jth data point and measures, in the space of the explanatory variables, how far the jth observation is from the other n  1 observations. For simple linear regression with one explanatory variable z, 2 ( zj  z) 1 h1· 1· =  + n n 2 � (zj  z) 
j= l
The average leverage is ( r + 1 )/n. (See Exercise 7.8.) For a data point with high leverage, hjj approaches 1 and the prediction at Zj is almost solely determined by yj , the rest of the data having little to say about the mat ter. This follows because (change in yj ) = hjj (change in yj) , provided that other y values remain fixed. Observations that significantly affect inferences drawn from the data are said to be influential. Methods for assessi �g influence are typically based on the change in the vector of parameter estimates, f3 , when observations are deleted. Plots based upon leverage and influence statistics and their use in diagnostic checking of regres sion models are described in [2] , [4] , and [9] . These references are recommended for anyone involved in an analysis of regression models. If, after the diagnostic checks, no serious violations of the assumptions are de tected, we can make inferences about f3 and the future Y values with some assur ance that we will not be misled. Additional Problems in Li near Reg ression
We shall briefly discuss several important aspects of regression that deserve and re ceive extensive treatments in texts devoted to regression analysis. (See [9] , [10], [12], and [20].) Selecting predictor variables from a large set. In practice, it is often difficult to formulate an appropriate regression function immediately. Which predictor vari ables should be included? What form should the regression function take? When the list of possible predictor variables is very large, not all of the vari ables can be included in the regression function. Techniques and computer programs designed to select the "best" subset of predictors are now readily available. The good ones try all subsets: z 1 alone, z2 alone, , z 1 and z2 , The best choice is decided by . . .
. • .
•
Section 7 . 6
Model Checking a n d Oth e r Aspects of Reg ression
381
2 R2 2 •2 R R , R 1  ( 1  R2) ( n  1)/(n  [11 1) , ) n ( 
examining some criterion quantity like [See (79).] However, always increases with the inclusion of additional predictor variables. Although this problem can be cir r a cumvented by using the adjusted = better statistic for selecting variables seems to be Mallow ' s CP statistic (see ]),
(
(residual sum of squares for subset model with p parameters, including an intercept) CP = 2p) (residual variance for full model) A plot of the pairs (p, Cp ) , one for each subset of predictors, will indicate models that forecast the observed responses well. Good models typically have (p , Cp ) co ordinates near the 45 ° line. In Figure 7.4, we have circled the point corresponding to the "best" subset of predictor variables. If the list of predictor variables is very long, cost considerations limit the num ber of models that can be examined. Another approach, called stepwise regression (see [ 2]) , attempts to select important predictors without considering all the possibili ties. The procedure can be described by listing the basic steps (algorithm) involved in the computations: Step 1. All possible simple linear regressions are considered. The predictor variable that explains the largest significant proportion of the variation in Y (the variable that has the largest correlation with the response) is the first vari able to enter the regression function .
1
•
(0)
•
(3 )
(2) .
•
•
(2, 3)
•
( 1 , 3)
(1)
0
•
( 1 , 2,
3)
( 1 , 2) Numbers in parentheses correspond to predicator variables
CP p l ot fo r com p uter data from Exa m p l e 7.6 with th ree pred icto r va riables (z 1 = orders, z2 = add delete cou nt, z3 = n u m ber of items; see the exa m p l e a n d orig i n a l sou rce) .
Figure 7.4
382
Chapter 7
M u ltivariate Linear Reg ress ion Models
Step 2. The next variable to enter is the one (out of those not yet included) th at makes the largest significant contribution to the regression sum of squares. The significance of the contribution is determined by an Ftest. (See Result 7.6. ) The value of the Fstatistic that must be exceeded before the contribution of a variable is deemed significant is often called the F to enter. Step 3. Once an additional variable has been included in the equation, the in dividual contributions to the regression sum of squares of the other variables already in the equation are checked for significance using Ftests. If the Fstatistic is less than the one (called the F to remove) corresponding to a pre scribed significance level, the variable is deleted from the regression function . Step 4. Steps 2 and 3 are repeated until all possible additions are nonsignifi cant and all possible deletions are significant. At this point the selection stop s . Because of the stepbystep procedure, there is no guarantee that this approach will select, for example, the best three variables for prediction. A second drawback is that the (automatic) selection methods are not capable of indicating when trans formations of variables are useful.
Z
Colinearity. If is not of full rank, some linear combination, such as Za, must equal 0. In this situation, the columns are said to be colinear. This implies that does not have an inverse. For most regression analyses, it is unlikely that a = 0 ex actly. Yet, if linear combinations of the columns of exist that are nearly 0, the cal culation of is numerically unstable. Typically, the di�gonal entries of will be large. This yields large estimated variance � for the f3 /s and it is then difficult to detect the "significant" regression coefficients f3i . The problems caused by colin earity can be overcome somewhat by (1) deleting one of a pair of predictor variables that are strongly correlated or (2) relating the response Y to the principal compo nents of the predictor variablesthat is, the rows zj of are treated as a sample, and the first few principal components are calculated as is subsequently described in Sec tion 8.3. The response Y is then regressed on these new predictor variables.
Z' Z Z 1 (Z' Z)
Z
( Z ' Z ) 1
Z
Bias caused by a misspecified model. Suppose some important predictor vari ables are omitted from the proposed regression model. That is, suppose the true model has = with rank r + 1 and !
Z [ Z 1 Z2 ] f3 1 ( ) + q X ( Y(nX 1) [ (nx(Zq1+1) (nXZ(r2q) J ( rf3q()2X) 11)) Ze)1 13 (1)a21.Z2 P (2) e E( e) ( Y Z ' ( Y Z ) /3 f3 1 1 1 1 · ) ( ( 1 (Z1 ZE1()p Z1 1)Y. (Z 1Z1 )1Z1E(Y) (Z1Z1)1 Z1 ( Z1 f3 1 () ( l
=
     +
l
=
+
+
(neX 1)
(723)
where However, the investigator unknowingly fits = 0 and Var( = a model using only the first q predictors by minimizing the error sum of squares The least squares estimator of is Then, unlike the situation when the model is correct, =
=
+
f3(1) P (1) Z2 P (2) E(e))
:=
+
(724 )
Section 7 . 7
383
Thatto thosise, of is (athbiatasise,d estimator ofIf imporunltaents vartheiacolbleusmnsare miofs inarg feroperm pthendie modelcular, the least squares estimates may be misleading. InsponsthisesseY]_cti,on, we. . ,consYm andiderathsienprgleoblseetmofofprmodel i n g t h e r e l a t i o ns h i p bet w een r e e di c t o r var i a bl e s z , , Eac h r e 1 sponse is assumed to fol ow its own regres ion. model, so that Thetermers asrosrotciearmtede'with dif erent responshaseEs may(e) be corandreVar(lated.e) Thus, the error To establishdenotnotateiotnheconfvalouresminofg ttohethpredi e clascitcoalr varlineariablreesgrfeosriothnemodel ,ilaelt, jt h t r lreotrs. In matrix notation, thbee destheigrnesmatponsriexs, and let ej be the er "
13 ( 1 ) Z2
7.7
M u ltivariate M u ltiple Reg ression
Z1 Z 2
�
0). f3 ( l )
Z1
f3 ( I )
M U LTIVARIATE M U LTI PLE REGRESSION
z 2 . . . , z, .
Y2 ,
Yi = 13 o l + I31 1 Z1 + . · + l3r 1 Zr + e 1 Y2 = 13 o 2 + I31 2 Z1 + · · · + 13r 2 Zr + e2
m
(725 )
Ym = 13 o m + 13 I m Z1 + · · · + l3rm Zr + e m = [e 1 , e2 , . . . , e m ] = 0 = I.
[ Zjo , Zj r, . . . , Zj r ] Yj = [ lj 1 , lj 2 , . . . , �· m ]
=
[ ej 1 , ej 2 , . . . , ej m ]
z
( nX ( r + l )) Zno Zn l
Znr
imats threixsaquant me asititehsathavefor tmulhe stiinvglareiarteescount ponseerrepgraretss. ioSetn model. [See The other (73).]
Y1 1 Yi 2 y y = 121 22 (n x m) Yn l Yn 2
Yi m 12m = [Y (l) Ynm
13 o i 13 o 2 13 1 1 131 2
13 o m 13 1 m = [ /3 ( 1 )
/3
(( r + l ) Xm )
e
( n Xm )
l3 rl l3r 2 e1 1 e1 2 e2 1 e22 = en l e n 2 1 =
ee2 e'

n
l3 rm el m e2 m = [ e( l ) en m
384
Chapter 7
M u ltiva r i ate Linear Reg ression Models
Simply stated, the ith response Y( i ) follows the linear regression model Y( i ) = ZfJ ( i ) + B ( i ) , i = 1 , 2, . . . , m (7 27 ) with Cov ( B ( i ) ) = cr ii I. However, the errors for different responses on the same trial can be correlated. Given the outcomes Y and the values of the preftictor variables Z with full col umn rank, we determine the least squares estimates fJ ( i ) exclusively from the obser vations Y( i ) on the ith response. In conformity with the singleresponse solution, we take P u) = ( Z ' Z) 1 Z' Y( i ) (7 28 ) Collecting these univariate least squares estimates, we obtain Y(mJ ] iJ = [J} (ll i P ( 2 J i · · · i P (m J ] = ( Z ' Z r1 Z' [Y(ll ! Y( 2 J or (729)
For any choice of parameters B = [b (l) ! b ( z ) ! · · · l b (m) J , the matrix of errors Y is  ZB. The error sum of squares and cross products matrix is ( Y  ZB) ' ( Y  ZB ) =
[
\
( Y(ll  Zb ( lJ ( Y(ll  Zb (ll )
(Y( ll  Zb (ll ) ' ;( Y(ml  Zb (m J )
(Y(m)  Zb (m) ) (Y(l)  Zb (l) )
(Y(m)  Zb (m) ) ( Y(m)  Zb (m) )
]
(7 3 0) "
The selection b ( i ) = fJ ( i ) m1n1m1zes the ith diagonal sum of squares (Y( i )  Zb ( i ) ) ' (Yu),... Zb ( i ) ) · Consequently, tr [ ( Y  ZB ) ' ( Y  ZB ) ] is minimized by the choice B = f3 . Also, the generalized variance ( Y  ZB ) ' ( Y  ZB ) is min imized by the least squares estimates /J . (See Exercise 7 . 1 1 for an additional generalized sum of squares property.) ,... Using the least squares estimates f3 , we can form the matrices of 1 Predicted values: Y = z{J = Z (Z' Z) Z' Y e = Y  Y = [I  Z ( Z ' Z)  1 Z ' ] Y Residuals: (7 3 1)
I
I
M u ltiva riate M u ltiple Reg ress ion
Section 7 . 7
385
The orthogonality conditions among the residuals, predicted values, and columns of Z, which hold in classical linear regression, hold in multivariate multiple regression. 1 They follow from Z' [I  Z ( Z ' Z ) Z' ] = Z'  Z' = 0. Specifically, 1 (732) Z' e = Z' [ I  Z ( Z' Z)  Z' ] Y = 0 so the residuals e(i) are perpendicular to the columns of Z. Also, (733) confirming that }he predicted values Y(i) are perpendicular to all residual vectors e(k) . Because Y = Y + e ,
(
or Y'Y
Y'Y
e'e
�
)
resi ual ( error) �urn total sum of squares = predicted sum of squares 0 square s an + and cross products and cross products cross pro d ucts (734) The residual sum of squares and cross products can also be written as e � e = Y' Y  Y'Y = Y'Y  iJ' z ' z iJ (735)
(
Example 7.8
=
) (
)
+
(Fitti ng a mu ltivariate straig htl i n e reg ression model)
To illustrate the calculations of /3 ' Y , and e ' we fit a straightline regression model (see Panel 7.2 on page 386 for SAS output), "
"
lf1 = f3o l + f3 1 1 Zj1 + Bj l lj 2 = f3o 2 + f3 I 2 Zj l + Bj 2 '
j = 1, 2, . . . ' 5
to two responses Yi and Y2 using the data in Example 7.3. These data, aug mented by observations on an additional response, are as follows: 0 1 1
1 4 1
3 8 3
2 3 2
4 9 2
The design matrix Z remains unchanged from the singleresponse problem. We find that
z'
=
[
1 1 1 1 1 0 1 2 3 4
]
(Z'Z)
1
=
[
.6  .2  .2 .1
]
386
Chapter 7
PANEL 7 . 2
M u ltivariate Li near Reg ress ion Models
SAS ANALYSIS FOR EXAMPLE 7.8 U S I N G PROC. G L M .
title ' M u ltiva riate Reg ression Ana lysis'; data m ra; infile 'Exa m pl e 78 data; i n put y1 y2 z 1 ; proc g l m data = m ra; model y 1 y2 = z 1 /ss3; m a n ova h z 1 /pri nte;
PROGRAM COM MANDS
=
General Linear Models Proced u re
OUTPUT
Sou rce Model Er ror Corrected Tota l
Sou rce Z1
DF 1 3 4
S u m o f Sq u a res 40.00000000 6. 00000000 46.00000000
Mean Sq u a re 40. 00000000 2 . 00000000
RSq u a re 0.869 565
C.V. 28. 28427
Root M S E 1 .41 42 1 4
DF 1
Type I l l S S 40. 00000000
Mean Sq u a re 40.00000000
T fo r HO: Parameter = 0 0.91 4.47
Sou rce Model E r ror Corrected Tota I
Sou rce Z1
F Va l u e 20.00
Y 1 Mean 5 .00000000 F Va l u e 20.00
S u m of Sq u a res 1 0.00000000 4.00000000 1 4.00000000
RSq u a re 0 . 7 1 4286
1 1 5.4701
Root M S E 1 . 1 5470 1
DF 1
Type I l l S S 1 0. 00000000
Mean Sq u a re 1 0.00000000
Mean Sq u a re 1 0.00000000 1 . 33333333
c.v.
T for HO: Parameter = 0 1 . 1 2 2 . 74
Pr > F 0. 0208 Std Error of Esti mate 1 . 095445 1 2 0 .4472 1 3 60
P r > ITI 0.4286 0.0208
DF 1 3 4
Pr > F 0. 0208
F Va l u e 7.50
Pr > F 0.07 1 4
Y2 Mean 1 .00000000 F Va l u e 7.50 Pr > ITI 0.3450 0.07 1 4
Pr > F 0.07 1 4 Std Error of Est i m ate 0.894427 1 9 0.3651 4837
(continues on next page)
Section 7 . 7
PANEL 7.2
M u ltivariate M u ltiple Reg ress ion
387
(continued)
Y2
Y1 Y1 Y2
M a n ova Test Criteria a n d Exact F Statistics for the Hypothesis of no Overa l l Z1 Effect E = E rror SS&CP Matrix H = Type I l l SS&CP Matrix for Z1 N=O S=1 M=O Va l u e 0.062 50000 0.937 50000 1 5 .00000000 1 5 . 00000000
Statistic Wil ks' La m bda P i l l a i's Tra ce Hotel l i n gLawley Trace Roy's G reatest Root
Pr > F 0.0625 0.0625 0.0625 0.0625
Den D F 2 2 2 2
Num DF 2 2 2 2
F 1 5 . 0000 1 5 . 0000 1 5 .0000 1 5 . 0000
and
11 2 2 ] /3 (2 ) (Z' Z) 1 Z'y(2 ) == [ .. 26 ..21 ] [ 250 ] == [ 11 ] /3 (1 ) == (Z'Z)  1 Z'y(1 ) [ 21 � � [ jJ [11(1) f}( 2 ) ] J ( Z' Z f1 Z' [Y(l ) Y(2) ] == 1 2z 1 y2 == 1 11 01 [� � ] 1 10 1 5 == 1 2 2 11 3
so
A
==
From Example 7.3,
A
==
Hence,
i
=
=
i
=
The fitted values are generated from )\ Collectively, A
Y
3
A
==
and
+
z{J
=
3 4
7 9
3
+
z2 .
388
Chapter 7
M u ltivariate Li near Reg ress ion Models
and A
E=YY= Note that
E'Y = [00 A
[00
1 2 1 1 1 1
OJ
1 2 1 1 1 1 1
0
1 1 3 5 1 7 2 9 3
n
[� �]
=
Since
Y' Y =
[
4 3 8 1 1 1 2 3
[
Y'Y = 165 45 45 15
J
and
1 1 4 1 3 2 8 3 9 2
�] A
A
e'e
=
=
[
171 43 43 19
[ !]
J
6 2
the sum of squares and crossproducts decomposition
Y ' Y = Y'Y + e' e
is easily verified.
•
Result 7.9. For the least squares estimator {J = [ Jl ( l) i · · · i Jl (mJ l determined under the multivariate multiple regression model (726) with full rank(Z) = r + 1 < n,
J1 (2)
and
i, k The residuals E = [i ( l ) i i ( 2 l i · · · i i (mJ l E ( e ( i ) e ( k ) ) = (n  r  1 ) oi k ' so E( e) "
=0
and E
Also, e and {J are uncorrelated.
=
= Y  z {J
(n  r 1
1
e' e
1, 2, . . . , m satisfy E(i ( i ) )
)=I
=0
and
M u ltivariate M u ltiple Reg ress ion
Section 7.7
Proof.
389
The i t h r e s p ons e f o l o ws t h e mul t i p l e r e gr e s i o n model Y Z , and ) e ( e f3 e e ) E i E( i ai u i i i i ( ( ( ( ( ) ) ) ) Also, as in P (i)  f3 u) (Z'Z)1Z ' Yu)  f3 u) (Z' Z)1Z' eu and e(i) Y(i)  Y(i) [I  Z(Z'Z)1Z' J Y(i) [I  Z(Z'Z))1Z' ] e(i) so E(Nextp (i) ) , f3 (i) and E(iu) Cov ( P u) , P (k ) E( Z( 'PZ(i))1Z'/3E(e) u( P) e(kk) )Z ( Z(k' Z) )1 oik(Z' Z)1 Us i n g Res u l t and t h e pr o of of Res u l t wi t h U any r a ndom vect o r and AConsa efquent ixed lmaty, rix, we have that E[U' AU] E[tr (AUU' ) ] tr [AE(UU' ) J . E(i(i)e(k ) aE(iket(ri)[((II ZZ((ZZ'' Z)Z)11ZZ'' ))e](k ) aikt(r [ (I  Z (Z' Z)1Z' ) oiki] astaiinntthheeunbiproofasofedResestiumltator ofDividFiinngaleachly, entry e(i) e(k) of by   we ob Cov( p0) , e(k ) E(Z[ (' ZZ)' Z1)Z'1E(Z ' ee((ii))ee((kk) ()I(IZ(ZZ(Z' Z'Z))1Z1Z' )']) (oZik'((ZZ)'1ZZ)'o1Ziki' (I (ZZ' Z(Z)'1ZZ)')1Z' ) so each element of is uncorrelated with each element of The mean vect o r s and covar i a nce mat r i c es det e r m i n ed i n Res u l t enabl e us to obtWeain ftihrestsconsamplidinerg prtheoperprotblieesmofoftheeslteimasatt siqnuarg thees mean predicvecttorso. r when the predic tvaror ivarablieablis ezs0 f3have(i) , andthe tvalhisuiess esz0timat[1e,dZobyr , .z.0,f3,.Z.. (oi)r,] .thThee ithmeancomponent of the iofth trheespfionst ede regres ion relationshipZ'. Collect[zi'vely, z' : : z ' Of' (1) : f' ( ) icomponent s an unbi�asFredomestthime atcovaror zi'oa/3ncesimatnceriE(x fzo'or/3f3i_i(i)) andz'of3lf((/3k) ,(ith) e eszt'oim/3uatJ iofonrereachrors Zo f3[(zi) (f3Zof3(i)Phave) (f3COVariaPnces)'z ] z ( E ( f3  P ) (f3  P ) )z E b u) u (k) (k o abikzo(Z('i)Z)1zou (k) (k ' o +
= (710),
=
0,
=
=
=
=
(736)
=
=
=
=
0.
= =
o)
/3
(
4.9
'
=
7.2, =
= =
=
=
=
7.2.
n
1)
r
e'e n r
I.
= = = =
=
"
0
e.
/3
1,
•
7.9
=
"
OfJ
a

a
:
Oa
2
a J : . . . : OP(m)
=
= =
(737)
=
(738)
390
Chapter 7
M u ltivariate Linear Reg ress ion Models
The r e l a t e d pr o bl e m i s t h at of f o r e cas t i n g a new obs e r v at i o n vect o r Y0 d i n g t o t h e r e gr e s i o n model , 0 s0 whe , Y02, er,rYoormE] oat z0.[s0Accor [thYe01"new" z r e i i fJ i Yo ( ) , s0 , . . , s ] i s i n dependent of t h e er r o r s and s a t i s f i e s a m 1 2 E(s0J OandE( s0isok) oik · Theforecast errorfortheith componentofY0is Yoi zafJ (i) eoYoii  Zzob(fJufJ ()i)  zfJa fJ(i ()i)  za fJ (i) sunbio E(asYeodi pr ez0difJuctor) of YE(oi ·sThe0J fozr0eEcas( IJt(eri) rorfJus have) covarinidiancescating that z0 fJu) is E(Yoi  zo fJ (i ) (YaE(k soizofJz(ko()IJ (i)  IJ (i ) )( sok  zb( fJ (k)  fJ (k ) ) E (zsboEiso((k)P (i) zbEIJ((Pi )us) ok) IJ(iE() (Pso(ik()P(k) fJ(kfJ)'(zko)')zo 0(Z' Z)si1nz0) (Z' Z 1Z' e is independent ( 1 oz i k Notof E0e•thAatsiE(mil(aPr(ri)esul!lt hol(i ) eodskf)or E(s0i(e�J3 (pk)(i) f3(k ) ' ) f. (i) flu) the erMaxirors mumhavelikaelnorihoodmal esditsitmribatutorios n.and their distributions can be obtained when Let t h e mul t i v ar i a t e mul t i p l e r e gr e s i o n model i n hol d wi t h fbutul iroan.nkThen(Z) r 1, n (r 1) and let the errors have a normal distri iEs (t/J�e) maxi/JandCov( mal distributihoemaxn with mum likelifJh(oodi) , fJes(k t)imatooikr(ofZ'ZfJ) and. AlfJsoha,/Ji� asnorindependentoft imum likelihood estimator1of the pos1itive definite given by  Z/3)  Z/3) n n and n:i is distributed as Accor d i n g t o t h e regres s i o n model , t h e l i k el i h ood i s det e r m i n ed fdiromstribtutheed datasa Nm(fJ['Yzj1,, Y2, . .We, YnJ'firwhosst note erowsthatare independent , wi t h [ Y 1 � /J' ZfJ z 1 Y2  /J' z2, . . , Yn  fJ' znJ' so  Z/J)'  Z/J) j= 1 (Yj  fJ' zj) (Yj  fJ' zj)'
==
. .
+
=
•
=
e
=
=
A
A
= =
A
A
=
A
A
+
A
= 0,
an
A
A
A
=
+
=
=
(739)
+
= 0
+
=
e
Result 7.10. =
+
+
>
(726)
+ m,
e
A
.......
=
.......
1
=
I
A
I=
A

A
e' e
=

A
(Y
f
A
(Y
wp,n  r  1 (I)
Proof.
Y=
Y
I).
(Y
(Y
n
=
2:
=
Y1
M u ltiva riate M u ltiple Reg ression
Section 7.7
and
n
:L (Y  /J'z j ) ' I  1 (Yj  fJ' z j ) j= 1 j
391
= j:L= 1 tr [ (Yj  /J'zj ) ' I1 (Yj  /J 'zj ) ] = j=1:L tr [I1 (Yj  fJ'zj ) (Yj  fJ'zj ) ' ] = [I1 (Y  Z/J) ' (Y  Z/J) ] Z' e = n
n
tr (740) Another preliminary calcl!lation will enable us to express the likelihood in a simple 0 [see (732)], form. Since e Y  Z/J satisfies
= (Y  Z/J) ' (Y  Z/J) = [Y  zp z (/J  fJ) [Y  zp z ( /J  fJ) J = (Y  Z/J) ' (Y  Z/J) ( /J  /J ) ' Z'Z (/J  /J ) = e' e (/J /J) ' z' z (/J  /3) 1( ) I ) = IT L( I I j= 1 + = I = I Z (/3  /J ) I 1 ( /J  /J) ' Z' A' A, 2 1 A = I  1 (/J  /J ) ' Z', tr [Z( /3  /J)I 1 (/J  /J) ' Z' ] fJ = fJ . Z Z ( f3 (i)  f3 (i) ) =I= P u )  f3u )1 tr [?; (/3  /J)I  (/3  /J ) ' Z' ] ' I 1 Z;_(/J  /J). = e' e , = n 1 fJ I, fJ i = n e' e " + "
"
"
]'
"
"
+
"
"
+
+ Using (740) and (741), we obtain the likelihood n ) ' 1 1 1  2 Y1 P z1 l ( y1  p z1 a e p, (21T )mf2 1 1/2 1 1 _ l tr [ l 1 ( £ ' i ( P  P ) ' z' z ( p  13 ) J e 2 (21T )mn/2 1 l n/2 1 1 _ l tr [:I1£ ' i ]  l t r [ Z ( p  P)l\ P  P ) ' Z ' ] 2 e 2 (21T )mn/2 1 l n/2 
a'
"
The ,...matrix
(741)
a'
"
is the form with and, from Exercise 2.16, it is nonnegative definite. Therefore, its eigenvalues are nonnegative also. Since, by Result 4.9, " is the sum of its eigenvalues, this trace will equal its minimum value, zero, if This choice is unique because is of full rank " implies that 0, in which case and *,... 0, c > 0, where c ' is any nonzero row of c Applying Result 4.10 with B b /2, and p = m, we find that and are the maximum likelihood estimators of and respective ly, and " ( n )mn/2 nmj2 1 e nm/2 (742) L( /J , i ) (27T)mnj2 1 i ' i l n/2 e (21T )mn/2 1 I l n/2 "
"
"
>
=
=
It remains to establish the distributional results. From (736), we know that f3u ) and e u ) are linear combinations of the elements of e . Specifically,
p(i) e (i)
= (Z' Z) 1 Z' e(i) 1 f3 u) = [I  Z (Z' Z)  Z' J eu) ' +
i = 1, 2, . . . , m
"
392
Chapter 7
M u ltivariate Linear Reg ress ion Models
Therefore, by Result 4.3, {3 ( 1 ) ' {3 (2) ' P (m ) ' e ( 1 ) ' e ( 2) ' e ( m ) are jointly norm al. Their mean vectors and covariance matrices are given in Result 7.9. Since e and {3 have a zero covariance matrix, by Result 4.5 they are independent. Further, as in 0 0 0
'
0 0 .
'
n r1 1 (713), [I  Z(Z' Z) Z' ] = � e e e € , where e{;ek = 0, e # k, and e € e e = 1. S et €= 1 Ve = e' e e = [ e ( 1 ) e e , e ( 2) e e , . . , e (m ) e e J ' = ee 1 e 1 + ee 2 e2 + · · · + ee n en . Because Ve , e = 1, 2 , . . . , n  r  1 , are linear combinations of the elements of e, they have a joint normal distribution with E( Ve ) = E( e' ) ee = 0. Also, by Result 4.8, Ve and Vk have covariance matrix ( e{;ek) I = (0) I = 0 if e # k. Consequently, the Ve are in dependently distributed as Nm (O, I ) . Finally, nr 1 n r 1 e' e = e' [I  Z (Z' Z)  1 Z' ] e = � e' e e e ee = � Ve V € €= 1 €= 1 which has the wp , n  r  1 (I) distribution, by (422). II .
Result 7.10 provides additional �upport for using least squares estimates. When the errors are normally distributed, /3 and n 1 e' e are the maximum likelihood esti mators of {3 and I, respectively. Therefore, for large samples, they have nearly the smallest possible variances. Comment. The multivariate multiple regression model poses no new computational problems. Least squares (maximum likelihood) estimates, f3 ( i ) = (Z'Z)  1 Z' y( i ) ' are computed individually for each response variable. Note, however, that the model requires that the same predictor variables be used for all responses. Once a multivariate multiple regression model has been fit to the data, it should be subjected to the diagnostic checks described in Section 7.6 for the singleresponse model. The residual vectors [ ej1 , ej 2 , , ej m ] can be examined for normality or out liers using the techniques in Section 4.6. The remainder of this section is devoted to brief discussions of inference for the normal theory multivariate multiple regression model. Extended accounts of these procedures appear in [1] and [22] . A
• • •
Li ke l i hood Ratio Tests for Regression Parameters
The multiresponse analog of (715), the hypothesis that the responses do not depend on Zq+ 1 , Zq+2 , , Zr , becomes • • •
H0 : /3 ( 2) =
0
where {3 =
/3 ( ) (( q+ 1 )1X m) /3 (2) ((r  q) X m )
(743)
Setting
Z
] ( n XZ(r2 q))
[ ( n XZ( q+l)) 1
=
/3 ( 2 )
= 0, Y =
393
'we can write the general model as
Under the quantities involved in the H0 :
M u ltivari ate M u ltiple Reg ression
Section 7.7
Z 1 /3( l ) + e
and the likelihood ratio test of is based on H0
extra sum of squares and cross products = (Y  Z 1 fJ ( l ) ) ' (Y  Z 1 /3 ( l ) )  (Y  Z {J)' (Y  Z /3 ) = n( 1 {J ( l J = (Z!Z 1 r1 Z lY i 1 = n 1 (Y  z l {J ( l l ) ' (Y  z l {J ( l J ) ·
I  I)
"
"
"
"
whereFrom the likelihandood ratio, can be expres ed in terms of generalized variances: ( Equivalently, I�I I can be used. Let t h e mul t i v ar i a t e mul t i p l e r e gr e s i o n model of hol d witrit�hutedof:._ fUnder ul rank and is distributed Letas the errors inbedependent normallylydiofs The l i k el i h ood r a t i o t e s t of whi c h, i n t u r n , i s di s t r i b ut e d as is equivalent to rejecting for l(arI gIeIval) ues of l ni l 5 2ln ln I ln I I For large, the modi[ fied statistic ] ln ( I 1 � 1 I ) has, to a close appr(SeeoSuppl ximateiment on, a chisquare distribution with d.f. If is tihsenot of ful rank, butdihasscusrseadnkin (Stheeenalso Exercise Thewherdise tplriabcedutiobynal concland usionsbystartaenkd in ResuHowever lt re,mainotnaltlhhypot e same,hespresoconcer vided tnhiatng iscanre (742),
A,
7 44)
Wilks' lambda statistic
A 2;n =
Result 7.11.
Z
n (I 1  I)
r+1 /3 ( 2 )
H0 :
I1 l
( r t 1 ) + m n. e ni Wp,n  r  1 (I) Wp,r  q ( I).
= 0,
H0
A = n
il l
n
=
n
Proof.
Z
7A.)
generalized inverse
r1
q+1
5 Technically, both n
(Z 1 ) .
 r and n
m
7.11
H0
ni + n( i l  i )
 n  r  1 _ l:_2 (m  r + q + 1 )
(Z'Z) 
(726)
A2 > > Aq+ l > 0 Z 1 (Z1Z 1 )  1 Z1 q + 1 Z 1b 1. Z 1 (Z1Z 1 ) 1 Z1 , ···
1.
g e=
so
the
1, 2, . . . , q + 1,
comb i
Z1 . (216), h a ve 1 q+ (Z (Z' Z) 1 Z' ) Z = Z, Z 1 (Z1Z 1 )  1 Z1 = L g g . see €= 1 Zb = g Z (Z'Z)1Z' r+ 1 Z (Z' Z)  1 Z' = L g g . A = 1, €= 1 PZ = [ I  Z (Z' Z)  1 Z' ] Z = Z  Z = 0 g = Zbh e r + 1, P A = 0. e > r + 1, Z' g = 0, Pg = g . P nr1 n (216), P = L g g t'=r+2 n n n i = e' Pe = L (e' g e ) (e'g e ) ' = L €= r +2 €= r +2 Vjk) = E(g€e u ) e(k ) gj ) = oi kgegj = 0, f * e ' g£ I). = . . . , Ve i , (422), n i Wp, n  r  1 ( I). Pl gc = g ee > qq ++ 11 n P1 = L g g € . €=q+2 r+ 1 r+ 1 n(I 1  I) = e' (P1  P)e = L (e' g e ) (e' g e )' = L €=q+2 €=q+2 I). (422), n(I 1  i) Wp, r  q ( I) n(I 1  i ) ni ,
0. Then the ith principal compo ( A.2 , e 2 ) , . . . , (A. p , e p ) where A 1 > A2 nent is given by i = 1, 2, . . . ' p r: = e;x = ei l xl + ei 2 x2 + . . . + ei p xp , ( 8 4 ) With these choices, = A·l i = 1, 2, . . . ' p Var (Y)l = e��e l .... l Cov ( I: , Yk ) = e;Iek = 0 i#k (8 5) If some Ai are equal, the choices of the corresponding coefficient vectors, e z , an d hence I: , are not unique. . • .
>
· · ·
>
Proof. We know from (251), with B = I, that
a' Ia (attained when a = e 1 ) max , = A 1 a*O a a But e1 e 1 = 1 since the eigenvectors are normalized. Thus, e1Ie a'Ia max , = A1 = , 1 = e1Ie 1 = Var ( Yi ) a*O a a elel Similarly, using (252), we get
For the choice a = e k + l ' with ek +l ei = 0, for i = 1, 2, . . . , k and k = 1, 2, . . . , p ek +l i e k +l fek +l ek + l = ek + l i e k +l = Var (Yk +l )

1,
But ek + 1 (Ie k + l ) = "k + l ek +l e k +l = " k +l so Var (Yk + l ) = "k + l · It remains to show that e i perpendicular to e k (that is, e;e k = 0, i =I= k) gives Cov ( I: , Yk) = 0. Now, the eigenvectors of I are orthogonal if all the eigenvalues A 1 , A2 , . . , AP are distinct. If the eigenvalues are not all distinct, the eigenvectors corresponding to common eigen values may be chosen to be orthogonal. Therefore, for any two eigenvectors e i and e k , eie k = 0, i # k. Since Ie k = Ak e k , premultiplication by e; gives Cov ( I: , Yk ) = e;Ie k = eiA. k e k = A. k eie k = 0 for any i # k, and the proof is complete. • .
From Result 8.1, the principal components are uncorrelated and have variances equal to the eigf�nvalues of I.
R esult 8.2. Let X' = [ X1 , X2 , , Xp ] have covariance matrix I, with eigenvalueeigenvector pairs ( A. 1 , e 1 ) , ( A.2 , e 2 ) , . . . , ( A.p , e p ) where > A = e2X, . . . , YP = e�X be the principal 0. Let Yi = e!X, A 1 A2 > Y2 P components. Then p p . . . . . . o 1 1 + o22 + + oPP = L Var (XJ = " 1 + "2 + + AP = L Var ( I: ) i =l i=l . . •
>
· · ·
>
Section 8.2
Popu lation Principal Components
Proof. From Definition 2A.28, o1 1 + o22
429
+ · · · + oPP == tr (I) .
From (220) with A == I, we can write I PAP' where A is the diagonal matrix of eigenvalues and P [e1 , e 2 , . . . , e p ] so that PP' P ' P I Using Result 2A.12(c), we have tr (I) tr (PAP' ) tr ( AP ' P ) tr ( A ) A1 + A2 + · · · + AP Thus, p p == Var( Xz) tr( I ) == tr( A ) • L Var( Y;) iL i=l =l ==
==
==
==
==
==
.
==
==
==
Result 8.2 says that Total population variance
o1 1 + o22 + · · · + aP P A1 + A2 + · · · + AP (86) and consequently, the proportion of total variance due to (explained by) the kth prin cipal component is ==
==
Proportion of total population variance due to kth principal component
k
A 1 + A2 + · · · + A p
==
1, 2, . . . ' p
(87)
If most (for instance, 80 to 90%) of the total population variance, for large p, can be attributed to the first one, two, or three components, then these components can "replace" the original p variables without much loss of information. Each component of the coefficient vector e; [ ei 1 , . . . , ei k ' . . . , ei p ] also merits inspection. The magnitude of ei k measures the importance of the kth variable to the ith principal component, irrespective of the other variables. In particular, ei k is pro portional to the correlation coefficient between }j and Xk . ==
Result 8.3. If Y1 e� X, Y2 == e�X, . . . , YP == e�X are the principal components obtained from the covariance matrix I, then el· k \IIl (88) i, k == 1, 2, . . . ' p PY P Xk � V lTkk are the correlation coefficients between the components }j and the variables Xk . Here (A1 , e1), (A2 , e 2 ) , . . . , ( Ap , e p ) are the eigenvalueeigenvector pairs for I. ==
==
�
Proof. Set a k [0, . . . , 0, 1, 0, . . . , 0] so that Xk a k X and Cov (X10 }j) == Cov ( akX, e;X) ak i ei , according to (245). Since Iei Aiei , Cov ( Xk , li) == ak Ai ei Aieik · Then Var(}j) Ai [see (85)] and Var(Xk ) ok k yield Cov (}j , Xk ) Aiei k ei k \0\ . k = 1 ' 2' · · · , p • = = PY , . xk = Vvar (Y;) Vvar (Xk ) \0\ � � z , ==
==
==
==
==
==
==
Although the correlations of the variables with the principal components often help to interpret the components, they measure only the univariate contribution of an individual X to a component Y. That is, they do not indicate the importance of an X to a component Y in the presence of the other X' s. For this reason, some
430
Chapter 8
Principa l Com ponents
statisticians (see, for example, Rencher [17]) recommend that only the coefficients e1 1\ , and not the correlations, be used to interpret the components. Although the coeffi cients and the correlations can lead to different rankings as measures of the impor tance of the variables to a given component, it is our experience that these ranking s are often not appreciably different. In practice, variables with relatively large coef ficients (in absolute value) tend to have relatively large correlations, so the two mea sures of importance, the first multivariate and the second univariate, frequently give similar results. We recommend that both the coefficients and the correlations be ex amined to help interpret the principal components. The following hypothetical example illustrates the contents of Results 8 . 1 , 8.2, and 8.3. Example 8 . 1
(Calculati ng the po pulation principal components)
[ � � �]
Suppose the random variables X1 , X2 and X3 have the covariance matrix
I=
It may be verified that the eigenvalueeigenvector pairs are A1 = 5.83, A2 = 2.00, A3 = 0.17,
e1 = [ .383, .924, OJ e2 = [0, 0, 1 J e3 = [ .924, .383, OJ
Therefore, the principal components become Yi = e1X = .383X1  .924X2 Y2 = e2X = X3 Y3 = e3X = .924X1 + .383X2 The variable X3 is one of the principal components, because it is uncorrelated with the other two variables. Equation (85) can be demonstrated from first principles. For example, Var ( Yi ) = Var ( .383X1  .924X2 ) = ( .383 ) 2 Var (X1 ) + ( .924) 2 Var (X2 ) + 2( .383 ) (  .924) Cov ( X1 , X2 ) = .147 ( 1 ) + .854(5)  .708 ( 2) = 5.83 = A1 Cov ( Yi , ¥2) = Cov ( .383X1  .924X2 , X3 ) = .383 Cov (X1 , X3 )  .924 Cov (X2 , X3 ) = .383 (0)  .924(0) = 0 It is also readily apparent that o1 1 + o22 + o3 3 = 1 + 5 + 2 = A1 + A2 + A3 = 5.83 + 2.00 + .17
Section 8.2
Pop u lation Pri ncipal Components
43 1
validating Equation (86) for this example. The proportion of total variance ac counted for by the first principal component is A1/ ( A 1 + A2 + A3) = 5.83/8 = .73. Further, the first two components account for a proportion (5.83 + 2)/8 = .98 of the population variance. In this case, the components Yi and Y2 could replace the original three variables with little loss of information. Next, using (88), we obtain
Notice here that the variable X2 , with coefficient .924, receives the great est weight in the component Y1 . It also has the largest correlation (in absolute value) with Y1 . The correlation of X1 , with Yi , .925, is almost as large as that for X2 , indicating that the variables are about equally important to the first prin cipal component. The relative sizes of the coefficients of X1 and X2 suggest, however, that x2 contributes more to the determination of Yi than does xl . Since, in this case, both coefficients are reasonably large and they have oppo site signs, we would argue that both variables aid in the interpretation of Yi . Finally, 
( as it should) The remaining correlations can be neglected, since the third component is • unimportant. It is informative to consider principal components derived from multivariate normal random variables. Suppose X is distributed as Np ( JL, I ) . We know from (47) that the density of X is constant on the IL centered ellipsoids which have axes ± c\/T; e i , i = 1, 2, . . . , p, where the (Ai , ei) are the eigenvalue eigenvector pairs of I. A point lying on the ith axis of the ellipsoid will have coordi nates proportional to ei = [ ei l , ei 2 , , ei p ] in the coordinate system that has origin IL and axes that are parallel to the original axes x1 , x2 , , xP . It will be convenient to set IL = 0 in the argument that follows. 1 From our discussion in Section 2.3 with A = I 1 , we can write • . •
• • •
1 This can be done without loss of generality because the normal random vector X can always be translated to the normal random vector W = X J.t and E(W) = 0. However, Cov (X) Cov (W). 
=
432
Chapter 8
Principa l Components
where e1 x, e2 x, . . . , e� x are recognized as the principal components of x . Sett ing Y1 = e1 x, y2 = e2 x, . . . , yP = e� x, we have
and this equation defines an ellipsoid (since A1 , A2 , . . . , AP are positive) in a coordin ate system with axes y1 , y2 , . . . , yP lying in the directions of e1 , e 2 , . . . , e P , respectively. I f A1 is the largest eigenvalue, then the major axis lies in the direction e1 . The remain ing minor axes lie in the directions defined by e 2 , , e P . To summarize, the principal components y1 = e1 x, y2 = e2 x, . . , Yp = e � x lie in the directions of the axes of a constant density ellipsoid. Therefore, any point on the ith ellipsoid axis has x coordinates proportional to e; = [ ei 1 , ei 2 , , eip] and, neces sarily, principal component coordinates of the form [0, . . . , 0, Yi , 0, . . . , OJ . When IL * 0, it is the meancentered principal component Yi = e; (x  IL ) that has mean 0 and lies in the direction ei . A constant density ellipse and the principal components for a bivariate normal random vector with IL = 0 and p == .75 are shown in Figure 8.1. We see that the prin cipal components are obtained by rotating the original coordinate axes through an angle (} until they coincide with the axes of the constant density ellipse. This result holds for p > 2 dimensions as well. • • •
.
• . •
Figure 8.1 The consta nt density e l l i pse 2 x ' I 1 x = c and the principal
p, = O p = . 75
components y1 , y2 for a bivariate normal ra ndom vector X havi ng mean 0.
Pri ncipal Components O btai ned from Standardized Variables
Principal components may also be obtained for the standardized variables
Zl = z2 = p ==
z
(XI  JL I )
�
(X2  JL2 ) �
(Xp  JLp ) va:;:;
(89)
Section 8.2
Popu lation Principa l Components
433
In matrix notation, (810) where the diagonal standard deviation matrix V 1 12 is defined in (235) . Clearly, E(Z) = 0 and Cov (Z ) = (V 1 12 )  1 I (V 1 12 )  1 = p by (237). The principal components of Z may be obtained from the eigenvectors of the correlation matrix p of X. All our previous results apply, with some simplifica tions, since the variance of each Zi is unity. We shall continue to use the notation Yi to refer to the ith principal component and ( Ai , e i ) for the eigenvalueeigenvector pair from either p or I. How ever, the ( Ai , e i ) derived from I are, in general, not the same
as the ones derived from p.

R esult 8.4. The ith principal component of the standardized variables Z' = [Z1 , Z2 , . . . , ZP ] with Cov (Z) = p, is given by i = 1, 2, . . . , p Yi = e;z = e; ( V 1 12 )  1 (X IL ) , Moreover, p p (811) Var ( Yi) = L Var (Zi) = p L i =l i =l and i, k = 1 , 2, . . . ' p P y Z = el. k \IIl l'
k
In this case, ( A. 1 , e 1 ) , ( A2 , e 2 ) , . . . , ( A.p , e p ) are the eigenvalueeigenvector pairs for AP 0. p, with A 1 A.2 >
>
· · ·
>
>
Proof. Result 8.4 follows from Results 8.1, 8.2, and 8.3, with Z1 , Z2 , . . . , ZP in • place of X1 , X2 , , XP and p in place of I. • • •
We see from (811) that the total (standardized variables) population variance is simply p, the sum of the diagonal elements of the matrix p. Using (87) with Z in place of X, we find that the proportion of total variance explained by the kth princi pal component of Z is
(
Proportion of (standardized) population variance due to kth principal component
where the Ak ' s are the eigenvalues of p . Example 8.2
)
=
A
_!5._ ,
P
k = 1, 2, . . . ' p
(Pri nci pal com ponents obtai ned fro m cova ria nce and correlation matrices are different)
Consider the covariance matrix
I=
[! �J 10
(812)
434
Chapter 8
Principal Components
and the derived correlation matrix p
=
[ .! � ]
The eigenvalueeigenvector pairs from I are e1 == [ .040, .999] A1 == 100.16, A2 == e2 [ .999,  .040] .84, Similarly, the eigenvalueeigenvector pairs from p are e1 == [ .707, .707] A 1 == 1 + p == 1.4, A2 == 1  p == .6, e2 == [ .707,  .707] The respective principal components become Yi == .040Xl + .999X2 I: Y2 == .999X1  .040X2 and X2 X1 � ILl + .707 Y1 = .707Zl + .707Z2 = .707 ==
(
)
(
� IL2
)
1 == .707 ( X1  JL 1 ) + .0707(X2  JL2 ) p: X2 � IL2 X1 � ILl  .707 ¥2 = .707Zl  .707Z2 = .707 1 == .707 (X1  JL 1)  .0707(X2  JL2 ) Because of its large variance, X2 completely dominates the first principal com ponent determined from I. Moreover, this first principal component explains a proportion A1 == 100.16 == .992 101 A1 + A2 of the total population variance. When the variables X1 and X2 are standardized, however, the resulting variables contribute equally to the principal components determined from p. Using Result 8.4, we obtain PY 1 , z 1 == e1 1 VA; == .707 vT.4 == .837 and PY 1· z2 == e2 1 VA; == .707 vT.4 == .837
(
'
)
(
)
�
In this case, the first principal component explains a proportion A1 1.4 == == .7 p 2 of the total ( standardized ) population variance. Most strikingly, we see that the relative importance of the variables to, for instance, the first principal component is greatly affected by the standardization.
Section 8.2
Pop u l ation Principal Co mponents
435
When the first principal component obtained from p is expressed in terms of X1 and X2 , the relative magnitudes of the weights .707 and .0707 are in direct opposition to those of the weights .040 and . 999 attached to these variables in • the principal component obtained from I . The preceding example demonstrates that the principal components derived from I are different from those derived from p. Furthermore, one set of principal components is not a simple function of the other. This suggests that the standard ization is not inconsequential. Variables should probably be standardized if they are measured on scales with widely differing ranges or if the units of measurement are not commensurate. For ex ample, if X1 represents annual sales in the $10,000 to $350,000 range and X2 is the ratio (net annual income)/(total assets) that falls in the .01 to .60 range, then the total vari ation will be due almost exclusively to dollar sales. In this case, we would expect a single (important) principal component with a heavy weighting of X1 . Alternative ly, if both variables are standardized, their subsequent magnitudes will be of the same order, and X2 (or Z2 ) will play a larger role in the construction of the principal com ponents. This behavior was observed in Example 8.2. Pri ncipal Com ponents fo r Covariance Matrices with Special Structu res
There are certain patterned covariance and correlation matrices whose principal com ponents can be expressed in simple forms. Suppose I is the diagonal 1natrix 0 0
0
lT l l
I == 0
(813)
0
Setting ei == [ 0, . . . , 0, 1 , 0, . . . , 0 J , with 1 in the ith position, we observe that lTl l 0 0 lT22 0
0
0 0 lT pp
0
0
0 1 0
0
1oii
0
0
0
or I e·l == ol l e l· . .
and we conclude that ( oii , ez) is the ith eigenvalueeigenvector pair. Since the lin ear combination ei X = Xi , the set of principal components is just the original set of uncorrelated random variables. For a covariance matrix with the pattern of (813), nothing is gained by extracting the principal components. From another point of view, if X is distributed as Np ( /L, I) , the contours of constant density are ellipsoids whose axes already lie in the directions of maximum variation. Consequently, there is no need to rotate the coordinate system.
436
Chapter 8
Principa l Components
Standardization does not substantially alter the situation for the I in (8 1 3 ). In that case, p == I , the p X p identity matrix. Clearly, pe i == l e i , so the eigenv alu e 1 has multiplicity p and e; == [0, . . . , 0, 1 , 0, . . . , OJ , i == 1 , 2, . . . , p, are convenient choices for the eigenvectors. Consequently, the principal components determined from p are also the original variables Zr , . . . , ZP . Moreover, in this case of equ al eigenvalues, the multivariate normal ellipsoids of constant density are spheroids. Another patterned covariance matrix, which often describes the correspon dence among certain biological variables such as the sizes of living things, has the general form (81 4)
I == The resulting correlation matrix
p ==
1 p p 1
p p
(815)
p p
1 is also the covariance matrix of the standardized variables. The matrix in (8 1 5) im plies that the variables X1 , X2 , , XP are equally correlated. It is not difficult to show (see Exercise 8.5) that the p eigenvalues of the corre lation matrix (8 1 5) can be divided into two groups. When p is positive, the largest is A1 1 + (p  l )p (8 1 6) with associated eigenvector • . •
[ �, �, ... , �] ==
el =
(8 1 7 )
The remaining p  1 eigenvalues are A2 == A.3 == · · · == AP == 1  p and one choice for their eigenvectors is
e2 =
[ �, � , o, . . . , o
]
e) =
[� � �
e; =
[ V( i 1 1 ) i , . . . , V( i 1 1 )i ' V (( ii  11 ))i , O, . . . , O J
e� =
·
[
·
, o, . . . , o
]
 (p  1 ) 1 1 . ' ' ' V (p  1 ) p · · V (p  1 ) p V (p  1 ) p
J
Section 8.3
Summarizing Sa m p l e Va riation by Pri ncipal Components
43 7
The first principal component
is proportional to the sum of the p standarized variables . It might be regarded as an "index" with equal weights. This principal component explains a proportion
1 + (p  1 )p  == p p
1p + p
( 8  1 8)
of the total population variation. We see that A. 1/ p · p for p close to 1 or p large. For example, if p == .80 and p == 5, the first component explains 84% of the total variance. When p is near 1, the last p  1 components collectively contribute very little to the total variance and can often be neglected. In this special case, the first principal component for the original variables, X, is the same. That is, Yi == ( 1/ v]J ) [1, 1, . . . , 1 ] X, a measure of total size and it explains the same proportion (818) of total variance. If the standardized variables Z1 , Z2 , . . . , ZP have a multivariate normal distrib ution with a covariance matrix given by (815), then the ellipsoids of constant densi ty are "cigar shaped," with the major axis proportional to the first principal component Yi == ( 1/v]J) [ 1, 1, . . . , 1 ] Z. This principal component is the projection of Z on the equiangular line 1 ' [ 1, 1, . . . , 1 ] . The minor axes (and remaining principal com ponents) occur in spherically symmetric directions perpendicular to the major axis (and first principal component). ==
8.3
S U M MARIZING SAM PLE VARIATI O N BY PRI NCIPAL CO M PO N E NTS
We now have the framework necessary to study the problem of summarizing the variation in n measurements on p variables with a few judiciously chosen linear combinations. Suppose the data x 1 , x 2 , . . . , x n represent n independent drawings from some pdimensional population with mean vector IL and covariance matrix I. These data yield the sample mean vector x , the sample covariance matrix S , and the sample cor relation matrix R . Our objective in this section will be to construct uncorrelated linear combina tions of the measured characteristics that account for much of the variation in the sam ple . The uncorrelated combinations with the largest variances will be called the
sample principal components. Recall that the n values of any linear combination
j ==
1, 2, . . . , n
have sample mean a1 x and sample variance a1 Sa 1 . Also, the pairs of values (a1xj , a2xj ) , for two linear combinations, have sample covariance a1 Sa 2 [see (336)]. The sample principal components are defined as those linear combinations
438
Cha pter 8
Principal Components
The sample principal components are defined as those linear combinations which have maximum sample variance. As with the population quantities, we restrict the coefficient vectors ai to satisfy a; ai = 1. Specifically, First sample principal component Second sample principal component
=
=
linear combination a1 x1 that maximizes the sample variance of a1x1 subject to a1a 1 = 1 linear combination a2x1 that maximizes the sample variance of a2x1 subject to a2a2 = 1 and zero sample covariance for the pairs (a1x1 , a 2x1 )
At the ith step, we have ith sample principal component
=
linear combination a;x1 that maximizes the sample variance of aj x1 subject to aja i = 1 and zero sample covariance for all pairs (a ;x1 , ak x1) , k < i
The first principal component maximizes a1 Sa 1 or, equivalently, (819) "'
By (251), the maximum is the largest eigenvalue A 1 attained for the choice a 1 = � e igenvectpr e 1 of S. Successive choices of ai maximize (819) subj ect to 0 = ajSe k = aji\k e k , or ai perpendicular to e k . Thus, as in the proofs of Results 8.18 . 3, we obtain the following results concerning sample principal components:
Section 8.3
Summarizing Sample Va riation by Principal Com ponents
439
We shall denote the sample principal components by y1 , y2 , . . . , yP , irrespective of whether they are obtained from S or R . 2 The components constructed from S and R are not the same, in general, but it will be clear from the context which matrix is being used, and the single notation Yi is convenient . It is also convenient to label the component coefficient vectors e i and the component variances Ai for both situations . The observations xj are often "centered" by subtracting x . This has no effect on the sample covariance matrix S and gives the ith principal component i = 1 , 2, . ' p (821) Yi  e i ( x  x ) ' for any observation vector x. If we consider the values of the ith component j = 1, 2, . , n (822) YJ l  e l ( X1·  X ) ' generated by substituting each observation x1 for the arbitrary x in (821), then 1 1 1 (823) Y i  n :L ei ( xj  x)  n ei :L ( xj  x )  n e i 0  0 j= l j=l That is, the sample m�an of each principal component is zero. The sample variances are still given by the A./s, as in (820) . "'
"'
n
_
"'
Example 8.3
"' '
.
"' '
"' '


"' '
(n
.. .. )
"' '
(Summarizing sample variabil ity with two sa mple principal co mponents)
A census provided information, by tract, on five socioeconomic variables for the Madison, Wisconsin, area . The data from 14 tracts are listed in Table 8 . 5 in the exercises at the end of this chapter. These data produced the following summary statistics: 14 . 01, [4 . 32, 1 . 95, 2 . 17, 2 . 45 ] x' = total total health services median median population school employment employment home value ($10,000s) (thousands) years (thousands) (hundreds) and 4 . 308 1 . 683 1 . 803 2 . 155  . 253 1 . 683 1 . 768 . 176 . 177 . 588 S = 1 . 803 . 588 . 801 1 . 065  . 158 2. 155 . 177 1 . 065 1 . 970  . 357  . 253 . 176  . 158  . 357 . 504 Can the sample variation be summarized by one or two principal components? 2 Sample principal components can also be obtained from i Sn , the maximum likelihood estimate of the covariance matrix I, if the X1 are normally distributed. (See Result 4.11.) In this case, provided that the eigenvalues of I are distinct, the sample principal components can be viewed as the maximuf!l likelihood estimates of the corresponding population counterparts. (See [1].) We shall not consider ! because the assumption of normality is not required in this section. Also, i has eigenvalues [ (n 1 )/n ]A1 and corresponding eigenvectors el ' where ( Al ' eJ are the eigenvalueeigenvector pairs for S. Thus, both S and i g�ve tp.e satpe sample principal components e; x_j see (820)] and the same proportion of explained variance Ai / ( A 1 + A 2 + . . + Ap) · Finally, both S �nd I give the same sample correlation matrix R, so if the variables are standardized, the choice of S or I is irrelevant. =

·
440
Chapter 8
Principal Components
We find the following: CO EFFICIE NTS FOR TH E PRI NCIPAL COM PO N E NTS (Correlation Coefficients i n Parentheses)
Variable Total population Median school years Total employment Health services employment Median home value "
Variance (Ai): Cumulative percentage of total variance
e l (ryl,xk )
e 2 ( ry2 , xk)
e3
e4
es
"
"
"
.781 (.99)
 .071 (  .04)
.004
.542
 .302
.306 (.61) .334 (.98)
 .764(  .76) .083 (.12)
 . 162 .015
 .545 .050
 .010 .937
.426 (.80)
.579 (.55)
.220
 .636
 .173
 .054(  .20)
 .262(  .49)
.962
 .051
.024
6.931
1.786
.390
.230
.01 4
74.1
93.2
97.4
99.9
100
The first principal component explains 74.1% of the total sample variance. The first two principal components, collectively, explain 93.2% of the total sam ple varia rirce. Consequently, sample variation is summarized very well by two principal components and a reduction in the data from 14 observations on 5 variables to 14 observations on 2 principal components is reasonable. Given the foregoing component coefficients, the first principal component appears to be essentially a weighted average of the first four variables. The sec ond principal component appears to contrast health services employment with a weighted average of median school years and median home value. M As we said in our discussion of the population components, the component co efficients ei k and the correlations ry l ' xk should both be examined to interpret the prin cipal components. The correlations allow for differences in the variances of the original variables, but only measure the importance of an individual X without regard to the other X' s making up the component. We notice in Example 8.3, however, that the correlation coefficients displayed in the table confirm the interpretation provid ed by the component coefficients. The N u m ber of Principal Com ponents
There is always the question of how many components to retain. There is no defini tive answer to this question . Things to consider include the amount of total sample vari ance explained, the relative sizes of the eigenvalues ( the variances of the sample components ) , and the subjectmatter interpretations of the components. In addition, as we discuss later, a component associated with an eigenvalue near zero and, hence, deemed unimportant, may indicate an unsuspected linear dependency in the data .
Section 8 . 3
S u m marizing Sa m p l e Va riation by Principal Components
441
�I
Figure 8.2
A scree plot.
A useful visual aid to determining an appropriate number of principal components is a scree plot. 3 With the eigenvalues ordered from largest to smallest, a scree plot is a plot of Ai versus ithe magnitude of an eigenvalue versus its num ber. To determine the appropriate number of components, we look for an elbow ( bend ) in the scree plot. The number of components is taken to be the point at which the remaining eigenvalues are relatively small and all about the same size. Figure 8.2 shows a scree plot for a situation with six principal components. An �lbow occurs in the plot in Figure 8.2 at about i = 3. That is, the eigenval ues after A2 are all relatively small and about the same size. In this case, it appears, without any other evidence, that two ( or perhaps three ) sample principal compo nents effectively summarize the total sample variance. Example 8.4
(Summarizing sam ple va riabi l ity with one sample principal co m ponent)
In a study of size and shape relationships for painted turtles, Jolicoeur and Mosi mann [11] measured carapace length, width, and height. Their data, reproduced in Exercise 6.18, Table 6 . 9, suggest an analysis in terms of logarithms. (Jolicoeur [10] generally suggests a logarithmic transformation in studies of sizeandshape relationships. ) Perform a principal component analysis. The natural logarithms of the dimensions of 24 male turtles have sample mean vector x ' = [ 4.725, 4.478, 3.703 J and covariance matrix 3
Scree is the rock debris at the bottom of a cliff.
442
Chapter 8
]
Principa l Components
[
11 .072 8.019 8.160 3 s == 1o 8.019 6.417 6.005 8.160 6.005 6.773 A principal component analysis (see Panel 8.1 on page 443 for the output from the SAS statistical software package) yields the following summary: CO EFFICI E NTS FOR PRI NCIPAL CO M PO N E NTS (Correlation Coefficients i n Pa rentheses)
el ( ry l , xk )
Variable ln (length) ln (width) ln (height) "
Variance ( Ai): Cumulative percentage of total variance
e2
e3
.683 (.99) .510 (.97) .523 (.97)
 . 159  .594 .788
 .713 .622 .324
23.30 X 103
.60 X 10 3
.36 X 103
96.1
98.5
100
"
"
A scree plot is shown in Figure 8.3. The very distinct elbow in this plot oc curs at i == 2. There is clearly one dominant principal component. The first principal component, which explains 96% of the total variance, has an interesting subjectmatter interpretation. Since y1 == .683 ln (length) + .510 ln (width) + .523 ln (height) == ln [ (length)·683 (width) .s 10 (height) · 523 ]
2
3
Figure
8.3
turtle data .
A scree p lot for the
Section 8.3
PANEL 8.1
S u m marizing Sample Va riation by Principal Components
443
SAS ANALYSIS FOR EXAMPLE 8.4 USING PROC PRINCOMP.
title 'Principal Component Ana lysis'; data t u rt l e; i nfile ' E 84.d at'; i n put l e n gth width heig ht; x1 = log(length); x2 =log(width); x3 =log(heig ht); p roc princomp cov data = tu rtle out = resu lt; var x1 x2 x3;
PROGRAM CO M MANDS
Pri ncipal Com pone nts Ana lysis
OUTPUT
24 Observations 3 Va riables
Mean StD
X1
X1 4. 72 5443647 0 . 1 05223590
S i m p l e Statistics X2 4.477573765 0.080 1 04466
X3 3.703 1 85794 0.08229677 1
X1
X2
X3
0.0080 1 9 1 4 1 9
0. 008 1 596480 0.0060052707
X2 X3
Total Va ria nce = 0.0242 6 1 488
Difference 0.022705 0 . 000238
PR I N 1 PRIN2 PRIN3
X1 X2 X3
Proportion 0.960508 0.024661 0 . 0 1 4832
444
Chapter 8
Principal Components
the first principal component may be viewed as the ln (volume) of a box with adjusted dimensions. For instance, the adjusted height is (height ) · 523 , which ac counts, in some sense, for the rounded shape of the carapace. • I nterpretati on of the Sample Pri nci pal Com ponents
The sample principal components have several interpretations. First, suppose th e underlying distribution of X is nearly Np ( JL , I ) . Then the sample principal comp o nents, .Yz == ei(x  x) are realizations of population principal compone n ts li == ej (X  p.. ) , which have an Np (O, A ) distribution. The diagonal matrix A h as entries A. 1 , A.2 , . . . , AP and ( "z , e i) are the eigenvalueeigenvector pairs of I . Also, from the sample values x1 , we can approximate IL by x and I by S. If S is positive definite, the contour consisting of all p X 1 vectors x satisfying (824 ) (x  x) ' s  1 (x  x) == c 2
estimates the constant density contour (x  IL ) 'I  1 (x  IL ) == c2 of the underlying normal density. The approximate contours can be drawn on the scatter plot to ind i cate the normal distribution that generated the data. The normality assumption is useful for the inference procedures discussed in Section 8.5, but it is not required for the development of the properties of the sample principal components summa rized in (820). Even when the normal assumption is suspect and the scatter plot may depart some what from an elliptical pattern, we can still extract the eigenvalues from S and obtain the sample principal components. Geometrically, the data may be plotted as n points in pspace. The data can then be expressed in the new coordinates, which coincide with the axes of the contour of (824). Now, (824) defines a hyperellipsoid that is centered at x and whose axes are given by the eigenvectors of s 1 or, equivalently, of S. (See Section 2.3 and Result 4.1, with S in place of I.) The lengths of these hyperellipsoid AP 0 are the axes are proportional to \lA;, i = 1, 2, . . . , p, where A 1 A2 eigenvalues of S. Because e i has length 1, the absolute value of the ith principal component, 1 .Yi 1 1 ei(x  x) I, gives the length of the projection of the vector (x  x) on the unit vector e i . [See (28) and (29) .] Thus, the sample principal components .Yi == ei(x  x) , i 1, 2, . . . , p, lie along the axes of the hyperellipsoid, and their ab solute values are the lengths of the projections of x  x in the directions of the axes e i . Consequently, the sample principal components can be viewed as the result of translating the origin of the original coordinate system to x and then rotating the coordinate axes until they pass through the scatter in the directions of maximum variance. The geometrical interpretation of the sample principal components is illustrat ed in Figure 8.4 f2r p � 2. Figure 8.4(a) shows an ellipse of constant distance, cen tered at x, with A 1 > A2 . The sample principal components are well determined. They lie along the axes of the ellipse in the perpendicular directions of maximum �ampl� varial!_ce. Fjgure 8.4(b) shows a constant distance ellipse, centered at x, with A 1 A2 . If A 1 == A.2 , the axes of the ellipse (circle) of constant distance are not uniquely determined and can lie in any two perpendicular directions, including the di rections of the original coordinate axes. Similarly, the sample principal components >
==
==
·
>
·
· ·
>
>
Section 8.3
(x

S u m marizing Sa m p l e Va riation by Principa l Components
x ) ' S  l (x

x) =
445
c2
•
(x
(a) Figure
5:, 1 > �2
8.4
(b) � l
�

x ) ' s  l (x

x)
=
c2
�2
Sam p l e principal components a n d e l l i pses of consta nt di stance.
can lie in any two perpendicular directions, including those of the original coordi nate axes. When the contours of constant distance are nearly circular or, equiva lently, when the eigenvalues of S are nearly equal, the sample variation is homogeneous in all directions. It is then not possible to represent the data well in fewer than p dimensions. If the last few eigenvalues Ai are sufficiently small such that the variation in the corresponding ei directions is negligible, the last few sample principal components can often be ignored, and the data can be adequately approximated by their repre sentations in the space of the retained components. (See Section 8.4.) Finally, Supplement 8A gives a further result concerning the role of the sample principal components when directly approximating the meancentered data xj  x . A
Standard izing the Sample Pri nci pal Components
Sample principal components are, in general, not invariant with respect to changes in scale. (See Exercises 8.6 and 8.7). As we mentioned in the treatment of population components, variables measured on different scales or on a common scale with wide ly differing ranges are often standardized. For the sample, standardization is ac complished by constructing
xjl  x l � xj2  x2 Vs;
j
==
1, 2,
. . .
, n
(825)
446
Cha pter 8
Principa l Components
The n X p data matrix of standardized observations
Z ==
z1 z2
Z1 1 Z12 Z21 Z22
z n'
Zn l Zn2 X1 1  X1 X12  X1 � � X21  X2 X22  X 2 �
Z1 p Z2p Znp
vS;;
Xnl  Xp Xn2  Xp � vs;;
yields the sample mean vector [see (324)]
x lp  x l � x2p  x 2 vs;; Xnp  Xp vs;;
=0
and sample covariance matrix [see (327)] 1 s z == n n n1 1 n 1 1 n 1 ( n  1 ) s1 1 (n  1 ) s12
( z _ lll' z )' ( z _ lll'Z) (Z  lZ')' ( Z  lZ' ) Z'Z S1 1
1 n1
( n  1 ) s12
� Vs;;
� Vs;;
(n  1 ) s22
S22
( n  1 ) s1 P (n  1 ) s2 P
� vs;;
vS;; vs;;
(826)
(827)
(n  1 ) s1 P
� vs;; (n  1 ) s2 P vS;; vs;; == R
(828)
(n  1 ) sPP
Spp
The sample principal components of the standardized observations are given by (820), with the matrix R in place of S. Since the observations are already "centered" by construction, there is no need to write the components in the form of (821).
Section 8.3
(
S u m marizing Sample Va riation by Principal Components
)
447
Using (829), we see that the proportion of the total sample variance explained by the ith sample principal component is Proportion of (standardized) sample variance due to ith sample principal component
==
A.
__!_
P
i
==
1, 2,
.. .
'p
(830)
A rule of thumb suggests retaining only those components whose variances Ai are greater than unity or, equivalently, only those components which, individually, ex plain at least a proportion 1/p of the total variance. This rule does not have a great deal of theoretical support, however, and it should not be applied blindly. As we have mentioned, a scree plot is also useful for selecting the appropriate number of components. Example 8. 5
"'
(Sa mple principal com ponents from standardized data)
The weekly rates of return for five stocks (Allied Chemical, du Pont, Union Carbide, Exxon, and Texaco) listed on the New York Stock Exchange were de termined for the period January 1975 through December 1976. The weekly rates of return are defined as (current Friday closing price  previous Friday closing price )/(previous Friday closing price), adjusted for stock splits and div idends. The data are listed in Table 8.4 in the Exercises. The observations in 100 successive weeks appear to be independently distributed, but the rates of return across stocks are correlated, since, as one expects, stocks tend to move togeth er in response to general economic conditions. Let x1 , x 2 , , x5 denote observed weekly rates of return for Allied Chemical, du Pont, Union Carbide, Exxon, and Texaco, respectively. Then . • .
x'
==
[ .0054, .0048, . 0057, .0063, . 0037 ]
448
Chapter 8
Principa l Components
and 1.000 .577 .509 .387 .462 .577 1 .000 .599 .389 .322 R = .509 .599 1 .000 .436 .426 .387 .389 .436 1 .000 .523 .462 .322 .426 .523 1.000 We note that R is the covariance matrix of the standardized observations x5  x5 x1  x1 x2  x 2 Z1 = � , Z2 = \IS;; , , Zs = � 

· · ·
The eigenvalues and corresponding normalized eigenvectors of R , determi ne d by a computer, are "'
A 1 = 2.857, e1 = [ .464, .457, .47o, .421, .421 J e2 = [ .24o, .5o9, .26o, .526,  .582 J "2 = .809, A3 = .540, e3 = [  .612, .178, .335, .541,  .435] e4 = [ .387, .2o6,  .662, .472,  .382 J "4 = .452, A5 = .343, e5 = [  .451, .676,  .4oo, .176, .385 J Using the standardized variables, we obtain the first two sample principal components: 5\ = e1z = .464z l + .457z2 + .470z 3 + .421z4 + .421zs .Y2 = e2z = .240z l + .5o9z2 + .26oz3  .526z4  .582zs These components, which account for "'
"'
"'
"'
of the total ( standardized ) sample variance, have interesting interpretations . The first component is a roughly equally weighted sum, or "index," of the five stocks. This component might be called a general stockmarket component, or simply a market component. The second component represents a contrast between the chemical stocks ( Allied Chemical, du Pont, and Union Carbide ) and the oil stocks ( Exxon and Texaco ) . It might be called an industry component. Thus, we see that most of the variation in these stock returns is due to market activity and uncorrelated industry activity. This interpretation of stock price behavior has also been sug gested by King [12] . The remaining components are not easy to interpret and, collectively, rep resent variation that is probably specific to each stock. In any event, they do not explain much of the total sample variance . This example provides a case where it seems sensible to retain a compoII nent ( j/2) associated with an eigenvalue less than unity.
Section 8.3
Example 8.6
Summarizing Sa m p l e Va riation by Principa l Components
449
(Co mponents from a correlation matrix with a special structure)
Geneticists are often concerned with the inheritance of characteristics that can be measured several times during an animal ' s lifetime . Body weight (in grams) for n = 150 female mice were obtained immediately after the birth of their first 4 litters. 4 The sample mean vector and sample correlation matrix were, respectively, 49.95 ] x ' = [39.88, 48.11, 45.08, and 1 .000 .7501 .6329 .6363 .6925 .7386 .7501 1 .000 R = .6625 .6329 .6925 1.000 .6363 .7386 .6625 1.000 The eigenvalues of this matrix are A1 = 3.085, A2 = .382, A3 = .342, and A4 = .217 We note that the first eigenvalue is nearly equal to 1 + (p  1 ) r = 1 + ( 4  1 ) ( .6854) = 3.056, where r is the arithmetic average of the off diagonal e}ements of R. The remainin � eigen':alues are small and about equal, although A4 is somewhat smaller than A2 and A3 • Thus, there is some evidence that the corresponding population correlation matrix p may be of the "equal correlation" form of (815). This notion is explored further in Example 8.9. The first principal component .Y1 = e �z = .49z l + .52z2 + .49z3 + .5oz4 accounts for 100( A1/p)% = 100(3.058/4)% = 76% of the total variance. Although the average postbirth weights increase over time, the variation in weights is fairly well explained by the first principal component with (nearly) equal coefficients. • "
"
"
"
"
C omment. An unusually small value for the last eigenvalue from either the sample covariance or correlation matrix can indicate an unnoticed linear dependen cy in the data set. If this occurs, one (or more) of the variables is redundant and should be deleted. Consider a situation where x1 , x 2 , and x 3 are subtest scores and the total score x4 is the sum x1 + x2 + x3 • Then, although the linear combination e' x = [ 1, 1, 1,  1 ] x = x1 + x2 + x3  x4 is always zero, rounding error in the com putation of eigenvalues may lead to a small nonzero value. If the linear expression relating x4 to ( x1 , x2 , x3 ) was initially overlooked, the smallest eigenvalueeigenvector pair should provide a clue to its existence. Thus, although "large" eigenvalues and the corresponding eigenvectors are im portant in a principal component analysis, eigenvalues very close to zero should not be routinely ignored. The eigenvectors associated with these latter eigenvalues may point out linear dependencies in the data set that can cause interpretive and compu tational problems in a subsequent analysis. • 4 Data courtesy of J. J. Rutledge.
450
8.4
Chapter 8
Principa l Com ponents
G RAPH I N G TH E PRI NCIPAL CO M PO N E NTS
Plots of the principal components can reveal suspect observations, as well as provide checks on the assumption of normality. Since the principal components are linear combinations of the original variables, it is not unreasonable to expect them to be nearly normal. It is often necessary to verify that the first few principal components are approximately normally distributed when they are to be used as the input data for additional analyses. The last principal components can help pinpoint suspect observations. Each observation can be expressed as a linear combination
xj = (xj e l ) e l + (xj e 2 ) e 2 + . . . + (xj ep ) e p = .Yj l e 1 + .Yj 2 e2 + · · · + .Yj p e p of the complete set of eigenvectors e 1 , e 2 , . . . , e P of S. Thus, the magnitudes of the last principal components determine how well the first few fit the observations. That is, Yj l e l + Yj 2 e 2 + . . . + Yj, q  l e q  1 differs from Xj by Yj q e q + + Yj p e p , the square of whose length is YJq + · · · + YJp · Suspect observations will often be such that at least one of the coordinates Yj q ' . . . , yj P contributing to this squared length will be large . (See Supplement 8A for more general approximation results.) The following statements summarize these ideas. 1. To help check the normal assumption, construct scatter diagrams for pairs of the first few principal components. Also, make QQ plots from the sample values generated by each principal component. 2. Construct scatter diagrams and QQ plots for the last few principal compo nents. These help identify suspect observations. . . .
Example 8.7
(Plotti ng the principal components fo r the tu rtle data)
We illustrate the plotting of principal components for the data on male turtles discussed in Example 8.4. The three sample principal components are
y1 = .683 (x1  4.725 ) + .510(x2  4.478) + .523 (x3  3.703 ) y2 =  .159 ( x1  4.725 )  .594(x2  4.478) + .788(x3  3.703 ) y3 =  .713 (x1  4.725 ) + .622(x2  4.478) + .324(x 3  3.703 ) where x 1 = ln (length) , x2 = ln (width), and x3 = ln ( height) , respectively. Figure 8.5 shows the QQ plot for y2 and Figure 8.6 shows the scatter plot of (j/1 , y2 ) . The observation for the first turtle is circled and lies in the lower right corner of the scatter plot and in the upper right corner of the QQ plot; it may be suspect. This point should have been checked for recording errors, or the tur tle should have been examined for structural anomalies. Apart from the first turtle, the scatter plot appears to be reasonably elliptical. The plots for the other sets of principal components do not indicate any substantial departures • from normality.
Section 8.4
Graph ing the Principa l Components
451
@1 .04 ,, fl' .� •
0. •
..
••
•
••
•
.3
.1
•
yl
•
•
• • • • •
. 1 •
.3
••
.03
•
.01
•
•
•
8.5
A 00 p l ot for the secon d principal component y2 from the d ata on m a l e t u rtles . Figure
2
• •
•
:· .01
.03
.05
8.6 Scatter plot of the principal components y1 and y2 of the data o n male turtles. Figure
.07
y2
The diagnostics involving principal components apply equally well to the check ing of assumptions for a multivariate multiple regression model. In fact, having fit any model by any method of estimation, it is prudent to consider the
(
. . vector)  vector . oft predicted Residual vector = ( observation ( es tIma e d) va 1ues
)
or
(831) e j = yj  /3 ' zj j = 1, 2, . . . , n p( Xl ) ( p Xl ) ( p Xl ) for the multivariate linear model. Principal components, derived from the covariance matrix of the residuals, n 1 (832) ( e"1  e 1· ) ( e1  e;::1· ) � n  p . 1 =1 ;::
"
'
can be scrutinized in the same manner as those determined from a random sample. You should be aware that there are linear dependencies among the residuals from a linear regression analysis, so the last eigenvalues will be zero, within rounding error.
452
8.5
Chapter 8
Principa l Components
LARGE SAM PLE I N FE R E N CES
We have seen that the eigenvalues and eigenvectors of the covariance (correlation ) matrix are the essence of a principal component analysis. The eigenvectors deter mine the directions of maximum variability, and the eigenvalues specify the variances. When the first few eigenvalues are much larger than the rest, most of the total vari ance can be "explained" in fewer than p dimensions. In practice, decisions regarding the quality of the principal componen1 approx imation must be made on the basis of the eigenvalueeigenvector pairs ( Ai , e z) ex tracted from S or R . Because of sampling variation, these eigenvalues and eigenvectors will differ from their underlying population counterparts. The sampling distributions of Ai and ei are difficult to derive and beyond the scope of this book. If you are interested, you can find some of these derivations for multivariate normal populations in [1 ], [2], and [5] . We shall simply summarize the pertinent large sam ple results. Large Sample Properties of Ai and ei
Currently available results concerning large sample confidence intervals for A i and e1 assume that the observations X 1 , X2 , . . . , Xn are a random sample from a normal population. It must also be assumed that the (unknown ) eigenvalues of I are distinct and positive, so that A 1 > A2 > · · · > A p > 0. The one exception is the case where the number of equal eigenvalues is known. Usually the conclusions for distinct eigen values are applied, unless there is a strong reason to believe that I has a special struc ture that yields equal eigenvalues. Even when the normal assumption is violated, the confidence intervals obtained in this manner still provide some indication of the uncertainty in ;\ i and e i . Anderson [2] and Girshick [5] have established the following large sample distri bution theory for the eigenvalues A' = [A l , . . . ' Ap ] and eigenvectors e l , . . . ' e p of S: 1. Let A be the diagonal matrix of eigenvalues Ar, . . . , AP of I, then Vn (A  A ) is approximately Np (O, 2A 2 ) . 2. Let "
then Vn ( e i  ez ) is approximately Np (O, E i) · 3. Each Ai is distributed independently of the elements of the associated e i . Result 1 implies that, for n large, the A i are independently distributed. Moreover, Ai has an approximate N ( Ai , 2A[j n ) distribution. Using this normal distribution, we obtain P[ l Ai  Ai I z(aj2)Ai V2711 J = 1  a. A large sample 100( 1  a) % confidence interval for Ai is thus provided by "
A5 > 0. Since n == 100 is large, we can use (8:__3 3 ) with i == 1 to construct a 95% confidence interval for A 1 . From Exercise 8.10, A. 1 == .0036 and in addition, z( .025 ) 1.96. Therefore, with 95% confidence, · · ·
==
.0036 �) ( 1 + 1.96 "v/2
z 1 , p y 1 , z2 , and PY2 , z 1 • 8.3. Let 2 0 I= o 4 o 0 0 4 Determine the principal components Yi , ¥;, and Y3 . What can you say about the eigenvectors (and principal components) associated with eigenvalues that are not distinct? 8.4. Find the principal components and the proportion of the total population vari ance explained by each when the covariance matrix is a2 a2 p 0 1 1 2 I = a P a2 a2 P ,  __ < P < __ v2 v2 0 a 2 p (I2 8.5. (a) Find the eigenvalues of the correlation matrix
O J [
[
]
p =
[: � :] p
p
1
Are your results consistent with (816) and (817)? (b) Verify the eigenvalueeigenvector pairs for the p X p matrix p given in (815). 8.6. Data on x 1 = sales and x2 = profits for the 10 largest U.S. industrial corpora tions were listed in Exercise 1.4 of Chapter 1. From Example 4.12
X= _
8.7.
[ ]
[
]
62,309 10,005.20 255.76 X 105 S = 2,927 ' 255.76 14.30
(a) Determine the sample principal components and their variances for these data. (You may need the quadratic formula to solve for the eigenvalues of S.) (b) Find the proportion of the total sample variance explained by )\ . (c) Sketch the constant density ellipse (x  x) 'S 1 (x  x) = 1.4, and indicate the principal components j/1 and y2 on your graph. (d) Compute the correlation coefficients ry b xk ' k = 1, 2. What interpretation, if any, can you give to the first principal component? Convert the covariance matrix S in Exercise 8.6 to a sample correlation matrix R. (a) Find the sample principal components j/1 , j/2 and their variances. (b) Compute the proportion of the total sample variance explained by j/1 . (c) Compute the correlation coefficients rY I > Z k ' k = 1, 2. Interpret j/1 . (d) Compare the components obtained in Part a with those obtained in Exer cise 8.6(a). Given the original data displayed in Exercise 1.4, do you feel that it is better to determine principal components from the sample covariance matrix or sample correlation matrix? Explain.
468
Chapter 8
Principa l Components
8.8. Use the results in Example 8.5.
(a) Compute the correlations ryn zk for i = 1 , 2 and k = 1, 2, . . . , 5. Do these correlations reinforce the interpretations given to the first two compo nents? Explain. (b) Test the hypothesis 1 p p p p p 1 p p p Ho : P = Po = p p 1 p p p p p 1 p p p p p 1 versus H1 : P =I= Po at the 5% level of significance. List any assumptions required in carrying out this test. 8.9. (A test that all variables are independent.) (a) Consider that the normal theory likelihood ratio test of H0 : I is the diagonal matrix 0 (]" 1 1 0 0 0 ' (Tii > 0 0
0
Show that the test is as follows: Reject H0 if
l n/ 2 A = IpS = I R l n/2 < s�J/2 II i=1 ll
C
For a large sample size, 2 ln A is approximately X� (p 1 ) 12 . Bartlett [3] sug gests that the test statistic  2[1  (2p + 1 1 )/6n J ln A be used in place of 2 ln A. This results in an improved chisquare approximation. The large sample a critical point is X� (p 1 ) ;2 ( a ) . Note that testing I = I o is the same as testing p = I. (b) Show that the likelihood ratio test of H0 : I = 0"2 1 rejects H0 if 
 ( 2:p Ai )p = [ i 1
I S l n/2 A= ( tr( S) /p ) np/2
n /2
geometric mean Ai n p/2 < c arithmetic mean Ai
J
1 p = For a large sample size, Bartlett [3] suggests that 2[1  (2p2 + p + 2)j6pn] ln A is approximately x(p + 2) (p 1 ) ;2 . Thus, the large sample a critical point is XJp + 2) (p 1 ) ;2 ( a ) . This test is called a sphericity test, because the constant den sity contours are spheres when I = 0"2 1. A
A
Chapter 8
Exercises
469
Hint: (a) max L( IL, I) is given by (510), and max L ( IL , I0) is the product of the p, , l univariate likelihoods, max (2'1T) n12 aizn12 exp  ± (xi;  JLY/2u;; . JLl al l j=l n n 2 Hence, jli = ( 1/ n ) � xj i and (]ii = ( 1/ n ) � ( xj i  xi ) • The divisor n j=l j=l cancels in A, so S may be used. 2 (b) Verify (J2 = np under H0 . ( x i P  Xp ) (xi 1  X1 / + · · · + j=l j=l Again, the divisors n cancel in the statistic, so S may be used. Use Result 5.2 to calculate the chisquare degrees of freedom.
J
[
±
[±
]/
The following exercises require the use of a computer. 8.10. The weekly rates of return for five stocks listed on the New York Stock Ex
change are given in Table 8.4. (See the stockprice data on the CDROM.) TABLE 8.4
STOCK P R I C E DATA (WE E KLY RATE OF R ETU RN)
Week 1 2 3 4 5 6 7 8 9 10
Allied Chemical .000000 .027027 .122807 .057031 .063670 .003521  .045614 .058823 .000000 .006944
91 92 93 94 95 96 97 98 99 100
 .044068 .039007  .039457 .039568  .031142 .000000 .021429 .045454 .050167 .019108
Du Pont .000000  .044855 .060773 .029948  .003793 .050761  .033007 .041719  .019417  .025990
Union Carbide .000000  .003030 .088146 .066808  .039788 .082873 .002551 .081425 .002353 .007042
Exxon .039473 .014466 .086238 .013513  .018644 .074265  .009646  .014610 .001647  .041118
Texaco .000000 .043478 .078124 .019512 .024154 .049504  .028301 .014563  .028708 .024630
.020704 .038540  .029297 .024145  .007941  .020080 .049180 .046375 .036380  .033303
 .006224 .024988  .065844  .006608 .011080  .006579 .006622 .074561 .004082 .008362
 .018518  .028301  .015837 .028423 .007537 .029925  .002421 .014563 .011961 .033898
.004694 .032710  .045758 .009661 .014634  .004807 .028985 .018779 .009216 .004566
(a) Construct the sample covariance matrix S, and find the sample principal components in (820). (Note that the sample mean vector x is displayed in Example 8.5.)
470
Chapter 8
Principal Components
(b) Determine the proportion of the total sample variance explained by th e first three principal components. Interpret these components. (c) Construct Bonferroni simultaneous 90% confidence intervals for the vari ances A 1 , A2 , and A3 of the first three population components Yi , Y2 , and Y3 . (d) Given the results in Parts ac, do you feel that the stock ratesofreturn data can be summarized in fewer than five dimensions? Explain. 8.11. Consider the censustract data listed in Table 8.5. Suppose the observations on X5 = median value home were recorded in thousands, rather than ten thou sands, of dollars; that is, multiply all the numbers listed in the sixth column of the table by 10. (a) Construct the sample covariance matrix S for the censustract data when X5 = median value home is recorded in thousands of dollars. (Note that this covariance matrix can be obtained from the covariance matrix given in Example 8.3 by multiplying the offdiagonal elements in the fifth column and row by 10 and the diagonal element s5 5 by 100. Why?) (b) Obtain the eigenvalueeigenvector pairs and the first two sample princip al components for the covariance matrix in Part a. (c) Compute the proportion of total variance explained by the first two princi pal components obtained in Part b. Calculate the correlation coefficients, ry i , xk ' and interpret these components if possible. Compare your results with the results in Example 8.3. What can you say about the effects of this change in scale on the principal components? 8.12. Consider the airpollution data listed in Table 1 .5. Your job is to summarize these data in fewer than p = 7 dimensions if possible. Conduct a principal TABLE 8. 5
Tract 1 2 3 4 5 6 7 8 9 10 11 12 13 14
C E N S USTRACT DATA
Total population (thousands) 5.935 1.523 2.599 4.009 4.687 8.044 2.766 6.538 6.451 3.314 3.777 1 .530 2.768 6.585
Median school years 14.2 13.1 12.7 15.2 14.7 15.6 13.3 17.0 12.9 12.2 13.0 13.8 13.6 14.9
Total employment (thousands) 2.265 .597 1 .237 1.649 2.312 3.641 1.244 2.618 3.147 1.606 2.119 .798 1.336 2.763
Health services employment (hundreds) 2.27 .75 1.11 .81 2.50 4.51 1.03 2.39 5.52 2.18 2.83 .84 1 .75 1.91
Median value home ($10,000s) 2.91 2.62 1 .72 3.02 2.22 2.36 1 .97 1.85 2.01 1 .82 1.80 4.25 2.64 3.17
Note: Observations from adj acent census tracts are likely t o b e correlated. That is, these 14 observa tions may not constitute a random sample.
Chapter 8
Exercises
47 1
component analysis of the data using both the covariance matrix S and the cor relation matrix R. What have you learned? Does it make any difference which matrix is chosen for analysis? Can the data be summarized in three or fewer dimensions? Can you interpret the principal components? 8.13. In the radiotherapy data listed in Table 1 .7 (see also the radiotherapy data on the CDROM), the n = 98 observations on p = 6 variables represent patients ' reactions to radiotherapy. (a) Obtain the covariance and correlation matrices S and R for these data. (b) Pick one of the matrices S or R (justify your choice), and determine the eigenvalues and eigenvectors. Prepare a table showing, in decreasing order of size, the percent that each eigenvalue contributes to the total sample variance. (c) Given the results in Part b, decide on the number of important sample prin cipal components. Is it possible to summarize the radiotherapy data with a single reactionindex component? Explain. (d) Prepare a table of the correlation coefficients between each principal com ponent you decide to retain and the original variables. If possible, interpret the components. 8.14. Perform a principal component analysis using the sample covariance matrix of the sweat data given in Example 5.2. Construct a QQ plot for each of the important principal components. Are there any suspect observations? Explain. 8.15. The four sample standard deviations for the postbirth weights discussed in Ex ample 8.6 are � = 32.9909, \IS; = 33.5918, vs;; = 36.5534, and � = 37.3517 Use these and the correlations given in Example 8.6 to construct the sample co variance matrix S. Perform a principal component analysis using S. 8.16. Over a period of five years in the 1990s, yearly samples of fishermen on 28 lakes in Wisconsin were asked to report the time they spent fishing and how many of each type of game fish they caught. Their responses were then converted to a catch rate per hour for
x 1 = Bluegill x2 = Black crappie x3 = Smallmouth bass x6 = Northern pike x4 = Largemouth bass x5 = Walleye The estimated correlation matrix (courtesy of Jodi Barnet) .0652 .4653 .2277 1 .4919 .2636 .3506  . 1917 1 .3127 .2045 . 4919 .0647 1 .4108 .2493 .2635 .3127 R= 1  .2249 .2293 .3506 .4108 .4653 1  .2144  .2277  .1917 .0647  .2249 1 .2293  .2144 .2045 .2493 .0652 is based on a sample of about 120. (There were a few missing values.)
472
Chapter 8
Principa l Com ponents
8.17. 8.18.
8.19.
8.20.
8.21.
Fish caught by the same fisherman live alongside of each other, so the data should provide some evidence on how the fish group. The first four fish belong to the centrarchids, the most plentiful family. The walleye is the most popular fish to eat. (a) Comment on the pattern of correlation within the centrarchid family x1 through x 4 • Does the walleye appear to group with the other fish? (b) Perform a principal component analysis using only x 1 through x4 • Int er pret your results. (c) Perform a principal component analysis using all six variables. Interpret your results. Using the data on bone mineral content in Table 1.8, perform a principal com ponent analysis of S. The data on national track records for women are listed in Table 1 .9. (a) Obtain the sample correlation matrix R for these data, and determine its eigenvalues and eigenvectors. (b) Determine the first two principal components for the standardized vari ables. Prepare a table showing the correlations of the standardized vari ables with the components, and the cumulative percentage of the total ( standardized ) sample variance explained by the two components. (c) Interpret the two principal components obtained in Part b. ( Note that the first component is essentially a normalized unit vector and might measure the athletic excellence of a given nation. The second component might measure the relative strength of a nation at the various running distances. ) (d) Rank the nations based on their score on the first principal component. Does this ranking correspond with your inituitive notion of athletic excel lence for the various countries? Refer to Exercise 8.18. Convert the national track records for women in Table 1.9 to speeds measured in meters per second. Notice that the records for 800 m, 1500 m, 3000 m, and the marathon are given in minutes. The marathon is 26.2 miles, or 42,195 meters, long. Perform a principal components analysis using the covariance matrix S of the speed data. Compare the results with the results in Exercise 8.18. Do your interpretations of the components differ? If the na tions are ranked on the basis of their score on the first principal component, does the subsequent ranking differ from that in Exercise 8.18? Which analysis do you prefer? Why? The data on national track records for men are listed in Table 8.6. ( See also the data on national track records for men on the CDROM. ) Repeat the princi pal component analysis outlined in Exercise 8.18 for the men. Are the results consistent with those obtained from the women ' s data? Refer to Exercise 8.20. Convert the national track records for men in Table 8.6 to speeds measured in meters per second. Notice that the records for 800 m� 1500 m, 5000 m, 10,000 m and the marathon are given in minutes. The marathon is 26.2 miles, or 42,195 meters, long. Perform a principal component analysis using the covariance matrix S of the speed data. Compare the results with the results in Exercise 8.20. Which analysis do you prefer? Why?
Chapter 8
TABLE 8.6
Exercises
473
NATI ONAL TRACK RECORDS FOR M E N
Country Argentina Australia Austria Belgium Bermuda Brazil Burma Canada Chile China Colombia Cook Islands Costa Rica Czechoslovakia Denmark Dominican Republic Finland France German Democratic Republic Federal Republic of Germany Great Britain and Northern Ireland Greece Guatemala Hungary India Indonesia Ireland Israel Italy Japan Kenya Korea Democratic People ' s Republic of Korea Luxembourg Malaysia Mauritius Mexico Netherlands
lOO m (s) 10.39 10.31 10.44 10.34 10.28 10.22 10.64 10.17 10.34 10.51 10.43 12.18 10.94 10.35 10.56 10.14 10.43 10.11 10.12
200 m (s) 20.81 20.06 20.81 20.68 20.58 20.43 21 .52 20.22 20.80 21 .04 21.05 23.20 21.90 20.65 20.52 20.65 20.69 20.38 20.33
10.16
20.37 44.50 1.73
3.53
13.21
27.61
132.23
10.11
20.21 44.93 1.70
3.51
13.01
27.51
129.13
10.22 10.98 10.26 10.60 10.59 10.61 10.71 10.01 10.34 10.46 10.34 10.91
20.71 21 .82 20.62 21.42 21.49 20.96 21 .00 19.72 20.81 20.66 20.89 21.94
46.56 48.40 46.02 45.73 47.80 46.30 47.80 45.26 45.86 44.92 46.90 47.30
1 .78 1.89 1.77 1 .76 1.84 1.79 1 .77 1.73 1.79 1.73 1.79 1.85
3.64 3.80 3.62 3.73 3.92 3.56 3.72 3.60 3.64 3.55 3.77 3.77
14.59 14.16 13.49 13.77 14.73 13.32 13.66 13.23 13.41 13.10 13.96 14.13
28.45 30.11 28.44 28.81 30.79 27.81 28.93 27.52 27.72 27.38 29.23 29.67
134.60 139.33 132.58 131.98 148.83 132.35 137.55 131.08 128.63 129.75 136.25 130.87
10.35 10.40 11.19 10.42 10.52
20.77 20.92 22.45 21 .30 20.95
47.40 46.30 47.70 46.10 45.10
1.82 1.82 1 .88 1.80 1.74
3.67 3.80 3.83 3.65 3.62
13.64 14.64 15.06 13.46 13.36
29.08 31.01 31 .77 27.95 27.61
141 .27 154.10 152.23 129.20 129.02
400 m (s) 46.84 44.84 46.82 45.04 45.91 45.21 48.30 45.68 46.20 47.30 46.10 52.94 48.66 45.64 45.89 46.80 45.49 45.28 44.87
800 m 1500 m 5000 m 10,000 m Marathon ( min ) ( min ) ( min ) (min ) (min ) 3.70 14.04 1.81 29.36 137.72 3.57 1.74 13.28 27.66 128.30 3.60 13.26 135.90 1.79 27.72 3.60 13.22 1 .73 27.45 129.95 14.68 3.75 1.80 30.55 146.62 3.66 13.62 1.73 28.62 133.13 3.85 14.45 1.80 30.28 139.95 3.63 1 .76 13.55 28.09 130.15 3.71 1.79 13.61 29.30 134.03 3.73 1.81 13.90 29.13 133.53 3.74 1.82 13.49 27.88 131.35 4.24 16.70 2.02 35.38 164.70 3.84 1.87 14.03 28.81 136.58 3.58 13.42 1 .76 28.19 134.32 3.61 13.50 1 .78 28.11 130.78 14.91 3.82 1.82 154.12 31 .45 3.61 1.74 13.27 27.52 130.87 3.57 1.73 132.30 13.34 27.97 3.56 13.17 1.73 27.42 129.92
(continues on next page)
474
Chapter 8
TABLE 8.6
Principa l Components
(continued)
Country New Zealand Norway Papua New Guinea Philippines Poland Portugal Rumania Singapore Spain Sweden Switzerland Taipei Thailand Turkey USA USSR Western Samoa
lOO m (s) 10.51 10.55 10.96 10.78 10.16 10.53 10.41 10.38 10.42 10.25 10.37 10.59 10.39 10.71 9.93 10.07 10.82
200 m (s) 20.88 21 .16 21.78 21 .64 20.24 21.17 20.98 21 .28 20.77 20.61 20.46 21 .29 21 .09 21 .43 19.75 20.00 21.86
400 m (s) 46.10 46.71 47.90 46.24 45.36 46.70 45.87 47.40 45.98 45.63 45.78 46.80 47.91 47.60 43.86 44.60 49.00
800 m 1500 m 5000 m 10,000 m Marathon (min) (min) (min) (min) (min) 13.21 27.70 1.74 3.54 128 . 9 8 13.34 27.69 1 .76 3.62 131 .48 14.72 1.90 31 .36 4.01 148.22 14.74 30.64 3.83 1.81 145.27 13.29 27.89 1 .76 3.60 131.58 13.13 27.38 1.79 3.62 128.65 13.25 27.67 1 .76 3.64 132.50 15.11 31.32 1.88 3.89 157.77 27.73 13.31 1 .76 3.55 131.57 13.29 27.94 1.77 3.61 130 .63 13.22 27.91 1 .78 3.55 131 .20 14.07 30.07 1.79 3.77 139.27 32.65 15.23 3.84 1.83 149.90 13.56 28.58 1.79 3.67 131 .50 13.20 27.43 3.53 1 .73 128.22 13.20 27.53 1.75 3.59 130.55 16.28 34.71 4.24 2.02 161.83
Source: IAAF/ATFS Track and Field Statistics Handbook for the 1 984 Los Angeles Olympics.
8.22. Consider the data on bulls in Table 1.10. Utilizing the seven variables YrHgt,
FtFrBody, PrctFFB, Frame, BkFat, SaleHt, and SaleWt, perform a principal component analysis using the covariance matrix S and the correlation matrix R. Your analysis should include the following: (a) Determine the appropriate number of components to effectively summarize the sample variability. Construct a scree plot to aid your determination. (b) Interpret the sample principal components. (c) Do you think it is possible to develop a "body size" or "body configuration" index from the data on the seven variables above? Explain. (d) Using the values for the first two principal components, plot the data in a twodimensional space with 5\ along the vertical axis and y2 along the hor izontal axis. Can you distinguish groups representing the three breeds of cat tle? Are there any outliers? (e) Construct a QQ plot using the first principal component. Interpret the plot. 8.23. A naturalist for the Alaska Fish and Game Department studies grizzly bears with the goal of maintaining a healthy population. Measurements on n = 61 bears provided the following summary statistics: Variable Sample mean x
Weight (kg)
Body length (em)
Neck (em)
Girth Head (em) length (em)
Head width (em)
95.52
164.38
55.69
93.39
31.13
17.98
Chapter 8
References
475
Covariance matrix
3266.46 1343.97 731.54 1175.50 162.68 238.37 1343.97 721 .91 324.25 537.35 80.17 117.73 731 .54 324.25 179.28 281 .17 39.15 56.80 S= 1175.50 537.35 281.17 474.98 63.73 94.85 162.68 63.73 80. 17 39.15 9 . 95 13.88 238.37 117.73 56.80 94.85 13.88 21.26 (a) Perform a principal component analysis using the covariance matrix. Can the data be effectively summarized in fewer than six dimensions? (b) Perform a principal component analysis using the correlation matrix. (c) Comment on the similarities and differences between the two analyses. 8.24. Refer to Example 8.10 and the data in Table 5.8, page 240. Add the variable x6 = regular overtime hours whose values are ( read across )
6187 7679
7336 8259
6988 6964 10954 9353 and redo Example 8. 10.
8425 6291
6778 4969
5922 4825
7307 6019
8.25. Refer to the police overtime hours data in Example 8.10. Construct an alter
nate control chart, based on the sum of squares d� j , to monitor the unexplained variation in the original observations summarized by the additional principal components.
REFERENCES 1. Anderson, T. W. An Introduction to Multivariate Statistical Analysis (2d ed. ) . New York: John Wiley, 1 984. 2. Anderson, T. W. "Asymptotic Theory for Principal Components Analysis." Annals of Mathematical Statistics, 34 (1963) , 122148. 3. Bartlett, M. S. "A Note on Multiplying Factors for Various ChiSquared Approxima tions." Journal of the Royal Statistical Society (B) , 16 (1954 ), 296298.
4. Dawkins, B. "Multivariate Analysis of National Track Records." The American Statisti cian, 43 (1989), 110115.
5. Girschick, M. A. "On the Sampling Theory of Roots of Determinantal Equations." Annals of Mathematical Statistics, 10 (1939), 203224. 6. Hotelling, H. "Analysis of a Complex of Statistical Variables into Principal Components." Journal of Educational Psychology, 24 (1933) , 417441 , 498520.
7. Hotelling, H. "The Most Predictable Criterion." Journal of Educational Psychology, 26 (1935), 139142. 8. Hotelling, H. " Simplified Calculation of Principal Components." Psychometrika , 1 (1936), 2735. 9. Hotelling, H. "Relations between Two Sets of Variates." Biometrika, 28 (1936), 321377.
10. Jolicoeur, P. "The Multivariate Generalization of the Allometry Equation." Biometrics, 19 (1963), 497499.
476
Chapter 8
Principal Com ponents 11. Jolicoeur, P. , and J. E. Mosimann. "Size and Shape Variation in the Painted Turtle : A Principal Component Analysis." Growth, 24 (1960), 339354. 12. King, B. "Market and Industry Factors in Stock Price Behavior." Journal of Business, 39 (1966), 1391 90. 13. Kourti, T., and J. McGregor, "Multivariate SPC Methods for Process and Product Mon itoring," Journal of Quality Technology, 28 (1996), 409428. 14. Lawley, D. N. "On Testing a Set of Correlation Coefficients for Equality." Annals of Mathematical Statistics, 34 (1963), 149151. 15. Maxwell, A. E. Multivariate Analysis in Behavioural Research. London: Chapman and Hall, 1977. 16. Rao, C. R. Linear Statistical Inference and Its Applications (2d ed. ) . New York: John Wiley, 1973. 17. Rencher, A. C. "Interpretation of Canonical Discriminant Functions, Canonical Variates and Principal Components." The American Statistician, 46 (1992) , 217225.
CHAPTE R
9
Factor Analysis and Inference for Structured Covariance Matrices
9. 1
I NTRODUCTI O N Factor analysis has provoked rather turbulent controversy throughout its history. Its modern beginnings lie in the early 20thcentury attempts of Karl Pearson, Charles Spearman, and others to define and measure intelligence. Because of this early as sociation with constructs such as intelligence, factor analysis was nurtured and de veloped primarily by scientists interested in psychometrics. Arguments over the psychological interpretations of several early studies and the lack of powerful com puting facilities impeded its initial development as a statistical method. The advent of highspeed computers has generated a renewed interest in the theoretical and computational aspects of factor analysis. Most of the original techniques have been abandoned and early controversies resolved in the wake of recent developments. It is still true, however, that each application of the technique must be examined on its own merits to determine its success. The essential purpose of factor analysis is to describe, if possible, the covari ance relationships among many variables in terms of a few underlying, but unob servable, random quantities called factors. Basically, the factor model is motivated by the following argument: Suppose variables can be grouped by their correlations. That is, suppose all variables within a particular group are highly correlated among themselves, but have relatively small correlations with variables in a different group. Then it is conceivable that each group of variables represents a single underlying construct, or factor, that is responsible for the observed correlations. For example, correlations from the group of test scores in classics, French, English, mathematics, and music collected by Spearman suggested an underlying "intelligence " factor. A second group of variables, representing physicalfitness scores, if available, might correspond to another factor. It is this type of structure that factor analysis seeks to confirm. 477
478
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
Factor analysis can be considered an extension of principal component analy sis. Both can be viewed as attempts to approximate the covariance matrix I. How ever, the approximation based on the factor analysis model is more elaborate. The primary question in factor analysis is whether the data are consistent with a pre scribed structure. 9.2
TH E ORTH OGO NAL FACTO R M O D E L The observable random vector X, with p components, has mean IL and covariance ma trix I. The factor model postulates that X is linearly dependent upon a few unob servable random variables F1 , F2 , . . . , Fm , called common factors, and p addition al sources of variation s 1 , s 2 , . . . , sP , called errors or, sometimes, specificfactors. 1 In par ticular, the factor analysis model is
X1  IL 1 = f 1 1 F1 + f 12 F2 + X2  IL2 = f2 1 F1 + f22 F2 +
··· ···
+ f 1m Fm + e1 + f2 m Fm + e2
(91)
or, in matrix notation, L F + e (92) ( p Xm ) ( mXl ) ( p Xl ) ( p Xl ) The coefficient ei j is called the loading of the ith variable on the jth factor, so the ma trix L is the matrix of factor loadings. Note that the ith specific factor si is associated only with the ith response Xi . The p deviations X1  IL l , X2  �L2 , . . . , XP  ILp are expressed in terms of p + m random variables F1 , F2 , , Fm , s 1 , s2 , . . . , sP which are unobservable. This distinguishes the factor model of (92) from the multivariate regression model in (7 26), in which the independent variables [whose position is oc cupied by F in (92)] can be observed. X  JL =
• • •
With so many unobservable quantities, a direct verification of the factor model from observations on X1 , X2 , . . . , XP is hopeless. However, with some additional as sumptions about the random vectors F and e, the model in (92) implies certain co variance relationships, which can be checked. We assume that
E(F) = 0 , ( mX l ) E(e) = 0 , ( pXl )
Cov (F) =
Cov ( e ) =
E[FF' J =
I (m x m)
E[ee' ] = 'It = ( px p)
l/1 1
0
0
l/12
0 0
0
0
l/Jp
(93)
1 As Maxwell [22] points out, in many investigations the E1 tend to be combinations of measure ment error and factors that are uniquely associated with the individual variables.
Section 9.2
The O rthogonal Factor Model
479
and that F and e are independent, so Cov ( e, F) = E ( eF' ) =
0
( pXm )
These assumptions and the relation in (92) constitute the
orthogonal factor mode/ . 2
The orthogonal factor model implies a covariance structure for X. From the model in (94) , ( X  �L ) (X 
IL ) '
= ( LF + e ) ( LF + e ) ' = ( LF + e ) ( ( LF) ' + e' ) = LF( LF) ' + e ( LF) ' + LFe ' + ee '
so that
I = Cov (X) = E(X 
IL ) IL )' (X 
= LE ( FF' ) L' + E ( eF' ) L' + L E ( Fe ' ) + E ( ee ' ) = LL' + 'II according to (93). Also, by independence, Cov ( e, F) = E ( e, F' ) = 0. Also, by the model in (94) , (X  p, ) F' = ( LF + e ) F' = LF F' + eF' , so Cov (X, F) = E (X  �L ) F' = L E( FF' ) + E ( eF' ) = L. not diagonal gives the oblique factor model. The oblique model presents some additional estimation difficulties and will not be discussed in this book. (See [20] .)
2 Allowing the factors F to be correlated so that Cov (F) is
480
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria nce Matrices
The model X  IL = LF + E is linear in the common factors. If the p responses X are, in fact, related to underlying factors, but the relationship is nonlinear, such as in xl  ILl = e l l pl p3 + B l , Xz  J.Lz = e2 1 F2F3 + Bz , and so forth, then the covari ance structure LL' + 'It given by (95) may not be adequate. The very important as sumption of linearity is inherent in the formulation of the traditional factor model. That portion of the variance of the ith variable contributed by the m common factors is called the ith communality. That portion of Var ( Xi) = uii due to the spe cific factor is often called the uniqueness, or specific variance. Denoting the ith com munality by h r , we see from (95) that (]" 0 0 ll
+
communality or
specific variance (96)
and (T o o ll
= h? + l
1 /r o 'P l '
i = 1 , 2, . . . ' p
The ith communality is the sum of squares of the loadings of the ith variable on the
m common factors. Example 9 . 1
(Verifying the relation
Consider the covariance matrix
I=
I=
LL'
+ W for two factors)
1 9 30 2 12 30 57 5 23 2 5 38 47 12 23 47 68
Section 9 . 2
T h e Orthogonal Factor Model
The equality 1 9 30 30 57 2 5 12 23
2 12 5 23 38 47 47 68
4 1 7 2 1 6 1 8
or
[�
1 6
7 2
1 8
]+
481
2 0 0 0 0 4 0 0 0 0 1 0 0 0 0 3
I = LL' + 'It
may be verified by matrix algebra. Therefore, I has the structure produced by an m = 2 orthogonal factor model. Since
f 12 f22 e32 e42
el l L = f21 e3 1 e4 1
'It =
l/1 1
0
0 0 0
l/12 0 0
4 7 1 1 0 0
l/13
0 0 0
0
l/14
1 2 6 8
'
2 0 0 0
0 4 0 0
0 0 0 0 1 0 0 3
the communality of X1 is, from (96) ,
h i = ei l + ei 2 = 4 2 + 12 = 17
and the variance of X1 can be decomposed as
lTl l =
or 19
�
variance
( e i l + ei 2 ) 42
+ 12
�
communality
+ l/1 1 = hi + l/1 1
+
+
2
�
17
+
2
specific variance
A similar breakdown occurs for the other variables.
•
The factor model assumes that the p + p(p  1 )/2 = p(p + 1 )/2 variances and covariances for X can be reproduced from the pm factor loadings eij and the p specific variances l/Ji · When m = p, any covariance matrix I can be reproduced ex actly as LL' [see (91 1)], so 'It can be the zero matrix. However, it is when m is small relative to p that factor analysis is most useful. In this case, the factor model pro vides a "simple" explanation of the covariation in X with fewer parameters than the p(p + 1 ) /2 parameters in I . For example, if X contains p = 12 variables, and the fac tor model in (94) with m = 2 is appropriate, then the p ( p + 1 )/2 = 12 ( 1 3 )/2 = 78 elements of I are described in terms of the mp + p = 12(2) + 12 = 36 parameters eij and l/Ji of the factor model.
482
Chapter 9
Factor Ana lysis and Inference for Structu red Covaria nce Matrices
Unfortunately for the factor analyst, most covariance matrices cannot be fac tored as LL' + '\}I, where the number of factors m is much less than p. The follow ing example demonstrates one of the problems that can arise when attempting to determine the parameters eij and l/Ji from the variances and covariances of the ob servable variables. Example 9.2
(Nonexistence of a proper sol ution)
]
Let p = 3 and m = 1, and suppose the random variables X1 , X2 , and X3 have the positive definite covariance matrix
I
=
[
1 .9 .7 .9 1 .4 .7 .4 1
Using the factor model in (94), we obtain X1  JL1 X2  JL2 X3  JL3
= = =
e1 1 F1 + e 1 e2 1 F2 + e2 e3 1 F1 + e3
The covariance structure in (95) implies that
I
or
1
=
ei 1 + l/J1
.9 o 1
=
= =
LL' +
w
e l le2 1 e� l + l/J2
The pair of equations
.70 .40 1
= = =
el le3 1 e2 1 e3 1 e� l + l/J3
.7 o = e l le3 1 .40 = e21 e3 1 implies that
( )
.40 .70 el l Substituting this result for e2 1 in the equation .9o = el le2 1 yields e r 1 = 1 .575, or e1 1 = ± 1.255. Since Var (F1 ) = 1 (by assumption) and Var ( X1 ) = 1 , e1 1 = Cov (X1 , F1 ) = Corr (X1 , F1 ) . Now, a correlation coeffi cient cannot be greater than unity (in absolute value), so, from this point of view, I el l ' = 1 .255 is too large. Also, the equation 1 = ei 1 + ljJ 1 ' or ljJ 1 = 1  ei 1 e2 1
=
Section 9 . 2
The Orthogona l Factor Model
483
gives
1  1.575 =  .575 which is unsatisfactory, since it gives a negative value for Var ( s 1 ) = l/f 1 . Thus, for this example with m = 1, it is possible to get a unique numeri cal solution to the equations I = LL' + '1'. However, the solution is not con sistent with the statistical interpretation of the coefficients, so it is not a proper solution. • l/1 1 =
When m > 1, there is always some inherent ambiguity associated with the fac tor model. To see this, let T be any m X m orthogonal matrix, so that TT' = T'T = I. Then the expression in (92) can be written X 
where
JL =
LF + L*
=
E =
LTT'F +
LT and F*
E =
L*F* +
=
T'F
=
0
E
(97)
Since E(F* ) and Cov (F* )
=
=
T' E(F)
T' Cov (F)T
=
T'T
=
( m XI m)
it is impossible, on the basis of observations on X, to distinguish the loadings L from the loadings L* . That is, the factors F and F* = T ' F have the same statistical prop erties, and even though the loadings L* are, in general, different from the loadings L, they both generate the same covariance matrix I. That is,
I
=
LL' + 'I'
=
LTT'L' + 'I'
=
(L*) (L* ) ' + 'I'
(98)
This ambiguity provides the rationale for "factor rotation," since orthogonal matri ces correspond to rotations (and reflections) of the coordinate system for X.
The analysis of the factor model proceeds by imposing conditions that allow one to uniquely estimate L and '1'. The loading matrix is then rotated (multiplied by an orthogonal matrix) , where the rotation is determined by some "easeof interpretation" criterion. Once the loadings and specific variances are obtained, fac tors are identified, and estimated values for the factors themselves (called factor scores ) are frequently constructed.
484
9.3
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
M ETHODS OF ESTI MATIO N
Given observations x 1 , x 2 , . . . , x n on p generally correlated variables, factor analysis seeks to answer the question, Does the factor model of (94), with a small number of factors, adequately represent the data? In essence, we tackle this statistical model building problem by trying to verify the covariance relationship in (95). The sample covariance matrix S is an estimator of the unknown population co variance matrix I. If the offdiagonal elements of S are small or those of the sample correlation matrix R essentially zero, the variables are not related, and a factor analy sis will not prove useful. In these circumstances, the specific factors play the domi nant role, whereas the major aim of factor analysis is to determine a few important common factors. If I appears to deviate significantly from a diagonal matrix, then a factor model can be entertained, and the initial problem is one of estimating the factor loadings fl 1 and specific variances l/Ji . We shall consider two of the most popular methods of pa rameter estimation, the principal component (and the related principal factor) method and the maximum likelihood method. The solution from either method can be rotated in order to simplify the interpretation of factors, as described in Section 9.4. It is al ways prudent to try more than one method of solution; if the factor model is appro priate for the problem at hand, the solutions should be consistent with one another. Current estimation and rotation methods require iterative calculations that must be done on a computer. Several computer programs are now available for this purpose. The Principal Com po nent (and Pri nci pal Factor) Method
The spectral decomposition of (220) provides us with one factoring of the covariance ma trix I . Let I have eigenvalueeigenvector pairs ( Ai , e i) with A 1 A2 AP 0. Then >
>
· · ·
>
>
(91 0 )
vx; e�
This fits the prescribed covariance structure for the factor analysis model having as many factors as variables ( m = p) and specific variances t/Ji = 0 for all i. The load ing matrix has jth column given by � ej . That is, we can write (911) (pIX p) (pLX p) (pL'X p ) + (p0X p) LL' Apart from the scale factor � ' the factor loadings on the jth factor are the coeffi =
=
cients for the jth principal component of the population. Although the factor analysis representation of I in (911) is exact, it is not par ticularly useful: It employs as many common factors as there are variables and does not allow for any variation in the specific factors e in (94). We prefer models that explain the covariance structure in terms of just a few common factors. One approach,
Section 9.3
Methods of Esti mation
485
when the last p  m eigenvalues are small, is to neglect the contribution of "m+l e m+ l e � + l + · · · + AP e P e� to I in (910) . Neglecting this contribution, we obtain the approximation
L L' (912) ( p Xm ) ( mXp ) The approximate representation in (912) assumes that the specific factors e in (94) are of minor importance and can also be ignored in the factoring of I. If specific factors are included in the model, their variances may be taken to be the diagonal el ements of I  LL', where LL' is as defined in (912) . Allowing for specific factors, we find that the approximation becomes
I

LL' + 'It
VA; e�

=
[ VA; e 1
VI;: e 2
VA: e m ]
VI;: e2


+
VX: e �
l/1 1
0
0
t/12
0 0
0
0
l/Jp
(913)
m where l/Ji = (Ti i  � erj for i = 1 , 2, . . . ' p . j =l To apply this approach to a data set x 1 , x 2 , . . . , x n , it is customary first to center the observations by subtracting the sample mean x . The centered observations xl Xj l Xj l  x l x x2 xj 2  x 2 j = 1 , 2, . . , n (914) xj  x = j 2 .
Xjp Xj p  Xp Xp have the same sample covariance matrix S as the original observations. In cases where the units of the variables are not commensurate, it is usually de sirable to work with the standardized variables (xj l  xl ) �
1
z. =
(xj 2  x2 ) Vs;
j
=
1 , 2,
... , n
(xj p  xp )
vs;;
whose sample covariance matrix is the sample correlation matrix R of the observa tions x 1 , x 2 , . . . , x n . Standardization avoids the problems of having one variable with large variance unduly influencing the determination of factor loadings.
486
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria n ce Matrices
The representation in (913), when applied to the sample covariance matrix S or the sample correlation matrix R, is known as the principal component solution. The name follows from the fact that the factor loadings are the scaled coefficients of the first few sample principal components. (See Chapter 8.)
For the principal component solution, the estimated loadings for a given factor do not change as the number of factors is increased. For example, if m = 1 , L = [ � e l ] , and it m = 2 , L = [ � e l i � e2] . where ( A l , e l ) and ( A2 , C2) are the first two eigenvalueeigenvector pairs for S (or R). By th�£ efinigon of {/Ji , the diagonal elements of S are equal to the diagonal el ements of LL�""'+ 'I' .""'However, the offdiagonal elements of S are not usually re produced by LL' + 'If. How, then, do we select the number of factors m? If the number of common factors is not determined by a priori considerations, such as by theory or the work of other researchers, the choice of m can be based on the estimated eigenvalues in much the same manner as with principal components. Consider the residual matrix
s  (i:i> + 'if)
(918)
resulting from the approximation of S by the principal component solution. The di agonal elements are zero, and if the other elements are also small, we may subjectively take the m factor model to be appropriate. Analytically, we have (see Exercise 9 .5) Sum of squared entries of ( S  (LL '
+ 'if ) ) < A�+ l + . . . + A�
(919)
Section 9.3
Methods of Esti mation
487
Consequently, a small value for the sum of the squares of the neglected eigenvalues implies a small value for the sum of the squared errors of approximation. Ideally, the contributions of the first few factors to the sample variances of the variables should be large. The contribution to the sample variance sii from the first common factor is e[l . The contribution to the total sample variance, s1 1 + s22 + + sP P = tr ( S ) , from the first common factor is then
···
+ e� l + + e� l = (VA: el)'( VA: el) since the eigenvector e 1 has unit length. In general,
(
er l
Proportion �f total sample variance due to jth factor
0 0 0
)
=
Al
" A1·
=
s1 1 + s22 + . . . + sPP for a factor analysis of S (920) " Aj for a factor analysis of R p
Criterion (920) is frequently used as a heuristic device for determining the appropriate number of common factors. The number of common factors retained in the model is increased until a "suitable proportion" of the total sample variance has been explained. Another convention, frequently encountered in packaged computer programs, is to set m equal to the number of eigenvalues of R greater than one if the sample cor relation matrix is factored, or equal to the number of positive eigenvalues of S if the sample covariance matrix is factored. These rules of thumb should not be applied in discriminately. For example, m = p if the rule for S is obeyed, since all the eigenvalues are expected to be positive for large sample sizes. The best approach is to retain few rather than many factors, assuming that they provide a satisfactory interpretation of the data and yield a satisfactory fit to S or R. Example 9.3
(Factor analysis of consumerpreference data)
In a consumerpreference study, a random sample of customers were asked to rate several attributes of a new product. The responses, on a 7 point semantic differential scale, were tabulated and the attribute correlation matrix con structed. The correlation matrix is presented next:
Attribute ( Variable) Taste Good buy for money Flavor Suitable for snack Provides lots of energy
1 2 3 4 5
1 2 4 3 5 1.00 .02 ® .42 .01 .02 1 .00 .13 . 7 1 @ .96 .13 1.00 .50 .11 .42 . 7 1 .50 1.00 @ .01 .85 .11 . 7 9 1.00
It is clear from the circled entries in the correlation matrix that variables 1 and 3 and variables 2 and 5 form groups. Variable 4 is "closer" to the (2, 5) group than the ( 1 , 3 ) group. Given these results and the small number of variables, we might expect that the apparent linear relationships between the variables can be explained in terms of, at most, two or three common factors.
488
Chapter 9
Factor Ana lysis and I nference for Structured Cova ria nce Matrices "
"
The first two eigenvalues, A 1 = 2.85 and A2 = 1.81, of R are the only eigenvalues greater than unity. Moreover, m = 2 common factors will account for a cumulative proportion "
"
A1_ + A_2 _ _ p
=
2.85 + 1.81 5
=
.93
of the total (standardized) sample variance. The estimated factor loadings, com munalities, and specific variances, obtained using (915), (916), and (917), are given in Table 9 . 1. TABLE 9 . 1
Estimated factor loadings �e "·l e·l 1· vf l1 F2 Fl . .
Variable 1 . Taste 2. Good buy for money 3. Flavor 4. Suitable for snack 5. Provides lots of energy Eigenvalues Cumulative proportion of total (standardized) sample variance
=
Specific variances �l/1l·  1  �h 2l·
Communalities �h 2· l

.56
.82
.98
.02
.78 .65
 .53 .75
.88 .98
.12 .02
.94
 .10
.89
.11
.80 2.85
 .54 1.81
.93
.07
.571
.932
Now, .56 .78 � �� .65 LL' + 'It .94 .80 .02 0 + 0 0 0 =
.82  .53 .75  .10  .54 0 .12 0 0 0
[
.80 .94 .78 .65 .56 .82 .53 .75 .10  .54
0 0 .02 0 0
0 0 0 .11 0
0 0 0 0 .07
1 .00 =
.01 1 .00
J .97 .11 1.00
.44 .79 .53 1 .00
.00 .91 .11 .81 1.00
Section 9.3
489
Methods of Estimation
nearly reproduces the correlation matrix R . Thus, on a purely descriptive basis, we would judge a twofactor model with the factor loadings displayed in Table 9.1 as providing a good fit to the data. The communalities ( .98, .88, .98, .89, .93 ) indicate that the two factors account for a large percentage of the sample vari ance of each variable. We shall not interpret the factors at this point. As we noted in Section 9.2, the factors (and loadings) are unique up to an orthogonal rotation. A rotation of the factors often reveals a simple structure and aids interpretation. We shall consider this example again (see Example 9.9 and Panel 9.1) after factor rota tion has been discussed. • Example 9.4
(Factor analysis of stockprice data)
Stockprice data consisting of n == 100 weekly rates of return on p == 5 stocks were introduced in Example 8.5. In that example, the first two sample prin cipal components were obtained from R. Taking m == 1 and m == 2, we can easily obtain principal component solutions to the orthogonal factor model. Specifically, the estimated factor loadings are the sample principal component coefficients (eigenvectors of R) , scaled by the square root of the corre sponding eigenvalues. The estimated factor loadings, communalities, specif ic variances, and proportion of total (standardized) sample variance explained by each factor for the m == 1 and m == 2 factor solutions are available in Table 9 .2. The communalities are given by (917). So, for example, with 2 2 .66. m == 2, hi == erl + 'li 2 == ( .783 ) + (  .217) ==
TABLE 9.2
Twofactor solution
Onefactor solution Estimated factor loadings Variable
Fl
Specific variances r..1 l/1l· 1  r.h .1l2·
1. Allied Chemical 2. Du Pont 3. Union Carbide 4. Exxon 5. Texaco Cumulative proportion of total (standardized) sample variance explained
.783 .773 .794 .713 .712
.39 .40 .37 .49 .49
.571
Estimated factor loadings
Fl
F2
Specific variances r..1 l/J l· == 1  r. h .12l·
.783 .773 .794 .713 .712
 .217  .458 .234 .472 .524
.34 .19 .31 .27 .22
.571
.733
490
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria nce Matrices
The residual matrix corresponding to the solution for m
R  LL '  'I' ,....._, ,....._,
,....._,
=
=
2 factors is
.017  .164 .069 .055 .012  .122 0  .019  .017  .019 0  .232  .017 .232 0
0  .127 0  .127  .164  .122 .055  .069 .017 .012
The proportion of the total variance explained by the twofactor solution is ap preciably larger than that for the onefactor solution. However, for m 2, LL' produces numbers that are, in general, larger than the sample correlations. This is particularly true for r45 • It seems fairly clear that the first factor, F1 , represents general econom ic conditions and might be called a market factor. All of the stocks load high ly on this factor, and the loadings are about equal. The second factor contrasts the chemical stocks with the oil stocks. ( The chemicals have relatively large negative loadings, and the oils have large positive loadings, on the factor. ) Thus, F2 seems to differentiate stocks in different industries and might be called an industry factor. To summarize, rates of return appear to be determined by general market conditions and activities that are unique to the different industries, as well as a residual or firm specific factor. This is essentially the conclusion reached by an examination of the sample principal components in Example 8.5. M ==
,....._, ,....._,
A Mod ified Approachthe Principal Facto r Sol ution
A modification of the principal component approach is sometimes considered. We describe the reasoning in terms of a factor analysis of R, although the procedure is also appropriate for S. If the factor model p = LL' + 'I' is correctly specified, the m common factors should account for the offdiagonal elements of p, as well as the communality portions of the diagonal elements .
.
Pl l
=
1
=
h l? +
1 /r . 'P l
If the specific factor contribution l/Ji is removed from the diagonal or, equivalently, the 1 replaced by hf , the resulting matrix is p  'I' = LL'. Suppose, now, that initial estimates l/Ji of the specific variances are available. Then replacing the ith diagonal element of R by hj2 = 1  l/Ji , we obtain a "reduced" sample correlation matrix
R
r
=
hi2 r1 2 r1 2 h i2
ri p r2 p
ri p r2 p
h p*2
Section 9.3
Methods of Estimation
491
Now, apart from sampling variation, all of the elements of the reduced sample cor relation matrix R should be accounted for by the m common factors. In particular, R is factored as r
r
R
r
_.:_
L* L*' r
(921)
r
where L; = { f0 } are the estimated loadings. The principal factor method of factor analysis employs the estimates L;
=
,,,� V' l
=
[VA! et i Yfi ei m
" e l�2 1  .£..,; 1
� e !]
i ··· i
(922)
j= 1
where (Ai , ei ) , i = 1 , 2, . . . , m are the (largest) eigenvalueeigenvector pairs deter mined from R . In turn, the communalities would then be (re)estimated by r
(923) The principal factor solution can be obtained iteratively, with the communality esti mates of (923) becoming the initial estimates for the next stage. In the spirit of the principal component solution, consideration of the estimat ed eigenvalues Ai, Ai, . . . , A� helps determine the number of common factors to re tain. An added complication is that now some of the eigenvalues may be negative, due to the use of initial communality estimates. Ideally, we should take the number of common factors equal to the rank of the reduced population matrix. Unfortu nately, this rank is not always well determined from R, and some judgment is necessary. Although there are many choices for initial estimates of specific variances, the most popular choice, when one is working with a correlation matrix, is l/1[ = 1/ r i i , where r ii is the ith diagonal element of R 1 . The initial communality estimates then become
h{ 2
=
1 
1/Ji
=
1 1  ll. r 0
(924)
which is equal to the square of the multiple correlation coefficient between Xi and the other p  1 variables. The relation to the multiple correlation coefficient means that h {2 can be calculated even when R is not of full rank. For factoring S, the initial specific variance estimates use s ii , the diagonal elements of s  1 . Further discussion of these and other initial estimates is contained in [12] . Although the principal component method for R can be regarded as a princi pal factor method with initial communality estimates of unity, or specific variances equal to zero, the two are philosophically and geometrically different. (See [12] .) In practice, however, the two frequently produce comparable factor loadings if the num ber of variables is large and the number of common factors is small. We do not pursue the principal factor solution, since, to our minds, the solution methods that have the most to recommend them are the principal component method and the maximum likelihood method, which we discuss next.
492
Factor Ana lys is and I nference for Structu red Cova ria nce Matrices
Chapter 9
The Maxi m u m Likeli hood Method
If the common factors F and the specific factors e can be assumed to be normally distributed, then maximum likelihood estimates of the factor loadings and specific variances may be obtained. When Fj and ej are jointly normal, the observations Xj  IL == LFj + ej are then normal, and from (416) , the likelihood is
L( /L, I, )
= ==
n [ ) J I I, �  � e  (�) :ttt ( xi  X ) (xi  X ) ' + n (X  ) ( X  l  ( n  1)p ( n  1) ( 1 ) [ _1 ( j�l (x}  x ) (x;  x ) ')] (21T) 2 I I 1  2 e P 1 (n ) ( x  p, ( x  p, ) X (21T)  2 1 I �  e ( 27r f
I'
tr
2
r l 2 t
) '� �1

2
n
_
I' )
'
_
(92 5)

which depends on L and '\}I through I LL' + W . This model is still not well de fined, because of the multiplicity of choices for L made possible by orthogonal trans formations. It is desirable to make L well defined by imposing the computationally convenient uniqueness condition ==
a diagonal matrix
(926)
"
"
The maximum likelihood estimates L and '\}I must be obtained by numerical maximization of (925). Fortunately, efficient computer programs now exist that en able one to get these estimates rather easily. We summarize some facts about maximum likelihood estimators and, for now, rely on a computer to perform the numerical details. Result 9.1. Let X 1 , X 2 , . . . , X n be a random sample from Np ( IL, I), where I LL' + '\}I is the covariance matrix for the m common factor model of (94). The maximum likelihood estimators L , .q,, and jL x maximize (925) subject to L ' .q,  1 i being diagonal. ==
==
The maximum likelihood estimates of the communalities are for i so
(
Proportion of total sample variance due to jth factor
)
=
fL
+
==
1, 2, . . . , p
(927)
e�j + . . . + e� j
(928 )
s11 + s22 + · · · + sP P
Proof.
By the invariance property of maximum likelihood estimates (see Section 4.3), functions of L and 'I' are estimated by the same functions of i and .q,. In par!_icular, the CS?mmunalities hf ef1 + . . . + efm have maximum likelihood estimates hf
==
e f1 + · · · + efm ·
==
If, as in (810), the variables are standardized so that Z the covariance matrix p of Z has the representation
==
v  1 12 ( X
•  IL
), then (929)
Section 9 . 3
Methods of Esti mation
493
Thus, p has a factorization analogous to (95) with loading matrix Lz = v 112 L and specific variance matrix '\}I z = v  1 /2 '\}IV 112 • By the invariance property of maximum likelihood estimators, the maximum likelihood estimator of p is
P
=
=
( v1f2i ) ( v1/2i)' + v1/2 q, y1/2 " L Z L � + '\}I z
(930) where v 1;2 and L are the maximum likelihood estimators of v 112 and L , respec tively. (See Supplement 9A.) As a consequence of the factorization of (930), whenever the maximum like lihood analysis pertains to the correlation matrix, we call " h2l  '(,·IJl2 + '(,·IJ l2 + . . . + IJL2m i = 1, 2 , . . . p (931) 0
_
1
2
'(,
'
·
the maximum likelihood estimates of the communalities, and we evaluate the im portance of the factors on the basis of Proportion of total (standardized) = fL + e� j + . . . + e� j (932) sample variance due to jth factor p " To avoid more tedious notations, the preceding fi /s denote the elements of Lz .
(
)
"
,...
Comment. Ordinarily, the observations are standardized, and a sample correlation matrix is factor analyzed. The sample correlation matrix R is inserted for [ ( n  1 ) j n 1S in the likelihood function of (925), and the maximum likelihood estimates L z and '\}I z are obtained using a computer. Although the likelihood in (925) is appropriate for S, not R, surprisingly, this practice is equivalent to obtaining the maximum likelihood estimates L and q, based on the sample covariance matrix S, setting L z = v  112 L and q, z = y 1/2 q, y112 . Here v112 is the diagonal matrix with the reciprocal of the sample stan dard deviations (computed with the divisor Vn) on the main diagonal. " Goin � in the other direction, given the estimated loadings L z and specific variances '\}I z obtained from R, we find that the resulting maximum likelihood estimates for a factor analysis of the covariance matrix [ (n  1 ) /n J S are L = y 1/2 iz and q, = y 1/2 q, z y 1/2' or where aii is the sample variance computed with divisor n. The distinction between • divisors can be ignored with principal component solutions. The equivalency between factoring S and R has apparently been confused in many published discussions of factor analysis. (See Supplement 9A.) Example 9 . 5
(Facto r ana lysis o f stockprice data u s i n g the maxim u m l i ke l i hood method)
The stockprice data of Examples 8.5 and 9.4 were reanalyzed assuming an m = 2 factor model and using the maximum likelihood method. The estimated factor loadings, communalities, specific variances, and proportion of total (standardized) sample variance explained by each factor are in Table 9.3. The corresponding
494
Chapter 9
Factor Ana lysis and I nference for Structu red Cova ria nce Matrices
TABLE 9.3 
Principal components
Maximum likelihood

Variable 1. 2. 3. 4. 5.
Allied Chemical Du Pont Union Carbide Exxon Texaco Cumulative proportion of total ( standarc.lized) sample variance explained
Specific variances �l· = 1  it?l
Estimated factor loadings Fl F2 .189 .517 .248  .073 .442
.684 .694 .681 .621 .792
.50 .25 .47 .61 .18
Estimated factor loadings F2 Fl .783 .773 .794 .713 .712
 .217  .458  .234 .412 .524
Specific variances �·l = 1 h2l 

.34 .19 .31 .27 .22

.485
.571
.598
.733
figures for the m = 2 factor solution obtained by the principal component method (see Example 9.4) are also provided. The communalities corresponding to the maximum likelihood factoring of R are of the form [see (931)] hf = efl + fT2 · So, for example, hy = ( .684 ) 2 + ( .189) 2 = .50 The residual matrix is
R

LL '
W "

0 .005  .004  .024  .004 .005  003  .004 0 .000 .031  .004 0 .004  .003 0 .031  .000  .024  .004 .000  .004  .000  .004 0 .
=
The elements of R  LL' '\}I are much smaller than those of the residual matrix corresponding to the principal component factoring of R presented in Ex ample 9.4. On this basis, we prefer the maximum likelihood approach and typically feature it in subsequent examples. The cumulative proportion of the total sample variance explained by the factors is larger for principal component factoring than for maximum likeli hood factoring. It is not surprising that this criterion typically favors principal component factoring. Loadings obtained by a principal component factor analy sis are related to the principal components, which have, by design, a variance op timizing property. [See the discussion preceding (819).] Focusing attention on the maximum likelihood solution, we see that all variables have large positive loadings on F1 . We call this factor the market fac tor, as we did in the principal component solution. The interpretation of the second factor, however, is not as clear as it appeared to be in the principal "

Section 9 . 3
Meth ods of Esti mation
495
component solution. The signs of the factor loadings are consistent with a con trast, or industry factor, but the magnitudes are small in some cases, and one might identify this factor as a comparison between Du Pont and Texaco. The patterns of the initial factor loadings for the maximum likelihood so lution are constrained by the uniqueness condition that l/.q, 1 L be a diagonal matrix . Therefore, useful factor patterns are often not revealed until the factors are rotated (see Section 9.4) . •
Example 9.6
{Factor analysis of Olympic decath lon data)
Linden [21] conducted a factor analytic study of Olympic decathlon scores since World War II. Altogether, 160 complete starts were made by 139 athletes. 3 The scores for each of the 10 decathlon events were standardized, and a sample cor relation matrix was factor analyzed by the methods of principal components and maximum likelihood. Linden reports that the distributions of standard scores were normal or approximately normal for each of the ten decathlon events. The sample correlation matrix, based on n == 160 starts, is 100m Long Shot High 400m 110m Dis Pole Jave 1500m run JUmp put JUmp run hurdles cus vault lin run .28 .20 .40 .35 1.0 .11 .34  .07 . 59 . 63 1 .0 .42 .52 .51 .49 .09 . 31 . 36 . 21 1.0 .73 . 24 .36 .38 .19  .08 . 44 1.0 .46 . 17 .39 .18 . 29 . 27 R == 1 .0 .23 .39 .34 . 17 . 13 1.0 .32 . 33 .18 . 00 .24 1 .0 .34  .02 .24 1 .0 .17 1.0  .00 1 .0 From a principal component factor analysis perspective, the first four eigenvalues, 3.78, 1.52, 1 . 1 1 , .91, of R suggest a factor solution with m == 3 or m == 4. A subsequent interpretation of the factor loadings reinforces the choice m == 4. The principal component and maximum likelihood solution methods were applied to Linden 's correlation matrix and yielded the estimated factor loadings, communalities, and specific variance contributions in Table 9.4. 4 3 Because of the potential correlation between successive scores by athletes who competed in more than one Olympic game, an analysis was also done using 139 scores representing different athletes. The score for an athlete who participated more than once was selected at random. The results were virtually iden tical to those based on all 160 scores. 4The output of this table was produced by the BMDP statistical software package. The output from the SAS program is identical for the principal component solution and very similar for the maximum likelihood solution. For this example the solution to the likelihood equations produces a Heywood case. That is, the estimated loadings are such that some specific variances are negative. Consequently, the soft ware package may not run unless the Heywood case option is selected. With that option, the program obtains a feasible solution by slightly adjusting the loadings so that all specific variance estimates are non negative. A Heywood case is suggested in this example by the .00 values for the specific variances for the shot put and the 1500m run.
� \0 �
TABLE 9.4
Maximum likelihood
Principal component
F3
F4
.217 .184  .535 .134 .551
 .520  . 1 93 .047 .139  .084
 .206 .092  .175 .396  .419
.687 .621 .538 .434 .147
.042  .521 .087  .439 .596
 .161 .109 .411 .372 .658
.345  .234 .440  .235  .279
.38
.53
.64
F2
Variable
Fl
100m run Long jump Shot put High jump 400m run 100 m hurdles 7. Discus 8. Pole vault 9. Javelin 10. 1500m run Cumulative proportion of total variance explained
.691 .789 .702 .674 .620
1. 2. 3. 4. 5. 6.
 

 

      

.73
�· l
==
  
Specific variances
Estimated factor loadings
Specific variances
Estimated factor loadings
1  h?
F4
�·
1  it ?
Fl
F2
F3
.16 .30 .19 .35 .13
 .090 .065  .139 .156 .376
.341 .433 .990 .406 .245
.830 .595 .000 .336 .671
 .169 .275 .000 .445  . 137
.16 .38 .00 .50 .33
.38 .28 .34 .43 .11
 .021  .063 .155  .026 .998
.425 .361 .030 .728 .229 .264 .441  .010 .000 .059
.388 .019 .394 .098 .000
.54 .46 .70 .80 .00
l
.12
.37
.55
.61
l
==
l
Section 9 . 3
Methods of Estimation
497
In this case, the two solution methods produced very different results. For the principal component factorization, all events except the 1500meter run have large positive loadings on the first factor. This factor might be la beled general athletic ability. The remaining factors cannot be easily inter preted to our minds. Factor 2 appears to contrast running ability with throwing ability, or "arm strength." Factor 3 appears to contrast running en durance (1500meter run) with running speed (100meter run), although there is a relatively high polevault loading on this factor. Factor 4 is a mystery at this point. For the maximum likelihood method, the 1500meter run is the only variable with a large loading on the first factor. This factor might be called a running endurance factor. The second factor appears to be primarily a strength factor (discus and shot put load highly on this factor) , and the third factor might be running speed, since the 100meter and 400meter runs load highly on this factor. Again, the fourth factor is not easily identified, al though it may have something to do with jumping ability or leg strength. We shall return to an interpretation of the factors in Example 9 . 1 1 after a discussion of factor rotation. The fourfactor principal component solution accounts for much of the total (standardized) sample variance, although the estimated specific variances are large in some cases (for example, the javelin and hurdles). This suggests that some events might require unique or specific attributes not required for the other events. The fourfactor maximum likelihood solution accounts for less of the total sample variance, but, '1s the f9llowing residual matrices indicate, the maximum likelihood estimates L anj '\}I d � a better job of reproducing R than the principal component estimates L and W: Principal component: R  LL'  W r..J r..1
0  .075  .030  .001  .047  .096  .027 .114 .051  .016
=
0  .010  .056  .077  .092  .041  .042 .042 .017
0 .042  .020  .032  .031  .034  .158 .056
0  .024 0 0  . 122 .022 0 .014  .001  .017 0 .009 .067  .129  .215 0 .036 .041  .254  .005 .022 .062  .109 .112 .076 .020  .091
0
498
Chapter 9
Factor Ana lysis and I nference for Structu red Covaria nce Matrices
Maximum likelihood: R  LL'  W "
0 .000 .000 .012 .000 .012 .004 .000  .018 .000
=
0 .000 .002  .002 .006  .025  .009  .000 .000
0 .000 0 0 .000  .033  .000 .001 .028 0 .036  .000  .034 .002 .006 .008  .012  .000  .000  .045 .052  .013 .000 .000 .000 .000
0 .043 .016 .000
0 .091 .000
0 .000
0 •
A Large Sample Test for the N u m ber of Co mmon Facto rs
The assumption of a normal population leads directly to a test of the adequacy of the model. Suppose the m common factor model holds. In this case I = LL' + 'I' , and testing the adequacy of the m common factor model is equivalent to testing
(pIX p )
(933) (pLx m) ( mL'x p) + (pwx p) versus H1 : I any other positive definite matrix. When I does not have any special =
form, the maximum of the likelihood function [see (418) and Result 4.11 with i ( (n  1 )/n) S = S n] is proportional to
=
(934)
Under H0 , I is restricted to have the form of (933). In this case, the maximum of the likelihood function [see (925) with jL = x and i = LL ' + q, , where L and q, are the maximum likelihood estimates of L and '\}I, respectively] is proportional to
I 1n/2 i
exp
(  � [ (� (xi tr I  1
=

I LL ' +
X) ( xi  x) '
)])
q, , n/2 exp (  � n tr [ ( LL + q,) l S n J )
(935 )
Using Result 5.2, (934), and (935), we find that the likelihood ratio statistic for testing H0 is maximized likelihood under H0 2 ln A  2 ln . . d rk maximize I e rh I oo d (936) _
_
_
[
]
Section 9.3
Methods of Estimation
499
with degrees of freedom,
� p(p + 1 )  [p(m + 1 )  � m ( m  1 ) ] 2 = � [ (p  m )  p  m]
v  v0 =
(937)
i LI/
Supplement 9A indicates that tr ( I 1 S n ) p = 0 provided that maximum likelihood estimate of I = LL' + '\}I. Thus, we have 
(Iii)
=
+
2 ln A = n ln TS,\
.q, is the (938)
Bartlett [4] has shown that the chisquare approximation to the sampling dis tribution of 2 ln A can be improved by replacing n in (938) with the multiplicative factor ( n  1  (2p + 4m + 5 )/6 ) . Using Bartlett ' s correction, 5 we reject H0 at the a level of significance if ( n  1  (2p + 4m + 5)/6) ln
I
"
"
"
LL' + '\}I I
sn
I
I
>
X2[(p m) 2  p m ] j2 ( a )
(939)
provided that n and n  p are large. Since the number of degrees of freedom, � [ (p  m ) 2  p  m J, must be positive, it follows that
m
< � (2p +
1  V8p + 1 )
(940)
in order to apply the test (939).
I LL' I i LL'
Comment. In implementing the test in (939), we are testing for the adequacy of the m common factor model by comparing the generalized variances + .q, and I S n I · If n is large and m is small relative to p, the hypothesis H0 will usually be rejected, leading to a retention of more common factors. However, = + .q, may be close enough to sn so that adding more factors does not provide additional insights, even though those factors are "significant." Some judgment must be exercised in the choice of m. Example 9.7
(Testi ng for two co mmon factors)
The twofactor maximum likelihood analysis of the stockprice data was pre sented in Example 9.5. The residual matrix there suggests that a twofactor so lution may be adequate. Test the hypothesis H0 : I = LL' + '\}I, with m = 2, at level a = .05. 5Many factor analysts obtain an approximate maximum likelihood estimate by replacing Sn with the unbiased estimate S = [n/ ( n 1 ) ] Sn and then minimizing ln l I I + tr [I 1 S]. The dual substitution of S and the approximate maximum likelihood estimator into the test statistic of (939) does not affect its large sample properties. 
500
Chapter 9
Factor Ana lysis and I nfere nce fo r Structu red Cova ria nce Matrices
The test statistic in (939) is based on the ratio of generalized variances
I i I I LL ' + q, I I sn I I sn I Let v  1;2 be the diagonal matrix such that v  112 S n v1;2 R . =
=
of determinants ( see Result 2A.11),
By the properties
and I
v112 1 1 s n I I v 112 1 I v 112 sn v112 1 =
Consequently,
i I I v112 1 I ii./ + q, I I v112 1 I sn I I v1/2 1 I sn I I v1/2 1 I v112 Li ' v1;2 + v 1;2 q, y112 l I y 1/2 s n y1/2 1 I L Z L � + 'II z I IRI I
=
"
"
(941)
"
= 
by (930). From Example 9.5, we determine
I LzL � + 'l'z I IRI "
"
"
1 .000 .572 1.000 .513 .602 1.000 .411 .393 .405 1.000 .458 .322 .430 .523 1.000 1.000 .577 1.000 .509 .599 1.000 .387 .389 .436 1.000 .462 .322 .426 .523 1.000
=
.194414 .193163
=
1"0065
Using Bartlett ' s correction, we evaluate the test statistic in (939):
LL ' + 'II I  1  (2p + 4m + 5 )/6 J ln I I Sn I "
[n
=
[
"
"
100  1 
]
(10 + 8 + 5) ln ( 1.0065) 6
=
.62
Section 9.4
Factor Rotation
501
Since � [ (p  m) 2  p  m] == � [ (5  2) 2  5  2] == 1, the 5% critical value xi( .05 ) == 3.84 is not exceeded, and we fail to reject H0 • We conclude that the data do not contradict a twofactor model. In fact, the observed significance level, or Pvalue, P[xi > .62]  .43 implies that H0 would not be rejected at any reasonable level. • A A Large sample variances and covariances for the maximum likelihood estimates e ij ' l/Ji have been derived when these estimates have been determined from the sam ple covariance matrix S . (See [20].) The expressions are, in general, quite complicated. 9. 4
FACTOR ROTATI ON
As we indicated in Section 9.2, all factor loadings obtained from the initial loadings by an orthogonal transformation have the same ability to reproduce the covariance (or correlation) matrix. [See (98).] From matrix algebra, we know that an orthog onal transformation corresponds to a rigid rotation (or reflection) of the coordinate axes. For this reason, an orthogonal transformation of the factor loadings, as well as the implied orthogonal transformation of the factors, is called factor rotation. If L is the p X m matrix of estimated factor loadings obtained by any method (principal component, maximum likelihood, and so forth) then A A L* == LT, where TT' == T'T == I (942) is a p X m matrix of "rotated" loadings. Moreover, the estimated covariance (or cor relation) matrix remains unchanged, since A A A A A LL' + W == LTT' L + W == L * L * ' + W (943) Equation (943) indicates that the residual matrix, S n  LL'  .q, == s n  L * L * '  .q, , remainAs unchanged. Moreover, the specific variances �i ' and hence the communaliti�s hr ;._ are unaltered. Thus, from a mathematical viewpoint, it is immaterial whether L or L* is obtained. Since the original loadings may not be readily interpretable, it is usual practice to rotate them until a "simpler structure" is achieved. The rationale is very much akin to sharpening the focus of a microscope in order to see the detail more clearly. Ideally, we should like to see a pattern of loadings such that each variable loads highly on a single factor and has small to moderate loadings on the remaining fac tors. However, it is not always possible to get this simple structure, although the ro tated loadings for the decathlon data discussed in Example 9.11 provide a nearly ideal pattern. We shall concentrate on graphical and analytical methods for determining an orthogonal rotation to a simple structure. When m = 2, or the common factors are considered two at a time, the transformation to a simple structure can frequently be determined graphically. The uncorrelated common factors are regarded as unit vec tqrs �long perpendicular coordinate axes. A plot of the pairs of factor loadings ( ei l , fi 2 ) yields p points, each point corresponding to a variable. The coordinate axes can then be visually rotated through an anglecall it
>
···
1 , 2, . . . ' p
p;2
>
(101 4)
P 2i12p2 1P1f p1 2 P2i12).
Comment.
a k (X 1 )  IL ( 1 )
=
ak 1
i 1 )  J.L P ) + a k 2 ( X�1 )  J.L�1 ) )
+
+ ak p
+
···
···
1 p  r11.p( ) )
+ a k p vr=aPP �
1 1 ( x p( )  J.Lp( ) ) r=
v aPP
�
) wher e Ther e f o r e , t h e canoni c al coef f i c i e nt s f o r t h e X ? Va r ( i . sictaalndarcoefdfiizcedientvars atiatblachedes, to the original variables are sSpeciimplyfirceallalyt,eidf to itshtehcanon e coe f fthiceiektnthvectcanonior fcoalr tvarhe iktathe canoni c al var i a t e t h en i s t h e coef f i c i e nt vect o r f o r ed fromeletment he standarva::zd.izSiedmvarilarilaybl, es isHertheecoef ifsictiheentdivectagonalor fomatr threixcanoni witconsh ictalhtrdivaructagonal i a t e cons t r u c t e d f r o m t h e s e t of s t a ndar d i z ed var i ablvares ( X In),thTheis cascanoni e cials tcorhe direalagonaltionsmatarerix with ith dibyagonalthe seltaendarmentdiVO::Zzation. HoweverThe,rtehleatchoiionscheipofbetthweeencoeftfhiecicanoni ent vectcalorcoefs ficiewintls ofnotthbee stuniandarquediifzed variables andture tofhetcanoni c al coef f i c i e nt s of t h e or i g i n al var i a bl e s f o l o ws f r o m t h e s p eci a l s t r u c he matrix [see also ( or and,pal component in this book,analis uniysisq,ueif to icanoni c al cor r e l a t i o n anal y s i s . For exampl e , i n pr i n ci s t h e coef f i c i e nt vect o r f o r t h e kt h pr i n ci p al compo nentis theobtcoefaifniedcientfrovectm orthfoenr the kth principal componentbut wedercannot i n f e r t h at ived from =
aii '
1, 2, . . � p. z? ) = (X?J  J.L? ) )/ VO::Z , =
Uk ,
V
Z (2 ) .
x? ) . ak Vi�2
ak
Z ( 1) . bk Vi�2
Vi�2
Vi�2
12)
=
unchanged ak , b k
P k2
=
Pk� 1 .
(1016)]
P 1 fl2 P1 2 P2i P2 1 Pr if2 )
I1i12 I 1 2 I2ii 2 1 I r if2 I,
ak ak( X  IL )
=
ak V 112 z,
p.
ak V1/2
Section 1 0.2
Example 1 0. 1
Ca non ica l Va ri ates and Ca nonica l Corre lations
549
(Calcu lati ng ca nonical variates and canonica l correlati ons for standardized variabl es)
1 1 Z (l) == [ z i ) , Z� ) ] '
Z ( 2 ) == [ z i2 ) , Z�2 ) ] '
Suppos are alsoestandardized variablarese. sLettandardized1.var0 ia.bl4]esandand.5 Cov ( [H��lP��J   ..45  1.0   1. 0  1..40. 4 . 6 Then [ 1.0681 1.0681 ] [ 1.0417 1.008417 ] and [ 4 71 ..21096178 ] The e1genva ues, of are obta1ned from 0 .4371_2178 _1096.2178 ( .4371 ( .1096  (2.178)2  .5467 .0005 yitoerldequating ion .5458 and .0009. The eigenvector e1 fol ows from the vec [ .4217871 .1096 ] ( .5458) Thus, [.8947, .4466] and  [ ..28776561 ] ] [ .8561 ] [ .4054436 ] [ We must scale so that Z ) ==
Z == [ Z (l ) , Z ( 2 ) '
P2 1 l P22
p1 11 /2 
 
==
p*1 2 , p2* 2
_
.2
 .2 3
 . 2083
. 3 . 21 78
A
·
 A)
=
== A2
pf 2
 
 .2229
1 p 1 11 /2 p 1 2 pI 22 p 2 1 p 1 1/2
,
A
=
 
.3
.2
1  1 P 1P111/2 P 1 1 /2 P1 2 P22 2
1
.3

 . 2229
p221 
·
.6
 A)
A+
p! 2 =
=
. 3
e 1 ==
. 2178
el
==
el
a 1  P1 11 /2 e 1 
b1
ex
_1 P22 p2 1 8 1 
b1
_
. 3959 .2292 . 520 9 . 3542
. 277 6
==
. 2
550
Chapter 1 0
Ca nonica l Correlation Ana lysis
The vector .4026, .5443 gives [.4026, .5443] [1 .·°2 1.·20] [ .·45443026 ] .5460 Using .7389, we take 1.7389 [ ..54443026] [ ..57448366 ] The first pair of canonical variates i2s 1) 2 1) . 8 6 i . 2 8 � V b1Z(2) .542l2) .742�2) and their canonical correlation is .74 Thifroms isththee largandest corZ(2r)esleattsio. n possible between linear combinations of variables corr ofrecanoni lation,cal variates, although.03,uncor is verryeslmatealdl,wiandth consmembereThequents ofselycondt,htehefircanoni ssetcondpairc,paialconveys ver y l i t l e i n f o r m at i o n about t h e as s o ci a t i o n betsiderweeend inseExerts. (Tcihese 10.cal5cul.) ation of the second pair of canonical variates is con We not e t h at and V , apar t f r o m a s c al e change, ar e not much di f e r e nt 1 from the pair [
J'
=
v3460 =
bl =
=
+
U1 = a1Z (l ) = = 1 = Pi =
+
v'Pf2 =
z(l)
V3458 =
Pi = \1.00()9 =
U1
For these variates, Var(U ) 12.4 1 2. 4 Var b b' P 2 4. 0 Cov( U , 1 and Corr , 12.4.40 2.4 .73 ThelinearcorcombirelatnioatniobetnswUeen1 , thisealrmatoshert tshiemmaxiple and,mumpervalhuapse , easil.y74.interpretable = a' p1 1a =
( Vr ) =
=
Vr ) = a ' p 1 2 b =
( U1 Vi) = v'IT.4 v'2.4 = �
V1
�
Pi =
M
Section 1 0.3
551
I nterpreting the Pop u l ation Ca nonica l Va ria bles
The pr o cedur e f o r obt a i n i n g t h e canoni c al var i a t e s pr e s e nt e d i n Res u l t 10. 1 has cericaltacoefin advant a ges . The s y mmet r i c mat r i c es , whos e ei g envect o r s det e r m i n e t h e canon f i c i e nt s , ar e r e adi l y handl e d by comput e r r o ut i n es . Mor e over , wr i t i n g t h e coefand ftihceiierntgeomet vectorrsicasinterpretationsand. To ease the computfaciliattaitoenals analburydtien,c desmanycriptpeoions ple prefer to get the canonical correlations from the eigenvalue equation (1015) The coefficient vectors and fol ow directly from the eigenvector equations (1016) TheExermatciseri10.ces4 for more detailands.) are, in general, not symmetric. (See bk == I2il2 fk
ak == I1F2 ek
a
b
I1i i 1 2 I2ii 2 1 a == p * 2 a I2i i2 1 I1 i i 1 2 b == p * 2 b
I1i i 1 2 I2ii 2 1
1 0.3
I2ii 2 1 I1i i 1 2
I NTERPRETI NG TH E POPU LATION CAN O N I CAL VARIABLES
Canoni c al var i a bl e s ar e , i n gener a l , ar t i f i c i a l . That i s , t h ey have no phys i c al mean inhaveg. If tunihe tors prigionporal vartionaliablteos thoseandof the are andused, thesecanoni c al coef f i c i e nt s and t s . I f t h e or i g i n al var i a bl e s zero means haveardare dsitznoaedndarunivardtiiszaedblofemeastso. haveurement , and thandey musunitt varbe iianncesterp,rethteedcanoni in termcals ofcoefthfeicsietantns Res u l t 10. 1 gi v es t h e t e chni c al def i n i t i o ns of t h e canoni c al var i a bl e s and canon ical correlations. In this section, we concentrate on interpreting these quantities. Even t h ough t h e canoni c al var i a bl e s ar e ar t i f i c i a l , t h ey can of t e n be "i d ent i f i e d" i n tputerminsgofthtehecorsubjreelacttiomnsattbeter varweeniabltehse. canoni Many tcimalesvarthiiastiedsentandificthateioorn iigsiainaldedvarbyiacombles. Thesunivare icoraterienlafotiromnsat, ihowever , mus t be i n t e r p r e t e d wi t h caut i o n. They pr o vi d e onl y o n, i n t h e s e ns e t h at t h ey do not i n di c at e how t h e or i g i n al var i ablreaseos n,contmanyributinevestigatotrostprefhe canoni c al anal y s e s . ( S ee, f o r exampl e , [ 1 1 ] . ) For t h i s directLetly from the standardized coefer andtofiasciseents st(h1e0cont13).ributions of, sothethoratigtihnealvectvaroiarbls eofs canonical variables are (1017) where we are primarily interested in the first canonical variables in Then(1018) X (l )
b
X (2 ) X (l )
a
X (2 )
Identifyi ng the Canon ical Va riables
jointly
A == [ a 1 , a 2 , . . . , a p ] '
( pxp)
B == [b 1 , b 2 , . . . , b q ]