1,525 253 17MB
Pages 367 Page size 442 x 666 pts Year 2001
Supervised and
unsupervised Pattern Recognition Feature Extraction and Computational
© 2000 by CRC Press LLC
Industrial Electronics Series Series Editor J. David Irwin, Auburn University
Titles Included in the Series Supervised and Unsupervised Pattern Recognition: Feature Extraction and Computational Intelligence Evangelia Micheli-Tzanakou, Rutgers University Handbook of Applied Computational Intelligence Mary Lou Padgett, Auburn University Nicholas Karayiannis, University of Houston Lofti A. Zaden, University of California Berkeley Handbook of Applied Neurocontrols Mary Lou Padgett, Auburn University Charles C. Jorgensen, NASA Ames Research Center Paul Werbos, National Science Foundation Handbook of Power Electronics Tim L. Skvarenina, Purdue University
© 2000 by CRC Press LLC
Industrial Electronics Series
Supervised and
unsupervised Pattern Recognition Feature Extraction and Computational Evangelia Micheli-Tzanakou Rutgers University Piscataway, New Jersey
CRC Press Boca Raton London New York Washington, D.C.
Library of Congress Cataloging-in-Publication Data Micheli-Tzanakou, Evangelia, 1942Supervised and unsupervised pattern recognition: feature extraction and computational intelligence /Evangelia Micheli-Tzanakou, editor/author p. cm.-- (Industrial electronics series) Includes bibliographical references and index. ISBN 0-8493-2278-2 1. Pattern recognition systems. 2. Neural networks (Computer science) I. Title. II. Series. TK7882.P3 M53 1999 006.4--dc21 99-043495 CIP This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. The consent of CRC Press does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press for such copying. Direct all inquiries to CRC Press LLC., 2000 Corporate Blvd., N.W., Boca Raton, Florida 33431. © 2000 by CRC Press LLC No claim to original U.S. Government works International Standard Book Number 0-8493-2278-2 Library of Congress Card Number 99-043495 Printed in the United States of America 1 2 3 4 5 6 7 8 9 0 Printed on acid-free paper
© 2000 by CRC Press LLC
Dedication To my late mother for never being satisfied with my progress and for always pushing me to better things in life.
© 2000 by CRC Press LLC
PREFACE This volume describes the application of supervised and unsupervised pattern recognition schemes to the classification of various types of waveforms and images. An optimization routine, ALOPEX, is used to train the network while decreasing the likelihood of local solutions. The chapters included in this volume bring together recent research of more than ten authors in the field of neural networks and pattern recognition. All of these contributions were carried out in the Neuroelectric and Neurocomputing Laboratories in the Department of Biomedical Engineering at Rutgers University. The chapters span a large variety of problems in signal and image processing, using mainly neural networks for classification and template matching. The inputs to the neural networks are features extracted from a signal or an image by sophisticated and proven state-of-the-art techniques from the fields of digital signal processing, computer vision, and image processing. In all examples and problems examined, the biological equivalents are used as prototypes and/or simulations of those systems were performed while systems that mimic the biological functions are built. Experimental and theoretical contributions are treated equally, and interchanges between the two are examined. Technological advances depend on a deep understanding of their biological counterparts, which is why in our laboratories, experiments on both animals and humans are performed continuously in order to test our hypotheses in developing products that have technological applications. The reasoning of most neural networks in their decision making cannot easily be extracted upon the completion of training. However, due to the linearity of the network nodes, the cluster prototypes of an unsupervised system can be reconstructed to illustrate the reasoning of the system. In these applications, this analysis hints at the usefulness of previously unused portions of the spectrum. The book is divided into four parts. The first part contains chapters that introduce the subjects of neural networks, classifiers, and feature extraction methods. Neural networks are of the supervised type of learning. The second part deals with unsupervised neural networks and fuzzy neural networks and their applications to handwritten character recognition, as well as recognition of normal and abnormal visual evoked potentials. The third part deals with advanced neural network architectures, such as modular designs and their applications to medicine and threedimensional neural networks architectures simulating brain functions. Finally, the fourth part discusses general applications and simulations in various fields. Most importantly, the establishment of a brain-to-computer link is discussed in some detail, and the findings from these human experiments are analyzed in a new light. All chapters have either been published in their final form or in a preliminary form in conference proceedings and presentations. All co-authors to these papers were mostly students of the editor. Extensive editing has been done so that repetitions
© 2000 by CRC Press LLC
of algorithms, unless modified, are avoided. Instead, where commonality exists, parts have been placed into a new chapter (Chapter 4), and references to this chapter are made throughout. As is obvious from the number of names on the chapters, many students have contributed to this compendium. I thank them from this position as well. Others contributed in different ways. Mrs. Marge Melton helped with her expert typing of parts of this book and with proofreading the manuscript. Mr. Steven Orbine helped in more than one way, whenever expert help was needed. Dr. G. Kontaxakis, Dr. P. Munoz, and Mr. Wei Lin helped with the manuscripts of Chapters 1 and 3. Finally, to all the current students of my laboratories, for their patience while this work was compiled, many thanks. I will be more visible—and demanding—now. Dr. D. Irwin was instrumental in involving me in this book series, and I thank him from this position as well. Ms. Nora Konopka I thank for her patience in waiting and for reminding me of the deadlines, a job that was continued by Ms. Felicia Shapiro and Ms. Mimi Williams. I thank them as well. Evangelia Micheli-Tzanakou, Ph.D. Department of Biomedical Engineering Rutgers University Piscataway, NJ
© 2000 by CRC Press LLC
Contributors Ahmet Ademoglu, Ph.D. Assistant Professor Institute of Biomedical Engineering Bogazici University Bebek, Istanbul, Turkey
Timothy J. Dasey, Ph.D. MIT Lincoln Labs Weather Sensing Group Lexington, MA
Sergey Aleynikov, M.S. IDT Hackensack, NJ
Cynthia Enderwick, M.S. Hewlett Packard Palo Alto, CA
Jeremy Bricker, Ph.D. Candidate Environmental Fluid Mechanics Laboratory Department of Civil and Environmental Engineering Stanford, CA
Faiq A. Fazal, M.S. Lucent Technologies Murray Hill, NJ
Tae-Soo Chon, Ph.D. Professor Department of Biology College of Natural Sciences Pusan National University Pusan, Korea
Raymond Iezzi, M.D. Kresge Institute Detroit, Michigan Francis Phan, M.S. Harmonix Music Systems, Inc. Cambridge, MA
Woogon Chung, Ph.D. Assistant Professor Department of Control and Instrumentation Sung Kyun Kwan University Kyung Gi-Do, South Korea
Seth Wolpert, Ph.D. Associate Professor Pennsylvania State University — Harrisburg Middletown, PA
Lt. Col. Timothy Cooley, Ph.D. USAF Academy Department of Mathematical Sciences Colorado Springs, CO
Daniel Zahner, M.S. Data Scope Co. Paramus, NJ
© 2000 by CRC Press LLC
Contents Section I — Overviews of Neural Networks, Classifiers, and Feature Extraction Methods—Supervised Neural Networks Chapter 1 Classifiers: An Overview 1.1 Introduction 1.2 Criteria for Optimal Classifier Design 1.3 Categorizing the Classifiers 1.3.1 Bayesian Optimal Classifiers 1.3.2 Exemplar Classifiers 1.3.3 Space Partition Methods 1.3.4 Neural Networks 1.4 Classifiers 1.4.1 Bayesian Classifiers 1.4.1.1 Minimum ECM Classifers 1.4.1.2 Multi-Class Optimal Classifiers 1.4.2 Bayesian Classifiers with Multivariate Normal Populations 1.4.2.1 Quadratic Discriminant Score 1.4.2.2 Linear Discriminant Score 1.4.2.3 Linear Discriminant Analysis and Classification 1.4.2.4 Equivalence of LDF to Minimum TPM Classifier 1.4.3 Learning Vector Quantizer (LVQ) 1.4.3.1 Competitive Learning 1.4.3.2 Self-Organizing Map 1.4.3.3 Learning Vector Quantization 1.4.4 Nearest Neighbor Rule 1.5 Neural Networks (NN) 1.5.1 Introduction 1.5.1.1 Artificial Neural Networks 1.5.1.2 Usage of Neural Networks 1.5.1.3 Other Neural Networks 1.5.2 Feed-Forward Neural Networks 1.5.3 Error Backpropagation 1.5.3.1 Madaline Rule III for Multilayer Network with Sigmoid Function 1.5.3.2 A Comment on the Terminology ‘Backpropagation’
© 2000 by CRC Press LLC
1.5.3.3 Optimization Machines with Feed-Forward Multilayer Perceptrons 1.5.3.4 Justification for Gradient Methods for Nonlinear Function Approximation 1.5.3.5 Training Methods for Feed-Forward Networks 1.5.4 Issues in Neural Networks 1.5.4.1 Universal Approximation 1.5.5 Enhancing Convergence Rate and Generalization of an Optimization Machine 1.5.5.1 Suggestions for Improving the Convergence 1.5.5.2 Quick Prop 1.5.5.3 Kullback-Leibler Distance 1.5.5.4 Weight Decay 1.5.5.5 Regression Methods for Classification Purposes 1.5.6 Two-Group Regression and Linear Discriminant Function 1.5.7 Multi-Response Regression and Flexible Discriminant Analysis 1.5.7.1 Powerful Nonparametric Regression Methods for Classification Problems 1.5.8 Optimal Scoring (OS) 1.5.8.1 Partially Minimized ASR 1.5.9 Canonical Correlation Analysis 1.5.10 Linear Discriminant Analysis 1.5.10.1 LDA Revisited 1.5.11 Translation of Optimal Scoring Dimensions into Discriminant Coordinates 1.5.12 Linear Discriminant Analysis via Optimal Scoring 1.5.12.1 LDA via OS 1.5.13 Flexible Discriminant Analysis by Optimal Scoring 1.6 Comparison of Experimental Results 1.7 System Performance Assessment 1.7.1 Classifier Evaluation 1.7.1.1 Hold-Out Method 1.7.1.2 K-Fold Cross-Validation 1.7.2 Bootstrapping Method for Estimation 1.7.2.1 Jackknife Estimation 1.7.2.2 Bootstrap Method 1.8 Analysis of Prediction Rates from Bootstrapping Assessment References Chapter 2 Artificial Neural Networks: Definitions, Methods, Applications 2.1 Introduction 2.2 Definitions 2.3 Training Algorithms
© 2000 by CRC Press LLC
2.3.1 Backpropagation Algorithm 2.3.2 The ALOPEX Algorithm 2.3.3 Multilayer Perceptron (MLP) Network Training with ALOPEX 2.4 Some Applications 2.4.1 Expert Systems and Neural Networks 2.4.2 Applications in Mammography 2.4.3 Chromosome and Genetic Sequences Classification References Chapter 3 A System for Handwritten Digit Recognition 3.1 Introduction 3.2 Preprocessing of Handwritten Digit Images 3.2.1 Optimal Size of the Mask for Dilation 3.2.2 Bartlett Statistic 3.3 Zernike Moments (ZM) for Characterization of Image Patterns 3.3.1 Reconstruction by Zernike Moments 3.3.2 Features from Zernike Moments 3.4 Dimensionality Reduction 3.4.1 Principal Component Analysis 3.4.2 Discriminant Analysis 3.5 Analysis of Prediction Error Rates from Bootstrapping Assessment 3.6 Summary Acknowledgments References Chapter 4 Other Types of Feature Extraction Methods 4.1 Introduction 4.2 Wavelets 4.2.1 Discrete Wavelet Series 4.2.2 Discrete Wavelet Transform (DWT) 4.2.3 Spline Wavelet Transform 4.2.4 The Discrete B-Spline Wavelet Transform 4.2.5 Design of Quadratic Spline Wavelets 4.2.6 The Fast Algorithm 4.3 Invariant Moments 4.4 Entropy 4.5 Cepstrum Analysis 4.6 Fractal Dimension 4.7 SGLD Texture Features References
© 2000 by CRC Press LLC
Section II
Unsupervised Neural Networks
Chapter 5 Fuzzy Neural Networks 5.1 Introduction 5.2 Pattern Recognition 5.2.1 Theory and Applications 5.2.2 Feature Extraction 5.2.3 Clustering 5.3 Optimization 5.3.1 Theory and Objectives 5.3.2 Background 5.3.3 Modified ALOPEX Algorithm 5.4 System Design 5.4.1 Feature Extraction 5.4.1.1 The Karhunen-Loève Expansion 5.4.1.2 Application by a Neural Network 5.5 Clustering 5.5.1 The Fuzzy c-Means (FCM) Clustering Algorithm References Chapter 6 Application to Handwritten Digits 6.1 Introduction to Character Recognition 6.2 Data Collection 6.2.1 Preprocessing 6.2.2 Noise Thresholding 6.2.3 Center of Mass Adjustment 6.2.4 Line Thinning 6.2.5 Fixing to Size 6.2.6 Rotation 6.2.7 Reducing Resolution 6.2.8 Blurring 6.3 Results 6.4 Discussion 6.5 Summary References Chapter 7 7.1 7.2 7.3 7.4
An Unsupervised Neural Network System for Visual Evoked Potentials Introduction Data Collection and Preprocessing System Design Results
© 2000 by CRC Press LLC
7.5 Discussion References
Section III
Advanced Neural Network Architectures/Modular Neural Networks
Chapter 8 Classification of Mammograms Using a Modular Neural Network 8.1 Introduction 8.2 Methods and System Overview 8.2.1 Data Acquisition 8.2.2 Feature Extraction by Transformation 8.3 Modular Neural Networks 8.4 Neural Network Training 8.5 Classification Results 8.6 The Process of Obtaining Results 8.7 ALOPEX Parameters 8.8 Generalization 8.9 Conclusions Acknowledgments References Chapter 9 9.1 9.2
9.3 9.4 9.5 9.6
Visual Ophthalmologist: An Automated System for Classification of Retinal Damage Introduction System Overview 9.2.1 Image Processing 9.2.2 Feature Extraction Methods 9.2.3 Image Classification Modular Neural Networks Application to Ophthalmology Results Discussion References
Chapter 10 A Three-Dimensional Neural Network Architecture 10.1 Introduction 10.2 The Neural Network Architecture 10.3 Simulations 10.3.1 Visual Receptive Fields 10.3.2 Modeling of Parkinson’s Disease 10.4 Discussion References
© 2000 by CRC Press LLC
Section IV
General Applications
Chapter 11 11.1 11.2
11.3 11.4
A Feature Extraction Algorithm Using Connectivity Strengths and Moment Invariants Introduction ALOPEX Algorithms 11.2.1 Original Algorithm 11.2.2 Reinforcement Rules 11.2.3 A Generalized ALOPEX Algorithm 11.2.3.1 Process I 11.2.3.2 Process II Moment Invariants and ALOPEX Results and Discussion Acknowledgments References
Chapter 12
Multilayer Perceptrons with ALOPEX: 2D-Template Matching and VLSI Implementation 12.1 Introduction 12.1.1 Multilayer Perceptrons 12.2 Multilayer Perceptron and Template Matching 12.3 VLSI Implementation of ALOPEX References
Chapter 13 Implementing Neural Networks in Silicon 13.1 Introduction 13.2 The Living Neuron 13.3 Neuromorphic Models 13.4 Neurological Process Modeling References Chapter 14 14.1 14.2 14.3 14.4
14.5 14.6
Speaker Identification through Wavelet Multiresolution Decomposition and ALOPEX Introduction Multiresolution Analysis through Wavelet Decomposition Pattern Recognition with ALOPEX Methods 14.4.1 Data Acquisition 14.4.2 Data Preprocessing 14.4.3 Representing the Wavelet Coefficients for Template Matching Results Discussion Acknowledgments
© 2000 by CRC Press LLC
References Chapter 15 Face Recognition in Alzheimer’s Disease: A Simulation 15.1 Introduction 15.2 Methods 15.3 Results 15.4 Discussion References Chapter 16 Self-Learning Layered Neural Network 16.1 Introduction 16.2 Neocognitron and Pattern Classification 16.2.1 Training Algorithm 16.3 Objectives 16.4 Methods 16.5 Study A 16.5.1 Network Description 16.5.2 Results from Study A 16.6 Study B 16.6.1 Results from Study B 16.7 Summary and Discussion References Chapter 17 Biological and Machine Vision 17.1 Introduction 17.2 Distributed Representation 17.3 The Model 17.4 A Modified ALOPEX Algorithm 17.5 Application to Template Matching 17.6 Brain to Computer Link 17.6.1 Global Receptive Fields in the Human Visual System 17.6.2 The Black Box Approach 17.7 Discussion References
© 2000 by CRC Press LLC
Introduction—Why this Book? The potential for achieving a great deal of processing power by wiring together a large number of very simple and somewhat primitive devices has captured the imagination of scientists and engineers for many years. In recent years, the possibility of implementing such systems by means of electro-optical devices and in very large scale integrations has resulted in increased research activities. Artificial neural networks (ANNs) or simply Neural Networks (NNs) are made of interconnected devices called neurons (also called neurodes, nodes, neural units, or simply units). Loosely inspired by the makeup of the nervous system, these interconnected devices look at patterns of data and learn to classify them. NNs have been used in a wide variety of signal processing and pattern recognition applications and have been successfully applied in such diverse fields as speech processing, handwritten character recognition, time series prediction, data compression, feature extraction, and pattern recognition in general. Their attractiveness lies in the relative simplicity with which the networks can be designed for a specific problem along with their ability to perform nonlinear data processing. As the neuron is the building block of a brain, a neural unit is the building block of a neural network. Although the two are far from being the same, or performing the same functions, they still possess similarities that are remarkably important. NNs consist of a large number of interconnected units that give them the ability to process information in a highly parallel way. An artificial neuron sums all inputs to it and creates an output that carries information to other neurons. The strength by which two neurons influence each other is called a synaptic weight. In an NN all neurons are connected to all other neurons by synaptic weights that can have seemingly arbitrary values, but in reality, these weights show the effect of a stimulus on the neural network and the ability or lack of it to recognize that stimulus. All NNs have certain architectures and all consist of several layers of neuronal arrangements. The most widely used architecture is that of the perceptron first described in 1958 by Rosenblatt. A single node acts like an integrator of its weighted inputs. Once the result is found it is passed to other nodes via connections that are called synapses. Each node is characterized by a parameter that is called threshold or offset and by the kind of nonlinearity through which the sum of all the inputs is passed. Typical nonlinearities are the hardlimiter, the ramp (threshold logic element) and the widely used sigmoid. NNs are specified by their processing element characteristics, the network topology and the training or learning rules they follow in order to adapt the weights, Wi. Network topology falls into two broad classes: feedforward (nonrecursive) and feedback (recursive). Nonrecursive NNs offer the advantage of simplicity of implementation and analysis. For static mappings a nonrecursive network is all one needs to specify any static condition. Adding feedback expands the network’s range of
© 2000 by CRC Press LLC
behavior since now its output depends upon both the current input and network states. But one has to pay a price — longer times for teaching the NN to recognize its inputs. The most widely used training algorithm is the backpropagation algorithm. The backpropagation algorithm is a learning scheme where the error is backpropagated layer by layer and used to update the weights. The algorithm is a gradient descent method that minimizes the error between the desired outputs and the actual outputs calculated by the MLP. The original perceptrons trained with backpropagation are examples of supervised learning. In this type of learning the NN is trained on a training set consisting of vector pairs. One of these vectors is used as input to the network, the other is used as the desired or target output. During training the weights of the NN are adjusted in such a way as to minimize the error between the target and the computed output of the network. This process might take a large number of iterations to converge, especially because some training algorithms (such as backpropagation) might converge to local minima instead of the global one. If the training process is successful, the network is capable of performing the desired mapping.
© 2000 by CRC Press LLC
Section I Overviews of Neural Networks, Classifiers, and Feature Extraction Methods—Supervised Neural Networks
© 2000 by CRC Press LLC
1
Classifiers: An Overview Woogon Chung and Evangelia Micheli-Tzanakou
1.1 INTRODUCTION One way to better understand a subject is to classify or categorize it among related subjects. Many classifiers result from different approaches to classification problems. The purpose of this article is to categorize the well-known classifiers in the literature according to how they learn to classify. Lippmann’s tutorial paper1 described various classifiers as well as neural networks in detail after his first discussion2 on the general application of neural networks. Another general overview on this subject is found in a paper by Hush and Horne3 in which neural networks are reviewed in the broad dichotomy of stationary vs. dynamic networks. Weiss and Kulikowski’s book4 generally touches the classification and prediction methods from the point of view of statistics, neural networks, machine learning, and expert systems. The purpose of this article is not to give a tutorial on the well-developed networks and other classifiers but to introduce another branch in the growing classifier tree, that of nonparametric regression approaches to classification problems. Recently Hastie, Tibshirani, and Buja5 introduced the Flexible Discriminant Analysis (FDA) in the applied statistics literature, after the unpublished work by Breiman and Ihaka.6 Canonical Correlation Analysis (CCA) for two sets of variables is known to be a scalar multiple equal to the Linear Discriminant Analysis (LDA). Optimal Scaling (OS) is an alternative to CCA, where the classical Singular Value Decomposition (SVD) is used to find the solutions. OS brings the flexibility obtained via nonparametric regression and introduces this flexibility to discriminant analysis, hence the name Flexible Discriminant Analysis. A number of recently developed multivariate regressions are used for classification, in addition to other groups of classifiers for a data set obtained from handwritten digit images. The software is contributed mainly from the authors or active researchers in this area. The sources are described in later sections after the description of each classifier.
1.2 CRITERIA FOR OPTIMAL CLASSIFIER DESIGN We start with a general description of the classification problem and then proceed to a discussion of simpler cases in which assumptions are made. Which criterion should be used is application specific. Expected Cost for Misclassification (ECM) is applied to problems in which the cost of misclassification differs among the cases. For example, one may expect to assign a higher cost for misdiagnosing a patient
© 2000 by CRC Press LLC
with a serious disease as healthy than for misdiagnosing a healthy person as unhealthy. If a meteorologist forecasts fine weather for the weekend but a heavy storm strikes the town, the cost of the misclassification will be much more than if the opposite situation occurs. Sometimes we do not care about the resulting cost of misclassification. The cost for misclassification for a pattern recognition system to misclassify pattern ‘A’ as pattern ‘B’ may be considered the same as the cost to misclassifying pattern ‘B’ as pattern ‘A’. In this situation we can disregard the cost information or assign the same cost to all cases. An optimal classification procedure might also consider only the probability of misclassification (from conditional distributions) and its likelihood to happen among different classes (from the a priori probabilities). Such an optimal classification procedure is referred to as the Total Probability of Misclassification (TPM). The ECM, however, requires three kinds of information, that is, the conditional distribution, the a priori probabilities, and the cost for misclassification. In the simplest case, we also ignore the a priori probabilities or assume that they are all equal. In this case we only wish to reduce misclassification for all the classes without considering the class proportion of the given data. It should be noted, however, that it is relatively simple to estimate the a priori probabilities from the sample at hand by the frequency approximation. Thus the TPM is often the choice as a criterion in which the class conditional distribution and a priori probabilities are considered.
1.3 CATEGORIZING THE CLASSIFIERS 1.3.1
BAYESIAN OPTIMAL CLASSIFIERS
Bayesian classifiers are based on probabilistic information on the populations from which a sample of training data is to be drawn randomly. Randomness in sampling is assumed, and it is necessary for a better representation of the sample of the underlying population probability function. An optimal classifier would be one that minimizes the criterion, ECM, which consists of three probabilistic types of information. Those are the class conditional probabilities pi (x), a priori probabilities Pi, and cost for misclassification C (ij), i ≠ j for i ∈ G. Another criterion of an optimal Bayesian classifier is ignoring the cost for different misclassifications or using the same cost for all the different misclassifications. Then the probabilistic information used is pi (x) and Pi for i ∈ G. This minimum TPM classifier is the Maximum A Posterior classifier which may be familiar. This will be shown in section 1.4.1. For the minimum ECM and TPM optimal classifiers, we need to estimate the class conditional densities for different classes which is usually difficult for q ~> 2 . This difficulty in density estimation is related to the curse of dimensionality caused by the fact that a high-dimensional space is mostly empty. A simplified Bayesian classifier can be obtained by assuming a normal distribution for the class conditional density functions. With the normal distribution assumption, the conditional density functions are parameterized by the mean vector µi and the covariance matrices Σi for i ∈ G where G is the set of class labels. Depending on the assumption of the covariance matrices we have a quadratic discriminant classifier or a linear discriminant classifier. © 2000 by CRC Press LLC
1.3.2
EXEMPLAR CLASSIFIERS
The most simple-minded nonparametric classifier is to use the label information of the training data to allocate the unknown input x. The idea is to find the distribution of the labels in a neighborhood of a new observation x in the training sample and pick the label whose occurrence is maximum. The well-known classifier in this group is the K-nearest neighbors (KNN) classifier. This classifier is justified either via nearestneighbor density estimation, or using the nearest-neighbor nonparametric regression.7 Practical issues in the KNN includes the choice of a metric to measure the distance between the K nearest points and the unknown pattern point, and fast searches for neighbors. Advanced data structures such as K-D trees8 are suggested for faster searches at the expense of complications in training and adaptation. Other examples are the feature-map classifier,9 Learning Vector Quantization (LVQ),10 Adaptive Resonance Theory (ART) classifier,11 and others that are found in the survey paper by Lippmann.1 Vector Quantization (VQ)12,13 is another classical representative exemplar finding algorithm that has been used in communications engineering for the purpose of data reduction for storage and transmission. The exemplar classifiers (except for the KNN classifier) cluster the training patterns via unsupervised learning then followed by supervised learning or label assignment. A Radial Basis Function (RBF) network14 is also a combination of unsupervised and supervised learning. The basis function is radial and symmetric around the mean vector, which is the centroid of the clusters formed in the unsupervised learning stage, hence the name radial basis function. The RBF networks are two-layer networks in which the first layer nodes represent radial functions (usually Gaussian). The second layer weights are used to combine linearly the individual radial functions, and the weights are adapted via a linear least squares algorithm during the training by supervised learning. Figure 1.1 depicts the structure of the RBF networks.
FIGURE 1.1 RBF network. Two-layer network with first layer node being any radial functions imposed on different locations and second layer node being linear.
The LMS algorithm,15 a simple modification for the linear least squares, is usually used during training for the output layer weights. Any unsupervised clustering algorithm, such as K-means algorithm (i.e., LBG algorithm13) or Self-Organizing Map10 may be used in the first clustering stage. © 2000 by CRC Press LLC
The most common basis is a Gaussian kernel function of the form:
(
)(
x−m t x−m j j θ i = exp − 2 2σ j
) j = 1, 2,…., n
(1.1)
where mj is the mean vector of the jth cluster found from a clustering algorithm, and x is the input pattern vector. The σ 2j is the normalization factor which is a spread measure of the points in a cluster. The average squared distance of the points from the centroid is the common choice for the normalization factor:
σ 2j =
1 Mj
∑ (x − m ) (x − m ) t
j
j
(1.2)
x∈w j
where wj is the set of the points in the jth cluster and Mj is the number of the points in the jth cluster. A generalization of the radial function utilizes the variance of an individual variable and covariance among the variables in the training sample. The Mahalanobis distance in the Gaussian kernel has the form:
(
) (
)
t θ j = exp − x − m j Σ −j 1 x − m j j = 1,.2,.…, n
(1.3)
where Σj is the covariance matrix in the jth cluster. The localized distribution function is now ellipsoidal rather than a radial function. A more extensive study on the RBF networks can be found in Hush and Horne.3
1.3.3
SPACE PARTITION METHODS
The input space X is recursively partitioned into children subspaces such that the class distributions of the subspaces become as impure as possible: impurity of class distribution in a subspace measures the partitioning of the input space by classes. There are a number of different schemes for estimating trees. Quinlan’s ID316 is well known in the machine learning literature. The citations for some of its variants can be found in a review paper by Ripley.17 The most well-known partitioning method is the Classification and Regression Tree (CART),18 which is used to build a binary tree partitioning the input space. At each split of the subspace, each variable is considered with a separating value, and the separating variable with the best separating value is chosen to split the subspace into two children subspaces. The main issue in this CART algorithm is how to ‘grow’ it to fit the given training data well and ‘prune’ it to avoid over-fitting, i.e., to improve the regularization.
© 2000 by CRC Press LLC
1.3.4
NEURAL NETWORKS
Neural networks are popular, and there are numerous textbooks and journals devoted to the topic. Lippmann (1987)2 is recommended for a general overview of neural networks for classification and (auto)associative memory applications. A statistician’s view on using neural networks for multivariate regression and classification purposes is found in extensive review papers by Ripley.19,17 Different learning algorithms with historical aspects in learning can be obtained from a reference by Hinton.20 In this chapter we are mainly interested in multivariate regression and classification properties of neural networks, usually in the form of feed-forward multilayer perceptrons. Chapter 2 deals mainly with neural network architectures and algorithms.
1.4 CLASSIFIERS 1.4.1
BAYESIAN CLASSIFIERS
For simplicity we would like to start with a two-class classification problem and develop it for multi-class cases in a straightforward way. Three kinds of information for an optimal classification design procedure in Bayesian sense are denoted as C(21), C(12) P1, P2 p1 ( x ), p2 ( x )
cost of misclassification a priori probabilities class conditional probability density functions
where C (ij) is the cost for misclassification of j as i. With the notations introduced, the probability that an observation is misclassified as w2 is represented by the product of the probability that an observation comes from w1 but falls in w2 and the probability that the observation comes from w1: P(misclassified as w2 )
= P(X ∈ R2 ) P(w1 ) = P(21) P1
(1.4)
where the regions R2 and P (21) (i.e., the integration of pi (x) in the region R2) are depicted in Figure 1.2. Ri, i ∈ {1,2} is an optimum decision region in the input space such that minimum error results are obtained. P (ij), i ≠ j ∈ {1,2} is the integration of the conditional probability function in the region of the other class, thus measuring the possibility of error due to the regions and the conditional probability functions.
© 2000 by CRC Press LLC
FIGURE 1.2
1.4.1.1
Misclassification probabilities and decision regions R1 and R2.
Minimum ECM Classifiers
When the criterion is to minimize the ECM (Expected Cost for Misclassification), the optimal resulting classifier is called a Minimum ECM classifier. The cost for correct classification is usually set to zero, and positive numbers are used for misclassification costs. The whole supporting region is the input space X and is divided into two exclusive and exhaustive subregions: X = R1 U R2. By the definition, the Minimum ECM classifier for class 1 is formed as follows: ECM = C(21) P(21) P1 + C (12) P(12) P2 = C(21) P1
∫
R2
= C(21) P1 1 − =
p1 ( x )dx + C(12) P2
∫
R1
∫
R1
(1.5)
p2 ( x )dx
p1 ( x )dx + C(12) P2
∫
p2 ( x )dx
R1
∫ (C(12)P p (x) − C(21)P p (x))dx + C(21)P R1
2
2
1 1
(1.6)
1
with all the individual quantities being positive. The minimization is achieved as close to zero as possible by having the integration in Equation 1.6 to be equal to a negative quantity. Thus the ECM is minimized if the region R1 includes those values x for which the integrand becomes as negative as possible with which the absolute value is equal to the last quantity C(2|1)P1:
{C(12)P p (x) − C(21)P p (x)} ≤ 0 2
2
1 1
(1.7)
and excludes those x for which this quantity is positive. That is, R1, the decision region for class 1, must be the set of points x such that C(12) P2 p2 ( x ) ≤ C(21) P1 p1 ( x )
© 2000 by CRC Press LLC
or
(1.8)
p1 ( x ) p2 ( x )
C(12) P2
≥
(1.9)
C(21) P1
Here we have chosen to express the region as the set of solution x of the inequality. The fractional form of Equation 1.9 for the region R1 is the preferred format, since it reduces to a simple form (which will be shown) when the conditional distribution function pi (x), i = 1,2 is assumed to be normal (and thus assuming the same covariance matrix for the two conditional distributions) for simple Bayesian classifiers. Assuming the same cost for each misclassification reduces the criterion ECM to Total Probability of Misclassification (TPM): P2 p2 ( x ) ≤ P1 p1 ( x ) p1 ( x ) p2 ( x )
≥
or
P2 P1
(1.10) (1.11)
from Equation 1.9. Due to the Bayes theorem: P(wk x ) =
Pk pk ( x )
∑
2
Pi pi ( x )
for all k ∈ {1, 2}
(1.12)
i =1
the corresponding decision rule (Equation 1.10) becomes the Maximum A Posteriori (MAP) criterion, that is to allocate x into w1 if P(w2 x ) ≤ P(w1 x ) . 1.4.1.2
(1.13)
Multi-Class Optimal Classifiers
The boundary regions of the minimum ECM optimal classifier for a multi-class classifier are obtained in a straightforward manner from Equation 1.6 by minimizing
J
ECM =
J Pi P(k i )C (k i ) k =1 k ≠i
∑ ∑ i =1
(1.14)
The probability of misclassification of x ∈ wi into wk is represented as P( k i ) =
∫
Rk
pi ( x )dx
(1.15)
The optimal regions {Ri} that minimize the ECM are the set of the points x for which the allocation of x to a group wk, k = 1, 2, …, J results in the least cost. It © 2000 by CRC Press LLC
can be shown that an equivalent form of Equation 1.14 can be represented without the integral term P (ki). The equivalent minimizing ECM′ is interpreted intuitively* as The minimizing ECM is equivalent to minimizing the a posteriori probabilities for the wrong classes with the corresponding costs.
That is, the equivalent ECM′ has the form J
ECM ′ =
∑ P(k x)C(i k ) k =1 k ≠i
=
∑ k =1 k ≠i
(1.16)
Pk pk ( x )
J
∑
J
Pj p j ( x )
C (i k )
j =1
and since the denominator is a constant independent of the indices j, this can be further simplified as J
ECM ′ =
∑ P p ( x ) C (i k ) k
k
(1.17)
k =1 k ≠i
In other words, the optimal minimum ECM classifier assigns x to wk such that Equation 1.17 is minimized. The minimum ECM (ECM′) classifier rule determines mutually exclusive and exhaustive classification regions R1, R2,…, RJ such that Equation 1.14 (Equation 1.17) is a minimum. If the cost is not important (or the same for all misclassifications), the minimum ECM rule becomes minimum TPM. The resulting classifier is, again as in the twoclass case, a MAP classifier: Assign unknown x to wk: J
x ∈ wk = arg min
∑ P p (x) i i
(1.18)
= arg max Pi pk ( x )
(1.19)
= arg max P(wk x )
(1.20)
i∈G
i =1 i ≠1
i∈G
i∈G
The Bayesian classification rule which is based on the conditional probability density functions for each class, pi (x), is the optimal classifier in the sense that it minimizes the cost of the probability of error.22 However, the class conditional * The fact that ECM and ECM′ are equivalent is shown analytically in the text.21
© 2000 by CRC Press LLC
probability density function pi (x) needs to be estimated. The density estimation is realizable and efficient if the dimensionality is low, such as 1 ~ 2 or 3, at most. The parametric Bayesian classification, even if it renders the optimal result in the sense that probability of error is minimized, is difficult to realize in practice. Alternatively, we look for other simple approximations using a normality assumption on the class conditional distributions.
1.4.2
BAYESIAN CLASSIFIERS POPULATIONS
WITH
MULTIVARIATE NORMAL
If the conditional distribution of a given class is assumed to be p-dimensional multivariate normal, pi ( x ) =
1
(2π )
p/2
Σi
1/ 2
t 1 exp − ( x − µ i ) Σ i−1 ( x − µ i ) , i = 1, 2,…, J 2
(1.21)
with mean vectors µi and covariance matrices Σi, then, the resulting Bayesian classifiers are easily realized. 1.4.2.1
Quadratic Discriminant Score
With the assumption of having the same cost for all misclassifications added to the multivariate normality, we get a simple classification rule directly from Equation 1.19. Then the minimum TPM decision rule can be expressed as follows: Allocate x to the class wk: x ∈ wk = arg max{ln Pi pi ( x )} i∈G
{
}
= arg max diq ( x ) i∈G
(1.22)
where the quadratic discriminant score is defined as diq ( x ) = −
t 1 1 ln Σ i − ( x − µ i ) Σ i−1 ( x − µ i ) + ln Pi 2 2
(1.23)
and consists of contributions from the generalized variance Σi, the a priori probability Pi, and the squared distance from x to the population class mean µi. Note that dqi (x) is the quadratic form of the unknown x. 1.4.2.2
Linear Discriminant Score
If we further assume that the population covariance matrices Σi are all the same, we can simplify the quadratic discriminant score (Equation 1.23) into the linear discriminant score:
© 2000 by CRC Press LLC
di ( x ) = µit Σ i x −
1 t −1 µ Σ µ i + ln Pi 2 i
(1.24)
Then the optimal minimum ECM classifier with the assumptions that 1. the multivariate normal distribution in the class conditional density function is pi (x), 2. we have equal misclassification cost (thus a minimum TPM classifier), and that 3. we have equal covariance matrices Σi for all classes, reduces to the simplest form with a linear discriminant score as follows: 1 x ∈ wk = arg max di ( x ) = µ it Σ i x − µ it Σ −1µ i + ln Pi i∈G 2
(1.25)
where x was assigned to class wk. As the name indicates, the linear discriminant score di (x) for a class i used in the special case of the minimum TPM classifier Equation 1.25 is a linear functional of the input x. The boundary regions R1, R2,…, RJ are hyper-linear, e.g., lines in two-dimensional, planes in three-dimensional input space, etc. However, the minimum TPM classifier with different covariances for the classes is given by the quadratic form of x as in Equation 1.22. 1.4.2.3
Linear Discriminant Analysis and Classification
The Fisher’s Discriminant function is basically for description purposes. With new lower dimensional discriminant variables, multidimensional data may be visualized to find some interesting structures; hence, the linear discriminant analysis is exploratory. The objective of this section is to relate the linear discriminant analysis to Bayesian optimal classifiers based on normal theory. The linear transform by which the discriminant variates is obtained is defined by the q × q matrix F in the transform: x = Fy
(1.26)
where q is the dimensionality of vector x and the matrix F consists of s = min{q, J – 1} eigenvectors of W–1 B whose corresponding eigenvalues are nonzero. This result is obtained by maximizing the quadratic form of the quadratic expression of matrix W. W and B are the sample versions of pooled within and between covariance matrices, respectively defined as
© 2000 by CRC Press LLC
W = B =
1 N−J 1 J −1
ni
J
∑ ∑ (y i =1
ij
)(
− yi y ij − yi
j =1
)
t
J
∑ n (y − y) (y − y) i
i
t
i
i =1
where N = Σ iJ=1 ni is the size of the sample and J is the number of classes. In the transformed domain or in the discriminant coordinate space (CRIMCOORD), the class mean vectors are given by
[
µ i, x = µ i, x1 , µ i, x2 , …, µ i, xs
] = Fµ t
i,y
for x ∈ wi, and by the definition of the LDA cov (X) = I. Thus it is appropriate to consider a Euclidean distance in order to measure the separation of the discriminant variates. The classification rule from the discriminants is now to allocate x into class wk:
{
x ∈ wk = arg min x − µ i, x i∈G
2
}
(1.27)
Here the dimensionality of x is s ≤ min{q, J – 1}. The dimensionality of the transformed variables, i.e., the discriminant variates, become s and the classification rule needs only s variables in the linear discriminant classification rule (Equation 1.27). The reason for only s variables needed for this classification purpose follows. The sample pooled within covariance matrix W and the between covariance matrix B have full ranks, hence the W-1B, (q × q)-matrix, has full rank. The number of nonzero eigenvalues should not be greater than the full rank: s≤q
(1.28)
And the class mean vectors span a multidimensional space with dimensionality: p≤J–1
(1.29)
which is obvious since by definition Σ iJ=1 (µ i − µ ) = 0. From Equation 1.28 and Equation 1.29 we can conclude that s = min{q, J – 1}. The remaining (q – s)dimensional subspace is called the null space of the linear transformation represented by the matrix F and consists of all the vectors y that are mapped into 0 by the linear transformation of Equation 1.26.
© 2000 by CRC Press LLC
1.4.2.4
Equivalence of LDF to Minimum TPM Classifier
It is interesting to observe the equivalence of the linear discriminant classification rule Equation 1.27 with that of the minimum TPM classification rule, with the assumption that all covariances Σi = Σ are the same for all classes i ∈ G. The argument of the minimization quantity of Equation 1.27 becomes
(
F y − µ i,y
)
2
= x − µ i, x
(
2
) ( t
= y − µ i, y Σ −1 y − µ i, y
)
(1.30)
= 2 di (y) + yt Σ −1y + 2 ln Pi where the last equation is due to: di (y) −
(
)
1 t −1 1 y Σ y = − µ it, y Σ −1µ i, y − 2µ it, y Σ −1y + yt Σ −1y + ln Pi 2 2
(1.31)
The minimization of the squared distance in the Fisher’s discriminant variate domain is equivalent to the maximization of the linear discriminant score di (y), which results in the equivalence of the ‘linear discriminant classification rule’ to the ‘minimum TPM optimal classifier.’23 This is an interesting observation or justification of Fisher’s LDF. Even though the derivation of the Fisher’s discriminant functions do not require the ‘multivariate normality’ assumption, the same classification rule is obtained from the minimum TPM criterion Bayesian classification rule in which normality is assumed.
1.4.3
LEARNING VECTOR QUANTIZER (LVQ)
Learning Vector Quantization (LVQ) is a combination of the self-organizing map and of supervised learning.10 The self-organizing map is a typical competitive learning method and results in a number of new vectors, called codebook vectors, ml, i = 1, 2,…,L. The codebook vectors represent an input vector space with a small number of representative vectors (codebook M). It is a quantization of the given data set {xi,gi} 1N to get a quantized codebook {ml,g}L1. 1.4.3.1
Competitive Learning
Given a training vector {xi,gi}N1 and a size L of a randomly chosen codebook {ml }L1, an input of time instance k, x (k), is compared to all the code vectors, ml, in order to find the closest one, mc, by a distance measure such that:
(
)
{(
d x ( k ) , mc = min d ml , x ( k )
© 2000 by CRC Press LLC
l
)}
(1.32)
L2-norm is a common choice, and the competitive learning with this measure utilizes the steepest descent gradient step optimization.10 Once the closest code vector mc is found, the competitive learning (or the steepest descent gradient optimization) updates the closest code vector, mc, but it does not change the other code vectors, ml l ≠ c.
(
m (ck +1) = m (ck ) + α (k ) x ( k ) − m (ck )
)
m(l k +1) = m(l k ) for l ≠ c
(1.33) (1.34)
with α (k) being suitable constant 0 < α < 1, or monotonically decreasing sequence, 0 < α (k) < 1, for which the optimization LVQ (or OLVQ that will be discussed later) is concerned with. 1.4.3.2
Self-Organizing Map
This is an algorithm for finding a codebook M (or a set of feature-sensitive detectors) in the input space X. It is known that the internal representations of information in the brain are generally organized spatially, and the self-organizing map mimics the spatial organization of the cells10 in its structure. A self-organizing map enforces the logically inspired network connections, with “lateral inhibition” in a general way by defining a neighborhood set Nc; a time-varying monotonically decreasing set of code vectors:
{
(
)
}
Nc( k ) = m(l k ) d m(l k ) , m(ck ) ≤ r(k )
(1.35)
where r (k) represents the radius of the Nc(k). Once the winning code vector (or cell) is found from Equation 1.32, all the code vectors in the neighborhood Nc, which is centered on the winning code vector mc, are undated and the others remain untouched. It has been suggested10 that the Nc(k) be very wide in the beginning and shrink monotonically with time as r (k) is a function of time, k. Thus the updating has a similar form to simple competitive learning as in Equation 1.33,
(
(k ) (k ) (k ) m + α (k ) x − ml m(l k +1) = (l k ) ml
)
if ml ∈ Nc( k ) if ml ∉ Nc( k )
(1.36)
where α (k) is a scalar-value “adaptation gain” 0 ≤ α (k) ≤ 1. 1.4.3.3
Learning Vector Quantization
If we now have a codebook that represents the input vector space X by a set of quantized vectors, i.e., a codebook M, then the Nearest Neighbor rule can be used © 2000 by CRC Press LLC
for classification problems, provided that the codebook vectors ml have their labels in the space to which each codebook vector belongs. The labeling process is similar to the K-nearest neighbor rule in which (a part of) the training data are used to find the majority labels among the K closest patterns to a codebook vector ml. Thus the LVQ, a form of supervised learning, follows the unsupervised learning, self-organizing map, as shown in Figure 1.3:
FIGURE 1.3 Quantization.
Block diagram for a system of Self-organizing Map and Learning Vector
The last two stages in the figure are called LVQ, and researchers10,24 have come up with different updating algorithms (LVQ1, LVQ2, LVQ3, OLVQ1) from different methods of updating the codebook vectors. The LVQ1 and its optimization version OLVQ1 are considered in the next sections. 1.4.3.3.1 LVQ1 This is similar to simple competitive learning (Equation 1.33), except that it includes pushing off any wrong closest codebook vector in addition to pulling operations (Equation 1.33 and Equation 1.36). Let L (x (k)) be an operation to get the label information; then the codebook updating rule LVQ1 has the form (Figure 1.4)
(
)
( )
(1.37)
(
)
( )
(1.38)
m (ck +1) = m (ck ) + α (k ) x ( k ) − m (ck ) for L x ( k ) = L (m c ) m (ck +1) = m (ck ) − α (k ) x ( k ) − m (ck ) for L x ( k ) ≠ L (m c ) m (l k +1) = m (l k ) for i ≠ c
Here, 0 < α (k) < 1 is a gain, which is decreasing monotonically with time, as in the competitive learning, (Equation 1.33). The authors suggest a small starting value, i.e., α (0) = 0.01 or 0.02. 1.4.3.3.2 Optimized LVQ1 (OLVQ1) For fast convergence of the LVQ1 algorithm in Equation 1.37 and Equation 1.38, an optimized learning rate for the LVQ1 is suggested.24 The objective is to find an optimal learning rate αl (k) for each codebook vector ml, so that we have individually optimized learning rates:
(
m (ck +1) = m (ck ) + α c (k ) x ( k ) − m (ck )
© 2000 by CRC Press LLC
)
for
( )
L x ( k ) = L (m c )
(1.39)
(
m(ck +1) = m(ck ) − α c (k ) x ( k ) − m(ck ) m(l k +1) = m(l k ) for l ≠ c
FIGURE 1.4
)
for
( )
L x ( k ) ≠ L( m c )
(1.40)
LVQ1 learning, or updating the initial codebook vectors a, b, c.
Equation 1.39 and Equation 1.40 can be stated with a new sign term s (k) = 1 or –1 for the right class and the wrong class, respectively, as follows:
[
]
m(ck +1) = (1 − s(k )α c (k )) m(ck ) + s(k )α c (k )x ( k )
(1.41)
It can be seen that mc is directly independent but is recursively dependent on the input vector x from Equation 1.41. The argument on the learning rate10 is that: Statistical accuracy of the learned codebook vectors mc(*) is optimal if the effects of the corrections made at different times are of equal weight.
The learning rate due to the current input x (k) is αc (k) from Equation 1.41, and due to the previous input x (k–1), the current learning rate is (1 – s (k) αc (k)) · αc (k–1). According to the argument, the effects to the learning rates are to be the same for two consecutive inputs x (k) and x (k–1):
α c (k ) = [1 − s(k )α c (k )] α c (k − 1) .
(1.42)
If this condition is to hold for all k, by induction, the learning rates from all the earlier x (k), for k = 0,1,…,k should be the same. Therefore, due to the argument, the
© 2000 by CRC Press LLC
optimal values of learning rate αc (k) are determined by the recursion from Equation 1.42 for the specific code vector mc as:
α c (k ) =
α c (k − 1) 1 + s(k )α c (k − 1)
(1.43)
with which the OLVQ1 is defined as in Equation 1.39 and Equation 1.40.
1.4.4
NEAREST NEIGHBOR RULE
The Nearest Neighbor (NN) classifier, a nonparametric exemplar method, is the natural classification method one can first think of. Using the label information of the training sample, an unknown observation x is compared with all the cases in the training sample. N distances between a pattern vector x and all the training patterns are calculated, and the label information, with which the minimum distance results, is assigned to the incoming pattern x. That is, the NN rule allocates the x to wk if the closest exemplar xc is with the label k = L (xc):
{
}
x c = arg min d ( x 0 , x i ) , i = 1, 2, …, N i
x 0 ∈ wk = L ( x k )
(1.44)
The distance measure between the unknown and the training sample has a general quadratic form: d ( x, x k ) = ( x 0 − x k ) M ( x 0 − x k ) t
(1.45)
With M = Σ–1, the inverse of the covariance matrix in the sample, the result is the Mahanalobis distance. Euclidean distance is obtained when M = I, i.e., the identity matrix. Another choice may be the measure considering only the variance for which M = Λ, where Λ is a diagonal matrix with its elements (λi)1/2 = var (xi) and x = (x1, x2,…, xp)t. The K-Nearest Neighbor (KNN) rule is the same as the NN rule except that the algorithm finds K nearest points within the points in the training set from the unknown observation x and assigns the class of the unknown observation to the majority class in the K points. Recent VLSI technology advances have made memory cheaper than ever; thus, the KNN rule is becoming feasible. Some modified versions of the original KNN rules are reported in what follows. These approaches interpolate between outputs of nearest neighbors stored during training to form complex nonlinear mapping functions.25,26 Much of the work with the modified KNN rules is in designing effective distance metrics.1 Some modified KNN are developed for parallel machine implementation, called the connectionist machine,27 as well as for serial computing.25
© 2000 by CRC Press LLC
1.5 NEURAL NETWORKS (NN) 1.5.1
INTRODUCTION
Neural networks have been a much-publicized topic of research in recent years and are now beginning to be used in a wide range of subject areas. One of the strands of interest in neural networks is to explore possible models of biological computation. Human brains contain about 1.5 × 1012 neurons of various types, with each receiving signals through 10 to 104 synapses. The response of a neuron is known to be happening in about 1 ~ 10 milliseconds.28 Yet we can recognize an old friend’s face and call him in about 0.1 seconds. This is a complex pattern recognition task which must be performed in a highly parallel way, since the recognition is done in about 100 ~ 1000 steps. This suggests that highly parallel systems can perform pattern recognition tasks more rapidly than current conventional sequential computers. As yet our VLSI technology, which is essential planar implementation with at most two- or three-layer cross-connections, is far from achieving these parallel connections that require three-dimensional interconnections. 1.5.1.1
Artificial Neural Networks
Even though originally the neural networks were intended to mimic a task-specific subsystem of a mammalian or human brain, recent research has been mostly concentrated on the Artificial Neural Networks which are only vaguely related to the biological system. Neural networks are specified by the (1) net topology, (2) node characteristics, and (3) training or learning rules. Topological consideration of the artificial neural networks for different purposes can be found in review papers.2,3 Since our interests in the neural networks are in classification, only the feed-forward multilayer perceptron topology is considered, leaving the feedback connections to the references. The topology describes the connection with the number of layers and the units in each layer for feed-forward networks. Node functions are usually nonlinear in the middle layers but can be linear or nonlinear for output layer nodes. However, all of the units in the input layer are linear and have fan-out connections from the input to the next layer. Each output yj is weighted by wij and summed at the linear combiner represented by a small circle in Figure 1.5. The linear combiner thresholds its inputs before it sends them to the node function φj. The unit functions are (non-)linear, monotonically increasing and bounded functions as shown on the right of Figure 1.5. 1.5.1.2
Usage of Neural Networks
One use of a neural network is classification. For this purpose each input pattern is forced, adaptively, to output the pattern indicators that are part of the training data; the training set consists of the input covariate x and the corresponding class labels. Feed-forward networks, sometimes called multilayer perceptrons (MLP), are trained adaptively to transform a set of input signals, X, into a set of output signals, G. Feedback networks start with an initial activity state of a feedback system, and after
© 2000 by CRC Press LLC
FIGURE 1.5 (I) The linear combiner output xj = Σ in=1 yiwij is input to the node function φj to give the output yj. (II) Possible node functions. Hard limiter (a), threshold (b), and sigmoid (c) nonlinear functions.
state transitions have taken place, the asymptotic final state is identified as the outcome of the computation. One use of the feedback networks is the case of associative memories: on being presented with pattern near a prototype X it should output pattern X′, and as autoassociative memory or contents-addressable memory by which the desired output is completed to become X. In all cases the network learns or is trained by the repeated presentation of patterns with known required outputs (or pattern indicators). Supervised neural networks find a mapping f : → for a given set of input and output pairs. 1.5.1.3
Other Neural Networks
The other dichotomy of the neural networks family is unsupervised learning, that is clustering. The class information is not known or it is irrelevant; the networks find the groups of the similar input patterns. The neighboring code vectors in a neural network compete in their activities by means of mutual lateral interactions and develop adaptively into specific detectors of different signal patterns. Examples are the Self-Organizing Map10 and the Adaptive Resonance Theory (ART)11 networks. ART is different from other unsupervised learning networks in that it develops new clusters by itself; the network develops a new code vector if there exist sufficiently different patterns. Thus the ART is truly adaptive, whereas others require the number of clusters to be specified in advance.
1.5.2
FEED-FORWARD NETWORKS
In feed-forward networks the signal flows only in the forward direction; no feedback exists for any node. This is perhaps best seen graphically in Figure 1.6. This
© 2000 by CRC Press LLC
FIGURE 1.6 A generic feed-forward network with a single hidden layer. For bias terms the constant input with 1 are shown and the weights of the constant inputs are the bias values which will be learned as training proceeds.
is the simplest topology and has been shown to be good enough for most practical classification problems.19 The general definition allows more than one hidden layer, and also allows ‘skiplayer’ connections from input to output. With this skip-layer, one can write a general expression for a network output yk with one hidden layer, yk = φ k bk +
∑ i→ k
wik xi +
∑ j→k
w jk φj b j + wij xi i→ j
∑
(1.46)
where the bj and bk represent the thresholds for each unit in the jth hidden layer and the output layer, which is the kth layer. Since the threshold values bj , bk are to be adaptive, it is useful to have a threshold for the weights for constant input value of 1 as in Figure 1.6. The function φ () is almost inevitably taken to be a linear, sigmoidal (φ (x) = ex / (1 + ex)) or threshold function (φ (x) = I (x > 0)). Rumelhart, Hinton, and Williams29 showed that the feed-forward multilayer perceptron networks can learn using gradient values obtained by an algorithm, called Error Backpropagation.* This contribution is a remarkable advance since 1969, when Minsky and Papert30 claimed that the nonlinear boundary, required for the XOR problem, can be obtained by a multilayer perceptron. The learning method was unknown at the time. Since Rosenblatt (1959)31 introduced the one-layer, single perceptron learning method, called the perceptron convergence procedure, the research on the single * A comment on the terminology ‘backpropagation’ is given in section 1.5.3. There, the backpropagation is interpreted as a method to find the gradient values of a feed-forward multilayer perceptron network, not as a learning method. A pseudo-steepest descent method is the learning mechanism used in the network.
© 2000 by CRC Press LLC
perceptron had been widely active until the counter-example of the XOR problem was introduced which the single perceptron could not solve. In multilayer network learning the usual objective or error function to be minimized has the form of a squared error: P
E( w ) =
∑
(
t p − f x p;w
p =1
)
2
(1.47)
that is to be minimized with respect to w, the weights in the network. Here p represents the pattern index, p = 1,2,…,P, and tp is the target (or desired) value when xp is the input to the network. Clearly this minimization can be obtained by any number of unconstrained optimization algorithms; gradient methods or stochastic optimization are possible candidates. The updating of weights has a form of the steepest descent method: wij ← wij − η
∂E , ∂wij
(1.48)
where the gradient value ∂E/∂wij is calculated for each pattern being present; the error term E (w) in the on-line learning is not the summation of the squared error for all the P patterns. Note that the gradient points are in the direction of maximum increasing error. In order to minimize the error it is necessary to multiply the gradient vector by minus one (–1) and by a learning rate η. The updating method (Equation 1.48) has a constant learning rate η for all weights and is independent of time. The original Method of Steepest Descent has the time-dependent parameter, ηk, hence ηk needs to be calculated as iterations progress.
1.5.3
ERROR BACKPROPAGATION
The backpropagation was first discussed by Bryson and Ho (1960),32 later by Werbos (1974),33 and Parker34 but was rediscovered and popularized later by Rumelhart, Hinton, and Williams (1986).29 Each pattern is presented to the network, and the input xj and output yj is calculated as in Figure 1.7. The partial derivative of the error function with respect to weights is ∂E(t ) ∂E(t ) ∇E(t ) = ,…, ∂wn (t ) ∂w1 (t )
T
(1.49)
where n is the number of weights, and t is the time index representing the instance of the input pattern presented to the network.
© 2000 by CRC Press LLC
FIGURE 1.7 Error-backpropagation. The δj for weight wij is obtained, δk’s are then backward propagated via thicker weight lines wjk’s.
The former indexing is for the ‘on-line’ learning in which the gradient term of each weight does not accumulate. This is the simplified version of the gradient method that makes use of the gradient information of all training data. In other words, there are two ways to update the weights by Equation 1.49: ∂E wij( p ) ← wij( p ) − η ∂wij wij ← wij − η
∑ p
( p)
∂E ∂w ij
temporal learning
(1.50)
epoch learning
(1.51)
( p)
One way is to sum all the P patterns to get the sum of the derivatives in Equation 1.51 and the other way (Equation 1.50) is to update the weights for each input and output pair temporally without summation of the derivatives. The temporal learning, also called on-line learning, (Equation 1.50), is simple to implement in a VLSI chip because it does not require the summation logic and storing each weight, while the epoch learning in Equation 1.51 does require to do so. However the temporal learning is an asymptotic approximation version of the epoch learning which is based on minimizing objective functions (Equation 1.47). With the help of Figure 1.7 the first derivatives of E with respect to a specific weight wjk can be expanded by the chain rule:
∂E ∂E ∂xk ∂E ∂E y = φk′ ( xk ) y = = ∂w jk ∂xk ∂w jk ∂xk j ∂yk j
(1.52)
∂φ k ( x k ) ∂E y = δk yj ∂x k ∂yk j
(1.53)
=
For output units, ∂E/∂yk is readily available, i.e., 2 (yk – tp), where yk and tp are the network output and the desired target value for input pattern xp. The φ′k (xk) is © 2000 by CRC Press LLC
straightforward for the linear and logistic nonlinear node functions; the hard limiter on the other hand is not differentiable. For the linear node function:
φ′ (x) = 1
with y = φx = x
and for the logistic unit the first order derivative becomes
(
) ( ) (1 + e )
ex 1 + ex − ex
φ ′( x ) =
2
(1.54)
x 2
= y(1 − y) when φ ( x ) =
ex 1 + ex
(1.55)
The derivative can be written in the form
∂E = ∂wij
∑yδ
p p i j
(1.56)
p
which has become known as the generalized delta rule. The δ’s in the generalized delta rule, Equation 1.56, for output nodes, therefore becomes
δ k = 2 yk (1 − yk ) ( yk − t p ) for a logistic output unit
(
δ p = 2 yp − t p
)
(1.57)
for a linear output unit
The interesting point in the backpropagation algorithm is that the δ ’s can be computed from output to input through hidden layers across the network. δ ’s for the units in earlier layers can be obtained by summing the δ ’s in the higher layers. As shown in Figure 1.7, the δj are obtained as
( ) ∂∂yE
δ j = φ ′j x j
j
( )∑ w
= φ ′j x j
jk
j→k
( )∑ w
= φ ′j x j
jk
∂E ∂xk
(1.58)
δk
j→k
The δk’s are available from the output nodes. As the updating (or learning) progresses backwards, the previous (or higher) δk are weighted by the weights wjk’s and summed © 2000 by CRC Press LLC
to give the δj’s. Since Equation 1.58 for δj only contains terms at higher layer units, it is clear that it can be calculated backwards from the output to the input of the network; hence the name backpropagation. 1.5.3.1
Madaline Rule III for Multilayer Network with Sigmoid Function
Widrow took an independent path in learning in as early as the 1960s.35,36 After some 20 years of research in adaptive filtering, Widrow and colleagues returned to the neural network research,36 and extended the Madaline I with the goal of developing a new technique that could adapt multiple layers of adaptive elements, using the simpler hard-limiting quantizer. The result was Madaline Rule II (or simply MRII), a multilayer linear combiner with a hard-limiting quantizer. Andes (1988, unpublished) modified the MRII by replacing the hard-limiting quantizer resulting in MRIII by a sigmoid function in the Adaline, i.e., a singlelayer linear combiner with a hard-limiting quantizer. It was proven later that MRIII is in essence equivalent to backpropagation. The important difference from the gradient based backpropagation method is that the derivative of the sigmoid function is not required in this realization; thus the analog implementation becomes feasible with this MRIII multilayer learning rule. 1.5.3.2
A Comment on the Terminology ‘Backpropagation’
The terminology ‘backpropagation’ has been used differently from what it should mean. To get the partial derivatives of the error function (at the system output node) with respect to the weights of the units in lower than the output unit, the δ terms in the output unit are propagated backward, as in Equation 1.58. However, the network (actually the weights) learns (or weights are updated) using the Pseudo Steepest Descent method, (Equation 1.48); it is pseudo because a constant term is used, whereas the Steepest Descent method requires an optimal learning rate for each weight and time instance, i.e., ηij (k). The error backpropagation is indeed to find the necessary gradient values in the updating rule. Thus it is not a good idea to call the backpropagation a learning method; the learning method is a simple version of the Steepest Descent method, which is one of the classical minimizer finding algorithms. Backpropagation is an algorithm to find the gradient ∇E in a feed-forward multilayer perceptron network. 1.5.3.3
Optimization Machines with Feed-forward Multilayer Perceptrons
Optimization in multilayer perceptron structures can be easily realized by gradientbased optimization methods with the help of backpropagation. In the multilayer perceptron structure the functions can be minimized/maximized via any gradientbased unconstrained optimization algorithm, such as Newton’s method or Steepest Descent method.
© 2000 by CRC Press LLC
The optimization machine has the functional description depicted in Figure 1.8 and consists of two parts, gradient calculation and weight (or parameter) updating.
FIGURE 1.8
Functional diagram for an Optimization Machine.
The gradient ∇E of the multilayer perceptron network is obtained by error backpropagation. If this gradient is used in an on-line fashion with the constant learning rate η as in Equation 1.48, then this structure is the neural network used earlier.29 This on-line learning structure possesses a desirable feature in VLSI implementation of the algorithm since it is temporal: no summation over all the patterns is required but the weights are updated as the individual pattern is presented to the network. It requires little memory but sometimes the convergence is too slow. The other branch in Figure 1.8 shows unconstrained optimization of the nonlinear function. The Optimization Machine gets the gradient information as before, but various and well-developed unconstrained optimizations can be used for finding the optimizer. The unconstrained nonlinear minimization is divided basically into two categories, gradient methods and stochastic optimization. The gradient methods are deterministic and use the gradient information to find the direction for the minimizer. Stochastic optimization methods such as ALOPEX are discussed in another section of this book as well as in References 37, 38, and 39. Comparisons of ALOPEX with backpropagation are shown in References 37 and 40. 1.5.3.4
Justification for Gradient Methods for Nonlinear Function Approximation
Getting stuck in local minimizers is a well-known problem for gradient methods. However, the size of the weights (or the dimensionality of the weight space in the neural networks) is usually much larger than the dimensionality of the input space: X ⊂ Rp that we like to search for optimization. The employed redundant degrees of freedom in the ways to find the better minimizer is a good reason or the justification for the gradient methods used in neural networks. Another justification for the gradient method in optimization may be due to the approximation by the Taylor expansion of highly nonlinear functions28 where the first and second order approximation, i.e., a quadratic approximation to the nonlinear function, is used. The quadratic function in a covariate x has a unique minimum or maximum.
© 2000 by CRC Press LLC
1.5.3.5
Training Methods for Feed-Forward Networks
There exist two basic ways to train the feed-forward networks. They are gradientbased learning and stochastic learning. Training or learning is essentially an unconstrained optimization problem. Abundant algorithms in optimization can be applied to the function approximated by the network in a structured way defined by the network topology. In the gradient-based methods, the most popular learning is the steepest descent/ascent method with Error Backpropagation algorithm to get the required gradient of the minimizing/maximizing error function with respect to the weights in the network.29,41 Another method using the gradient information is Newton’s method, which is basically used for zero finding of a nonlinear function. The function optimization problem is the same as the zero finding of the first derivative of the function; hence, the Newton’s method is valid. All the deterministic (as opposed to stochastic) minimization techniques are based on either or both the steepest descent and Newton’s method. The objective function to be optimized is usually limited to a certain class in the network optimization. The square of the error t – yˆ 2 and the information theoretic measure, the Kullback-Leibler distance, are objective functions used in the feed-forward networks. This is due to the limitation in calculating the gradient values of the network utilized by the Error Backpropagation algorithm. The recommended ‘method of optimization’ due to Broyden, Fletcher, Goldfarb, and Shannon (BFGS) is the well-known Hessian matrix update in the Newton’s method of unconstrained optimization.42 It requires gradient values. For the optimization machine of Figure 1.8 the feed-forward network with backpropagation provides the gradients, and the Hessian approximation is obtained by the BFGS method. The other dichotomy of the minimization of an unconstrained nonlinear multivariate function is grouped into the so called ‘stochastic optimization.’ The representative algorithms are Simulated Annealing,43 Boltzman Machine Learning,44 and ALgorithm Of Pattern EXtraction (ALOPEX).45,46 Simulated Annealing43 has been used successfully in combinatoric optimization problems, such as the traveling salesman problem, VLSI wiring, and VLSI placement problems. An application of feed-forward network learning has been reported47 with the weights being constrained to be integers or discrete values rather than continuum of the weight space. Boltzman Machine learning by Hinton and Sejnowski44 is similar to Simulated Annealing except that the acceptance of randomly chosen weights is possible even when the energy state has decreased. In Simulated Annealing the weights yielding the decreased energy state are always accepted, but in the Boltzman Machine, probability is used in accepting the increased energy states. The Simulated Annealing and the Boltzman Machine Learning (a general form of Hopfield Network48 for the associative memory application) are mainly for combinatoric optimization problems with binary states of the units and the weights. Extension from binary to M-ary in the states of the weights has been reported for classification problems47 in Simulated Annealing training of the feed-forward perceptrons.
© 2000 by CRC Press LLC
ALOPEX was originally used for construction of the visual receptive field but with some modifications was later applied to the learning of any type of network, not restricted to multilayer perceptrons. It is a random walk process in each parameter in which the direction of the constant jump is decided by the correlation between the weight changes and the energy changes.46 Since the stream of this chapter consists of the gradient-based optimization methods and the scope of the stochastic optimization is examined elsewhere in this book,37 we do not include the other important optimization stream of stochastic methods in this chapter.
1.5.4
ISSUES
1.5.4.1
Universal Approximation
IN
NEURAL NETWORKS
In the introduction section of the article by Hornik, Stinchcombe, and While (1989)49 previous work about the approximation capability of multilayer perceptrons is summarized and is referenced here. More than 20 years ago, Minsky and Papert (1969)30 showed that simple two-layer (no hidden layers) networks cannot approximate the nonlinearly separating functions (e.g., XOR problems) but a multilayer neural network could do the job. Many results on the capability of the multilayer perceptron have been reported. Some theoretical analyses for the network capability of the multilayer perceptron as a universal approximator are listed below and are extensively discussed in Reference.49 Kolmogorov (1957)50 tried to answer the question of Hilbert’s 13th problem, i.e., the multivariate function approximation by a superposition of the functions of one variable. The superposition theory sets the upper limit of the number of hidden units to 2n + 1 units, where n is the dimensionality of the multivariate function to be approximated. However, the functional units in the network are different for the different functions to be approximated, while one would like to find an adaptive method to approximate the function from the given training data at hand. Thus Kolmogorov’s superposition theory says nothing about the capability of a multilayer network nor which method to be used. More general views were reported. Le Cun (1987)51 and Lapedes and Farber (1988)52 showed that monotone squashing functions can be used in the two hidden layers to approximate the functions. Fourier series expansion of a function is realized by a single layer network by Gallant and White (1988)53 with cosine functions in the units. Further related results using the sigmoidal (or logistic) units are shown by Hecht-Nielsen (1989).54 Hornik, Stinchcombe, and White (1989)49 presented a general approximation theory of one hidden layer network using arbitrary squashing functions such as cosine, logistic, hyperbolic tangent, etc., provided that sufficiently many hidden units are available. However the number of hidden units is not considered to attain any given degree of approximation in Hornik, Stinchcombe, and White.49 The number of hidden units obviously depends on the characteristics of the training data set, i.e., the underlying function to be estimated. It is intuitive to say that the more complicated the functions to be trained, the more hidden units are required.
© 2000 by CRC Press LLC
For the number of hidden units, Baum and Haussler55 limit the size of general networks (not necessarily the feed-forward multilayer perceptrons) by relating it to the size of the training sample. The authors analytically showed that if the size of the sample is N and we want to correctly classify future observations with at least a fraction 1 − −∈2 correctly, then the size of the sample has a lower bound given by W N N ≥ O log ∈ ∈ where W is the number of the weights and N the number of the nodes in a network. This, however, does not apply to the interesting feed-forward neural networks, and the given bound is not useful for most applications. There seems to be no rule of thumb for the number of hidden units.19 The size of the hidden units can usually be found by cross-validation or any other resampling methods. Usual starting value for the size is suggested to be about the average of the number of the input and output nodes.19 Failure in learning can be attributed49 to three main reasons: • inadequate learning, • inadequate number of hidden units, or • presence of a stochastic rather than a deterministic relation between input and target in the training data, i.e., noisy training data.
1.5.5
ENHANCING CONVERGENCE RATE AND GENERALIZATION OF AN OPTIMIZATION MACHINE
While the steepest descent method used originally with the backpropagation algorithm, (Equation 1.48), can be an efficient method for obtaining the weights that minimize an error measure, error surfaces frequently possess properties that make this procedure of slow convergence. There are at least two reasons (correlated in a sense as will be seen below) for this slow rate of convergence.56 1. The magnitude of a gradient may be such that modifying a weight by a constant proportion, η as in Equation 1.48, of that gradient will yield too little reduction in the error measure. There are two cases for this situation. When the error surface is fairly smooth (or nearly flat), the gradient magnitude is small, and consequently the convergence is too slow. The other situation involves the case where the error curve is too wiggly. Even a small change in the weight space may result in ‘overshooting,’ which may produce a small reduction of the error measure. Oscillating over a local minimum can happen with this error function. 2. The second reason for the slow convergence is that the negative gradient may not point to the actual minima, as is usually the case. Figure 1.9 shows an example of an error function of the two parameters with the elliptic curves representing the contour of the error function. With the
© 2000 by CRC Press LLC
given weight point w (t) at time t, the negative gradient does not point to the real minima which are represented by a bullet in the center of the inner contour. Given the negative gradient, the magnitude in the direction of the major axis x1 is too small, whereas the component in the minor direction x2 is too large.
FIGURE 1.9
1.5.5.1
Error surface with contours over a two-dimensional weight space.
Suggestions for Improving the Convergence
Jacobs (1988)56 summarized four heuristics proposed in the literature for increasing the rate of convergence: 1. Every parameter of the performance measure to be minimized should have its own individual learning rate, ηij. 2. Every learning rate should be allowed to vary over time, ηij (k). 3. When the derivative of a parameter possesses the same sign for several consecutive time steps, the learning rate for that parameter should be increased. 4. When the sign of the derivative of a parameter alternates for several consecutive time steps, the learning rate for that parameter should be decreased. Note that from Figure 1.9, by providing different learning rates for each parameter dimension, the current point in the weight space is not modified in the direction of the negative gradient, but toward the real minima. Another cause for the slow convergence comes from the sigmoidal units φ ()’s that are used to impose the network with nonlinearity. The derivative of the nonlinear
© 2000 by CRC Press LLC
unit function has been shown to be in the form of Equation 1.55. The logistic units may become ‘stuck’ at a round value, either 0 or 1, since φ′ (x) = y (1 – y) (Equation 1.55) gives a very small value for an output 0 or 1:
φ′ (x) = y (1 – y) 0 for y 0 or 1
(1.59)
Unfortunately, any saturating unit function is bounded, resulting in the property: near the saturation points the derivative vanishes. With nonlinear units with the backpropagation learning and the general objective function E = t – y2 giving the ∂E/∂w = y (1 – y) the convergence of a network is known to be slow, as discussed earlier. In the original work of Rumelhart, Hinton, and Williams29 a ‘momentum’ term was added; that, is an exponential smoothing was applied to the correction term, so that ∂E wij ← wij − η (1 − α ) + α ∆wij ∂wij
( )
(1.60)
They also considered the ‘on-line’ version of Equation 1.60, that is
( )
wij ← wij − η′yipδ jp + α ′ ∆wij
(1.61)
and updated the weights as each pattern was presented to the network. 1.5.5.2
Quick Prop
Some other interesting ideas to speed up the convergence have been introduced. Quickprop57 used a second-order method, based loosely on Newton’s method. Quickprop is based on two risky assumptions, (1) that the error vs. weight graph for each weight can be approximated by a parabola with one minimum value and (2) that the change in the slope of the error curve, as seen by each weight, is not affected by all the other weights that are changing at the same time. Everything else proceeds as in standard backpropagation, but for each weight wij a set of information for the previous time update is retained to get a second order approximation. The steps to follow are (1) find the error derivative Sij (t – 1) = ∂E (t – 1)/∂wij (t – 1) and (2) update ∆wij (t – 1) = wij (t) – wij (t – 1). The computation for the next step size of a found direction according to the heuristics above is then given by: ∆w(t ) =
© 2000 by CRC Press LLC
S (t ) ∆w(t − 1) S(t − 1) − S(t )
(1.62)
where S (t) and S (t – 1) are the current and previous values of ∂E/∂w. This is a crude approximation to the optimal minima. The fraction portion η in each parameter wij is adaptively adjusted using the Equation 1.62. To get around this pitfall, Fahlmann (1989)57 suggested also using an offset in order for the delta (as in Equation 1.57) to be at least 0.1, i.e., φ′ (x) = 0.1 + y (1 – y). 1.5.5.3
Kullback-Leibler Distance
A more interesting treatment for the problem with the classical gradient descent method has been shown in the literature. (58-60) A relative (or cross) entropy of target t with respect to output y is defined and interpreted as Maximum A Posteriori (MAP) estimation for the optimal minima of the weight space, E=
∑∑ p
k
p tkp 1 − tkp p t log + 1 − log t k k ykp 1 − ykp
(
)
(1.63)
This entropy measure becomes the measure of ‘maximum likelihood’ if the targets tk are tk ∈ {0,1},20 and may be called the ‘Kullback-Leibler’ distance, one of the probabilistic distances. The interpretation of the output vectors with this distance measure is that the output vector represents the conditional probability of target t, given the input pattern x. A binary random variable Bk associated with the kth output unit describes the presence (Bk = 1) or absence (Bk = 0) of the kth output attribute. For a given input pattern xp, the activity yp reflects the conditional probabilities
{
P Bk = 1x
{
p
}=y ,
P Bk = 0 x
p
p
and
} = 1- y . p
(1.64)
(1.65)
With this distance measure the δ value in the generalized delta rule, (Equation 1.56), becomes simpler and linear with the error (t p – yk):
δ k = φ k′ ( x k )
∂E ∂yk
−t p 1 − t p + = y jk (1 − yk ) yk 1 − yk y −tp = yk (1 − yk ) k yk (1 − yk ) = yk − t p
© 2000 by CRC Press LLC
(1.66)
Thus the error signal propagates towards the inner layers backwards and the pitfall problem (Equation 1.59) no longer exists for this distance measure. 1.5.5.4
Weight Decay
Another way to avoid saturation is to discourage large weights and hence large inputs:20 ones with large deviations from the data set are used for training. One can modify the error function to obtain the regularization effects by adding an extra term, which penalizes the overfitting. Also the discouragement of unusual inputs (e.g., outlier patterns) works as robust learning. This generalization in learning is related to the bias-variance trade-off in the scatter plot smoothing. A new error to be minimized is the sum of the squared error: E′ = E + λ
∑w
2 ij
(1.67)
ij
where the λ is the weight decay parameter. The weight update rule, (Equation 1.48), turns out to be (with the penalty term) wij ← wij − η
∑yδ
p p i j
− 2ηλwij
(1.68)
p
This is the gradient (or steepest) descent learning method with a new error term. Two effects from the weight decay can be realized. One is the generalization obtained by the shrinkage effect of the weight decay. This shrinkage method is the same idea as ridge regression in statistics, which may be written in a modified linear regression form as:
( X X + Λ) βˆ = X Y t
t
(1.69)
where Λ is a non-negative diagonal matrix. This is motivated by a prior on β or as a penalty term or a device to avoid large parameter values in nearly collinear problems.19 It is also known that weight decay helps the numerical stability of optimization algorithms, especially in avoiding almost flat regions in iterative methods, such as in Equation 1.48. The extra penalty term in Equation 1.68, weight growing is equally discouraged; there is no discrimination of the weights by their hierarchical position in a multilayer network. With the help of Figure 1.7, the weights {wij} relate the system as inputs yi = xi and xj, the input to the next layer units, but the weights {wjk} are between yj and xk. To give the same penalty for all the weights evenly, (Equation 1.68), the input vector x to the system should have the same range as the yj’s. Thus it is more sensible that the system inputs have the same range as the intermediate values yj’s, which is done by scaling so that the input {xp} is in [0,1], approximately.
© 2000 by CRC Press LLC
For the decay parameter λ, Ripley28 suggested λ 10–4 ~ 10–2 for the sum of the squares criterion (Equation 1.68), and 0.01 ~ 0.1 for the entropy measure criterion, (Equation 1.63). If regression and classification are to be considered in a unified frame, the distinguishing characteristic is in the interpretation and use of the response variable. Regression is a method of model fitting for the given data point pairs. Regression has the continuous response variable, representing outputs of the estimating function fˆ (·), and usually continuous in the region of the function f (·). One likes to find or estimate the underlying function that relates input and output pairs, { (xi, yi)}N1, for many reasons. Prediction for future observations x0, inference on the estimated function f, and interpretation of the function of covariate xi are the principal objectives. Neural networks are a new surge in this regression paradigm, although research for regression purposes is not as active as it is for classification problems. Classification is meant to analyze different group data and to represent the group data well so that future observations could be classified as correctly as possible. The response variable can be considered as a categorical variable taking the value from a finite set of class labels. The difference between regression and classification is whether the response variable is the continuous region of the function or the categorical variable for classification, respectively. 1.5.5.5
Regression Methods for Classification Purposes
The recent success and popularity of neural networks motivated some applied statisticians to look for similar methodologies in the statistical literature and to develop methods to use the existing nonparametric regression techniques5 for classification. The classification problem is recast in the form of a regression problem. To establish a relationship between regression and classification, the two-class linear discriminant function can be shown to be the scalar (not a constant) multiple of the least square regression function in Section 1.5.6. Generalization for multiple group settings is given in Section 1.5.7. A number of recently developed adaptive regression methods are studied. Those are Classification And Regression Tree (CART),18 BRUTO,61 and Multivariate Adaptive Regression Splines (MARS)62 and incorporated with a bridging tool FDA (Flexible Discriminant Analysis)5 for classification purposes.
1.5.6
TWO-GROUP REGRESSION FUNCTION
AND
LINEAR DISCRIMINANT
The linear discriminant function for two-group classification has been viewed by Fisher (1936) alternatively in a regression context. (See pp. 212–213 of Anderson (1984)21). The linear projector W–1t ( –x (2) – –x (1)) in the linear discriminant function ( –x (2) – –x (1))tW–1 –x is actually a scalar multiple of the linear regression function. A dummy variate is introduced for two class response values. Let the two variables be
© 2000 by CRC Press LLC
yi(1) =
n2 , i = 1, 2,…, n1, n1 + n2
(1.70)
yi( 2 ) =
− n1 , i = 1, 2,…, n2 , n1 + n2
(1.71)
The regression function btx is obtained by minimizing the sum of squared residual (SSR) ni
2
∑ ∑ [y
( j) i
j =1
)]
(
2
− b t x (i j ) − x
i =1
where xi (j) is the ith observation from group j, j = 1,2 and x– is the overall mean of the training data. The normal equations are obtained by taking the derivative of the SSR with respect to b, the newly defined unknown coefficients of the two-group regression, and set it equal to zero: 2
ni
∑ ∑ (x j =1
( j) i
)(
−x x
i =1
( j) i
)
ni
2
t
−x b =
∑ ∑ y (x ( j) i
j =1
( j) i
−x
i =1
)
[(
) (
(
)
=
n1n2 x (1) − x − x ( 2 ) − x n1 + n2
=
n1n2 x (1) − x ( 2 ) n1 + n2
)]
(1.72)
(1.73)
The outer product in the LHS of Equation 1.72 is the total covariance of the predictor variables and can be decomposed in the form of within-covariance and between-covariance matrix combinations as ni
2
∑ ∑ (x j =1
( j) i
)(
i =1
ni
2
=
∑ ∑ (x j =1
)
t
− x x (i j ) − x ( j) i
i =1
(
)(
)(
− x ( j ) x (i j ) − x ( j )
)
t
)
t
(
)(
+ n1 x (1) − x x (1) − x + n2 x ( 2 ) − x x ( 2 ) − x 2
=
ni
∑ ∑ (x j =1
i =1
(
( j) i
)(
)
)(
)
− x ( j ) x (i j ) − x ( j )
nn + 1 2 x (1) − x ( 2 ) x (1) − x ( 2 ) n1 + n2 © 2000 by CRC Press LLC
)
t
(1.74)
t
t
(1.75)
Thus Equation 1.72 is rewritten as
(
n1n2 x (1) − x ( 2 ) n1 + n2 =
)
ni
2
∑ ∑ (x j =1
+
( j) i
)(
− x ( j ) x (i j ) − x ( j )
i =1
(
)
t
)(
n1n2 x (1) − x ( 2 ) x (1) − x ( 2 ) n1 + n2
(1.76)
) ]b t
If we define the within-group SSP (sum of squares and products) as W: 2
W=
ni
∑ ∑ (x j =1
( j) i =1
)(
)
t
− x ( j ) x (i =j1) − x ( j ) ,
i =1
the normal Equation 1.76 has the form Wb =
(
)
(
)(
)
t n1n2 nn x (1) − x ( 2 ) − 1 2 x (1) − x ( 2 ) x (1) − x ( 2 ) b n1 + n2 n1 + n2
nn t nn = x (1) − x ( 2 ) 1 2 − 1 2 x (1) − x ( 2 ) b . n1 + n2 n1 + n2
(
)
(
)
(1.77)
(1.78)
Since the whole bracket is a scalar, the solution b of Equation 1.78 is proportional to the projection vector W-1 (x– (1) – –x (2)) of the linear discriminant function.
1.5.7
MULTI-RESPONSE REGRESSION AND FLEXIBLE DISCRIMINANT ANALYSIS
Multiresponse linear/nonlinear regression can also be used for classification. The most simple and common way is to transform the categorical variable j ∈ {1,2,…,J} in the form of (N × J)-matrix YN×J such that an element yij has a value of 1 in the jth column if the observation is in class j. The multiresponse multivariate regression is carried onto the predictors x. A new observation x0 is fitted with the J fits and is classified by the class having the largest fitted value, i.e., Yj. Since we cannot expect the regression fit yˆ k = fk (x0), the kth regression fit, to be in the region [0,1], the indicator matrix Y whose elements are either 0 or 1 is not a good way of introducing dummy response variables. Optimal Scoring, which will be studied in Section 1.5.8, transforms a categorical variable to real line R, such that linear regression of the transform is best regressed on the predictor variables x.
© 2000 by CRC Press LLC
1.5.7.1
Powerful Nonparametric Regression Methods for Classification Problems
Recently, Hastie, Buja, and Tibshirani63 introduced a new treatment of regression methods to be used for classification problems. They showed that the discriminant analysis could be tackled via Optimal Canonical Correlation Analysis (CCA), especially its asymmetric version, Optimal Scoring (OS). The idea is based on the facts that CCA is equivalent to Linear Discriminant Analysis (LDA) and that the OS results to CCA, via various nonparametric regression methods. Linear discriminant analysis in Section 1.4.2 of multi-group has been the traditional choice in classification and discriminant analysis. The robustness and the simplicity of LDA64 in implementation and interpretation are responsible for its popularity. Recently a group of applied statisticians found and developed ways of using regression techniques for classification applications. Breiman and Ihaka (1984)6 noticed that the regression approach to the classification problem can be extended from the two-group to a multi-group setting via scaling and ACE. This idea has been adapted by Hastie, Tibshirani, and Buja and was developed to render the Flexible Discriminant Analysis (FDA).5 The basic concept is that the LDA, CCA, and OS are equivalent. One can find the discriminant variates via either CCA or OS. Since this equivalence is so critical, some space is devoted here to the understanding of this property. The generalization of the LDA to nonlinear flexible discriminant analysis is due to the fact that an OS solution can be obtained by any linear/nonlinear regression method. This has the important consequence that we can simply use the tools for nonparametric regression to perform nonparametric discriminant analysis, which the authors termed as Flexible Discriminant Analysis (FDA). This section is a somewhat concise version of Section 3 of Hastie, Buja, and Tibshirani’s unpublished paper.63 It is known that discriminant variates are the same as the so-called ‘canonical variates,’ which result from an associated canonical correlation analysis (CCA), and often the latter term is used interchangeably with discriminant variates. Somewhat less known is that an asymmetric version of canonical correlation analysis, here called optimal scoring (OS), well-known in correspondence analysis, can also yield a set of dimensions which coincide with those of LDA and CCA. Each of the three techniques (OS, CCA, LDA) to be discussed has an associated criterion and constraints under which the criterion is to be optimized. The equivalence of LDA, CCA, and OS follows as each of them are briefly described.
1.5.8
OPTIMAL SCORING
Optimal scoring is used to turn categorical variables into quantitative ones by assigning scores to classes (groups, categories). Suppose θ : T → R is a function that assigns scores to the classes, such that the transformed class labels are optimally predicted by linear regression on X. This produces a one-dimensional separation between the classes. More generally, we can find K sets of independent scorings for the class labels, {θ1, θ2,…, θK}, and K corresponding linear maps ηk (X) = Xtβk,
© 2000 by CRC Press LLC
k = 1,2,…, K, chosen to be optimal for multiple regression in R K. Thus the OS problem is to find the two sets of unknown functions that minimize a certain criterion. Let (xi,gi), i = 1,2,…, N, be the training sample; then the scores {θk (g)}K1 and the maps {βk}K1 are chosen to minimize the average squared residual (ASR): ASR =
1 N
K
N
∑ ∑ (θ (g ) − x β ) k
k =1
i
t i
2
k
i =1
In the criterion ASR above, θ (g) assigns a real number, θj, to the jth label of g, the categorical response variable. With the matrix notation, given a J-vector of such scores θk, a N-vector Yθ is a vector of scored training data which one may try to regress onto the predictor matrix H, the N × p-matrix. For simpler notational purposes, we proceed with a single solution only. The multiresponse multivariate regression can be thought of as simply the K duplicates for the single response multivariate regression. Thus a single solution pair (θ,β) is used in the following, instead of the series of solution (θk,βk), k = 1,2,…, K, to simplify the notation. Definition: The Optimal Scoring problem is defined by the criterion 1 ASR (θ , β OS ) = min β N = min β
2 θ ( g ) − h( x )t β i i
∑ ( N
i =1
1 Yθ − Hβ N
)
2
(1.79)
(1.80)
which is to be minimized (or made stationary) under the constraint N–1Y θ2 = 1 which is for a unique solution for θ. A unified view for the three similar but equivalent techniques (OS, CCA, and LDA) can be conveniently achieved by rewriting the ASR in Equation 1.80 in a quadratic form: ASR(θ , β ) = θ t Σ11θ − 2θ t Σ12 β + β t Σ 22 β
(1.81)
where the matrices Σ are defined as: • Σ11 = −1N YtY, a diagonal matrix with the class proportions pj = nj/N in the diagonal, • Σ22 = −1N (HtH), the total covariance matrix of the predictor variables, • Σ12 = −1N (YtH), Σ21 = Σt12 If all considered classes are in the sample, i.e., nj > 0, Σ11 is invertible.
© 2000 by CRC Press LLC
1.5.8.1
Partially Minimized ASR
If we assume that the score vector θ is fixed, the minimizing β for the OS problem is obtained by the least squares estimate of β: −1 βOS = ( H t H ) H t Yθ = Σ 22 Σ 21θ −1
(1.82)
The linear regression of Yθ on to the design matrix H with the least square criterion gives the following results. From Equation 1.80 and Equation 1.82: min ASR(θ , β ) = β
(
)
1 1 1 Yθ 2 − (Yθ )t St SYθ = 1 − (Yθ )t S(Yθ ) N N N
= 1−
1 t t −1 θ Y SYθ = 1 − θ t Σ12 Σ 22 Σ 21θ N
(1.83) (1.84)
where S = H (HtH)–1Ht denotes the ‘hat’ or ‘smoother’ matrix of the predictor matrix H, which is the result of the least square linear regression. The same equation on the ASR (θ, β) has a matrix form as
(
)
min ASR(θ , β ) =
1 1 Yθ 2 − (Yθ )t S t SYθ N N
=
1 1 t Yθ 2 − ( SYθ ) ( SYθ ) N N
=
1 (Yθ − SYθ )t (Yθ − SYθ ) N
=
1 N
=
1 t t θ Y ( I − PH )Yθ N
β
(1.85)
{(Yθ ) ( I − S)( I − S)Yθ} t
{
}
with a new notation for the projection matrix, PH, based on the predictor design matrix H for the least square linear regression
(
PH = S = H H t H
)
−1
Ht
With the assumption of fixed θ we have reached the partially minimized ASR where the minimizing β was obtained via the least square linear regression. Now, we need to find the θ that transforms the indicator matrix to yield the scalings Yθ
© 2000 by CRC Press LLC
such that the linear regression yields the best fit to the new scalings. The question then is — Given Equation 1.85, what θ gives the least possible ASR? It is the quadratic form of the symmetric matrix Yt (I – PH)Y that we like to look for the vector θ, that results in the minimum quadratic value. Minimizing θ for the whole matrix Yt (I – PH)Y is the same as maximizing θ for the matrix YYˆ = YPHY, provided that the regression fit YYˆ = PHY is shrunk, which is a property of all linear smoothers.65 The projection operation PH is a linear smoother. Therefore, the minimizing θ in Equation 1.85 is the eigenvector corresponding to the largest eigenvalue of YYˆ = YPHY. This is the point at which nonlinear nonparametric regressions come into play for classification applications of regression. Direct calculation of the projector matrix PH of the expanded predictor space h (x), or spanned by the columns of the matrix H is possible, but the fact that any regression can calculate the fitted value Yˆ allows various linear/nonlinear regressions to be used.
1.5.9
CANONICAL CORRELATION ANALYSIS
Canonical Correlation Analysis (CCA) seeks to identify and quantify the associations between two sets of variables. The correlation of two linear combinations of the two sets of variables is to be maximized. Definition: The canonical correlation problem is defined by the criterion
{
COR(θ CCA , β CCA ) = max θ t Σ12 β θ, β
}
(1.86)
which is to be maximized under the constraints
θ t Σ11β = 1, and β t Σ 22 β = 1.
(1.87)
The Σ’s are the same as in the previous section for optimal scoring. The criteria of the optimal scoring ASR (θ,β) and canonical correlation analysis COR (θ,β) are related to each other by Equation 1.81 and the two CCA constraints: ASR = 2 – 2 COR which means that the OS and the CCA differ only in the additional constraint on β through Equation 1.87. The partially maximizing βCCA with θ for both the OS and the CCA obtained by minimizing βOS with the constraint of the βCCA in Equation 1.87 is
β CCA = β OS
t β OS Σ 22 β OS .
(1.88)
The maximizer βCCA representation in terms of the minimizer βOS in the above equation (Equation 1.88) and the definition of the CCA (Equation 1.86) entails the identity in the fixed linear coefficients θ in the OS and CCA: © 2000 by CRC Press LLC
max COR(θ , β ) = θ t Σ12 β = t
θ Σ11θ =1 β t Σ 22 β =1
θ t Σ12 β OS t β OS Σ 22 β OS 1
−1 θ t Σ12 β OS β OS Σ 21θ 2 = t β OS Σ 22 β OS
= =
(
−1
−1 t t θ t Σ12 β OS β OS Σ 22 β OS β t Σ 21θ
(θ Σ t
12
−1 Σ 22 Σ 21θ
)
(1.89)
)
1 2
1 2
which verifies the identity of θ in that the minimizer in Equation 1.84 is the same as the one in the maximizer in Equation 1.86. With the identity of θ for both the OS and the CCA as just shown and the relationship between the β’s (Equation 1.88) verifies that the OS is essentially the same as the CCA with the constraint on the βCCA.
1.5.10 LINEAR DISCRIMINANT ANALYSIS Linear Discriminant Analysis (LDA) is a standard tool for classification and dimension reduction purposes. The LDA is a special case of the Bayesian Classifier as in Section 1.4.2, where the group conditional distributions are assumed to be multivariate normal, have a common covariance matrix, and have different mean vectors for the different classes. 1.5.10.1 LDA Revisited The optimizing problem of the multiclass data is to find the K ≤ J – 1 linear combinations which separate the class means mj as much as possible in the K dimensional subspace satisfying the constraint that the linear combinations are to be spherical, i.e., uncorrelated and with unit variance, with respect to Σw, the withinclass covariance. The columns of the matrix U of LDA vectors uk are the eigenvectors corresponding to the K largest eigenvalues of the matrix of Σ–1B ΣW. The procedure for the LDA is first to sphere x with respect to the common within-groups covariance matrix, project these data onto the J – 1 dimensional subspace spanned by the J group mean vectors mj’s, and then classify the new discriminant covariate, Ux0, vector to the class corresponding to the closest centroid. Following the notations of the two sets of variables as in Section 1.5.8, the matrix M of mean vectors, ΣB, and ΣW have the following simple form with PY = Y (YtY)–1Yt the projector onto a Y-column space:
© 2000 by CRC Press LLC
• M = Σ–111 Σ12, a J × p-matrix whose rows are the class means mj = avg{hi; i ∈ Class j} : M = (m1, m2,…, mJ)t • ΣB = −1N (PY H)t (PY H) = Σ21 Σ–111 Σ12 = Mt Σ11M • ΣW = −1N [ ( (I – PY)H)t (I – PY)H] = Σ22 – ΣB The matrix M consists of rows of class mean vectors mj. The between-class covariance ΣB is the covariance of H regressed onto Y, or, equivalently, the classweighted covariance of the class means. The within-class covariance is the left of the subtraction of the ΣB from the total covariance Σ22. The criterion of the linear discriminant problem is the maximization problem of the between-class variance under a constraint on the within-class variance. Definition: The criterion of the linear discriminant problem to be maximized is the between-class variance:
{
BV AR(β LDA ) = max β t Σ B β β
}
(1.90)
with the constraint: t WV AR(β LDA ) = β LDA Σ W β LDA = 1
1.5.11 TRANSLATION OF OPTIMAL SCORING DIMENSIONS DISCRIMINANT COORDINATES
(1.91) INTO
It is convenient to use CCA as a link between OS and LDA. CCA is a generalized singular value problem for Σ12 with regard to the metrics given by Σ11 and Σ22. Remember that it is the maximizing problem Equation 1.86 in which the generalized quadratic form is used, hence it is called the generalized singular value problem. The associated singular value decomposition (SVD), essentially a collection of stationary solutions of the CCA problem, takes on the form: −1 −1 Σ11 Σ12 Σ 22 = Θ Dα Bt
(1.92)
Θt Σ11Θ = IL
(1.93)
Bt Σ 22 B = I L
(1.94)
where L = min (J,p), Θ is a J × L matrix whose columns θk are left-stationary vectors, B is a p × L matrix whose columns βk are right-stationary vectors, and Dα is a diagonal matrix of size L × L with non-negative diagonal elements of αk sorted in descending order. A simple (non-generalized) SVD of a form A = UDVt entails the trivial consequences: © 2000 by CRC Press LLC
A = UDV t AV = UD At U = VD U t AV = D V t At AV = D 2 U t AAt U = D 2 These are translated to the generalized SVD as follows. The left column is for the regular SVD and the right column for the generalized SVD. A = UDV t
−1 −1 Σ11 Σ12 Σ 22 = ΘDα Bt
AV = UD
−1 Σ11 Σ12 = ΘDα Bt Σ 22 −1 Σ11 Σ12 B = ΘDα
At U = VD
−1 −1 Σ 22 Σ 21 Σ11 = BDα Θ t −1 Σ 22 Σ 21Θ = BDα Θ t Σ11Θ = BDα
U t AV = D
−1 −1 Θ t Σ11 Σ12 Σ 22 B = Dα
Θ t Σ12 B = Dα V t At AV = D 2
−1 Θ t Σ12 Σ 22 Σ 21Θ = Dα 2
U t AAt U = D 2
−1 Bt Σ 21 Σ11 Σ12 B = Dα 2
(1.95)
(1.96)
(1.97)
(1.98)
(1.99) (1.100)
In particular, Equation 1.98 implies COR (θk, βk) = αk. As noted before from Equation 1.84 and Equation 1.89, the stationary θ vectors of OS and CCA are the same, while the β vectors of OS and CCA are related according to Equation 1.82 and Equation 1.97 by BOS = BDαt ,
(1.101)
BOS being a matrix of OS-stationary column vectors βOS,k. From Equation 1.84 and Equation 1.89 it follows that ASR (θk,βk) = 1 – α2k.
© 2000 by CRC Press LLC
To link CCA and LDA, we rewrite Equation 1.100 using the expression of the ΣB = Σ21 Σ–111 Σ12 as: Bt Σ B B = Dα 2
(1.102)
Bt Σ W B = Bt ( Σ 22 − Σ B ) B
(1.103)
and
= I L − Dα 2 = D1−α 2
(1.104)
These two equations, (Equation 1.102) and Equation 1.104 show that B diagonalize both ΣB and ΣW. If we define, BLDA = BD(1−α
1
2 )2
we get a matrix whose columns βLDA,k are stationary solutions of the LDA problem: t BLDA Σ W Σ LDA = IL ,
(1.105)
t BLDA Σ B BLDA = Dα 2 /(1−α 2 )
(1.106)
Finally, the relation between the LDA and the OS solutions is given by BLDA = BOS D
(1.107)
[α 2 (1−α 2 )] − 2
1.5.12 LINEAR DISCRIMINANT ANALYSIS
VIA
1
OPTIMAL SCORING
The minimization criterion, average squared residual (ASR), for a multi-response Optimal Scoring has the form K
N
∑∑(
1 θ k (gi ) − x it βk N k =1 i =1 1 2 = YΘ − XB N
ASR =
)
2
with a constraint N–1YΘ2 = 1 for a unique solution Θ. If Θ is fixed we get the transformed value Θ*N×K = YN×J ΘJ×K .
© 2000 by CRC Press LLC
(1.108)
With a new notation for the projection matrix, PH, and smoothing operation S, based on the predictor design matrix H for the least square linear regression PH = S = H(HtH)–1Ht, the partially minimizing ASR with the Θ* fixed becomes
(
)
1 1 2 YΘ − (YΘ)t S t SYΘ N N 1 1 2 t YΘ − ( SYΘ) ( SYΘ) N N 1 (YΘ − SYΘ)t (YΘ − SYΘ) N 1 (YΘ) + ( I − S)( I − S)YΘ N 1 Θ t Y t ( I − PH )YΘ N
ASR(Θ, B) = = =
{ {
= =
}
(1.109)
}
If we set the constraints on the Θ* of zero mean and being unit variance and uncorrelated: 1 N
N
∑Θ i =1
* i
=0
1 *t * Θ Θ = IK N
the minimizing Θ is obtained from Equation 1.109 by the K largest eigenvectors Θ of YtPHY with the constraint Θt Dp Θ = IK and with Dp = YtY/N. A direct approach for such optimal score Θ would be by explicitly building the project (or hat) matrix PX and doing eigen analysis via Singular Value Decomposition,
(
PX = X X t X
)
−1
Xt
Y t PX Y = ΘΛΘt A more convenient approach avoids the explicit calculation PX and takes advantage of the fact that PX computes the linear regression Yˆ = PXY. An algorithmic approach to compute the usual canonical variates by OS provides an equivalent procedure to get the LDA by OS. 1.5.12.1 LDA via OS As the equivalence of OS and LDA from Equation 1.107 the algorithm for LDA via OS is: 1. Initialization: form YN×J, the indicator matrix, whose index yij is 1 if the ith observation belongs to the jth group, otherwise is 0. 2. Linear multivariate regression: find the linear regression
© 2000 by CRC Press LLC
Yˆ = PX Y = SY and by the linear least squares, set B such that Yˆ = XB. 3. Optimal scores: find the eigenvector matrix Θ of rank K ≤ J matrix YtYˆ via SVD Y t Yˆ = ΘΛΘt with Θt DpΘ = I J 4. Update the coefficient matrix of the linear combination matrix B obtained in step 2. B ← BΘ The final coefficient matrix BOS is, up to a diagonal scale matrix, the same as the LDA coefficient matrix BLDA obtained from Equation 1.107. BLDA = BOS D where the diagonal matrix D has the elements
[ (
dkk = α k2 1 − α k2
)]
−1 / 2
and αk is the kth element of the diagonal matrix Λ, in the spectral decomposition of the rank K ≤ J matrix YtYˆ via SVD: Y t Yˆ = Θt ΛΘ
1.5.13 FLEXIBLE DISCRIMINANT ANALYSIS
BY
OPTIMAL SCORING
ˆ in step 2 above, we can reduce the If we apply nonparametric regression Yˆ = S ( λ)Y, flexibility of the nonparametric regression into a classification problem. Here the smoothing parameter λˆ controls the fitness of the regression Yˆ to Y, and is thus the control parameter. ˆ comes into play in The nonparametric multivariate regression in Yˆ = S ( λ)Y 5 two ways: • the regularization property by bias-variance control is obtained, and • a model selection (i.e., variable selection) and interaction between variables may be exploited in the multivariate regression.
© 2000 by CRC Press LLC
There exist many powerful nonparametric multivariate regression methods, and more are expected to be developed. The most recently developed are (1) Projection Pursuit Regression (PPR),66 (2) Alternate Conditional Expectation (ACE),67 (3) Additivity and Variance Stabilization (AVAS),68 (4) Additive Model (AM),63 (5) Multivariate Adaptive Regression Splines (MARS),62 (6) π-method,69 (7) Interaction spline method,70 (8) Hinging-hyperplanes,71 and (9) Neural networks. The FDA by OS method is similar to the algorithmic LDA by OS of the previous Section 1.5.12. The steps to follow are 1. Initialize: Choose an initial score matrix Θ0 satisfying the constraints ΘtK × J DpΘ J × K = IK and get the scoring matrix Θ*0 = Y Θ0. The Θ0 may be obtained by a contrast matrix.* 2. Multivariate nonparametric regression: Fit a multi-response, adaptive nonparametric regression of Θ*0 of X by one of the nonparametric regressions listed above.
()
Θ*0 = S λˆ Θ*0 = η( x ) where η (x) is the vector of fitted regression functions. ˆ *0 and hence 3. Optimal scores: Obtain the eigenvector matrix Φ of Θ*t0 Θ the optimal scores ΘJ×K = Θ0Φ. 4. Update the final model from step 2 using the optimal scores:
η( x ) ← Θ t η( x ) It is worth noting step 3 in both procedures in order to distinguish the way of obtaining the optimal scores. For the first procedure for the LDA via OS, the indicator matrix Y is regressed on to X. But in the second procedure for FDA via OS, the transformed score data, Θ*0 = Y Θ0 are regressed onto X by any of the various nonparametric regression methods. The optimal score Θ is thus updated as Θ = Θ0Φ. For a J class problem, it is known from the discriminant analysis that the vector of canonical variates or functions η (x) has at most K = J – 1 components. If –ηj = Σgi=j η (xi)/nj denotes the fitted centroid of the jth class in this space of canonical variates, the discrimination rule has the form of a (weighted) nearest centroid rule:
* The contrast matrix is the K – 1 linear combinations of a factor variable with K levels. It is an encoding method of the factor variable such that the linear combination of the levels becomes linearly independent. There exist the Helmert, polynomial contrasts and others (see References 72, Ch.2).
© 2000 by CRC Press LLC
{
(
x ∈ j = arg min D η( x ) − η k k
)
2
}
(1.110)
D is the diagonal matrix of scale factors that convert optimally scaled fits to discriminant analysis variables.
1.6 COMPARISON OF EXPERIMENTAL RESULTS In general, any pattern recognition system consists of two basic subsystems: feature extraction and classifier design. In this study, however, we are mainly interested in classifiers. There are many different classifiers from the simple and powerful nonparametric KNN rule to the recently popularized neural networks, as well as the newly developed multivariate regression methods. Eleven classifiers, which are all explained in Section 1.1, are experimented with the same data set obtained by Zernike Moments, a global feature extraction method.73–75 A new branch in the growing tree of the classifiers has been developed in applied statistics6,5 and is by now popularized. It is based on the fact that Optimal Scoring (OS) is equivalent to the Linear Discriminant Analysis (LDA) (Equation 1.107) and the OS can be obtained by various regression techniques which are well researched in statistics (Section 1.5.11). The multivariate regression methods were used for classification, and the results were proven to be competitive to the classical statistical methods. Table 1.1 describes the classifiers in a simple format with control parameters, learning and operation process. Details on the classifiers are given in Section 1.1. The core part of the software for the classifiers used in the study has been obtained from contributed software. They are written mostly by originators or some active researchers in the area. The archive package “classif” is a collection contributed by B. Ripley and is maintained in the [email protected] which is accessible by anonymous ftp. It can be found under “S” directory of the maintainer. This “classif” library also contains LDA, OLVQ1, KNN, and others that we did not experiment with. Hastie and Tibshirani contributed the programs that are recently developed by themselves and A. Buja. The package “fda” contains the Flexible Discriminant Analysis (FDA), which is a way of using Optimal Scoring by nonparametric regression for classification problems. The library “fda” comes with POLYREG, BRUTO. MARS and BRUTO are the recently developed multivariate regression methods. MARS can also be obtained from the directory “general” in the same maintainer, [email protected]. The CART and PPREG can also be found from the “S” directory of the same maintainer. These are also available in function type “tree ()” and “ppreg ()” from the commercial package Splus.* The NNET neural networks written by Ripley are different from the original ones29 in that he uses the modified Newton’s optimization algorithm with BFGS algorithm (the most popular Hessian matrix update algorithm). The description is * The commercial version of “S”76 which is developed in AT&T Bell Lab. Splus is an extended version of “S” from Statistical Sciences, Inc. Seattle, WA., USA.
© 2000 by CRC Press LLC
TABLE 1.1 The List of the Classifiers Used Classifiers
Control Parameters
LDA
Learning F,
OLVQ1
Codebook
KNN
k=1,3,5
NNET
h = 15 λ = 0.005
µi
arg miniG F (x – µi)2
find {mi}L1
find di (x, mi) arg mini {di (x, xi)} arg maxj {P (jx)}
Minimize
∑ ( yˆ − t ) + λ∑ W 2
i
CART
Operation
2
i
find Bm (x) =
[ (
∏ kL=m1 H skm xv( k ,m ) − tkm
[
L arg max m {Bm x}1
)]
M
deg = 1
Px = X (X′ X)–1 X′
yˆ = Px y
POLY
deg = 2
PH = H (H′H) H′
yˆ = PH y
PPREG
min = 9 max = 15
Minimize
LREG
–1
∑ BRUTO
MARS
Nnet
cost = 2.5
y − i
∑(
βmψ m (α mt x i )
2
)
yˆ =
∑ β φ (α x)
yˆ =
∑ f (x )
yˆ =
∑ f (x )
Backfitting
cost = 2 deg = 1
TURBO
h = 15
Minimize
]
t
m m
i
i
i
i
arg minj
∑ (θˆ( j) − θ ( j)) + ∑ W 2
2
(
)
θ ( j ) − θˆ( j ) 2
depicted in Figure 1.8. The NNET has been very reliable in experiments and yields a better convergence to better minima than any other software that has been tested for the feed-forward multilayer neural network study with backpropagation.
1.7 SYSTEM PERFORMANCE ASSESSMENT In practice we are given a data set and required to design a system for a certain objective. The system is a realization of the function of an unknown input space D.
© 2000 by CRC Press LLC
If we know all the necessary characteristics of the input space D, it is fairly easy to design an optimal system for the objective, such as the Bayesian classification rule with class conditional distributions and a priori probabilities for the classes. We, however, usually do not know the underlying generating function that generates the sample we have at hand. Instead, from the sample we like to find the underlying generating function, i.e., the population distribution. This is the inference problem. Let us say that the input space is fully described by a certain distribution function F (·). The system we are interested in can be represented as a functional θ that takes the population distribution F: θ (D) = θ (F). The functional θ is known, but the distribution F is not. θ could be any statistic or a complicated error rate in a classification problem. The distribution is usually estimated parametrically or non-parametrically, thus providing the input argument to the system functional θ () in order to estimate the system’s functional of the real population distribution F. Thus we have an estimation for θ (F):
()
θ ( F ) θ Fˆ . With this estimation strategy the next question is how accurate θˆ is as the estimator of θ.
1.7.1
CLASSIFIER EVALUATION
Once we have designed a classifier, we like to know how accurately the system can do the job or quantify the quality of the system performance. Prediction error is the criterion that we like to employ to see how good the designed system is. For both regression and classification system design, the usual system performance measure is its prediction error. In the context of regression, prediction error refers to the expected squared difference between the response value and its prediction from the model PE = E( y − yˆ )
2
(1.111)
The expectation operation refers to the repeated sampling from the true underlying population distribution. Prediction error also arises in classification problems, where the response falls into one of J not ordered classes. The prediction error is commonly defined as the probability of an incorrect classification PE = Prob ( yˆ ≠ y)
(1.112)
which is called misclassification rate. How to assess the system performance is an important issue in order to better quantify the designed system in terms of a criterion, e.g., error rate. © 2000 by CRC Press LLC
1.7.1.1
Hold-Out Method
If the data set at hand is large, we may divide it in two parts; use one for training and hold out the other for testing, hence the name hold-out method. This is a popular method to assess the system’s performance. In most cases the data are limited in size; thus a hold-out method is ad-hoc in the sense of which subset is held out for testing. The performance evaluation via this method depends on how the data are separated. 1.7.1.2
K-Fold Cross-Validation
A natural compromise to the hold-out above is the so-called K-fold cross-validation method. The given data are divided evenly into K parts. One or more of the K parts is used to test the designed system by the remaining parts of the data. An average among the results is called the K-fold cross-validation estimate of the true error rate. An extreme case results to the leave-one-out method, in which one observation, (yk,xk), is left out and the rest N – 1 cases, { (yi,xi}i≠k } are used for training. The prediction error, PE (Equation 1.111) from the leave-one-out method is the average of the N errors
PEcv =
1 N
N
∑ ( y − fˆ
−1
i
i =1
(xi ))
2
(1.113)
where fˆ–i (xi) is the estimation of the response of f (xi) based on the system trained with the data in which the xi is missing. In general, with a notation wi being the index group in which the index i falls, the cross-validation has a form of prediction error in regression: 1 N
∑ ( y − yˆ (
(xi ))
1 N
∑ [ y ≠ yˆ (
(x i )].
PEcv =
N
− wi )
i
i =1
2
and in classification setting: PEcv =
N
− wi )
i
i =1
(1.114)
Other than cross-validation for estimation some modification of the apparent error, the sum of squared residuals (SSR) 1 N
© 2000 by CRC Press LLC
N
∑ ( y − fˆ(x )) i
i =1
i
2
has also been used [see Reference 77, Ch.17]; such as SSR/ (N – p), SSR/ (N – 2p), ˆ 2/N. Leaving these modifications of SSR aside (since they are and Cp = SSR/N + 2pσ beyond the scope of our interest) we like to use the Bootstrap estimate of prediction error, which is also used for the performance analysis of our classification system.
1.7.2
BOOTSTRAPPING METHOD
FOR
ESTIMATION
Bootstrapping is a method of nonparametric estimation of statistical errors, which are the bias and the standard error of an estimator. The nonparametric techniques known to date are the Bootstrap, the Jackknife, and the cross-validation. Nonparametric methods for testing the accuracy of an estimator have all some common desirable features: they require very little in the way of modeling, assumptions, or analysis, and can be applied in an automatic way to any statistics, no matter how complicated these are.78 – In order to see what they are, a simple statistic, the sample mean X, is employed to assess the accuracy of the estimation for the true mean µ. We consider the available data set as a random sample of size N from an unknown distribution F in the sense that it represents the population F relatively well. As shown in Figure 1.10, a random sample is drawn from an unknown probability distribution F, X1, X2 ,…, Xn ~ F
FIGURE 1.10
(1.115)
Illustration of the Bootstrap sampling.
With a sample from F, we compute the sample average x– = ΣN1 xi/N as an estimate of the expectation of F, EF (X). For this special statistic (sample average), we can get more information about the estimator –x. The accuracy of the estimator is represented by the standard deviation of –x:
© 2000 by CRC Press LLC
{
}
σˆ ( F; N , x ) = var ( X )
= {var ( X ) / N}
1/ 2
1 = − 1) N N ( µ ( F) 2 N
1/ 2
1/ 2
( xi − x )
N
∑
(1.116)
2
i =1
1/ 2
(1.117)
where µ2 (F) is the central moment of the population with distribution F. This standard error formula with the raw sample realization of Equation 1.115, does not extend to the other statistics, such as median, correlation, or prediction error. This is the point where computer methods, such as the resampling techniques for accuracy estimation, come into play. 1.7.2.1
Jackknife Estimation
Let –x (i)* defined as x(i ) =
1 N −1
∑x
j
j ≠i
=
Nx − xi N −1
(1.118)
with N – 1 points, be the sample average of N – 1 points for all i = 1,2,…,N. Then the jackknife estimate of standard error is represented by N −1 σˆ J ( F; N , x ) = N
N
∑(x i =1
(i )
− x(⋅)
)
1/ 2
2
(1.119)
The –x (i) = ΣNi –x i/N is the average among the N –x (i)’s. This can be proved to be equal to the standard error for the sample average of Equation 1.117 by substituting Equation 1.118 onto the Equation 1.119. The jackknife standard error estimation of any statistic θ may have the form of Equation 1.119 to get the accuracy information of the estimator, θˆJ. The advantage with the estimate of standard error for a statistic is to use Equation 1.119 where any statistic θˆ (i) = θˆ (X1,…,Xi–1, Xi+1,…, XN) is replaced by –x (i) and θˆ (·) = 1/N Σi=1N θˆ (i) for –x (i).
* Note the change of the notation in the deletion statistic from the usual superscript with negative sign, e.g., ,ˆ f–i (xi) in Equation 1.113.
© 2000 by CRC Press LLC
1.7.2.2
Bootstrap Method
Bootstrap generalizes Equation 1.117 in an apparently different way. Any statistic θ (F), which is a functional, requires the distribution F. But in practice F is not known and is difficult to estimate. An empirical distribution Fˆ from the given sample from an unknown distribution F is defined in a bootstrap setting by giving an equal probability mass −1N to each of the values xi, and draw a sample from the empirical ˆ distribution F: X1* , X2* ,…, XN* ~ Fˆ Each x*i is drawn independently with replacement and with equal probability from – the set {x1,x2,…,xN}. Then the standard error of sample mean X* = Σi=1N X*i /N is given as
σ ( F; N , x
*
)
()
1 = µ 2 Fˆ N 1 = 2 N
1/ 2
1 = N
N
∑ ( x − x ) 2
N
∑ i =1
2 1 xi − x ) ( N
1/ 2
(1.120)
1/ 2
i
i =1
where µ2 (·) is the second order central moment of a given distribution. Comparing this standard error for bootstrap sample average with Equation 1.117, we note that they are almost the same. Thus the jackknife (Equation 1.119) and the bootstrap (Equation 1.120) standard error for sample average (a simple statistic as an example) are shown to be nearly equal to Equation 1.117; a special statistic that is the sample average as an estimate for mean has an explicit form. Formulas like Equation 1.117 do not exist for most statistics. This is where the computing intensive jackknife and bootstrap estimations are used. It turns out* that we can always numerically evaluate the bootstrap estimate for standard error σˆ = σ (Fˆ ), without a simple expression like Equation 1.117.
1.8 ANALYSIS OF PREDICTION RATES FROM BOOTSTRAPPING ASSESSMENT The boxplots in Figure 1.11 represent the E632 estimator superimposed by the distribution of the B = 100 bootstrap sample errors, θˆ* (b)’s. The median value of the B error rates is replaced by the E632 estimate; thus the B bootstrap errors are shifted according to the E632 estimate. For ease of display and understanding the ˆ s, are plotted. system performance, the recognition rates, i.e., 1 – θ′
* The proof can be found in Reference 77.
© 2000 by CRC Press LLC
The generality issue of the designed system is related to its reliability in terms of standard error of the estimator for the prediction error. To make the analysis simpler, we assume the symmetry of the system performance of the classifiers in the boxplot figures. Then the standard error of the prediction rule by the mean (or median simply from the boxplots) is relatively approximated by the inter quartile range of the boxplots. The mean value is the bootstrap sample estimate θˆ of the true statistic θ = 1 – PE. The standard deviation implicity represents the reliability of the estimate, i.e., standard error of the estimate. From the result of the classifiers considered in this study, (Figure 1.11), the 95% confidence interval of the estimate θˆ = 0.955 is given by
θˆ − σˆ × 1.645 ≤ θ ≤ θˆ + σˆ × 1.645 0.941 ≤ θ ≤ 0.96 where the multiple factor 1.645 is the 95% percentile point of the standard normal variate, N (0,1). The graphical display seems to reveal more for the comparison study of the classifiers and different treatments of the data. The boxplot display of a batch is a very simple and useful way to show the distribution of the sample. The Inter Quartile Range (IQR), which is the difference between the upper quartile and the lower quartile, is considered to be the robust estimation of the scalar multiple of the dispersion. The height of the box is the IQR. The median of the batch is represented by the line in the box. The whiskers represented by the dotted lines are extended up to the points in which the 1.5 times of the IQR contains. Outliers are represented by the individual dots to signify their existence. The boxplot, thus, displays the distribution very simply but well enough, especially when many different batches are to be compared. The correct recognition rates from 11 classifiers are displayed with the boxplots for each data set obtained from the different treatments. Each boxplot shows the distribution of the recognition rate of the 100 systems designed by B = 100 bootstrap samples. The corresponding figures for the data are in Figure 1.11. The results from the LDA and LREG (via linear regression) would have been the same due to the equivalence of the LDA and OS (Equation 1.107 and Equation 1.110) if the same bootstrap sample were used for both classifiers; the bootstrap samples used to train the classifiers are different for no reason!* The best performance of the optimization machine with the feed-forward neural network structures can be observed (Figure 1.11). This is seen with the mean values for the estimation of the correct recognition error. Note that we do not consider the KNN classifier as a learning mechanism, so it is not of concern. It does not learn but performs by the exemplars; i.e., the computation in the operation phase is the largest, which is inappropriate in real-time processing applications.
* If the different classifiers were trained with the same B bootstrap samples, then the classification by the linear regression method and the LDA would have been the same.
© 2000 by CRC Press LLC
FIGURE 1.11 Boxplots for different classifiers for data set R. 100 bootstrap samples are used to assess each classifier.
REFERENCES 1. Lippmann, R. P., Pattern classification using neural networks, IEEE Commun. Mag., 47, November 1989. 2. Lippmann, R. P., An introduction to computing with neural nets, IEEE ASSP Mag., 4, April 1987. 3. Hush, D. and Horne, B., Progress in supervised neural networks, IEEE Signal Process. Mag., 8, January 1993. 4. Weiss, S. M. and Kulikowski, C. A., Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems, Kaufman Publishers, San Francisco, CA,1991. 5. Hastie, T., Tibshirani, R., and Buja, A., Flexible discriminant analysis by optimal scoring. It can be obtained from /netlib/stat via ftp to netlib.att.com, February 1993. 6. Breiman, L. and Ihaka, R., Nonlinear Discriminant Analysis via Scaling and Ace. Technical report, Univ. California, Berkeley, 1984. 7. Härdle, W., Smoothing Techniques With Implementation in S, Springer-Verlag, Berlin, 1991. 8. Omohundro, S. M., Efficient algorithms with neural network behavior, Complex Syst., 1; 273, 1987. 9. Huang, W. and Lippmann, R., Neural nets and traditional classifiers; in Neural Information Processing Systems, Anderson, D., Ed. American Institute of Physics, New York, 1986, 387. 10. Kohonen, T., The self-organizing map, Proc. IEEE, 78 (9); 1464, September 1990. 11. Carpenter, G. A. and Grossberg, S., Art2: Self-organization of stable category recognition codes for analog input patterns, Appl. Opt., 26; 4919, 1987. 12. Gray, R. M., Vector quantization, IEEE ASSP Mag., 1; 4, 1984.
© 2000 by CRC Press LLC
13. Linde, Y., Buzo, A., and Gray, R. M., An algorithm for vector quantization, IEEE Trans. Commun., COM-8; 84, 1980. 14. Powell, M. J. D., Radial Basis Functions for Multivariate Interpolation, Technical Report DAMPT 1985/NA12, Dept. of Appl Math. And Theor. Physics, Cambridge University., Cambridge, England, 1985. 15. Widrow, B., and Hoff, M., Adaptive switching circuits, in In 1960 IRE WESCON Convention Record, New York, NY, 1960, 96. 16. Quinlan, J. R., Induction of decision tree, Machine Learning, 1; 81, 1986. 17. Ripley, B. D., Neural networks and related methods for classification. PS file is available by anonymous ftp from markov.stats.ox.ac.uk (192.76.20.1) in directory pub/neural/papers. 18. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., Classification and Regression Trees, Wadsworth and Brooks/Cole, Monterey, CA, 1984. 19. Ripley, B. D., Statistical aspects of neural networks, in Chaos and Networks: Statistical And Probabilistic Aspects Barndorff-Nielsen, O. E., Cox, D. R., Jensen, J. L. and Kendall, S. S., Eds., Chapman & Hall, London, 1993. 20. Hinton, G. E., Connectionist learning procedures, Artif. Intelligence, 185, 1989. 21. Anderson, T. W., An Introduction to Multivariate Statistical Analysis, John Wiley & Sons, New York, 1984. 22. Fukunaga, K., Introduction to Statistical Pattern Recognition, 2nd ed., Academic Press, New York, 1990. 23. Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis, Prentice Hall, New York, 1998. 24. Kohonen, T., Kangas, J., Laaksonen, J., and Torkkola, K., Lvq-pak: The learning vector quantization program package. Technical report, Helsinki University of Technology, Laboratory of Computer and Information Science, 1992. lvq-pak is available for anonymous ftp user at the Internet site cochlea.hut.fi (130.233.168.48). 25. Wolpert, D., Alternative generalizers to neural nets. Abstracts of 1st Annual INNS Meeting, Boston, Neural Netw., 1, 1988. 26. Farmer, J. D. and Sidorowich, J. J., Exploiting Chaos to Predict the future and Reduce Noise. Technical report, Los Alamos National Laboratory, Los Alamos, New Mexico, 1988. 27. Stanfill, C. and Waltz, D., Toward memory-based reasoning, Commun. ACM, 29 (12); I:213, 1986. 28. Ripley B. D., Neural networks and flexible regression and discrimination. PS file is available by anonymous ftp from markov.stats.ox.ac.uk (192.76.20.1) in directory pub/neural/papers. 29. Rumelhart, D. E., Hinton, G. E., and Williams, R. J., Learning internal representations by error backpropagation, in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, I Foundations, Rumelhart, D.E., McClelland, J.L., and the PDP Research Group, Eds., MIT Press, 1986, chap. 8. 30. Minsky, M. and Papert, S., Perceptrons: An Introduction to Computational Geometry, MIT Press, Cambridge, MA, 1969. 31. Rosenblatt, R., Principles of Neurodynamics, Spartan Books, New York, 1959. 32. Bryson, A. E. and Ho, Y. C., Applied Optimal Controls, Bleisdell, New York, 1969. 33. Werbos, P. J., Beyond Regression: New Tools for Prediction and Analysis in the Behavior Sciences. Ph.D. thesis, Harvard University, Cambridge, MA, 1974. 34. Parker, D. B., Learning-Logic. Technical Report TR-47, Center for Comp. Res. in Econ. and Man., MIT, Cambridge, MA, April 1985.
© 2000 by CRC Press LLC
35. Widrow, B., Generalization and information storage in networks of adaline ‘neurons’, in Self-Organizing Systems, Yovitz, M., Jacobi, G., and Goldstein, G., Eds., Spartan Books, Washington, DC, 1962, 435. 36. Widrow, B. and Lehr, M., 30 years of adaptive neural networks: perceptron, madaline, and backpropagation, Proc. IEEE, 78 (9); 1415, September 1990. 37. Zahner, D. and Micheli-Tzanakou, E., Alopex and backpropagation Supervised and Unsupervised Pattern Recognition: Feature Extraction and Computational Intelligence, CRC Press, Boca Raton, Fl, 1999, Chap. 2. 38. Micheli-Tzanakou, E., Neural networks in biomedical signal processing, in The Biomedical Engineering Handbook, Bronjino, J., Ed., CRC Press Inc., Boca Raton, Fl, 1995, 917. 39. Zahner, D. and Micheli-Tzanakou, E., Artificial neural networks: definitions, methods and applications, in The Biomedical Engineering Handbook, Bronjino, J., Ed., CRC Press Inc., Boca Raton, Fl, 1995, 2689. 40. Micheli-Tzanakou, E., Uyeda, E., Sharma, A., Ramanujan, K. S., and Dong, J., Face recognition: comparison of neural networks algorithms, Simulation, 64; 37, 1995. 41. Rumelhart, D. E., Hinton, G. E., and Williams, R. J., Learning representations by backpropagation errors, Nature, 323; 533, 1986. 42. Peressini, A. L., Sullivan, F. E., and Uhl J. J. Jr., The Mathematics of Nonlinear Programming, Springer-Verlag, Berlin, 1988. 43. Kirkpatrick, S., Gelatt, C. D., Jr., and Vecchi, M. P., Optimization by simulated annealing, Science, 220 (4598): 671, May 1983. 44. Hinton, G. E. and Sejnowski, T. J., Learning and relearning in boltzmann machines, in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, I. Foundations, Rumelhart, D.E., McClelland, J.L. and the PDP Research Group, Eds., MIT Press, Cambridge, MA, Chap. 7. 45. Harth E. and Tzanakou E., Alopex: A stochastic method for determining visual receptive fields, Vision Res., 14, 1475, 1974. 46. Unnikrishnan, K. P. and Venugopal, K. P., Alopex: A correlation-based learning algorithm for feed-forward and recurrent neural networks, Neural Computation, June 1994. 47. Engel, J., Teaching feed-forward neural networks by simulated annealing, Complex Syst., 2; 641, 1988. 48. Hopfield, J. J. and Tank, D. W., Neural computation of decisions in optimization problems, Biol. Cybern., 52; 141, 1985. 49. Hornik, K., Stichcombe, M., and White, H., Multilayer feedforward networks are universal approximators, Neural Netw., 2; 359, 1989. 50. Kolmogorov, A. N., On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition, Doklady Akad. Nauk SSR, 114; 953, 1957. 51. Le Cun, Y., Medeles Connexionists de l’apprentissage, Ph.D. thesis, Universite Pierre et Marie Curie, Paris, 1987. 52. Lapedes, A. and Farber, R., How Neural Networks Work, Technical report, Los Alamos National Laboratory, Los Alamos, NM, 1988. 53. Gallant, A. R. and White, J., There exists a neural network that does not make avoidable mistables, in IEEE Second Int. Conf. Neural Networks, SOS Printing, San Diego, 1988, 657. 54. Hecht-Nielsen, R., Theory of the backpropagation neural network, in Proc. Int, Joint Conf. Neural Networks, SOS Printing, San Diego, 1989, I: 593.
© 2000 by CRC Press LLC
55. Baum, E. and Haussler, D., What size net gives valid generalization? Neural Comput., 1; 151, 1989. 56. Jacobs, R. A., Increased rates of convergence through learning rate adaptation, Neural Netw., 1; 295, 1988. 57. Touretzky, D., Hinton, D., and Sejnowski, T., Eds., Faster-learning Variations of Backpropagation: An Empirical Study, Morgan Kaufmann, San Mateo, CA, 1989. 58. Solla, S. A., Levin, E., and Fleisher, M., Accelerated learning in layered neural networks, Complex Syst., 2; 625, 1988. 59. Golden, R. M., A unified framework for connectionist systems, Biol. Cybern., 59; 109, 1988. 60. van Ooyen, A. and Niehhuis, B., Improving the convergence of the backpropagation algorithm, Neural Netw. 5; 465, 1992. 61. Hastie, T., Discussion in flexible parsimonious smoothing and additive modeling, Technometrics, 31 (1); 23, 1989. 62. Friedman, J. H., Multivariate adaptive regression splines, Ann. Stat., 19 (1); 1, 1991. 63. Hastie, T., Buja A., and Tibshirani, R., Penalized discriminant analysis. can be obtained from /netlib/stat via ftp to netlib.att.com., July 1993. 64. Gnanadesikan, R. and Kettenring, J., Discriminant analysis and clustering, Stat. Sci., 4 (1); 34, 1989. 65. Buja, A., Hastie, T., and Tibshirani, R., Linear smoothers and additive models, Ann. Stat., 17 (2); 453, 1989. 66. Friedman, J. H. and Stuetzle, W., Projection pursuit regression, J. Am. Stat. Assoc., 76 (376); 817, December 1981. 67. Breiman, L. and Friedman, J. H., Estimating optimal transformations for multiple regression and correlation, J. Am. Stat. Assoc., 80 (391); 580, September 1985. 68. Tibshirani, R., Estimation optimal transformations for regression via addivity and variance stabilization, J. Am. Stat. Assoc., 83; 394,1988. 69. Breiman, L., The π-method for estimation multivariate functions from noisy data, Technometrics, 33 (2); 125, 1991. 70. Wahba, G., Spline Models for Observational Data, SIAM, Philadelphia, 1990. 71. Breiman, L., Hinging Hyperplanes for Regression, Classification and Function Approximation, Technical Report 324, Univ. California, Berkeley, 1991. 72. Chambers, J. M. and Hastie, T. J., Statistical models, in Statistical Models in S, Chambers, J. M. and Hastie, Trevor J., Eds. Wadsworth & Brooks, Pacific Grove, CA, 1991. 73. Teague, M. R., Image analysis via the general theory of moments, J. Opt. Soc. Am., 70 (8); 920, August 1980. 74. Khotanzad, A. and Hong, Y. H., Invariant image recognition by zernlike moments, IEEE Trans. Pattern Anal. Machine Intelligence, 12 (5); 489, May 1990. 75. Chung, W., A Strategy for Visual Pattern Recognition, Ph.D. thesis, Electrical and Computer Engineering, Rutgers University, The State University of New Jersey, 1994. 76. Becker, R. A., Chambers, J. M., and Wilks, A. R., The New S Language, Wadsworth, Pacific Grove, CA, 1988. 77. Efron, B. and Tibshirani, R. J., An Introduction to the Bootstrap, Chapman & Hall, London, 1993. 78. Efron, B. and Gong, G., a leisurely look at the bootstrap, the jackknife, and crossvalidation, Am. Statistician, 37 (1); 36, February 1983. 79. Efron, B. and Tibshirani, R. J., Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy, Stat. Sci. 1 (1); 54, 1986.
© 2000 by CRC Press LLC
80. Efron, B., Estimating the error rate of a prediction rule: improvement on crossvalidation, J. Am. Stat. Assoc., 78 (382); 316, June 1983. 81. Jain, A. K., Dubes, R. C., and Chen, C. C., Bootstrap techniques for error estimation, IEEE Trans. Pattern Anal. Machine Intelligence, 9 (5); 628, September 1987.
2
Artificial Neural Networks: Definitions, Methods, Applications Daniel A. Zahner and Evangelia Micheli-Tzanakou
2.1 INTRODUCTION The potential of achieving a great deal of processing power by wiring together a large number of very simple and somewhat primitive devices has captured the imagination of scientists and engineers for many years. In recent years, the possibility of implementing such systems by means of electro-optical devices and in very large scale integrations has resulted in increased research activities. Artificial neural networks (ANNs) or simply neural networks (NNs) are made of interconnected devices called neurons (also called neurodes, nodes, neural units or simply units). Loosely inspired by the makeup of the nervous system, these interconnected devices look at patterns of data and learn to classify them. NNs have been used in a wide variety of signal processing and pattern recognition applications and have been successfully applied in such diverse fields as speech processing,1–4 handwritten character recognition,5–7 time series prediction,8–9 data compression,10 feature extraction,11 and pattern recognition in general.12 Their attractiveness lies in the relative simplicity with which the networks can be designed for a specific problem, along with their ability to perform nonlinear data processing. As the neuron is the building block of a brain, a neural unit is the building block of a neural network. Although the two are far from being the same or from performing the same functions, they still possess similarities that are remarkably important. NNs consist of a large number of interconnected units that give them the ability to process information in a highly parallel way. The brain, as well, is a massively parallel machine as has long been recognized. As each of the 1011 neurons of the human brain integrates incoming information from all other neurons directly or indirectly connected to it, an artificial neuron sums all inputs to it and creates an output that is carrying information to other neurons. The connection from one neuron’s dendrites or cell body to another neuron’s processes is called a synapse. The strength by which two neurons influence each other is called a synaptic weight. In a NN all neurons are connected to all other neurons by synaptic weights that can have seemingly arbitrary values, but in reality, these weights show the effect of a stimulus on the neural network and the ability or lack of it to recognize that stimulus.
© 2000 by CRC Press LLC
In the biological brain, two types of processes exist, static and dynamic. Static brain conditions are those that do not involve any memory processing, while dynamic processes involve memory processing and changes through time. Similarly, NNs can be distinguished as static or dynamic, the former being those that do not involve any previous memory and only depend on current inputs, the latter having memory and being described by differential equations that express changes in the dynamics of the system through time. All NNs have certain architectures, and all consist of several layers of neuronal arrangements. The most widely used architecture is that of the perceptron first described in 1958 by Rosenblatt.13 In the sections that follow we will build on this architecture but not necessarily on the original assumptions of Rosenblatt, the validity of which has been disputed by others.27 Since there are many names in the literature that express the same thing and usually create a lot of confusion for the reader, we will define the terms to be used and use them throughout the chapter. Terminology is a big concern for those involved in the field and for organizations such as IEEE. A standards committee has been formed to address issues such as nomenclature and paradigms. In this book, whenever possible, we will try to conform to the terms and definitions already in existence. Some methods for training and testing of NNs will be described in detail, although many others will be left out due to lack of space, but references will be provided for the interested reader. A small number of applications will be given as examples, since many more are discussed in other chapters of this book, and it will be redundant to repeat them here.
2.2 DEFINITIONS Neural Nets (NNs) go by many other names, such as connectionists models, neuromorphic systems, and parallel distributed systems, as well as artificial NNs, which distinguishes them from the biological ones. They contain many densely interconnected elements called neurons or nodes, which are nothing more than computational elements nonlinear in nature. A single node acts like an integrator of its weighted inputs. Once the result is found, it is passed to other nodes via connections that are called synapses. Each node is characterized by a parameter that is called threshold or offset and by the kind of nonlinearity through which the sum of all the inputs is passed. Typical nonlinearities are the hardlimiter, the ramp (threshold logic element), and the widely used sigmoid. The simplest NN is the single layer perceptron13,14 which is a simple net that can decide whether an input belongs to one of two possible classes. Figure 2.1 is a schematic representation of a simple one-neuron perceptron, the output of which is passed through a nonlinearity called an activation function. This activation function is of different types, the most popular being a sigmoidal logistic function. Figure 2.2 is a schematic representation of some activation functions, such as the hardlimiter (or step), the threshold logic (or ramp), a linear, and a sigmoid. The neuron of Figure 2.1, receives many inputs, Ii, each weighted by a weight Wi (i = 1, 2....N). These inputs are then summed. The sum is then passed through the activation function, f, and an output, y, is calculated only if a certain threshold is exceeded. © 2000 by CRC Press LLC
FIGURE 2.1
Artificial neuron.
FIGURE 2.2
Typical activation functions.
Complex artificial neurons may include temporal dependencies and more complex mathematical operations than summation.15 While each node has a simple function, their combined behavior becomes remarkably complex when organized in a highly parallel manner. NNs are specified by their processing element characteristics, the network topology, and the training or learning rules they follow in order to adapt the weights, Wi. Network topology falls into two broad classes, feed-forward (nonrecursive) and feedback (recursive) NNs.16 Nonrecursive NNs offer the advantage of simplicity of implementation and analysis. For static mappings a nonrecursive network is all one needs to specify any static condition. Adding feedback expands the network’s range
© 2000 by CRC Press LLC
of behavior since now its output depends upon both the current input and network states. But one has to pay a price, longer times for teaching the NN. Obviously the scheme of Figure 2.1 is quite simple and inadequate in solving problems. A multilayer perceptron (MLP) is the next choice. A number of inputs are now connected to a number of nodes at a second layer called the hidden layer. The outputs of the second layer may connect to a third layer and so on, until they connect to the output layer. In this representation, every input is connected to every node in the next layer and the outputs of one hidden layer are connected to the nodes of the next hidden layer and so on. More details on multilayer perceptrons can be found in Chapter 12. Artificial neural networks usually operate in one of two modes. Initially there exists a training phase where the interconnection strengths are adjusted until the network has a desired output. Only after training does the network become operational, i.e., capable of performing the task it was designed and trained to do. The training phase can be either supervised or unsupervised. In supervised learning, there exists information about the correct or desired output for each input training pattern presented.20 The original perceptron and backpropagation are examples of supervised learning. In this type of learning the NN is trained on a training set consisting of vector pairs. One of these vectors is used as input to the network; the other is used as the desired or target output. During training the weights of the NN are adjusted in such a way as to minimize the error between the target and the computed output of the network. This process might take a large number of iterations to converge, especially because some training algorithms (such as backpropagation) might converge to local minima instead of the global one. If the training process is successful, the network is capable of performing the desired mapping. In unsupervised learning, no a priori information exists, and training is based only on the properties of the patterns. Sometimes this is also called self-organization.20 Training depends on statistical regularities that the network extracts from the training set and represents as weight values. Applications of unsupervised learning have been limited. However, hybrid systems of unsupervised learning combined with other techniques produce useful results.21–23 Unsupervised learning is highly dependent on the training data, and information about the proper classification is often lacking.21 For this reason, most neural network training is supervised.
2.3 TRAINING ALGORITHMS After McCulloch and Pitts24 demonstrated, in 1943, the computational power of neuron-like networks, much effort was given to developing networks that could learn. In 1949, Donald Hebb proposed the strengthening of connections between presynaptic and post-synaptic units when both were active simultaneously.25 This idea of modifying the connection weights as a method of learning is present in most learning models used today. The next major advancement in neural networks was by Frank Rosenblatt.13,14 In 1960, Widrow and Hoff proposed a model, called the Adaptive Linear Element (ADALINE), which learns by modifying variable connection strengths, minimizing the square of the error in successive iterations.26 This
© 2000 by CRC Press LLC
error correction scheme is now known as the Least Mean Square(LMS) algorithm, and it has found widespread use in digital signal processing. There was great interest in neural network computation until Minsky and Papert published a book in 1969 criticizing the perceptron. This book contained a mathematical analysis of perceptron-like networks, pointing out many of their limitations. It was shown that the single layer perceptron was incapable of performing the XOR mapping. The single layer perceptron was severely limited in its capabilities. For linear activation functions, multilayer networks were no different from single layer models. Minsky and Papert pointed out that multilayer networks with nonlinear activation functions could perform complex mappings. However the lack of any training algorithms for multiple layer networks made their use impossible. It was not until the discovery of multilayer learning algorithms that interest in neural networks resurfaced. The most widely used training algorithm is the backpropagation algorithm, as already mentioned in the introduction. Another algorithm used for multilayer perceptron training is the ALOPEX algorithm. ALOPEX was originally used for visual receptive field mapping by Tzanakou and Harth in 197328–30 and has since been applied to a wide variety of optimization problems. These two algorithms are explained in detail below.
2.3.1
BACKPROPAGATION ALGORITHM
The backpropagation algorithm is a learning scheme in which the error is backpropagated layer by layer and used to update the weights. The algorithm is a gradient descent method that minimizes the error between the desired outputs and the actual outputs calculated by the MLP. Let Ep =
1 2
N
∑ (T – Y ) i
2
i
(2.1)
i =1
be the error associated with template p. N is the number of output neurons in the MLP, Ti is the target or desired output for neuron i and Yi is the output of neuron i calculated by the MLP. Let E = Σ Ep be the total measure of error. The gradient descent method updates an arbitrary weight, w, in the network by the following rule: w(n + 1) = w(n) + ∆w(n)
(2.2)
where ∆w(n) α – η
∂E ∂w(n)
(2.3)
where n denotes the iteration number and η is a scaling constant. Thus, the gradient ∂E descent method requires the calculation of the derivatives for each weight, ∂w(n) w, in the network. For an arbitrary hidden layer neuron, its output, Hj , is a nonlinear function f of the weighted sum of all its inputs (netj).
© 2000 by CRC Press LLC
( )
H j = f net j
(2.4)
where f is the activation function. The most commonly used activation function is the sigmoid function given by f ( x) =
1 1 + e− x
(2.5)
Using the chain rule, we can write
∂E ∂E ∂net j = ⋅ ∂wij ∂net j ∂wij
(2.6)
and since n
net j =
∑w I
(2.7)
ij i
j =1
we have
∂net j = Ii ∂wij
(2.8)
Thus Equation 2.6 becomes
∂E ∂E = ⋅I ∂wij ∂net j i ∂E = ∂net j
(2.9)
m
∑ ∂∂netE k =1
⋅ k
∂netk ∂H j ⋅ ∂H j ∂net j
(2.10)
recalling that n
netk =
∑w H ik
j
(2.11)
j =1
it follows that
∂netk = w jk ∂H j
© 2000 by CRC Press LLC
(2.12)
also
∂H j = f ′ net j ∂net j
( )
(2.13)
Therefore,
∂E = f ′ net j ⋅ ∂net j
n
( ) ∑ ∂∂netE k =1
⋅ w jk
(2.14)
k
Assuming f to be the sigmoid function of Equation 2.5, then
( )
f ′ net j = Yi (1 − Yi )
(2.15)
Equation 2.14 gives the unique relation that allows the backpropagation of the error to all hidden layers. For the output layer
∂E ∂E = ⋅ f ′ net j ∂net j ∂H j
(2.16)
∂E = −(Ti − Yi ) ∂H j
(2.17)
( )
In summary, then, first the output Yi for all the neurons in the network is calculated. The error derivative needed for the gradient descent update rule of Equation 2.2 is calculated from
∂E ∂E ∂net = ⋅ ∂w ∂net ∂w
(2.18)
∂E = −(Ti − Yi ) ⋅ Yi (1 − Yi ) ∂net j
(2.19)
If j is an output neuron, then
If j is a hidden neuron, then the error derivative is backpropagated by using Equations 2.14 and 2.15. Substituting, we get
∂E = Yi (1 − Yi ) ⋅ ∂net j
© 2000 by CRC Press LLC
m
∑ ∂∂netE k =1
⋅ w jk k
(2.20)
Finally the weights are updated, as in Equation 2.2. There are many modifications to the basic algorithm that have been proposed to speed the convergence of the system. Convergence is defined as a reduction in the overall error below a minimum threshold. It is the point at which the network is said to be fully trained. One method31 used is the inclusion of a momentum term in the update equation such that w(n + 1) = w(n) − η
∂E + α ∆w(n) ∂w(n)
(2.21)
η is the learning rate and is taken to be 0.25. α is a constant momentum term which determines the effect of past weight changes on the direction of current weight movements. Another approach used to speed the convergence of backpropagation is the introduction of random noise.32 It has been shown that while inaccuracies resulting from digital quantization are detrimental to the algorithm's convergence, analog perturbations actually help improve convergence time. One of these variations is the modification by Fahlman,33 called the quickprop, that uses second derivative information without calculating the Hessian needed in the straight backpropagation algorithm. It requires saving a copy of the previous gradient vector, as well as the previous weight change. Computation of the weight changes uses only information associated with the weight being updated:
[
]
∆w = ∇wij (n) / ∇wij (n − 1) − ∇wij (n) / ∆wij (n − 1)
(2.22)
where ∆wij(n) is the gradient vector component associated with the weight wij at iteration n. This algorithm assumes that the error surface is parabolic, concave upward around the minimum, and that the slope change of the weight ∆wij(n) is independent of all other changes in weights. There are obviously problems with these assumptions, but Fahlman suggests a “maximum growth factor” µ in order to limit the rate of increase of the step size, namely that if ∆wij(n) > µ ∆wij(n – 1) then ∆wij(n) = µ ∆wij(n – 1). Fahlman also used a hyperbolic arctangent function to the output error associated with each neuron in the output layer. This function is almost linear for small errors, but it blows up for large positive or large negative errors. Quickprop is an attempt to reduce the number of iterations needed by straight backpropagation, and it succeeded in doing so by a factor of 5, but this factor is problem dependent. This method also required several trials before the parameters were set to acceptable values. Backpropagation has achieved widespread use as a training algorithm for neural networks. Its ability to train multilayer networks has led to a resurgence of interest in the field. Backpropagation has been used successfully in applications such as adaptive control of dynamical systems and in many general neural network applications. Dynamical systems require monitoring of time in ways that monitor the past. In fact, the biological brain performs in an admirable way just because it has
© 2000 by CRC Press LLC
access to and uses values of different variables from previous instances. Backpropagation through time is another extension of the original algorithm proposed by Werbos in 199034 and has been previously applied in the “Truck Backer-Upper” by Nguyen and Widrow.35 In this problem a sequence of decisions must be made without an immediate indication of how effective these steps are. No indication of performance exists until the track hits the wall. Backpropagation through time solves the problem, but it has its own inadequacies and performance difficulties. Despite its tremendous effect on neural networks, the algorithm is not without its problems. Some of the problems have been discussed above. In addition, the complexity of the algorithm makes hardware implementations of it very difficult.
2.3.2
THE ALOPEX ALGORITHM
The ALOPEX process is an optimization procedure that has been demonstrated successfully in a wide variety of applications. Originally developed for receptive field mapping in the visual pathway of frogs, ALOPEX's usefulness and its flexible form have increased the scope of its applications to a wide range of optimization problems. Since its development by Tzanakou and Harth in 1973,28 ALOPEX has been applied to real-time noise reduction,36 pattern recognition,37 adaptive control systems,38 and multilayer neural network training to name a few. Optimization procedures, in general, attempt to maximize or minimize a function F( ). The function F( ) is called the cost function, and its value depends on many parameters or variables. When the number of parameters is large, finding the set (x1, x2,… xN) that corresponds to the optimal (maximal or minimal) solution is exceedingly difficult. If N were small, then one could perform an exhaustive search of the entire parameter space, in order to find the “best” solution. As N increases, intelligent algorithms are needed to quickly locate the solution. Only an exhaustive search can guarantee that a global optimum is found; however, near-optimal solutions are acceptable because of the tremendous speed improvement over exhaustive search methods. Backpropagation, described earlier, being a gradient descent method often gets stuck in local extrema of the cost function. The local stopping points often represent unsatisfactory convergence points. Techniques have been developed to avoid the problem of local extrema, with simulated annealing39 being the most common. Simulated annealing incorporates random noise, which acts to dislodge the process from local extremes. Crucial to the convergence of the process is that the random noise be reduced as the system approaches the global optimum. If the noise is too large, the system will never converge and can be dislodged mistakenly from the global solution. ALOPEX is another process which incorporates a stochastic element to avoid local extremes in search of the global optimum of the cost function. The cost function or response is problem-dependent and is generally a function of a large number of parameters. ALOPEX iteratively updates all parameters simultaneously based on the cross-correlation of local changes, ∆Xi, and the global response change ∆R, plus an additive noise. The cross-correlation term ∆Xi∆R helps the process move in a direc-
© 2000 by CRC Press LLC
tion that improves the response. Table 2.1 shows how this can be used to find a global maximum of R.
TABLE 2.1 X↑
∆X +
R↑
∆R +
∆X ∆R +
X↑
+
R↓
–
–
X↓
–
R↑
+
–
X↓
–
R↓
–
+
All parameters Xi are changed simultaneously at each iteration according to Xi (n) = Xi (n − 1) + γ ∆ Xi (n)∆ R(n) + ri (n)
(2.23)
The basic concept is that this cross-correlation provides a direction of movement for the next iteration. For example, take the case where Xi↓ and R↑. This means that the parameter Xi decreased in the previous iteration, and the response increased for that iteration. The product ∆Xi∆R is a negative number, and thus Xi would be decreased again in the next iteration. This makes perfect sense since a decrease in Xi produced a higher response; if you are looking for the global maximum, then Xi should be decreased again. Once Xi is decreased and R also decreases, then ∆Xi∆R is now positive and Xi increases. These movements are only tendencies, since the process includes a random component that will act to move the weights unpredictably, avoiding local extrema of the response. The stochastic element of the algorithm helps it to avoid local extrema at the expense of slightly longer convergence or learning period. The general ALOPEX updating Equation 2.23 is explained as follows. Xi(n) are the parameters to be updated, n is the iteration number, and R( ) is the cost function, of which the “best” solution in terms of Xi is sought. Gamma, γ, is a scaling constant, ri(n) is a random number from a Gaussian distribution whose mean and standard deviation are varied, and ∆Xi(n) and ∆R(n) are found by: ∆ Xi (n) = Xi (n − 1) − Xi (n − 2)
(2.24)
∆ R(n) = R(n − 1) − R(n − 2)
(2.25)
the calculation of R( ) is problem dependent and can be easily modified to fit many applications. A detailed description of the response calculation can be found in other chapters. This flexibility was demonstrated in the early studies of Harth and Tzanakou.29 In mapping receptive fields, no a priori knowledge or assumptions were made about the calculation of the cost function, instead a “response” was measured. By
© 2000 by CRC Press LLC
using action potentials as a measure of the response28,29,40,41 receptive fields could be determined by using the ALOPEX process to iteratively modify the stimulus pattern until it produced the largest response. It should be stated that due to its stochastic nature, efficient convergence depends on the proper control of both the additive noise and the gain factor γ. Initially all parameters Xi are random, the additive noise has a Gaussian distribution with mean 0, and standard deviation, σ, initially large. The standard deviation, σ, decreases as the process converges to ensure a stable stopping point. Conversely, gamma, γ, increases with iterations. As the process converges, ∆R becomes smaller and smaller, and an increase in gamma is needed to compensate for this. Additional constraints include a maximal change permitted for Xi, for one iteration. This bounded step size prevents the algorithm from drastic changes form one iteration to the next. These drastic changes often lead to long periods of oscillation, during which the algorithm fails to converge.
2.3.3
MULTILAYER PERCEPTRON (MLP) NETWORK TRAINING WITH ALOPEX
A MLP can also be trained for pattern recognition using ALOPEX. A response is calculated for the jth input pattern based on the observed and desired output
(
Rj (n) = Odes k − Oobs k (n) − Odes k
)
2
(2.26)
Where Oobsk and Odesk are vectors corresponding to Ok for all k. The total response for iteration n, is the sum of all the individual template responses, Rj(n). m
R(n) =
∑ R ( n) j
(2.27)
j =1
In Equation 2.27 m is the number of templates used as inputs. ALOPEX iteratively updates the weights using both the global response information and local weight histories, according to the following: Wij (n) = ri (n) + γ ∆Wij (n) ∆R(n) + Wij (n − 1)
(2.28)
Wjk (n) = ri (n) + γ ∆Wjk (n) ∆R(n) + Wik (n − 1)
(2.29)
where γ is an arbitrary scaling factor, ri(n) is an additive Gaussian noise, ∆W represents the local weight change and ∆R represents the global response information. These values are calculated by: ∆Wij (n) = Wij (n − 1) − Wij (n − 2)
© 2000 by CRC Press LLC
(2.30)
∆Wjk (n) = Wjk (n − 1) − Wjk (n − 2) ∆ R(n) = R(n − 1) − R(n − 2)
(2.31) (2.32)
Besides its universality to a wide variety of optimization procedures, the nature of the ALOPEX algorithm makes it suitable for VLSI implementation. ALOPEX is a biologically influenced optimization procedure that uses a single value global response feedback, to guide weight movements toward their optimum. This single value feedback, as opposed to the extensive error propagation schemes of other neural network training algorithms, makes ALOPEX suitable for fast VLSI implementation. Recently, a digital VLSI approach to implementing the ALOPEX algorithm was undertaken by Pandya and Venugopal.66 Results of their study indicated that ALOPEX could be implemented using a Single Instruction Multiple Data (SIMD) architecture. A simulation of the design was carried out, in software, and good convergence for a 4x4 processor array was demonstrated. In our laboratory, an analog VLSI chip was designed to implement the ALOPEX algorithm. By making full use of the algorithm's tolerance to noise, an analog design was chosen. As discussed earlier, analog designs offer larger and faster implementations than those of digital designs. More details are given in Chapter 12.
2.4 SOME APPLICATIONS 2.4.1
EXPERT SYSTEMS
AND
NEURAL NETWORKS
Computer-based diagnosis is an increasingly used method that tries to improve the quality of health care. Systems that depend on artificial intelligence (AI), such as knowledge-based systems or expert systems, as well as hybrid systems such as the above combined with other techniques, like NNs, are coming into play. Systems of that sort have been developed extensively in the last ten years with the hope that medical diagnosis and therefore medical care will improve dramatically. Hatzilygeroudis et al.42 are developing such a system with three main components; a user interface, a database management system, and an expert system for the diagnosis of bone diseases. Each rule of the knowledge representation part is an Adaline unit that has as inputs the conditions of the rule. Each condition is assigned a significance factor corresponding to the weight of the input to the Adaline unit, and each rule is assigned a number, called a bias factor, that corresponds to the weight of the bias input of the unit. The output is calculated as the weighted sum of the inputs filtered by a threshold function. Hudson et al.43 developed a NN for symbolic processing. The network has four layers. A separate decision function is used for layer three and a threshold for each node in the same layer. If the value of the decision function exceeds the corresponding threshold value, a certain symbol is produced. If the value of the decision function does not exceed the threshold, then a different symbol is produced. The so generated symbols of adjacent nodes are combined at layer four according to a well-structured
© 2000 by CRC Press LLC
grammar. A grammar provides the rules by which these symbols are combined.44 The addition of a symbolic processing layer enhances the NN in a number of ways. It is, for instance, possible to supplement a network that is purely diagnostic, with a level which recommends further actions, or to add additional connections or nodes in order to more closely simulate the nervous system. With increasing network complexity, parameter variance increases, and the network prediction becomes less reliable. This difficulty can be overcome if some prior knowledge can be incorporated into the NN to bias it.45 In medical applications in particular, rules can either be given by experts or can be extracted from existing solutions to the problem. In many cases the network is required to make reasonable predictions before it has gone through any sufficient training data, relying only on a priori knowledge. The better this knowledge is initially, the better the performance and the shorter the training.46,47
2.4.2
APPLICATIONS
IN
MAMMOGRAPHY
One of the leading causes of death of women in America is breast cancer. Mammography has been proven to be an effective diagnostic procedure for early detection of breast cancer. An important sign in its detection is the identification on the mammograms of microcalcifications, especially when they form clusters. Chan et al.48 have developed a computer-aided diagnosis (CAD) scheme based on filtering and feature extracting methods. In order to improve on the false positives, Zhang et al.49 applied an artificial NN which is shift invariant. They evaluated the performance of the NN by the “jack-knife” method50 and receiver operating characteristic analysis.51,52 A shift invariant NN is a feed-forward NN with local, spatially invariant interconnections similar to those of the neocognitron53 but without the lateral interconnections. BP was also used for training for individual microcalcifications and a cross-validation technique was employed in order to avoid overtraining. In this technique the data set is divided into two sets, one used for training and the other for validating the predetermined intervals. The training of the network is terminated just before the performance of the network for the validating set decreases. The shift-invariant NN was proven to be much better in dropping the false positive classifications by almost 55% over previously used NNs. In another study, Zheng et al.54 used a multistage NN for detection of microcalcification clusters with almost 100% success and only one false positive per image. The multistate NN consists of more than one NN connected in series. The first stage is called the “detail network,” with inputs the pixel values of the original image, while the second network, the “feature network” gets as inputs the output from the first stage and a set of features extracted from the original image. This approach has higher sensitivity of classification and a lower false positive detection than the previous reports. Another approach was used by Floyd et al.55 where radiologists read the mammograms and came up with a list of eight findings, which were used as features for a NN. The results from biopsies were taken as the truth of diagnosis. For indeterminate cases, as classified by radiologists, the NN had a performance index of 0.86, which is quite high.
© 2000 by CRC Press LLC
Downes56 used similar techniques to identify stellate lesions. He used texture quantification via fractal analysis methods instead of using the raw data. In mammograms, specific textures are usually indicative of malignancy. The method used for calculating the fractal dimension of digitized images was based upon the relationship between the fractal dimension and the power spectral density. Giger et al.57 aligned the mammograms of left and right breasts and used a subtraction technique to find initial candidate masses. Various features were then extracted and used in conjunction with NNs in order to reduce false positives resulting from bilateral subtraction. Receiver operating characteristic (ROC) analysis was applied to evaluate the output of the NN. The methods used were evaluated using pathologically confirmed cases. This scheme yielded a sensitivity of 95% at an average of 2.5 false positive detections per image.
2.4.3
CHROMOSOME
AND
GENETIC SEQUENCES CLASSIFICATION
Several clinical disorders are related to chromosome abnormalities that are difficult to identify accurately and also classify the individual chromosome. Automated systems can greatly help human capabilities in dealing with some of the problems involved. One way to deal with this problem is the use of NNs. Several studies have already been done toward enhancing the ability of an automated computerized system to analyze chromosome identification.58 One such study by Sweeney and Musavi59 analyzed the metaphase of chromosome spreads employing probablistic NNs (PNNs), which have been used as alternatives to various classification problems. Firstly introduced by Specht,60,61 PNNs are combinations of a kernel-based estimator for estimation of probability densities and the Bayes rule for classification decision. The estimation with the highest value specifies the correct class. Thus, training of PNNs means to find appropriate kernel functions, usually taken to be Gaussian densities, and therefore the problem is reduced to the selection of a scalar parameter, namely the standard deviation, of the Gaussian. A way to improve the accuracy of a PNN for chromosome classification is to use the knowledge that there can be a maximum of only two chromosomes assigned to each class. This knowledge can be easily incorporated into the NN. Similar or better results were obtained to the classical BP-trained NN. A hybrid symbolic/NN machine learning algorithm was introduced by Noordewier et al.62 for the recognition of genetic sequences. The system uses a knowledge base of hierarchically structured rules to form an artificial NN in order to improve the knowledge base. They used this system in recognizing genes in DNA sequences. The learning curve of this system was compared to that of a randomly initialized, fully connected two-layer NN. The knowledge-based NN learned much faster than the other one, but the error of the randomly initialized NN was slightly lower (5.5 vs. 6.4%). Methods have also been devised to investigate what the NN has learned by an automatic translation into symbolic rules of trained NN initialized by the knowledge-based method.63 Medical axis transform (MAT) based features as inputs to a NN have been used in studying human chromosome classification.64 Prenatal analysis, genetic syndrome diagnosis, and others make this research very important. Human chromosome clas-
© 2000 by CRC Press LLC
sification based on NN requires no a priori knowledge or even assumptions on the data. MAT is a widely used method for transformations of elongated objects and requires less storage and time while preserving the topological properties of the object. MAT also allows for a transformation from a 2D image to a 1D representation of it. The so obtained features are then fed as inputs to a two-layer feed-forward NN trained by BP, with almost perfect results in classifying chromosomes. An optimization on an MLP was also done.65
REFERENCES 1. Mueller, P. and Lazzaro, J., Real time speech recognition, in Neural Networks for Computing, Dember, J. Ed., American Inst. of Physics, New York, 1986, 321–326. 2. Bourland, H. and Morgan, N., A continuous speech recognition system embedding a multilayer perceptron into HMM, in Advances in Neural Information Processing Systems, 2, Touretzky, D., Ed., Morgan Kauffman, San Mateo, CA, 1990, 186. 3. Bridle, J. S. and Cox, S. J., RecNorm: simultaneous normalization and classification applied to speech recognition, in Advances in Neural Information Processing Systems, 3, Lippmann, R. P., Moody, J. E., and Touretzky, D. S., Eds., Morgan Kauffman, San Mateo, CA, 1991, 234. 4. Lee, S. and Lippman, R. P., Practical characteristics of neural networks and conventional classifiers on artificial speech problems, in Advances in Neural Information Processing Systems, 2, Touretzky, D. S., Ed., Morgan Kauffmann, San Mateo, CA, 1990, 168. 5. Fukushima, K., Neocognition: a neural network model for a mechanism of visual pattern recognition, IEEE Trans. on Systems, Man Cybernetics, SMC-13(5), 1983, 826. 6. Dasey, T. J. and Micheli-Tzanakou, E., An unsupervised system for the classification of handwritten digits, comparison with backpropagation training, Handbook of Industrial Electronics, Irwin, D., Ed., 1994. 7. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D., Backpropagation applied to handwritten zip code recognition, Neural Comput., 1(4), 541, 1989. 8. Hakim, N., Kaufman, J. J., Cerf, G., and Medows, H. E., A discrete time neural network model for system identification, Proc. of IJCNN, 90(Vol. 3), 593, 1990. 9. Hesh, D., Abdallah, C., and Horne, B., Recursive neural networks for signal processing and control, in Proc. First IEEE-SP Workshop on Neural Networks for Signal Processing, Princeton, NJ, 1991, 523. 10. Cottrell, G. W., Munro, P. N., and Zipser, D., Image compression by backpropagation, a demonstration of extensional programming, in Advances in Cognitive Science, Vol. 2, Ablex Publ., Norwood, NY, 1989, 208. 11. Oja, E. and Lampinen, J., Unsupervised learning for feature extraction, in Computational Intelligence Imitating Life, Zurada, J. M., Marks, R. J., II, and Robinson, C. J., Eds., IEEE Press, New York, 1994. 12. Fogelman Soulie, F., Integrating neural networks for real world applications, in Computational Intelligence Imitating Life, Zurada, J. M., Marks, R. J., II, and Robinson, C. J., Eds., IEEE Press, New York, 1994. 13. Rosenblatt, F., The perceptron: a probabilistic model for information storage and organization in the brain, Psychol. Rev., 65, 386, 1958.
© 2000 by CRC Press LLC
14. Rosenblatt, F., Principles of Neurodynamics, Spartan Books, New York, 1962. 15. Lippman, R. P., An introduction to computing with neural nets, IEEE, ASSP Mag., 4, 1987. 16. Moore, K., Artificial neural networks: weighing the different ways to systemize thinking, IEEE Potentials, 23, 1992. 17. Huang, S. and Huang, Y., Bounds on the number of hidden neurons in multilayer perceptrons, IEEE Trans. Neural Networks, 2(1), 47, 1991. 18. Kung, S. Y., Hwang, J., and Sun, S., Efficient modeling for multilayer feedforward neural nets, Proc. IEEE Conf. on Acoustics, Speech Signal Processing, New York, 1988, 2160. 19. Mirchandani, G., On hidden nodes for neural nets, IEEE Trans. Circuits and Systems, 36(5), 661, 1989. 20. Kohonen, T., Self-Organization and Associative Memory, Springer-Verlag, New York, 1988. 21. Hecht-Nielsen, R., Counterbackpropagation networks, Proc. of the IEEE First Int. Conf. on Neural Networks, Vol. 2, 1987, 19. 22. Dasey, T. J. and Micheli-Tzanakou, E., The unsupervised alternative to pattern recognition I: classification of handwritten digit, Proc. 3rd Workshop on Neural Networks, Auburn, AL, 1992, 228. 23. Dasey, T. J. and Micheli-Tzanakou, E., The unsupervised alternative to pattern recognition II: detection of multiple sclerosis with the visual evoked potential, Proc. 3rd Workshop on Neural Networks, Auburn, AL., 1992, 234. 24. McCulloch, W. C. and Pitts, W., A logical calculus of the ideas imminent in nervous activity, Bull. Math. Biophys., 5, 115, 1943. 25. Hebb, D., The Organization of Behavior, Wiley, New York, 1949. 26. Widrow, B. and Lehr, M. A., 30 Years of adaptive neural networks: perceptron, Madaline, and backpropagation, Proc. IEEE, 78(9), 1415, 1990. 27. Minsky, M. and Papert, S., Perceptrons: An Introduction to Computational Geometry, M.I.T. Press, Cambridge, MA, 1969. 28. Tzanakou, E. and Harth, E., Determination of visual receptive fields by stochastic methods, Biophys. J., 15, (42a), 1973. 29. Harth, E. and Tzanakou, E., Alopex: a stochastic method for determining visual receptive fields, Vis., Res., 14, 1475, 1974. 30. Tzanakou, E., Michalak, R., and Harth, E., The ALOPEX process: visual receptive fields by response feedback, Biol. Cybern., 35, 161, 1979. 31. Rumelhart, D. E. and McClelland, J. L., Eds., Parallel Distributed Processing, M.I.T. Press, Cambridge, MA, 1986. 32. Holstrom, L. and Koistinen, P., Using additive noise in backpropagation training, IEEE Trans. Neural Networks, 3(1), 24, 1992. 33. Fahlmann, S. E., Faster learning variations of backpropagation: an emprical study, in Proc. of the Connectionist Models Summer School, Touretzky, D., Hinton, G., and Sejnowski, T., Eds., Morgan Kaufmann, San Mateo, CA, 1988. 34. Werbos, P. J., Backpropagation through time: what it does and how to do it, Proc. IEEE, 78(30), 1550, 1990. 35. Nguyen, D. and Widrow, B., The truck backer-upper: an example of self-learning in neural networks, Proc. Int. Joint Conf. on Neural Networks, Vol. II, IEEE Press, New York, 1989, 357. 36. Ciaccio, E. and Tzanakou, E., The ALOPEX process: Application to real-time reduction of motion artifact, Ann. Int. Conf. of IEEE EMBS, Vol. 12, no. 3, 1990, 1417.
© 2000 by CRC Press LLC
37. Dasey, T. J. and Micheli-Tzanakou, E., A pattern recognition application of the Alopex process with hexagonal arrays, Int. Joint Conf., on Neural Networks, Vol. II, 1990, 119. 38. Venugopal, K., Pandya, A., and Sudhakar, R., ALOPEX algorithm for adaptive control of dynamical systems, Proc. of IJCNN, Vol. II, 1992, 875. 39. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., Optimization by simulated annealing, Science, 220, 671, 1983. 40. Micheli-Tzanakou, E., Non-linear characteristics in the frog’s visual system, Biol. Cybern., 51, 53, 1984. 41. Micheli-Tzanakou, E., Visual receptive fields and clustering, Behav. Res. Meth. Instrument., 15(6), 553, 1983. 42. Hatzilygeroudis, I., Vassilakos, P. J., and Tsakalidis, A., An intelligent medical system for diagnosis of bone diseases, Proc. Int. Conf. on Med. Physics and Biom. Eng., Vol. 1, Cyprus, 1994, 148. 43. Hudson, D. L., Cohen, M. E., and Deedwania, P. C., A neural network for symbolic processing, Proc. 15th Ann. Int. Conf. of the IEEE/EMBS, Vol. 1, 1993, 248. 44. Hoperoft, J. E. and Ullman, J. D., Formal Languages and their Relation to Automata, Addison-Wesley, Reading, MA, 1969. 45. Roscheisen, M., Hofmann, R., and Tresp, V., Neural control for running mills: incorporating domain theories to overcome data deficiency, in Advances in Neural Information Processing Systems 4, Morgan Kauffman, San Mateo, CA, 1992. 46. Towell, G. G., Shavlik, J. W., and Noordemier, M. O., Refinement of approximately correct domain theories by knowledge-based neural networks, Proc. 8th Nat. Conf. on Artif. Intelligence, 1990, 861. 47. Tresp, V., Hollatz, J., and Ahmad, S., Network structuring and training using rulebased knowledge, in Advances in Neural Information Processing Systems 5, Morgan Kauffman, San Mateo, CA, 1994, 871. 48. Chan, H.-P., Doi, K., Vyborny, C. J. et al., Improvement in radiologists’ detection of clustered microcalcifications on mammograms. The potential of computer aided diagnosis, Inv. Radiol., 25, 1102, 1990. 49. Zhang, W., Giger M. L., Nishihara, R, M., and Doi, K., Application of a shift-invariant artificial neural network for detection of breast carcinoma in digital mammograms, Proc. World Congr. on Neural Networks, Vol. I, 1994, 45. 50. Fukunaga, K., Introduction to Statistical Pattern Recognition, 2nd ed., Acad. Press, New York, 1990. 51. Metz, C.E., Current problems in ROC analysis, Proc. Chest Imaging Conf., Madison, WI, 1988, 315. 52. Metz, C.E., Some practical issues of experimental design and data analysis in radiological ROC studies, Inv. Radiol., 24, 234, 1989. 53. Fukushima, K., Miyake, S., and Ito, T., Neocognitron: a neural network model for a mechanism of visual pattern recognition, IEEE Trans. on Systems, Man, and Cybernetics, SMC-13, 1983, 826. 54. Zheng, B., Qian, W., and Clarke, L. P., Artificial neural network for pattern recognition in mammography, Proc. World Congr. on Neural Networks, Vol. I, 1994, 57. 55. Floyd, C. E., Jr., Yun, A. J., Lo, J. Y., Tourassi, G., Sullivan, D. C., and Kornguth, P. J., Prediction of breast cancer malignancy for difficult cases using an artificial neural network, Proc. World Congr. on Neural Networks, Vol. I, 1994, 127. 56. Downes, P., Neural network recognition of multiple mammographic lesions, Proc. World Congr. on Neural Networks, Vol. I, 1994, 133.
© 2000 by CRC Press LLC
57. Giger, M. L., Lu, P., Huo, Z., and Zhang, W., Application of artificial neural networks to the task of merging feature data in computer-aided diagnosis schemes, Proc. World Congr. on Neural Networks, Vol. I, 1994, 43. 58. Piper, J., Granunn, E., Rutovitz, D., and Ruttledge, H., Automation of chromosome analysis, Signal Proc., 2(3), 109, 1990. 59. Sweeney, W. P., Jr. and Musavi, M. T., Application of neural networks for chromosome classification, Proc. 15th Annu. Int. Conf. of the IEEE/EMBS, Vol. 1, 1994, 239. 60. Specht, D., Probabilistic neural networks for classification, mapping, or associative memory, IEEE Int. Conf. on Neural Networks, San Diego, CA, 1988. 61. Specht, D., Probabilistic neural networks, Neural Net., 3, 109, 1990. 62. Noordewier, M. O., Towell, G. G., and Shavlik, J. W., Training knowledge-based neural networks to recognize genes in DNA sequences, Adv. Neural Info. Proc. Sys., 3, 530, 1993. 63. Towell, G. G., Graven, M., and Shavlik, J. W., Automated interpretation of knowledgebased neural networks, Technical Report, Univ. of Wisc. Computer Sci. Dept., Madison, WI, 1991. 64. Lerner, B., Rosenberg, B., Levistein, M., Guterman, H., Dinstein, I., and Romem, Y., Medical axis transform based features and a neural network of human chromosome classification, Proc. World Congr. on Neural Networks, Vol. 3, 1994, 173. 65. Lerner, B., Guterman, H., Dinstein, J., and Romem, Y., Learning curves and optimization of a multilayer perceptron neural network for chromosome classification, Proc. World Congr. on Neural Networks, Vol. 3, 1994, 248. 66. Pandya, A.S. and Vinugopal, P., A Stochastic Parallel Algorithm for Supervised Learning in Neural Networks, IEEE Trans. Inf. Sys., E77-D No. 4, 1994, 376.
© 2000 by CRC Press LLC
3
A System for Handwritten Digit Recognition Woogon Chung and Evangelia Micheli-Tzanakou
3.1 INTRODUCTION Visual pattern recognition has long been an interesting problem, both from the application and technical aspects. We hope to design a system that understands characters, words, and even sentences. Handwritten digit recognition is one of the most challenging problems. Its applications are extensive—automatic document processing, banking systems, etc. Depending on the writer’s environment, the writing style differs, and this causes the difficulty in the system design, even though the fundamental assumption in writing communications is that differences between characters are more significant than differences among the same character. The handwritten digit recognition has a long history, and many researchers have proposed different models.1–6 These are mostly model-based. The developed model is usually specific to the given data set, and its applicability for a different data set is rather restricted. These methods find local properties, or primitives, e.g., arcs, lines, starting/end points, and the rules that combine the individual properties, from the skeletonized images. Painstaking processes to find and tune the properties are some of the difficulties and variabilities of the resulting systems. A simple and important image pattern analysis (of Arabic numerals, for example) is carried out to demonstrate that a simple model-free strategy, via global moments with proper statistical analysis, renders a quite acceptable result. The moment calculation for features is model-free, since no other information of the data set than the group label is required in order to design the pattern recognition system. All the groups of data are treated the same way to extract the global features, while the model-based methods are required to describe each different digit by a certain list of properties.
3.2 PREPROCESSING OF HANDWRITTEN DIGIT IMAGES The images are passed through a sequence of preprocessing steps before the Zernike moments calculation, a global feature extraction method which will be described in
© 2000 by CRC Press LLC
Section 3.3. A block diagram for the sequence of preprocessing procedures and the intermediate results of digit images are shown in Figure 3.1.
FIGURE 3.1 (I) Sequence of the preprocessing. (II) Two original images (9,6) and their preprocessed results. Starting with the original images, the results of ‘smoothing,’ ‘contrast enhancement,’ ‘thresholding,’ ‘centering,’ ‘skeletonization,’ ‘dialization,’ and ‘size normalization’ are presented from left to right and top to bottom of the figure.
© 2000 by CRC Press LLC
The objectives and the methods for each preprocessing are described in the following paragraphs. The major objective in the preprocessing stage of the pattern recognition system is getting unique features from the same group of patterns. Noise due to acquisition or transmission is reduced by a smoothing operation with neighboring pixel values which generally is low-pass filtering. Smoothing substitutes the value of the pixel in the center of a window with the average of pixels in the window. Such an operation has the effect of suppression of the distortions in the gray values caused by sensor noise or transmission errors. Edges in an object are typical changes in the gray levels. Thus, smoothing and edge detection are contradictory. In image analysis, however, one likes to smooth without distorting the edges. Median filtering, which is a nonlinear operation, is well known for noise removal while preserving the edges7 rendering solution to this contradiction. Since the binary noise (i.e., shot noise) is the noise type to be removed, we apply median filtering to our data. The pixels in the window (usually 3 × 3-matrix) are sorted, and a robust median value is chosen to replace the pixel value. Since the binary noise, like the shot noise, completely changes the gray level value, it is very unlikely to be the median value in the window. Thus, the median of the pixel values in the window is used to estimate its gray level value. Due to variations in the acquisition systems, e.g., cameras and scanners, reflection angle, etc., recorded pixel values are not exactly what objects really are. Thus the smoothed images are further processed for gray-scale modifications to enhance constrast. The contrast of an image in a given gray level range can be increased by stretching the range of gray levels in the image. The brightest and the darkest pixel values are found, and they are assigned to white and black, i.e., 255 and 0 in an 8bit representation. This is an affine transformation taking the acquisition value and changing it to the full gray levels. Some benefits from the contrast enhancement (usually known as histogram equalization) are • the elimination of the irregular acquisition effects, and • the enhancement of contrast. The enhanced contrast not only helps in viewing but also in building more confidence in finding the threshold in order to separate an object from the background. Segmentation of an image into parts is an important stage in image analysis. It uses clustering of pixels by their values. An ideal clustering would result in homogeneity in the distribution of pixels in a cluster, thus segmenting the images into parts by their pixel values. In digit recognition we have only one object to be segmented from the background. For this purpose, simply taking the midpoint as the threshold of the gray level in the histogram will result in good binary images. Another preprocessing step is done for the varied positions of the centroids of the digits, as seen in Figure 3.2. This translational variance of the images is interpreted as the camera movement in a direction perpendicular to the optical axis. The centroid of an image f (x, y) is given by © 2000 by CRC Press LLC
x = M1,0 / M0,0 , y = M0,1 / M0,0 where M p ,q =
∑ ∑ x y f ( x, y) p q
is the (p + q)-th order moment. The image is translated to the center of the frame by moving the centroid to that point.
FIGURE 3.2 Some digits from the training data. Five people are involved in writing digits on a grid and of one inch square. We assume that the digits are well separated, that is interaction and occlusion problems are solved already. Different sizes and widths of writing styles are notable.
Depending upon the writing instruments and the writer’s habits, stroke widths are different, as can be seen in the sample digits of Figure 3.2. Skeletonization* is used in order to find an approximation to the medial axis of planar objects.
* Some other terms, like shrinking and thinning, appear in the literature and are used interchangeably.8,9
© 2000 by CRC Press LLC
The basic requirements in the skeletonization algorithms are end-point preservation and pixel connectivity.8,9 The algorithm used for our study is that of Zhang and Suen.9 Eight neighbor pixel values, either 0 or 1, are usually compared, and a decision is made as to whether to delete the center pixel or not. The eight neighbors are denoted as (p2, p3,…,p9), as shown in the Figure 3.3(a). Using the eight neighbor values, we test for four conditions in order to decide for the removal of the center pixel, p1.
FIGURE 3.3 (I) (a)Neighboring pixels and (b)preventing end-points and middle points from deletion. (II) A series of skeletonized patterns next to the original pattern. Starting from upper left, original pattern, 1st, 2nd, 3rd, 4th, and 5th (the last one) are displayed. As the procedure goes, it peels off the boundary points and an opposite corner point; then it does the same from the opposite direction. In the first peeling-off, all the N/W boundary points and a S/E corner point are deleted.
(3.1) The algorithm works in two directions. The conditions for the two directions(3.2) are 2 ≤ B( p1 ) ≤ 6 A( p1 = 1) p2 ∗ p4 ∗ p6 = 0 p4 ∗ p6 ∗ p8 = 0
© 2000 by CRC Press LLC
2 ≤ B( p1 ) ≤ 6 A( p1 ) = 1 p2 ∗ p4 ∗ p8 = 0 p2 ∗ p6 ∗ p8 = 0
(3.1) (3.2) (3.3) (3.4)
(3.3) (3.4)
where the first two conditions of the second set are the same as the ones in the first set of conditions. B(p1) is the sum of all the eight neighboring pixels, that is, B(p1) = p2 + p3 + L + p9, and A(p1) represents the number of the (0, 1) patterns around the neighboring pixels (Figure 3.3b). The conditions of Equation 3.3 and Equation 3.4 in the first set above are satisfied when p4 = 0 or p6 = 0 or (p2 = 0 and p8 = 0). So point p1, which has been removed, might be an East/South boundary point or a North-West corner point. This set of conditions is valid for East/South boundary point or North-West corner point deletion. The conditions of Equation 3.1 and Equation 3.2 protect the endpoints from being deleted (Figure 3.3): the first loop at the left end-point has B(p1) = 1 which does not meet the condition of Equation 3.1, and the second loop shows that A(p1) = 2, meaning the middle point cannot be deleted. A set of skeletonized patterns and the original is also displayed in Figure 3.3. Note that the procedures take turns in both directions as the algorithm passes the two subiterations with the corresponding conditions. After segmentation by thresholding, the binary images are skeletonized to obtain the invariance of the stroke width that resulted from different writing styles and writing instruments. For global moment calculation a dilation process is desired. Pen path-width standardization by dilation is proven to be important for that purpose. (This will be indirectly seen later in Figure 3.6, where the reconstruction of patterns is progressively done for some font images. In the reconstruction, the narrow strokes are less prominent compared to the wider width parts of the fonts). Another reason for the path-width standardization is that the moment values obtained from the skeletonized images (width of one pixel) are more vulnerable to perturbation by a little change in the location of the skeletonized pixels (Figure 3.4). Therefore, a certain width in a given image size is desired in order • to stabilize the moment values against the variation of the skeletonized patterns and • to build tighter clusters in the same group and larger separations between the clusters of the different classes. Nonlinear morphological processing, as opposed to the linear processing (e.g., convolution) achieves certain effects such as dilation, erosion, opening, closing, and boundary extraction.7,10,11 Let F be the set of all the pixels of the matrix which are not zero and M the set of the non-zero mask pixels. With M p we denote the mask shifted or centered on this reference point to the pixel p. Dilation is defined with a set operation as follows:
F ⊕ M = { p: M p I F ≠ ∅} that is, the dilation operation produces the points on which the mask M and the image F have at least one non-zero pixel in common. Erosion is defined as
© 2000 by CRC Press LLC
__ FO M = { p: M p ⊆ F }
that is, the erosion produces the points for which the mask is a subset of the original image. These are equivalent to the regular binary operations for dilation and erosion, respectivley: K
K
(
f xy′ = Mk ,l f x −k , y−l k =− K k =− K
)
(3.5)
and f xy′ = kK=− K kK=− K Mk ,l f x −k , y−l
(3.6)
where the and denote the logical [OR] and [AND] operations, respectively. The binary image f is convolved with a symmetric (2K + 1) × (2K + 1) mask M. The erosion has to be done as shown in Equation 3.6 since the all-zero mask M would have no meaning in a binary [AND] operation. In other words, the erosion operation is done by first dilating with the background and then inverting the result to get the erosion effect.
3.2.1
OPTIMAL SIZE
OF THE
MASK
FOR
DILATION
The intuition for the dilation operation is justified via a simulation to find an optimal dilation matrix of size, 2K + 1. The strategy is that given a size of the image frame, find the size of the dilation matrix of size 2K + 1 which gives a larger separation between group means (or higher confidence in order to reject the null hypothesis of MANOVA model), in comparing J population mean vectors. The MANOVA model and the modified Wilks’ statistic (or Bartlett statistic)12 is used to measure the separation. Leaving the details to Reference 13, we introduce its definition as well as results from a simulation study.
3.2.2
BARTLETT STATISTIC
This is the modified Wilks’ lambda statistic, given by Λ* =
WSSP BSSP + WSSP
(3.7)
where the WSSP and BSSP are the “within” and “between” sums of squares and cross-products. A simple modification results to the Bartlett statistic, provided that the null hypothesis (i.e., same group means) is true and N = ∑Jj=1 nj is large:
© 2000 by CRC Press LLC
( p + J ) ln WSSP > χ 2 (α ) − N − 1 − p ( J −1) 2 BSSP + WSSP
(3.8)
where χ2p(J—1) (α) is the upper (100α)th percentile of a chi-square distribution with p(J — 1) degrees of freedom and J is the number of classes while p represents the dimensionality of the covariate. The size of the digital images used in this study is about 128 × 128 because the moment approximation by digital calculation requires high resolutions. This fact is partly studied for lower moment invariants14 requiring the image size to be larger than 60 × 60 pixels. For the higher order moments, a higher resolution may be required. With the image size fixed (129 × 129), Bartlett statistics (or modified Wilks’ lambda Λ*) are calculated for different dilation matrix size, 2K + 1, as in Equation 3.5. For the simulation study, an image pattern ‘A’ is preprocessed in the same way except for the size of dilation. The skeletonized image is dilated with dilation matrix sizes 2K + 1 = 1, 3, 5, 7, 9, 11, 13, 15. A set of Zernike moments are obtained for different dilation sizes, and the Bartlett statistics (Equation 3.8) are calculated and plotted against the size of the dilation matrix 2K + 1 (Figure 3.4). The null hypothesis (that is, all the mean vectors are the same) test is obviously rejected in all K values at the significance level α = 0.01.
FIGURE 3.4 Bartlett statistic against dilation matrix size. Dilation increases the statistic as K increases and starts decreasing after size 2K + 1 = 7 with the image frame of size 129 × 129.
From Figure 3.4, the statistic with 2K + 1 = 7 is the highest. In fact results using size 7 look the best (Figure 3.1) for an image of size around 129 × 129, which is the size we have chosen. © 2000 by CRC Press LLC
It is worth noting the assumption made on the statistics. The statistics of Equation 3.8 assume that the error term follows the multinormal distribution ∈l,j ~ N(0, Σ) in the one-way classification model Xij = µX + µj + ∈ij where i = 1, 2, …, nj and j = 1, 2, …, J. µX is an overall mean and µ j represents the jth treatment effect (or jth group mean) with ∑Jj=1 njµj = 0. Furthermore, the statistic does not necessarily measure the separation between multigroup mean vectors where J > 2. For example, with a scalar statistic in a two dimensional three group setting, a large statistic may result also from the case that any two mean vectors are unacceptably close, but the other mean vector is far from the two. However, the more ideal separation among the groups is in the case when the three mean vectors are equilateral in distance. The Bartlett statistic in this sense gives little insight on how well the mean vectors are separated; however, it still gives some feeling about the separation. After the translation invariance has been obtained by the translation standardization stage in Figure 3.1, size standardization follows. The radius of an image function f(x, y) can be defined15 as r = ( µ20 + µ02 )
1/ 2
(3.9)
where µ20 and µ02 are the moments of order 2 after the centralization and represent the variance in x- and y-directions of the ellipsoidal approximation of the image. In the stage of size standardization the desired radius rs, after normalization, is fixed to be 60% of one-half the smaller side of the image frame: r s = 0.6 ∗ min {ncol / 2, nrow / 2}
(3.10)
where ncol × nrow is the size of the image frame. All the object pixels are scaled in such a way that the radius rs of the scaled object becomes the prescribed value. The 60% restriction can be thought of as a control parameter that contains all the scaled objects inside the frame. This prevents the scaled objects from spilling outside the frame, and it corresponds to the coordinate normalization in the Zernike moment calculation, which will be treated in Section 3.3. It should be noted that the radius in Equation 3.9 is neither the principal axis length a nor the secondary principal axis length b of an ellipsoid approximation of the image function f(x, y), but that it is directly related to a and b; the area of an ellipse of parameters (a, b) is equal to πab. Digits such as ‘1’ have a larger major principal axis but smaller secondary principal axis, whereas the digit ‘0’ and ‘4’ give relatively equal principal and secondary principal axes a and b. The effect of the size normalization with the control constant 0.6 in Equation 3.10 is shown in Figure 3.1.
© 2000 by CRC Press LLC
3.3 ZERNIKE MOMENTS (ZM) FOR CHARACTERIZATION OF IMAGE PATTERNS The complex Zernike moments of order n with repetition l are defined as An,l =
n +1 π
2π
∞
∫ ∫ [V (r,θ )] f (r cos θ , r sin θ ) rdrdθ 0
*
nl
0
(3.11)
where n = 0, 1, 2,L, ∞ and l takes on positive and negative integer values such that n − l = even, l ≤ n.
(3.12)
16
The Zernike polynomials given by Vnl (r cos θ , r sin θ ) = Rnl (r ) exp (ilθ )
(3.13)
are a complete set of complex-valued orthogonal functions on a unit disk x2 + y2 ≤ 1: 2π
1
∫ ∫ [V (r,θ )] V 0
*
nl
0
mk
(r, θ ) r dr dθ =
π δ δ n + 1 mn kl
(3.14)
In Figure 3.5 the luminance of gray images represents the real part of the polynomials which are in [—1, 1] and 256 gray levels are assigned to the discrete level of the polynomials. The periodicity in Equation 3.13 being equal to 2π/l related the polynomial image to an l-fold symmetric range. The real-valued radial polynomial shown in Figure 3.5 and represented by Equation 3.13 satisfies the following condition:
∫
1
0
Rnl (r ) Rmk (r ) r dr =
1 δ 2(n + 1) mn
(3.15)
and is defined as Rnl (r ) =
( n− l ) / 2
∑ (−1) s! n + l −(ns−!s)n! − l − s! r s
2
s=0
=
∑B
nlk
2
n−2 s
(3.16)
rk
k= l n − k = even
where the Bn|l|k is the new expression (by changing the variable) for the coefficient part of the radial polynomial:
© 2000 by CRC Press LLC
FIGURE 3.5 Radial and Zernike polynomials Rnl(r) for different orders for a given azimuthal repetition l. Two real parts of Zernike polynomials with (n, l) = (6, 4) and (n, l) = (9, 5) are also shown.
Bn l k = ( −1)
n − n + k ! 2 − + n k k l k − l ! ! ! 2 2 2
n−k 2
The orthogonality of the Zernike polynomials enables a given f(x, y) to be expressed in terms of the polynomials f ( x, y) =
∞
∑ ∑ n=0
AnlVnl ( x, y)
(3.17)
l ≤n n − l = even
where the Zernike moments Anl are computed over the unit disk x2 + y2 ≤ 1: Anl = =
n +1 π
∫∫
[V (r,θ )] f ( x, y) dxdy *
2
2
x + y ≤1
nl
[A ]
*
n ,−1
This is obtained simply by the orthogonality property of the Zernike polynomials in Equation 3.13. The second equal sign holds because f(x, y) is real, and the radial polynomials satisfy Rn,l = Rn,—l. An,l can be interpreted as the projection, correlation, or proximity of a given image onto each complex valued polynomial. Thus the set
© 2000 by CRC Press LLC
of Zernike moments is the collection of the projections of a given image onto the set of the Zernike polynomials with order n and azimuthal repetition l. In practice, we cannot have an infinite limit in the summation of Equation 3.17. Instead the finite order of N is used: fˆ ( x, y) =
N
∑ ∑ n=0
AnlVnl ( x, y).
(3.18)
l ≤n n − l = even
This approximation with the finite order N is the optimal among all the other representations of f(x, y) expressed by moments due to the orthogonality property. The Zernike moments can be represented by the regular geometric moments (GM) by expressing the terms rk in Equation 3.16 and exp(—ilθ) in Equation 3.13 in terms of x and y: k 2
(
)
(
) ( x − iy)
r k = x 2 + y2 exp( −θ ) = x 2 + y 2
−
1 2
exp( −ilθ ) = exp( −iθ ) = (cos θ − i sin θ ) l
(
= x2 + y
l
1 2 −2
(3.19)
) ( x − iy)
(
= x 2 + y2
)
−
1 2
l
l
∑ ml (−i)
m
x l−m y m
m=0
The resulting expression for the Anl is n + 1 Anl = π =
n +1 π
∫ ∫ R (r) exp(−ilθ ) f ( x, y)dxdy nl
q
l
∑ ∑∑
k= l n − k = even
(3.20)
j =0 m=0
q l w m Bn l k Mk −( 2 j + m ),2 j + m j m where w = —i, + i for l > 0, l ≤ 0, respectively, and q = −12 (k — l).
3.3.1
RECONSTRUCTION BY ZERNIKE MOMENTS
In designing a pattern recognition system, one should be concerned with what constitutes the feature elements. What is the best set (if any at all) of the possible © 2000 by CRC Press LLC
features for the classification purpose? How does one get it? A trade-off is to be made between representability and complexity of the system that resulted from the selected set of the global features. The order of the ZM to be included can be found by the reconstruction process. Due to the orthogonality of the Zernike polynomials (Equation 3.14), we are able ˆ y) by its finite order representation (Equation 3.18) of to reconstruct the image f(x, the original image f(x, y). In order to illustrate the reconstruction process and to find the optimal order to be used, we revisit Equation 3.18 and simplify it in terms of real-valued functions.17 fˆ ( x, y) =
N
∑∑
Anl Vnl ( ρ, θ ) +
n= 0 l 0
N = n=0
∑ ∑ A V ( ρ, θ ) nl nl
n=0 l ≥0
N
=
N
( ρ, θ ) + ∑ ∑ Anl Vnl ( ρ, θ ) n=0 l ≥0
∑ ∑ [ A V (ρ,θ ) + A V (ρ,θ )] * * nl nl
l >0
N
+
nl nl
∑A
V
n0 n0
( ρ, θ )
n=0
N
=
∑ ∑ (C
nl
n=0
l .> 0
cos lθ + Snl sin lθ ) Rnl ( ρ ) +
Cn 0 Rn 0 ( ρ ) 2
with Cnl = 2 Re( Anl ) =
2(n + 1) π
Rnl ( ρ ) cos lθ dx dy Snl = 2 Im( Anl ) =
∫∫
−2(n + 1) π
Rnl ( ρ ) sin lθ dx dy
x 2 + y 2 ≤1
∫∫
f ( x, y)
x 2 + y 2 ≤1
f ( x, y)
In Section 3.3 the azimuthal index l is limited by the condition n − l = even and n ≥ l
(3.21)
Two digits of times-bold 14 font were reconstructed from the ZM. The reconstruction is done up to a certain high order, say 15; the order up to 15 renders a total of 72 moments: © 2000 by CRC Press LLC
15
72 =
∑ 2n + 1 n=0
FIGURE 3.6 Reconstruction via ZM. The original image and the reconstruction by 1st to 15th orders of moment show the effects of the orders in the reconstruction.
Figure 3.6 shows the original image and its reconstruction by ZM. It is evident that lower order ZMs capture gross shape information and that the more fine structures are filled in by higher order moments. Each digit consists of 16 small frames, which are the original, top-left, and its reconstruction in the direction from left to right and top down for orders 1 to 15. Most of the digits are well reconstructed by order around 11 ~ 15, except the digit ‘4’. We conjecture that the handwritten digits with various writing styles need orders up to 15 for the reconstruction to be close enough to the original images. The possible redundant variables included by higher moments will be removed via PCA (see Section 3.4). Order 15 was chosen to be the cut-off point for our handwritten digit data through visual inspection of Figure 3.6. In this way we have resolved the question of how large the feature set should be.
© 2000 by CRC Press LLC
3.3.2
FEATURES
ZERNIKE MOMENTS
FROM
The advantage of ZMs for pattern recognition has been reported in terms of noise immunity, discrimination power18 and image representation ability, noise insensitivity, and information relevance.19 These are considered a basic theoretical support for ZM. A simulation study that supports the theoretical work can be found in Reference 20. The application of ZM for pattern recognition is also in favor of the ZMs compared to others.21 Functions of the Zernike moments, called Zernike Moment Invariants (ZMIs), are introduced in order to get the rotational invariance from different orders m and azimuthal indices h for a given order n and l. Teague22 introduced a form of rotational invariance ZMIn 0 = An 0 ; ZMInl = Anl
[
2
][
ZMInz = Anl* ( Amh ) ± Anl* ( Amh ) p
(3.22)
]
p *
(3.23)
where the integers m, n, h, l and positive integer p are constraints such as m = any integer
p =
l with l mod h = 0 h
h ≤ l
z =
p + l + h for index
The first two invariants in Equation 3.22 are called primary invariants and the third in Equation 3.23 secondary invariants. The number of the primary invariants for a given order n is −2n + 1, due to the constraint n — l = even in Equation 3.12 of the ZM definition. The secondary invariants are found by forcing the exponential term to be 1, thus to become independent of the angle θ,
[
Anl* ( Amh )
p
][
+ Anl* ( Amh )
]
p *
p = Rnl (r ) exp( − jlθ ) Rmh (r ) exp( jphθ ) p + Rnl (r ) exp( jlθ ) Rmh (r ) exp(− jphθ )
[
(3.24)
]
p = Rnl (r ) Rmh (r ) exp{ j( ph − l )θ } + exp{− j( ph − l )θ } p = Rnl (r ) Rmh (r ) ⋅ cos( ph − l )θ
with the constraint on p, h, and l ensuring the cos() term to be one, thus resulting in Rnl(r) Rpmh (r) being independent of the angle θ. Since there is no restriction on the order m of the secondary invariant, we could have an infinite number of invariants
© 2000 by CRC Press LLC
by varying m while satisfying Equation 3.23. However by the definition of the functional independence of the invariants, only n + 1 number of invariants are functionally independent. The moment invariants are functionally independent if the invariants can be solved for the moments which form them.21 n + 1 is the number of the independent moments from the definition of ZM (Equation 3.11) and its constraints on the indices (Equation 3.12). Another set of Zernike moment invariants has been introduced recently.21 The idea is the same as that of Teague’s in Equation 3.22 and Equation 3.23, and is given by ZMIn′ 0 = An 0 ; ZMInl′ = Anl
[
* ZMInz′ = Amh ( Anl )
p
][
* ± Amh ( Anl )
(3.25)
]
p *
(3.26)
where m≤n h≤l
h with 0 ≤ p ≤ 1 l l for index z = h p =
The difference of this formulation from the original ones (Equation 3.22 and Equation 3.23) is that the modulus values are taken instead of their squares, and the constraints on the indices are rational power multiplications rather than integer power. The first constraint m ≤ n ensures that only combinations of moments of orders lower than n are used to form secondary invariants. The factor p ranges between 0 and 1. This constraint tends to decrease the magnitudes of the secondary invariants since p decreases as l increases. This magnitude decreasing property of the new invariants ZMI′ (Equation 3.26) is desirable and was not present in the original ZMI of Equation 3.23. The secondary parts of the ZMI and ZMI′ (Equation 3.23 and Equation 3.26) are the additional (n/2) rotational invariant values that are obtained from the power multiplication of the higher order moments or lower order moments, respectively. As shown in the ZMI and ZMI′ the rotational invariance is obtained in various ways by forcing the phase information of complex-valued ZMs to be one. Using only radial information means that all the points of a circle of radius r, in the complex domain, are the same. In addition, in digit recognition, 180-degree rotation conflict digits such as 9 and 6 are not taken care of. Khotanzad and Hong17 used the modulus value of the complex-valued ZM, the primary invariant, to eliminate the rotational problem. Their argument is based on the fact that the ZM for a rotated image fr(x, y) due to rotation by θ, results to a simple phase shift:
© 2000 by CRC Press LLC
2π
Anlr =
n +1 π
∫ ∫
=
n +1 π
2π
1
φ* =0
r =0
1
φ =0 r =0
f (r, φ − θ ) Rnl (r ) exp( −ilφ )rdrdφ
∫ ∫ f (r,φ )R (r) exp(−il(φ + θ )) rdrdφ *
*
nl
*
(3.27)
= Anl exp( −ilθ )
The original function f(x, y) and the rotated one fr(x, y) result to the same modulus value: Anl = Anlr As a remedy to this problem we have included skewness information into the modulus value of all the variables used (up to order 15). The skewness of a twodimensional function f(x, y) is obtained for each variable x and y. The skewness for an image function f(x, y) is given by Sx =
µ3,0
(µ )
3/ 2
(3.28)
2 ,0
Sy =
µ0,3
(µ )
3/ 2
(3.29)
0 ,2
Two more new variables for the skewness information are added to the modulus values of the ZM order from 2 to 15. The 0th and 1st order are deleted since the image has been preprocessed to be size standardized and to be centered by the centroid. The new moment moduli with the skewness values added are now not only rotation invariant but also free of the 180-degree rotation conflict. Section 3.5 includes the results from both the modulus values of ZM called ‘V’ and the modulus values of ZM with skewness information added, called ‘V1’. An argument is developed here to justify the use of only the real components of the ZM. The 180-degree rotation conflict problem is taken care of by the third order moments µ0,3 and µ3,0 of Equation 3.28 and Equation 3.29. This skewness information is contained in the real part of the phase components of the lower orders of ZM (A3,l and A2,l). We call ‘R’ the real part of the ZM. The number of the real part of the ZM for a given order n is −2n + 1 and is obtained with m = even from Equation 3.20. That is, the real part of ZM is given by
© 2000 by CRC Press LLC
[ ]
Cnl = 2 Re Anl = 2 n +1 =2 π
n +1 π
∫∫
q
l
x 2 + y 2 ≤1
∑ ∑∑
k= l j =0 m=0 m = even n − k = even
Rnl (r ) cos(lθ ) f ( x, y) dx dy
(−1)m qj ml Bn l k Mk −( 2 j +m ),2 j +m
(3.30)
with q = −12 (k — l). The rotational invariance by the modulus operation of ZM or moment invariants has been successful with the patterns that have no 180-degree rotation conflict, such as printed English alphabets, the aerial views of the four Great Lakes, aircraft recognition tasks, etc. The circular symmetry property of the Zernike polynomials seems to handle the rotational variance of the patterns well. The Zernike polynomial Vnl(r, θ) is circularly symmetric in periods of 2π/l (Equation 3.13) and has a wedge shape implying the rotational variance of patterns. If the patterns from a group vary within a certain orientation range (as is the case with handwritten digits), the modulus operation or the ZMI costs too much for the rotational invariance. The range of the modulus operation of the complex-valued ZM is only the distance, represented by a radius in a complex domain. The real part of ZM, however, has a range twice as large as that of the modulus value; it explains more than the radius does. The modulus value (called ‘V’) or squared modulus of the complex-valued ZM is the primary part of the Zernike moment invariants [ZMI] (Equation 3.22 and Equation 3.25). The secondary part of the ZMI shown in Equation 3.23 and Equation 3.26 is not included in our features because the secondary invariants are simply the power multiplication that adds another (n/2) number of the orientation independent values. Instead, we have followed the strategy of including the primary invariants of all the moments that have been included by the reconstruction process in finding the finite number of moments for the given patterns.
3.4 DIMENSIONALITY REDUCTION The subject of dimensionality reduction in pattern recognition is concerned with mathematical tools for reducing the size of the features. The most revealing facts with dimensionality reduction are discussed in reference 23 and summarized below: • Reduction of the physical system complexity is as required by feasibility limitations of either a technical or economical nature. • It ensures the reliability of the decision making procedure by removing the redundant and irrelevant information which has a derogatory effect on the classification process. • More importantly, the dimensionality is strongly related to the size of the sample used for training: as the dimensionality increases, the size of the training required grows exponentially. Neural networks, however, train
© 2000 by CRC Press LLC
well regardless of the dimensionality, except that the networks require more time to learn and result to poor convergence. Two stages are employed for this purpose: Principal Component Analysis (PCA) is followed by Discriminant Analysis (DA), both of which are eigen analyses on the covariance-type matrices. These eigen analyses can be interpreted as finding p < q directions on which the projections of the data result to some interesting properties, such as large variance or separation among the group means under a set of constraints.
3.4.1
PRINCIPAL COMPONENT ANALYSIS
Principal component analysis of a multivariate random sample can be viewed as finding an axis optimizing a criterion in a geometric sense. Illustration with projection of simulated two-dimensional data points is shown in Figure 3.7. Pearson (1901) looked for a new axis on which the projection gives the least sum of squares of di. Hotelling (1933) was interested in finding a new axis on which the maximum variance of the projection values is obtained (see reference 24). Even though the approaches are different and opposite, the resulting axis from the two different approaches is the same. The optimal axis for minimal sum of the squares of di’s is the same as the one with the axis in which maximum variance of zi is obtained. max PY
∑
zi2 ⇔ min PY N −1
∑d
2 i
(3.31)
where PY is the projection operator defined by a projection axis.
FIGURE 3.7 Illustration of projection of a vector point yi onto the principal axis. zi represents the projected value of xi onto the axis and di the error component of the projection. z2i + d2i = const confirms the equivalence of the two motivations for finding the optimal axis.
The idea of the PCA is to find a rotational transformation (i.e., an orthogonal transformation) matrix Rq×q such that the sample variances of the new rotated variables are in decreasing order of magnitude.24 Thus the first principal component is such that the projections of the given points onto it have maximum variance among
© 2000 by CRC Press LLC
all possible linear coordinates; the second principal component has maximum variance subject to being orthogonal to the first; and so on. The PCA is done on the sample version of total covariance matrix Tq×q of the handwritten data matrix, YN×q, where the dimensionality q = 70 and the size N = Σ Jj =1 n j = 1000, and where J = 10 is the number of classes. The lower curve of the plot in Figure 3.8 is called scree plot and represents the variance information contained in the new derived variates. The upper curve represents the accumulated version of the lower scree curve, which is the total variance of the newly obtained variables from the first to the corresponding variable indices. The 95% and 99% of the accumulated variance are indicated by the two broken lines. The 99% explanation of the variance information is obtained by the first 35 newly obtained variables.
FIGURE 3.8
PCA on the sample total covariance matrix of the handwritten data set ‘R.’
Since the dimensionality of the original data q = 70 was too large, we reduced the dimensionality using PCA of the total covariance. With the new data set, which is supposed to be uncorrelated (or less correlated), we are ready to do more statistical treatment in order to find multidimensional outliers for robust analysis and reduce the heteroscedacity (and as a by-product enhance the multinormality, if possible at all). A strategy we follow for such large dimensionality is a two-step dimensionality reduction. First principal component analysis on the total sample covariance matrix, T, is carried out. Then discriminant analysis follows, in order to reduce the dimensionality even further to J — 1. Even though the PCA is well known to be sensitive to outliers,25,26 we argue that the whole data set is preserved, as much as we want, in a lower dimensional © 2000 by CRC Press LLC
space, provided that the explanation of the variance information is over, say, 99%. The whole data set as a single batch from the different clusters of different classes is decorrelated via principal component analysis. Now the lower p = 35 dimensional space is processed by discriminant analysis for further dimensionality reduction.
3.4.2
DISCRIMINANT ANALYSIS
Suppose that we wish to find a linear transformation matrix F, which maximizes some distance criterion d defined over a sample of random vectors in a new transform space. Two interesting pairwise distance measures are the intraset and the interset distances.27 The intraset distance, or averaged within-class distance, between the kth variable of all pattern vectors in one class, averaged over all classes is dW( k ) =
1 2
J
∑ i =1
P(wi )
ni
ni
∑ ∑ f (y
1 ni2
t k
j =1
)(
)
t
− yil yij − yil fk
ij
l =1
(3.32)
where ni is the number of vectors y ∈ wi and fk is the kth column of the transformation matrix F. The interset distance, or between-class distance, of the kth direction in the new transform space is defined as dB( k ) =
J
∑ i =2
i −1
∑
P(wi )
h =1
P(wh )
1 ni nh
ni
nh
∑ ∑ f (y t k
j =1
ij
)(
)
t
− y hl yij − y hl fk
l =1
(3.33)
The first two summation indices hold for N(N — 1)/2 interpoint distances. These averaged distance measures are expressed in terms of sample withingroups covariance matrix W and between-groups matrix B,28 defined as:
W = B =
1 N−J 1 J −1
ni
J
∑ ∑ (y i =1
ij
)(
− yi yij − yi
j =1
)
t
J
∑ n (y − y)(y − y)
t
i
i
i
i =1
respectively, where N = Σ iJ=1 ni . Using the definition of W and B, the distance measures dW(k) (Equation 3.32) and dB(k) (Equation 3.33) can be written, in terms of W and B, as follows: dW( k ) = fkt Wfk dB( k ) = fkt Bfk
© 2000 by CRC Press LLC
Now, we are interested in maximizing a distance measure dB(k) = ftk Bfk with respect to the transformation vector fk subject to a constraint, e.g., holding a distance measure, (i.e., dW(k) = ftkWf) constant. The constraints are usually chosen to be irrelevant for the maximization of dB(k) while guaranteeing a unique solution fk, i.e., ftWf = 1. The solution for this kind of optimization problem can be obtained by the method of Lagrange multipliers. Maximization of dB(k), subject to dW(k) constant, has the form
{
(
)
(
max J = dB( k ) − λ dW( k ) − const = fkt Bfk − λ fkt Wfk − const fk
)}
Setting the first derivative of J with respect to fk equal to zero yields
∂J = Bfk − λWfk = 0 ∂ fk
( B − λW )fk = 0 If we premultiply the above by W—1, it results in an eigenvalue problem, i.e.,
(W
−1
)
B − λI fk = 0
(3.34)
The traditional disCRIMinant COORDinate system (or CRIMCOORD) is interpreted as finding functions that maximize the quadratic forms: dB( k ) = fkt Bfk , with respect to fk, subject to the constraint of dW( k ) = fkt Wfk = 1
(3.35)
resulting to the solution of Equation 3.34. Two consecutive linear transformations by R (via PCA) followed by F (via DA) are represented by a linear transformation matrix FR of dimension J — 1 × q, for example, 9 × 70 for our data set. Figure 3.9 shows two-dimensional projections of 30 randomly selected patterns from each group on the first five discriminate variates (CRIMCOORD) with corresponding digit representation. Remarkably, some distinction of the digits is clear from the figures, implying that the discriminant variates discriminate among the different groups.
© 2000 by CRC Press LLC
FIGURE 3.9 Two-dimensional projections of the handwritten data with the first five discriminant variates.
3.5 ANALYSIS OF PREDICTION ERROR RATES FROM BOOTSTRAPPING ASSESSMENT Prediction error is usually a good measure of the performance of pattern recognition systems. In practice, a random sample, called training data, from an unknown population described by distribution F is given. Any statistic θ (F) requires distribution F, but in practice, the F is not known and is difficult to estimate. An empirical distribution Fˆ from the given sample from an unknown distribution F is defined in a bootstrap setting by giving an equal probability mass 1/N on each of the values xi. A bootstrap sample is a random sample from the empirical distribution X1* , X2* ,…, XN* ~ Fˆ Each xi* is drawn independently with replacement and with equal probability from the sample, i.e., training data:
X = {xi }1
N
© 2000 by CRC Press LLC
{
}
− (υi , yi )
N
1
Standard error and bias estimation using Bootstrap resampling techniques can be found from the references.29,30 Here we introduce the algorithms for estimation of the standard error and the bias for prediction error estimation, leaving the technical details in the references above. The Monte Carlo bootstrapping algorithm proceeds in three steps: 1. using a random number generator, independently draw a large number 50 ≤ B ≤ 200 of bootstrap samples, {F*b}Bb=1, 2. for each bootstrap sample F*b, evaluate the statistic of interest, θˆ*(b) = θˆ (F *b) for b ∈ {1, 2,…, B} from the training data X, 3. calculate the sample standard deviation of θˆ* (b) values
σˆ B
1 = B −1
1 θˆ * (⋅) = B
B
2 θˆ * (b) − θˆ * (⋅)
∑{ b =1
}
1/ 2
, (3.36)
B
∑θˆ (b) *
b =1
Standard errors are crude but useful measures of statistical accuracy.31 An approximated confidence interval for an unknown parameter θ is given by
θ ∈θˆ ± σˆ z (α )
(3.37)
where z(α) is the 100 · α percentile point of a standard normal variate, e.g., z(0.95) = 1.64485. The standard error approximation (Equation 3.37) for a confidence interval bears the assumption that
θˆ − θ ~ N (0,1) σˆ Bias about an estimator θˆ is the next to be considered. Bootstrap bias estimation is an estimation of the optimistic bias op resulting from using the same training data for prediction, e.g., via the resubstitution method. One way to estimate the system performance from the given sample is to correct the apparent error rate (or resubstitution error rate) by the estimation of the optimistic (or positive) bias. The optimistic bias is defined as op( X ; F ) = θ − θ app where θ is the true error rate for the unknown distribution F and θapp for the apparent error rate. Since we do not know the bias op(X, F), the bootstrap estimate of the bias, opboot is found instead and the optimistic θapp is corrected by adding the estimated bias
© 2000 by CRC Press LLC
θˆ = θ app + opboot
(3.38)
Let η(v, X) be a decision rule based on the training set X and let Q[yi, η(vi, X)] be an indication of misclassification of vi by η(): 1 η(υi , X ) ≠ yi , Q yi , η(υi , X ) = 0 otherwise
[
]
Thus Q[·] = 1 indicates the misclassification of a training observation from the system designed by the training data. The bootstrap procedure for estimating the bias, opboot, follows: ˆ 1. Select 50 ≤ B ≤ 200 bootstrap samples from the empirical distribution F. 2. From each bootstrap sample compute the bias wb N
wb =
∑ N − P 1
*b
i
i =1
[ (
Q y , η υ , F *b i i
)]
with Pi*b indicating the proportion of the bootstrap sample on xi, i.e., Pi*b = Cardinality of
{j x
*b j
}
= xi / N
and η(vi, F*b) being the prediction of vi from the system trained by F*b. 3. Repeat step 2 to get {w1, w2,L, wB}. Then the bootstrap bias opboot is estimated by opboot =
1 B
B
∑w
b
b =1
and thus, the bootstrap error estimate θˆ (Equation 3.38) is obtained. E0 prediction error estimation is equivalent to counting the number of patterns that are not included in the bootstrap samples and normalizing the misclassification count of the samples32 by the total number of the training patterns not selected in the bootstrap samples. Thus, E0 uses the testing set, which is asymptotically 36.8% of the original training, according to the argument that follows. In a typical bootstrap sample, about 63% of the original observations are likely to be chosen. This is easly seen since the probability that an observation does not belong to a bootstrap sample is
(1 − 1 / N ) N = 1 / e.
© 2000 by CRC Press LLC
Thus an observation xi will be in the bootstrap sample with about 1 — 1/e = 63.2% chances. Let Ab = {i Pi*b = 0} denote the index set of training patterns that do not appear in the F*b, then the prediction error θ0 estimated by the E0 estimator is defined by
∑ ∑ Q[ y ,η(υ , F )] . = ∑ Cardinality of {A } B
θ0
*b
b =1 B
i∈Ab
i
b =1
i
b
This E0 estimator is a form of cross-validation in that the testing data have not been used in training. The difference from the cross-validation is that the E0 separates the training and the testing data randomly while the cross-validation selects the testing pattern sequentially such that all the training patterns are used for testing. The testing patterns used in the apparent error rate obtained by the resubstitution method are too close or the distance is ‘zero’ from the training patterns, while the test patterns for E0 estimator are ‘too far’ from the training set. From that the asymptotic probability argument that a pattern will not be included in a bootstrap sample is 0.368, the weighted average of θapp and θ0 involves patterns at the ‘right’ distance from the training set in estimating the error rate:32
θ 632 = 0.368 ∗ θ app + 0.632 ∗ θ 0
(3.39)
The E632 was shown to be optimal in terms of least variance and bias from a comparison study for various estimators32,33 among cross-validation, ordinary bootstrap bias correction (Equation 3.38) and E632 (Equation 3.39). We used the E632 prediction error as a standard performance measure. (The bootstrap package bootstrap.funs* contains various resampling techniques and is available via anonymous ftp to [email protected].) The boxplots in Figure 3.10 represent the E632 estimator superimposed to the distribution of the B = 100 bootstrap sample errors, θˆ*(b)’s (Equation 3.36). The median value of the B error rates is replaced by the E632 estimate; thus the B bootstrap errors are shifted according to the E632 estimate. For ease of display and ˆ s, are plotted. understanding the system performance, the recognition rates, 1 — θ′ The height of the box is the inter-quartile range, which is the difference between the upper quartile and the lower quartile, and is considered to be a robust estimation of the scalar multiple of the dispersion. The median of the batch is represented by the line in the box. The bars, represented by the vertical dotted lines, are extended up to the points 1.5 times of the inter-quartile range. Outliers are represented by the individual dots to signify their existence. The boxplots display the distribution very simply but well enough, especially when many different batches are to be compared. The correct recognition rates from three-layer feed-forward neural networks with the Broyden, Flecher, Goldfarb and Shannon (BFGS) algorithm (in reference 34) * The bootstrap was contributed by Efron and Tibshirani.
© 2000 by CRC Press LLC
FIGURE 3.10 Boxplots from nnet with hidden layer size = 15 for data sets of V, V1, R. E632 prediction error rate and B = 100 bootstrap samples are used.
are displayed in the boxplots for each data set obtained from the different treatment in Section 3.3.2. The classifiers* used in this study can be obtained via anonymous ftp to [email protected]. Each boxplot shows the distribution of the recognition rate of 100 systems designed by 100 bootstrap samples.
3.6 SUMMARY A simple model-free feature extraction by the two-dimensional Zernike polynomials was shown to be a powerful pattern recognition system (via the correct recognition rate >∼ 95% via the E632 prediction error measure) for handwritten digits. The images are preprocessed before ZM calculations take place, and the dimensionality of the feature vectors is reduced by PCA followed by DA. For the 180-degree rotation conflict data, addition of the skewness variables improves the performance of the system. Simply by taking the real (or imaginary) part of the complex-valued Zernike Moments, one obtains more information than what is lost by the rotational invariance operation for the rotational variance of the patterns, which is inherent in handwritten digit data. The rotational variance of the patterns seems to be observed by the wedge type Zernike polynomials. The addition of skewness information (V1) to the modulus value (V) of the complex-valued ZM improves generally the correct recognition rate by 2–3%, while the real part (R) yields generally 3–4% improvement over the modulus value (V). The wedge shape of the polynomial also possesses an important property that the variation, at around the outer region of the patterns, results to less variance than the one from the Cartesian coordinates, such as the regular moments and their invariants.
* The package nnet is contributed by Ripley.
© 2000 by CRC Press LLC
ACKNOWLEDGMENTS The authors wish to thank Dr. R. Gnanadesikan for insightful discussions and Mr. G. Kontaxakis for help with the final version of the manuscript.
REFERENCES 1. Tappert, C. C., Suen, C. Y., and Wakahara, T., The state of the art in on-line handwriting recognition, IEEE Trans. Pattern Anal. Machine Intelligence, 12(8), 787, 1990. 2. Le Cun, Y., Jacket, L. D., Boser, B., Denker, J. S., Graf, H. P., Guyon, I., Henderson, D., Howard, R. E., and Hubbard, W., Handwritten digit recognition: applications of neural network chips and automatic learning, IEEE Commun. Mag., 41, November 1989. 3. Suen, C. Y., Berthod, M., and Mori, S., Automatic recognition of handprinted characters—the state of the art, Proc. IEEE, 68(4), 469, April 1980. 4. Mantas, J., An overview of character recognition, Patt. Recog., 19(6), 425, 1986. 5. Bitchell, B. T. and Gillies, A. M., A model-based computer vision system for recognizing handwritten zip codes. Mach. Vision Applic., 2, 231, 1989. 6. Wang, C. H. and Srihari, S. N., A framework for object recognition in a visually complex environment and its application to locating address blocks on mail pieces, Int. J. Comput. Vision, 2, 125, 1988. 7. Jähne, B., Digital Image Processing: Concepts, Algorithms and Scientific Applications, Springer-Verlag, Berlin, 1991. 8. Pavlidis, T., A thinning algorithm for discrete binary images, Comput. Graphics Image Process., 13, 142, 1980. 9. Zhang, T. Y. and Suen, C. Y., A fast parallel algorithm for thinning digital patterns, Commun. ACM, 27(3), 236, 1984. 10. Haralilck, R. M. and Shapiro, L. G., Computer and Robot Vision, Vol. 1. AddisonWesley, Reading, MA, 1992. 11. Serra, J., Image Analysis and Mathematical Morphology. Academic Press, New York, 1982. 12. Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis, Prentice-Hall, Englewood Cliffs, NJ, 1988. 13. Chung, W., A Strategy for Visual Pattern Recognition, PhD thesis, Electrical and Computer Engineering, Rutgers University, The State University of New Jersey, 1994. 14. Teh, C-H. and Chin, R. T., On digital approximation of moment invariants, Comput. Vision, Graphics, Image Process. 33, 318, 1986. 15. Dudani, S. A., Breeding, K. J., and McGhee, R. B., Aircraft identification by moment invariants, IEEE Trans. Comput., 26(1), 39, January 1977. 16. Bhatia, A. B. and Wolf, E., On the circle polynomials of Zernike and related polynomials orthogonal sets, Proc. Camb. Phil. Soc., 50, 40, 1954. 17. Khotanzad, A. and Hong, Y. H. Invariant image recognition by Zernike moments, IEEE Trans. Patt. Anal. Mach. Intell., 12(5), 489, May 1990. 18. Abu-Mostafa, Y. S. and Psaltis, D., Recognitive aspects of moment invariants, IEEE Trans. Patt. Anal. Mach. Intell., 6(6), 698, November 1984. 19. Teh, C-H. and Chin, R. T., On image analysis by the methods of moments, IEEE Trans. Patt. Anal. Mach. Intell., 10(4), 496, July 1988.
© 2000 by CRC Press LLC
20. Chung, W. and Micheli-Tzanakou, E., A simulation study for different moment sets, in Document Recognition, IS&E/SPIE 1994 International Symposium on Electronic Imaging, February 1994, 378. 21. Belkasim, S. O., Shridhar, M., and Ahmadi, M., Pattern recognition with moment invariants: a comparative study and new results, Patt. Recog., 24(12), 1117, 1991. 22. Teague, M. R., Image analysis via the general theory of moments, J. Opt. Soc. Am., 70(8), 920, August 1980. 23. Kittler, J., Feature selection and extraction, in Handbook of Pattern Recognition and Image Processing, Academic Press, New York, 1986, Chap. 3, 59. 24. Gnanadesikan, R., Methods for Statistical Data Analysis of Multivariate Observations, John Wiley & Sons, New York, 1977. 25. Ammann, L. P., Robust singular value decompositions: a new approach to projection pursuit, J. Am. Stat. Assoc., 88(422), 505, 1993. 26. Devlin, S., Gnanadesikan, R., and Kettenring, J., Robust estimation of dispersion matrices and principal components, J. Am. Stat. Assoc., 76, 354, 1981. 27. Kittler, J., Mathematical methods of feature selection in pattern recognition, Int. J. Man-Machine Stud., 7, 609, 1975. 28. Kittler, J. and Young, P. C., A new approach to feature selection based on the Karhunen-Loeve expansion, Patt. Recogn., 5, 335, 1973. 29. Efron, B. and Gong, G., A leisurely look at the bootstrap, the jackknife, and crossvalidation, Am. Stat., 37(1), 36, February 1983. 30. Efron, B. and Tibshirani, R. J., An Introduction to the Bootstrap. Chapman & Hall, London, 1993. 31. Efron, B. and Tibshirani, R. J., Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy, Stat. Sci., 1(1), 54, 1986. 32. Efron, B., Estimating the error rate of a prediction rule: improvement on crossvalidation, J. Am. Stat. Assoc., 78(382), 316, June 1983. 33. Jain, A. K., Dubes, R. C., and Chen, C. C., Bootstrap techniques for error estimation, IEEE Trans. Patt. Anal. Mach. Intell., 9(5), 628, September 1987. 34. Peressini, A. L., Sullivan, F. E., and Uhl, J. J., Jr., The Mathematics of Nonlinear Programming, Springer-Verlag, Berlin, 1988.
© 2000 by CRC Press LLC
4
Other Types of Feature Extraction Methods Evangelia Micheli-Tzanakou, Ahmet Ademoglu, and Cynthia Enderwick
4.1 INTRODUCTION If a signal contains frequency components emerging and vanishing in certain time intervals, then a time as well as a frequency localization is required. The traditional method proposed for such an analysis is the Short Time Fourier Transform (STFT) or Gabor Transform.5 The STFT enables the time localization of a certain sinusoidal frequency but with an inherent limitation of the Heisenberg’s uncertainty principle, which states that resolution in time and frequency cannot be arbitrarily small, because their product is lower bounded by ∆t∆f ≥
1 4π
(4.1)
In order to overcome the resolution limitation of the STFT, a decomposition of square integrable signals L2(R) has recently been developed under the name of wavelets.1,10 These families of functions ha,b 1
ha,b (t ) = a 2 h(t − b)a, b ∈ _, a ≠ 0
(4.2)
are generated from a single function h(t) by the operation of dilations and translations. The wavelet transform of a continuous signal x(t) can be defined as CWTx (b, a) =< x (t ), a
−1 / 2
h* ((t − b) / a) >= a
−1 / 2
∫ x (t )h* ((t − b) / a)dt
(4.3)
where * represents the complex conjugation and where represents the inner product. Equation 4.3 is interpreted as a multiresolution decomposition of the signal into a set of frequency channels having the same bandwidth in a logarithmic scale (i.e., constant Q or constant relative bandwidth frequency analysis by octave band filters). For the STFT, the phase space is uniformly sampled, whereas in wavelet transform the sampling in frequency is logarithmic, which enables one to analyze higher frequencies in shorter windows and lower frequencies in longer windows in time.
© 2000 by CRC Press LLC
4.2 WAVELETS As mentioned in the introduction, wavelet analysis has practically become a ubiquitous tool in image and signal processing. Two basic properties, space and frequency localization and multiresolution analysis, make this a very attractive tool in signal and image analysis. Unlike the complex sine and cosine basis functions of the Fourier transform, the basis functions of the wavelet transform are localized in both space and frequency. Taking the wavelet transform of an image involves convolving a pair of filters, one high pass and one low pass, with the image. This is followed by decimation by two and repeated for as many octaves as desired. This algorithm is depicted in Figure 4.1, which shows a one-octave decomposition of an image into four components: low pass rows, low pass columns (LP-LP); high pass rows, low pass columns (HP-LP); low pass rows, high pass columns (LP-HP); and high pass rows, high pass columns (HP-HP). These will later be referred to as components 0 through 3, respectively. For the purposes of decomposing images, subsequent octaves were created by transforming the LP-LP component of the previous octave.
FIGURE 4.1 Wavelet transform algorithm — sub-band decomposition of one octave. HP = high-pass, LP = low-pass, ↓ 2 represents decimation by 2.
The result of the wavelet transform on a test image is shown in Figure 4.2. Figure 4.2a shows the original image Lena.bmp. Figure 4.2b shows the first octave wavelet decomposition using a 4-tap Daubechies filter bank1 (see also Press et al.15) enhanced to show the high-pass components. From left to right, top to bottom is LP-LP (component 0), HP-LP (component 1), LP-HP (component 2), and HP-HP (component 3). Notice that component 1 accentuates the vertically oriented details,
© 2000 by CRC Press LLC
FIGURE 4.2 The wavelet transform of Lena.bmp. Note that (b) has been enhanced to accentuate the detail coefficients (Reproduced by special permission of Playboy, © 1972).
component 2 accentuates the horizontally oriented details, and component 3 accentuates the 45° and 135° diagonal details. While there exist many practical wavelet filters applicable to a wide variety of problems, the filter used in our analysis is the Daubechies 8-tap filter. These filter coefficients are High-pass coefficients = {0.010597, 0.032883, –0.030841, –0.187035, 0.027984, 0.630881, –0.714847, 0.230378} Low-pass coefficients = {0.230378, 0.714847, 0.630881, –0.027984, –0.187035, 0.030841, 0.032883, –0.010597}
4.2.1
DISCRETE WAVELET SERIES
Although various ways of discretizing time-scale parameters are possible, the conventional scheme is the so-called dyadic grid sampling, where time remains continuous but time-scale parameters are sampled by choosing a = 2i and b = k2i i,k e Z. The wavelets in this case are given by −
i
(
)
hik (t ) = 2 2 h 2 − i t − k .
(4.4)
A wavelet series decomposes a signal x(t) onto a basis of continuous-time wavelets or the so-called synthesis wavelets αi,k(t) as shown x (t ) =
∑∑C
i ,k
α i , k (t )
(4.5)
i∈Z k ∈Z
The wavelet coefficients are defined as Ci,k = ∫ x (t )hi*,k (t )dt
© 2000 by CRC Press LLC
(4.6)
The signal decomposition may be done by using orthogonal wavelets,1 in which case the analysis and synthesis wavelets are identical.
4.2.2
DISCRETE WAVELET TRANSFORM (DWT)
The DWT is very close to a wavelet series, but in contrast, it applies to discretetime signals x[n]. It achieves a multiresolution decomposition of x[n] on I octaves labeled by i = 1,…,I, given by x[n] =
∞
∑ ∑ a g [n − 2 k ] + ∑ b h [n − 2 k ] i
I
i ,k i
i= I
I ,k I
k ∈Z
(4.7)
k ∈Z
The DWT computes wavelet coefficients ai,k for i = 1,…,I and scaling coefficients bi,k, which are given by
{
}
DWT x[n]; 2 I , k 2 I = ai,k =
∑ x[n]g [n − 2 k] * i
i
(4.8)
n
and bI ,k =
∑ x[n]h [n − 2 k] * I
I
(4.9)
n
where the gi[n – 2ik]’s are the discrete wavelets and the hI[n – 2Ik] are the scaling sequences.
4.2.3
SPLINE WAVELET TRANSFORM
The attractiveness of the Gabor representation of a signal comes from its optimal time-frequency localization.5 However, the use of fixed window size, redundancy, and nonorthogonality are the major limitations of the Gabor analysis. The use of Bspline wavelets is shown to have near optimal time-frequency localization by Unser et al.25,26 Although they are not orthogonal as the Battle/Lemarie polynomial spline wavelets used by Mallat10 which are exponentially decaying, they are semiorthogonal and have a compact support. The B-Spline Wavelet Transform is used to construct a sequence of embedded polynomial spline function spaces {S(i)n i ∈ Z} of order n such that S(i)n ⊃ S(i+1)n for i ∈ Z where Z is the set of integers.26 S(i)n is the subset of functions in L2(R) that are of class Cn–1, i.e., continuous functions with continuous derivatives up to order (n–1) and are equal to a polynomial of degree n in intervals [k2i,(k+1)2i] with k ∈ Z. Hence Sin = φi n ( x ) =
∞
∑c
k=_ ∞
© 2000 by CRC Press LLC
(i )
(k )β 2ni ( x − 2 i k )
(4.10)
where
β2ni ( x ) = β n ( x / 2i )
(4.11)
The basis function bn(x) is the B-spline of order n. For the spline basis functions of order n the wavelet sequence qn is q n (k − 1) = ( −1) b 2 n+1 (k ) pn (k ) *
k
(4.12)
where b2n+1 (k) = β2n+1(k), and pn(k), the scaling sequence, is the binomial kernel p n (k ) = 1 / 2 n
(
n+k k
),
k = 0, …, n +
(4.13)
When the wavelet and the scaling sequences are determined, the wavelet function b2n(x) can be constructed by a scaling function ßn(x) by solving a two-scale equation (a dilation equation)
β 2n ( x ) = β n ( x / 2) =
+∞
∑ q ( k )β ( x − k ) n
n
(4.14)
k=_ ∞
Given a function φn(x), we can obtain the B-spline representation at the finest resolution level that is defined as level (0) using
φ( 0 ) ( x ) =
∞
∑c
k=_ ∞
(0)
( k )β n ( x − k )
(4.15)
The essence of the wavelet transform is to decompose the above expression using basis functions that are expanded by a factor of two
φn( 0 ) ( x ) =
+∞
∑
+∞
d( I ) (k )β 2n ( x − 2 k ) +
k=_ ∞
∑c
(I)
(k )β 2n ( x − 2k )
(4.16)
k=_ ∞
The B-spline wavelet is a polynomial compact support with the property that β2n ⊥ β2n(x–2k) k ∈ Z.25 This means that the first term of the right-hand side of Equation 4.16 is the projection of φ(0)n on S(1)n and the second term represents the residual error. The decomposition can be implemented iteratively up to a level I, which yields the wavelet representation
φn( 0 ) ( x ) =
I
+∞
∑∑
i =1 k = _ ∞
© 2000 by CRC Press LLC
(
)
d(i ) (k )β 2ni x − 2 i k +
+∞
∑ c ( k )β ( x − 2 k ) n
I
k=_ ∞
2I
I
(4.17)
where
β 2ni ( x ) = β 2ni ( x / 2i −1 )
(4.18)
The coefficients {d(1),…,d(I)} are the so-called wavelet coefficients ordered from fine to coarse while the sequence {c(I)} characterizes the lower resolution signal at level (I).
4.2.4
THE DISCRETE B-SPLINE WAVELET TRANSFORM
It is possible to take a wavelet function which may be regarded as a band-pass filter and represent it as a combination of a low-pass and a high-pass filter. In this case, the wavelet analysis becomes a multiresolution analysis. The low-pass and the high-pass filters for nth order spline wavelet multiresolution decomposition may be computed as
[
]
[
]
h(k ) = 1 / 2 b 2 n+1
g(k + 1) = 1 / 2 b 2 n+1
−1
−1
↑2 ∗b 2 n+1 ∗ pn (k )
(4.19a)
↑2 ∗( −1) pn (k )
(4.19b)
k
where ↑2 indicates up-sampling by 2.
4.2.5
DESIGN
OF
QUADRATIC SPLINE WAVELETS
For the analysis of the waveforms used in this study, the quadratic spline wavelets (n = 2) are designed particularly for their antisymmetric property which conforms to the morphological character of the signal. The quadratic spline and wavelet functions are shown in Figures 4.3 and 4.4. The low-pass h(n) and the high-pass g(n) filter kernels for the quadratic spline wavelet are
[( ) ](k) ↑ ∗b (k) ∗ p (k)
1 5 b 2
h(k ) =
g(k + 1) =
−1
5
2
(4.20)
2
[
]
−1 1 5 k b (k ) ↑2 ∗( −1) p2 (k ) 2
(4.21)
where
[(b ) ](k) = Z {120 / (z 5 −1
−1
2
+ 26 z + 66 + 26 z −1 + z −2
)}
(4.22)
and p 2 (k ) =
© 2000 by CRC Press LLC
[
1 −1 Z 1 + 3z −1 + 3z −2 + z −3 22
]
(4.23)
FIGURE 4.3
Quadratic B-spline functions.
FIGURE 4.4
Quadratic B-spline wavelet.
© 2000 by CRC Press LLC
It may be shown that [b5(k)]–1 can be expressed in factorized form by
[(b )
5 −1
α 1α 2 −1 −1 (1 − α 1 z )(1 − α 1 z )(1 − α 2 z )(1 − α 2 z )
]
(k ) = Z −1
(4.24)
where α1 = –0.04309 and α2 = –0.43057. Now [b5(k)]–1 can be determined as
[(b ) ](k) = α α /(1 − α )(1 − α )(1 − α α )(α − α ).(α (1 − α )α 5 −1
1
2
2 1
2 2
1
2
1
2
1
2 2
k 1
(
)
− α 2 1 − α12 α 2k (4.25)
The coefficient values for these filter kernels h(n) and g(n) are given in Table 4.1 for the quadratic spline wavelets.
TABLE 4.1 Coefficients of the Truncated Decomposition Filters h, g (IIR) and Reconstruction Filters p2, q2 (FIR) for Quadratic Spline Filters. k -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
© 2000 by CRC Press LLC
h(k) +0.00157 +0.01909 -0.00503 -0.04440 +0.01165 +0.10328 -0.02593 -0.24373 +0.03398 +0.65523 +0.65523 +0.03398 -0.24373 -0.02593 +0.10328 +0.01165 -0.04440 -0.00503 +0.01909 +0.00157
g(k) -0.00388 -0.03416 +0.00901 +0.07933 -0.02096 -0.18408 +0.04977 +0.42390 -0.14034 -0.90044 +0.90044 +0.14034 -0.42390 -0.04977 +0.18408 +0.02096 -0.07933 -0.00901 +0.03416 +0.00388
p2(k)
q2(k)
+1/4 +3/4 +3/4 +1/4
+ 1/480 -29/480 +147/480 -303/480 +303/480 -147/480 +29/480 - 1/480
4.2.6
THE FAST ALGORITHM
The initial step is to find the B-Spline coefficients c(k) at the resolution level 0. This is efficiently implemented using c + (k ) f (k ) + b1c + (k − 1), (k = 2, …, K
(4.26a)
c − (k ) f (k ) + b1c − (k + 1), (k = K − 1, …, 1
(4.26b)
(
)
c(k ) = b0 c + (k ) − f (k )
(4.26c)
where b0 = –8a/(1–a2) b1 = a = √ 8–3 with c+(1) = c–(K) = 0. The wavelet coefficients are then computed iteratively for I octaves by filtering and decimating by a factor of 2
[
]
(4.27a)
[
]
(4.27b)
ci +1 (k ) = h ∗ ci ↓ 2 (k ) d(i +1) (k ) = g ∗ ci ↓ 2 (k ), i = 0, 1, 2, …, I − 1
where ↓2 indicates down-sampling by 2 and where h and g are the low-pass and the high-pass filter kernels for decomposition respectively and computed as
[
]
[
]
h(k ) = 1 / 2 b 2 n+1
g(k ) = 1 / 2 b 2 n+1
−1
−1
↑2 ∗b 2 n+1 ∗ pn (k )
(4.28a)
↑2 ∗ ( −1) pn (k )
(4.28b)
k
where ↑2 indicates up-sampling by 2. The basic computational block diagram for the I octave wavelet decomposition is given in Figure 4.5. The input to the block diagram is the signal at resolution level (0). The H(z) and G(z) denote the z-transform of the low-pass and high-pass filter kernels h and g, respectively. The discrete wavelet transform by the spline wavelets is performed by these two filters, as in the form of multiresolution decomposition. In each resolution level, the output of the high-pass filtering yields the wavelet coefficients for that resolution (i), and the output of the low-pass filtering yields the coarser input to the next resolution level (i+1). For the purpose of scaling, the discrete sequences decimation by two is performed after each time filtering is applied. Various fast algorithms for the implementation of the recursive filtering and decimation schemes for wavelet decomposition are proposed in Reference 17.
© 2000 by CRC Press LLC
FIGURE 4.5 algorithm.
The basic computational block diagram of one octave wavelet decomposition
The wavelet transform is a sequential band-pass filtering operation using filters with logarithmically ordered band-pass characteristics in frequency domain. In the time domain, this operation corresponds to projecting a signal onto a subspace spanned by the dilated and translated versions of prototype functions called wavelet and scaling functions. Dilation yields logarithmically ordered filters, and translation yields the information about where in time any frequency band activity in the signal occurs. In the multiresolution scheme, the number of wavelet coefficients halves from one scale to the next, which implies that longer time windows for lower frequencies and shorter time windows for higher frequencies are employed in the analysis. The wavelet coefficients can be used to investigate the amplitude and phase characteristics of oscillations in various frequency bands forming the waveform or the image under consideration. For the exploration of different oscillatory components in a waveform using multiresolution analysis, the waveforms are decomposed into six or more scales by the Quadratic Spline Wavelets. The decomposition filters are designed as described in Section 4.2 for the corresponding multiresolution approximation. As an example, if the sampling rate is 1KHz, each scale covers the frequency band:
© 2000 by CRC Press LLC
first Scale second Scale third Scale fourth Scale fifth Scale sixth Scale Residual Scale
4.3
250–500 125–250 62.5–125 31.3–62.5 15.6–31.3 7.8–15.6 0.0–7.8
Hz Hz Hz Hz Hz Hz Hz.
INVARIANT MOMENTS
Moments have long been used in statistical theory and classical mechanics.28 Statisticians view moments as means, variances, skewness, and kurtosis of distributions, while classical mechanics students use moments to find centers of mass and moments of inertia.24 In imaging, moments have been used as feature vectors for classification11 as well as for image texture attributes and shape descriptors of objects.16,20 In the early 1960s, Hu developed seven invariant moments from algebraic moment theory.6 These seven moments are invariant under translation, rotation, and scaling. Perhaps the most important contribution of this work was the application of these seven invariant moments to the two-dimensional pattern recognition problem of character recognition. This was a crucial development, since many key problems in imaging and image recognition focus on recognizing an image even though it has been translated or rotated, or perhaps magnified by some means. Since that time other pattern recognition applications have included hand-printed characters, aircraft identification, chest X-rays and ship recognition, and more recently face recognition.11 Similar to the definition of moments in classical mechanics, the two-dimensional moments of order (p + q) with an image intensity distribution of f(x,y) are defined as m pq = ∫ ∫ x p y q f ( x, y) dx dy
(4.29)
where p,q = 0,1,2,…. These moments in general are not invariant to any distortions,28 and, therefore, the central moments are defined as
µ pq = ∫ ∫ ( x − x ′) ( y − y′) f ( x, y)d ( x − x ′)d ( y − y′) p
© 2000 by CRC Press LLC
q
(4.30)
where m m x ′ = m10 and y′ = m01 . 00 00 The central moments are known to be invariant under translation, and by working through Equation 4.30, it can be shown that the first four orders of central moments can be expressed in terms of the ordinary moments defined in Equation 4.29 as
µ00 = m00 µ10 = µ01 = 0 µ11 = m11 −
m10 m01 m00
µ20 = m20 −
2 m10 m00
µ02 = m02 −
2 m01 m00
(4.31)
µ12 = m12 − m02 x ′ − 2 m11 y′ + 2 m10 y′ 2 µ21 = m21 − m20 y′ − 2 m11 x ′ + 2 m01 x ′ 2 µ03 = m03 − 3m02 y′ + 2 m01 y′ 2 µ30 = m30 − 3m20 x ′ + 2 m10 x ′ 2 Often it is desirable to normalize the moments with respect to size. This may be accomplished by using the area, µ00. The normalized central moments are
η pq = where r = 1 +
( p + q) 2
µ pq r µ00
for (p + q) = 2, 3,….
With these normalized moments the seven Hu invariants are found by:
© 2000 by CRC Press LLC
(4.32)
ϕ1 = η20 + η02 2 ϕ 2 = (η20 − η02 ) + 4η11 2
ϕ 3 = (η30 − 3η12 ) + (3η21 − η03 ) 2
ϕ 4 = (η30 + η12 ) + (η21 + η03 ) 2
2
2
[
ϕ 5 = (η30 − 3η12 )(η30 + η12 ) (η30 + η12 ) − 3(η21 + η03 ) 2
[
2
]
+(3η21 − η03 )(η21 + η03 ) 3(η30 + η12 ) − (η21 + η03 )
[
2
ϕ 6 = (η20 − η02 ) (η30 + η12 ) − (η21 + η03 ) 2
2
2
]
(4.33)
]
+4η11 (η30 + η12 )(η21 + η03 )
[
ϕ 7 = (3η21 − η03 )(η30 + η12 ) (η30 + η12 ) − 3(η21 + η03 ) 2
[
2
]
−(η30 − 3η12 )(η21 + η03 ) 3(η30 + η12 ) − (η21 + η03 ) 2
2
]
While these seven invariants are given in terms of the normalized moments, they may be calculated from the central moments as well. In that case the formulas are the same with µ substituted for η in Equation 4.33. One item to note is that these normalized moments assume that the image is represented by pixels, the values of which are all > 0. There is no problem in calculating the original and central moments even if some of the pixels are < 0. However, the normalized moments pose a different problem. If a substantial number of pixels have values < 0, then µ00 becomes negative. This causes a problem during the normalization, since µ00 raised to a nonintegral power becomes an imaginary number. Because this system calculates the moments of the wavelet coefficients there are situations where a substantial number of coefficients are negative. Because of this, when the normalized moments are calculated µ00 is treated somewhat like a vector. It is considered to have magnitude |µ00| and a direction which is (+) if µ00 > 0 and (–) if µ00 < 0. Therefore, the normalization of µpq is then:
η pq =
µ pq µ
r 00
if µ 00 > 0
η pq = −
µ pq µ 00
r
if µ 00 < 0
(4.34)
Therefore, this technique actually calculates pseudo-moments to use as features.
© 2000 by CRC Press LLC
4.4
ENTROPY
Entropy is a quantity that is widely used in information theory and is based on probability theory.4,18 Consider first an event E that can occur when an experiment is performed. How surprised would one be to see that E actually does occur? The answer to that question depends upon the probability of E. Suppose, for instance, cards are being drawn with replacement one at a time from a full deck of playing cards. If E were defined as “the card being a heart,” it would not be too surprising if E occurred as P(E) = 0.25. (There are 52 cards in a full deck broken into four equal suits, hearts, spades, diamonds, and clubs.) However, if E were defined to be “the card being the ace of hearts,” then we might be rather surprised to see that occuring, as now P(E) = 0.0192 or 52 −1 , but it is possible to quantify this concept of 18 surprise. If this concept is extended to a random variable X which can be one of the values x1, x2, …, xn with probability p1, p2, …, pn, then the expected amount of surprise upon learning the value of X is n
H( X ) = −
∑ p log i
2
pi
(4.35)
i =1
This quantity is the entropy of the random variable X. Note that if pi = 0, then 0 log2 0 is defined to be 0. Thus H(X) represents the average amount of surprise associated with the value of X. It also can be interpreted as representing the amount of uncertainty that exists in the value of X. In information theory, H(X) is considered to be the average amount of information received when the value of X is observed.4 It is from the information theory point of view that the entropy would be a valid data point for images. It is conjectured that perhaps normal outcomes would contain more information than abnormal, or vice-versa, and therefore, the entropy values would be of use for classification purposes. Interestingly enough, when viewed by a human observer, the distribution of entropy values of abnormal mammograms, for example, seemed no different than the distribution of the normal mammogram entropy values. However, when used as input to a neural network, they indeed added some discrimination.
4.5 CEPSTRUM ANALYSIS Naturally, a recorded signal, such as EEG, is mixed with noise. And the relationship between the signal and noise is simply considered as an addition, i.e., xi = si + ni
(4.36)
where xi is the recorded signal, si the pure signal, ni the noise and i the time index.
© 2000 by CRC Press LLC
Additionally, Fourier spectrum analysis and signal filtering can be applied directly. However, if the relationship between the signal and noise is convolution, which often occurs, i.e., xi = si ∗ ni
(4.37)
where * means convolution, the system is not linear, and the Fourier analysis and filtering cannot be used directly and deconvolution is needed. Basically the relationship between the signal and noise is a mixture of addition and convolution. After the discrete Fourier transform (DFT) is obtained xˆi = sˆi ⋅ nˆi
(4.38)
where xˆi is the DFT of xi,, si, is the DFT of si,, nˆ i the DFT of ni and · means multiplication. Then after the Log is applied, the signals become additive: x˜ i = s˜i + n˜i
(4.39)
~ where x~i is the Log of xˆi, s, the Log of sˆ ,i and n~ i the Log of nˆ .i To return to the time domain, an inverse DFT (linear transform, which maintains the addition relationship) is performed, and the output is the cepstrum ci. Just the low part of the cepstrum, where the signal is assumed to be concentrated, is selected by cepstrum filtering through the cepstrum window. The output cepstrum will be the input to the ANN. If the real Log of the absolute values of the DFT spectrum is used, the output is the real cepstrum. If the complex Log of the complex values of the DFT spectrum is used, the output is the complex cepstrum.
4.6 FRACTAL DIMENSION Additional features extracted from images and signals focus on texture features. The method used here was based on fractals. While fractal geometry has been around for over a century, it is thanks to Mandelbrot, who coined the term ‘fractal’ and popularized this class of mathematical functions.14 Examples of fractals best explain what a fractal really is. Purely mathematical fractals include the Mandelbrot set and the von Koch snowflake. Fractals occurring in nature include clouds, trees, the coast of England, mountains, blood vasculature, cauliflower, and much more. What these all have in common is some degree of self-similarity; naturally occurring fractals are statistically self-similar. In other words, a whole cauliflower looks like half a cauliflower looks like one fourth a cauliflower, and so on. This example using cauliflower can be found in Peitgen et al.12 or try it for yourself! The measure used to describe the ‘fractal-ness’ of an object or image is the fractal dimension. The fractal dimension is the measure of self-similarity of an image; the basic idea is that an image with fractal properties will look the same at all scales. Many have used the fractal dimension to analyze and segment textures.
© 2000 by CRC Press LLC
In a sense, the fractal dimension measures the roughness of an image. For example, computer generated mountains with low fractal dimension are smooth and rolling while those generated with high fractal dimension are rough and jagged.11,14 Fractals have been used in general purpose texture analysis, synthesis, and segmentation, particularly of natural scenes. Pentland11 segmented images containing aerial views, mountains, and a desert scene using the fractal dimension. Keller et al.7 used two characteristics related to the fractal dimension to distinguish silhouettes of trees from silhouettes of mountains. Keller et al.2 and Dubuisson and Dubes2 discuss the use of lacunarity (another fractal-based feature) in conjunction with fractal dimension to improve the segmentation of natural textures. They argue that the fractal dimension alone cannot discriminate all natural textures as well as a human observer but that other fractal-based features such as lacunarity can improve this segmentation. In general, the dimension of a set can be found by the equation D = log( N ) / log(l / r )
(4.40)
where D is the dimension, N is the number of parts comprising the set, each scaled down by a ratio r from the whole.12 For a two-dimensional square, N parts scaled by the ratio r = 1/N1/2 results in Nr2 = 1 or D = 2. A set is considered fractal if D is a noninteger value. For example, von Koch’s snowflake has four parts, each onethird the length of its parent, so D = log(4) / log(3) = l.26. There are many ways to calculate the fractal dimension of naturally occurring images. Most common in the literature are methods that estimate the fractal dimension based on statistical differences in pixel intensity. Voss27 and Keller et al.7 used a popular box-counting method similar to Sarkar and Chaudhuri19 described in more detail next. Peleg et al.13 used a multiresolution method to measure the fractal dimension by observing the change in surface area of a covering blanket over the topological map of an image at different scales. Super and Bovik23 used the outputs from multiple Gabor filters and fit them to a fractal power-law curve to obtain the fractal dimension. This has the added capability of spatial localization. Pentland14 gathered second-order statistics at varying distances, resulting in a Gaussian-shaped distribution. A fractal dimension estimation was gathered by fitting the standard deviations over scale. The method chosen here is that proposed by Sarkar and Chaudhuri.19 Starting with the basic equation D = log(N) / log(1/r), an image sized M × M, and G number of gray levels, the algorithm is as follows: 1. Divide up the image into size s × s where M/2 > s > 1 such that r = s / M. 2. Imagine the two-dimensional image is a topological map in three dimensions. On each grid sized s × s can be built a column of boxes sized s × s × s’ where G/s’ = M/s with indices starting with 1 for the bottom box. 3. Find the lowest and highest boxes intersected by the image in the current column of boxes and name them k and l, respectively.
© 2000 by CRC Press LLC
4. Add up the differences (l – k + 1) for all areas s × s for the current scale r and call it N(r). 5. Do this for all scales, and the result will be a vector N(r) where 1/r = 2, 4, 8, … M/2. 6. Plot log(N[r]) vs. log(1/r) and calculate the slope using a least square linear fit. This is the fractal dimension.
FIGURE 4.6
Test images used to test fractal dimension algorithm, FD = fractal dimension.
Sarkar and Chaudhuri19 used random images with increasing standard deviation to test their algorithm. These results were recreated by the authors by first generating 8-bit test images of Gaussian noise with a mean of 128 and standard deviation ranging from 8 to 128 in increments of 8. Some examples are shown in Figure 4.6. The size of all test images was 128 × 128 pixels. The results are shown in Figure 4.7. As demonstrated, the fractal dimension increases with increasing noise standard deviation similar to the results Sarkar and Chaudhuri show in their paper.19 Although it probably would have been more appropriate to generate actual fractal images with known dimension, such as fractal Brownian images, the calculation of the fractal dimension is only an estimation and can be thought of as a measure of the roughness of an image. Furthermore, when using the fractal dimension in texture analysis applications, the more salient property is that it increases monotonically with increasing true fractal dimension or roughness. Thus, by using these random
© 2000 by CRC Press LLC
FIGURE 4.7
Fractal dimension as a function of standard deviation of test images.
images, the authors demonstrated an increase in fractal dimension with increased noise or roughness; compare their algorithm with that of others, namely Keller et al.,7 Peleg et al.,13 and Pentland.14 In comparison, Pentland14 and Peleg et al.13 are accurate and cover the full dynamic range of fractal dimensions but are computationally expensive. On the other hand, the methods of Gagnepain and RoquesCarmes (1986) and Keller et al.7 are computationally efficient but do not cover the full dynamic range. The method of Sarkar and Chaudhuri demonstrated both qualities and was therefore chosen for implementation.
4.7 SGLD TEXTURE FEATURES Described in detail in Haralick et al.,3 the gray-tone spatial-dependence matrix, as they call it, reveals the second order statistics of an image. Often applied to the segmentation and/or classification of textures, the SGLD matrix in Haralick et al.3 was used to classify aerial images. The matrix itself contains the statistics of pairs of pixel intensities located a distance d away from each other at an angle Θ. Note that d is not the Euclidean distance but represents the number of units away in pixels; a pixel’s eight nearest neighbors are one unit away. Thus, element (i, j) in the SGLD matrix Φd,Θ is the probability that pixel valued i is a distance d away from a pixel valued j at an angle Θ. The matrix Φd,Θ for a two bit per pixel image (four gray levels) is explained in Table 4.2. The SGLD matrix is best explained by an example from Haralick et al.3 Take an image I with four possible gray levels {0, 1, 2, 3},
© 2000 by CRC Press LLC
TABLE 4.2 Calculation of Matrix Φd,Θ for A Two Bit Per Pixel Image j
0
i 0
1
# pairs (0,0) w/ (d, Θ) separation # pairs (1,0) w/ (d, Θ) separation # pairs (2,0) w/ (d, Θ) separation # pairs (3,0) w/ (d, Θ) separation
1 2 3
# pairs (0,1) w/ (d, Θ) separation # pairs (1,1) w/ (d, Θ) separation # pairs (2,1) w/ (d, Θ) separation # pairs (3,1) w/ (d, Θ) separation
0 0 I= 0 2
0 0 2 2
2
3
# pairs (0,2) w/ (d, Θ) separation # pairs (1,2) w/ (d, Θ) separation # pairs (2,2) w/ (d, Θ) separation # pairs (3,2) w/ (d, Θ) separation
# pairs (0,3) w/ (d, Θ) separation # pairs (1,3) w/ (d, Θ) separation # pairs (2,3) w/ (d, Θ) separation # pairs (3,3) w/ (d, Θ) separation
1 1 2 3
1 1 2 3
The four most typical SGLD matrices, ΦH (H = Horizontal), ΦRD (RD = Right Diagonal), ΦLD (LD = Left Diagonal), and ΦV (V = Vertical) are symmetric matrices with a default d = 1 and Θ = 0°/180°, 45°/225°, 135°/315°, and 90°/270°, respectively. They are calculated by first computing the SGLD for only one angle, then taking the transpose (representing the SGLD matrix of the angle added to 180°), and adding the transpose to the original matrix as demonstrated below for Θ = 0°/180°. This makes the SGLD matrix invariant to 180° rotations.
Φ1,0°
2 2 = 1 0
0 2 0 0
0 0 3 1
0 2 0 T 0 0 , Φ1,180° = Φ1, 0° = 0 1 0
Φ H = Φ1, 0° + Φ1, 180°
4 2 = 1 0
2 4 0 0
1 0 6 1
2 2 0 0
1 0 3 0
0 0 1 and 1
0 0 1 2
The other three matrices, ΦRD, ΦLD, and ΦV are
Φ RD
4 1 = 0 0
© 2000 by CRC Press LLC
1 2 2 0
0 2 4 1
0 6 0 0 0 4 1 , Φ LD = 2 2 0 0 0
2 2 2 2
0 2 1 0 1 2 2 , Φ V = 3 1 0 0 0
3 1 0 2
0 0 2 0
The above example is an oversimplified case. Typical gray-scale images have 256 gray levels; therefore, the resulting SGLD matrix would be 256×256, much larger than 4×4. In order to convert each matrix into a probability density function, each element of the above matrices must be normalized. This is done by dividing each element by the number of pixel pairs included in the calculation of the matrix. This value depends on d and Θ. This is best described in Euclidean terms. Let D be the Euclidean distance between pixel pairs. For example, if d = 1 and Θ = 45, then D = 2 . Then, let m be the horizontal projection of (D, Θ) where m = Dcos(Θ), and let n be the vertical projection of (D, Θ) where n = Dsin(Θ). Therefore, for an image sized Nx × Ny, each element must be divided by 2(Nx – m)(Ny – n). The factor of 2 is only necessary for symmetric matrices invariant to 180° rotations. Of course, if it is desired to reflect the exact relationship of pixel pairs to each other or, in other words, to detect 180° rotations, the symmetric matrix would not be used. For example, if only the relationship d = 1, Θ = 0° is desired to be detected, only the matrix Φ1, 0° would be used instead of ΦH. Statistical features are then computed from the above matrices (after normalization). The most common features used in the literature include the energy, entropy, correlation, local homogeneity, inertia, cluster shade, and cluster prominence. Often, these features are averaged over the four directions (or one averaged SGLD matrix would yield the same averaged features). Variance of the features from the four directions may also be an important feature, and in this case the four SGLD matrices will need to be preserved. The equations for these features are G −1 G −1
energy =
∑ ∑ f (i, j) i=0
2
(4.41)
j =0
G −1 G −1
entropy = –
∑ ∑ f (i, j) log[ f (i, j)] i=0
j =0
G −1 G −1
correlation =
(4.42)
∑∑ ( i=0
j =0
i − x )( j − y ) f (i, j ) σ2
G −1 G −1
local homogeneity =
∑ ∑ 1 + (i1− j) i=0
2
f (i, j )
(4.43)
(4.44)
j =0
G −1 G −1
inertia =
∑ ∑ (i − j) f (i, j) 2
i=0
© 2000 by CRC Press LLC
j =0
(4.45)
G −1 G −1
cluster shade =
∑ ∑ ((i − x ) + ( j − y )) f (i, j) i=0
3
(4.46)
j =0
G −1 G −1
cluster prominence =
∑ ∑ ((i − x ) + ( j − y )) f (i, j) i=0
4
(4.47)
j =0
where G −1 G −1
x=
∑∑ i=0
f (i, j ) = y =
j =0
G −1 G −1
∑ ∑ f (i, j) i=0
(4.48)
j =0
for symmetric matrices, and G −1
σ2 =
∑
(i − x )
G −1
2
i=0
∑
f (i, j ) =
j =0
G −1
∑
( j − y)
G −1
2
i=0
∑ f (i, j)
(4.49)
j =0
for symmetric matrices. Other less common features mentioned in Haralick et al.3 are the sum average, sum variance, sum entropy, difference variance, difference entropy, and information measures of correlation.
sum average = ( x + y) =
2 ( G −1)
∑ if
x+y
(i )
(4.50)
i=0
2 ( G −1)
sum variance =
∑ (i − ( x + y)) f 2
x+y
(i )
(4.51)
i=0
2 ( G −1)
sum entropy =
∑f
x+y
i=0
(i) log[ fx + y (i)]
G −1
difference variance =
∑ (i − ( x − y)) f i=0
where
© 2000 by CRC Press LLC
(4.52)
2
x−y
(i )
(4.53)
G −1
( x − y) = ∑ ifx − y (i)
(4.54)
i=0
G −1
difference entropy = −
∑f
x−y
i=0
(i) log[ fx − y (i)]
(4.55)
information measures of correlation: info12 =
HXY − XXY1 max{HX , HY }
(4.56)
(
)
info13 = 1 − exp[−2( HXY 2 − HXY )]
1/ 2
(4.57)
where HXY = entropy, HX = entropy of fx (i ) = HY = entropy of fy ( j ),
f x (i ) =
G −1
∑
f (i, j ) = fy ( j ) =
j =0
G −1
∑ f (i, j)
(4.58)
i=0
due to symmetry
G −1 G −1
and HXY1 = −
∑∑ i=0
j =0
G −1 G −1
HXY 2 = −
∑∑ i=0
j =0
[
]
f (i, j ) log fx (i ) fy ( j ) =
[
]
(4.59)
fx (i ) fy ( j ) log fx (i ) fy ( j )
also due to symmetry. Features can be averaged over the four directions (horizontal, vertical, right diagonal, and left diagonal) to calculate an average value. The variance between the features of the four directions is also calculated. Both these average values and variances can be used as input features in the classification of tissue ROIs.
REFERENCES 1. Daubechies, I., Orthonormal bases of compactly supported wavelets, Comm. in Pure and Appl. Math., 41(7), 1988.
© 2000 by CRC Press LLC
2. Dubuisson, M-P. and Dubes, R. C., Efficacy of fractal features in segmenting images of natural textures, Pattern Recognition Lett., 15, 419, 1994. 3. Haralick, R. M., Shanmugam, K., and Dinstein I., Textural features for image classification, IEEE Trans. Syst., Man Cybern., SMC-3(6), 610, November 1973. 4. Held, G., Data Compression, John Wiley & Sons, New York, 1987. 5. Gabor, D., Theory of communication, J. IEE (London), 93, 429, 1946. 6. Hu, M-K., Visual pattern recognition by moment invariants, IRE Trans. Inf. Theor., IT-8, 179, February 1962. 7. Keller, J. M., Crownover, R. M., and Chen, R. Y., Characteristics of natural scenes related to the fractal dimension, IEEE Trans. Pattern Anal. Mach. Intell., PAMI-9(5), 621, September 1987. 8. Laine, A., Schuler, S., Fan, J., and Huda, W., Mammographic feature enhancement by multiscale analysis, IEEE Trans. Med. Imaging, 13(4), 725, December 1994. 9. Laine, A., Fan, J., and Yang, W., Wavelets for contrast enhancement of digital mammography, IEEE Eng. Med. Biol., 536, September/October 1995. 10. Mallat, S., A theory for multiresolution signal decomposition: the wavelet representation, IEEE Trans. Pattern Anal. Mach. Intell., 11(7), 674, 1989. 11. Micheli-Tzanakou E., Uyeda E., Ray R., Sharma A., Ramanujan R., and Doug J., Comparison of neural network algorithms for face recognition, Simulation, 64(1), 15, July 1995. 12. Peitgen, H-O., Jurgens, H., and Saupe, D., Chaos and Fractals — New Frontiers of Science, Springer-Verlag, New York, 1992. 13. Peleg, S., Naor, J., Hartley, R., and Avnir, D., Multiple resolution texture analysis and classification, IEEE Trans. Patt. Anal. Mach. Intell., PAMI-6(4), 518, July 1984. 14. Pentland, A. P., Fractal-based description of natural scenes, IEEE Trans. Pattern Anal. Mach. Intell., PAMI-6(6), 661, November 1984. 15. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T., Numerical Recipes in C, Cambridge University Press, Cambridge, 1995. 16. Reeves, A., Prokop, R., Andrews, S., and Kuhl, F., Three-dimensional shape analysis using moments and Fourier descriptors, IEEE Trans. Pattern Anal. Mach. Intell., 10(6), 937, Nov 1988. 17. Rioul, O., Duhamel, P., Fast algorithms for discrete and continuous wavelet transforms, IEEE Trans. Info. Theor., 38(2), 569, 1992. 18. Ross, S., A First Course in Probability, Macmillan Publishing Co. Inc., New York, 1976. 19. Sarkar, N. and Chaudhuri, B. B., An efficient approach to estimate fractal dimension of textural images, Pattern Recognition, 25(9), 1035, 1992. 20. Shen, L., Rangayyan, R., and Desautels, J., Application of shape analysis to mammographic calcifications, IEEE Trans. Med. Imaging, 13(2), 263, June 1994. 21. Solka, J. L., Priebe, C. E., and Rogers, G. W., An initial assessment of discriminant surface complexity for power law features, Simulation, 58(5), 311, May 1992. 22. Spreckelsen, M. V. and Bromm, B., Estimation of single-evoked cerebral potentials by means of parametric modeling and Kalman filtering, IEEE Trans. Biomed. Eng., 35(9), 691, 1988. 23. Super, B. J. and Bovik, A. C., Localized measurement of image fractal dimension using Gabor filters, J. Visual Commun. Image Represent., 2(2), 114, June 1991. 24. Thomas, G. and Finney, R., Calculus and Analytic Geometry, Addison-Wesley, Reading, MA, 1988. 25. Unser, M., Aldroubi, A., and Eden, M., On the asymptotic convergence of B-spline wavelets to Gabor functions, IEEE Trans. Inf. Theor., 38(2), 864, 1992.
© 2000 by CRC Press LLC
26. Unser, M., Aldroubi, A., and Eden, M., Fast B-spline transforms for continuous image representation and interpolation, IEEE Trans. Pattern Anal. Mach. Intell., 13(3), 277, 1991. 27. Voss, R. F., Random fractals: characterization and measurement, Scaling Phenomena in Disordered Systems, Pynn, R. and Skjeltorp, A., Eds., Plenum Press, New York, 1986, 1. 28. Yu, F. T. S. and Li, Y., Application of moment invariants to neural computing for pattern recognition, Hybrid Image and Signal Processing II, Proc. SPIE, 1297, 307, 1990. 29. Levine, M., Vision in Man and Machine, McGraw-Hill, New York, 1985.
© 2000 by CRC Press LLC
Section II Unsupervised Neural Networks
© 2000 by CRC Press LLC
5
Fuzzy Neural Networks Timothy J. Dasey and Evangelia Micheli-Tzanakou
5.1 INTRODUCTION This chapter is divided into four components. In the first section, the concepts and background relevant to pattern recognition, some typical optimization techniques and ALOPEX, and a tutorial on the ideas and early works in neural networks are dealt with. The danger in this presentation is that these fields might be construed as disjoint problems. The truth is that a large amount of overlap exists between these conceptual divisions. Pattern recognition has benefited from the application of neural networks and optimization. Neural networks commonly use optimization routines to guide their training and have achieved many of their greatest successes in pattern recognition applications. These relationships should be kept in mind during the reading. The last section of this chapter includes a philosophical discussion explaining the rationale for the work.
5.2 PATTERN RECOGNITION 5.2.1
THEORY
AND
APPLICATIONS
To most individuals, a pattern recognition task involves an ability of the brain to assign labels to objects, sounds, feelings or ideas and discriminate one from another. Most of us are extremely adept at this processing task, while being unaware of the precise mechanism that provides us with this power. In fact, it is through the scientific field of pattern recognition, which relegates this task to machines, that the methods of our brain can be fully realized. Yet, there has never been a machine designed that has our capability to be a general-purpose pattern recognition machine. Regardless of the limitations, machines perform quite well in the grouping and labeling of patterns from certain problem sets. Machines excel when the recognition task is confined to a specific application. An extensive body of literature describes the recent attempts at relegating many pattern recognition tasks to machines as the explosive growth of information overworks the human classifiers.20,48,67 The use of automated pattern recognition machines has now touched nearly every field in an enormous variety of workplaces. The special nature of each pattern recognition task requires selection of the best approach.30 Heuristic approaches, which rely on the designer’s intuition and familiarity with the problem, are often sufficient to provide excellent solutions to many problems. Linguistic (syntactic) approaches are often useful when numerical
© 2000 by CRC Press LLC
measurements are not sufficient to describe the problem. Many pattern recognition problems can be solved through several mathematically substantiated techniques, using statistical variability between patterns or certain pattern similarity measures.18 When confronting a typical pattern recognition task, three particular problems must be addressed by the designer. The first is the representation of the input data that the system will use in its classifications. When determined, these comprise the pattern vector x as x = ( x1, x2 , x3 ,…, xn )
(5.1)
where n is the total number of parameters needed for analysis. In many mathematical pattern recognition problems, it is often convenient to envision each parameter xi as describing an axis in n-dimensional space (n-space), where each pattern then comprised a point in that space, as in the two-parameter space depicted in Figure 5.1.
FIGURE 5.1 pattern vector.
A two-parameter space. Each point in the space corresponds to an input
The second problem concerns the extraction of certain characteristic attributes from the pattern vectors and a reduction in the dimensionality (from n to m) of those vectors. This is usually termed the preprocessing and feature extraction problem. The attributes of features to be selected vary with the application but involve the selection of pattern attributes that can best be used for discrimination among the patterns. The feature extraction process can be thought of as an intermediate formulation of the more prominent goal of pattern recognition: the compression of large numbers of attributes to a small number of class determinants.64 The third problem involves the determination of optimum decision procedures, which are used for the identification and classification process. Many such
© 2000 by CRC Press LLC
procedures involve the separation of n-space (or m-space) into clusters, much as Figure 5.1 includes pattern points which are grouped into three similar categories. Although mathematical formulations of pattern recognition methods have been available for several decades, many prominent problems still must be solved before great theoretical improvements can be made.36,38 One issue, that of properly estimating the classification performance of a machine, has been largely agreed upon.58 It is generally thought that two pattern populations are needed to ensure that a machine pattern recognizer can generalize. One group is reserved for the training of the machine (determination of the decisions involved in making a classification), and the other is used for post-training testing. This helps to ensure that the decisions formulated for the training group also apply to the similar but distinct non-training group.
5.2.2
FEATURE EXTRACTION
A feature of a given parameter set refers to an attribute described by one or more elements of the original pattern vector. In an application to imaging, the elements are each pixel, and a feature may be selected as a subset of the pixel intensity values. More commonly, a feature describes some combinations of the original pixels, as in a Fourier expansion or a spatial filtering operation. The precise meanings of a preprocessing operation and a feature extraction process overlap, but in general a feature extraction operation involves the reduction in dimensionality of the pattern vector. The primary reason for such a transformation is to provide a set of measurements with more discriminatory information and less redundancy to the classifier. The precise choice of features is perhaps the most difficult task in pattern processing. In order to know the most successful set of features for a particular problem, the accuracy of the classifier must be known. Yet the classifier depends on the information from a feature extraction device, and thus cannot normally provide that information without a completely designed feature extractor. This enigma remains the primary reason for the difficulty in evaluating competing feature extraction methods. It has prompted many researchers to select features subjectively from an educated guess of what will be most important to the classifier. These techniques can be effective but are increasingly more difficult to ascertain as the complexity of the patterns increases, and they are always subject to personal bias. Many common mathematical parameters are used as feature measurements.39 A set of n-space Euclidean distance measurements is a very common example. In other situations where the identity of the patterns are known, transformation matrices to minimize intraset pattern entropy65 or intraset pattern dispersion and functional approximation methods are commonly used. Several orthogonal expansions are also used, including the Fourier expansion and the Karhunen-Loève (K-L) expansion.21 The K-L expansion offers certain optimal properties and will be reviewed in more detail in Section 5.3.1. Other common measurements, such as moment invariants,55 are used because of their constancy under many common pattern transformations. The number of different schemes for feature extraction, even in similar applications (such as the processing of handwritten characters), is typically enormous.
© 2000 by CRC Press LLC
5.2.3
CLUSTERING
“Clustering operation” may have different connotations for different people, even in the pattern recognition community. This chapter will use a terminology commonly accepted by many scientists. A clustering operation involves the grouping of like patterns with one another without any knowledge of pattern identity beforehand (an unsupervised classification operation). Classification problems generally have this information. In the clustering problem, patterns must be separated solely on their specific attributes, whereas the classification problems have access to error signals which can be generated to guide the decision making of the machine. Continuing with the geometrical analogy outlined in Section 5.1.1, let us envision each pattern as a point in n-space, much as we see each star as a point in the sky. If we are asked to group the stars in the sky, what measurement do we use for this determination? The exact formulation of this answer often depends on the application.50,53 In some instances, the distances between patterns can be used to separate patterns. In other cases, pattern density in regions of space are used to indicate locations where patterns likely are drawn from the same class, the techniques known as histogram approaches.37 It should be clear already that no clustering operation can ever be guaranteed to operate without error. The successful operation of the clustering method relies on the separability of the data from the attributes used as the pattern vectors. If two or more classes overlap in n-space, they will never be perfectly separated. All clustering problems rely on the fidelity of these input data and generally are based on the separation from highly dense regions of patterns from one another. Each of these “modes” of the distribution is assigned a particular class label at a later time. A large class of problems relies on a hierarchical grouping of pattern data.13 The procedures used usually have the disadvantage of a phenomenon called “chaining”, where small errors in grouping at the extremes of the tree accentuate at later levels. Patterns in this scheme can be given a classification arbitrarily by choosing to “cut” at a particular level of the tree, but recent thinking is that there is significant information in leaving the class identity of patterns “fuzzy”. Fuzzy clustering refers to assigning grades of membership to patterns and is currently a widely touted method.16,23
5.3 OPTIMIZATION 5.3.1
THEORY
AND
OBJECTIVES
In many situations it is desirable to find the values of a set of parameters that best define the solution to a particular problem. As a rule, it is always possible to perform an exhaustive search over the entire parameter space, choosing the parameter values that are closest to the desired operation of the system. In most cases, a measure can be formalized to assess the degree of fit of the proposed solutions to the ideal. This measure is usually termed a cost, energy, Hamiltonian, or objective function. Although an exhaustive search through all allowable combinations of system parameters is always theoretically possible, it is generally not feasible for even a moderately high number of system parameters. In fact, the number of possible choices (N) explodes exponentially with the number of parameters (q) involved in the space as © 2000 by CRC Press LLC
N = n1n2 n3 … n2
(5.2)
where nj is the number of samples of parameter j, or as N = nq
(5.3)
when n1 = n2…= nq. As an example, if the system is composed of three parameters, and the search is conducted by sampling the parameters every 0.1 in the interval [0,1], then there are 113 or 1331 search items. With the same sampling and ten parameters, the search list includes 1110 or 2.6 × 1010 items. It is obvious that this scheme is untenable for all but simple problems and is certainly impossible for the implementation of a dynamic system. It is for this reason that optimization procedures have received attention for a long period. The goal is to find the optimal (or at least close to optimal) solution with a shorter search time than the exhaustive search method. One of the conceptual means of achieving this uses a hyper-dimensional geometrical visualization of the cost function as it varies with each of the (presumably uncouples)* system parameters. This parameter space is usually widely variant, and the search over that space involves the extraction of a global minimum (or maximum, depending on the cost function used) from among all of the local minimum. In truth, the global extremum is rarely consistently attainable for realistic situations in finite time, but usually a very close approximation is both achievable and sufficient. Two means for adjusting the exhaustive search technique readily come to mind. The first involves sampling the entire parameter space at a low resolution and finding the lowest (in the case of a minimization) region. Then that subregion can be sampled at a higher resolution ad infinitum. This procedure has occasionally been adapted but makes the major assumption that the global extremum is contained in a larger depression about it (and that the boundary of the global minimum is at least approximately funneling into that extremum). This is a gross oversimplification, and application of these methods can result in a solution far from the best choice. In many other schemes, referred to as gradient descent techniques, the effect of a parameter change on the cost function is calculated, and the parameter is adjusted so that it is moving downhill toward a better solution. This results in a rapidly converging iterative procedure, but the technique is fortuitous if the solution at which it arrives is a global, not a local, minimum. All good optimization routines work on the concept that a short-term deleterious move, moving uphill as well as downhill, is necessary to ensure the possible escape from local minima and arrive near or most preferably at the global extremum.
5.3.2
BACKGROUND
Much of the literature on optimization deals with the analogy between optimization and statistical mechanics. Perhaps the first to draw this comparison was the technique that has become known as the Metropolis algorithm.43 This system was originally written as a means for investigating such macromolecular properties as the states of
© 2000 by CRC Press LLC
substances at the level of a set of N individual molecules. At any particular time, the potential energy of the system can be found as E=
1 2
N
N
i =1
j =1
∑ ∑ V (d ) ∀i ≠ j ij
(5.4)
where V is the potential between molecules and dij the minimum distance between molecules i and j. The problem consists of optimizing the positions of the particles in space (in this case 2-D space) to arrive at the lowest potential energy of the system. Starting with random positions of the particles, each particle is moved a random amount in a random direction. The new energy of the configuration is checked. If the energy is decreased, the move is allowed. However, if the move results in a P( ∆E ) = e
− ∆E , kT
(5.5)
higher potential energy, the move is allowed with a probability P(∆E), which is the Boltzmann distribution. Notice that this is no longer a gradient descent technique, but rather there is always a finite probability that the system can move uphill, out of a local minimum. In this way, the equilibrium states of the set of molecules could be analyzed, and it was seen that the system settled in configurations which also conformed to a Boltzmann distribution. It turns out that this method is a simple modification of a Monte Carlo scheme, where instead of choosing configurations randomly and weighting those configurations with a Boltzmann factor, the configurations are chosen with a Boltzmann distribution (evident through the simulations) and weighted evenly. The analogy between this statistical mechanics problem and optimization was explored even further with the introduction of the “simulated annealing” procedure.34 If we examine the Boltzmann update from the Metropolis algorithm, it is clear that the higher the temperature, the more likely that an uphill move will be accepted. Conversely, at zero temperature all uphill moves will be denied, and the system will fall to an energy minimum. To ensure a ground state configuration (without crystal imperfections) in a material, the system must be carefully annealed, a process where the substance is first melted and then slowly cooled, with extra time spent near the vicinity of the phase transition. With the analogy to the optimization problem, a “ground state” (global minimum) of the system may be found by starting off at a high value of temperature. This corresponds to melting the system so that uphill moves are nearly equiprobable to downhill moves, and the system randomly wanders in parameter space. By slowly lowering the temperature (and thus reducing the probability of uphill moves), the system can slowly settle in to a minimum. It has been shown that the simulated annealing procedure can find the global minimum under certain conditions with probability 1.0, but that finding may take an inordinate amount of time. An analogous calculation to the specific heat of the system can be sued to signal phase transitions in the optimization. The simulated annealing pro-
© 2000 by CRC Press LLC
cedure has been applied to a wide variety of pattern recognition tasks.66 The primary emphasis of experimentation with the algorithm has been the adjustment of the cooling schedule, the process of lowering the temperature.51 The simulated annealing procedure has received great attention over the last decade but is burdened by the application-dependent optimum cooling method and the necessity of a large number of iterations for convergence. Another has been dubbed Mean Field Annealing.49,54 In this scheme, the cost function H(x), which may have many local minima and in other ways be “ugly”, is replaced by another function H(x,m) which “resembles” H(x) but has components that are much easier to minimize (they could be convex functions with only one minimum). In order to make the two functions resemble each other, the set of parameters mi must be estimated. To perform this, another technique is borrowed from statistical mechanics, the mean field approximation. The details are too intensive to consider in this synopsis, but by estimating each of the parameters mi, the problem is reduced to a series of gradient descents at each value of temperature (note that this theory also utilizes a Boltzmann probability distribution). Several researchers have noted that each of the above methods involves a local search about the current point in parameter space. That is, even with the capability to move uphill, all operations are still local. The odds of crossing a wide gap to a region of “better” minima is low, and thus more global search methods have been proposed. One commonly used technique is to run multiple trials on a given data set, saving the best result. Given a high number of random starts, the hope is that the global optimum will be among the optima identified. Galar22 proposed a similar optimization routine to that of Eigen’s theory of macromolecular evolution19 which was more capable of crossing wide gaps. This method has many striking similarities, at least in concept if not in the method of application, to the ALOPEX process discussed in detail in the next section. Galar describes a two-term parameter update, one of which is a modified Markov chain and the other a random walk component. He claims that the resulting “biased random walk” is more capable of crossing wide gaps between local extrema than procedures like simulated annealing. Another recent approach has been named the dynamic tunneling algorithm.40 This routine uses gradient descent to go to a local minimum. At this time, the system “tunnels” through the surrounding hill (using an appropriately defined tunneling function) for the purpose of finding a point, other than the last minimum, which, when gradient descent is continued, will arrive at a point lower than the last minimum. The calculations are quite intensive, but the algorithm converges relatively often to the global minimum and may be more effective for problems with a high density of local minima.
5.3.3
MODIFIED ALOPEX ALGORITHM
The optimization routine ALOPEX (ALgorithms Of Pattern EXtraction) presents an alternative to the previously reviewed algorithms. It was originally applied to the measurement of the visual receptive fields of cells in the adult frog tectum.25,61 In the original application of the method, the cost function was referred to as the response function R.
© 2000 by CRC Press LLC
The method normally updates the model parameters (the pixel intensity values in the original application) as Pi (n + 1) = Pi (n) + Bi (n) + ri (n + 1)
(5.6)
where Bi(n) represents the influence of a term due to historical bias, and ri(n) is a random noise component. The bias term is calculated as Bi (n) = Bi (n − 1) + γ ∆ Pi (n)∆ R(n)
(5.7)
where DPi(n) represents the previous change in the ith parameter value Pi(n) as ∆Pi (n) = Pi (n) − Pi (n − 1)
(5.8)
and DR(n) indicates the similar change in the response function as ∆R(n) = R(n) − R(n − 1)
(5.9)
The two terms in the modification of Equation 5.3 provide different influences on the optimization. The first term is a bias term which tends to move the parameter in the direction that has been successful in the past. It is actually an aggregation of the biases to that point in the simulation, where the direction of the latest addition to the bias is determined by the change in the response function due to the last move. The second term is a random number, generated for each parameter at each iteration, which provides the opportunity for the parameter to move against the direction of recent success. As mentioned earlier, this capability to move “uphill” is what provides a good optimizer with the ability to escape local extrema.* The term ri(n) in the ALOPEX update equation is typically implemented as a Gaussian random number with zero mean and standard deviation σ. The accumulation of the biased terms in Equation 5.7 must be controlled in order to prove helpful. Without this regulation, the magnitude of Bi due to past iterations may overpower the relatively smaller change from the current iteration. In this scenario, the system has effectively gained “mass”, so that the “momentum” of the movement in one direction will not allow the system to stop quickly enough at the sites of the extrema. In all simulations in this chapter, the magnitude of Bi is constrained to the limits [–a,a]. The first two iterations of the simulation supply random numbers for the forthcoming update statements. The responses are found for each, and the update Equations 5.6 through 5.9 are applied to all of the parameters. This process repeats itself until the simulation is finished. Note that * The form of Equation 5.6 is correct for the maximization of the response function R. For minimization, the sign of the bias term should be changed.
© 2000 by CRC Press LLC
the ALOPEX process provides for the simultaneous update of all parameters at once, which the simulated annealing algorithm does not. This generally makes the ALOPEX process more time conservative. In addition, the magnitude of the random component does not depend on the amount by which that component raises of lowers the response (there is a dependence via the Boltzmann distribution in simulated annealing). This makes it easier for parameters to traverse wide gaps between the extrema. Even with the differences previously indicated, there are some analogies between the parameters of simulated annealing and those of ALOPEX. If the magnitude of the random component is much higher than that of the biasing component, then the parameters will be overwhelmingly driven by randomness, a situation analogous to the “melting” process in simulated annealing. Conversely, with no noise, the ALOPEX process simplifies to a gradient descent.* This indicates that the choices of γ and σ are critical for controlling the speed and accuracy of the convergence. The suspicion that the ALOPEX process could be run under similar conditions of “annealing” was confirmed by earlier60 and later work,14 which showed that slowly shrinking the magnitudes of both the noise and bias components in the update of the parameters could result in a great improvement in both the speed and accuracy of the optimization. In this work, the values of γ and σ were initially high and were lowered during the course of the simulation by the schedule
(
)
−
n τ
(
)
−
n τ
γ (n) = γ 0 − γ infinity e
+ γ infinity
(5.10)
+ σ infinity
(5.11)
and
σ (n) = σ 0 − σ infinity e
where τ was used to control the rate of the “cooling”, and the initial and final parameter values are user entered. Many other improvements have also been suggested, including a parallel implementation of the algorithm,42 averaging between multiple ALOPEX processes,60 and an interleaved formulation of the algorithm to work on multiple response functions.10 Recent work has used distributed ALOPEX processes working on overlapping “fields” of an image to enhance convergence speed.41 In a situation where the algorithm is used for noise removal or the correction of pattern imperfections and there exist a set of templates to guide the optimization, it has been shown42 that multiple response functions from each of the m templates Rj(n) can be used to get a single response function as *Note that there is not complete freedom of movement of the parameters with no noise. This is due to the fact that only one response function is used.
© 2000 by CRC Press LLC
m
R′ =
∑ j =1
R2 n j ( ) . m Rk (n) k =1
∑
(5.12)
A similar function, with the inversion of both the numerator and denominator of Equation 5.12, was used for minimizing a particular response function.15 The ALOPEX process has been applied successfully to many application areas since its introduction, in large part because of its general and flexible form. The ALOPEX process is interesting in that the pattern recognizer can be converted to a pattern extractor.44,17 Other applications include curve fitting to waveforms such as Visual Evoked Potentials,45,62 crystal growth,26 the traveling salesman problem,22 and pattern recognition applications.15 Using ALOPEX in perceptual tasks has also been addressed.28 Recent applications include the use of ALOPEX in reconstructing compressed images,60 reducing motion artifacts,11 and use of the VEP as a generator through ALOPEX of patterns of stimulation.31
5.4 SYSTEM DESIGN A pattern recognition system comprises four essential components, as labeled in Figure 5.2. The preprocessing module is an application dependent stage. The feature extracting and clustering modules are the trainable commodities in this scheme and comprise the bulk of the discussion of this chapter. As indicated by Figure 5.2, these modules are under the training control of the ALOPEX process, although the depiction of ALOPEX as a control external to the individual module is merely used as a convenience. A more accurate depiction would place an independent ALOPEX process within each stage. The final module is a labeling stage, in which the clusters formed in the previous stage are assigned an identity.
5.4.1
FEATURE EXTRACTION
A feature of an input pattern refers to any measurement from a set of pattern measurements that characterizes some attribute of that pattern. It was previously mentioned that a good feature extraction routine will compress the input space to a lower dimensionality while still maintaining a large portion of the information contained in the original pattern space. Although this is the most often-cited advantage of feature extraction, it is also true that an appropriate choice of features can help eliminate redundant and irrelevant information from the data set, thereby reducing the overhead for the classifier.56 Most feature extraction routines can significantly aid in the performance of a classifier. In unsupervised situations, the only information available to the feature extraction module is the statistical distribution of the patterns. In such a scenario, it is impossible to quantitatively analyze the effectiveness of a feature extraction routine in improving pattern classification. However, there are operators designed to main-
© 2000 by CRC Press LLC
FIGURE 5.2 A component diagram of the pattern recognition system used in this research. The dotted lines indicate control signal input to the modules, whereas the solid lines denote transfer of data.
tain high information content in the features (as compared to the original measurement pattern space) with a minimum number of dimensions. 5.4.1.1
The Karhunen-Loève Expansion
Perhaps the most widely used feature extraction routine with some of these information-conserving properties is the Karhunen-Loève (K-L).9,32,63 If it is desired to represent the kth N-dimensional input pattern x with an M-dimensional feature pattern y, then an M × N matrix Φ can be chosen so that r r y k = Φx k
(5.13)
The matrix Φ is actually a set of M orthonormal vectors (meaning they lie perpendicular to one another in N-dimensional space and have unit magnitude) constructed by taking the M eigenvectors corresponding to the M largest eigenvalues of the covariance matrix C constructed by the input patterns. This matrix C is formed by P
C=
N
rk
∑ ∑(x k =1
j
r −µ
j =1
) ( xr T
k j
r −µ
)
(5.14)
where P is the number of patterns in the set and the vector µ is the mean vector of the patterns set as r r µ = E( x )
© 2000 by CRC Press LLC
(5.15)
It has been shown that the eigenvectors of this covariance matrix exhibit certain optimal properties as a feature extractor when they are ordered with their correspondence to the eigenvalues of the matrix from highest to lowest.9 One of these properties is that the mean square representation error is the minimum for any choice of M orthogonal vectors, meaning that the approximation error (ε) of reconstructing the original pattern space with only M features
ε=
P
N
k =1
i =1
∑∑
k xi −
M
∑ j =1
Φ ij y kj
(5.16)
is the smallest for any choice of M vectors in Φ. This means the expansion answers a key requirement in the information compression problem of feature extraction. Its other optimum property is that this choice of vectors associates with the coefficients of the expansion a minimum measure of entropy or dispersion. The borrowed concept of entropy is often used in the pattern recognition field as a clustering measure,57,65 and so this minimum entropy property characterizes the K-L expansion as likely to contain clustering transformational properties. The crux of the theory is that the features contain the most information (without knowing the pattern identities) for the price and probably retain the existing pattern groups in the population. An implementation of the K-L expansion as a feature extractor21,35,33 generally proceeds in the opposite direction to the above analysis as the following steps: 1. Calculate the covariance matrix in Equation 5.14 using all available patterns and the mean vector from Equation 5.15. 2. Find the eigenvalues and eigenvectors of that covariance matrix. 3. Select the eigenvectors corresponding to the M largest eigenvalues and store them in the matrix Φ. 4. Find the feature values for each of the patterns via Equation 5.13. The primary advantage in using the K-L expansion for selection of features is that it requires no previous knowledge of pattern labels and thus is perfectly suited to unsupervised tasks. Many people confuse the aforementioned optimum properties of the expansion with an assumption that the features generated for the expansion provide optimum performance of the classifier. This is certainly not the case. A feature extractor can never provide optimum classifier information without information from the classifier about its historical performance. Nevertheless, the K-L expansion is useful in situations where that information is not reliable or available. The K-L expansion is also a linear operation, and considerable evidence suggests that other nonlinear features can often provide more useful information to the classifier.12 There is an additional point which should be mentioned here. It turns out that the primary eigenvector points in such a direction that the variance of the patterns in that feature space is maximum for all vectors. Subsequent eigenvectors find other locally maximum variance features in orthogonal directions. Furthermore, the eigenvalue representing each eigenvector is exactly equivalent to the variance of projected
© 2000 by CRC Press LLC
patterns onto the corresponding eigenvector.18 It is this realization that provides the impetus for the enactment of the K-L expansion onto a neural network, as described in the next section. 5.4.1.2
Application by a Neural Network
The linear projection of a pattern vector onto one of the K-L expansion vectors is a simple inner product operation, as denoted by Equation 5.13. Conveniently, this is the same operation that is most commonly given to units of an artificial neural system, as seen by the comparison with Equation 5.13 for all Qk = 0. In concept then, an artificial neuron can use its connection weightings to act as one of the eigenvectors contained in the matrix Φ in the last section. It only remains to consider the method of training a cell to retain that specific pattern of connections. After training, the output of the neuron is a real number corresponding to the feature value from the input pattern. A hint has already been given about the means for training a cell to retain the “maximum” eigenvector. It was mentioned that the primary K-L expansion vector had the property that the output features generated by it contained the maximum variance of any features generated by any other choice of vectors. In direct relation to the neural network, if the variance of the output of the cell for all training patterns is maximized during training, the connection vector retained after training corresponds to the primary K-L vector. The scheme for training any one cell follows the steps outlined below. 1. 2. 3. 4. 5. 6.
Set the connections to the cell to random values. Calculate the output value of the cell for every training pattern. Find the variance of the output over all patterns used in step 2. Update the connection weights to the cell via an ALOPEX update equation. Normalize the connection weights to the cell to a vector magnitude of 1.0. Go to step 2 until a convergence criterion has been met.
In particular, the output y of the jth cell from the input of the kth N-dimensional pattern x is found as N
Ojk
∑C x
k ij i
(5.17)
i =1
and the variance of the jth cell output (Vj) is calculated over all P patterns by P
∑ (O − µ ) k j
Vj =
© 2000 by CRC Press LLC
k =1
P
j
2
[ ]
, µ j = E Oj
(5.18)
The ALOPEX update equation uses the changes in the variance (∆Vj) and connections (∆Cij) from the current iteration (n) to the previous iteration (n–1) to change the connections by Cij (n + l ) = Cij (n) + ∆Bij (n) + rij (n)
(5.19)
∆Bij (n) = γ ∆Vj (n)∆Cij (n)
(5.20)
where
[see Equations 5.17 through 5.20]. The term rij(n) is a Gaussian random number with zero mean and standard deviation. The factors γ and σ are adjusted as in Equations 5.21 and 5.22. To illustrate the performance of the algorithm, a simple two-dimensional pattern space was constructed, and one neuron was trained on 60 patterns in this space using ALOPEX output variance maximization. The performance of the algorithm can be tested easily, since the “ideal” maximum is known (it can be found by the analysis of Section 5.4.1). The resultant vectors of the ALOPEX method and the K-L expansion are shown in Table 5.1. Also included in that table are two other commonly used methods for unsupervised neural network training. The first is a simple normalized Hebbian scheme, where the connections are changed during each iteration by ∆Cij = ηOj xi
(5.21)
where η is a gain factor. The second is a widely touted method employed by Oja,47 which is a variant of the Hebbian proposal and includes output feedback as
(
∆Cij = ηOj xi − Oj Cij
)
(5.22)
Table 5.1 shows some clear results, the first of which is that the variance maximization scheme with ALOPEX was able to mimic very accurately the optimal K-L vector. Second, it clearly points out a weakness in the Hebbian proposal. As further experiments will show, the Hebbian training29 (and the Oja training, which is nearly identical in most real world situations) cannot optimize to the “best” vector because the input patterns are not zero-mean centered (they do not share a center of mass at the origin of the coordinate system). Thus, the Hebbian mechanism is essentially unable to compensate for DC offsets, a situation that must be tolerated in nearly all real-world applications. Obviously this weakness can be compensated for by centering the data before it enters a Hebbian-trained module, but this requires additional post-training computation and is impractical in a highly connected system. Oja claims that this scheme will result in the cell retaining the principal component of the input pattern distribution.* * The principal component is identical to the first K-L expansion vector when the center of mass of the patterns in data space is at the origin.
© 2000 by CRC Press LLC
TABLE 5.1 Converged Connection Vectors Principal K-L Vector
ALOPEX Variance Maximization
Normalized Hebbian
Oja Scheme
0.2011 0.9796
0.1920 0.9810
0.6305 0.7760
0.6299 0.7770
Note: For the two-dimensional sample data space shown in Figure 5.3. The first row in the table refers to the first connection value in the neuron, while the second row denotes the second connection value.
Figure 5.3 shows the pattern space that was used for the training results in Table 5.1. Clearly the space is composed of three quite distinct clusters. However, not all choices of vectors allow all of those clusters to be clearly evident at the output of the feature cell. The illustrations of Figure 5.4 make it obvious why the K-L and ALOPEX-trained connections are superior choices to the Hebbian scheme. The diagrams in Figure 5.4 are histograms of the output values of the cell for each of the different vectors in Table 5.1. It is easy to see why the K-L expansion and ALOPEX-trained vectors are preferable to the Hebbian and Oja schemes: the increased range of the neuron output levels increases the information content of the output by allowing all three clusters to be evident on the output line. So the concept of variance maximization indirectly promotes the retention of cluster forming information. Note that this is not a proof of superiority of the methods, but it is a valid observation with an example training set. With the one-cell implementation described thus far there is still the question of how a network of cells is to optimize to other vectors in the K-L expansion. Since each cell is searching for the “optimal” vector, all cells in a network will arrive at or near the same vector when using the same input pattern set. There are several possible means for forcing other neurons to optimize to other K-L vectors when more than one feature output is necessary. The training routine can be altered so that the optimization is not a global search. This will force each cell to arrive at a local optimum that differs depending on the initial conditions, but this does not guarantee that redundant cells will not be formed, and it imposes a strong likelihood that some cells will arrive at local maxima which are not K-L expansion vectors. A second possibility is for an additional term to be added to the ALOPEX cost function which uses feedback connections between other feature extracting cells in the network to impose a constraint on the uniqueness of each cell. There are several problems with this choice. The insertion of inter-network feedback requires that additional care be taken to prevent instability, and even when stability of the output of the cells is guaranteed, computation time is drastically increased because of the
© 2000 by CRC Press LLC
FIGURE 5.3 The input pattern space used for the training of the vectors in Table 5.1. There are 60 patterns in the space, with 20 grouped in each cluster.
need to wait until the outputs of the cells “settle”. Additionally, the weighting between the terms is most likely problem-dependent and is certainly optimally different for each cell. This possibility was examined in detail, though, and the interim results indicate the possibility of training cells to recognize features that were highly nonlinear and yet often retained some of the optimal aspects required by the problem. The work on this possibility was suspended when the training times became exorbitant. Instead, a more viable solution was obtained. The actual method used imposes the constraints on the ALOPEX optimization in a more implicit way. Instead of adding extra qualifying terms to the ALOPEX update equation, the input pattern space itself was altered to exclude the information that was already retained by other cells. In the resulting architecture (Figure 5.5), the cells are “chained” to one another in series by a weighted feedback of the output of the previous cell to the input of the current cell. The first cell receives the original pattern space unaltered and will perform exactly as the single cell simulations shown previously. Subsequent cells receive the original pattern space (x), minus the component of that space extracted by the previous cells as xˆi = xi − Ci, j −1Oj −1
© 2000 by CRC Press LLC
(5.23)
(a)
(b) FIGURE 5.4 The feature cell output value histograms for the space of Figure 5.3 for (a) the normalized Hebbian training and (b) the ALOPEX variance maximization scheme.
for the ith input to the jth cell in the chain. The output of the jth cell is then found in the same way as before, only now using the modified input space as N
Oj =
∑ C xˆ . ij i
i =1
© 2000 by CRC Press LLC
(5.24)
Using this method, only the first cell, which receives no interference from other cells in the network, can optimize to the principal vector, while all others are relegated to finding secondary yet potentially important vectors. When a network such as this is trained, it can be seen that the first cell optimizes to the first KarhunenLoève eigenvector, and the other cells locate in order the remaining eigenvectors. Sanger recently published an architecture similar to this scheme,52 but his training relies on the gradient descent method of Oja.47
FIGURE 5.5 The feature extraction architecture for three feature cells. After the first cell, all inputs are modified by a subtracted weighted input from the outputs of the previous cells, as in Equation 5.1.11. The triangle symbol represents the neuron integrator, while the symbol C12 indicates the connection weight from the first input to the second cell.
To illustrate the performance of this architecture, let us use the pattern space employed in the one neuron simulation in Figure 5.3. Since the first cell in the chain architecture above does not receive any influence from other cells, it will optimize in the same way as in Table 5.1. If we now find the new pattern space to provide to the second cell according to Equation 5.24, we get a different pattern space. All information in the direction of the first cell vector of Table 5.1 has been removed from the pattern space given to the second cell. It is then a trivial matter for the second cell to optimize on this pattern space. Note that we were, in this example, extracting two features from what was originally a two-dimensional pattern space. This is not a compression scenario. In a more general case where the number of features is less than the number of pattern dimensions, the input space would never compress to a line, as in Figure 5.6. Rather, with each cell in the chain, there is essentially a virtual removal of a dimension from the pattern space in the direction
© 2000 by CRC Press LLC
of the previously optimized vectors (not a physical removal, since the feature extraction is still performed over the same number of dimensions). The application of this feature extraction method to a more complicated pattern space is shown in subsequent chapters, where it can be clearly seen that this architecture and training allow for the network cells to converge on the K-L expansion vectors in order.
FIGURE 5.6 The revised pattern space of Figure 5.3 as seen by the second cell in a feature extraction network chain.
5.5 CLUSTERING 5.5.1
THE FUZZY C-MEANS (FCM) CLUSTERING ALGORITHM
The concept of fuzzy logic was first introduced by Zadeh, whose classic paper has become the philosophical bible in the field.68 The concept is simple: set membership, and indeed reasoning of any sort, carries more information when there is a continuum of grades of membership. The reasoning is based on Zadeh’s Principle of Incompatibility, which maintains that high precision is incompatible with high complexity. The suggestion is that the complexity of a system and the precision with which it can be analyzed bear a roughly inverse relation to one another. He asserted that since real world ideas appear to be fuzzy in nature, there is reasonable cause for adapting this approach to machines. Since that time, the number of applications to decision making and pattern clustering in particular have been numerous.16 One of the first to apply fuzzy reasoning to pattern recognition was James Bezdek.2 The method that he and his colleagues have introduced, the Fuzzy c-Means clustering (FCM) algorithm,3–5 has seen great popularity as a flexible and easily implemented method. The method itself is actually a spinoff of the venerable ISODATA algorithm.1 The ISODATA clustering method is one of a set of techniques that assumes that the optimal cluster partitioning is described as the minimum (or
© 2000 by CRC Press LLC
maximum) of an objective function. For the ISODATA algorithm and others like it, which use a set of c prototype “centers” of clusters around which patterns are grouped by their resemblance to these centers, the most common choice of objective function J is of the form P
c
k =1
i =1
∑∑u d
J=
(5.25)
ik ik
where uik is the membership strength of pattern k in cluster i and dik is the squared distance from pattern k to cluster center i in m-dimensional feature space. The ISODATA algorithm is normally used to generate hard partitions of the data. A hard partition is one in which each pattern is allocated entirely to one cluster or another, so that the membership strengths take on the values of zero or one. In most scenarios, the assignment of patterns to any one cluster prototype in exclusion of all others is a gross simplification of the complexity of the pattern space. Bezdek used the concept of fuzzy logic, where decisions are made through analog weightings, and applied it to this objective function J. In doing so, J was defined as P
J=
c
∑ ∑ (u ) d q
ik
k =1
ik
(5.26)
i =1
It is easiest to think of this objective function as representing the sum of the errors (the distances) in representing the patterns by a set of c cluster centers, weighted by the membership of the patterns to those clusters. The exponent q controls the sharpness of the decision boundaries, so that when q = 1, hard clusters are constructed, and when q = ∝, all patterns share the same membership to each cluster. Most importantly for the mathematical analysis of this function, the use of continuous memberships means that the decision space is now continuous for all q > 1. It now becomes possible to examine the conditions for minimization of this function. It was demonstrated that J could be locally optimal for any one q only if P
q
r vi =
r
∑ (u ) x ik
k =1 P
∑ (u )
k
; 1≤ i ≤ c;
(5.27)
q
ik
k =1
and −2
c
uik =
∑ j =1
© 2000 by CRC Press LLC
d q −1 ik d ; 1 ≤ k ≤ P; 1 ≤ i ≤ c jk
(5.28)
where Vi is the ith cluster center and xk the kth pattern. By iterating through these conditions, Bezdek3 claimed that a local minimum of the function J would be achieved. It was later seen that this iteration could only guarantee stationary points and not necessarily local minima;69 nevertheless, the FCM method was found widely useful in practically achieving rapid (usually < 25 iterations) and “good” clusterings of data from many application areas.6,7,8,24 There are two factors in the use of the FCM procedure which still require discussion. One of these points is the optimal selection of the parameter q. To this point, there is no automated way of selecting the best value of q for any one pattern set, but most applications seem to find reasonable values as lying somewhere between 1.2 and 4.0.3 The second factor is the way in which the distance dik is calculated. In general, the distance d can be calculated through the quadratic form r r dik = xk − vi
2 A
r r T r r = ( xk − vi ) A( xk − vi )
(5.29)
which is termed the A-norm distance. If the matrix A is chosen to be the identity matrix, then the distance is the squared Euclidean distance from pattern xk to center vi. This causes the FCM algorithm to form roughly hyperspherical clusters. This makes convergence simpler (since each cluster shape is identical) but is not optimal for most data sets with clusters of unequal shape. If the matrix A is chosen as C–1, where C is found as the fuzzy covariance matrix P
q
r Ci =
r
∑ (u ) ( x ik
k
k =1
r r r T − vi )( x k − vi ) (5.30)
n
∑ (u )
q
ik
k =1
then the axes of the cluster are effectively scaled according to the distribution of the data points within those clusters. This is a modification of the Mahalanobis distance measure for “hard” data sets.56 Other forms of the matrix A are also popular, including a diagonal matrix of the eigenvalues of the matrix C which Equation 5.30 calculates.5 Using A = C–1, and with the same calculation of membership strengths as in Equation 5.28, clusters of essentially hyperellipsoidal shape can now be found. Furthermore, the cluster shapes can be variant from one cluster to another, since each cluster has its own covariance matrix C. The incidence of local optima in the use of a variant A matrix such as this has been shown to rise drastically, affecting almost every problem, even with small data sets.3 To compensate for this, elaborate means of choosing initial conditions have been used, with unproven ability to guarantee global success.23 Finally, every clustering algorithm must develop a proven means for determining the optimal number of clusters in the data set and whether a converged set of clusters is a “good” clustering. This is usually termed the cluster validity problem, and there
© 2000 by CRC Press LLC
are at least as many opinions as to what is the best set of parameters to provide this information as there are clustering routines. The originators of the FCM routine usually use an entropy measure to characterize the effectiveness of the clustering operations, which is given by 1 Hc = − P
c
P
i =1
k =1
∑∑u
ik
log a (uik )
(5.31)
a ∈ (1, ∞) where Hc = 0 for hard partitions and Hc =loga(c) for an entirely “blurred” (or indecisive) clustering. Another related parameter, termed the partition coefficient (F), is found by 1 F= P
c
P
∑ ∑ (u ) ik
i =1
2
(5.32)
k =1
Both of these parameters rely on one of the major paradoxes of fuzzy logic. That is, although the pretext of fuzzy clustering is to incorporate more information via using analog decision criteria, heuristically the “best” clustering is one in which the resultant clusters are hard (have binary membership strengths). This idea is the basis behind the use of the entropy (H) and partition coefficient (F) measures and assumes that the optimal number of clusters is the choice for which H is minimized and F is maximized. The FCM algorithms already fit many of the requirements for ALOPEX training. There is an explicit cost function which determines the “optimal” choice of clustering, a requirement for an optimization routine such as ALOPEX. Further, there is really only one set of independent parameters which must be varied in order to minimize the cost function of Equation 5.26: the set of cluster centers (c times m parameters in total). All other information for the determination of membership strengths results from the specification of cluster centers. The distances (Equation 5.29) are found in reference to the cluster centers, and the membership strengths are based entirely on distance information (Equation 5.28). It remains only to justify the use of ALOPEX to this application. ALOPEX will reduce the likelihood of arriving at a locally optimum solution at the price of increased computation time. However, most situations demand accuracy, even when having to sacrifice increased computation time. This is especially true when you consider that the decision of these clustering partitions is usually needed only once, after which those partitions are used to make rapid decisions about new data. The danger of locally optimal solutions becomes especially apparent when clusters of nonhyperspherical shape are assumed. The distance measurements often become very local in certain directions from the cluster center. The result is a much more localized cost function, which is therefore much more volatile. The resulting distances generated can often exceed the real number range of most software lan-
© 2000 by CRC Press LLC
guages, and special care must be taken to ensure the stability of the algorithm as it iterates. As mentioned before, all of the FCM family of algorithms share the danger of locally optimal solutions. Even with a Euclidean distance measurement, it is easily apparent that multiple runs of the FCM algorithm arrive at different solution points for pattern spaces of reasonable complexity. A more complete example of this behavior is shown in the context of the classification of handwritten characters. In order to incorporate a global optimizer into the fuzzy c-means family, ALOPEX is used to adjust the cluster centers iteratively in the steps outlined below. 1. 2. 3. 4. 5.
Randomly choose initial cluster centers. Find squared distances between patterns and centers. Calculate membership strengths via Equation 5.28. Find the current cost function J from Equation 5.26. Use ALOPEX to update the centers based on recent change in the cost function and the centers. 6. Go to step 2 until a convergence criterion is met. The performance of this routine, and that of the feature extractor, is illustrated in the context of two application domains. These are described in detail in Section 5.6. Most pattern recognition schemes need to consider the assignment of labels or pattern identities to decision codes generated by the pattern recognition system. Most commonly, this consideration is important for supervised schemes, in which the pattern identities are known without ambiguity. In unsupervised methods, the notion of pattern labeling is somewhat self-defeating. That is, if the identities of the patterns used in the training were known before training, then a supervised method would have been more productive. If, however, the pattern labels are suppositions or decisions with an amount of uncertainty, it would be more useful to assign labeling based only on cluster membership strengths. The classification of handwritten digits by this unsupervised system is an unusual task. As mentioned before, it is motivated by a desire to improve the algorithm without biasing the answer toward concurrence with the medical diagnoses. Paradoxically, the labels of each of the characters used in the study (which were never subjected to a segmentation process) are known unambiguously before training, but the knowledge is only used in the assignment of pattern labels after training is completed. This allows for a quantitative description of the performance of the algorithm. The labeling of such a scenario is performed in this research by analyzing the constituent memberships of each cluster into an array Ω. P
∑ {u
ik ⋅
Ω ij =
k =1
: ∀k ∈ PATTERN TYPE j} c
P
l =1
k =1
∑∑u
lk
© 2000 by CRC Press LLC
, 1 ≤ i ≤ c, 1 ≤ j ≤ R
(5.33)
where Ωij is the percentage of cluster l membership from pattern type j (for R pattern types in the simulation, i.e., 10 digits in the character recognition problem), and there were c clusters formed from the P training patterns. Then the degree (ψkj) to which pattern k belongs to pattern type j is calculated as c
∑u Ω ik
ψ kj =
R
i =1
ij
∑∑u Ω ik
l =1
, 1 ≤ k ≤ P, 1 ≤ j ≤ R
c
(5.34)
il
i =1
The label of pattern k (Lk) is then the maximum of the degrees of memberships to the pattern types as
{ }
Lk = max ψ kj .
(5.35)
In this way the labeling is performed not only on the membership strengths of the patterns given by the clustering module but also on the specificity to a single pattern type demonstrated by the clusters. That is, the labeling of a pattern with a high membership strength to a cluster with a high population of more than one pattern type will downplay that cluster membership strength in favor of other clusters with more “pure” pattern types. Most neural network decisions formulate their decisions in a highly intertwined and complicated way. Even if the network is purely feed-forward (as is the multilayer perceptron used in the backpropagation algorithm), there is usually only a limited idea of the criteria that the network used in making its decision. The neural network just presented is an intriguing exception to this category of systems. The primary finding of the clustering module is the set of “centers” around which the cluster boundaries are formed. Since the coordinates of this center reside in the same space as the feature vectors of the input patterns, the cluster center coordinates can be thought of as the feature values that would have been extracted if there were a corresponding input pattern. The primary question is whether, knowing these feature values, we can find out what the input pattern would have looked like. The answer is a resounding yes! The feature extracting neural network implements the K-L expansion, as we have already mentioned. Since the K-L expansion is really a linear expansion of the input pattern and since the K-L expansion is used both as a feature extractor and as a data compression method, the feature vector can be reconstructed to find the input pattern with the knowledge of the K-L vectors that derived the features. This is essentially the same concept that was used to remove information from subsequent cells in the feature extraction network applied to a different task. Given an m-dimensional feature vector y and a desired representation of the input pattern x, we can reconstruct an approximation to the input pattern (x) as
© 2000 by CRC Press LLC
m
xˆi =
∑C y
ij j
(5.36)
j =1
where xi is the ith input of the vector x and Cij is the network connection strength from input l to feature extracting cell j. The vector x is an approximation to m terms of the K-L expansion the original input pattern x. The realization that the cluster solutions (the centers) can be reconstructed into the corresponding input pattern (with hopefully a small error) allows the system to be used in an entirely new light. Not only can the system provide unsupervised classifications of a set of patterns, but through the reconstruction of the input pattern, a glimpse of the reasoning of the decisions can be made. This is made more apparent when the reconstructions of specific applications are displayed, as is done in Chapter 6 with Figure 6.8.
REFERENCES 1. Ball, G. H. and Hall, D. J., A clustering technique for summarizing multivariate data, Behav. Sci., 12, 153, 1967. 2. Bezdek, J. C., Fuzzy Mathematics in Pattern Classification, PhD Diss., Cornell University, Ithaca, NY, August 1973. 3. Bezdek, J. C., Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981, 65. 4. Bezdek, J. C. and Dunn, J. C., Optimal fuzzy partitions: a heuristic for estimating the parameters in a mixture of normal distributions, IEEE Trans. Comput., August 1975. 5. Bezdek, J. C., Ehrlich, R., and Full, W., FCM: the fuzzy c-means clustering algorithm, Comput. Geosci., 10(2), 191, 1984. 6. Bezdek, J. C. and Fordon, W. A., Analysis of hypertensive patients by the use of the fuzzy ISODATA algorithm, Proc. JACC, 3, 349, 1978. 7. Bezdek, J. C., Trevedi, M., Ehrlich, R., and Full, W., Fuzzy clustering: a new approach for geostatistical analysis, Int. J. Sys., Meas. Decisions, 1982. 8. Cannon, R. L., Dave, J. V., Bezdek, J. C., and Trivedi, M. M., Segmentation of thematic mapper image data using fuzzy c-means clustering, Proc. 1985 IEEE Workshop Lang. Automation IEEE Comput. Soc., 93, 1985. 9. Chen, Y. T. and Fu, K. S., On the generalized Karhunen-Loève Expansion, IEEE Trans. Inf. Theor., 15, 518, 1967. 10. Chon, T. and Micheli-Tzanakou, E., A Probabilistic Approach to the ALOPEX Process using Moment Invariants of Images, Proc. Int. Joint Conf. on Neural Networks, Vol. II, 1989, 611. 11. Ciaccio, E. J. and Micheli-Tzanakou, E., The ALOPEX Process: Application to RealTime Reduction of Motion Artifact, Proc. 12th Annu. Int. Conf. IEEE/EMBS, Vol. 12, 1990, 1417. 12. Cover, T. M., The best two independent measures are not the two best, IEEE Trans., SSC-6, 33, 1974.
© 2000 by CRC Press LLC
13. Dante, H. M. and Sharma, V. V. S., Optimum Decision Tree Classifiers for Classification in Large Populations, Proc. IEEE Intl. Conf. on Cybernet. and Soc., 1985, 559. 14. Dasey, T. and Micheli-Tzanakou, E., A Pattern Recognition Application of the ALOPEX Process on Hexagonal Images, Proc. Int. Joint Conf. on Neural Networks, Vol.II, 1989[a], 119. 15. Dasey, T. and Micheli-Tzanakou, E., Efficiency Exploration of ALOPEX Based Recognition and Hexagonalized Images, Proc. of the Fifteenth Annu. Northeast Biomed. Eng. Conf., 1989[b], 177. 16. Davis, J. C. and Economou, C. E., A review of fuzzy clustering methods, Adv. Eng. Software, 6(4), 1984. 17. Deutsch, S. and Micheli-Tzanakou, E., Neuroelectric Systems, NYU Press, New York, 1987. 18. Devijver, P. A. and Kittler, J., Pattern Recognition: A Statistical Approach, PrenticeHall, Englewood Cliffs, NJ, 1982. 19. Eigen, M., Self-organization of matter and the evolution of biological macromolecules, Naturwissenschaften, 58, 465, 1971. 20. Fu, K. S., Application of Pattern Recognition, CRC Press, Boca Raton, FL, 1982. 21. Fukunaga, K. and Koontz, W. L. G., Application of the Karhunen-Loève expansion to feature selection and ordering, IEEE Trans. Comput., C-19(4), 311, 1970. 22. Galar, R., Evolutionary search with soft selection, Biol. Cybern., 60, 357, 1989. 23. Gath, I. and Geva, A. B., Unsupervised optimal fuzzy clustering, IEEE Trans. PAMI, 11(7), 773, 1989. 24. Granath, G., Application of fuzzy clustering and fuzzy classification to evaluate provenance of glacial till, Math. Geol., 16, 283, 1984. 25. Harth, E. and Tzanakou, E., A stochastic method for determining visual receptive fields, Vis. Res., 12, 1475, 1974. 26. Harth, E., Kalogeropoulos, T., and Pandya, A. S., ALOPEX: A Universal Optimization Network, Proc. Spec. Symp. on Maturing Technol. and Emerging Horizons in Biomed. Eng., 1988, 97. 27. Harth, E. and Pandya, A. S., Dynamics of the ALOPEX process: applications to optimization problems, in Biomathematics and Related Computational Problems, Ricciardi, L., Ed., Kluwar Acad. Publ., 1988, 459. 28. Harth, E., Pandya, A. S., and Unnikrishnan, K. P., Perception as an optimization process, in Proc. of IEEE Computer Society Conference on Computer Visual and Pattern Recognition, IEEE Computer Society Press, Washington, DC, 1986, 662. 29. Hebb, D., The Organization of Behavior: A Neurophysiological Theory, John Wiley & Sons, New York, 1949. 30. Highleyman, W. H., The design and analysis of pattern recognition experiments, Bell Syst. Tech. J., 41, 723, 1962. 31. Iezzi, R., Jr., Micheli-Tzanakou, E., and Cottaris, N., Effects of Pattern Convergence and Orthogonality on Visual Evoked Potentials, Proc. 12th Annu. Int. Conf. IEEE/EMBS, Vol. 12, 1990, 897. 32. Karhunen, K., Uber lineare Methoden in der Wahrscheinlichkeitsrechnung, Ann. Acad. Sci. Fennicae, Ser. A137 (trans. by I. Selin in On Linear Methods in Probability Theory, T-131, The RAND Corp., Santa Monica, CA, 1960), 1947. 33. Kirby, M. and Sirovich, L., Application of the Karhunen-Loève procedure for the characterization of human faces, IEEE Trans. PAMI, 12(1), 103, 1990. 34. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., Optimization by simulated annealing, Sci., 220, 671, 1983.
© 2000 by CRC Press LLC
35. Kittler, J. and Young, P. C., A new approach to feature selection based on KarhunenLoève expansion, Patt. Recog., 5, 335, 1973. 36. Kohonen, T., Problems in practical pattern recognition, Neural Netw., 1(suppl.), 29, 1988. 37. Leboucher, G. and Lowitz, G. E., What a histogram can really tell the classifier, Patt. Recog., 10, 351, 1978. 38. Lerner, A., A crisis: in the theory of pattern recognition, in Frontiers of Pattern Recognition, Watanabe, S., Ed., Academic Press, New York, 1972, 367. 39. Levine, M. D., Feature extraction: a survey, Proc. IEEE, 57(8), 1391, 1969. 40. Levy, A. C. and Montalvo, A., The tunneling algorithm for the global minimization of functions, SIAM J. Sci. Stat. Comput., 6(1), 15, 1985. 41. Marsic, I. and Micheli-Tzanakou, E., Distributed Optimization with the ALOPEX Algorithms, Proc. of the 12th Int. Conf. IEEE/EMBS, Vol. 12, 1990, 1415. 42. Mellissaratos, L. and Micheli-Tzanakou, E., The Parallel Character of the Alopex Process, J. Med. Syst. 13(5), 243, 1990. 43. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., and Teller, A. H., Equation of state calculations by fast computing machines, J. Chem. Phys., 21(6), 1087, 1953. 44. Micheli-Tzanakou, E., Non-linear characteristics in the frog’s visual system, Biol. Cybern., 51, 53, 1984. 45. Micheli-Tzanakou, E. and O’Malley, K. G., Harmonic Content of Patterns and their Correlation to VEP waveforms, IEEE 17th Annu. Conf. of EMBS, 1985, 426. 46. Micheli-Tzanakou, E. and Dasey, T. J., Pattern Recognition with Neural Networks on Compressed Images, 6th IASTED Int. Conf. on Expert Systems and Neural Networks, 1990, 9. 47. Oja, E., A simplified neuron model as a principal component analyzer, J. Math. Biol., 1982. 48. Oja, E., Subspace Methods of Pattern Recognition, Research Studies Press, Ltd., Letchworth, 1983. 49. Peterson, C. and Hartman, E., Explorations of the mean field theory learning algorithm, Neural Net., 2, 475, 1989. 50. Romesburg, H. C., Cluster Analysis for Researchers, Lifetime Learning Publications, London, 1984. 51. Rutenbar, R. A., Simulated annealing algorithm: an overview, IEEE Circ. Devices Mag., 5(1), 19, January 1989. 52. Sanger, T. D., Optimal unsupervised learning in a single-layer linear feedforward neural network, Neural Net., 2, 459, 1989. 53. Scoltock, J., A survey of the literature of cluster analysis, Comput. J., 25(1), 130, 1982. 54. Snyder, W., Bilbro, G., and Van den Bout, D., New Techniques in Optimization: A Tutorial, Technical Report NETR-89-12, Center for Communication and Signal Processing, North Carolina State University, Raleigh, NC, November, 1989. 55. Teh, C.-H. and Chin, R. T., On digital approximation of moment invariants, Comp. Vis. Graphics Image Proc., 33, 318, 1986. 56. Tou, J. T. and Gonzalez, R. C., Automatic recognition of handwritten characters via feature extraction and multilevel decision, Int. J. Comput. Inf. Sci., 1, 43, 1972. 57. Tou, J. T. and Heydorn, R. P., Some approaches to optimum feature extraction, in Computer and Information Sciences - II, Tou, J. T., Ed., Academic Press, New York, 1967. 58. Toussaint, G. T., Bibliography on estimation of misclassification, IEEE Trans. Inf. Theory, IT-20(4), 472-479, 1974.
© 2000 by CRC Press LLC
59. Tucker, W. T., Counterexamples to the convergence theorem for the fuzzy c-means clustering algorithms, in Analysis of Fuzzy Information, Vol. III - Applications in Engineering and Science, CRC Press, Boca Raton, FL, 1987, 109. 60. Tzanakou, E., Principles and Design of the ALOPEX Device: A Novel Method of Mapping Visual Receptive Fields, Ph.D. Diss., 1977, International Publication No. 77-30, 771.38/8, 1978. 61. Tzanakou, E., Michalak, R., and Harth, E., The Alopex process: visual receptive fields by response feedback, Biol. Cybern., 35, 161, 1979. 62. Wang, J.-Z. and Micheli-Tzanakou, E., The use of the ALOPEX process in extracting normal and abnormal visual evoked potentials, IEEE-EMBS Mag. Spec. Issue DSP, V9(1), 44, 1990. 63. Watanabe, S., Karhunen-Loève Expansion and Factor Analysis — Theoretical Remarks and Applications, Proc. of the 4th Conf. on Inf. Theory, Prague, 1965. 64. Watanabe, S., Pattern recognition as information compression, in Frontiers of Pattern Recognition, Watanabe, S., Ed., Academic Press, New York, 1972, 561. 65. Watanabe, S. and Kaminuma, T., Recent Developments of the Minimum Entropy Algorithm, 9th Int. Conf. Pattern Recognition, 1988, 536. 66. Xu, L., Some Application of Simulated Annealing to Pattern Recognition, Proc. Int. Conf. Pattern Recognition, Rome, 1988, 1040. 67. Young, T. T. and Calvert, T. W., Classification, Estimation, and Pattern Recognition, American Elsevier, New York, 1974. 68. Zadeh, L., Fuzzy sets, Inf. Control, 8, 338, 1965. 69. Tucker, N. D. and Evans, F. C., A Two-Step Strategy for Character Recognition Using Geometrical Moments, Proc. 2nd Int. Joint Conf. Pattern Recognition, 1974, 223.
© 2000 by CRC Press LLC
6
Application to Handwritten Digits Timothy J. Dasey and Evangelia MicheliTzanakou
6.1 INTRODUCTION TO CHARACTER RECOGNITION Among the widely varied applications of pattern recognition techniques, perhaps none has been studied more intensively than the machine recognition of character data.1 The number of potentially profitable uses for such systems is nearly limitless, since so much of the information resident in today's industrial society is textual. This is one reason that an application of the methods developed in this book is devoted to the character recognition arena. Also, the nature of the task permits the experimenter a concrete success formulation since the correct classes can be determined unambiguously. This is a much more desirable environment for the development of a new method than the less clearly formulated class memberships of medical data is treated in Chapter 7. The industrial applications of character recognition (CR) systems fall into several broad categories. One area is certainly that of data entry of handwritten information into conventional computer systems. Such arenas are typically constrained to data sets with limited character sets and constrained paper format (i.e., banking). This overlaps with the text entry area, which is more concerned with the input of typewritten characters into a word processing or publishing environment. These systems can only recognize characters of certain fonts, but with very high success (> 99.9%). Other character recognition systems use the deciphered information to control a process, as would happen in a post office branch with a CR system that sorts mail. A final application area deals with providing an interface with the visually impaired, which often involves both a recognition procedure and a translator into speech. Any comparison of character recognition tasks must be approached cautiously, since the difficulty of the task is determined largely by the constraints imposed on the data and the information available to the machine. It is certainly much easier for a machine to recognize typewritten characters than handwritten characters, since the typewritten characters would usually follow more standard guidelines and be less variable. Similarly, a signature verification system would likely be more successful if the machine had access to the pen pressure, velocity, and acceleration information at the time of the writing, as well as the shape characteristics of the signature.
© 2000 by CRC Press LLC
The recognition of handwritten characters is a subset of the much more extensive optical character recognition (OCR) problem. It deals with the recognition of single hand-drawn characters of an alphabet that is unconnected. It must be differentiated from script recognition, which is concerned with the recognition of handwritten characters that may be connected and cursive. In this sense the developers of handwritten character recognition schemes do not need to concern themselves with the extremely challenging task of segmenting the characters.2 Still, handwritten character recognition is not as simple a task as it may appear, since some claim3 that even human beings can make up to 4% of mistakes when reading certain characters in the absence of context. Errors in reading handprinted characters, in addition to deriving from the algorithm and scanning methods, can also arise because of variations in shape due to the habits, style, mood, health, and other conditions of the writer.4 The recognition of handwritten characters must consider at least two problems: the means of scanning the image and the method for its recognition. The choice of a scanning device is not considered in this discussion, but the methodology of the recognition has been categorized as5,6 Point-by-point global comparison with stored images;7 Global transformations such as Karhunen-Loève,8,9 Fourier,10,11 Walsh,12 moments of inertia,13 and others;14,15 (iii) Extraction of the local properties, such as endpoints, line crossings, and angles;16,17,18 (iv) Use of curvature and stroke information for analysis;19,20,21 and (v) Structural methods, including decomposition of the character into graphs or other constituent elements.22,23 (i) (ii)
Many techniques contain portions that overlap between these categories. Each technique must be assessed by its ability to “ignore” deformation of the image caused by noise, translation, rotations, style variations, and other distortions, as well as practical considerations of the implementation, such as speed and complexity. The work of Grimsdale et al. represents one of the earliest attempts at character recognition.24 In this scheme, each digitized pattern is analyzed for shape by a computer, which extracts heuristic features and compares them to feature values stored on the computer. A few years later the notion of “analysis-by-synthesis” was presented by Eden.25,26 He initially proposed that all Latin characters could be formulated by only 18 strokes, which in turn could be generated by a subset of four strokes, called segments. More generally, the concept was that handwritten characters are formed by a small, finite number of schematic features, which, when known, can be used for recognition of character data. Perhaps more than Latin character recognition, the study of the Chinese alphabet is a stringent test for any algorithm.27 One of the first attempts at this problem was made by Casey and Nagy at IBM.28 A step-by-step approach was used for this large character set, in which the first stage grouped similar characters, and then “group masks” and “individual masks” were employed to further specify the character. This
© 2000 by CRC Press LLC
type of method, in which a hierarchical decision process is employed, is characteristic of several OCR schemes.29 Other researchers implemented a more mathematically formulated process for their systems. Tou and Gonzalez used a two-stage system, the first stage performing a series of measurements for subgroup separation and the second extracting a set of specialized features.30 Pavlidis and Ali used a “split-and-merge” algorithm to produce polygonal approximations of the characters which could provide enough information for decision making,31 while others used clustering procedures on the task.32 The review paper of Suen et al.33 discusses the efforts in the recognition of handprinted numerals. The best classifications of over 30 studies ranged from 85 to 99.79%, but direct comparison of methods is rarely feasible. This is due to the large discrepancy in the experimental setups. Not surprisingly, the 85% success rate used a realistic data set collected from the U.S. Postal Service34 while the 99.79% accuracy was derived after training writers to write numerals in specified shapes and sizes.35 In addition to this widely varying data quality, the number of training patterns was different in each case, and some studies never reserved any patterns for testing of the system after training. The arrival of neural network concepts into the pattern recognition field has spurned some wonderful successes and great disappointments in the character recognition field. Most studies use the backpropagation algorithm for the training and network architecture.36 The recent study at AT&T Bell Laboratories is one of the more notable projects, in which postal zip code numerals were trained with a modified backpropagation algorithm.37 Using this challenging data set of 7291 training patterns, they were able to show 0.14% error on the training set and 5% on the 2007 pattern test set. Fukushima's Neocognitron network38 also has demonstrated an invariance in recognizing characters of different rotations, sizes, and translations. Other works, such as the ART topologies, have very limited success in detecting any discrepancies among patterns.39 The testing of newly established algorithms often relies on a realistic data base for the development of the method. Several popular data sets are widely available for this purpose, the most popular of which are those created by Highleyman,40 Munson,41 and Suen.33 The Munson data set seems to be the most popular, chiefly because of its difficulty. A more recent large data set was created in which the optimal writing style for recognition was also examined.42 Still, many researchers choose to construct their own data sets, and this is the strategy used in this work.
6.2 DATA COLLECTION Digits were collected from 13 subjects, who were instructed to write several of each of the digits on a clean sheet of paper. No limitations were imposed as to the style, size, clarity, thickness, or slant of the digits. The numbers were then digitized using a Hewlett Packard Scanjet II digital scanner using the Scanning Gallery software package on an IBM PC. Each of the digits was segmented by hand, scanned at 300 DPI with 16 gray levels, and saved in separate files. The Scanning Gallery program saved the files in the TIF binary file format, and a C program was necessary to convert these files to an ASCII file format, which could be read by subsequent © 2000 by CRC Press LLC
programs. A total of 1500 digits were collected in this manner, approximately 150 of each of the 10 digits types. There are only a few assumptions that were imposed on the data. Among these are 1. The digits were to be clearly segmentable from one another. That is, a rectangular box could be drawn around each digit so that the entire content of one digit resided inside that box, and no portion of any other digits were contained in that region. 2. The background was relatively noiseless so that there exists a clear threshold between background and digit intensities. 3. No character was rotated more than 45° from what is normally considered its upright position (the character was not upside down). Figure 6.1 displays many of the digits collected with this process. It is clear that they were written without regard to neatness, and in fact some of the digits appear ambiguous to human classifiers. This variety was encouraged to provide a realistic environment for the training process.
6.2.1
PREPROCESSING
The networks used for feature extraction and classification are highly dependent on spatial overlap of digits of the same class for their success. The original digits were written without regard to this constraint, and so it was necessary to process the digits to alleviate differences in size, thickness, rotation, location, and intensity. This was not expected to destroy the recognition capabilities of either humans or machine, since the information content of the digits is largely contained in the form of the digits. In addition, the resolution of the digits was reduced to prevent prohibitive training time for subsequent modules. The preprocessing was conducted in the following steps: intensity thresholding to remove noise, center of mass adjustment, line thinning, simultaneous rotation to standard axis and translation to standard center of mass, size determination and fixation, reduction in resolution, and smoothing of digits as a form of anti-aliasing. This sequence is depicted in Figure 6.2 and the methods for these steps are described in the following paragraphs. The inputs to the preprocessing stages come from the digitized characters from the digital scanner, while the outputs of the preprocessing feed into the inputs of the feature extraction network.
6.2.2
NOISE THRESHOLDING
A threshold was applied to each pixel of the original digits, creating a binary image for further processing. The threshold served a dual purpose. It eliminated weak and extraneous information from the digit, thereby aiding a separation from the background. Second, it eliminated intensity variability from within the contour of the digit. Each pixel of the digit was checked against the threshold value. If it was lower, it was set to zero. If the pixel value was equal to or higher than the threshold, it was set to a maximal value (assigned to be 2). An effective threshold was found to be
© 2000 by CRC Press LLC
FIGURE 6.1
Random samples of the original unprocessed characters used in this study.
FIGURE 6.2
The sequence of steps in the preprocessing of the handwritten digits.
at a gray level of 4, and this value was used for the processing of all digits. A more flexible approach would have been to use an adaptive threshold, whereby the deciding value is based on the content of each digit by analyzing an intensity histogram. In part because of the controlled lighting conditions of the digital scanner and also due to the assumption of a clear separation of the digit from its background, it was felt that this additional computation was not necessary. This analysis was confirmed by the high quality of the digits after thresholding was applied.
© 2000 by CRC Press LLC
6.2.3
CENTER
OF
MASS ADJUSTMENT
Particular problems were encountered when some digit types (nines, eights, sixes) had small loops or regions of high density of pen marks. In such instances, the center of mass of the digit was highly skewed toward that region, and overlap with similar digits was often small. This also often resulted in an abnormal rotation when that routine was applied. An adjustment was applied to each digit to expand small regions as this and so move the center of mass toward the absolute center of the digit. In this method, the center of mass (CM = [xc yc]T) was located and the digit split into quadrants about this point. Each quadrant was then mapped into its corresponding quadrant in absolute space (using the absolute center AC = [xa ya]T) by scaling each of the regions as x′ =
( x − xc )( xmax − xa ) ( xmax − xc )
(6.1)
where the old x coordinate is mapped to the new location x′. The y coordinate is changed in the same way.
6.2.4
LINE THINNING
A thinning routine was used on the binary level digits to reduce the effects of line thickness. The method used is familiar in the literature.43,18 Basically, the algorithm pares away all boundary points in the digit until it is left with only skeletal pixels, which must be kept in order to preserve the integrity of the digit contour. A pixel is considered to be a skeletal pixel if it is part of the digit (has a nonzero value), one of its four neighbors is zero valued, and it passes either of two conditions as described thoroughly in a previous article.18 Several passes of this procedure are necessary to reduce the image to one consisting of only skeletal pixels, since the above criteria will remove only boundary pixels with each pass through the digit.
6.2.5
FIXING
TO
SIZE
Prior to the use of the rotation routine, the image is fixed to a standard size (60 × 100 pixels). This is necessary to avoid errors in the calculation of the digit principal axes caused by distortions in portions of the digit. To perform this task, the corners of the digit are located and scaled to the new size. Pixels are mapped into the nearest pixel after the scaling factor has been applied. The operation is performed in the same way as Equation 6.6 (a),(b), where the x coordinate magnifier is 60.0 and the y magnifier is 100.0.
6.2.6
ROTATION
The rotation algorithm uses the coordinates of each of the nonzero valued image pixels to find a principal vector of the image. This vector specifies the angle of the principal axis of the digit in two-dimensional space, which can then be manipulated
© 2000 by CRC Press LLC
to create a transformation matrix which will rotate the image. Each of the digits is rotated and translated in this space to a standard location and primary axis. The center of mass of the digit (M = [mx my]T) is located and used to find a correlation matrix for the digit, calculated as 1 C= r
r
∑ i =1
rr r r PP T − M M T
(6.2)
where the summation is over all r nonzero pixels in the image and the vector P is the coordinate vector of the pixel (P = [px py]T). An eigenvector matrix E is calculated from the 2 × 2 matrix C, encoding the angle (f) through which the primary axis of the digit runs as cos(φ ) − sin(φ ) E= sin(φ ) cos(φ )
(6.3)
Each digit was rotated so that the primary axis lies vertically (90°). Each pixel location P′ = [px′ py′]T of the rotated image is calculated from the original image as r r r P′ = E ′ P − M
(
)
T
cos(90° − φ ) − sin(90° − φ ) E′ = sin(90° − φ ) cos(90° − φ )
(6.4)
(6.5)
and the vector P contains the original pixel coordinates. The elements of P′ are rounded to the nearest integer locations.
6.2.7
REDUCING RESOLUTION
The corners of the rotated image (smallest rectangle which completely encloses the digit) are found and used to scale the digit to a new resolution of 16 × 16. This is performed by calculating a new pixel coordinate [x′ y′] by x *16. x ′ = nint +1 x max − x min
(6.6a)
x *16. y ′ = nint +1 ymax − ymin
(6.6b)
and
© 2000 by CRC Press LLC
where the nint( ) operation nearest integer takes the nearest integer of the resultant division. A pixel in the new 16 × 16 digit is assigned a value of 2 if any of the positive valued pixels of the higher resolution are mapped into that location. An alternative is to assign an additional threshold to turn on a pixel in the lower resolution image if the number of original pixels mapping into that location exceeds the threshold. The resultant 16 × 16 images were considered generally to be of good enough character (by subjective analysis) to avoid this additional complication.
6.2.8
BLURRING
As was mentioned previously, one of the assumptions fundamental to the success of the subsequent neural network processors is that of a high degree of spatial overlap of similar digits. That is, because of the hard wiring of neural inputs to image locations, the neural networks are not position invariant. The aforementioned preprocessing steps can aid in creating an invariance but is by no means invincible in this task. To assist in the overlap of the digit contours of similar digits, a simplified smoothing operation was applied to the 16 × 16 images. This operation can also be thought of as an anti-aliasing operation. Basically, if a zero valued pixel has one or more of its four primary neighbors with a nonzero value, that pixel is turned on with a value of 1 (it should be remembered that the pixels on the contour were given values of 2).
6.3 RESULTS The feature extraction routine (ALOPEX variance maximization of network node outputs using the architecture of Figure 5.5) was applied to each of the digits in the data set. A random 1000 digits were selected for the training of this module, and 32 features were extracted from each 256 dimensional input image.* Since the feature extraction module has an architecture reminiscent of a pipeline, it was more efficient computationally to allow each neuron in the module to complete training before any subsequent nodes were altered. Training appeared most efficient with ALOPEX parameters of γ0 = a0 = 5.0 × 10–3, σ0 = 7.5 × 10–3, γ∞ = a∞ = 5.0 × 10–5, σ∞ = 7.5 × 10–5, and τ = 1000, and typically required between 8,000 and 12,000 iterations per node for a good convergence, as seen by the response curve of Figure 6.3. The vast bulk of the processing time is due to the large number of patterns (1000) used in the training, since each pattern must be presented to the neuron during each iteration. The number of features to retain was calculated by plotting the eigenvalues in descending order as generated from the conventional K-L expansion, as in Figure 6.4. The magnitude of the eigenvalues is identical to the optimum variances of the cell outputs in the Feature Extraction (FE) network, and it is convenient to relate the magnitude of the eigenvalue with the amount of information the corresponding KL vector carries. The number of features to extract (32) was subjectively obtained * Note that the neural networks in this study have gloval inputs and are not spatially interdependent. This means that the preprocessed 16×16 digits are viewed as a 256-dimensional vector by the networks.
© 2000 by CRC Press LLC
FIGURE 6.3 A response curve vs. iteration number. A convergence can be observed in about 10,000 to 12,000 iterations.
from Figure 6.4 as the point in which the information given by an additional eigenvector reduces to near zero. Another way of finding the optimum number of neurons in the feature extracting network is to set a threshold. If a neuron optimizes to an output variance below this threshold, the node is not logically added to the network, and the training simulation is stopped. In this way both the extent of the network and its connectivities can be adaptable in the training. Figure 6.5 depicts the feature cell vectors as they would appear in image form. Each connectivity strength is given a corresponding intensity (relative to the strength of the connectivity) in the spatial position where the input to the connection arose. Very high intensities (white) indicate large positive connections, and large negative connections are shown as a low intensity value (black). Some of the feature “filters” have regions of high contrast, which remind us of features in the character data set. It is clear that the last few feature images are quite “noisy”, and this is consistent with their low information content. Moreover, it is very obvious that these “optimal” vectors would be very difficult to specify heuristically. The clustering operation needed to incorporate some understanding of the number of clusters necessary to accurately describe the data. Since the ALOPEX optimization for the clustering operation was quite time consuming, the standard FCM algorithm with Euclidean distance measurements was used to find the cluster validity measures described in Section 5.4 for as few as 2 and as many as 40 clusters. These simulations typically required no more than 30 iterations for convergence. In Section 5.2.1 the determination of the number of clusters necessary for any given data space was discussed (the cluster validity problem), and the cluster validity measures F and H were introduced. Figure 6.6 illustrates the change in the validity measure H for the converged clusterings of the FCM algorithm from
© 2000 by CRC Press LLC
FIGURE 6.4 The eigenvalue as a function of the number of the K-L expansion vector. The eigenvalue can also be thought of as the optimum nodal output variance for the ALOPEXtrained feature cell.
FIGURE 6.5 The 32 feature cell connection vectors displayed in image form. The vectors are shown in descending order of their output cell variance as you view from left to right and top to bottom.
© 2000 by CRC Press LLC
2 to 40 clusters (q = 1.2). It is hoped that a distinct minimum in the entropy (H) measure and a distinct maximum in the partition coefficient (F) measure will present themselves definitively around a certain value of c. This is not the case in any of the plot of Figure 6.6. In fact, the data space created by these digits appears rather homogeneous in nature, with few well-separated regions for simple cluster identification. This may be an artifact of the high number of patterns used in the cluster formation, which may “fill in” many of the less dense regions of feature space used for simple cluster identification. The credibility to the notion that the data space is highly uniform is enhanced by the extremely low value of q which was necessary to form clusters. At q = 1.2, the clusters are formed with very sharp decision boundaries. When a more commonly used value (q = 2) was used, the FCM algorithm converged every cluster center to the same point, so that the class memberships were entirely fuzzy, and no distinguishing information was provided.
FIGURE 6.6 The variation in the entropy (H) for choices for the number of clusters (c) from 2 to 40.
There are two regions in the curve of Figure 6.6 in which it is reasonably safe to assume that there are relatively “better” clusterings than for other c values. The first is for the value of c = 2, which the curve of Figure 6.6 shows as locally optimal. For the purposes of this study, the value of c = 2 had to be rejected simply because of the understanding that there are at least 10 clusters desired. This is because of the 10 digit types (zero through nine) used in the data set. The second region occurs for values of c > 30. In this region of the curve of Figure 6.6, there begins a “plateau” region, beyond which a mental extrapolation of
© 2000 by CRC Press LLC
the curves would anticipate little improvement for a much higher number of clusters.* The region from c = 30 to c = 40 is heuristically an acceptable region and still maintains an adequate number of average samples (25–33) per cluster. The heurisitic basis for the credibility of this range of c values resides in the belief that each of the digit types can be written in, on average, 3 to 4 different styles. For example, a one can be written as a single vertical line, or with additions of an upper diagonal line alone or with an accompanying lower horizontal line. It is interesting to note that there is indication that the FCM algorithm was not finding the globally optimal solution to the clustering problems it was presented with. One evidence of this was that the cost function value of the converged solution was often higher than one of the intermediate solutions through which the simulation had passed. But by far the simplest determination of locally optimal solutions is to run the program several times with the same parameter set and the same patterns. When this was performed at c = 30, the FCM algorithm obtained different solutions each time, as evidenced by discrepancies in the cost function value** and cluster membership distributions (the array F from section 5.3). This lends further credence to the use of an optimizer in the FCM routines. The ALOPEX-trained FCM algorithm was trained on 30 clusters for the same 1000 training patterns. In order to reduce the computational overhead, the simulation was started by using the center coordinates converged upon by the standard FCM algorithm. The simulation typically required between 1,000 and 2,000 iterations for a “good” convergence and seemed to perform best with ALOPEX parameters of γ0 = a0 = 0.2, σ0 = 0.3, γ∞ = a∞ = 2.0 × 10–3, σ∞ = 3.0 × 10–3, and τ = 2500. Primarily for computational reasons, the Euclidean distance measure was used in the ALOPEX-trained FCM algorithm. The use of a non-Euclidean measure would have, for this application, resulted in exorbitant execution times, since the calculation of the covariance matrices results in substantial computational overhead. Another reason for not using the fuzzy-covariance matrices in the formulation of a non-Euclidean distance metric was that later simulations showed that such a selection seemed to result in an unusually hard membership assignment, which may be disadvantageous for medical applications in particular. Table 6.1 shows the classification results for the ALOPEX-modified FCM scheme with the labeling method described in Section 5.4. The total classification accuracy is 86.3% for the 1000 training digits, and 86.0% for the 500 post-training digits, as indicated in Table 6.2. Figure 6.7 depicts the cluster centers as images (using the method of Section 5.4), to give us a flavor for the aspects of the characters which each cluster emphasizes. As Figure 6.7 shows, most clusters have fields that are strongly reminiscent of one of the digit types, but there are a few clusters which are blends of portions of several types.
* This was partially confirmed with a simulation performed at c=50, which indicated a continuation of this trend. ** When the FCM algorithm was run twice at c=30, final cost function values of J=52422.17 and J=56321.15 were obtained.
© 2000 by CRC Press LLC
TABLE 6.1 A Comparison of the Classification Results of the ALOPEX-Trained Network with the Actual Pattern Identities of the Digits Used in the Training. Of the 1000 Training Digits, 863 were Correctly Classified (86.3%). Assigned Digit Class
0
0 96
1 0
2 0
3 0
4 0
5 0
6 3
7 0
8 1
9 0
1
0
85
3
0
6
1
0
2
3
0
2
0
2
88
0
1
0
0
6
2
1
3
0
0
1
86
0
2
0
3
7
1
4
0
0
0
0
84
6
1
0
0
9
5
0
0
0
0
0
96
2
0
2
0
6
1
1
0
0
2
14
81
0
1
0
7
0
0
0
0
11
0
0
89
0
0
8
0
1
0
5
1
1
1
0
90
1
9
2
0
0
5
12
3
0
5
5
68
Figure 6.8 shows the misclassified digits as they appeared in their original unprocessed form, grouped by the digit type with which they were incorrectly identified with. Some of the misclassifications can be connected directly with preprocessing problems (i.e., improper rotations, noise in the image retained), while others are probably due to the strong overlap between certain characteristics of the digit types. The data set was also tested with the backpropagation neural network training algorithm. This technique was described thoroughly in Chapter 2. A direct comparison of the backpropagation results with the ALOPEX-trained network developed in this study is not equitable, since the backpropagation algorithm is a supervised technique. However, since backpropagation is so widely used, and since it has been used in the specific application of character recognition, the results that it provides can give a calibration of the difficulty of the data set. These results can also demonstrate the degree of additional accuracy which can be extracted by knowing the pattern identities a priori. The training was conducted with a network comprising 256 input nodes, 100 hidden nodes, and 10 output nodes on 1000 input patterns (consisting of the same preprocessed character training set as was used in the ALOPEX-trained system). The desired low value of the output lines was set at 0.1 and the desired high value at 0.9. The network was trained for 300 epochs with values of h = 0.1 and a = 0.75. Upon the completion of training, the training pattern classification error was determined by assigning it the class identity of the output node with the highest activity. © 2000 by CRC Press LLC
TABLE 6.2 A Comparison of the Classification Results of the ALOPEX Trained Network with the Actual Pattern Identities of the Digits Not Used in the Training. Of the 500 Digits, 430 were Correctly Classified (86.0%). Assigned Digit Class 4 5 6
0
1
2
3
7
8
9
0
46
0
0
0
1
0
1
0
0
0
1
0
41
3
0
3
0
1
1
1
0
2
0
6
44
0
3
0
0
0
41
0
0
0
2
0
0
0
2
0
2
4
1
4
0
0
0
0
41
2
1
0
0
4
5
0
0
0
0
0
49
1
0
0
0
6
0
7
0
0
0
0
0
5
47
0
0
0
0
1
0
2
0
0
46
0
1
8
0
0
0
2
1
1
0
0
48
0
9
2
0
0
3
12
1
0
1
2
27
FIGURE 6.7
The 30 cluster centers displayed in image form.
For the 1000 training patterns, the backpropagation network correctly classified all but two of them, for an accuracy of 99.8%. The 500 patterns not used in the training were classified with 93% accuracy, as shown in Table 6.3 below.
© 2000 by CRC Press LLC
6.4 DISCUSSION Our primary interest in this application is to be able to fine-tune the training algorithm so that it is of maximum efficiency and accuracy for subsequent medical applications. In this regard, the classification of handwritten digits tests the limits of the applicability of the method. This is because the large number of clusters, features, and patterns stress the algorithm to its maximum load. The computing times for all phases of the ALOPEX-trained algorithm were significant, but the accuracy of the optimization was nearly ideal in the feature extraction training. For the clustering module, the ALOPEX simulation for c=30 provided a moderate improvement over the standard FCM algorithm. Clearly a much more substantial computational demand is caused by the use of a non-Euclidean distance metric, particularly when the calculation of a fuzzy covariance matrix is required. For a large cluster, large pattern set application such as this, the Euclidean metric becomes one of the only feasible possibilities.
TABLE 6.3 A Comparison of the Classification Results of the Backpropagation Trained Network with the Actual Pattern Identities of 500 Digits Not Used in the Training. Of the 500 Digits, 465 were Correctly Classified (93%). 0
1
2
3
Assigned Digit Class 4 5 6
7
8
9
0
46
0
0
0
1
0
1
0
0
0
1
0
42
2
1
0
0
1
3
1
0
2
0
5
46
1
0
0
0
0
0
0
3
1
1
0
44
0
0
0
1
0
3
4
0
0
0
0
47
0
1
0
0
0
5
0
0
0
0
0
47
1
0
1
1
6
0
0
0
0
0
0
51
0
1
0
7
0
1
1
0
0
0
0
47
0
1
8
0
0
0
1
0
1
0
0
49
1
9
0
0
0
0
1
0
0
0
1
46
One of the largest problems appears to be the determination of the number of clusters necessary for an accurate depiction of the data space. Both of the cluster validity measures we used, along with about a half dozen others not included in this document, were not able to give us a definitive idea of the proper number of clusters. There is a natural tendency for all of the measures to drift toward their ideal values as the number of clusters increases, since when the number of clusters equals the number of patterns, we have a trivial but perfect set of clusters. Whether the plateau
© 2000 by CRC Press LLC
region of the curves of Figure 6.6 is an artifact from this tendency is unknown, but since the number of clusters was still substantially lower than the number of patterns (30 O(260/m)
(8.4)
or m > 2600 Obviously, the system described in this paper achieved generalizations of 90% and better with far fewer training examples. This shows that these bounds are very conservative in practice. This observed contradiction between the theoretical lower bounds and the actual behavior of the neural network in practice has been a point of contention between theorists and practitioners.37 There may be many ways to explain the difference, one being that many times the simplifications that must be made to calculate the bounds on the number of training examples result in the bounds not reflecting reality.24 In the case of our system, the answer may lie in the fact that the studies were done on systems much different than this with a different learning algorithm. However, another very plausible reason is that the training examples for this system may contain a high percentage of “boundary samples”.1,24 Boundary samples, or border patterns in a two-class problem, are examples that lie very close to the boundary between the two classes of data. Ahmad and Tesauro showed a marked increase in generalization performance when a large percentage of boundary samples are chosen to be in the training set. In addition, with a high enough number of boundary samples, the total number of examples needed to provide good generalization decreases. In the mammography application it is difficult to determine which examples are boundary samples (which is why most theoretical work is done on the XOR problem or the majority problem). However, given the above discussion, it would appear that the training set contains a high number of boundary samples. As previously stated, the generalization performance is dependent on the size of the training set and the structure of the neural network. Given this, there are then two ways to attack the generalization problem. First, one could set the size of the network and then collect enough data to obtain the performance desired. Or, the data could be collected and then the network sized to provide the best results given the data. Both methods have been approached in the above discussion. However, in practicality, the second method, which is how this system evolved, is usually chosen. In many situations it would be impossible or very expensive to collect more data. © 2000 by CRC Press LLC
For this reason most researchers will change the network structure to accommodate the data. Section 8.4 discussed the resulting data normalization method that this system uses. However, it was rather surprising that the ln and scaling approach done twice performed that much better than just a single ln and scale. The answer behind this lies in an analysis of the data. As Section 8.4 listed, the moment values lie between –1 × 1041 and 1 × 1041, with the minimum in absolute value being 1 × 10–39. Once the ln is taken, the values range from approximately –90 to 94. When these values are scaled onto the [–1,1] interval, the values from -90 to 0 fall into [–1,–.02], which is almost half the interval. These values of natural logarithms from –90 to 0 were the original data values that were in the [–1,1] interval. That is where 85% of the original moment data lie. Therefore, the majority of the data now occupies half the interval instead of 1/1041 of the interval that it did before. This results in the data becoming much more separable. To perform this operation again just further separates the data. Figure 8.7 shows this graphically. It is easy to see that the solid line, which is the ln(|x|), provides a good separation on the interval [–1,1]. However, the separation is even greater with the ln(|ln(|x|)|), as that function has a greater range. Figure 8.8 shows the same thing, only the results are now scaled to lie on the [–1,1] interval to exactly match the normalization of the data. While the same result can be seen, it is clearer in Figure 8.7.
FIGURE 8.7 Graph of ln(|x|) and ln(|ln(|x|)|). The solid line shows the graph of ln(|x|) and the dotted line the graph of ln(|ln(|x|)|). The interval between –2 and 2 is shown so that the effect on the [–1,1] interval can be more clearly seen.
This ability to allow the majority of the data to be more influential obviously increased the performance of the network. This says that the discriminating value
© 2000 by CRC Press LLC
FIGURE 8.8 Graph of ln(|x|) and ln(|ln(|x|)|) with scaling. This graph shows the same functions as Figure 8.7, except that the result of each function is now scaled to lie on the [–1,1] interval as were the input data.
came from this part of the data and not the few data points that were large in magnitude. Table 8.2 shows that the same neural network structure with different ALOPEX parameters will produce different results. While this is not too shocking given that the ALOPEX parameters control the amount of randomness in the system, for some cases it seems that the parameters are too sensitive to change. For example, comparing row 1 with row 2 shows that a very small change in the ALOPEX parameters led to a drop in the generalization performance. Rows 1 and 4 show an even greater performance drop for a slightly greater change in parameters. This seems to be an undesirable trait, and this situation warrants a closer look. Table 8.3 compares the initial ALOPEX parameters from row 1 and row 4. When viewed in the perspective of Table 8.3, things do not seem so out of the ordinary. One would expect that a 25% change in parameter values would give a change in performance. Additionally, Table 8.4 compares rows 1 and 2, which had a small change in performance. Table 8.4 shows that a smaller percentage change in parameters led to this smaller change in performance. This is what would be expected and desired. The system was most sensitive to the starting point of the weights, which are randomly set. This behavior is tied to the topography of the error surface, which depends upon the cost or error function used and the number of training examples. While smaller networks have advantages in generalization and training times, as previously discussed, they tend to produce more rugged error surfaces with fewer good solutions.29 In this case global minimization methods or methods that use at least some global information tend to perform much better than local minimization methods such as backpropagation.29 ALOPEX uses both global and local information, and as such, it is a good choice in this instance.
© 2000 by CRC Press LLC
TABLE 8.4 Comparison of Additional ALOPEX Parameters
σ γ Max Change
Row 1 Value
Row 2 Value
% Change
0.120 0.220 0.175
0.100 0.200 0.170
–17 –10 –3
The behavior described above is exactly what was observed in this system. There were few good solutions, and the ones that existed were difficult to find. To correct this problem, additional training examples would need to be used. However, as explained earlier in this section, additional training examples may lead to a larger network required in order to learn the increased training set size. This, in turn, means longer training times. Still, it may be possible to add training examples and smooth out the error surface to alleviate some of the problem and not increase the network size. This is one of the issues for future work.
8.9 CONCLUSIONS We have described a system that reads a digitized mammogram and classifies it as normal or abnormal. The results presented, even though they were generated on a small set of data, show great promise for a system of this type to be used in a clinical setting. With the maturation of digital mammography and the increase in processor speed of readily available and affordable personal computers, this system could easily be fielded on a mobile mammography van. Additional work on a larger database is ongoing.
ACKNOWLEDGMENTS The authors would like to thank the United States Airforce and the Rutgers University Research Council for funding support. The mammograms were provided by the New Brunswick Radiology Group (Dr. Barry Zickerman). Our thanks are also extended to Dr. David August of the New Jersey Cancer Institute, for his insightful comments.
REFERENCES 1. Ahmad, S. and Tesauro, G., Scaling and generalization in neural networks: a case study, in Advances in Neural Information Processing Systems I, Touretzky, D., Ed., Morgan Kaufmann Publishers, San Mateo, CA, 1989, 160. 2. Bankman, I., Christens-Barry, W., Kim D., Weinberg, I., Gatewood, O., and Brody, W., Automated recognition of microcalcification clusters in mammograms, Biomedical Image Processing and Biomedical Visualization, SPIE vol. 1905, 731.
© 2000 by CRC Press LLC
3. Baum, E. and Haussler, D., What size net gives valid generalization? Advances in Neural Information Processing Systems I, Touretzky, D., Ed., Morgan Kaufmann Publishers, San Mateo, CA, 1989, 81. 4. Cody, M., The fast wavelet transform, Dr. Dobb’s Journal, 16, April 1992. 5. Cohn, D. and Tesauro, G., Can neural networks do better than the Vapnik-Chervonenkis bounds? Advances in Neural Information Processing Systems 3, Touretzky, D., Ed., Morgan Kaufmann Publishers, San Mateo, CA, 1989, 911. 6. Cowley, G. and Ramo J., Sharper focus on the breast, Newsweek, 64, May 10, 1993. 7. Daubechies, I., Ten Lectures on Wavelets, SIAM, Philadelphia, PA, 1992. 8. Davis, F., How to Get the Best Mammogram, Working Woman, 38, Oct. 1994. 9. Dhawan, A., Buelloni, G., and Gordon, R., Enhancement of mammographic features by optimal adaptive neighborhood image processing, IEEE Trans. Med. Imaging, M15 (1) 8, Mar., 1986. 10. Dhawan, A. and Le Royer, E., Mammographic feature enhancement by computerized image processing, Comput. Meth. Prog. BioMed., 27, 23, 1988. 11. Dhawan, A., Chitre, Y., and Moskowitz, M., Artificial neural network based classification of mammographic microcalcifications using image structure features, Biomedical Image Processing and Biomedical Visualization, SPIE vol. 1905, 1993, 820. 12. Dodd, G. D., Mammography: state of the art, Cancer, 53, 652, 1984. 13. D’Orsi, C. and Kopans, D., Mammographic feature analysis, Sem. Roentgenol., XXVIII (3), 204, July 1993. 14. Elmore, J., Wells, C., Lee, C., Howard, D., and Feinstein, A., Variability in radiologists’ interpretations of mammograms, N. Engl. J. Med., 331(22), 1493, 1994. 15. Goldberg, M., Pivovarov, M., Mayo-Smith, W., Bhalla, M., Blickman, J., Bramson, R., Boland, G., Llewellyn, H., and Halpren, E., Application of wavelet compression to digitized radiographs, AJR, 163, 463, August 1994. 16. Haykin, S., Neural Networks — A Comprehensive Foundation, Macmillan College Publishing Co., New York, 1994. 17. Hecht-Nielsen, R., Neurocomputing, Addison-Wesley, Reading, MA, 1990. 18. Hu, M., Visual pattern recognition by moment invariants, IRE Trans. Inf. Theor., no. 8, 179, Feb. 1962. 19. Isard, H., Other imaging techniques, Cancer, 53, 658, 1984. 20. Kim, Y., Choi, I., Lee, I., Yun, T., and Park, K., Wavelet transform image compression using human visual characteristics and a tree structure with a height attribute, Opt. Eng., 35(1), 204, Jan. 1996. 21. Kung, S., Digital Neural Networks, PTR Prentice Hall, Englewood Cliffs, NJ, 1993. 22. Laine, A., Schuler, S., Fan, J., and Huda, W., Mammographic feature enhancement by multiscale analysis, IEEE Trans. Med. Imaging, 13(4), 725, Dec. 1994. 23. Lippmann, R., An introduction to computing with neural networks, IEEE ASSP Magazine, 4-22, April 1987. 24. Mehrotra, K., Mohan, C., and Ranka, S., Bounds on the number of samples needed for neural learning, IEEE Trans. Neural Net., 2(6), 548, Nov. 1991. 25. Micheli-Tzanakou, E., Neural networks in biomedical signal processing, The Biomedical Engineering Handbook, Bronzino, J., Ed., CRC Press, Boca Raton, FL, 1995, ch. 60, 917. 26. Myers, L., Rogers, S., Kabrisky, M., and Burns, R., Image perception and enhancement for the visually impaired, IEEE Eng. Med. Biol., 594, Sept./Oct. 1995. 27. Qian, W., Clarke, L., Li, H., Clark, R., and Silbiger, M., Digital mammography: M-channel Quadrature Mirror Filters (QMFs) for microcalcification extraction, Comp. Med. Imaging Graph., 18(5), 301, Sept./Oct. 1994.
© 2000 by CRC Press LLC
28. Samiy, A., Douglas, R., Jr., and Barondess, J., Textbook of Diagnostic Medicine, Lea and Febiger, Philadelphia, PA, 1987. 29. Shang, Y. and Wah, B., Global optimization for neural network training, Computer, 29(3), 45, Mar. 1996. 30. Shen, L., Rangayyan, R., and Desautels, J., Application of shape analysis to mammographic calcifications, IEEE Trans. Med. Imaging, 13(2), 263, June 1994. 31. Sheng, Y., Wavelet transform, in The Transforms and Applications Handbook, Poularikas, A., Ed., CRC Press, Boca Raton, FL, 1996, ch. 10, 747. 32. Sickles, E., Breast calcifications: mammographic evaluation, Radiology, 160, 289, 1986. 33. Strang, G., Wavelets, Am. Sci., 82, 250, May/June 1994. 34. Strickland, R. and Hahn, H., Wavelet transforms for detecting microcalcifications in mammograms, IEEE Trans. Med. Imaging, 15(2), 218, April 1996. 35. Vapnik, V. N. and Chervonenkis, A. Y., On the uniform convergence of relative frequencies of events to their probabilities, Theoret. Probability Appl., 17, 264, 1971. 36. Vyborny, C., and Giger, M., Computer vision and artificial intelligence in mammography, AJR, vol 162, Mar. 1994, 699-708. 37. Wasserman, P., Advanced Methods in Neural Computing, Van Nostrand Reinhold, New York, 1993. 38. Wasserman, P., Neural Computing: Theory and Practice, Van Nostrand Reinhold, New York, 1989. 39. Yoshida, H., Zhang, W., Cai, W., Doi, K., Nishikawa, R., and Giger, M., Optimizing Wavelet Transform Based on Supervised Learning for Detection of Microcalcifications in Digital Mammograms, Proc. of IEEE Int. Conf. Image Proc., 152, 1995. 40. Zettler, W., Huffman, J., and Linden, D., Application of compactly supported wavelets to image compression, Image Processing Algorithms and Techniques, SPIE vol. 1244, 1990, 150. 41. Micheli-Tzanakou, E., Uyeda, E., Ray, R., Sharma, A., Ramanujan, R., and Doug, J., Comparison of neural network algorithms for face recognition, Simulation, 64(1), 15, July 1995. 42. Mallat, S., A theory for multiresolution signal decomposition: the wavelet representation, IEEE Trans. Patt. Anal. Mach. Intell., 11(7), 674, July 1989. 43. Held, G., Data Compression, John Wiley & Sons, New York, 1987. 44. Ross, S., A First Course in Probability, Macmillan Publishing, New York, 1976. 45. Hrycej, T., Modular Learning in Neural Networks, John Wiley & Sons, New York, 1992. 46. Rodriguez, C., Rementeria, S., Martin, J., Lafuente, A., Muguerza, J., and Perez, J., A modular neural network approach to fault diagnosis, IEEE Trans. Neural Net., 7(2), 326, Mar. 1996.
© 2000 by CRC Press LLC
9
Visual Ophthalmologist: An Automated System for Classification of Retinal Damage Sergey Aleynikov and Evangelia Micheli-Tzanakou
9.1 INTRODUCTION There is a vast variety of eye related diseases which leave visible artifacts on the retinal surface. This makes it very difficult to create any automated classification system, since the disease features vary widely. Retinal diseases are classified in two major groups: vascular diseases, caused by circulatory disturbances of the retinal vessels; and avascular diseases, where the rod and cone layers and the pigment epithelium are implicated. In this research an attempt was made to classify only diseases belonging to the first group, resulting from retinal hemorrhage.
9.2 SYSTEM OVERVIEW The system, which was named “Visual Ophthalmologist” (VO) is designed for a 32-bit operating system, such as Windows NT, or Windows 95, running on a PC-compatible computer. The flowchart of Figure 9.1 describes the main components of the system. The system consists of five general modules: M1. M2. M3. M4. M5.
Image Acquisition Module-Image Source Database Image Management Module Image Processing Module Feature Extraction Module Neural Network Classification Module
The system can function independently just as an image data management system (IDMS), or it can also be used for image classification. Module M1 consists of a Hewlett Packard 4cx desktop scanner with a transparency adapter to scan slides. This module can be substituted with any compatible scanner capable of providing optical resolution greater than 150 dpi. Instead of using a scanner, it is also possible
© 2000 by CRC Press LLC
FIGURE 9.1
System components flowchart.
to incorporate a digital camera into the system’s structure and use it for direct image acquisition. The images are scanned using proprietary software into Windows Clipboard, and then pasted right into a newly created record in the Database Image Management Module (M2). The Database Image Management Module (DIMM) is a database client, which requests information from and sends information to a database server. A server processes requests from many clients simultaneously, coordinating accessing and updating of data. The advantage of this type of architecture is that the image server can be located either on a local computer or on a remote computer. The communication of the DIMM with the database server is done through an intermediate layer, called Borland Database Engine (BDE). The patients’ data are organized in three independent databases connected between each other with a parent/child relationship in the following manner: 1. Patient database (this file contains general patient data: first and last name, social security number, date of birth, etc.), 2. Visit database (includes information on patient’s visits, belonging to a patient in the ‘patient database’. This information may include: date of a visit, description, etc.), 3. Image database (consists of several fields, which include patient’s images belonging to a specific visit in the ‘visit database’).
© 2000 by CRC Press LLC
9.2.1
IMAGE PROCESSING
The image processing tools included in this module are histogram equalization and stretch, image compression based on a Gaussian Pyramid,2 image orientation, center of mass determination, and a set of convolution filters. This enables the user to acquire a more accurate and flexible classification.
9.2.2
FEATURE EXTRACTION METHODS
Module M4 in Figure 9.1 is a Feature Extraction Module. It processes selected records in the imaging database and outputs specific features extracted from the images. These features are saved in a file to be used in the image classification by module M5. The feature extraction is based on three independent methods to allow for a higher recognition rate. These methods are 1. Image Central and Invariant Moments,5 2. Image Power Spectrum, based on the F-CORE Decomposition,7 and 3. Multiresolution Wavelet Decomposition Approach.4,6
9.2.3
IMAGE CLASSIFICATION
Module M5 (Figure 9.1) of the system consists of a Modular Neural Network, which takes the features generated and saved to a file by the Feature Extraction Module (M4), and tests them against the information on which the neural network was trained. This module is implemented in a separate program, which can be run concurrently with the Visual Ophthalmologist. The reason for not incorporating this module directly into the Visual Ophthalmologist is that since it is a versatile program by itself, it can serve for finding solutions of many independent problems of recognition/classification, similar to the ones found by the Visual Ophthalmologist project. The neural network training in this module is based on the Algorithm of Pattern Extraction (ALOPEX). ALOPEX is an optimization technique developed by Tzanakou and Harth in 1973 (for a list of references the reader is advised to look into Reference 9) to optimize receptive field mapping in the visual system of frogs. It has been applied to a broad variety of applications due to the fact that it has better convergence compared to the traditional gradient methods. Some of its recent applications include face recognition8 and mammogram classification.3 ALOPEX serves to minimize/maximize a system’s global response R, which is a function of multiple parameters. As the parameter space of the response becomes large, it becomes more and more complicated to find an appropriate solution. A valuable characteristic feature of the ALOPEX algorithm is that at each iteration it considers both local and global effects on the response function. For the complete description of the algorithm, the reader is referred to one of the latest publications.
9.3 MODULAR NEURAL NETWORKS The idea of building modular networks comes from the analogy with biological systems, in which a brain (as a common example) consists of a series of interconnected © 2000 by CRC Press LLC
substructures, like the auditory and visual systems, which, in turn, are further structured on more functionally independent groups of neurons. Each level of signal processing performs its unique and independent purpose, such that the complexity of the output of each subsystem depends on the hierarchical level of that subsystem within the whole system. Modular neural networks can be used in a broad variety of applications. Each module does its unique function, providing some output to the modules in the next level. The usage of modular neural networks is most beneficial when there are cases of missing pieces of data.1 Since each module takes its input from several others, a missing connection between modules would not significantly alter that module’s output. With the introduction of object-oriented computer languages in the early 1990s, it has become relatively easy to implement parallelism in neural network processes, which extends the regular procedural approach to a new “biological-like” dimension. At a level of high abstraction the network should look like the “black box” object in Figure 9.2. It receives some input from templates stored in a file, propagates it through all modules, and provides some output in a meaningful format. A module is not aware of any type of processing that is going on in the rest of the network, though it knows which particular network it belongs to in order to provide correct references, stored in the second container, to the rest of the modules. A local error of a template is the summation of a function of the absolute differences between the desired and actual values of the module’s output nodes to a given template. We use different approaches for computing the local error Ei′ : Ei′ = Outidesired − Outiobserved If Ei′ > threshold, then if the desired output Outidesired is 1, we set Ei′ equal to exp(2· Ei′ ) –1. (This is done because we like the values on the diagonal of the output matrix to have an increased rate of convergence.) Otherwise Ei′ is expressed as exp ( Ei′ ) –1. The traditional training approach assumes that the local error is equal to ( Ei′ ) 2. However, we would like to make it more sensitive to a change of the argument, which is why ( Ei′ )–1 has been chosen. In fact, to get a faster convergence of the output that are set to 1, we use an even more sensitive function, namely exp(2· Ei′)–1.
9.4 APPLICATION TO OPHTHALMOLOGY This section summarizes the application of the modular neural network algorithm described in the previous section to the problem of retinal image damage classification. Once the data were acquired and stored in the database and all features were generated using approaches discussed earlier, we built a neural network as shown in Figure 9.3 to classify the obtained features. As shown in the figure, the network consists of two levels of modules. The modules on the first level process the features generated by three feature extraction methods consecutively: a) moments, b) F-CORE, and c) wavelet histogram. The module on the second level serves as the classifier of the results generated by the © 2000 by CRC Press LLC
FIGURE 9.2
An example of a modular perceptron-based neural network.
FIGURE 9.3
The configuration of a neural network used for retinal image classification.
© 2000 by CRC Press LLC
first three modules. It combines the recognition of three different methods into a joint classification. The definition of the correct classification of the retinal hemorrhage is that the network should be able to tell whether or not any given image contains a hemorrhage. In order to test the accuracy of recognition, the original images were preclassified by a degree of hemorrhage damage to a [1, .. ,5] range, where 1 means no or very little (< 5% of retinal surface) hemorrhage, and 5 means very high degree of damage(> 80% of retinal surface). The architecture of each module is as follows: the moments processing module consists of three layers of neurons, containing respectively: 15→10→5 nodes. It contains 15*10 + 10*5 = 200 connections. The F-Core processing module consists of three layers of neurons, containing respectively: 25→15→5 nodes. It contains 25*15 + 15*5 = 450 connections. The wavelet histogram processing module consists of three layers of neurons, containing respectively: 25→15→5 nodes. It contains 25*15 + 15*5 = 450 connections. Finally, the merging module merges the classification of the previous three modules into a combined form to provide the final classification of the degree of retinal damage. It consists of 15→10→5 input/hidden/output nodes, respectively. The number of neurons in the input layer of each module is equal to the number of features in the corresponding method. The number of output neurons is equal to 5 (the number of hemorrhage classes). A hidden layer contains the number of neurons equal to the average of the neurons in the input and output layers.
9.5 RESULTS Since the Visual Ophthalmologist is a highly integrated system, the process of obtaining results is very simple. It consists of three steps: 1) selecting the image records containing images of interest, 2) running the feature extraction algorithms on the selected data, and 3) classifying results using the modular neural network. The system currently contains a database of 160 retinal images obtained from three different sources: a) black and white sheet slides, b) loose color slides, and c) ophthalmological atlas. All images were preprocessed before they were input into the database. Preprocessing included a) Image re-sizing using Gaussian compression to 256 × 256 pixels b) Histogram stretch/equalization to enhance the quality of the image All images stored in the databases are in the Windows Device Independent Bitmap format (bmp). We chose this format due to compatibility with most Windows imaging applications. All 160 images were used for testing of the system. Twentyfive images were chosen for training of the neural network. Therefore, the network training set contained 25 templates, each consisting of 15 + 25 + 25 = 65 features provided by the feature extraction module. The overall classification provided by the module of the second level in the network’s architecture resulted in the correct classification of 127 images out of 160, which is equal to 79.38%. For each image processed by the system, the following rule was used to provide the image classification. As discussed in the previous section, each image could belong to one of five classes ordered by the degree of © 2000 by CRC Press LLC
hemorrhage from 1 to 5. The correct recognition by the network is considered true if its most dominant output deviates by not more than a distance of two from the correct classification. The modular network was also trained to 95%, and classified correctly 127 out of 160 images, which signifies 79.38% of recognition accuracy.
9.6 DISCUSSION The results obtained in the process of application of the four outlined methods are mainly affected by the feature extraction criteria in each method, the normalization of data provided to a neural network, and finally the training parameters of each network. When a classical neural network is used for classification, the convergence is much slower and not as accurate. Although we chose a very small number of templates for training, we still achieved a high testing performance (80%). Once our database becomes larger, then training will be done with a larger number of templates for each class. Undoubtedly the testing performance will be improved.
REFERENCES 1. Aleynikov, S. and Micheli-Tzanakou, E., Design and implementation of modular neural networks based on the ALOPEX algorithm, Virtual Intell., Proc. SPIE, 2878, 81, Nov. 1996. 2. Burt, P. J., The pyramid as a structure for efficient computation, in Multiresolution Image Processing and Analysis, Rosenfeld, A., Ed., Springer-Verlag, New York, 1984, 6. 3. Cooley, T. and Micheli-Tzanakou, E., A Modular Neural Network for Classifying Mammograms, Proc. of the Int. Conf. On Neural Networks, 1996, 1162. 4. Daubechies, I., Orthonormal bases of compactly supported wavelets, Comm. Pure Appl. Math., 41, 906, 1988. 5. Hu, M., Visual pattern recognition by moment invariants, IRE Trans. Inf. Theor., 8, 179, Feb. 1962. 6. Mallat, S., A theory of multiresolution signal decomposition: the wavelet representation, IEEE Trans. Patt. Anal. Mach. Intell., 11(7), 674, July 1989. 7. Micheli-Tzanakou, E. and Binge, G., F-CORE: A New Fourier based data Compression and Reconstruction Method, MEDICON ’89, 344, 1989. 8. Micheli-Tzanakou, E., Uyeda, E., Ray, R., Sharma, A., Ramanujan, R., and Doug, J., Comparison of neural network algorithms for face recognition, Simulation, 64(1), 15, July 1995. 9. Zahner, D. and Micheli-Tzanakou, E., Artificial neural networks: definitions, methods, applications, The Biomedical Engineering Handbook, Bronzino, J., Ed., CRC Press, Boca Raton, FL, 1995, ch. 184, 2699.
© 2000 by CRC Press LLC
10
A Three-Dimensional Neural Network Architecture Evangelia Micheli-Tzanakou, Timothy J. Dasey, and Jeremy Bricker
10.1 INTRODUCTION The idea behind the presented architecture was to create a pattern recognition system using neural components. The brain was taken as a model, and although little is known about how pattern recognition is accomplished, much more is known about the cells that comprise the earliest levels of processing and analyzing the features of an environment most directly. By constructing cells with similar properties to the biological cells, we may gain an advantage in information conservation and proper utilization of neural architectures. The most important characteristic of these cells is their receptive field (RF). With this in mind, we could search for an adaptive mechanism that, by changing connective strengths, could give the desired RFs. Therefore, since we will know what information the algorithmic components are providing, when a method is found that provides the desired cell types, we may be able to trace back via the algorithm to see what information the neurons give.
10.2
THE NEURAL NETWORK ARCHITECTURE
The architecture chosen was that of a hierarchy of two-dimensional cell layers, each successive layer more removed from the environment (Figure 10.1). The first layer receives inputs from the external world and all other layers from the preceding layers. In addition, the cells may receive lateral connections from other neighboring cells within the same layer, depending on the particular choice of the architecture. The interlayer feed-forward connections are chosen so that a cell feeds its connections onto a neighborhood of cells in the lower layer. This neighborhood may have definite bounds so that all cells within it make connections, or it may have indefinite bounds in which the probability of a connection decreases as a Gaussian with distance. The component cells themselves choose their outputs based on a weighted sum of all inputs passed through a function σ, such as
[
]
Oi (t ) = σ α i* Σ j Cij *Oj (t − 1)
© 2000 by CRC Press LLC
(10.1)
where Oi(t) is the output of neuron i at time interval t, Cij are the connection strengths, bounded from [–β,β] where β is usually 1.0, and α is a constant. In the simulations, σ is usually a sigmoid of the form
(
(
σ( x ) = 0.5 * a * 1 + tanh b * x * c
))
(10.2)
where a,b,c are constants which fix the maximum value, steepness, and bias of the sigmoid, respectively. However, if we wish to allow the inhibitory components of the RF to be used by subsequent layers, then the sigmoid function must have a nonzero firing level for those negative inputs. This suggests the use of a spontaneous firing activity for all neurons. An additional requirement needed to keep the neurons useful and “responsive” is to keep that neuron from being pushed too far into the saturation level. If that occurs, input deviations will not be sensed well, if at all. Since each neuron receives several inputs, it is easy for this to occur. To prevent it from happening, α is usually chosen equal to the reciprocal of the number of connections to neuron i, so that the neuron simply passed a weighted average of the inputs through the sigmoid.
10.3 SIMULATIONS A simulation usually consists of a sequence of presentations of random input patterns to the first layer and a learning rule imposed on the connections by analysis of the firings of the neurons. A random input was chosen so as to prevent the cells from being biased towards any specific environmental feature. Since neighboring inputs are uncorrelated, first layer cells that receive their influences are expected to have synapse patterns that would similarly wander aimlessly in the learning process. The first layer provides a spatial average of the overlying inputs. Since neighboring cells have the greatest overlap in their neighborhoods, they tend to have firing patterns which are most similar. This would cause cells in layer 2 to have synapses that originated from nearby cells to want to be alike. The actual training of the connections can be done in different ways. I) Synapses can be changed based on a variation of the Hebbian rule1 as follows: Cij = δ * Oi * O j
(10.3)
where δ is a small positive constant. Due to the correlation between neighboring level 1 cells, the synapses to the cells in later layers would tend to want to be all alike without additional constraints. In order to guarantee both positive and negative synapses to every cell, an additional “resource” constraint is imposed, which takes the form of Σ j Cij = 0
© 2000 by CRC Press LLC
(10.4)
The third restriction is a bounding of the connections to the interval [–1, 1]. A synapse is allowed the freedom to switch from positive to negative and vice versa. This is not expected to alter the main results but only to prevent many of the synapses from disappearing with zero strength. Convergence usually occurs within 1000-5000 iterations, although faster convergence can be achieved with larger δ. Usually the final state of the synapses is at either the excitatory or the inhibitory limits.
INPUT IMAGE
SPATIAL MEXICAN-HAT FILTERS
AREA OF CONNECTION FOR EACH CELL
SELF-ORGANIZING SETS OF NEURONS
FIGURE 10.1
A schematic representation of the neural architecture.
10.3.1 VISUAL RECEPTIVE FIELDS A network was created with three layers and 128 cells per layer. A square stimulus was assumed with 32×32 size. The maximum distance that these cells can affect is a radius of r, with minimum weight –1 and maximum weight values of +1. The network had a total of 7071 connections. In the training mode, the minimum stimulus value was assumed to be zero and the maximum equal to 10. No noise was imposed on the system. The results obtained show the emergence of cells with edge-type RFs in layer 2 (Figure 10.2a). The orientation of the edge appears to be totally arbitrary, even between neighboring cells. In layer 3, these edge cell RFs often conflict to give RFs which have oblong centers and surrounds of the opposite polarity, but many times these centers draw to the edges with further learning. Thus the final RFs often look like an elliptical center touching the outside of the field, mostly surrounded by a horseshoe-shaped region of opposite polarity (Figure 10.2b).
© 2000 by CRC Press LLC
Figure 10.2 Receptive field characteristics for the neurons described in the text. (a) RF of layer 2. (b) RF of layer 3. Notice the center-surround organization of layer 2 and the elongated character of layer 3.
Figure 10.3 shows the results from a similar network, except that the minimum weight value is –0.5, i.e., less inhibitory effects. Notice that excitation spreads more and that the maximum amplitudes are much larger. Also notice that the layer 3 RF is much longer than the one in layer 2.
FIGURE 10.3 Receptive field organization for layer 2 (a) and layer 3 (b) when the inhibitory effects are less than in Figure 10.2. Compare the amplitudes and the spread of the RFs to those of Figure 10.2.
In the frequency domain, these RFs show more fine tuning as we move to deeper layers of the system. Figures 10.4 and 10.5 represent the power spectra of Figures 10.2 and 10.3, respectively. Also, notice that the edge effects are more obvious in the spectra of layer 3. II) The wider the variance of the firing rate of the cells the more information the cells can carry. With such a supposition we can use an optimization routine to find the values of the synapses to a cell such that the variance in the firing rate of the cell is maximized. The optimization system is a variation of the ALOPEX process.2 In this process two random connection patterns can be presented, and the variance (V) of the cell output is estimated with a number of random input patterns. Since we want the pattern of connection strengths to affect the variance and not the strength of the connections themselves, the variance can be modified as
© 2000 by CRC Press LLC
[
)] (
(
Vi = (1/ N ) Σ j Oij − Oiave / Σ j Cij *
)
(10.5)
The connections are then changed based on the relation between the last change in connections and the last change in the variance, with an added noise term to prevent local minima as follows:
(
) (V (t ) − V (t − 1)) + noise term
Cij = β * Cij (t ) − Cij (t − 1)
*
i
i
(10.6)
Amazingly, with this modification, the same edge sensitive cell RFs emerge after only about 100 iterations and remain the same until about 400 iterations. This shows that the combination of Hebb’s rule and ALOPEX is something desirable. It might also mean that the way in which the architecture of the network is set up biases them toward neurons with edge detection capabilities. Work by others3 has indicated that certain forms of the Hebb rule can be used to perform principle component analysis, a variance maximization of sorts. In addition, both feed-forward and feedback connections are used, with feedback having a wider connective neighborhood than the feed-forward connections. All connections are variable. If the inhibitory connections are spread over a much wider area, they tend to cancel the excitatory influence, making the Hebb changes ineffective. In future work we will include feedforward connections of cells with a Gaussian distribution, and with inhibitory connections and excitatory connections having a different spatial standard deviation. The present number of maximum synapses allowed does not give us the ability of obtaining statistical significance for initial random strength generation. III) Both feed-forward and feedback connections can be used, with the feedback having a wider connective neighborhood than the feed-forward connections. IV) Lateral connections on each layer are allowed and used, thus adding an extra feature of similarity to the biological system.
FIGURE 10.4 Power spectrum of the RF in Figure 10.2. (a) layer 2, (b) layer 3. Notice the fine tuning in layer 3.
© 2000 by CRC Press LLC
FIGURE 10.5 Power spectra of RFs in Figure 10.3. (a) layer 2, (b) layer 3. Compare with Figure 10.4. The edge effect is much more pronounced.
If each input signal value is thought of as a dimension in parameter space, any particular input will comprise a point in that space. The synapses of a neuron can then be thought of as describing a vector in the same space and the output of the neuron as the projection of the input point onto the synapse vector. If the choice of the synapses is initially random, chances are that the projections from many different inputs will lie close to one another, giving the neuron a response profile. Consider this to be the response profile of a neuron before optimization. In order to better distinguish between inputs, the synapses should be changed so that more of the neuron range can be utilized. An intriguing choice is for the neuron to perform a type of principal component analysis (PCA) (Karhunen-Loève feature extraction). Principal component analysis may be approximated by a search for the vector (described by the connection weight values), which maximizes the variance of the cell firing level. The choice of this property may serve to partition the input space into recognizable categories at the output. This analysis approximates the KarhunenLoève search for the eigenvector of the maximum eigenvalue. For layers of neurons
© 2000 by CRC Press LLC
that have a large amount of information with near neighbors, the use of low-level lateral inhibition should prevent the system from settling on the same vector for each neuron, providing instead a graded topography to the layer. Depending on the partitioning of the input space, this processing mode of neurons could provide many different behaviors. If the input space has clusters, the neuron may provide classification. If, on the other hand, the inputs are “randomly” distributed in space, the neuron can choose any feature vector, but could be constrained by near neighbors interactions into how it forms topographic maps.
10.3.2 MODELING
OF
PARKINSON’S DISEASE
We have created a network with eight neural layers in addition to a layer for stimuli. Each layer represents one physiological region of the brain or nervous system, as described by the model of DeLong et al.4 By means of a series of excitatory and inhibitory feed-forward and feedback synapses, the brain stem (layer 7) is relatively active. Another layer was added to represent the motor neurons of the extremities (head, legs, and arms). The connections from the brain stem to the extremities are assumed to be inhibitory. Thus, in the normal state, the high activity of layer 7 subjects layer 8 to a large degree of inhibition, making this layer rather inactive. The Parkinsonian case is simulated by cutting off connections stemming from the input layer. When this happens, layer 7 is not excited to a degree as large as it is in the normal case. Sequentially, a smaller amount of activity exists with which to inhibit the extremities. This unusually high level of activation in the motor neurons of the extremities represents the tremors present in patients suffering from Parkinson’s Disease. A Pallidotomy is then simulated in the Parkinsonian scenario by destroying groups of neurons in the Globus Pallidus Internum (or GPi, layer 4). As is evident from DeLong’s model (Figure 10.6), this action will reduce the total amount of activity present in the GPi, causing less inhibition to the layer following the GPi, followed by greater excitation of the cortex, greater excitation of the brain stem, and more inhibition to the motor neurons of the extremities, corresponding to a reduction in tremors. The Pallidotomy brings the degrees of activation on layers between the GPi and the extremities back to levels akin to those observed in the non-Parkinsonian scenario. The program allows the effects of different types and locations of lesions to be observed. In general, a lesion is targeted on the location in which the highest degree of activity in the GPi is recorded. The network is helpful in predicting the consequences of lesioning off-target or at a location other than the point of highest activity. Lesioning at multiple locations or on different layers may also be simulated. The program created a network consisting of eight layers of neurons and one layer of input nodes. Each layer corresponds to one layer in DeLong’s model.4 On each layer are placed 200 neurons. This large quantity of neurons is necessary in order to visualize each layer well. The neurons are randomly scattered on each layer within the spatial bounds of -2