1,187 289 22MB
Pages 477 Page size 336 x 508.8 pts Year 2010
Digital Speech Processing, Synthesis, and Recognition
Signal Processing and Communications Series Editor
K. J. Ray Liu University of Maryland College Park, Maryland Editorial Board Sadaoki Furui, Tokyo lnstitute of Technology Yih-Fang Huang, University of Notre Dame Aggelos K. Katsaggelos,Northwestern University Mos Kaveh,University of Minnesota P. K. Raja Rajasekaran, Texas lnstruments John A. Sorenson, Technical University of Denmark
1.
Digital Signal Processing for Multimedia Systems, edited by Keshnb K. Parhi and Tnkuo Nishitani
2.
Multimedia Systems, Standards, and Networks,
edited by Atul Puri
and Tsuhan Chen
EmbeddedMultiprocessors:SchedulingandSynchronization, Sundm-arajan Sriranz and Shuvra S. Bhattcrcharyva David C. Swanson 4. Signal Processing for Intelligent Sensor Systerns, edited by Ming-Ting Sun and Amy 5. Compressed Video over Networks, R. Riebmm Xiang-Gen 6. Modulated Coding for Intersymbol Interference Channels, Xia 7. Digital Speech Processing, Synthesis, and Recognition: Second Edition, Revised and Expanded,Sadaoki Furui 3.
Additiorml Volzmes irt Preparation
Modern Digital Halftoning,David L. Lau altd Gonzalo R. Arce Blind Equalization and Identification,Zhi Ding and Ye (Geoffrey) Li Video Coding for Wireless Communications, King H. Ngan, Chu Yu Yap, aud Keng T.Tal2
Digital Speech Processing, Synthesis, and Recognition Second Edition, Revised andExpanded
Sadaoki Furui Tokyo Institute of Technology Tokyo, Japan
MARCEL
MARCEL DEKKER, INC. D E K K E R
NEWYORK BASEL
Library of Congress Cataloging-in-Publication Data Furui, Sadaoki. . Digital speech processing, synthesis, and recognition / Sadaoki Furui.ed., rev. and expanded. p. cm. - (Signal processing and communications; 7) ISBN 0-8247-0452-5 (alk. paper) 1. Speech processing systems. I. Title. 11. Series. TK788TS65 F87 2000 006.4’54-dc3 1
2nd
00-060 197
This book is printed on acid-free paper. Headquarters Marcel Dekker, Inc. 270 Madison Avenue, New York. NY 10016 tel: 21 2-696-9000:fax: 2 12-685-4540 Eastern Hemisphere Distribution Marcel Dekker AG Hutgasse 4, Postfach 812, CH-4001 Basel. Switzerland tel: 4 1-61-261-8482; fax: 4 1-6 1-26 1-8896 World Wide Web http://www.dekker.com The publisher offers discounts on this book when orderedinbulkquantities. For moreinformation, write to Special Sales/Professional Marketingat the headquarters address above. Copyright (0 2001 by Marcel Dekker, Inc. All Rights Reserved. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any informationstorage and retrieval system, without permission in writing from the publisher. Current printing (last digit) 10987654321
PRINTED IN THE UNITED STATES OF AMERICA
Series Introduction
Over the past 50 years, digital signal processing has evolved as a major engineering discipline. The fields of signal processing have grown from the origin of fast Fourier transform and digital filter design to statisticalspectral analysis andarray processing, and image,audio, and multimedia processing, and shaped developments in high-performance VLSI signal processor design. Indeed, there are few fields that enjoy so many #applications-signal processing is everywhere in our lives. Whenone uses a cellular phone,the voice is compressed, coded, andmodulated using signal processing techniques. As a cruise missile winds along hillsides searching forthetarget,the signal processor is busy processing the images taken along theway. When we are watching a movie in HDTV, millions of audio and video dataare being sent toour homes and received with unbelievable fidelity. When scientists compare DNA samples, fast pattern recognition techniques are being used. On and on, one can see the impact of signal processing in almost every engineering and scientific discipline. Because of the immense importance of signal processing and the fast-growing demands of business and industry, this series on signal processing serves toreportup-to-datedevelopmentsand advances in the field. The topics of interest include but are not limited to the following:
iii
Series Introduction
iv 0 0 0 0 0 0 0
Signal theoryandanalysis Statistical signal processing Speech andaudio processing Image and video processing Multimedia signal processing and technology Signal processing forcommunications Signal processing architectures and VLSI design
I hopethis series will providetheinterestedaudience withhigh-quality,state-of-the-artsignalprocessingliterature through research monographs,editedbooks,and rigorously written textbooks by experts in their fields. K. J . R q ’Liu
Preface to the Second Edition
More than a decade has passed since the first edition of Digital Speed1 Processiug, Synthesis, nnd Recog?zitio?l was published. The as both a bookhas beenwidely used throughouttheworld textbook and a reference work. The clear need for such a book stems fromthe fact that speech is themost naturalform of communication among humans and that it alsoplays an ever more salient role in hunm--nlachine communication.. Realizing any such system of conmunication necessitates a clear andthorough understanding of the core technologies of speech processing. The field of speech processing, synthesis, and recognition has witnessed significant progress in thispastdecade,spurred by advancesinsignalprocessing,algorithms,architectures,and hardware. These advancesinclude: ( I ) international standardization of various hybrid speech coding techniqu,es, especially CELP, and its widespread use in manyapplications, such as cellular phones; (2) waveform unit concatenation-based speech synthesis; (3) large-vocabularycontinuous-speechrecognition based on a statistical pattern recognitionparadigm,e.g.,hidden Markov models (HMMs)and stochasticlanguage models; (4) increased robustness of speech recognition systems against speech variation, such as speaker-to-speaker variability, noise, and channel distor( 5 ) speakerrecognitionmethodsusingthe HMM tion;and technology.
Preface
toEdition the Second
vi
This second edition includes these significant advances and details important emerging technologies. The newly added sections include Robust and Flexible Speech Coding, Corpus-Based Speech Synthesis, Theory and Implementation of HMM, Large-VocabularyContinuous-SpeechRecognition,Speaker-Independent and Adaptive Recognition, and Robust Algorithms Against Noise and Channel Variations. In an effort to retain brevity, older technologies now rarely used in recent systems have been omitted. The basic technology parts of the book have also been rewritten for easier understanding. It is my hope that users of the first edition, as well as new readersseeking to explore both thefundamentalandmodern technologies in this increasingly vital field, will benefit from this second edition for many years to come.
"_" " "
,.
" " "
"""~"_l
"
- "
Acknowledgments
I am grateful for permission from many organizations and authors to use their copyrighted material in original or adapted form: Figure 2.5 containsmaterial which is copyright 0 Lawrence Erlbaum Associates, 1986. Used with permission. All rights reserved. Figure 2.6 contains material which is copyright 0Dr. H. Sato, 1975. Reprintedwithpermission of copyright owner. All rights reserved. Figures 2.7, 3.8,4.9, 7.1, 7.4, 7.6, and 7.7 contain material which respectively is copyright 0 19-52, 1980, 1967, 1972, 1980,1987, and 1987 AmericanInstitute of Physics. Reproduced with permission. All rights reserved. Figures 2.8, 2.9, and 2.10 containmaterial which is copyright 0 Dr. H. Irii, 1987. Used with permission. All rights reserved. Figure 2.11 contains material which is copyright 0Dr. S. Saito, 1958. Reprintedwithpermissionofcopyright owner. All rights reserved. Figure 3.5 contains material which is copyright 0Dr. G. Fant, 1959. Reproducedwithpermission. All rights reserved. Figures 3.6, 3.7, 6.6, 6.33, 6.35, and 7.8 contain material which respectively is copyright 0 1972, 1972, 1975, 1986,
vii
Acknowledgments
viii
0
0
0
0
0
0
0
0
1986, and 1986 AT&T. Used with permission. All rights reserved. Figures 4.4, 5.4, and 5.5 containmaterial which is copyright (Q Dr. Y. Tohkura, 1980. Reprinted with permission. All rights reserved. Figures 4.12, 6.1, 6.12, 6.13, 6.18, 6.19, 6.20, 6.24, 6.25, 6.26, 6.27, 6.32, 6.34, 7.9, 8.1, 8.5, 8.14, B.1, C.1, C.2, and . C.3containmaterial which respectively is copyright (Q 1966,1983,1986,1986,1981, 1982, 1981,1983,1983, 1983,1980,1982,1982,1988,1996,1978,1981,1984, 1987, 1987, and 1987 IEEE. Reproduced with permission. All rights reserved. Figures 5.2, 5.3, 5.9, 5.10, 5.1 1, and 5.18, as well as Tables 4.1,4.2, 4.3, and 5.1 contain material which respectively is copyright 0Dr. F. Itakura, 1970, 1970, 1971, 1971, 1973, 1981, 1978, 1981, 1978, and 1981. Used with permission of copyright owner. All rights reserved. Figure 5.19 contains material which is copyright 0Dr. T. Nakajima, 1978. Reproduced with permission. All rights reserved. Figure 6.36 contains material which is copyright 0Dr. T. Moriya, 1986. Used with permission of copyright owner. All rights reserved. Figures 6.28 and 6.29 contain material which is copyright 0 Mr. Y. Shiraki, 1986. Reprintedwith permission of copyright owner. All rights reserved. Figure 6.38 contains material which is copyright 0Mr. T. Watanabe, 1982. Used with permission. All rights reserved. Figure 7.5 contains material which is copyright 0 Dr. Y. Sagisaka, 1998. Reproduced with permission. All rights reserved. Table 8.5 contains material which is copyright 0 Dr. S. Nakagawa, 1983. Reprintedwithpermission. All rights reserved. Figures 8.12,8.13, and 8.20 containmaterial which is copyrightPrenticeHall, 1993. Used withpermission. All rights reserved.
Acknowledgments 0
0
IX
Figures 8.15, 8.16, and 8.21 containmaterial which is 1996, 1996, 1997 Kluwer respectively copyright AcademicPublishers.Reproduced with permission. All rights reserved. Figures 8.22 and 8.23 contain material which is copyright 0 DARPA, 1999. Used with permission. All rights reserved.
e)
This Page Intentionally Left Blank
Preface to the First Edition
Research in speech processing has recently witnessed remarkable progress.Suchprogresshasensuredthe wide use of speech recognizers and synthesizers in a great manyfields, such as banking services anddatainputduringqualitycontrolinspections. Althoughthe level and range of applicationsremainsomewhat restricted,this technological progresshastranspired through an efficient and effective combination of thelongandcontinuing history of speech research with the latest remarkable advances in digital signal processing (DSP) technologies. In particular, these DSP technologies,includingfast Fouriertransform,linear predictive coding, and cepstrum representation, have been developed principally to solve several of the more complicated problems in speech processing. Theaim of thisbomok is, therefore, to introducethereadertothemostfundamentalandimportant speech processing technologies derived from the level of technological progressreachedin speech production,coding, analysis, synthesis, and recognition, as well as in speaker recognition. Although the structure of this book isbased on my book in Japanese entitled Digital Speech Processing (Tokai University Press, Tokyo, 1985), I have revised and updated almost all chapters in line with the latest progress. The present book also includes severalimportant speech processing technologies developedin Japan, whch, for the
xi
xii
Preface to the First Edition
most part, are somewhat unfamiliar to researchers from Western nations. Nevertheless, I have made every effort to remain as objective as possible in presenting the state of the art of speech processing. This book hasbeen designed primarily to serve as a text for an advanced undergraduate- or for a first-year graduate-level course. It hasalso been designed as a reference book withthe speech researcher in mind. The reader is expected to have an introductory understanding of linear systems and digital signal processing. Several peoplehavehad a significant impact, both directly and indirectly, on the material presented in this book. My biggest debt of gratitude goes to Drs. Shuzo Saito and Funlitada Itakura, both former headsof the Fourth Research Section of the Electrical ConlnlunicationsLaboratories(ECLs),NipponTelegraphand TelephoneCorporation(NTT).Formanyyearstheyhave providedmewithinvaluableinsight into theconductingand reporting of my research. Inaddition, I hadthe privilege of working as a visiting researcher from 1978 to 1979 in AT&T Bell Laboratories’ Acoustics Research Department under Dr. James L. Flanagan. During that period, I profited immeasurably from his views and opinions. Doctors Saito, Itakura, and Flanagan have not only had a profound effect on my personal life and professional career but have also had a direct influence in many ways on the information presented in this book. I also wish to thank the many members of NTT’s ECLs for providing me with the necessary support and stimulating environment in which many of the ideas outlined in this book could be developed. Dr.Frank K. Soong of AT&T Bell Laboratories his valuablecomments and deserves a note of gratitudefor criticism on Chapter 6 during his stay at the ECLs as a visiting researcher. Additionally, I would like to extend my sincere thanks to Patrick Fulnler of Nexus International Corporation, Tokyo, for his carefLd technical review of the nlanuscript. Finally, I would like to expressmy deep and endearing appreciation to mywife and family for their patience and for the time they sacrificedon my behalf throughout the book’s preparation. Suclaoli-i Frrrrri
Contents ...
Series Introductio~ ( K . J . Ray Liu) Preface to the S e c o d Edition Acknon,ledg~.l./enrs Preface to the First Edition
Ill
1'
vii
xi 1
1.
INTRODUCTION
2.
PRINCIPALCHARACTERISTICS OF SPEECH 2.1 Linguistic Information 2.2 Speech and Hearing 2.3 Speech Production Mechanism 2.4 Acoustic Characteristics of Speech 2.5 Statistical Characteristics of Speech 2.5.1 Distribution of amplitude level 2.5.2 Long-time averaged spectrum 2.5.3 Variationinfundamental frequency 2.5.4 Speech ratio
5 5 7 9 14 20 20 23 24 26
3.
SPEECH PRODUCTION MODELS 3.1 AcousticalTheory of Speech Production 3.2LinearSeparableEquivalentCircuitModel 3.3 Vocal Tract Transmission Model 3.3.1 Progressing wave model 3.3.2 Resonance model 3.4 Vocal Cord Model
27 27 30 32 32 38 40
xiii
"""""".""
.
L
-
"
"
xiv
4.
5.
Contents
SPEECHANALYSISANDANALYSISSYNTHESIS SYSTEMS Digitization 4.1 Sampling 4.1.1 4.1.2Quantizationandcoding 4.1.3 A/DandD/A conversion Spectral Analysis 4.2 4.2.1 Spectralstructure of speech 4.2.2 AutocorrelationandFouriertransform 4.2.3 Window function 4.2.4 Sound spectrogram Cepstrum 4.3 4.3.1 Cepstrumand itsapplication 4.3.2 Homomorphic analysis andLPC cepstrunl Filter Bank and Zero-Crossing Analysis 4.4 4.4.1 Digital filter bank 4.4.2 Zero-crossing analysis Analysis-by-Synthesis 4.5 Analysis-Synthesis Systems 4.6 4.6.1 Analysis-synthesis system structure 4.6.2 Examples of analysis-synthesis systems Pitch Extraction 4.7 LINEARPREDICTIVECODING(LPC)ANALYSIS 5.1 Principles of LPC83Analysis 5.2 LPC Procedure Analysis Maximum 5.3 Likelihood Spectral Estimation 5.3.1 Formulation of maximumlikelihood estimation spectral 5.3.2 Physicalmeaning of maximum likelihood spectral estimation 5.4 SourceParameterEstimationfromResidual Signals 5.5 Speech Analysis-Synthesis System by LPC 5.6 PARCOR Analysis 5.6.1 Formulation of PARCOR analysis
45 45 46 47 51 52 52 53 57 60 62 62 66 70 70 70 71 73 73 73 78 83 86 89 89 93 98 99 102 102
1
Contents
xv
5.6.2
5.7
5.8
Relationship between PARCORand LPC coefficients 5.6.3 PARCOR synthesis filter 5.6.4 Vocal tractarea estimadion based on PARCOR analysis LineSpectrumPair(LSP)Analysis 5.7.1 Principle of LSPanalysis 5.7.2 Solution of LSPanalysis 5.7.3 LSP synthesis filter 5.7.4Coding of LSPparameters 5.7.5 Composite sinusoidal rnodel 5.7.6 Mutual relationshipsbetween LPC parameters Pole-Zero Analysis
SPEECH 6 CODING 6.1 Principal Techniques for Speech Coding 6.1.1 Reversible coding 6.1.2 Irreversiblecoding andinformation rate distortion theory 6.1.3Waveformcoding and analysissynthesis systems 6.1.4 Basic techniques for waveformcoding methods 6.2 Coding in Time Domain 6.2.1 Pulsecodemodulation(PCM) 6.2.2 Adaptive quantization 6.2.3 Predictive coding 6.2.4 Delta modulation 6.2.5Adaptivedifferential PCM(ADPCM) 6.2.6Adaptivepredictivecoding(APC) 6.2.7 Noise shaping 6.3 Coding in Frequency Domain 6.3.1 Subband coding (SBC) 6.3.2Adaptivetransformcoding(ATC) 6.3.3 APC withadaptive bit allocation (APC-AB)
108 109 110 116 116 119 122 126 126 127 129 133 133 133 134 135 138 141 141 143 143 149 151 153 156 159 159 163 166
Contents
xvi
6.3.4
6.4
6.5
Time-domainharmonic scaling (TDHS) algorithm Vector Quantization 6.4.1 Multipath search coding 6.4.2 Principles of vectorquantization 6.4.3Treesearch and multistage processing 6.4.4Vectorquantizationforlinear predictor parameters 6.4.5 Matrixquantizationandfinite-state vector quantization Hybrid Coding 6.5.1 Residual- or speech-excited linear predictive coding 6.5.2 Multipulse-excited linear predictive coding (MPC) 6.5.3 Code-excited linear predictive coding (CELP) 6.5.4 Coding by phaseequalization and variable-rate tree coding EvaluationandStandardization of Coding Methods 6.6.1Evaluationfactors of speech coding systems 6.6.2 Speech coding standards Robustand Flexible Speech Coding I
6.6
6.7
SYNTHESIS 7 SPEECH 7.1 Principles of Speech Synthesis 7.2 Synthesis Based on Waveform Coding 7.3 Synthesis Based on Analysis-Synthesis Method 7.4 Synthesis Based on Speech Production Mechanism 7.4.1 Vocalanalog tract method 7.4.2 Terminal method analog 7.5 Synthesis by Rule 7.5.1 Principles of synthesis by rule 7.5.2 Control of prosodic features
168 173 173 175 178 180 182 187 187 189 193 196 199 199 203 21 1 213 213 217 221 222 223 224 226 226 230
Contents
7.6 Text-to-Speech Conversion 7.7 Corpus-Based Speech Synthesis 8.
SPEECH RECOGNITION 8.1 Principles of Speech Recognition 8.1.1 Advantages of speech recognition 8.1.2 Difficulties in speech recognition 8.1.3 Classification of speech recognition 8.2 Speech Period Detection 8.3 Spectral Distance Measures 8.3.1 Distancemeasures used in speech recognition 8.3.2 Distances based onnonparametric spectral analysis 8.3.3 Distances based on LPC 8.3.4Peak-weighteddistances based on LPC analysis 8.3.5 Weighted cepstral distance 8.3.6 Transitionalcepstraldistance 8.3.7 Prosody of WordRecognition Systems 8.4Structure 8.5 Dynamic Time Warping (DTW) 8.5.1 DP matching 8.5.2 Variationsin DP matching 8.5.3 Staggered array DP nlatching 8.6 Word RecognitionUsingPhonlemeUnits 8.6.1 Principal structure 8.6.2 SPLIT method 8.7 TheoryandImplementation of HMM 8.7.1 Fundamentals of HMM 8.7.2Three basic problemsfbr HMMs 8.7.3SolutiontoProblem 1-probability evaluation 8.7.4 Solution toProblem 2--optimal state sequence 8.7.5Solution to Problem 3-parameter estimation
xvi i
234 237 243 243 245 246 248 249 249 251 252 258 260 262 264 264 266 266 270 272 275 275 277 278 278 282 283 286 288
Contents
xviii
8.7.6
8.8
8.9
8.10
8.1 1
i
Continuousobservation densities in HMMs 8.7.7 Tied-mixture HMM 8.7.8 MMIandMCE/GPDtraining of HMM 8.7.9 HMM system forword recognition Connected Word Recognition 8.8.1 Two-level DP matchingand its modifications 8.8.2 Word spotting Large-Vocabulary Continuous-Speech Recognition 8.9.1 Threeprincipalstructuralmodels 8.9.2 Other system constructingfactors 8.9.3 Statisticaltheory of continuousspeech recognition 8.9.4Statisticallanguagemodeling 8.9.5 Typical structure of large-vocabulary continuous-speech recognition systems 8.9.6 Methodsforevaluating recognition systems Examples of Large-Vocabulary ContinuousSpeech Recognition Systems 8.10.1 DARPA speech recognitionprojects 8.10.2 English speech recognition system at LIMSI Laboratory 8.10.3 English speech recognition system at IBM Laboratory 8.10.4 AJapanese speech recognition system Speaker-Independent and Adaptive Recognition 8.11.1 Multi-templatemethod 8.11.2 Statistical method 8.1 1.3 Speakernormalizationmethod 8.1 1.4 Speaker adaptation methods
290 292 292 293 295 295 303 306 306 308 311 312 314 318 320 323 323 324 325 328 330 332 333 334 335
c
Contents
8.12
9
10
xix
8.1 1.5 Unsupervisedspeaker aldaptation method Robust AlgorithmsAgainstNoise and Channel Variations 8.12.1 HMM composition/PMC 8.12.2 Detection-based approachfor spontaneous speech recognition
SPEAKER RECOGNIT ION 9.1 Principles of Speaker Recognition 9.1.1 Humanandcomputer speaker recognition 9.1.2 Individual characteristics 9.2 Speaker Recognition Methods 9.2.1 Classification of speaker recognition methods 9.2.2 Structure of speakerrecognition systems 9.2.3 Relationship between errorrateand number of speakers 9.2.4 Intra-speakervariationandevaluation parameters of feature 9.2.5 Likelihood (distance) normalization 9.3 Examples of Speaker Recognition Systems 9.3.1 Text-dependent speaker recognition systems 9.3.2 Text-independentspeakerrecognition systems 9.3.3 Text-prompted speaker recognition systems FUTUREDIRECTIONSOFSPEECH PROCESSING INFORMATION 10.1 Overview 10.2 Analysis and Description of Dynamic Features
336 339 344 344 349 349 349 351 352 352 354 358 360 364 366 366 368 373
375 375 378
xx
Contents
10.3 10.4 10.5 10.6 10.7 10.8 10.9 10.10
Extraction and Normalization of Voice Individuality Adaptation Environmental to Variation Basic Units for Speech Processing Advanced Knowledge Processing Clarification of Speech Production Mechanism Clarification of Speech Perception Mechanism Evaluation Methods for Speech Processing Technologies LSI for Speech Processing Use
APPENDICES z-Transform Convolution and A A. 1 Convolution A.2 1-Transform A.3 Stability Quantization Algorithm Vector B B. 1VQ (Vector Quantization) Technique Formulation B.2 Lloyd's Algorithm(&MeansAlgorithm) Algorithm B.3 LBG Nets C Neural BibIiogmphy Ill dex
379 380 381 382 383 384 385 386 387 387 388 391 393 393 394 395 399
405 437
Digital Speech Processing, Synthesis, and Recognition
"""-""""""
.-
This Page Intentionally Left Blank
Introduction
Speech communication is one of the basic andmost essential capabilities possessed by human beings. Speech can be said to be the single mostimportantmethodthrough which people can readily convey information without the need for any ‘carry-along’ tool.Although we passively receive morestimulifromoutside through the eyes than through the ears, mutually communicating visually is almost totally ineffective compared to what is possible through speech communication. The speech wave itself conveys linguistic information,the speaker’s vocal characteristics, and the speaker’s emotion.Information exchange by speech clearly plays a very significant role in our lives. The acoustical and linguistic structures of speech have been confirmed to be intricately related to our intellectual ability, and are, moreover, closely intertwined with our cultural and social development. Interestingly, the most cultural1:y developed areas in the world correspondtothoseareasin which thetelephone network is the most highly developed. One evening in early 1875, Alexander Graham Bell was speaking with his assistant T.A. Watson (Fagen, 1975). He had just conceived the idea of a mechanism based on the structure of the human ear during the course of his research intofabricatinga telegram machine for conveying music. He said, ‘Watson, I have another idea I haven’t told you about that I think will surprise you.
1
2
Chapter I
If I can get a mechanism which will make a current of electricity vary in its intensity as the air varies in density when a sound is passing through it, I can telegraph any sound, even the sound of speech.' This, as we know, became the central concept coming to fruition as the telephone in the following year. The invention of the telephone constitutes not only the most important epochinthehistory of communications, but italso represents the first step in which speech began to be dealt with as an engineering target. The history of speech research actually started, however, long before the invention of the telephone. Initial speech researchbeganwiththedevelopment of mechanicalspeech synthesizers towardtheend of the18thcentury and into vocal vibration and hearing mechanisms in the mid-19th century. Before the invention of pulse code modulation (PCM) in 1938, however, the speech wave had been dealtwith by analog processing techniques. Theinvention of PCMand thedevelopment of digital circuits and electronic computers have made possible the digital processing of speech and have brought about the remarkableprogressin speech information processing, especially after 1960. Thetwomost important papers to appear since 1960 were presented at the 6th international Congress on Acoustics held in Tokyo, Japan, in 1968: the paper on a speech analysis-synthesis system based onthe maximumlikelihoodmethodpresented by NTT's Electrical Communications Laboratories, and the paper on predictive codingpresented byBell Laboratories.Thesepapers essentially producedthegreatestthrusttoprogressin speech information processing technology;inotherwords, they opened the way to digital speech processing technology. Specifically, both papers deal with the information compression technique using the linear prediction of speech waves and are based on mathematical techniques for stochastic processes. These techniques gave rise to linear predictive coding (LPC), which has led to the creation of a new academic field. Various other complementary digital speech processing techniques have also been developed. In combination, these techniques have facilitated the realization of a wide range of systems operatingonthe principles of speech coding, speech
Introduction
3
analysis-synthesis,speechsynthesis,speechrecognition, and speaker recognition. Books on speech information processing have already been published, and each has its own special features (Flanagan, 1972; MarkelandGray 1976; RabinerandSchafer, 1978; Saitoand Nakata, 1985; FurniandSondhi, 1992; ScJxoeder, 1999). The purpose of the present book is to explain the technologies essential to the speech researcher and to clarify and hopefully widen his or her understanding of speech by focusing on the most recent of the digital processing technologies. I hope that those readers planning to study and conduct research in the area of speech information processing will find this book useful as a reference or text. To those readers already extensively involved in speech research, I hope it will serve as a guidebook for sorting through theincreasingly more sophisticated knowledge base forming around the technology and for gaining insight into expected future progress. I have tried to cite wherever possible themostimportant aspects of the speech information processing field, including the precise development of equations, by omittingwhat is now considered classic information. In such instances, I have recommended well-known reference books. Since understandingthe intricaterelationships between variousaspectsofdigital speech processing technology is essential to speech researchers, Ihave attemptedtomaintain a sense of descriptiveunity andto sufficiently describe the mutual relationships between the techniques involved. I have also tried to refer to as many notable papers as permissible to further broaden the reader’s perspective. Due to space restrictions, however, several important research areas, such as noise reduction and echo cancellation, unfortunately could not be included in this book. Chapters 2, 3, and 4 explore the fundamental and principal elements of digital speech processing technology. Chapters 5 through9presentthemoreimportanttechniques as well as applications of LPC analysis, speech waveformcoding, speech synthesis, speech recognition, and speakerrecognition. The final chapter discusses futureresearchproblems. Several important concepts,terms, andmathematicalrelationshipsare precisely
4
Chapter 1
explained in the appendixes. Since the design of this book relates the digital speech processing techniques to each other in developmental and precise terms as mentioned, the readeris urged to read each chapter of this book in the order presented.
1
II
L
~~
Principal Characteristics of Speech
2.1 LINGUISTICINFORMATION The speech wave conveysseveralkinds of information, which consistsprincipally of linguisticinformationthatindicatesthe meaningthespeaker wishes toimpart, individualinformation representing who is speaking, and emotional information depicting the emotion of the speaker.Needless to say, the first informational type is the most important. Undeniably, the ability to acquire and produce language and to actually make and use tools are the two principal features that distinguishhumans fromotheranimals.Furthermore,language andculturaldevelopmentareinseparable.Althoughwritten language is effective for exchangingknowledge and lastslonger thanspoken language if properlypreserved, theamount of information exchanged by speech is considerably larger. In more simplified terms,books,magazines, and thelike are effective as one-way information transmission media, but are wholly unsuited to two-way communication. Human speech production begins with the initial conceptualization of an idea which the speaker wants to convey to a listener.
5
6
Chapter 2
The speakersubsequentlyconverts that idea into a linguistic structure by selecting theappropriatewordsor phrases which distinctlyrepresentit, and thenorderingthemaccordingto loose or rigid grammatical rules depending upon the speakerlistenerrelationship.Followingtheseprocesses,thehuman brain produces motor nerve commands which move the various muscles of the vocal organs. This process is essentially divisible intotwosubprocesses:thephysiologicalprocessinvolving nerves and muscles, and the physical process through which the speech wave is produced and propagated. The speech characteristics as physical phenomena are continuous, although language conveyed by speech is essentially composed of discretely coded units. A sentence is constructed using basic word units, with each word beingcomposed of syllables, andeach syllablebeing composed of phonemes, which, in turn, can be classified as vowels or consonants. Although the syllable itself is not well defined, one syllable is generally formed by the concatenation of one vowel and one to several consonants. The number of vowels and consonants vary,dependingonthe classification methodandlanguage involved.Roughlyspeaking,English has 12 vowels and 24 consonants,whereasJapanesehas 5 vowels and 20 consonants. The number of phonemes in a language rarely exceeds 50. Since there are combination rules for building phonemes into syllables, the number of syllables in each language comprises only a fraction of all possible phoneme combinations. In contrast with the phoneme, which is the smallest speech unit from thelinguistic or phonemic point of view, the physical unit of actual speech is referred to as the phone. The phoneme and phoneare respectively indicated by phonemic and phonetic symbols, such as /a/ and [a]. As another example, the phones [E] [e], which correspondtothephonemes /e/ and /e/inFrench, correspond to the same phoneme /e/ in Japanese. Although the number of words in each languageis very large and new words are constantly added, the total number is much smaller than all of the syllable or phoneme combinations possible. It has been claimed that the number of frequently used words is
Principal Characteristics of Speech
7
between 2000 and 3000, and that the number of words used by the average person lies between 5000 and 10,000. Stress and intonation also play critical roles in indicating the location of important words, in makinginterrogative sentences, and in conveying the emotion of the speaker.
2.2 SPEECH AND HEARING
Speech is uttered for the purpose of being, and on the assumption that it actually is, received andunderstood by theintended listeners. This obviously means that speech production is intrinsically related to hearing ability. The speech wave produced by the vocal organs is transmitted through the air to the earsof the listeners, as shown in Fig. 2.1. At the ear, it activates the hearing organs to produce nerve impulses which are transmitted to the listener's brain through the auditory nerve system. Thispermitsthe linguistic infomation which the speaker intends to convey to be readily understood by the listener.
"To_ I ker
,"""""
Listener
>
l""""""""" [Linguistic]
c Discrete-+
process
4
FIG.2.1 Speechchain.
[Physical]
[Phyriologicoi] (acoust ic) p ro ce ss
Continuous
I
I
L
J
" " " I " " " " "
]
Linguistic [process
--
+Discrete+
process [Physioiopicol
process
]
8
Chapter 2
The same speech wave is naturally transmitted to the speaker’s ears as well, allowing him to continuously control his vocal organs by receiving his own speech as feedback. The critical importance of thisfeedbackmechanism is clearly apparent withpeoplewhose hearing has become disabled for more than a year or two. It is also evidentinthefact that it is very hard to speakwhen our own speech is fed back to our ear with a certain amount of time delay (delayed feedback effect). The intrinsicconnectionbetweenspeechproductionand hearing is calledthespeechchain(Denes and Pinson, 1963). In terms of production, the speech chain consists of the linguistic, physiological, and physical (acoustical) stages, theorder of which is reversed for hearing. The human hearing mechanism constitutes such a sophisticated capability that, at this point in time anyway, it cannot be closely imitated by artificial/computational means. One advantage of this hearing capability is selective listening, which permits the listener to hearonlyone voice even whenseveralpeopleare speaking simultaneously, and even when the voice a person wants to hearis spoken indistinctly, with astrong dialectal accent, or with strong voice individuality. On the other hand, the human hearing mechanism exhibits very low capability. One example of its inherent disadvantage is that the ear cannot separate two tones that aresimilar in frequency or that havea very short timeinterval between them.Another negative aspect is that when two tones exist at the same time, one cannot be heard since it is masked by the other. The sophisticatedhearingcapabilitynoted is supported by thecomplexlanguageunderstandingmechanismcontrolled by the brain, which employs various context information in executing thementalprocessesconcerned.Theinterrelationships between these mechanisms thus allows people to effectively communicate with each other. Although research intospeech processing has thus farbeenundertakenwithoutadetailedconsideration of the concept of hearing, it is vital to connect any future speech research tothe hearingmechanisminclusive of therealm of language perception.
Principal Characteristics of Speech
9
2.3 SPEECHPRODUCTIONMECHANISM
The speech production process involves three subprocesses: source generation,articulation,andradiation.Thehumanvocalorgan complex consists of the lungs, trachea, larynx., pharynx, and nasal andoralcavities.Togetherthese forma connectedtube as indicated in Fig. 2.2. The upper portion beginning with the larynx is called the vocal tract,which is changeable into various shapes by moving the jaw, tongue, lips, and other internal parts. The nasal
Soft polote
Vocal tract Pharynx
LarynK Esophagus
FIG.2.2
Schematic diagram of the human vocal mechanism.
Chapter 2
10
cavity is separated from the pharynx and ora1 cavity by raising the velum or soft palate. When the abdominal muscles force the diaphragm up, air is pushed up and out from thelungs, with the airflowpassing through thetrachea and glottis into thelarynx.Theglottis, or the gap between the left and right vocal cords, which is usually open during breathing, becomes narrower when the speaker intends to produce sound.Theairflowthroughtheglottis is then periodically interrupted by opening and closing the gap inaccordance with theinteraction between the airflow and the vocal cords.This intermittent flow, called the glottal source or the source of speech, can be simulated by asymmetrical triangular waves. The mechanism of vocal vibration is actually very complicated. In principle, however, the Bernoulli effect associated with theairflow and thestabilityproduced by the elasticity of the muscles draw the vocal cords toward each other. When the vocal cords are strongly strained and the pressure of the air rising from the lungs (subglottalair pressure) is high,theopen-and-close period (that is, the vocal cord vibration period) becomes short and the pitch of the sound source becomes high. Conversely, the lowair-pressureconditionproduceslower-pitched sound. This vocal cordvibrationperiod is called thefundanlentalperiod,andits reciprocal is called thefundamentalfrequency. Accent and intonation resultfromtemporalvariation of theflmdamental period. The sound source, consisting of fundamental and harmonic components, is modified by the vocal tracttoproducetonal qualities, such as /a/ and io/, in vowel production. During vowel production,the vocal tract is maintainedina relatively stable configuration throughout the utterance. Two other mechanisms are responsible for changing the airflow from the lungs into speech sound. These are the mechanisms underlying the production of two kinds of consonants: fricatives and plosives. Fricatives, such as /si, if/,and are noiselike sounds produced by turbulent flow which occurs when the airflow passes through a constriction in the vocal tract made by the tongue or lips. The tonaldifference of each fricative correspondsto a fairly precisely located constriction and vocal tract shape. Plosives (stop
/si,
i
Principal Characteristics of Speech
11
consonants), such as /p/, /ti, and /k/, are impulsive sounds which occur with thesudden release of high-pressureairproduced by checking the airflow in the vocal tract, again b:y using the tongue or lips. The tonal difference corresponds to thedifference between the checking position and the vocal tract shape. The production of these consonants is wholly independent of vocal cord vibration. Consonants which are accompanied by vocal cord vibration are known as voiced consonants, and those which arenotaccompanied by thisvibrationarecalledunvoiced consonants.Thesoundsemitted with vocal cordvibration are referred to as voiced sounds,andthosewithoutarenamed unvoiced sounds.Aspirationor whispering is produced when a turbulent flow is made at the glottis by slightly opening the vocal cords so that vocal cord vibration is not produced. Semivowel, nasal, and affricate sounds are also included in the family of consonants. Semivowels are produced ina similar way as a vowels, but their physical properties gradually change without steadyutteranceperiod.Although semivowels are included in consonants, they are accompanied by neither turbulent airflow nor pulselike sound, since the vocal tract constriction is loose and vocal organ movement is relatively slow. In the production of nasal sounds, the nasal cavity becomes anextendedbranch of theoralcavity, with theairflow being supplied to the nasal cavity by lowering the vellum and arresting the airflow at some particular place in the oral cavity. When the nasal cavity forms a part of the vocal tract together with the oral cavity during vowel production, the vowel quality acquires nasalization and produces the nasalized vowel. Affricates areproduced by the succession of plosive and fricative sounds while maintaining a close constriction at the same position. Adjusting the vocal tract shape to produce various linguistic sounds is called articulation, while the movement of each part in the vocal tract is known as articulatory movement. The partsof the vocal tract used for articulation are called articulatory organs, and those which can actively move, such as the tongue, lips, and velum, are named articulators.
Chapter 2
12
The difference between articulatory methods for producing fricatives, plosives, nasals, and so on, is termedthemanner of articulation. The constriction place in the vocal tract produced by articulatory movement is designated as the place of articulation. Various tone qualities areproduced by varyingthevocaltract shape which changes the transmission characteristics (that is, the resonance characteristics) of the vocal tract. Speech sounds can be classified according to the combination of source and vocal tract (articulatory organ) resonance characteristics based on the production mechanism described above. The consonants and vowels of English are classified in Table 2.1 and Fig. 2.3, respectively. The horizontal lines in Fig. 2.3 indicate the approximatelocation of the vocal tractconstrictioninthe representation: the more to the left it is, thecloser to the front (near the lips) is the constriction. The vertical lines indicate the degreeofconstriction, which correspondsto the jaw opening position;thelowestlineinthefigureindicatesmaximum jaw opening. Thesetwo conditions inconjunctionwithliprounding represent the basic characteristics ofvowel articulation. Each of the vowel pairs located side by side in the figure indicates a pair in which only the articulation of the lips is different:the left one does not involveliprounding,whereastherightone is produced by TABLE 2.1
Consonants
Articulation place
1
Source
uv 1v v uv
Fricatives Articulation Affricates manner Plosives Semivowels Nasals
~
~~
Labial Dental Alveolar Palatal Glottal
v f
uvvuv v vuv z s
6 6'
dz ts d p w m
d t I n ~~~
V = voiced; UV = unvoiced
~~
3 J d, tJ g k j, r ?I
h
Principal Characteristics of Speech
13
Tongue hump position /
Front
A
Central
7
6 ack
( H iqh c 0
0
Q
Low
FIG. 2.3 Vowelclassificationfromapproximatevocalorgan representation.
roundingthelips.Thisliproundingrarelyhappensfor vowels produced by extended jaw opening. The phoneme [a] is called a neutral vowel, sincethe tongue and lips forproducing this vowel are in the most neutral positiomhence, the vocal tract shape is similarto a homogeneous tube having a constant cross section. Relativelysimple vowel structures, s w h asthat of the Japanese language, are constructed of those vowels located along the exterior of the figure. These exteriorvowels consist of [i, e, E, a, a, D,3,0, u, LU]. This means that the back tongue vowels tend to feature lip rounding while the front tongue vowels exhibit no such tendency. Gliding monosyllabic speech sounds produced by varying the vocal tract smoothly between vowel or semivowel configurations are referred toasdiphthongs.Thereare six diphthongsin American English, /ey/, /om/, /ay/, /am/, /oy/, and /ju/, but there are none in Japanese. Thearticulatedspeech wave withlinguisticinformation is radiated from the lips into the air and diffused. In nasalized sound, the speech wave is also radiated from the nostrils.
14
Chapter 2
2.4 ACOUSTIC CHARACTERISTICS OF SPEECH
Figure 2.4 represents the speech wave, short-time averaged energy, short-time spectral variation (Furui, 1986), fundamental frequency (modified correlation functions; see Sec. 5.4), and sound spectrogramfortheJapanesephrase/tJo:seNnaNbuni/, or 'in the southern part of Korea,' uttered by a male speaker. The sound spectrogram, the details of which will be described in Sec. 4.2.4, visually presents the light and dark time pattern of the frequency spectrum. The dark parts indicate the spectral components having high energy, and the vertical stripes correspond to the fundamental period. This figure shows that the speech wave and spectrum vary as nonstationary processes in periods of '/2 s or longer. In appropriately divided periods of 20-40 nls, however, the speech wave and spectrum can be regarded as having constant characteristics. The vertical lines in Fig. 2.4 indicate these boundaries. The segmentation was done automatically based on the amount of short-time spectralvariation.Duringtheperiods of /tJ/or Is/ unvoiced consonant production, the speech waves show random waves with small amplitudes, and the spectra show random patterns. On the other hand, during the production periods of voiced sounds, such as those with /i/, /e/, /a/, io/, /u/, /N/, the speech waves present periodic waves having large amplitudes, with the spectra indicating relatively global iterations of light and dark patterns. The dynamic range of the speech wave amplitude is so large that the amplitude difference between the unvoiced sounds having smaller amplitudes and the voiced sound having larger amplitudes sometimes exceeds 30 dB. The dominant frequency components which characterize the phonemes corresponding to the resonant frequency components of the vowels, generally have three formants,which are called the first, second, and third formants, beginning with the lowest-frequency component. They areusually written as F1, F2, and F3. Even for the samephoneme,however,theseformantfrequencieslargely vary,dependingonthespeaker.Furthermore,theformant
I
I
,
tf
0
I :
I
I s
t
I
Time (sl N
n
a
I
N
b
u
FIG.2.4 Speech wave, short-time averaged energy, short-time spectral variation, fundamental frequency, and sound spectrogram (from top to bottom) for the Japanese sentence /tJo:seN naNbuni/.
n
i
16
Chapter 2
frequenciesvary,dependingontheadjacentphonemesin continuouslyspokenutterances,suchasthoseemittedduring conversation. Theoverlappingofphoneticfeaturesfromphonemeto phoneme is termedcoarticulation.Eachphonemecan be considered as a target at which the vocal organs aim but never reach. As soon as the target has been approached nearly enough to be intelligible to the listener, the organs change their destinations and start to head for a new target. This is done to minimize the effort expendedinspeakingandmakesforgreaterfluency.The phenomenon of coarticulationaddstotheproblems of speech synthesis and recognition. Since speechin which coarticulation does not occursoundsunnaturaltoourears,forhigh-quality synthesis, we must include an appropriate degree of coarticulation. In recognition, coarticulation means that the features of isolated phonemes are never found inconnectedsyllables;hence any recognition system based on identifying phonemes mustnecessarily correct for contextual influences. Examples of the relationship between vocal tract shapes and vowel spectral envelopes are presented in Fig. 2.5 (Stevens et al., 1986). Fronting or backing of the tongue body while maintaining approximately the same tongue height causes a raising or lowering of F2, withthe effect ontheoverallspectralshapeaccordingly produced as shown. As is clear, FZ approaches Fl for back vowels and F3 for front vowels. A further lowering of F2 can be achieved by rounding the lips as illustrated in Fig. 2.5(c). The basic acoustic characteristics of vowel formants can be characterized by Fl and F2. Figure 2.6 is ascatterdiagram of formant frequencies of the isolatedly spoken five Japanese vowels onthe F1-F2 plane,thehorizontal and verticalaxes of which correspond to the first- and second-formant frequencies, F1 and F2, respectively (Sato, 1975). This figure indicates the distributions for 30 male and 30 female speakers as well as the mean and standard deviation values for these speakers. The five vowels are typically distributed in a triangular shape as shown in this figure, which is sometimescalledthe vowel triangle. For comparative purposes, Fig. 2.7 presents the scatter diagram of formant frequencies of 10
Principal Characteristics of Speech
Back
20
-
Ft
'
20
1
Neutral 10
0
-10
-20
1
0
1
2
3
0
1
2
3
Frequency [kHz]
FIG.2.5 Examples of the relationship between vocal tract shapes and vowel spectral envelopes: (a) schematization of mid-sagittal section of vocal tract for a neutral vowel (solid contour), and for back and front tongue-body positions; (b) idealized spectral envelopes corresponding to the three tongue-body configurations in(a); (c) approximate effect of lip rounding on the spectral envelopefor a back vowel.
English vowels uttered by 76 speakers (33 adult males, 28 adult females, and 15 children) on the F1-F2 plane (Peterson and Barney, 1952). Thedistribution of the vowels extractedfromcontinuous speech generally indicatesan overlap between differentvowels. The
Chapter 2
l oI
F1 [Hz)
FIG.2.6 Scatter diagram of formant frequencies of five Japanese vowels uttered by 60 speakers (30 males and 30 females) in the FI-F2 plane. variation owing to the speakers and their ages, however, can be approximated by theparallel shift inthelogarithmicfrequency plane,inotherwords, by the proportional changeinthelinear frequency, which can be seen inthemale and female voice comparison in Fig. 2.6. Hence, this overlapping of different vowels can be considerably reduced when the distribution is examined in three-dimensionalspaceformedfromaddingthethird formant, which characterizestheindividuality of voice. The higher-order formant indicatesasmallervariation,depending on the vowels
Principal Characteristics of Speech
2000
19
-
n N
r
u
f
Ft
[HZ]
FIG.2.7 Scatter diagram of formant frequencies of 10 English vowels uttered by 76 speakers (33 adult males, 28 adult females, and 15 children) in the F,-F2 plane.
uttered. Therefore, the higher-order formant has a peculiar value for each speaker corresponding to his or her vocal tract length. Although difficult, measuring formant bandwidths has been attempted by many researchers.The extracted values range from 30 to 120 Hz (mean 50 Hz) for F 1 , 30 to 200Hz (mean 60 Hz for F?, and 40 to 300 Hz (mean 115 Hz) for F?. Variation in bandwidth has little influence on the quality of speech heard.
20
Chapter 2
Consonants are classified by the periodicity of waves (voiced/ unvoiced), frequency spectrum, duration, and temporal variation. The acoustic characteristics of the consonants largely vary as the result of coarticulation with vowels since the consonants originally have no stable or steady-state period.Especially with rapid speech, articulation of the phoneme which follows, that is, tongue and lip movement toward the articulation placeof the following phoneme, starts beforecompletionofarticulation of thephoneme being presently uttered. Coarticulationsometimesaffectsphonemeslocatedbeyond adjacent phonemes. Furthermore, since various articulatory organs participate in actual speech production, and since each organ has itsowntime constant of movement,theacousticphenomena resulting from these movementsare highly complicated. Hence,it is verydifficulttoobtainone-to-onecorrespondence between phonemic symbols and acoustic characteristics. Under these circumstances, the focus has been on examining ways to specify eachphoneme by combiningrelativelysimple features instead of on determining the specific acoustic features of eachphoneme(Jakobsonetal., 1963). Thesefeaturesthus far formalized, which are calleddistinctivefeatures,consist of the binary representation of nine descriptive pairs: vocal/nonvocalic, consonantal/nonconsonantal, compact/diffuse,grave/acute, flat/ plain,nasal/oral,tense/lax,continuant/interrupted,andstrident/ mellow. Since the selection of these features hasbeen based mainly on auditory rather than articulatorycharacteristics, many of them are qualitative, havingweak correspondence to physical characteristics.Therefore,considerableroom still remainsintheirfinal clarification.
2.5 STATISTICALCHARACTERISTICS OF SPEECH 2.5.1 Distribution of AmplitudeLevel
Figure 2.8 shows accumulated distributionsof the speech amplitude level calculatedforutterances by 80 speakers (4 speakers x 20
c
Principal Characteristics of Speech
20
10
0
Instantaneous
-10
21
-20
amplitudelevel
-30
-40
-50
[de rel. to RMSI
FIG.2.8 Accumulated distribution of speech amplitude level calculated for utterances madeby 80 speakers having a durationofroughly 37 min.
languages) having a duration of roughly 37 minutes (Irii etal., 1987). The horizontal axis, specifically the amplitude level, is normalized by the long-term effective value, or root mean square (rms) value. The vertical axis indicates the frequencyof amplitude accumulated from large values, in other words, the frequency of amplitude values larger thantheindicatedvalue.Theseresults clearly confirm that the dynamic range of speech amplitude exceeds %dB. Thedifference between theamplitude level, at which the accumulatedvalue amounts to 1%, and the long-term effective value is called the peak factor because it relates to the sharpness of the wave. The speech and sinusoidal wave peak factors are about 12 dB and 3 dB, respectively, indicating that the speech waveis much higher in sharpness. The derivative of the accumulated distribution curve corresponds to the amplitude density distribution function. The results
Chapter 2
22 20.020""F 1QO -
705.0 r"
8
-
30-
-
U
2.0 0
c
0,
1
cr r.0-
z
07 0.5 -
0.3-
0.201
20
1
I
10
0
I
-10
I
1
-20
Instantaneous amplitudelevel
-30
-40
I
-50
[dB rel.to RMS]
FIG.2.9 AmplitudedensitydistributionfunctionderivedfromFig.
2.8.
derived from Fig. 2.8 are presented in Fig. 2.9 (Irii et al., 1987). The distribution can be approximated by an exponential distribution:
Here, 0 is the effective value (a2corresponds to the mean energy). Distribution of the long-term effective speech level over many speakers is regarded as being thenormaldistributionforboth
I -
Principal Characteristics of Speech
23
%,
v)
C 4)
n U
-60 -
"0" " . -
Male Fema Le
"-0
.cI
0 Q,
CL
v,
I
-700
0.2 0.3
1
, 0.5 Q7 I
,
,
,
,
Frequency
I
I
1
2
3
.
L
& ' A
5 7 1 0
[kHz]
FIG.2.10 Long-time averaged speech spectrum calculated for utterances made by 80 speakers.
males and females. The standard deviation for these distributions is roughly 3.8 dB, and the mean value formale voices is roughly 4.5 dB higher than that for female voices. The long-term effective valueunderthehigh-noise-levelcondition is usuallyraised according to that noise level. 2.5.2
Long-TimeAveragedSpectrum
Figure 2.10 shows the long-time averaged speech spectra extracted using 20 channels of one-third octave bandpass filters which cover the 0-9 kHz frequency range (Irii et al., 1987). These results were alsoobtained using theutterancesmade by 80 speakers of 20
24
Chapter 2
languages. As is clear, only a slight difference exists between male and female speakers, except for the low-frequency range where the spectrum is affected by the variation in fundamental frequency. The difference is also noticeably very small between languages. Based on these results, the typical speech spectrum shape is represented by the combination of a flat spectrum and a spectrum having a slope of-10 dB/octave (oct). The formeris applied to the frequency range of lower than 500 Hz, while the latter is applied to that of higherthan 500 Hz. Although the longtime averaged spectra calculated through the above-mentioned method demonstrate only slightdifferencesbetweenspeakers,thosecalculatedwithhighfrequencyresolution definitely featureindividual differences (Furui, 1972). 2.5.3 VariationinFundamentalFrequency
Statistical analysis of temporal variation in fundamental frequency during conversational speech for every speaker indicates that the mean and standard deviation for female voices are roughly twice those for male voices as shown in Fig. 2.1 1 (Saito et al., 1958). The fundamental frequency distributed over speakers on a logarithmic frequency scale can be approximated by two normal distribution functions which correspond to male and female voices, respectively, as shown in Fig. 2.12. The mean and standard deviation for male voices are 125 and 20.5 Hz, respectively, whereas those for female voices aretwo times larger.Intraspeakervariation is roughly 20% smaller than interspeaker variation. Analysis of thetemporaltransitiondistributioninthe fundamentalfrequencyindicatesthatroughly18% of theseare ascending and roughly 50% are descending. Frequency analysis of the temporal pattern of the fundamental frequency, in which the silent period is smoothly connected, shows that the frequency of the temporal variation isless than 10 Hz.This implies that the speed of the temporal variation in the fundamental frequency is relatively slow.
Principal Characteristics of Speech
25 X'
0 '
//'
CI
2
U
c
.-0
c,
.-> 0
Q)
U U
L
0 0 C
0
c,
m
*" 100 200 x)o Mean fundamentalfrequency
(Hz)
FIG. 2.1 1 Meanandstandarddeviation of temporalvariationin fundamentalfrequencyduringconversationalspeech for various speakers.
Fundamental frequency FIG.2.12
[Hz)
Fundamental frequency distribution over speakers.
Chapter 2
26
2.5.4
Speech Ratio
Conversational speech includes speech as well aspauseperiods, and the proportion of actual speech periods is referred to as the speech ratio. In conversational speech, the speech ratio for each speaker is roughly changing, of course, as afunction of the speech rate.Anexperiment which increased and decreased the speech rate to 30-40% indicated that the expansion or contraction at pauseperiods becomes 65-69%, althoughduringthe speech period it is 13-19% (Saito, 1961). This means that the variation in the speech rate is mainlyaccomplished by changingthepause periods. Moreover, expansion or contraction duringvowel periods is generally larger than that during consonant periods.
4,
3 Speech Production Models
3.1 ACOUSTICAL THEORY OF SPEECH PRODUCTION
As described in Sec. 2.3, the speech wave production mechanism can be divided intothree stages: soundsourceproduction, articulation by vocal tract,andradiationfromthe lips and/or nostrils (Fant, 1960). These stages can be ftlrther characterized by electrical equivalentcircuits based ontherelationship between electrical and acoustical systems. Specifically, sound sources are either voiced or unvoiced. A voiced sound source can be modeled by a generator of pulses or asymmetricaltriangular waves which arerepeatedat every fundamentalperiod.Thepeak value of thesource wave corresponds to the loudness of the voice. An unvoiced sound source, on the other hand, can be modeled by a white noise generator, the mean energy of which correspondstotheloudness of voice. Articulation can be modeled by the cascade or parallel connection of several single-resonance or antiresonance circuits, which can be realized through a multistage digitalfilter. Finally, radiation canbe modeled as arisingfromapistonsoundsourceattached to an infinite, plane baffle. Here, the radiation impedance is represented
27
Chapter 3
28
TABLE3.1 SpeechProductionProcessModels System function
production model Speech Type ~~
~
> Vocal tract
Voiced sou rce
Vowel tY Pe
> Radiation
Resonance -only (allpole model) "O
r
Consonant tY Pe
Nasal type (nasal & nasalized vowel)
Vocal tract (back)
Unvoiced source
"+
Vocal
Resonance resonance (pol e-ze ro)
Vocal Voiced source - tract " +
resonance (pole-zero)
by an L-rcascadecircuit,wherer is theenergylossoccurring through the radiation. The speech production process can accordingly be characterized by combining these electrical equivalent circuits as indicated in Table 3.1. The resonance characteristics depend on the vocal tract shape only, and not on the location of the sound source during both vowel-type and consonant-type production. Conversely, the antiresonancecharacteristicsduringconsonant-typeproduction depend primarily on the antiresonance characteristics of the vocal tract between the glottis and sound source position. The resonance and antiresonance effects are usually canceled in the low-frequency range, since these locations almost exactly coincide. Resonance characteristics for the branched vocal tract, such as those for nasal-typeproduction,areconditioned by the oral
Speech Production Models dB 40
29
Vowel
l-
/a/
20
0' '
I
0
dB 40 -
Poly
1
2
1
Zelo
I
3 kHz
Nasalization
20
0
I
I
J
1
2
3 kHz
Frequency
FIG.3.1 An example of spectral change caused by the nasalization of vowel /a/. It is characterized by pole-zero pairs at 300-400 Hz and at around 2500 Hz. F,, F2, F3 are formants.
cavitycharacteristicsforward and backwardfromthe velum and by thenasaltractcharacteristicsfromthe velum tothe nostrils. The antiresonance characteristics of nasalized consonants (nasalsound)aredetermined by theforwardcharacteristics of the oral cavity starting from the velum. On the other hand, the anti-resonance characteristics of nasalized vowels depend on the nasaltractcharacteristicsstartingfromthevelum.Figure 3.1 exemplifies the spectral change caused by the nasalization of the vowel /a/.
30
Chapter 3
When the radiation characteristics are approximated by the above-mentioned model, the normalized radiation impedance for the unit plane’s free vibration can be represented by
(kn)’ 2
8ka 37r
2).= -+j-
(kn E ~ A(z) Based on these definitions, the linear predictive model using the linear predictor filter F(z) and inverse filter A(z) can be block diagrammedas in Fig. 5.1. The LPC analysis, that is, theprocessofapplyingthelinearpredictivemodel tothe speechwave,minimizestheoutput o2 by adjustingthe of either the linear predictor filter or the inverse coefficients {ai> filter. Based on the linear separable equivalent circuit model of the speech production mechanism (Sec. 3.2), the speech wave is regarded as the output of the vocal tract articulation equivalent filter excited by a vocal source impulse. The characteristics of the equivalent filter, which include the overall spectral characteristics of the vocal cords as well as the radiation characteristics, can be assumed to be passive and linear. The speech wave is then considered to be the impulse response of the equivalent filter, and,
x (2)0 FIG.5.1 Linear prediction model block diagram.
m E ( z )
86
Chapter 5
therefore, the equivalent filter characteristics can be theoretically obtainedasthesolution of thelineardifferential equation. Accordingly,the speech wave can be predicted, andthe speech spectralcharacteristics can be extracted by thelinearpredictor coefficients. Although linear predictive analysis is based on these assumptions, they actuallyvary anddonot hold completely. This is because the vocal tractshape is temporallychanging slowly, of course, and because the vocal source is not a single impulse but rather the iteration of impulses or triangular waves accompanied by noise sources. 5.2
LPC ANALYSIS PROCEDURE
Let us here consider the method for estimating the linear predictor coefficients (ai> by applying the least mean square error method to Eq. (5.3). Specifically, let us determine the coefficients (aili=1 p so that the squared sumof the error E t between the sample values of -xt and the linearly predicted values 2t over a predetermined period of [to, 4 is minimized. The total squared error p is
where a.
=
1. Defining
(5.10)
Linear Predictive Coding (LPC) Analysis
87
p can then be equivalently written as P
P
1=0
i=O
(5.11)
Minimization of ,8 is obtained by setting to zerothepartial derivation of ,8 with respect to aj (j = 1, 2, . . ., p ) and solving. Therefore, from Eq. (5.1 I),
The predictor coefficients (ai}can be obtained by solving this set of p linear simultaneous equations. The known parameters cij (i = 0,1,2, . . ., p ;j = 1,2, . . ., p ) are defined from the sample databy Eq. (5.10), which shows thatthe samples from to - p to tl are essential to the solution. For theactualsolution based on a sequence of N speech samples, { x , } = {xo, xl, . . ., x N - l } , two specific cases have been investigated indetail. These are referred to as thecovariance method and the autocorrelation method. The covariance method is defined by setting to = p and tl = N - 1 so that the error is minimized only over the interval [p, N - I], whereasallthe N speechsamples are usedin calculating thecovariancematrix elements cij (AtalandHanauer, 1971). Accordingly, Eq. (5.12) is solved using N- 1 t=p
Thecovariancemethoddrawsitsnamefromthefact that cij represents the row i, column j element of a covariance matrix. The autocorrelation methodis defined by setting to = -00 and t J = 00, and by letting xt = 0 for t < 0 and t 2 N (Markel, 1972). These limits allow cij to be simplified as
Chapter 5
88
t=-m
(5.14) Thus, aiis obtained by solving (5.15) where N- 1 -T
(5.16) t=O
Althoughtheerror E , is minimized over an infiniteinterval, [0, equivalent results areobtained by minimizing itonlyover N - I]. This is because xris truncated to zero forf < 0 and t 2 N by
multiplied by a finite-length window, suchasa Hamming window. The autocorrelation method is so named from the fact that for the conditions stated, cij reduces to the definition of the short-term autocorrelation r7 at the delay 7 = I i - j l . Equation (5.16) can be expressed by matrix representation as
............
rp- 1
(5.17)
.........
Y1
YO
Analysis Predictive (LPC) Linear Coding
89
The p x p correlation matrix of the left term has the form of a Toeplitzmatrix, which is symmetrical, and has the same values along the lines parallel to the diagonal. This type of equation is called anormalequation or aYule-Walkerequation. Since the positive definiteness of the correlation matrix is guaranteed by the definition of the correlation function, an inverse matrix exists for the correlation matrix.Solving the equation then permits{ai} to be obtained.Ontheotherhand,the positive definiteness of the coefficient matrix is not necessarily guaranteed in the covariance method. The equations for the covariance and correlation methods can be efficiently solved by the Cholesky decomposition methodand by Durbin’s recursive solution methods, respectively. Durbin’s method is equivalenttothePARCOR(partialautocorrelation) coefficient extraction process which will be presented later in Sec. 5.6. Althoughthecovarianceandautocorrelationmethods give almost the same results when {x,} is long ( N >> 1) and stationary, their results differ when {x,] is short and has temporal variations. The number of multiplications and divisions in Durbin’s method are p’ and p , whereas the number of multiplications, divisions, and square root calculations in the Cholesky decomposition are ( p 3 + 9p’ + 2 p ) / 6 , pand , p . Assuming that y = 10, computationally, the former method is three times more efficient than the latter method. In linear system identification in modern control theory, the process exemplified by Eq. (5.1) is called the autoregressive (AR) process, in which E, and x, are the system inputandoutput, respectively. This system is also referred to as the all-pole model since it has an all-pole system function.
5.3 MAXIMUMLIKELIHOODSPECTRALESTIMATION 5.3.1
Formulation of MaximumLikelihoodSpectral Estimation
Maximumlikelihoodestimation is themethod used to estimate parameters which maximize the likelihood based on the observed
90
Chapter 5
values. Here, the likelihood is the probability of occurrence of the actualobservations(the speech samples) underthe presumed parameter condition. The maximum likelihood method is better than any other estimation methodin the sense that the varianceof the estimatedvalue is minimized when the samplesize issufficiently large. In order to accomplish maximum likelihood spectral estimation, let us make two assumptions for the speech wave (Itakura and Saito, 1968): 1.
2.
The sample value x t can be regarded as the sample derived from a stationaryGaussian processcharacterized by thepower spectral density f i x ) . (Here, X = WAT is thenormalizedanglefrequency; i.e., X = f7r corresponds to the frequency f W.) Thespectral density j(X) is represented by an all-pole polynomial spectral density function of the form
1
1
o2
-
27r
A0
+2 P
(5.18) A , COS i X
i= I
where siis the root of (5.19) i= 1
Linear Predictive Coding (LPC) Analysis
91
and Ai is defined as
Furthermore, o2 is the scaling factor for the magnitude of spectral density, and y is the number of poles necessary for approximating theactualspectraldensity.Here,apair of conjugate poles is counted as two separate poles. 1 is easily approvedfor unvoiced Althoughassumption consonants, it is not readily so for voiced sounds having a pitchharmonic structure. In actual speech, however, the glottal source usually features temporal variation and fluctuation, and, therefore, harmoniccomponentsarebroadenedinthespectraldomain. Hence,assumption 1 can be accepted forthespectral envelope characteristics of both voiced and unvoiced sounds. Assumption 2 corresponds to the AR process described in the previoussection. That is, the signal {xr},exhibitingthespectral density of Eq. (5.18), satisfies the relationship of Eq. (5.1) in the time domain. This correspondence can be understood if one traces back from Eq. (5.8) to Eq. (5.1). Zeros are notincluded in the hypothesized spectral density for two reasons. First, thehuman auditory organs aresensitive to poles and insensitive to steep spectral valleys, such as those represented only by zeros (Matsuda, 1966). Second, removing zeros simplifies as well as facilitates the mathematical process and the parameter extraction procedure. When { E [ ) is Gaussian, the logarithmic likelihoodL ( X I&) for the N sample sequence X = (xo, x l , . . ., . Y ~ - ~can ) be approximated bY
92
Chapter 5
where W indicates the parameter set (a2,al, a2, . . ., aP)in Eq. (5.18). f(A) and i j T , which respectively express theshort-term spectral density (periodogram)andshort-termautocorrelation function for {x,}, are defined as
(5.22) and
(5.23) and rT in Eq. (5.14) are related by i j T = rJN. Equation (5.21) shows thatthelogarithmic likelihood fora given X can be p time delay approximatelyrepresented using only thefirst elements of the short-term autocorrelation function, { i j T } T = p . Let us maximize L ( X I w > with respect to a2 first. From aL(Xlw)/aa2 = 0, we obtain
ijT
a2 = J(Q1)Q 2 ) . . r= -p
Then (5.25) Therefore, the maximization of L ( X IW)with respect to {ai}+I p is attained by the minimization of J ( q , a2, . . ., Q J . Since J(Q1)Q,) . . .
)
Qp)
=
+ P
(5.26) ij=O
Analysis Predictive (LPC) Linear Coding
93
(ai>+f can be derived by solving thelinearsimultaneous equations
From Eqs. (5.24) and (5.26), P
2
(5.28)
= r=O
Since Eqs. (5.27) and (5.15) are equivalent, the (aili=1 p values obtained by Eq. (5.27) areequal to the values derived by the autocorrelation method. This means that linear predictive analysis’ employing theautocorrelationmethod and maximum likelihood spectral estimation, respectively, solvethe same passive linear system (acoustic characteristics of the vocal tract, including the source and radiation characteristics) in the time domain and frequency domain, respectively. The maximum likelihood spectral estimation method is equivalent to theprocess of adjusting the coefficients to minimize the output power o2 when the input signal is passed through an adjustablepth-order inverse filter. Hence, this method is often referred to as the inverse filtering method (Markel, 1972). 5.3.2
Physical Meaning of Maximum Likelihood Spectral Estimation
The functionf(X) in Eq. (5.21) is restricted in that it takes on the form of Eq. (5.18). Without sucharestriction, f(X), which maximizes Eq. (5.21) under the condition of a given f(X), is equal to !(X) ( - 7 ~5 X 5 .-). The maximum value of L is
Chapter 5
94
Therefore,
which is defined by 87r (LmaX - L ( X lG))/N, becomes zero only when f(A) = j(X) (-7r 5 X Azx), otherwise it has a positive value. Accordingly, El (flf) can be regarded as a matching error
measure when the short-term spectral density is substituted by a hypothetical spectral density f i x ) . This means that the estimation of spectral information i3, based on the maximum likelihood method,correspondstothespectrummatching which minimizes the matching error measurein the same way as the A-b-S method. If theintegrand of El log (f(A)/j(A)), it becomes
is represented by afunction d =
2 ( ~ / + e - ~1) -
=
(5.3 1)
which is shown by the solid curve in Fig. 5.2. On the other hand, intheconventionalA-b-Smethod, G2(d) = d2 has usually been used as an integrand for measuring the spectral matching error. In the region Id1 < 1, G1 ( (0 and G2(d) arealmostthesame. When d > 1 and d < -1, however, Gl(cJ) respectively increases linearly and exponentiallyasafunction of d. G2(d) hasa symmetrical curve around d = 0, whereas Gl(d) is unsymmetrical. Thismeansthatinspectralmatchingusingthemaximum likelihood method, the matching error for neglecting a local valley in f(X) is evaluated as being smaller than that forneglecting a local peakhavingthesameshape.Thenonuniform weighting inthe maximum likelihood method is preferred over uniform weighting since the peaks play a dominant role in the perception of voiced speech. The poles of the spectral envelope, z, (i = 1,2, . . ., p ) , can be obtained as roots of the equation
Linear Predictive Coding (LPC) Analysis
-2
-3 1
-12-9
-1
1
1
O I
-6 - 3
I
0
95
1
2
3
I
I
I
3
6
9
d I
12 dB
FIG.5.2 Comparison of matching error measure in maximumlikelihood method, GI(@, withthat in analysis-by-synthesis (A-b-S) method, G2(@.d = log {(X)/f(X)}; f(X) = model spectrum; f(X) = short-term spectrum.
1
+
x P
QIZ-i = 0
(5.32)
i= 1
in which complex poles correspond to quadratic resonandes. Their resonance frequencies and bandwidths are given by the equations
Chapter 5
96
and
(5.33) where AT is the sampling period. Theformants canbe extracted by selecting the poleswhosebandwidth-to-frequencyratiosare relatively small. Figure 5.3 comparestheshort-termspectral densities and spectral envelopes estimated by the maximum likelihood method for the male and female vowel /a/ when the number of poles is
Male v o w e l /
0
I
/
1
Female vowel / a /
..
0
I
2 Frequency
3
4
0
1
2
3
4 kHz
Frequency
FIG. 5.3 Comparison of (a) short-term spectra and (b) spectral envelopes obtained by the maximum likelihood method.
Analysis Predictive (LPC) Linear Coding
97
varied between 6 and 12. It is evident that major peaks in the shortterm spectrum can be almost completely represented by f ( X ) when the speech wave is band-limited between 0 and 4 kHz and p is set larger than or equal to 10. Figure 5.4 exemplifies the time function of spectral envelopes for the Japanese test sentence beginning /bakuoNga/, or ‘A whir is . . .,’ uttered by amalespeaker (Tohkura, 1980). Here,the Hamming window length is 30ms, the frame period is 5 ms, and p is set at 12.
-
0
I
2
3
4kHz
FIG.5.4 Time function of spectral envelopes for the Japanese phrase /bakuNga/ uttered by a male speaker.
98
Chapter 5
5.4 SOURCE PARAMETER ESTIMATION FROM RESIDUAL SIGNALS
Let us consider the spectral fine structure of the residual signal (5.34) Since the fine structure is obtained by normalizing the short-term spectrum of input speech f(X), using the spectral envelope-f(X),it is almostflatalongthefrequency axis and exhibitsaharmonic structureforperiodic speech.Therefore,theautocorrelation function for the residualsignal, called the modified autocorrelation function, produces large correlationvalues at the delays having the integer ratio of the fundamental period for voiced speech, whereas no specific correlation is demonstratedfor unvoicedspeech (Itakura and Saito, 1968). In this way, the vocal source parameters can be obtained using the modified autocorrelationfunction regardless of the spectral envelope shape. The modified autocorrelation function can be easily calculated by the Fourier transform of j(X)if(X) as follows:
1 x
1 " ?(X) o2 -"
=1
P
A , c o s ( ~- s)X dX
s= - p
u
(5.35) where A , is a correlation fL1nction of linear predictor coefficients as previously defined by Eq. (5.20). Equation (5.35) means that w, can be calculated by the convolution of the short-term autocorrelation function and {A,} ,= I y for input speech, followed by normalization using 02. w, can also be obtained by directly calculatingthe correlation function for E , using Eq. (5.34).
Linear Predictive Coding (LPC) Analysis
99
Since actual speech often features intermediate characteristics between theperiodic and unperiodic,thesourcecharacteristic function V(w,) is defined so that it expresses not only merely voiced or unvoiced soundbutalsotheintermediatecharacteristics between these sounds. In the course of pitch extraction, low-passfilteringiswidely applied to speechwaves or residual signals for improving the resolution of the extracted pitch period. Low-pass filtering is effective for removingtheinfluence of high-orderformantsandfor compensating for the insufficiency of the time resolution arising in the autocorrelation function. The latter effect is especially important for pitch extraction using this modified autocorrelation function. The double-period pitch error due to the time resolution insufficiency can be considerably minimized by employing low-pass filtering. Figure 5.5 exemplifies waveforms, autocorrelation functions, and short-term spectra for speech waves, residual signals, and their low-passed signals forthe vowel /a/ uttered by a male speaker (Tohkura, 1980). Thecutofffrequency for the low-pass filter is 900Hz. Comparison of thecorrelationfunctionsforthe speech waves and for the residualsignals shows that the latter,specifically, the modified autocorrelation function, is more advantageous than the former correlation function. When the correlation function for the speech waves is used, formant-relatedcomponents, which become large when the harnzonic components of the fundamental frequency and the formant frequencies are close together,cause errors in maximum value selection. On the other hand, when the correlationfunctionforthe residual signals is used,peaks are observed only at thefundamentalperiodandatitsratios of integers and are not affected by formants. 5.5 SPEECH ANALYSIS-SYNTHESIS SYSTEM BY LPC
The original speech wave can be reproduced based on the relationship x, = -?t + E , or X ( z ) = E(z)/.A(z),using the speech synthesis circuit indicated in Fig. 5.6 and the residual signal E [ as the sound source. For the purpose of reducing information, pulses
Chapter 5
81 L . P . F .
SPEECH HRVEFORH
AUTO-CORR.
OF
Ruro-corm.
OF C . P . F .
AUTO-CORR.
OF RESIOURL HRVEFORR
RUTO-CORR.
OF L.P.F. RESIDUAL
SPEECH HAVEFORM
SPEECH
lr
0
C)
RESIOURL HRYEFORH lH=l2l
'r
Dl L.P.F.
AESIDURL HRVEFORH
'F
u
0
-
1
0
10
20
OELlY IHS)
30
FIG.5.5 Waveforms, autocorrelation functions, and short-term spectra for a speech wave, a residual signal, and their low-pass filtered signals for the vowel /a/ uttered by a male speaker.
and white noise are utilized as sound sources to drive the speech synthesis circuit instead of employing E [ itself. Pulses and white noise are controlled based on the source periodicity information extracted from E*. The control parameters of the speech synthesis circuit arethuslinearpredictor coefficients {ai}l = l p , pulse amplitude A , , and fundamental period T for the voice source. A ,
Linear Predictive Coding (LPC) Analysis
101
out put
I"+x t xt-1
xt-2 -P
FIG.5.6 Speech synthesis circuit based on linear predictive analysis method.
and T are replaced with noise amplitude A,, for the unvoice source (Itakura and Saito, 1968). The stability of theabove-mentioned synthesis filter l/A(z) must be carefully maintained since it has a feedback loop. Stability here, meaning that the output of the system for a finite input is itself finite, correspondstotheconditionthatthe difference (see Appendix A.3). If equation (5.1) hasastationarysolution the linear predictor coefficients are obtained through the correlation method of linear predictive analysis or through the maximum likelihood method, stability of the synthesis filter is theoretically guaranteed. The reason for thisis that the spectral density function f ( X ) alwaysbecomespositivedefinitewhentheshort-term autocorrelation function {67}7= is a positive definite sequence. During actual parameter transmission or storage, however, stability is not always guaranteed because of thequantization error. In such situations, thereis no practical, clear criterionfor the range of {aili,I p which secures stability.This is one of the difficulties of using LPC or the maximum likelihood method in speech analysis-synthesis systems. In order to minimize this problem, the spectral dynamic range, namely, the difference between the maximum and minimum values
102
Chapter 5
(peaks and valleys) in the spectrum, should be reduced as much as possible. Effective for this purpose is the application of a 6-dB/oct high-emphasis filter or a spectral equalizer adapted to the overall spectral inclination.The stability problem, however, has finally beensolved theoretically as well as practically by the PARCOR analysis-synthesis nlethod as described in the following section. 5.6 5.6.1
PARCOR ANALYSIS Formulation of PARCOR Analysis
Thesametwoassumptionsmadeforthemaximumlikelihood estimation (see 5.3.1) are also made for the speech wave. When the prediction errors for the linear prediction of xland .Y+~,, using the sampled values (s,-i>i= 1”’” are written as
i=O
and (5.36) the PARCOR (yartial autocorrelation) coefficient k,, between and . Y ~ - ~is? ~defined by
X,
(5.37) Thisequation means thatthePARCOR coefficient is the ~fi(’”-’) and the correlation between the forward prediction error backward prediction error ~ b t ( n l - ’ ) (Itakura and Saito, 1971). The definitional concept behind the PARCOR coefficient is presented inblockdiagramforminFig. 5.7. Since thepredictionerrors, 1) and ~ ~ , l ( ” ’ - l ) ,are obtained after removing the linear effect of ,f[o??-
Analysis Predictive Linear (LPC) Coding
103
H Z sample
values between -xt and x c I - 1 1 2 from these sample values, k,,, represents the pure or partial correlation between x, and xt+,. WhenEq. (5.36) is putinto Eq. (5.37), the PARCOR coefficient sequence knz( r n = 1,2, . . ., p ) can be written as
(5.38)
OQ)
23
a zz
J
~
&m-
I)
p m
Backward prediction
Correlotor
.
&ft(m-ll
I
error
Forward predict ion
error
PARCOR coefficient
FIG.5.7
sz
Definition of PARCOR coefficients.
104
Chapter 5
where vi is the short-term autocorrelation function for the speech wave. Although this autocorrelationfunctionshould be written as Ci in line with the notations made thus far, it is written as v i for simplicity's sake. kl is equalto u l / v o , i.e., tothefirst-order autocorrelation coefficient. This is also clear from the definition of kHz. Using Eq. (5.38) and the fact that the prediction coefficients r (nz-I) jai }i=fl-l and {ppz-l)}j= constitute the solutions of the sinlultaneous equations
and
i= 1
(5.39) the following recursive equations canbe obtained (m
=
1,2, . . ., p):
Additionally, the following equation is obtained from Eq. (5.39):
pi h - 1 )
(m- a/l?-i
1)
(i
=
1 ) 2 , .. . ) M Z
-
1)
(5.41)
Based on these results, the PARCOR coefficients and linear predictor coefficients {anz}nl= 1 p areobtainedfrom
"-
-
-I-_ " " 1 . -
"....__ -_ "I".".-"
1p
105
Analysis Predictive Linear (LPC) Coding
{v,}+ 1 p through the flowchart in Fig.5.8 by using Eqs. (5.38) and (5.40). This iterative method is equivalent to Durbin's recursive solutionforsimultaneouslinearequations.Thenumbers of multiplications,summations,and divisions necessary for this computation are roughly p ( p + l), p ( p + l), and p , respectively. When these computations are done using a short word length, the truncation error in the computation accumulates as the analysis
w m = m+ 1
FIG.5.8
Flowchart for calculating {km}i=oP and
" " " "
"
"-
{ c ~ l , } ~ = ~from P {V~}~=~P.
"
Chapter 5
106
progresses. In the iteration process, each k,(m = 1,2, . . ., p ) is obtainedone by one, whereas the a,, values changeat every iteration. Finally, a,,, values are obtained as a,,? =
(1 5
ai,,@)
M2
5 p)
(5.42)
Since the normalized mean square error o2 is equal to up/vo fromitsdefinition, o2 can be calculated using PARCOR coefficients, instead of linear predictor coefficients, from
n P
o2 =
112=
(1
-
kt,?)
(5.43)
1
This equation is obtained from Eq. (5.40) 1 p directly from the signal {x,) let us In order toderive {k,,2)itl= A,,(D) define the forward and backward prediction error operators and BIII(D)as
i=O
and
i= 1
where D is the delay operator such that D'x, (5.36) can then be written as
= .xpi.
Equations
and (5.45)
Linear Predictive Coding (LPC) Analysis
107
From Eq. (5.40), we can arrive at the recursive equations
and
Based on Eqs. (5.38), (5.45), and (5.46), the PARCOR coefficients {k,J can subsequently be produced directly from the using ! a cascade connection of variable parameter speech wave -Idigitalfilters(partialcorrelators),each of which includes a correlator as indicated in Fig. 5.9(a). Since E(($11-1))2) = E { ( E ~ ~ ( / " - ~the ) )correlator '), can be realized by thestructure indicatedinFig.5.9(b), which consists of square,addition, subtraction, and division circuits and low-pass filters. The process of extractingPARCORcoefficientsusing thepartialcorrelatorsinvolves successively extractingand removingthecorrelations between adjacentsamples.This is an inverse filtering process which flattensthespectral envelope successively. Therefore, when thenumber of partialcorrelators p is largeenough,thecorrelation between adjacentsamples, which corresponds to the overall spectral envelope information, is almostcompletelyremoved by passingthespeechwave through thepartialcorrelators.Consequently,theoutput of thefinalstage,namely,theresidual signal, includes only the correlation between thedistantsamples which relates to the source(pitch)information.Hence,thesourceparameterscan be extractedfromtheautocorrelationfunctionfortheresidual signal,inotherwords,fromthemodifiedautocorrelation function. Thedefinition of the PARCOR coefficients confirms that lkljTlL- 1 is always satisfied. Furthermore, if Iktj21< 1, the roots for A,(,?) = 0 have also been verified to exist inside of the unit circle, and, therefore,thestability of the synthesis filter is guaranteed (Itakura and Saito, 1971).
108
(a)
Chapter 5
O”
xt
P a r t i a l correlotor
Am- I
P a r t la I correla tor
Xt
FIG. 5.9 (a) PARCOR coefficient extraction circuit constructed by cascade connection of partial autocorrelators and (b) construction of each partial autocorrelator.
5.6.2 Relationship between PARCOR and LPC Coefficients
If eitherone of the set of f or (am},.,,= f is given, the othercan be obtained by iterativecomputation.For example, when {knJnI = 1p are given, {a,,,},,, = I p are derived by iterative computations (n.t = 1,2, . . ., p ) using a part of Durbin’s solution:
On the other hand, (k,,,},,= 1p can be drawn from {a,},,= I p using the iterative computations in the opposite direction (177 =
Analysis Predictive Linear (LPC) Coding
109
p,p- 1, . . ., 2,l) as indicated below, where the initial condition ai,,(Y) = Q~~~ (1 2 m p ) :
is
(5.48)
5.6.3 PARCOR Synthesis Filter A digitalfilter which synthesizes speech waveformemploying PARCOR coefficientscan be realized by theinverseprocess of speech analysisincorporatingpartialautocorrelators.Inother words, in the PARCOR synthesis process the correlation between samplevalues issuccessively provided to the residualsignal, or resonance is added to theflat spectrum of the residual signal.More specifically, the synthesis filter features the inverse characteristics of the analysis filter l/Ain(D). Reversingthesignal propagation directionfor A inthe recursive equation (5.46) produces the relationships
and
Let us assume that the synthesis filters having the transmission characteristics of l/A,(D) and B,,(D) are already realized as shown within the solid rectangle in Fig. 5.10. In order to attain a synthesized output y r at thefinal output terminal Q? asignal A,,(D)y, must be inputto terminal a,. Thispermitsasignal B,,(D)y, to appear at terminal b,. Let us next construct a lattice filter based on Eq. (5.49), as indicated within the dashed rectangle, and connectit to the circuitwithinthesolidrectangle. If these
110
Chapter 5
FIG. 5.10 Principal construction features of sythesis filter using PARCOR coefficients.
combined circuits are viewed from terminal a,, 1, they exhibit an input-output relation of l/A,,,+ (D)since they produce outputy r at terminal Q for input signal A,, + @)yl. At the same time, signal B,, + ( D ) y , appears at terminal b,,, + Therefore,thestructure indicated in Fig. 5.10 realizes one section (stage) of the PARCOR synthesis filter. Several equivalenttransformations exist forthis lattice filter, as indicated in Fig. 5.11. Astructuralexample of a speech analysis-synthesis system using PARCOR coefficients is presented in Fig. 5.12. Here, partial autocorrelators are used for the analysis. For comparison, Fig. 5.13 offers an example in which a recursive computation-based method is employed for the same purpose. Whenthe synthesis parameters of thePARCOR analysissynthesis system are renewed at time intervals(frameintervals) different from the analysis intervals, the speaking rate is modified without an accompanyingchangeinthepitch(fundamental frequency).
5.6.4 Vocal Tract Area Estimation Based on PARCOR Analysis
The signal flow graph of Kelly’s speech synthesis model (Fig. 3.4(b) in Sec. 3.3.1) formally coincides with the speech synthesis digital
-
Linear Predictive Coding (LPC) Analysis Input
""--4+)"
n
111 output
0
-" I
1-
FIG.5.11 Equivalent transformations for lattice-type digital filter.
filter used in PARCOR analysis-synthesis systems (Fig. 5.12). In otherwords,the PARCOR coefficient k,, correspondstothe reflection coefficient K,?. Also, thePARCOR latticefilter is regarded as an equivalent circuit for the vocal tract acoustic filter simulating the cascade connection of y equal-length acoustic tubes havingdifferentareas. Since arelationship exists between the reflection coefficient and area function as described in Sec. 3.3.1, the area function can expectantlybe estimated from the PARCOR coefficients. However, several problems exist with this assumption. Fromthe speech productionmechanism,the total system function S(z) for the speech production system is represented by
112
C
4-
.0 .-N
rd Q c f
0
v) 0,
r.
C
c
c
v)
Chapter 5
" L
Linear Predictive Coding (LPC) Analysis L
0
0
4
"--r
I
I I
I
I
t
I
0,
L
-0 L
0 V
Y-
Q,
D
..U
r" I -
L
0
c
L
Q, L
0 V
I 0 7
t
a
1
1
7 e
a
e '1 U
3
113
114
Chapter 5
the product of the system functions for source generation G(z), vocal tract resonance V(z),and radiation R ( z ) as S(Z) = G ( z ) V ( z ) R ( z )
(5.50)
WithPARCOR analysis, which is based on thelinear separable equivalent circuit model described in Sec. 3.2, the vocal tract system function is obtained by assuming that the sound source consists of an impulse or random noise having a uniform spectraldensity.Therefore,theoverallcharacteristics of S(z) includingthesource andradiationcharacteristics are derived instead of V(z).Consequently, when the area function is calculated according totheformal coincidencebetweenthe PARCOR coefficient k,, and the reflection coefficient E,,, the result widely differs from the actual area function. Properlyestimatingtheareafunctionthusrequiresthe removal of the effects of G(z) and R(s) fromthe speech wave prior to the PARCOR analysis, which is called inverse filtering or spectral equalization. Twospecific methods have been investigated for inverse filtering. 1.
First-order differential processing As is well known, the frequency characteristics of the sound source G(z) and radiation R(s) can be roughly approximated as -12 dB/oct and 6 dB/oct, respectively. Based on this approximation, the sound source and radiation characteristics can be canceled by B-dB/oct spectral emphasis (Wakita, 1973). This is actually done by analog differential processing of the input speech wave ( x , } or by digital processing of digitized speech. The latter is accomplished by calculating y r = xI--.q_1, which correspondstothe filter processing of F ( Z ) = 1 - 8 . 2. Adaptive inverse filtering On the assumption that overall vocal tract frequency characteristics are almost flat and have hardly any spectral tilt, the spectral tilt in the input signal is adaptively removed at every analysisframe using lower order correlation coefficients (Nakajinla et al., 1978). When the first-order inverse filter is applied,thefirst-ordercorrelation
Linear Analysis Predictive (LPC) Coding
115
coefficient, that is, the first-order PARCOR coefficient (kl = rl = ul/ vo), is used to construct the F(s) = 1 - klz” filter. This is achieved by the computation of y r = xI - k l ~ , - lor by the convolution of the correlation coefficients. Using theconvolutionmethod, inverse filtering can easily be done even for the second- or third-order critical damping inverse filtering. Appropriate boundary conditions at the lips and the glottis must also be established for properly estimating the area function. For this purpose, two cases have been considered for vowel-type speech production in which thesoundsource is located at the glottis and no connection exists with the nasal cavity. 1.
Case 1
Lips: The vocal tract is open to the field having an infinite area (that is, K~ = 1) such that the forward propagation wave is completely reflected, and the circuit is short (impedance is 5 = p c / A I ‘4 “‘m = 0). Glottis:Thevocaltract is terminated by thecharacteristic impedance pc/Ap. The backward propagation wave flows out to the trachea without reflection and causes a loss. The input signal is supplied to the vocal tract through this characteristic impedance (Wakita, 1973; Nakajima et al., 1978). 2.
Case 2
Lips: The vocal tract is terminated by thecharacteristic impedance pc/Al. The forward propagationwave is emitted to the field without reflection and results in a loss. Glottis:The vocal tract is completely closed (in otherwords, Lip+ 1 = - 1) such that thebackwardpropagation wave is completely reflected, and the input signal is supplied tothe glottis as a constant flow source (Atal, 1970). The vocal tract area ratio is successivelydetermined from the lips in Case 1 and fromthe glottis in Case 2. Linear predictive analysisand PARCOR analysis correspond to Case 1. Comparing the results of
Chapter 5
116
these two cases, which are usually quite different from each other, Case 1 seems to give themostreasonableresults. For final transformation fromthe area ratio to the area function, itis necessary to define the glottal areaAp so that the final results become similar to the actual values determined through x-ray photography and other techniques. The relationshipbetween the area function andPARCOR coefficients in Case 1 is shown in Fig. 5.14. The vocal tract area functionestimated from the actualspeech wave based on the above method has been confirmed to globally coincide with the results observed by x-ray photographs. Figure 5.15 compares spectral envelopes and area functions (unit interval length is 1.4 cm) estimatedby applying adaptive inverse filtering for the five Japanese vowels uttered by a male speaker (Nakajima et al., 1978). If the vocal tract area could ever be estimated automatically and precisely fromthe speech wave, theestimationmethod achieving this will certainly become a fundamental speech analysis method.Furthermore, thismethod will be extremelyuseful for analyzingthespeechproductionprocessandforimproving speech recognitionandsynthesis systems.Several problems remain,however,inachievingthenecessaryprecision of the estimated area function. These warrant further investigation into properlymodelingthesourcecharacteristicsintheestimation algorithm.
5.7 5.7.1
LINE SPECTRUM PAIR (LSP) ANALYSIS Principle of LSPAnalysis
AlthoughthePARCOR analysis-synthesismethod is superior toanyother previouslydevelopedmethods,ithasthelowest bit rate limit, 2400 bps. If thebit rate falls below thisvalue, synthesized voice rapidly becomes unclear and unnatural. The LSP method was thus investigated to maintain voice quality at smaller bit rates (Ttakura, 1975). The PARCOR coefficients are essentially
C
4-
a
r
c
I
x
i
u)
.-a J
1 .-#
t
0
t
c3
" " " " " "
J
C
E
k & I + - + + a a c
II x
" " " " " "
Linear Predictive Coding (LPC) Analysis
-
117
" " " " "
Chapter 5
118
t
Lips I
0
I
I
2
'
I
4
Spectrum
'
t
Glottis
I
6 kHz Estimated area function
FIG.5.15 Examplesofspectralenvelopesandestimatedarea functions for five vowels: (a) overall spectralenvelope for inverse filtering (source and radiation characteristics); (b) spectralenvelope after inverse filtering (vocal tract characteristics). parametersoperatinginthe time domain as aretheautocorrelation coefficients, whereas the LSPs are parameters functioning in the frequency domain. Therefore, the LSP parameters are advantageous in that the distortion they produce is smaller than that of thePARCOR coefficients even when they are roughly quantized and linearly interpolated. As with PARCOR analysis, LSP analysis is based on the allpolemodel.Thepolynonlialexpressionfor z , which is the
Analysis Predictive Linear (LPC) Coding
119
denominator of the all-pole model, satisfies the following recursive equations, as previously demonstrated in Eq. (5.46):
and (5.5 1) where A&) = 1 and &(z) = z" (initial conditions). Let us assume that Ap(z) is given, and represent two Ap+ &) types, P(z) and Q(z), undertheconditions kp+ = 1 and kI,+ = - 1, respectively. The condition Ikp+ I = 1 corresponds to the case where the airflow is completely reflected at theglottisinthe (pseudo) vocal tract model represented by PARCOR coefficients. In other words, this condition corresponds to the completely open or closed termination condition. The actual boundary conditionat the glottis is, however, the iteration of opening and closing, as a function of vocal cord vibration. Since theboundarycondition at the lips in the PARCOR analysis is a free field (ko = -1) as mentionedintheprevious section, the present boundary condition sets the absolute values of the reflection coefficients to 1 at both ends of the vocal tract. This means that the vocal tract acoustic system becomes a lossless system which completely shuts in the energy. The Q value at every resonance mode in the acoustic tube thus becomes infinite, and a pair of delta function-like resonance characteristics (a pair of line spectra) which correspondtoeachboundarycondition at the glottis are obtained. The number of resonances are 2p.
5.7.2 Solution of LSP Analysis
Chapter 5
120
and
Although P(z) and Q(z) areboth (p + 1)st-orderpolynomial expressions, P(z) has inversely symmetricalcoefficientswhereas Q(z) has symmetrical coefficients. Using Eq. (5.52), we get (5.53) On the other hand, from the recursive equations of (5.51),
Continuingthistransformation, equation
we can derivethegeneral
If p is assumed to be even, P(z) and Q(z) are factorized as P(z)
=
(1
-
z")
rI i=2,4,. ..,p
(1
-
2 8 coswj
+ 3-2)
121
Linear Predictive Coding (LPC) Analysis
and Q(Z)
= (1
l"J
+ z- 1)
(1 - 22" cos WI
+ z-?)
i=1,37...,p- 1
(5.56) The factors 1 - z-l and 1 + z-' are found by calculating P(1) and Q(-1) after putting Eq. (5.55) into Eq. (5.52). The coefficients {wi} which appear in the factorization of Eq. (5.56) are referred to as LSP parameters. {w,} are ordered as 0
m
0 .-L
Chapter 6
195
Speech Coding
the long-term prediction based on the long-term periodicity of the source and theshort-term prediction based on thecorrelation between adjacent samples (Atal and Schroeder, 1984). This method can be regarded as a modification of MPC by replacing the multipulses with vector-quantized random pulse sequences. Since each codevector is a random noise vector, L kinds of Nsample vectors can be stored as a single ( L + N)-sample noise wave instead of storingthemseparately.Thedifferentcodevectors having N-samples are thenextractedfromthe single vector by shifting the starting position sampleby sample. Each vector codeis thus represented by the position in the ( L+ “sample vector from whichtheN-samplesequence is extracted.Selection of the optimum N-sample vector is performed so asto minimize the perceptually weighted sum of thesquarederror between the synthetic speech wave and the original speech wave as shown in Fig. 6.35 (Atal and Rabiner, 1986). The same ( L +N)-sample vector is stored in the decoder, and the N-sample vector at the position indicated by the transmitted
Original s p eech
Fine structure (Pitch)
Codebook
Spectral envelope
-
Code word 2
Objective
error
Excitation
Perceptualerror
Average
Synthetic
Square
-
Perceptual weighting f iI ter
-
I
FIG. 6.35 CELP.
Searchprocedure for determiningbestexcitation
code in
196
Chapter 6
signal is extracted from the ( L + N)-sample vector as the excitation signal. High-quality speech with a meanSNR of roughly 15 dB was reported to be obtained under the conditionsof N = 40 (5 ms) and a bit rate of 0.25 bit/sample(10 bits/40 samples) (Schroederand Atal, 1985). MPC and CELP are analysis-by-synthesis coders, which are essentially waveform-approximating coders because they produce an output waveform that closely follows the original waveform. (The minimization of the mean square error in the perceptual space viaperceptualweightingcausesaslightmodificationtothe waveform-approximationprinciple.)Thiseliminatestheold vocoder problem of having to classify a speech segment as voiced or unvoiced.Suchadecisioncan never be made flawlessly and many speech segments have both voiced and unvoiced properties. Recent vocoders also have found ways to eliminate the need for making the voiced/unvoiced decision.The multiband excitation (MBE) (Griffin and Lim, 1988) and sinusoidal transform coders (STC)(McAulay andQuatieri, 1986), alsoknownasharmonic coders,dividethespectrumintoaset of harmonicbands. Individual bands can be declared voiced or unvoiced. This allows the coder to produce a mixed signal: partially voiced and partially unvoiced.Mixed-excitation LPC(MELP) (Supplee etal., 1997) and waveforminterpolation(WI)(KleijnandHaagen, 1994) produce excitation signals that are a combination of periodic and noise-like components. These modern vocoders produce excellentquality speech compared to their predecessors, the channel vocoder andtheLPC vocoder.However,they are still less robustthan higher-bit-rate waveform coders. Moreover, theyare more affected by background noise and cannot code music well. 6.5.4
Coding by Phase Equalization and Variable-Rate Tree Coding
Speech coding methods canbe classified into waveform coding and analysis-synthesis, with the difference between them being whether the sound source is modeled or not. For example, the excitation
Speech Coding
197
information is compressed by quantizing the LPCresidual in APC, whereas it is modeled dichotomously either using a periodic pulse train or random noise source in an LPC vocoder. The residual waveform representation in an LPC vocoder is considered to be a process of both whitening the short-time power spectrum of the prediction residual and modifying the short-time phase into the zero phase or random phase. A phase modification process utilizing human perceptual insensitivity to the short-time phase change is highly effective for bit rate reduction. Also with waveformcoding, if theLPC residual can be modified into a pulselike wave, speech energy willbe temporally localized, and, hence, coding efficiency can be increased by timedomain bit allocation. This is similar to the effectiveness of energy localization in the frequency domain which increases the prediction gain. For this purpose, a highly efficient speech coding method has been proposed combiningphase equalization in the time domain with variable-rate (time-domain bit allocation) tree coding (Moriya and Honda, 1986). Figure 6.36 shows a block diagram of this system. The phase equalization is realised through the matched filter principle.The characteristics of the phase equalization filter are determined to
Code sequence
c(n 1
FIG. 6.36 Block diagram of coder based on phase equalization of prediction residual waveform and variable-rate tree coding.
198
Chapter 6
minimize the mean square error between the pseudoperiodic pulse train andthe filter output forthe residual signal. The impulse response of the phase equalization filter can be approximated as a timereversed residual waveform under the assumption that adjacent samples are uncorrelated. Theoutput residual of this filter is approximately zero-phased over a short period, and itbecomes an impulse-train-like signal. The matched filter principle implies that the phase equalization filter corresponds to the filter which maximizesthe amplitude at the pulse position under the fixed gain condition. Examples of phase-equalized waveforms, shown in Fig. 6.37, clearly indicate that the residual signal is modified to an impulse-train-like signal by phase equalization.
O r i g i n a l Speech
Phase-equalizedoriginal
speech
I
I
Residual s ignol
Phase-equalizedresidualsignal
FIG. 6.37 Examples of phase-equalization processing for original speech and residual signal by a female speaker.
Speech Coding
199
In this method, the phase-equalized residual signal is coded by variable-rate tree coding (VTRC). Variable-rate coding is effective for signals withtemporally localized energy.Treecoding is a method in which a tree structureof signals is used to search for the optimumexcitationsource signal sequence minimizing the error between the input speech signal and coded output (Anderson and Bodie, 1975; Fehn and Noll, 1982). The tree coder in this system is constructed by acodegeneratorhavingavariableratetree structure and an all-pole prediction filter. Each code in the code sequence minimizing the error between the phase-equalized speech wave and coded output over several sample values is successively determined using a method similar to the A-b-S procedure. The number of bits R(n) andquantizationstep size &) for each branch of thetree are allocatedaccording tothetemporal localization of the residual energy. Decoding is performed by the excitation of the all-pole filter using a residual signal which is phase-equalized and tree-coded. Since thedecodedspeechwaveform is processed by phase equalization, it is generally different from the original waveform. Thecodingmethod based on phaseequalizationnot only provides an efficient method of speech waveform representation, but also makespossible a unified modeling of the speech waveform inthesameframeworkincludingwaveformcoding andthe analysis-synthesis method. The latter capability is similar to that possible in the multipulse coding (MPC) method.
6.6 EVALUATION AND STANDARDIZATIONOF CODING METHODS 6.6.1
EvaluationFactors of SpeechCodingSystems
Speech coding has found a diverse range of application such as cellulartelephony, voice mail,multimediamessaging,digital answering machines, packet telephony, audio-visual teleconferencing, and, of course, many other applications in the Internetarena.
200
Chapter 6
Evaluation factors for speech coding systems include bit rate (amount of information in coded speech), coded speech quality, including robustness againstnoise and coding errors,complexity of coderanddecoder (usually a coder is more complex thana decoder), and coding delay. The cost of coding systems generally increases with their complexity. For mostapplications, speech coders are implemented on either special-purpose devices (such as DSP chips) or on general-purpose computers (such as a PC for Internet telephony). In either case, the important quantities are number (million) of instructionspersecondthatareneeded to operate in real-time andtheamount of memory used. Coding delays can be objectionable in two-way telephone conversations, especially when they areaddedtothe existing delays in the transmission network and combined with uncanceled echoes. The practical limit of round-trip delays for telephony is about 300ms. One component of the delay is due to the algorithm and the other to the computation time. Individual sample coders have thelowest delay, while coders that work on ablock or frame of samples have greater delay. Techniques for evaluating the quality of coded speech can be divided into subjectiveevaluationandobjectiveevaluation techniques. Subjective evaluation includes opinion tests, pair comparison,sometimes called A-B tests, and intelligibility tests. The former two methods measure the subjective quality, including naturalnessand ease of listening, whereasthe lattermethod measures how accurately phonetic information canbe transmitted. In the opinion tests, quality is measured by subjective scores (usually by five levels: 5 is excellent, 4 good, 3 fair, 2 poor, and 1 bad). The mean opinionscore (MOS) is then calculated, in which a mean value is determined for the many listeners. Since the MOS indicates only the relative quality in a set of test utterances, the opinion-equivalent SNR value (SNR,) has also been proposed to ensure that the MOS is properly related to the objective measures (Richards, 1973). TheSNR, indicates thesignal-to-amplitudecorrelated noise ratio of the reference signal which results in the same MOS as that for each test utterance. Amplitude-correlated noise is white noise which has been modified by the speech signal
Speech Coding
201
amplitudeinorderto give it thesamecharacteristicsasthe quantization noise. The energy ratio of the original signal to the modified noise is called thesignal-to-amplitude-correlated noise ratio. In the pair comparison test, each test utterance is compared with variousotherutterances,andtheprobabilitythatthe test utterance is judgedto be betterthantheotherutterances is calculated as the preference score. Intelligibility is measured using thecorrectidentification scores for sentences, words, syllables, or phonemes (vowels and consonants).Analyzingtherelationshipbetween syllable and sentence intelligibility indicates that when the syllable identification (articulation) score exceeds 75%, the sentence intelligibility score approaches 100%. The intelligibility is often indicated by the AEN (articulation-equivalent loss), the calculation of which is based on the identification (articulation) score (Richards, 1973). The AEN is the difference in transmission losses between the system to be measured and the reference system when the phoneme identification scores for both systems are 80%. In the calculation, thereferencesystem is adjustedtoreproducetheacoustic transmission characteristics between two people facing each other at a distanceof 1 m in a free field. Importantly, the AEN values are more stable than the raw identification scores. Although definitive evaluation of coding methods should be performed by human listeners, the subjective tests require a great deal of labor and time. Therefore, it is practical to build objective evaluation methods producing evaluationresults which correspond well with the subjective evaluation results. Amongthevarious objective measures proposed, the most fundamental is the SNR. Similar to this measureis the segmental SNR (SNR,,,), which is the SNR measured in dBs at short periodssuch as 30 ms, and averaged over a long speech interval. SNR,,, corresponds better with the subjective values than does the SNR,since the short-term SNRsof even relatively small amplitude periods contribute to this value. In addition to this time-domain evaluation, spectral-domain evaluation methods have also been proposed. These methods are based on spectral distortion measured using various parameters
Chapter 6
202
such asthespectrum,predictor coefficients, autocorrelation function,andcepstrum.Themost typical method is that using the cepstral distance measure defined as (6.22) V
i=l
where c/*') and e/?'' are cepstral or LPC cepstral coefficients for input and output signals of the coder, and Db is the constant for transforming the distance value into the dB value (Db= 10/ln 10) (Kitawakietal., 1982). Subjective evaluationresults using the MOS for various coding methodsverifies that the C D has a better correspondence to the subjective measure than does the SNR,,,. The relationship between the C D and MOS is demonstrated inFig. 6.38 in which the regression equation between them obtained from the experiments is indicated by a quadratic curve. Thestandarddeviationfortheevaluation values fromthe 5 -
o PCM ADPCMf ADPCM" o ADM A APC o ATC A A P C - A B
4 -
:3-
E 2-
1 1
0
I
1
I
2
I
3
I
4
C D [dB]
FIG.6.38
Relationship between CD and MOS.
I
5
I
6
Speech Coding
203
regressioncurve is 0.18.Theseresultsindicate thatquality equivalent to that of 7-bit log PCM can be obtained by 32-kbps ADPCM or 16-kbps APC-AB and ATC. The objective and subjective measures do not correspondwell in several cases such as in systems incorporating noise shaping. A universal objective measure which can be applied to all kinds of coding systems has not yet been established. Table 6.2 comparesthetrade-offsincurredin using representative types of speech coding algorithms. The algorithms must be evaluated based on the total measure which is constructed from an appropriately weighted combination of these factors. A broader range of coding methods from high-quality coding to very-low-bitratecoding are now being investigated inorderto meet the expected demands. Digital network telephonygenerally operates at 64 kbps, cellular systems runfrom 5.6 to 13 kbps,and secure telephonyfunctions at 2.4 and 4.8 kbps.High-qualitycoding transmits not only speech but also wideband signals such as music at a rate of 64 kbps. Very-low-bit-rate coding under investigation fully utilizes the speech characteristics to transmit speech signals at 200 to 300 bps. The evaluation methods for these coding techniques, specifically, the weighting factors for combining evaluation factors, must be determined,depending on their bit ratesandapplication purposes.Acrucialflltureproblem is how best to measurethe individuality and naturalness of coded speech.
6.6.2 Speech CodingStandards
For speech coding to be useful in telecommunication applications, be standardized(i.e.,itmustconformtothesame ithasto algorithmand bit format)to ensure universal interoperability. Speech-coding standardsare established by various standards organizations: for example, ITU-T (International Telecommunication Union, Telecommunication Standardization Sector, formally CCITT), TIA (TelecommunicationsIndustryAssociation), RCR
-."-
" l " "
" "
""-
204
O
T-
T
-
m z
cv
Chapter 6
Speech Coding
205
(Research and Development Center for Radio Systems) in Japan, ETSI(EuropeanTelecommunications StandardsInstitute),and othergovernment agencies (Childersetal., 1998). Figure 6.39 summarizes the trend in standardization at ITU-T, aswell as gives examples of standardized coding for digital cellular phones. The figure also exemplifies analysis-synthesis systems. Since CELP can achieve relatively high coding quality at the bit-rateragefrom4to 16 kbps,CELP-basedcoders have been adopted in a wide range of recent standardization. The LD-CELP (low-delay CELP),CS-ACELP(conjugatestructurealgebraic CELP),VSELP (vector sum excited linearprediction) and PSICELP (pitchsynchronousinnovationCELP)inthe figure are CELP-based coders. The principal points of each are summarized as follows. LD-CELP
LD-CELP was standardized by the ITU-T for use in integrated services digitalnetworks (ISDN). Figure 6.40 showsthe search procedure for determining the best excitation code in LD-CELP (Chen et al.,1990). The key feature ofthis coding system is its short system delay (2 ms) which is achieved by using a short block length for the speech and the backward prediction technique instead of the forward prediction used in the conventional CELP. The order of prediction is around 50, covering the pitch period range,which is five times longer than that in the conventional CELP. CS-ACELP
The key features of CS-ACELP system are its conjugate codebook structure in the excitation source generator and its shorter system delay(the round-trip delay is less than32ms)thanwith conventional CELP (Kataoka et al., 1993). The conjugate structure reduces memoryrequirements and enhancesrobustnessagainst transmission errors. The shorter system delay is achieved by using
206
Chapter 6
Speech Coding
4
............................................ d
s
207
208
Chapter 6
backward prediction, similarto LD-CELP. Theexcitation source is efficiently represented by an algebraic coding structure. 8-kbpsCSACELP has coded speech quality equivalent to 32-kbps ADPCM, andhas been used inpersonalhandyphone systems (PHS)in Japan. VSELP
In VSELP, as shown in Fig. 6.41, the excitation sourceis generated by linear combination of several fixed basis vectors; this enhances robustnessagainstchannelerrors(Gerson andJasiuk, 1990). Although a one-bit transmission error of excitation source vector indexproducesacompletelydifferentvectorinconventional CELP, only inversion of a basis vector occurs in VSELP and its effect is much smaller. In addition, an efficient multi-stage vector quantization technique is employed to speed upthecodebook search.Complexity and memoryrequirementsaresignificantly reduced by VSELP. VSELP has been standardized for the full-bitrate (1 1.2 kbps in Japan and 13 kbps in North America, including errorprotectionbits)systemfordigitalcellularandportable telephone systems. PSI-CELP
The PSI-CELP algorithm, shown in Fig. 6.42, has two important features: 1) the random excitation vectors in the excitation source generator are given pitch periodicity for voiced speech by pitch synchronization, and 2) the codebook has a two-channel conjugate structure (Miki et al., 1993). The pitch synchronization algorithm using an adaptive codebook reduces quantization noise without losing naturalness at low-bit rates. In particular, this significantly improves voiced speech quality. The two-channel conjugatestructureanda fixed codebookfor transient speechsignals have been proposed to reduce memory requirements against channel errors. This conjugate structure is made
Speech Coding
5W
cn > 0
rc
209
210 CA
t
i
Chapter 6
Speech Coding
21 1
by selecting the best combination of code vectors from well-organized codebooks to minimize distortion resulting from summing up two codebooks. PSI-CELP has been adopted as the digitalcellular standard in Japanfora half-rate (3.45 kbps for speech + error protection = 5.6 kbps) digital cellular mobile radio system. Its quality at the half bit rate nearly equals or is better than that of VSELP at the full bit rate. However, the amount of processingand the codec system delay for the former are about twice that of the latter. 6.7
ROBUST AND FLEXIBLE SPEECH CODING
Most of the low-bit speech coders designed in the past implicitly assunlethatthe signal is generated by aspeakerwithoutmuch interference.Thesecodersoften demonstratedegradationin quality when used in an environment in which there is a competing speech orbackground noise including music. A recent research challenge is to make coders perform robustly underwide a range of conditions, including noisy automobile environments (Childers et al., 1998). From theapplicationpoint ofview, it is useful if a common coder performs well for both speech and music. Another challenge is the coder’s resistance to transmission errors, which areparticularlycriticalincellularandpacket communicationapplications.Methods that combinesource and channelcoding schemes or that conceal errors are important in enhancing the usefulness of the coding system. As packet networking is becoming more and more prevalent, a new breed of speech coders is emerging. These coders need to take into account and negotiate for the available network resources (unlike the existing digital telephony hierarchy in which a constant bit rate per channel is guaranteed) in order to determine the right coder to use. They also have to be able to deal with packet losses (severe at times), For this reason,theidea of embedded and scaleable (in terms of bit rates) coders is being investigated with considerable interest (Elder, 1997).
This Page Intentionally Left Blank
Speech Synthesis
7.1 PRINCIPLES OF SPEECH SYNTHESIS Speech synthesis is a process which artificially produces speech for variousapplications,diminishingthedependenceonusinga person’srecorded voice. The speech synthesismethodsenablea machine to pass oninstructionsorinformationtotheuser through‘speaking.’ The applications include information supply services overtelephone,such asbanking services anddirectory services,variousreservationservices,public announcements, such as those attrainstations,readingoutmanuscripts for collation,readingemails,faxes,and web pagesovertelephone, voice outputinautomatictranslation systems, and special equipmentforhandicappedpeople,suchaswordprocessors withreading-outcapability andbook-readingaidsfor visuallyhandicappedpeople, and speakingaidsforvocally-handicapped people. As already mentioned, progress in LSI/computer technology and LPC techniqueshave collectively helped to advance speech synthesis research. Moreover, information supply services are now available in a wider range of application fields. Speech synthesis
213
214
Chapter 7
research is closely related to research into deriving the basic units of infornlationcarriedin speech waves andinto the speech production mechanism. Voice response technology designed to convey messages via synthesized speech presents several advantagesforinformation transmission: Anybody can easily understand the message without training or intense concentration; The message can be received even when the listener is involved inother activities, such as walking, handling an object or looking at something; Theconventionaltelephonenetwork can be used to realize easy, remote access to information; and This form of messaging is essentially a paper-free communication form. The last ‘advantage’ also means, however, that no hard copy of the messages makes them difficult to scan. TIILK,synthesized speech is sometimes inappropriate for conveying a large amount of complicated information to many people. History’sfirstspeechsynthesizer is said to have been constructed in 1779, more than 200 years ago. Figure 7.1 shows the structure of the speech synthesizer subsequently produced by von Kenlpelen in 1791 (Flanagan, 1972). This synthesizer, the first of its kind capable of producing both vowels and consonants, wasintendedtosimulatethehumanarticulatoryorgans. Soundsoriginatingthroughthevibration of reeds were nlodulated by the resonance of a leather tube and radiated as a speech wave. Fricativesounds were producedthroughthe ‘S’ and ‘SH’ whistles. This synthesizer is purportedto have been able to produce words consisting of up to 19 consonants and 5 vowels. Earlymechanicallystructured speech synthesizers, of course, couldnotgeneratehigh-quality synthesizedspeechsinceit was difficult to continuously and rapidly change the vocal tract shape.
I
Speech Synthesis
0 0,
J
215
216
Chapter 7
The first synthesizer incorporating an electric structure was made in 1922by J. Q. Stewart. Two coupled resonant electric circuits were excited by a current interrupted at a rate analogous to the voice pitch. By carefully tuning the circuits, sustained vowels could be produced by this synthesizer. The first synthesizer which actually succeeded in generating continuous speech was thevoder,constructed by H. Dudley in 1939. Itproducedcontinuous speech by controllingthefundamentalperiodandband-pass filter characteristics, respectively, using a foot pedal and10 finger keys. The voder, which laterserved astheprototype of the speech synthesizer forthevocoder introduced in Sec. 4.6.2, became a principal foundation block for recent speech synthesis research. The voder structure, based on the linear separable equivalent circuit model, is still used in present speech synthesizers. Present speech synthesis methods can be divided into three types: 1) Synthesis based on waveform coding, inwhichspeechwaves of recordedhuman voice storedafter waveformcoding or immediately after recording are used to producedesired messages 2) Synthesis based onthe analysis-synthesis method, in which speech waves of recorded human voice are transformed into parameter sequences by the analysis-synthesis methodand stored, with a speech synthesizer being driven by concatenated parameters to produce messages. 3) Synthesis by rule, in which speech is producedbased on phonetic andlinguistic rules from letter sequences or sequences of phoneme symbols and prosodic features. The principles of these three methods and a comparison of their features are presented in Fig. 7.2 and Table 7.1, respectively. Synthesis systems based onthe waveformcodingmethod are simple and provide high-quality speech, but they also exhibit low versatility, that is, themessages can only be used in theform recorded. At the other extreme, synthesis-by-rule systems feature
Speech Synthesis
217
[ synthesls Analysis-]
cod [ Waveform] ing
Basic form of infor [mation
-
1
0 Waveform
e
parameter
a
Linguistic
Input data
symbol
e d u c t , o n o , Reduction
Synthesis (Parameter conversion)
Parameter connection connect ion
Playback synthesizer synthesizer
Speech Speech
FIG.7.2
Parameter sequence generation
Speech
Speech
Basic principles of three speech synthesis methods.
great versatility but are also highly complex, and, as yet, of limited quality. In practical cases, it is desirable to select the method most appropriate for the objectives fully takingtheperformance and properties of each method into consideration. The details of each method will be discussed in the following.
7.2 SYNTHESIS BASED ON WAVEFORM CODING
As mentioned, synthesis based on waveform coding is the methodby which short segmental units of human voice, typically words or
218
E S
n v)
v)
a. a m I
b
m
0
.-c
E
v)
I
b
T
0
0
a,>
v,
-k Y ,
Chapter 7
Speech Synthesis
219
phrases, are stored and the desired sentence speech is synthesized by selecting and connecting the appropriate units. In this method, the quality of synthesized sentence speech is generally influenced by the quality of the continuity of acoustic features at the connections between units. Acoustic features include the spectral envelope, amplitude, fundamental frequency, and speaking rate. If large units such as phrases or sentences are stored and used, thequality (intelligibility and naturalness) of synthesized speech is better, although the variety of words or sentences which can be synthesized is restricted. On the other hand,when small units such as syllables or phonemes are used, a wide range of words and sentences can be synthesized but the speech quality is largely degraded. In practical systems typically available at present, words and phrases are stored, and wordsare inserted or connected with phrases to produce a desired sentence speech. Since the pitch pattern of each word changes according to its position in differing sentences, it is necessary to store variations of the same words with rising, flat, and falling inflections. The inflection selected also depends on whether the sentence represents a question, statement, or exclamation. Two major problems exist in simply concatenating words to produce sentences (Klatt, 1987). First, a spoken sentence is very differentfrom a sequence of wordsuttered in isolation. In a sentence, words are as short as half their duration when spoken in isolation,makingconcatenatedspeech seem painfullyslow. Second, the sentence stress pattern, rhythm, and intonation,which are dependent on syntactic and semantic factors, are disruptively unnatural when words are simply concatenated even if several variations of the same word are stored. Inorderto resolvesuch problems,synthesismethods concatenating phoneme units haverecently been widely employed. The acceleration of computer processing andthereduction of memory prices are advancing these methods. In these methods, a largenumber of phonemeunits or sub-phoneme(shorter than phonemes) units corresponding to allophones and pitch variation are stored, and the most appropriate units are selected based on rules and evaluation mesures and are concatenated to synthesize speech. Several methods have been developed of overlapping and
220
Chapter 7
adding pitch-length speech waves according to the pitch period of synthesizing speech and various methods of controlling prosodic features by iterating or thinning out the pitch waveforms. These methodscan synthesize unrestricted sentences even thoughthe unitsarestored by speech waveforms.Typicalexamples of methods include TD-PSOLA and HNMdescribed in the following. In order toreduce requirements for memorysize, the units are sometimescompressed by waveformcodingmethodssuchas ADPCM rather than simply storing with analog or digital speech waves. Synthesis derived fromthe analysis-synthesis method, which will be discussed inSection 7.3, is considered to be an advancedform of thismethodfromtheviewpoint of its information reduction and controllability. TD-PSOLA
The TD-PSOLA (Time Domain Pitch Synchronous OverLap Add) method (Moulines and Charpentier, 1990) is currentlyone of the most popular pitch-synchronouswavefolm concatenation methods. This method relies on the speech production modeldescribed by the sinusoidal framework. The ‘analysis’ part consists of extracting shorttime analysis signalsby multiplying the speech waveform by a sequence of time-translated analysis windows.The analysis windowsare located around glottal closure instants and their length is proportional to the local pitch period. During unvoiced frames the analysis time instants are set at a constant rate. During the ‘synthesis’ process, a mapping betweenthesynthesistime instants and analysistime instants is determinedaccording to thedesired prosodic modifications. T h s processspecifies whch of the short-time analysissignals willbe eliminated or duplicated in order to form the final synthetic signal. HNM HNM (Harmonic plus Noise Model) method (Laroche et al., 1993) is based on a pitch-synchronous harmonic-plus-noise representation
sis
Speech
221
of the speech signal. The spectrum is divided into two bands, with low band being represented solelyby harmonically represented sinewaves having slowly varying amplitudes and frequencies. Here, h ( t ) = Z A k ( t ) cos(kO(t)
+ &(I))
li= 1
with O(t) = f awo(l>dl..4,4t) and $,(t) are the amplitude and phase at time t of the kth harmonic, wO(t) is the fundamental frequency and K(t) is the time-varying number of harmonics included in the harmonic part. The frequency content of the high band is modeled by a timevarying AR model; its time-domain structure is represented by a piecewise linear energy-envelope function. The noise part, n(t), is thereforeassumed to have been obtained by filtering a white Gaussian noise b(t), by a time-varying, normalized all-pole filter h(r, t ) and multiplying the result by an energy envelope function w(t), such that r7
( t ) = w (1) [ h ( r ,t ) * b ( t ) ]
A time-varying parameter referred toas maximum voiced frequencydeterminesthe limit between thetwobands. During unvoiced frames the maximum voiced frequency is set to zero. At synthesis time, HNM frames areconcatenatedandthe prosody of units is altered according to the desired prosody.
7.3 SYNTHESIS BASED ON ANALYSIS-SYNTHESIS METHOD
In synthesis derived from the analysis-synthesis method, words or phrases of human speech are analyzed based on the speech productionmodelandstoredas timesequencesof feature parameters. Parameter sequences of appropriate units areconnected
-""".-""
"
222
Chapter 7
and supplied to a speech synthesizer to produce the desired spoken message. Since the units are stored by source and spectral envelope parameters, the amount of information is much less than with the previous method of storing by wavefoml, although the naturalness of synthesized speech is slightly degraded. Additionally, this method is advantageous in that changing the speaking rate and smoothing the pitch and spectral change at connections can be performed by controlling the parameters. Channel vocoders and speech synthesizers based on LPC analysis methods, such as LSP and PARCOR methods, or the cepstral analysis methods, are used for this purpose. Phoneme-based speech synthesis can also be implemented by the analysis-synthesis method, in which thefeatureparameter vector sequence of eachallophone is storedor produced by a model.Amethodhas been recently developed using HMMs (hidden Markov models) to model the feature parameter produca parameter vector tion process for each allophone. In this method, sequence consisting of cepstra and delta-cepstra for a desired sentence is automaticallyproduced by a concatenation of allophone HMMs based on the likelihood maximization criterion. Since delta-cepstra aretakenintoaccount in thelikelihood maximization process, a smooth parameter sequence is obtained (Tokuda et al., 1995).
7.4 SYNTHESISBASEDONSPEECHPRODUCTION MECHANISM
Twomethodsarecapable of producing speech by electroacoustically replicating the speech production mechanism. One is the vocal tract analog method, which simulates the acoustic wave propagation in the vocal tract. The other is the terminal analog method simulating the frequency spectrum structure, that is, the resonance andantiresonancecharacteristics, which reproduces articulation as a result. Although in the early years these methods were realized by analog processing using analogcomputersor variable resonance circuits, most of the recent systems use digital
Speech Synthesis
223
processing owing to advances in digital circuits and computers and to their ease of control.
7.4.1 VocalTract Analog Method
The vocal tract analog method is based on the principle described in Sec. 3.3. More specifically, the vocal tract is represented by a cascade connection of straight tubes with various cross-sectional areas, each of which has a short length Ax.The acoustic waves in thetubesareseparatedintoforwardandbackward waves. Acoustic wave propagation in the vocal tract is represented by theintegration of reflection andpenetrationofforwardand backward waves at each boundary between adjacent tubes. The amount of reflectionandpenetrationattheboundary is determined by the reflection coefficient which indicates the amount of mismatching in acoustic impedance. The signal processing for speech synthesis based on this principle is previously detailed in Fig. 3.4. Amethodhas also been investigated in which vocal tract characteristics are simulated by a cascade connection of7r-type four-terminal circuits, each of which consists of L- and C-elements. The circuit is terminated by another circuit havinga series of L- and R-elements, which is equivalent to the radiation impedance at the lips. The vocal tract model is excited by a pulse generator at the input terminalof the 7r-type circuit for voiced sounds, and by a white noise generator connected to a four-terminal circuit where turbulent noise is produced for consonants. Rather than remaining with the modeling of the vocal tract area function, it would be better to take the next, more difficult step, and directly formulate a model based on the structure of the articulatory organs. In such a modeling system, which is called the articulatory model, locations and shapesof articulatory organs are used as control parameters for speech synthesis. In this method, synthesisrulesareexpectedto be muchclearersincethe articulatorymovements of the organs can be directly described
Chapter 7
224
and controlled. In an example speech synthesis system based on this method (Coker et al., 1978), the glottal area, gap between the velum and pharynx, tongue location, shape of the tongue tip, jaw opening, and the amount of narrowing and protruding of the lips are controlled to produce speech. The speech synthesizer based on the vocal tractanalog method is considered to be particularly effective in synthesizing transitionalsoundssuchasconsonants, since it can precisely simulatethedynamicmanner of articulation in the vocal tract. Additionally, this method is considered to be easily related to the phonetic information conveyed by the speech wave. High-quality synthesized speech has not yet been obtained, however, since the movement of thearticulatoryorganshasnot been sufficiently clarified to offer suitable control rules. 7.4.2 TerminalAnalogMethod
Theterminalanalogmethodsimulatesthe speech production mechanism using an electrical structure consisting of the cascadeor parallel connection of several resonance (formant) and antiresonance(antiformant)circuits.Theresonanceorantiresonance frequency and bandwidth of each circuit are variable. This method is also called the formant-type synthesis method. As indicatedin Sec. 3.3.2 (resonancemodel),thecomplex frequency characteristics (Laplace transformation) of a resonance (pole) circuit can be represented as
where s =
-0
+jw
Speech Synthesis
225
Digitalsimulation of this circuit can be represented through its s-transformation
where
T is the sampling period, and Res[ ] indicates the residue number. These equations imply that the digital simulation circuit can be represented as shown in Fig. 7.3(a). When the resonance frequency-f;, = w,,/27r [Hz] and bandwidth b, = oi2/7r[Hz] are given, the circuit parameters can be obtained. The antiresonance (zero) circuitindicated in Fig. 7.3(b) can be easily obtained from the resonance circuit, based on the inverse circuit relationships. Here, k, = wi7/(o,:+q:). Thecascadeconnection of resonance andantiresonance circuits is advantageous in that mutual amplitude ratios between formants and antiformants are automatically determined. This is feasible because vocal tract transmissioncharacteristics can be directly represented by this method. On the other hand, parallel connection is advantageous in that the final spectral shape can be precisely simulated. Suchprecise simulation is made possible by the fact that the amplitude of each formant and antiformant can be represented independently, even though this methoddoesnot directlyindicate the vocal tracttransmissioncharacteristics. Therefore, cascade connection is suitable for vowel speech having a clear spectral structure, and parallel connection is best intended
Chapter 7
226 Input 4
2-1
-
output
c
rc
'
Kp*A
B
-
-
2-1
.c-
2-1
Input 4
FIG.7.3 Digital simulation of resonanceandantiresonance (a) resonance (pole) circuit; (b) antiresonance (zero) circuit.
circuits;
for nasal andfricative sounds featuringsuch a complicated spectral structure that their pole and zero structures cannot be extracted easily. Figure 7.4 shows a typical example of the structure of a synthesizer which is constructedbasedon these considerations (Klatt, 1980).
7.5 7.5.1
SYNTHESIS BY RULE Principles of SynthesisbyRule
Synthesis by rule is a method for producing any words sentences or based on sequences of phoneticlsyllabic symbols or letters. In this
I "
7 "
Speech Synthesis
"
I
I I
I
i
".""""
Y
227
-
" " " " 0
228
Chapter 7
method,featureparametersforfundamental small units of speech such as syllables, phonemes or one-pitch-period speech, arestoredandconnected by rules. At thesame time, prosodic features such as pitch and amplitude are also controlled by rules. The quality of fundamental units for synthesis as well as control rules (control information and control mechanisms) for acoustic parameters play crucially important roles in this method, and they must be based on phonetic and linguistic characteristics of natural speech. Furthermore,toproducenaturaland distinct speech, temporal transitions of pitch, stress, and spectrum mustbe smooth, and other features such as pause locations and durations must be appropriate. Vocaltractanalog,terminalanalog,andLPCspeech synthesizers used to bewidely employed for speech production. AsdescribedinSection 7.2, waveform-based methodshave recently become very popular.Featureparametersforfundamentalunitsareextractedfromnatural speech or artificially created. When phonemes are taken as the fundamental units for speech production, the memory capacity can be greatly reduced, since the number of phonemes is generally between 30 and 50. However, the rules for connecting phonemes are so complicated that high-quality speech is hard to obtain. Therefore, units larger than phonemes or allophone (context-dependent phoneme) units are frequently used. In thelatter case, thousandsortens of thousand of unitsare necessary for synthesizing high-quality speech. FortheJapaneselanguage, 100 CV syllables (C is a consonant, V is a vowel) corresponding to symbols in the Japanese ‘Kana’ syllabary are often used as these units. CVC units have also been employed to obtain high-quality speech (Sato, 1984a). The number of CVC syllables appearing in Japaneseis very large, being somewhere between 5000 and 6000. Thus, combinations of roughly 1000 CVC syllables frequently appearing in Japanese along with roughly 200 CV/VC syllables havebeen used to synthesize Japanese sentences. Combinations of between 700 and 800 VCV units have also been attempted (Sato, 1978).
Speech Synthesis
229
For example, the Japanese word ‘sakura,’ or cherry blossom, can be represented by the concatenation of these units as CV units CVC units VCV units
sa+ku+ra sak + kur + ra sa + aku + ura
CVC units are connected at consonants, and VCV units at vowel steady parts. Each method presents its own advantages in ease of connection. 3500 Incontrast,the Englishlanguagehasmorethan syllables, which expand to roughly 10,000 whenallophones (phonological variations) are taken into consideration. Therefore, syllables are usually decomposed into smaller units, such as dyads, diphones (both have roughly400 to 1000 units; Dixon and Maxey, 1968), or demisyllables (roughly 1000 units; Lovins et al., 1979). These units basically consist of individual phonemes and transitions between neighboring phonemes. Although demisyllables are slightly larger than the other two units, all units are composed in such a way that they may be concatenated using simple rules. In phoneme-based systems (Klatt, 1987), synthesis beginsby selecting targets for each controlparameterfor each phonetic segment. Targetsare sometimes modified by rules that takeinto account features of neighboring segments. Transitions between targetsarethencomputedaccordingto rules thatrange in complexity from simple smoothing to a fairly complicated implementation of the locus theory.Mostsmoothinginteractions involve segments adjacent to one another, but therules also provide for articulatory/acoustic interaction effects that span more than the adjacent segment. Since these rules are still very difficult to build, synthesis methods concatenating context-dependent phoneme units are now widely used as described in Secs. 7.2 and 7.3. Control parameters forintonation, accent, stress, pause, and duration used to be manually inputinto the systemin order to synthesize high-quality sentence speech. Because of the difficulty of
230
Chapter 7
inputting these parameters, however,text-to-speech conversion? in which these control parameters are automatically produced based on letter sequences, has been introduced. This system can realize the human ability of reading written texts, that is, converting unrestricted text to speech. This is essentiallythe ultimate goal of speech synthesis. Building such a text-to-speech conversion system, though, necessitates clarifying how people understand sentences using our knowledge of syntax and semantics. To be totally effective, this process of understanding must then be converted into computer programs. The principles of text-to-speech conversion are described in Sec. 7.6. 7.5.2 Control of Prosodic Features
In prosodic features, intonation and accent are most important in of synthesizedspeech. Fundamental improvingthequality frequency,loudness, anddurationare related to these features. In the period of speech between pauses, that is, the period of speech uttered in one breath, pitch frequency is usually high at the onset and gradually decreases towardtheendduetothe decrease in subglottal pressure. This characteristic is called the basic intonation component.Thepitchpattern of each sentence is produced by adding the accent components of the pitch pattern to this basic intonation component. The accent components are determined by the accent position for each word or syllable. Figure 7.5 shows an example of the pitch pattern production mechanismforaspokenJapanese sentence, in which thepitch pattern is expressed by the superposition of phrase components and accent components (Sagisaka, 1998). The accent component for each phrase is finally determinedaccording to thesyntactic relationships existing between phrases. In asuccessfd speech synthesis system for English (Klatt, 19871, the pitch pattern is modeled in terms of impulses and step commands fed to a linear smoothing filter. A step rise is placed near the start of the first stressed vowel in accordance with the 'hat theory' of intonation. A step fall is placed near the start of the final stressed vowel. These rises and falls set off syntactic units. Stress is
Speech Synthesis
231
Chapter 7
232
also manifested in this rule system by causing an additional local rise on stressed vowels using the impulse commands. The amount of rise is greatest for the first stressed vowel of a syntactic unit,and smallerthereafter.Finally,smalllocalinfluences of phonetic segments are added by positioning commands to simulate the rises for voiceless consonants and high vowels. A gradual declination line (the basic intonation component) is also included in the inputs to the smoothing filter. Thetopportion of Fig. 7.6 shows three typicalclause final intonation patterns, and the bottom portion exemplifies a pitch 'hat pattern' of rises and falls between the brim and top of the hat for a two-clause sentence. An example of the step andimpulsive commands for the English sentence noted, as well as the pitch pattern generated by these commands and the rules, are given in Fig. 7.7.
I
Time ~
Final f a l l
Question rise
~
~-
Foil- rise continuum
-"-Time
FIG. 7.6 Three typical clause-finalintonationpatterns (top), and an example of a pitch "hat pattern" of rises and falls (bottom).
I
I
1
I
I
L
1
Speech Synthesis
I
I
v)
n U
233
234
Chapter 7
Duration control foreach phoneme is also an importantissue in synthesizing high-quality speech. The duration of each phoneme in continuous speech is determined by many factors, such as the characteristicspeculiar to eachphoneme, influence of adjacent phonemes, and the numberof phonemes as well as their locationin theword(Sagisaka andTohkura, 1984). Theduration of each phonemealsochangesasafunction of the sentence context. Specifically, the final vowel of the sentence is lengthened, as are the stressed vowels and the consonants that precede them in the same syllable, whereasthe vowels before voiceless consonantsare shortened (Klatt, 1987).
7.6 TEXT-TO-SPEECHCONVERSION Text-to-speech conversion is an ambitious objective and continues to be thefocus of intensive research. A text-to-speech system produced would find a wide range of applications in a number of fields. These rangefrom accessing emails and variouskinds of databases by voice-over telephone to reading for the blind. Figure 7.8 presentsthe chief elementsoftext-to-speechconversion (CrochiereandFlanagan, 1986). Input textoftenincludes abbreviations,Romannumerals,dates, times,formulas,and punctuation marks. The system developed must be capable of first converting these into somereasonable, standard form and then translating them into a broad phonetic transcription. This is done by usingalargepronouncingdictionarysupplemented by appropriate letter-to-sound rules. IntheMITalk-79 system, which is one of themajor pioneering English text-to-speech conversion systems yet developed, 12,000 morphs, covering 98VO of ordinary English sentences, are used as basic acoustic segments (Allen et al., 1979). Morphs, which are smaller than words, are minimum units of letter strings having linguistic meaning.They consist of stems, prefixes, and suffixes. The word ‘changeable,’ for example, is decomposed into the morphs ‘change’ and ‘able.’ The morph dictionary stores the
Speech Synthesis
-
c
0
1
YI
235
236
Chapter 7
spelling and pronunciation for each morph, rules for connecting with othermorphs,and rules forsyntax-dependentvariations. Phoneme sequences for low-frequency wordsareproduced by letter-to-sound rules, instead of preparing morphs for them. Thisis based onthefactthatirregularletter-to-soundconversions generally occur for frequent words though the pronunciation of infrequent words tends to follow regular rules in English. TheMITalk-79 system convertswordstringsintomorph strings by aleft-to-right recursive processusingthemorph dictionary.Eachword is thentransformedintoa sequence of phonemes. Additionally, stress in each word is decided according to the effects of prefixes, suffixes, the word compound, and the part ofspeech.Sentence level prosodicfeaturesareadded according to syntax and semantics analysis, and sentence speech is finally synthesized using the terminal analog speech synthesizer introduced in Sec. 7.4.2 (Fig. 7.4). Thequality of the speech synthesized by theMITalk-79 system was evaluated by phoneme intelligibility in isolated words, word intelligibility in sentence speech, and sentence comprehensibility. Experimental results confirmed that the error rate for the phoneme intelligibility test was 6.9%, and that word intelligibility scores were, respectively, 93.2% and 78.7%in normal sentences and meaningless sentences. The DECtalksystem, which is the most successful commercialized text-to-speech conversion system, is based on refinements of the technology used in theMITalk-79 system (Klatt, 1987). Text-to-speech conversion systems for several other languages have also been investigated (Hirose et al., 1986). In a Japanese textto-speech conversion system (Sato, 1984b), input text, whichis written in acombinationof Chinese characters, or Kanji and Japanese Kanasyllabary, is analyzed by depth-first searching for the longest match using a 58,000-word dictionary and a word transition table. Thetransition table provides candidates for the following word.Compoundandphraseaccentandsentenceprosodic characteristics are next determined by reconstruction of phrases on the basis oflocal syntactic dependency analysis. A continuous speech signal is finally synthesized by concatenating CV speech units.
Speech Synthesis
7.7
237
CORPUS-BASED SPEECH SYNTHESIS
As described in Section 7.2, speech synthesis methods relying on a largenumber of short waveformunitscoveringpreviousand succeeding phonetic context and pitch are now widely used. The waveform units are usually made by using a large speech database (corpus) andstored.Themostappropriateunitsthathavethe closest phonetic context and pitch frequency to the desired speech andthat yield thesmallestconcatenation distortion between adjacent units are selected based on rules and evaluation measures andcancatenated(Hirokawaetal., 1992). Theunitsareeither directly connected or interpolated at the boundary. If the number of units is large enough and the rule of selection is appropriate, smoothsynthesized speech can be obtainedwithoutapplying interpolation.Insteadofstoringaunifiedlengthunitssuchas phonemes, methods of using variable length units according to the amount of data and kinds of speech to be synthesized have also been investigated (Sagisaka, 1988). The major factors determining synthesized speech quality in these methods consist of: 1) speech database, 2) methods for extracting the basic units, 3) evaluation measures for selecting the most appropriate units, and 4) efficient methods for searching the basic units, COC Method
COC (Context-Oriented-Clustering) speech synthesis method has been pioneering in usinghierarchical, decision treeclusteringin unit selection for speech synthesis. The method was first proposed forJapanese(NakajimaandHamada, 1988) and was later extended to English(Nakajima, 1993). Inthis approach, allthe instances of a given phoneme in a single-speaker continuous-speech
238
Chapter 7
database are clustered into equivalence classes according to their preceding and succeeding phonemecontexts.The decision trees which perform the clustering are constructed automatically so as to maximize theacoustic similarity within the equivalence classes. Figure 7.9 shows an example of the decision tree clustering for the phoneme /a/. This approach is sinlilar tothat used inmodern speech recognition systems to generate hidden Markov models in different phonetic contexts (See Subsection 8.9.5). Inthe synthesis systems, parametersor segments arethen extracted fromthedatabaseto representeach leaf inthetree. During synthesis, the trees are used to obtain theunit sequence required to produce the desired sentence. A key feature of this method is thatthe tree constructionautomaticallydetermines which context effects are most important in terms of their effect upon the acoustic properties of the speech, and thus enables the automatic identification of a leaf containing segments or parametersmostsuitable for synthesizing a given contextduring synthesis, even when the context required is not seen in training. It was confirmedthat, by concatenatingthephoneme-contextdependent phoneme units, smooth speech can be synthesized. The COC method was extended to use a set of cross-word decision-treestate-clusteredcontext-dependenthidden Markov models and define a set of subphoneunits to be used in a concatenation synthesizer (Donovan and Woodland,1999). During synthesis the required utterance, specified as a string of words of knownphoneticpronunciation, was generated as a sequence of these clustered states using a TD-PSOLA waveform concatenation synthesizer. A method of using HMM likelihood scores for selecting themostappropriatebasicunitshavealso been investigated (Huang et al., 1996). CHATR
CHATR is a corpus-basedmethod forproducing speech by selecting appropriate speech segments according to a labeling which annotates prosodic as well as phonemic influences on the
Speech Synthesis
239
240
f
Chapter 7
sis
241
Speech
speech waveform (Black and Campbell, 1995; Deng and Campbell, 1997). The labeling of speech variation in the natural data has enabled a generic approach to synthesis which easily adapts to new languages and to new speakers with little change to the basic algorithm.Figure 7.10 summarizesthe data flow in CHATR.It shows that processing (illustrated here in the form of pipes) occurs at two main stages: in the initial (off-line) database analysis and encoding stage to provide index tables and prosodic knowledge bases, and in the subsequent (online) synthesis stage for prosody prediction and unit selection. Waveform concatenation is currently the simplest part of CHATR, as the raw waveform segments to which the index points for the selected candidates are simply concatenated. Irrespective of recent progressin speech synthesis, many research issues still remain, including:
1) Improvement of naturalness, especially that of prosody, in synthesized speech; 2) Control of speaking style, such as reading or dialogue style and speech quality; and 3) Improvement of the accuracy of text analysis.
.
This Page Intentionally Left Blank
Speech Recognition
8.1 8.1. I
PRINCIPLES OF SPEECH RECOGNITION Advantages of Speech Recognition
Speech recognition is the process of automatically extracting and determining linguistic information conveyed by a speech wave using computers or electronic circuits. Linguistic information, the mostimportantinformationina speech wave, is also called phonetic information. In the broadest sense of the word, speech recognition includes speaker recognition which involves extracting individualinformationindicatingwho is speaking. Theterm ‘speech recognition’ will be used from here on, however, to mean the recognition of linguistic information only. Automatic speech recognition methods have been investigated formany years aimed principally at realizing transcriptionand human - computer interaction systems. The first technical paper to appear on speech recognition was published in 1952. It described Bell Labs’spoken digit recognizer Audrey(Davisetal., 1952). Research on speech recognition has since intensified, and speech recognizers for communicating with machines through speech have recently been constructed although they remain only of limited use.
243
244
Chapter 8
Conversation with machines canbe actualized by the combination of a speech recognizer and a speech synthesizer. This combination is expected to be particularly efficient and effective for human computer interaction since errors can be confirmed by hearing and then corrected promptly. Interest is growing in viewing speech not just as a means for accessing information, but also initself as a source of information. Important attributes that would make speech more useful in this respect include: random access, sorting ( e g , by speaker, by topic, by urgency), scanning, and editing. Similar to speech synthesizers,speechrecognitionfeatures four specific advantages: Speech input is easy to perform because it does not require a specialized skill as does typing or pushbutton operations; Speech can be used to input information three to four times faster than typewriters and eight to tentimesfaster than handwriting; Information can be input even when theuser is moving or doing other activities involving the hands, legs, eyes, or ears; and Sinceamicrophone ortelephonecanbeusedasan input terminal,inputtinginformation is economical,with remote inputting capable of being accomplished over existing telephone networks and the Internet. Regardless of these positive points, however, speech recognition also has the same disadvantages as does speech synthesis. For instance,the inputorconversation is notprinted,andnoise canceling oradaptation is necessarywhenusedinanoisy environment. In typical speech recognitionsystems,the input speech is comparedwithstoredunits(modelsorreferencetemplates) of phonemes or words, and the most likely (similar) sequenceof units is selected as a candidate sequence of phonemes or words of input speech. Since speech waveforms are too complicated to compare,
Speech Recognition
245
and since phase components whichvary according to transmission and recording systems little affect human speech perception, the phase components are desirably removed from the speech wave. Thus,short-time spectral density is usually extracted atshort intervals and used for comparison with the units. 8.1.2 Difficulties inSpeechRecognition
The difficulties in speech recognition can be summarizedas follows.
1) Coarticulationandreductionproblems Thespectrum of aphoneme in aword or sentence is influenced by neighboringphonemesasaconsequence of coarticulation.Suchaspectrum isvery different fromthose of isolated phonemes or syllables since the articulatory organs do not moveasmuch in continuous speech as in isolated utterances. Although this problem canbe avoided in the case of isolated word recognition by using words as units,how best to contend with this problem is very important in continuous-speech recognition. With continuous speech, the difficulty is compounded by elision, where thespeakerrunswordstogetherand‘swallows~most of the syllables. 2) Difficulties in segmentation Spectra continuously change from phoneme to phoneme due to their mutual interaction. Since the spectral sequence of speech can essentially be compared to a stringof handwritten letters, it is very difficult to precisely determine the phoneme boundaries which segmentthe time functionofspectralenvelopes.Although unvoiced consonants can be segmented relatively easily based on theamount of spectral variationandthe onset and offset of periodicity, attempting to segment asuccession of voiced sounds is particularly burdensome. Furthermore, it is almost impossible to segment a sentence of speech into words merely based on their acoustic features.
246
Chapter 8
3) Individuality and other variation problems Acoustic features vary from speaker to speaker,even when the samewords areuttered,accordingto differences inmanner of speaking and articulatory organs. To complicate matters, different phonemesspoken by differentspeakersoften have thesame spectrum. Transmission systems or noise also affect the physical characteristics of speech. 4) Insufficient linguistic knowledge The physical features of speech do not always convey enough phoneticinformation in and of themselves. Sentence speech is usually uttered with a n unconscious use of linguistic knowledge, such as syntactic and semantic constraints, and is perceived in a similar way. The listener can usually predictthe next word according to several linguistic constraints, and incomplete phonetic information is compensatedfor by such linguistic knowledge. However, what we know about the linguistic structure of spoken utterances is much smaller than that of written languages, and it is verydifficult tomodelthemechanism of usinglinguistic constraints in human speech perception. 8.1.3 ClassificationofSpeechRecognition
Speech recognition can be classified into isolated word recognition, in which words uttered in isolationare recognized, and continuousspeech recognition,in which continuouslyuttered sentences are recognized. Continuous-speech recognition can be further classified intotranscriptionandunderstanding.Theformeraimsat recognizing each word correctly. The latter, also called conversational speech recognition, focuses on understanding the meaningof sentences rather than recognizing each word. 111 continuous-speech recognition, it is very importantto use sophisticated linguistic knowledge. Applyingrules of grammar, which govern the sequence of words in a sentence, is but one example of this. Speech recognition can also be classified from different points of view into speaker-independent recognition and speaker-dependent
gnition
Speech
247
recognition. The former system can recognize speech uttered by any speaker, whereas, in the latter case, reference templates/models must be modifiedeverytime the speaker changes. Although speakerindependent recognition is much more difficult thanspeakerdependent recognition, it isof particular importance to develop speaker-independent recognition methods in order to broaden the range of possible uses. Various units of reference templates/models from phonemes to words have been studied. When words are used as units, the digitized input signal is comparedwitheach of the system’s storedunits,i.e.,statisticalmodels or sequencesofvalues corresponding to the spectral pattern of a word, until one is found that matches. Conversely, phoneme-based algorithms analyze the input into a string of sounds that they convert to words through a pronunciation-based dictionary. Whenwords are used as units,wordrecognition can be expected to be highly accurate since thecoarticulationproblem within words can be avoided. A larger vocabulary requires a larger memoryandmorecomputation,however,makingtraining troublesome.Additionally,thewordunitscannotsolvethe coarticulationproblemarising between wordsincontinuousspeech recognition.Usingphonemes as unitsdoes not greatly increase memory size requirements, on the otherhand,nor the amount of computationas a function of vocabulary size. Furthermore,trainingcan be performed efficiently. Moreover, coarticulation within and between words can be adequately taken intoconsideration. Since coarticulation rules have not yet been established, however, context-dependentmultiple-phonemeunits are necessary. The most appropriate units for enabling recognition success depend on the type of recognition, that is, on whether it is isolated word recognition or continuous-speech recognition, and on the size of the vocabulary. Along these lines, medium-size units between wordsandphonemes,suchas CV syllables, VCV syllables, diphones,dyads, and demisyllables, havealso been exploredin orderto overcomethedisadvantages of usingeitherwords or phonemes.
248
Chapter 8
With these subword (smaller-than-word) units, it is desirable to select more than one candidate in the unit recognition stage toform lattices andtotransfer these candidateswiththeir similarity values to the next stage of the recognition system. This method will help minimizethe occurrence of serious errorsat higherstagesduetomatchingerrorswith these unitsand segmentation errors involved in the lower stages. In most of the currentadvancedcontinuous-speechrecognition systems, the recognition process is performedtop-down, that is, driven by linguistic knowledge, and the system predicts sentence hypotheses, each of which is represented as asequence of words. Eachsequence is thenconvertedintoa sequence of phoneme models, and the likelihood (probability) of producing the spectralsequence of input speech given thephonemesequence is calculated.Thus,the matching and segmentation errors of phonemes are avoided (See Subsection 8.9.5).
8.2 SPEECHPERIODDETECTION Detection of the speech period is thefirststage of speech recognition.This is aparticularlyimportant stage because it is difficult to detect the speech period correctly in noisy surroundings and becauseadetectionerrorusuallyresultsinaserious recognition error. Consonants at the beginning or end of a speech periodand low energy vowels are especially difficult todetect. Additional noise such as breath noise at the end of a speech period must also be ignored. A speech period is usually detected by thefactthatthe short-timeaveraged energy levelexceeds athresholdfor longer thanapredeterminedperiod.The beginning point of a speech period is often determined as being a position which is a certain period prior to the position detected by the energy threshold. The energy levelis often compared with two kinds of thresholds to make a reliable detection decision. In addition to theenergy level, the zero-crossing number or the spectral difference betweenthe
gnition
Speech
249
input signal and reference noise spectrum is often used for speech period detection. Along with stationary noise which can be distinguished from the speech period using the above-mentioned methods, nonspeech sounds, such as coughing, the sound of turning pages, and even sounds utteredsubconsciously when thinking orsuddenly adjusting a sentence in midspeech, should be distinguishable from the actual speech. When the vocabulary is large, and the system must work speaker-independently,it is very troublesometodistinguish between speech and nonspeech sounds. Because this distinction is itself considered to be a speech recognition process, it is almost impossible to develop a perfect algorithmfordeterminingit. Research on word spotting, specifically, the automatic detectionof predetermined words from arbitrary continuoussentence speech, is expected to open the door to solving this problem. Besides speech period detection, voiced/unvoiced decision is also important. Although ascertaining the presence of vocal cord vibration, that is, the existence of a periodic wave, is most reliable, this method requires a large amount of computation. Therefore, the energy ratio of high- to low-frequency ranges, such as the range higher than 3 kHz and thatlower than 1 kHz, and similar measures are often used. When these methods are employed, is it necessary to normalize the effects of individuality and transmission characteristics to arrive at a reliable decision. Along these lines, a pattern recognitionapproachcombiningvariousparameters, such as autocorrelation coefficients, has also been attempted as previously mentioned (See Sec. 4.7).
8.3 SPECTRALDISTANCEMEASURES 8.3.1 DistanceMeasuresUsedinSpeechRecognition
As previously described, in almost all speech recognition systems, short-time spectral distances or similarities between input speech and stored units (models or reference templates) are calculated as
250
Chapter 8
the basis for the recognition decision. Spectral analysis is usually performed with one of five methods (See Sec. 4.2): Using band-pass filter outputs for 10 to 30 channels, Calculating the spectrum directly from the speech wave using FFT, Employing cepstral coefficients, Utilizing an autocorrelation function, and Deriving a spectralenvelopefrom LPC analysis (maximum likelihood estimation). Various distance (similarity) measures can be defined based on multivariate vectors representing short-time spectra which are obtained through these spectral analysis techniques. The distance measure d(x,y ) between two vectors x and y must desirably satisfy the following equations for effective use in speech recognition:
(a)
Symmetry :
(b)
Positivede finiteness : d(x,y) > 0, d ( x , y ) = 0,
x#y "X
= J'
(8.2)
If d(x, y ) is a distance in the mathematical sense of the word, it shouldsatisfythetriangleinequality.This condition is not necessary in speech recognition, however, and it is more important to formulate algorithms for calculating &x, y) efficiently. Although the simple Euclidean distance is used in many cases for d(x, y ) , several modifications have alsobeen attempted. Among these are weighted distances based on auditory sensitivity and the distancesin reduced multidimensional spaces obtainedthrough statistical analyses of discriminant analysis or principal component analysis. Formant frequencies, which are important features for representing speech characteristics, have rarely been used inthe most recent spectraldistance-based speech recognition because they are very difficult to extract automatically.
gnition
Speech
8.3.2
251
Distances Based on Nonparametric Spectral Analysis
The following methodshave been specifically investigated for obtainingspectraldistances based on generalspectral analysis techniques which do not incorporate modeling speech production mechanisms. 1) Band-pass filter bankmethod Band-pass filter banks have been used for many years and are still being employed because of the ease with which hardware for real-time analysis purposes can be realized. Center frequencies of band-pass filters are usually set withequal spaces alongthe logarithmic frequency scale. Differences of logarithmic output for each band-pass filter between the reference and input speech are averaged (summed) over all frequency ranges or averaged for their squared values to produce the overall distance. 2) FFT method Although it is possible to directly calculatethedistance between spectra obtained by FFT, spectral patterns smoothed by cepstral coefficients or window functionsintheautocorrelation domain are usually used. This is because the spectral fine structure varies according to pitch, voice individuality, andmanyother factors. The spectral values obtained at equal intervals on a linear frequency axis are usually resampled with equal spaces on a logarithmic frequency scale taking the auditory characteristics into consideration. Equal space resampling on a Bark-scale or a Melscale frequency axis hasalso been introducedin an effort to simulate the auditory characteristics more precisely. The Bark scale, which is based on theauditory critical bandwidth,corresponds to the frequency scale on the basilar membrane in theperipheral auditory system. This scale is defined as B = 13 arctan(0.768
+ 3.5 arctan
[AI
where B and j -represent the Bark scale and frequency in kilohertz.
Chapter 8
252
The Me1 scale corresponds to the auditory sensation of tone height. The relationship between frequency f in kilohertz and the Me1 scale Me1 is usually approximated by the equation
MeZ
=
1000 log? (1 +J>
(8-4)
The Bark and Me1 scales are nearly proportional to the logarithmic frequency scale in the frequency range above 1 kHz.
3) Cepstrummethod It is clear from the definition of cepstral coefficients that the Euclideandistancebetween vectors consisting of lower-order cepstral coefficients corresponds to the distance between smoothed logarithmic spectra. Me1 frequency cepstral coefficients (MFCCs) transformed from the logarithmic spectrum resampled at Mel-scale frequency as shown in Fig. 8.1 have also been used for this distance coefficients (Young, 1996). A and A2 aretransitionalcepstral which are described in Subsection 8.3.6. 4) Autocorrelationfunctionmethod The distance between vectors consisting of the autocorrelation function multiplied by the lag window corresponds to the distance between smoothed spectra. 8.3.3 DistancesBased on LPC
Since LPC analysis hasproven itself to be an excellentspeech analysis method, as mentioned in Chap. 5, it is also being widely usedinspeech recognition. Notations of various LPC analysisrelated parameters are indicated in Table 8.1, where fix) and g(X) represent spectral envelopes based on the LPCmodel for a reference template and input speech, respectively. These are given as
1
.f(N = 2n
Speech Recognition
f
Io
n
253
254
Chapter 8
TABLE 8.1 Notations for LPC Analysis-Related Parameters Parameters Spectral envelope Energy Autocorrelation coeff. Predictor coeff. Maximum likelihood parameter Normalized residual Cepstral coeff.
Reference template
Input speech
AN 4fl i;v> !l(f)
A,(f> RV)
C,,O
i = 1, . . . , p , j = - p , . . . , p p = order of LPC model n = -no, . . ., no
i=O
and
The following various distance measures using LPC analysisrelatedparameters have been proposedfordeterminingthe distance between .f(X) and g(X). 1. Maximum likelihood spectral distance (Itakura-Saito distance) Maximum likelihood spectral distance was introduced as an evaluationfunctionforspectral envelope estimationfromthe short-time spectral densityusing the maximum likelihood method. This distance is represented by the equation (see Sec. 5.3.2)
Speech Recognition
255
This distance is also called the Itakura-Saito distance (distortion). As described in Sec. 5.3.2, by defining d(X) = log f ( X ) - log g(X) forexaminingtherelationship between thisdistance and the logarithmic spectral distance, we obtain the equation
Whentheintegrandof this equation is processed by Taylor expansion for d(X) at the region around 0,
is derived. Thismeans that when Jd(X)I is small, thedistance E isclosetothesquaredlogarithmicspectraldistance. Equation (8.7) indicates that theintegrand of thisdistance is in proportion to d(X) when d(X) >> 0 and in proportion to e-''') when d(X) , via a feature analysis; followedby calculation of model likelihoods for all possible models, P(OJX,), 1 5 v 5 V ; followed by selection of the word whose model likelihood is highest specifically, -
v*
=
I
argmax[P(O X,)]
(8.72)
1 < L j ' 1 1 -
The likelihood calculation step is generally performed using the Viterbi algorithm (i.e., the maximum likelihood path is used). The segmental k-meanstrainingprocedure as shownin Fig. 8.13 (Rabiner and Juang, 1993)iswidelyused to estimate parameter values, in which good initial estimates of the parameters of the bJ (0,)densities are essential for rapid and proper convergence of thereestimation formulas. Followingmodelinitialization, the setof training observation sequences is segmented into states, based on the current model X. Ths segmentation is achievedby finding theoptimum state sequence, via the Viterbi algorithm, and then backtracking along the optimal path. The results of the segmenting each of the training sequencesis a maximum likelihood estimate of the set of the observations that occurwithineach state according to the current model. Basedon this segmentation,the model parameter set isupdated. The resulting model is then compared to the previous model. If the model distance score exceeds a threshold, the old model is replacedby the new (reestimated) model, and the overall training loop is repeated. If model convergence is assumed, the final modelparameters are saved.
8.8
CONNECTED WORD RECOGNITION
8.8.1 Two-Level DP Matchingand Its Modifications
The DP matching technique used in isolated word recognition can be expanded into a technique which is applicable to connected
296
.
1
T'
>
Chapter 8
I
gnition
Speech
297
word recognition (Ney andAubert, 1996). The basic process involved in this expansion is to perform D P matchingbetween input speech and all possible concatenations of reference word templates to ensureselecting the best sequence having the smallest accumulated distance. Several problems persist, however, in finding theoptimal matching sequence of reference templates. One is that the number of words in the input speech is generally unknown. Another is that thelocations, in time, of theboundariesbetweenwordsare unknown. The boundaries are usually unclear because the end of onewordmaymergesmoothly with the beginning of the next word. Still another is that the amount of calculation becomes too large when all possible sequences and input speech are exhaustively matched using the method described in Sec. 8.5. This is because the number of ways of concatenating X words selected from the N-word vocabulary is Nx.It is thus very important to create an efficient means for ascertaining the optimal sequence. Fortunately, several methods have been devised that optimally solve the matching problem without giving rise to an exponential growth in the amount of calculation as the vocabulary or length of the word sequence grows. Specifically worth mentioning are four principal methods having different computation algorithms, but producing identical accumulated distance results. 1. Two-level DPmatching
Since DP matching is performed on two levels in this method, it is called two-level D P matching (Sakoe, 1979). Onthe first level, semiunconstrained endpoint DP matching is performed between every shortperiodofinput speech and each word reference template. The starting position of the warping function is shifted frame by frame in input speech. The meaning of the semiunconstrained endpoint is that only the final position of the warping function is unconstrained. On the second level, the accumulated distance for the word sequence is calculated again using the DP matching method based on the results derived at the first level.
Chapter 8
298
In exploring themethod, let us assume that first-level DP matching has already been performed between partial periods of the input utterance starting from every position and eachreference template.Theword with theminimumdistancefromtheinput utterance between positions s and t is written as ~ ( s ,t), and its distance is written as D(s, t). ~ ( s ,t ) and D(s, t ) are obtained and stored for every partial period of input speech, more precisely, for every combination of s and t (1 s < t 2 T, T = input speech length). These values are then used for second-level D P matching forobtainingtheword sequence minimizingtheaccumulated distance over the entire inputspeech. That is, the recognition result is the word sequence w(1, m l ) , ~ ( 1 3 2 + ~ I , m2), . . ., w ( m k + 1, 2") satisfying the following equation under the condition 1 1771 < m 2 . . . < 1?7k < TI
(8.73) Since this equation can be rewritten into the recursive form
Do = 0 D,, = min {Dl,,- 1 Ill=
1.I?
+ D (In,n ) }
(8.74)
it can be efficiently solved by the DP technique.
2. LB (level building) method In the level building method, the number of connected words is assumedto be oneforthefirstconditionandincreased successively. Distances between input speech and connected word sequence candidatesare calculated to select theoptimumword sequence, namely,the best matchingwords,foreachcondition (level) of the number of connected words. Figure 8.14 illustrates the LB method (Myers and Rabiner, 1981).
Speech Recognition
299
Search region for longest r e f e r e n c e at each l e v e l
Seutch region for shortest reference at aach level
FIG. 8.14 Illustration of warpingpathregions matching using LB method.
in four-level DTW
For thefirst level, specifically, forthe first wordinthe sequence, unconstrainedendpoint DP matching is performed between input speech and each word reference templateunder theconditionthatthewarpingfunctionmuststartfromthe
300
Chapter 8
beginning position of the input speech. For the second and later levels, unconstrainedendpoint D P matching is done using the optimum accumulated distances obtained at each previous level in theendarea (nzl(l)-~n2(l)) in Fig. 8.14 as initial values. This procedure is repeated until theallowed maximum numberof words (wordstring length) is reached. The word sequence with the smallest accumulated distance at theend of the input speech is finally selected as being the recognition result. The LB method is particularly beneficial in thatunconstrained endpoint DP matching can be performed at every level, whereas the first levelof the two-level DP matching consists of semi-unconstrainedendpointmatching.Consequently, since the LB methodcan solve theoptimizationproblemthrough onelevel DP matching,theamount of computation it requires is less than two-level DP matching. The LB method in the original form is unsuitedtoframesynchronous, real-time processing, however, since scanningandmatching with reference templates must be performed throughout the input string of speech at every level untilthenumber oflevels equalsthe allowed maximum processing of the LB number of words.Framesynchronous method has been realized by the clockwise DP method described below using anadditional memory forintermediatecalculation results.
3. CW (Clockwise) DP method In contrast with the LB method in which the assumed number of connectedwords (level)is increased successively, andthe best matching word string is selected for each level, the clockwise DP (CWDP) method performs this procedure through parallel matching synchronized to the input speech frame (Sakoe and Watari, 1981). This makes CWDP suitable for real-time processing. The number of parallel matchingcorrespondstotheallowable maximum number of words in the string. In the DP matching between a certain periodof input speech and each word reference template, theresult of optimum matching,
Speech Recognition
301
in particular, the optimum accumulated distance, for the speech input before this period is usedas an initial conditionforthe recursive calculation. The repetition of the same spectral distance calculation occurring on every level of the LB method is removed in the CWDP method. Thus, CWDP requires fewer calculations than the LB method. Memory capacity increases in the CWDP method, however, since intermediate results of recursive calculations for DP matching must be stored for each figure number and for each word reference template. 4. OS (one-stage) DP method or O(n) (order n) DP method
Asopposedtothe two-level DP or CWDP methods, in which DP recursive calculationsareperformedfor all the possible conditionsonthenumber of figures at every frame, only the optimumcondition is considered at every frame in the onestage (OS) DP or order n DP method (Vintsyuk, 1971; Bridle and Brown, 1979; Nakagawa, 1983). Although investigated indepenO(n) D P methodsareactuallythesame dently,theOSDPand algorithm. Since this methoddoesnot involve therepetition of recursive DP calculations,it requires fewer calculationsanda smaller memory. Specifically, the number of calculations necessary to calculate the distance between input and reference frames and for distance accumulation in this methoddoesnotdependonthenumber of figures in the input speech. If the length of the speech input andmeanlength of reference templatesarebothconstant,the number of calculations is proportional only tothe size of the vocabulary, 12. For this reason, this method is called the O(PZ) DP method. Since the intermediate results for each stage of the figure are not maintained, it is impossible to obtain the recognition results whenthenumber of figures is specified. Forthe samereason, automaton control is also impossible with this method. Table 8.2 comparesthenumber of calculationsandthe memory size for each of the four methods described.
302
z 7
Z 7
X 7
z
X
z 7
m
E
O
Y-
Chapter 8
Speech Recognition
303
8.8.2 Word Spotting
The term word spotting describes a variety of speech recognition applications where it is necessary to spot utterances that are of interest to the system and to reject irrelevant sounds (Rose, 1996; Rohlicek, 1995). Irrelevantsounds can include out-of-domain speech utterances,backgroundacoustic noise, andbackground speech. Wordspottingtechniques have been applied to a wide range of problems that can suffer from unexpected speech input. These include human-machine interactions where it is difficult to constrain users' utterances to be within the domain of the system. Mostwordspotting systems consist of a mechanism for generating hypothesized vocabularywords or phrasesfroma continuous utterance along with some sort of hypothesis testing mechanism for verifying thewordoccurrence.Hypothesized keywords aregenerated by incorporating models of out-ofvocabulary utterances and non-speech sounds that compete in a search procedure with models of the keywords. Hypothesis testing is performed by deriving measures of confidence for hypothesized words or phrases and applying a decision rule to this measure for disambiguating correctly detected words from false alarms. Word spotting was first attempted using a dynamic programming technique for template matching (Bridle, 1973). Non-linear warping of the time scale for a storedreference template for a word was performed in order to minimize an accumulated distance from the input utterance. In this system, a distance was computed by performing a dynamic programming alignment for every reference template beginning at each time instant of a continuous running input utterance. Each dynamic programming pathwas treated as a hypothesized keyword occurrence, requiring a second-stage decision rule for disambiguating the correctly decoded keywords from false alarms. Recently, hidden Markov model (HMM)-based approaches have been used for word spotting. The reference template and the distance are replaced by an HMM word model and the likelihood, respectively. In these systems, thelikelihood foran acoustic background or "filler" speech model is used as part of a likelihood
304
c
..
..
0
a, a,
Q v)
T) 0 I :
if
2
-a
0
a,
Chapter 8
gnition
Speech
305
ratio scoring procedure in a decision rule that is applied as asecond stage to the word spotter. The filler speech model represents the alternate hypothesis, that is, out-of-vocabulary or ‘non-keyword’ speech. Figure 8.15 (Rose, 1996) shows a basic structure of an HMMbased word spotter, inwhich filler models competewith the models for keywords in a finite state network. The outputof the system is a continuous streamof keywords and fillers, and the occurrence of a keyword in this output stream is interpretedasa hypothesized event that is to beverifiedby a second-stage decision rule. The specification of grammarsforconstrainingand weighting the possible word transitions can be incorporated into the likelihood calculation. A variety of filler structures has been used successfully. They include: A simple one-state HMM (a Gaussian mixture), A network of unsupervised units such as an ergodic HMM, or a parallel loop of clustered sequences, A parallel networkloopofsubnetworkscorrespondingto keyword pieces, phonetic models, or even models of whole words, such asthe most commonwords,a single pooled ‘other’ word, and unsupervised clustering of the other words, and An explicit network characterizing typical word sequences. Wordspottingperformancemeasuresare derived using Neyman-Pearson hypothesis testing formulation. Givena T length sequence of observation vectors YA-= y l k , . . ., yTk corresponding to a possible occurrence of a keyword, a word spotter maygeneratea score Sk representing the degree acceptance of confidence for that keyword. The null hypothesis Ho corresponds to the case where the input utterance is the correct keyword, and thealternate hypothesis H I correspondstoanimposter (false) utterance. A hypothesis test can be formulated by defining a decision rule S() such that
Chapter 8
306
S( Y k ) =
0, Sk > r, (accept Ho) 1, sk 5 r, (accept H I )
(8.75)
where r is a constant decision threshold. We can define the type I error as rejecting Ho when the keyword is in fact present and the type I1 error as accepting Ho when the keyword is not present. Since there is a trade-off between the two types of error, usually a boundary on the type I error is specified and the type I1 error is minimized within this constraint. Figure 8.16 (Rose, 1996) showsa simple loopingnetwork which consists of N keywords Wkl, . . ., W k N and M fillers FVD, . . ., WfM.Word insertion penalties Cki and Cfican be associated with the ith keywordand jth filler respectively,'and they can be adjusted to affect a trade-off between type I and type I1 errors similar to adjusting r in Equation 8.75. Suppose, for example, the network in Fig. 8.16 contained only a single keyword and a single filler. Then at each time f, Viterbi algorithm propagates the path extending fromkeyword Wk, represented by HMMXk, or filler J V f , represented by HMM A3 to network node PC according to (8.76) This corresponds to a decision rule at each time f: (8.77)
8.9
LARGE-VOCABULARY CONTINUOUS-SPEECH RECOGNITION
In large-vocabulary continuous-speech recognition, input speech is recognized using various kinds of information including a lexicon,
Speech Recognition
YI
L
307
308
Chapter 8
syntax, semantics, pragmatics, context, and prosodics. The lexicon indicates the phonemic structure of words, syntax expresses the grammatical structure, semantics defines the relationship between words as well as the attributes of each word, pragmatics expresses general knowledge concerning the present topics of conversation, context concerns the contextual information,such as that obtained through human - machine conversation, and prosodics represents accent and intonation. Various algorithms and databases usedintheseprocesses are referred to as knowledge sources for continuous-speech recognition. The keys determining system performance liewith the kinds of knowledge sources used and how they are combined as quickly as possible to produce the most probable recognition. Specifically, the focus involves how best to control the process of searching through possibilities. There arethree principal issues in solving these problems: the order inwhichtheseknowledgesources should beused, the direction of processes in the input speech period, and the procedures for evaluating and selecting the most probable hypotheses. 8.9.1 Three PrincipalStructural Models
Therearethreeprincipalmodelsforcombiningand using the knowledge sources: the hierarchy model,theblackboardmodel, and the network model. 1. Hierarchy model
Thehierarchymodeldistributesknowledge sources in multiple of processes aretransferred hierarchicalsubsystems.Results between adjacent subsystems in the bottom-up direction for taskindependent processes and in thetop-downdirectionfortaskdependent processes. The fundamental structure of the hierarchy model is presented inFig. 8.17(a). Acoustic features are extractedin the acoustic processor from input speech and converted into a phoneme sequence (lattice) by
309
Speech Recognition Recognltion results
4
Linguisticprocessor (Linguistic level I 1 Word candidates Word prediction (TOP -down I (Bottom-up 1 Acousticprocessor (Acoustic leve I 1
4 Speech wave
Recogn i t ion resu Its Word ,level
r
t
1Common
Acou E t ic leve I
database (Blackboard) I
leve I
Speech wave
Control mechanism
(c)
sources Recogn Network of knowledge (Word, syntax & semantic levels)"
t
'L
i t ion results
Acoustic l e v e l
t
Speechwave
FIG. 8.17 Three principal structural models of continuousspeech recognition: (a) hierarchymodel; (b) blackboardmodel; (c) network model.
means of segmentation and phoneme recognition. In thenext step, word orword-sequence candidatesare produced from the phoneme sequence which usually includes recognition errors.Word dictionaryas well asphonological rules representingphoneme
Chapter 8
310
modification rules associated with coarticulation are used for the word or word-sequence recognition. In the linguistic processor, a sentence is produced by removing incorrectcandidatewords according to linguistic knowledge such as syntax, semantics, and context information. Ontheotherhand, restrictions onwordcandidatesare provided in the top-down direction from thelinguistic processor to theacousticprocessor.Acousticandlinguistic processes are sometimes combined at a level below the word level. Actual acoustic and linguistic processors are further divided into multiple subsystems. 2.
Blackboard model
In the blackboard model,as in the hierarchy model, the recognition system is divided into multiple subsystems. A special feature of this system, however, is that each subsystem gains access to a common database independently to verify various hypotheses, as shown in Fig. 8.17(b). The process in each subsystem can be performed in parallelwithoutsynchronization.TheHearsay I1 system is a successful example of the blackboard model (Lesser et al., 1975). Systems based on hierarchy andblackboard models are characterized by flexibility. This is because various knowledge sources are classified and systematically combined to achieve the recognition and understanding of sentences while preserving their independence. 3.
Network model
The network model embeds all knowledge except the system control mechanism in one network, with every process beingperformed in this network, as shown in Fig. 8.17(c). Sentence recognition based on this model corresponds to the process of searching for a path in the network which matches the input speech. The process is thus similar to connected word recognition. Although the number of calculations
Speech Recognition
311
is relativelylarge, information loss on each level as well as information loss propagation can be prevented. In addition, all processes can be controlled homogeneously, and all knowledge sources can be handled uniformly. The Harpy system is a successful application of this model (Lowerre, 1976). The problem with the network model is that it is not as flexible in its application as the two previous models. Most of therecentlarge-vocabularycontinuous-speech recognition systems have been built based on the network model.
8.9.2 OtherSystemConstructingFactors
Prevalent among the directions which processes take are exemplified by the left-to-right and island-driven methods. In the former method, input speech is successively processed from beginning to end. In the latter method,the most reliable candidate word is first detected inthe input speech, which is then processed fromthis word to both ends. Although both methods have advantages and disadvantages,theleft-to-rightmethod is morefrequentlyused. This is because important words tend to be nearer the beginning of sentences and the left-to-right method is much easier to control. Quantitativeevaluationand selection of hypotheses are carried out by a variety of tree search algorithms. The depth-first method processes the longest word string first, and if this search fails, the system backtracks to the previous node. In the breadthfirst method, all word strings of the same length are processed in parallel, with the process proceeding from short to long strings. Withthe best-first method,thewordstringhavingthelargest evaluation value is selected at every node.Thestackalgorithm (Bahl et al., 1983) is widely used to find the best path first. These methods differ only in their search orders, exhibiting no essential difference insearchcapability.Reducingthesearchcost while maintaining the search efficiency, however, is very important for practical applications. The beam search method (Lowerre, 1976) is a modification of thebreadth-firstmethod,in which wordstrings with relatively
312
Chapter 8
large evaluation values areselected and processed in parallel. New algorithms such as the tree-trellis algorithm (Soong and Huang, 1991) which combines a Viterbi forward search and an A* (Paul, 1991) backwardsearchare very efficient ingeneratingN-best results (See Subsection 8.9.5). Various other trials have also been examined including pruning until only reliable candidates remain. Syntacticinformation,that is, syntacticrulesandtaskdependentknowledgeareusuallyrepresentedusingstatistical language modeling or context-free grammar (CFG). When more sophisticated control is required, they are represented by generation rules (rewriting rules) or by an augmented transition network (ATN) in which semantic information is embedded. Semantic information is represented in various ways. These include being represented by: (1) A combinationofsemanticmarkerswhichindicatefundamentalconceptsnecessary for classifying themeaning of words; (2) Embeddingtherestriction of semanticword classes inthe syntactic description as described above; (3) A semanticnet which indicatesthesemanticrelationship between word classes using a graph with nodes and branches; and (4) A case frame inwhich all words, mainly verbs, are qualifiedby words or phrases in a semantic class which coexist with the word. Procedural knowledge representation,predicate logic, anda production system have also been used for semantic information representation. 8.9.3 Statistical Theory of Continuous-Speech Recognition
In the state-of-the-art approach, speech production as well as the recognition processis modeled through four stages: text generation,
Speech Recognition
313
(Transmission theory 1
. L . " . I "
-I.--.
' Speech recognition
I
I I
t
Acoustic chonncl
-
r.-
"
"."
Text
L
-
i- " "-
Acoustic processing
system
I
-
I '
I
Y
I
-
t
Linguistic
I
decoding
+"""---J
" " "
(Speechrecognition process)
FIG.8.18 Structure of the state-of-the-art continuous speech recognition system.
speech production, acoustic processing, and linguistic decoding, as shown in Fig. 8.18. A speaker is assumed to be atransducer that transformsinto speech the text of thoughtshe/sheintends tocommunicate(informationsource). Based on information transmission theory, the sequence of processes is compared to an information transmission system, in which a word sequence W is convertedinto an acousticobservationsequence Y , with probability P(W, Y), through a noisy transmissionchannel, whichis then decoded to an estimated sequence I@. The goal of recognition is then to decode the word string, based on the acoustic observation sequence, so that the decoded string has themaximum a posteriori (MAP) probability(Rabiner and Juang, 1993; Young, 1996), i.e.,
FP =
argmaxP(W
1
Y).
(8.78)
It'
Using Bayes' rule, Eq. (8.78) can be written as (8.79)
314
Chapter 8
Since P ( Y ) is independent of W , the MAP decoding rule of Eq. (8.79) is
(8.80) The first term in Eq. (8.80), P( YI w), is generally called the of a sequence of acousticmodel as itestimatestheprobability acoustic observations conditioned on the word string. The second term is generally called the language model since it describes the probability associated with a postulated sequence of words. Such languagemodelscanincorporate both syntactic and semantic constraints of the language and the recognition task. Often, when only syntactic constraints are used, the language model is called a grammar. When the language models are represented in a finite state network, it can be integrated into the acoustic model in a straightforward manner. HMMs and statistical language models are typically used as theacoustic and languagemodels,respectively.Figure 8.19 diagramsthecomputation of theprobability P(WI Y) of word sequence W given theparameterizedacoustic signal Y. The data P(YI w) is computed using a likelihood of theacoustic composite hidden Markov model representing W constructed from simple HMM phoneme models joined in sequence according to word pronunciations stored in a dictionary.
8.9.4 Statistical LanguageModeling The statistical language model P( w)for word sequences W
is estimated from a given large text (training) corpus (Jelinek, 1997; Ney et al., 1997). Using the definition of conditional probabilities, we obtain the decomposition
Speech Recognition
4"
-m
L
315
Chapter 8
316
k
(8.82) i= 1
For large-vocabulary speech recognition, these conditional probabilities are typically used in the following way. The dependence of the conditional probability of observing a word tt'i at a position i is assumed to be restricted to itsimmediate N-1 predecessor words w i - N + . . . M'i-1. The resulting model is that of a Markov chain andis referred to asN-gram language model( N = 1: unigram; N = 2: bigram; and N = 3: trigram). The conditional probabilities P(wi I can be estimated by the simple relative frequency
wi:k+,)
(8.83) in which C is thenumber of occurrences of thestringinits argument in the given training corpus. In order for the estimate in Eq. (8.83) to be reliable, C has to be substantial in the given corpus. However, if the vocabulary sizeis2000 and N = 4, the possible number of differentword sequences w~~ is 16 trillion (20004), and, therefore, even if a considerably large training corpus is given, C = 0 for many possible word sequences. One way to circumvent this problem is to smooth the N-gram frequencies by using the deleted interpolationmethod (Jelinek, 1997). In the case of N = 3, the trigram model, the smoothing is done by interpolating trigram, bigram, and unigranl values
P(
"1
gnition
Speech
317
where thenonnegative weights satisfy X, + X2 + X3 = 1. The weights can be obtained by applyingtheprinciple of crossvalidation and the EM algorithm. This method has a disadvantage in that it needs a huge number of computations if the vocabulary size is large. In order to estimate the values of N-grams that do notoccur in the training corpus from N-1-gram values, Katz’s backoff smoothing(Katz, 1987;Ney etal., 1997) based ontheGood-Turing estimationtheory iswidely used. In this method, the number of occurrences of N-gram with so few occurrences is further reduced, and the left-over probability is distributed among the unobservedNgrams in proportion to their N-1-gram probabilities. The N-gram reducing ratio is called the discounting ratio. Even with these methods, itis almost practically impossibleto obtain N-gramswith N larger than 3 fora largevocabulary. Therefore,theword4-gramsareoftenapproximated by class 4-grams using word classes (groups), such as part of speech, as units as follows
where ci indicates the ith word class. A method using word cooccurrences as statistics overwider a range than adjacentwords has alsobeen explored. Language model adaptation for specific tasks and users has also been investigated. Introducingstatisticsintoconventionalgrammaras well as bigrams and trigrams of phonemesinstead of wordshavealso beentried.Thestatisticallanguagemodeling is themethod incorporatingbothsyntacticandsemanticinformationsimultaneously. One of the important issues in training statistical language modelsfromJapanesetext is that there is nospacing between words in the written form.Even no clear definition of words exists. Therefore,morphemesinstead of wordsareusedasunits, and morphologicalanalysis is applied totraining text for splitting sentences intomorphemesandproducingtheirbigramsand trigrams.
318
Chapter 8
8.9.5 Typical Structure of Large-Vocabulary ContinuousSpeech Recognition Systems
Thestructure of a typical large-vocabularycontinuous-speech recognition system currentlyunderstudy is showninFig. 8.20 (Rabiner and Juang, 1993). In this system, a speech wave is first converted into a time series of feature parameters, such as cepstra and delta-cepstra,inthefeatureextraction part.The system predicts a sentence hypothesis that is likely to be spoken by the user, based on thecurrenttopic,themeaning of words, and language grammar, and represents the sentence as a sequence of words. Thissequence is then converted into a sequence of phoneme models which were created beforehand in a training stage. Each phonememodel is typicallyrepresented by an HMM. The likelihood(probability) of producingthe time series of feature parameters from thesequence of the phoneme models is calculated, and combined with the linguistic likelihood of the hypothesized sequence to calculate the overall likelihood that the sentence was uttered by the speaker. The (overall) likelihood is calcualted for other sentence hypotheses, and the sentence with the highest likelihood scoreis chosen as the recognition result.Thus, in mostof the current advanced systems, the recognition process is performed top-down, that is, driven by linguistic knowledge. For state-of-theart systems, stochastic N-grams are extensively used. The use of a context-free language in recognition is still limited mainly due to the increase incomputationandthe difficulty instochastic modeling. Inordertoincorporate linguistic context within a speech subword unit, triphones and generalized triphones are now widely used. It has been shown that the recognition accuracyof a task can be increasedwhenlinguisticcontextdependency is properly incorporated to reduce the acoustic variability of the speech units being modeled. When triphones are used they result in a system that has too many parameters to train. The problem of too many parameters and too little training data is crucial in the design of a statistical speech recognizer. Therefore,tied-mixture models and
Speech Recognition
I:
319
320
Chapter 8
state-tyinghave been proposed.Figure 8.21 (Knill andYoung, 1997) shows a procedure for building tied-state Gaussian-mixture triphone HMMs. In thismethod,similar HMM states of the allophonic variants of each basic phone are tied together in order to maximize the amount of data available to train each state. The choice of which states to tie is made based on clustering using a phonetic decision tree, where phonetic questions, such ‘Is as the left context a nasal?’, are used to partition the present set into subsets in away that maximizes the likelihoodof the training data. The leaf nodes of each tree determine the setsof state tyings for each of the allophonic variants. In fluentcontinuous speech ithasalso been shown that interword units take into account cross-word coarticulation and thereforeprovidemoreaccuratemodeling of speech unitsthan intraword units.Word-dependentunitshavealsobeenusedto model poorly articulatedspeech sounds such as function words like a, the, in? and etc. Since a full searchof the hypotheses is very expensive in terms of processingtime and storage requirements,suboptimalsearch strategies are commonly used. As opposed to the traditional left-toright, one-pass search strategies, multi-pass algorithms perform a search in away that the first pass typically prepares partial theories and additional passes finalize the complete theory in a progressive manner. Multi-pass algorithms are usually designed to provide the N-best string hypotheses. To improve flexibility, simpler acoustic and language models are often used in the first pass as a rough match to introduce a word lattice. Detailed models and detailed matches are applied in later passes to combine partial theories into the recognized sentence.
8.9.6 Methods for Evaluating Recognition Systems
Three measures for representingthesyntacticcomplexity of recognizing taskshavethusfar been proposedtofacilitate the evaluation of the difficulty of speech recognition tasks. The
I ”
Speech Recognition
Mc 0 s a .d
"
tj
I c.1
I
I
321
322
Chapter 8
averagebranchingfactorindicatestheaveragenumber of words which can be predicated, that is, the words that can follow at each position of syntactic analysis (Goodman, 1976). Equivalent vocabulary size is amodificationoftheaveragebranching factorin which theacoustic similarity between words is taken intoconsideration(Goodman, 1976). Finally,perplexity is defined by 2H, where H is theentropy of a wordstringin sentence speech (Bahletal., 1982). Entropy H is given by equation
~ is theprobability of observingtheword where P ( I V ~. I. .Vwli) sequence. However, the language modelperplexity calculated by using a training text corpus does notnecessarily indicate the uncertainty of the texts which appear in speech recognition, since thetext database is limited in its size and it does not necessarily represent the whole natural language.Therefore, the following test-set perplexity PP or log perplexity log PP is frequently used for evaluating the difficulty of the recognition task:
1 logPP = - - log P(11'1 - N
*
M7N)
Thisindicates theobservationprobability of theevaluation (recognition) text per word measured using the trained language model. Althougheachmeasure offers its own benefits, a perfect measure has not yet been proposed. The performance of recognition systems is usually measured by the following %correct or accuracy:
"
1_1_
. ...
f
Speech Recognition
323
Yocorrect =
accuracy =
N
N
-
- sub
N
- del '
100
sub - del - ins - 100 N
(8.88)
(8.89)
Whenwords are used as measuringunits, they are called word %correct and word accuracy. N is the number of words in the speech forevaluation, and sub, del and ins are the numbers of substitutionerrors,deletionerrorsandinsertionerrors, respectively. The accuracy which includes insertion errors is more strict than%correct which doesnot.Thenumbercalculated by subtracting the accuracy from 100 is called the error rate. Actual systems should be evaluated by the combination of task difficulty and recognition performance.
8.10 8.10.1
EXAMPLES OF LARGE-VOCABULARY CONTINUOUSSPEECH RECOGNITION SYSTEMS DARPA Speech Recognition Projects
Applications of speech recognition technology can be classified into the two main areas of transcription and human-computer dialogue systems. A series of DARPA projects have been a major driving force of the recent progress in research on large-vocabulary, continuous-speech recognition. Specifically, transcription of speech reading newspapers, suchas North America business (NAB) newspapers including the Wall Street Journal (WSJ), and conversational speech recognition using an Air Travel Information System (ATIS) task wereactively investigated. Recently, broadcast news (BN) transcription and natural conversational speech recognition using Switchboard and Call Home tasks have been investigated as
Chapter 8
324
major DARPA programs. Research on human-computer dialogue systems named Communicator Program has also started. Thebroadcast news transcriptiontechnologyhas recently been integrated with information extraction and retrieval technology, andmanyapplication systems, such asautomatic voice document indexing and retrieval systems, are under development. These systems integratevarious diverse speech and language technologies including speech recognition, speaker change detection,speakeridentification,nameextraction,topic classification andinformationretrieval. In thehuman-computerinteraction domain, a variety of experimental systems for information retrieval through spoken dialogue are investigated. 8.10.2
EnglishSpeechRecognitionSystem Laboratory
at LlMSl
Thestructure of a typical large-vocabularycontinuous-speech recognition system developed atLIMSILaboratory inFrance for recognizing English broadcast-news speech is outlined as follows (Gauvain et al., 1999). The system uses continuous density HMMs with Gaussian mixture for acoustic modeling and backoff N-gramstatisticsestimated on largetext corporafor language modeling. For acoustic modeling, 39 cepstral parameters, consisting of 12 cepstral coefficients and the log energy, along with the first and second order derivatives, are derived from a Me1 frequencyspectrumestimated on the 0-8 kHz band (0-3.5 kHz for telephone speech models) every 10 ms. The pronunciations are based on a 48-phone set (three of them are used for silence, filler words, andbreath noises). Eachcross-wordcontext-dependent phonemodel is atied-stateleft-to-right HMM withGaussian mixtureobservation densities (about 32 components) where the tied states are obtained by means of a decision tree. Theacousticmodels were trained onabout 150 hours of Broadcast News data. Language models were trained on different data sets: BN transcripts, NAB newspapers and AP Wordstream
gnition
325
Speech
texts. The recognition vocabulary contains 65,122 words (72,788 phone transcriptions) and has a lexical coverage of over 99% on theevaluation test data.Priorto worddecodingamaximum likelihood partitioning algorithm using Gaussian mixture models (GMMs) segments the data into homogeneous regions and assigns gender,bandwidthandcluster labels to the speech segments. Details of the segmentation and labeling procedure are shown in Fig. 8.22. A criterion similar to BIC (Bayesian Information Criterion) (Schwarz, 1978) orMDL (MinimumDescription Length)(Rissanen, 1984) criterion is used to decide the number of segments. The word decoding procedure is shown in Fig. 8.23. The cepstral coefficients are normalized on a segment cluster basis using cepstral meannormalization and variance normalization.Each resulting cepstral coefficient for each segment has a zero mean and unity variance. Priorto decoding, segments longer than 30s are chopped into smaller pieces so as to limit the memory required for the trigram decoding pass. Word recognition is performed in three steps: 1) initial hypotheses generation, 2) word graphgeneration, and 3) final hypothesis generation, each with two passes. The initial hypotheses are used in cluster-based acoustic model adaptation using the MLLR technique prior to word graph generation and in all subsequent decoding passes. The final hypothesis is generated using a 4-gram interpolated with a category trigram model with 270 automatically generated word classes. The overall word transcription error on the November 1998 evaluation data was 13.6%. 8.10.3
English Speech Recognition System at IBM Laboratory
The IBM system uses acoustic models for sub-phonetic units with context-dependent tying (Chenetal., 1999). Theinstancesof context-dependent sub-phone classes are identified by growing a decision tree from the available training data and specifying the terminal nodes of the tree as the relevant instancesof these classes.
Chapter 8
326
Viterbi segmentation with GMMs Speech/music/background
1 Chop into small segments I Train aGMM for each segment Viterbi segmentation and reestimation GMM clustering
1"
Fewer clusters No change Viterbi segmentation with energy constraint Bandwidth and gender identification
FIG.8.22
Segmentation and labeling procedure of the LlMSl system
Speech Recognition
327
Cepstral mean and variance normalization for
1
Chopintosegments smaller than 30s Generate initial hypotheses
MLLR adaptation & word graph generation
FIG.8.23 Word decoding procedure of the LlMSl system.
The acoustic feature vectors that characterize the training data at the leaves are modeled by a mixture of Gaussian or Gaussian-like pdf s, with diagonal covariance matrices.The HMM used to model
-____"".-"-.""" _
c
"
"
"
" I I "
328
Chapter 8
each leaf is a simple one-state model, with a self-loop and a forward transition. The total number of Gaussians is 289 k. The BIC is used as a model selection criterion in segmentation,clusteringforunsupervised adaptation,and choosingthe number of GaussiansinGaussianmixturemodeling.TheIBM systemshowsalmost the sameword errorrate astheLIMSI system. 8.10.4 A Japanese Speech Recognition System
A large-vocabularycontinuous-speechrecognitionsystemfor Japanese broadcast-news speech transcription has been developed at Tokyo Institute of Technology in Japan (Ohtsuki et al., 1999). This is part of a joint research with a broadcast company whose goal is the closed-captioning of TV programs. The broadcast-news manuscripts that were used for constructing the language models were taken from the period of roughly four years, and comprised approximately 500 k sentences and 22 M words. To calculate word N-gramlanguagemodels,thebroadcast-newsmanuscripts were segmented into words by usingamorphologicalanalyzer since Japanese sentences are written without spaces between words. A word-frequency list was derived for the news manuscripts, and the 20 kmostfrequently used words were selected asvocabulary words. This 20 k vocabulary covers about 98% of the words in thebroadcast-newsmanuscripts.Bigramsandtrigrams were calculated and unseenN-grams were estimatedusingKatz’s back-off smoothing method. As showninFig. 8.24, atwo-pass search algorithm was used, in which bigrams were utilized in the first pass and trigrams were employed in the second pass to rescore the N-best hypotheses obtained as the result of the first pass. Japanesetext is writtenwithamixture of threekinds of characters: Chinese characters (Kanji) and two kinds of Japanese characters(HiraganaandKatakana).EachKanjihasmultiple readings, and correct readings can only be decided according to context. Therefore, a language model that depends on the readings of wordswasconstructedinordertotakeintoaccountthe
cn
0
Acoustic model training
fD fD
I
Speech-r
Acoustic analysis
+
2
Beamsearch decoder (First path)
'hypotheses N-best ' Rescoring (Second with acoustic score , Path)
-
-W
Recognition results
+
I
Trigram Language model training FIG. 8.24 Two-pass search structure used in the Japanese broadcast-news transcription system.
w N
(D
330
Chapter 8
frequency and context-dependency of the readings. Broadcastnews speech includes filled pauses at the beginning and in the middle of sentences, which cause recognition errors in the language models that use news manuscripts written prior to broadcasting. To cope with this problem, filled-pause modeling was introduced into the language model. Afterapplyingonline,unsupervised,incrementalspeaker adaptation using the MLLR-MAP (See Subsection 8.1 1.4) and VFS (vector-field smoothing) (Ohkura et al., 1992) methods, the word errorrate of 11.9%,on average over male and female speakers, was obtained for clean speech with no background noise. Summarizing transcribed news speech is useful for retrieving or indexing broadcast news. A method has been investigated for extracting topic words from nouns in the speech recognition results on the basis of a significance measure. The extracted topic-words were compared with ‘true’ topic-words, which were given by three human subjects. The results showed that, when the top five topicwords were chosen (recall = 13%), 87% of them were correct on average. Based on these topic words, summarizing sentences were created by reconstructing compound words and inserting verbs and postpositional particles.
8.11 SPEAKER-INDEPENDENT AND ADAPTIVE RECOGNITION Speaker-dependent variations in speech spectra are very complicated, and,as indicated in Subsection 8.1.2, there is no evidence that common physical features exist in the same words uttered by different speakers even if they can be clearlyrecognized by humans. A statistical analysis of the relationship between phonetic and individual information revealed that there is significantinteraction between them (Furui, 1978). It is thus verydifficult fora system to accurately recognize spoken words or sentences uttered by many speakers even if the vocabulary is as small as 10 digits. Only with a small vocabulary and nosimilar word pairs in the spectral domain can high accuracy be
Speech Recognition
331
achieved using a reference template 011 a model obtained by averaging the spectral patterns of many speakers for each word. Although looking for phonetic invariants, principally physical features commonly existing for all speakers for each phoneme, is important asbasic research, it seems too ambitious an undertaking. The present effective methodsforcoping with theproblem of speaker variability can be classified into two types of methods. One constitutesmethodsin which reference templates or statistical word/subword models are designed so that the range of individual variation is covered by them for each word, whereas the ranges of different words do not overlap. The other includes those in which the recognition system is provided with a training mechanism for automatically adapting to each new speaker. The need for effectively handlingindividualvariationsin speech has resulted inthelatter type of themethod,that is, introducing normalization or adaptation mechanisms into a speech recognizer. Such a method is based on the voice characteristics of each speaker observedusing utterances of a small number of words or short sentences. In the normalization method, spectral variation is normalized or removed from input speech, whereas inthe adaptation method,the recognizer templates or models are adapted to eachspeaker.Normalization oradaptation mechanisms are essential for very-large-vocabularyspeaker-independentword recognition. Since it is almost impossible toconducttraining involving every word in a large vocabulary, training using a short string of speech serves as a useft11 and realistic way of coping with the individuality problem. Unsupervised (online) adaptationhas also been attempted wherein the recognition system is automatically adaptedtothe speaker through therepetition of the recognition process without the need forthe utterances of predetermined words or sentences. Humans have also been found to possess a similar adaptation mechanism. Specifically, although the first several words uttered by a speaker new to the listener may be unintelligible, the latter quickly becomes accustomed to the former’s voice. Thus, theintelligibility of the speaker’s voice increases particularly after the listener hears several words and utterances (Kato and Kawahara, 1984).
332
Chapter 8
This section will focus on: 1) the multi-template method, in which multiple templates are created for each vocabulary word by clustering individual variations; 2) the statistical method, in which individual variations are represented by the statistical parameters in HMMs; and 3) thespeakernormalizationandadaptation methods,inwhichspeaker variability of input speech is automatically normalized or speaker-independent models are adapted to each new speaker. 8.11.I Multi-template Method
A spoken word recognizer based on the multi-template method clusters the speech data uttered by many speakers, and the speech sample at the center of each cluster or the mean value for the speech data associated with each cluster is stored as a reference template. Several algorithms areused in combination forclustering (Rabiner et al., 1979a, b). In the recognition phase, distances (or similarities) between input speech and all reference templates of all vocabulary words are calculated based on DP matching,andtheword with the smallest distance isselected asthewordspoken. Inorderto increase the reliability, the KNN (K-nearest neighbor) method is often used for the decision. Here, K reference templates with the smallest distances fromtheinput speech are selected fromthe multiple-reference template set for each word, with the mean value for these K templates being calculated for each word. The word with the smallest mean value is then selected as the recognition result. Experiments revealed that with 12 templates for each word, the recognition accuracy for K= 2 to 3 is higher than for K= 1. Speaker-independent connecteddigit recognition experiments were performed combining the LB and multi-template methods. This method is disadvantageous, however, in that when the number of reference templatesforeachword increases, the recognitiontaskbecomes equivalent tolarge-vocabularyword recognition, increasing the number of calculations and the memory
gnition
Speech
333
size. These problems have been resolved through the investigation of two methods based on the structure shown in Fig. 8.4(b), in which phoneme templates and a word dictionaryare utilized. In the first trial, the same word dictionary was used for all speakers, and multiplesets of phonemetemplates were prepared to cover variations in individual speakers (Nakatsu et al., 1983). For the second instance, the SPLIT method (see Subsection 8.6.2) was modified to use multiple-word templates (pseudophoneme sequences) foreachword to coverspeakervariations, whereas the set of pseudo-phoneme templates remain common to all speakers (Sugamura and Furui, 1984). This method was found to be able to reduce the number of calculations and memory size to roughly one-tenthof the methodusing word-based templates, while maintainingrecognitionaccuracy. In thismethod,pseudophonemes and multiple sequences in the word dictionary are produced by the same clustering algorithm. A VQ-based preprocessor is combinedwiththe modified SPLIT method for large-vocabulary speaker-independent isolated word recognition (Furui, 1987). Here, a speech wave is analyzed by time functions of instantaneous cepstralcoefficients and short-time regression coefficients for both cepstral coefficients and logarithmic energy. Regression coefficients represent spectral dynamics in every short period, as described in Sec. 8.3.6. A universal VQ codebookfor these time functions is constructed based on a multispeaker, multiword database. Next, a separate codebook is designed as a subset of the universal codebook for each word in the vocabulary. These word-specific codebooks are used for front-end processing toeliminatewordcandidateswithlarge-distance (distortion) scores. The SPLIT method subsequently resolves the choice among the remaining word candidates. 8.11.2 StatisticalMethod
The HMM method described in Sec. 8.7 is capable of including spectral distribution and variation in transitional probability for
334
Chapter 8
manyspeakers in themodelasa result of statisticalparameter estimation. It has been repetitively shown that given a large set of training speech, goodstatistical models can be constructedto achievehigh a performanceformanystandardized speech recognition tasks. Recognition experiments demonstrated that this method can achieve betterrecognitionaccuracy than the multitemplate method.Theamount of computation requiredinthe HMM method is much smaller than in the multi-template method (Rabiner et al., 1983). A trial was also conducted using HMM at the word level in the LB method (Rabiner and Levinson, 1985). It is still impossible, however, to accurately recognize the utterances of every speaker. A smallpercentage of people occasionally cause systems to produce exceptionally low recognition rates because of large nzisnlatches between the models and the input speech.This is an example of the'sheep andgoats' phenomenon. 8.11.3 Speaker Normalization Method
The nature of the speech production mechanism suggest that the vocal cord spectrum and the effects of vocal tractlengthcause phoneme-independent physical individuality in voiced sounds. Furthermore, the former can be observed in the averaged overall spectrum, that is, in the overall pattern of the long-time average spectrum, and the latter can be seen in thelinearexpansion or contraction coefficient alongthe frequency axis for the speech spectrum. Based on thisfactual evidence, individualitynormalization has been introducedforthephoneme-basedwordrecognition system described in Subsection 8.6.1 (Furui, 1975). Experimental results show that although this nlethod is effective, a gap exists between the recognition accuracies obtained using the method and those surfacing after training utilizing all of the vocabulary words for each speaker. This means that a more complicated model is necessary to ensure complete representation of voice individuality.
Speech Recognition
335
Nonlinear warping of the spectrum along the frequency axis has been attempted using the DP technique for normalizing the voice individuality (Matsumoto and Wakita, 1986). Since excessive warping causes the loss of phonetic features, an appropriate limit must be set for the warping function. 8.1 1.4 Speaker Adaptation Methods
The main adaptation methods currently being investigated are: 1) Bayesian learning, 2) spectral mapping, 3) linear (piecewise-linear) transformation, and 4) speaker cluster selection. Important practical issues in using adaptation techniques include the specification of a priori parameters (information), the availability of supervisioninformation,andtheamount of adaptation data needed to achieve effective learning. Since it is unlikely that all the phoneme units will be observed enough times ina small adaptation set, especially inlarge-vocabularycontinuous-speech recognition systems, onlya small numberofparameterscan beeffectively adapted.It is thereforedesirable to introduce some parameter correlation or tying so that all model parameterscan be adjusted at thesame time inaconsistent manner, even if some units are not included in the adaptation data. The Bayesian learning framework offers a way to incorporate newly acquired application-specific data into existing models and to combine them in an optimal manner. It is therefore an efficient technique for handling the sparse training data problem typically found in model parameter adaptation. This framework has been used to derive MAP (maximumaposteriori)estimates of the parameters of speech models, including HMM parameters (Lee and Gauvain, 1996). The MCE/GPD method described in Subsection 8.7.8 has also been successfully combined with MAP speaker adaptation of HMM parameters (Lin et al., 1994; Matsui et al., 1995). In the spectral mappingmethod? speaker-adaptive parameters are estimatedfromspeaker-independentparameters based on
336
Chapter 8
mappingrules.Themapping rules areestimatedfromthe relationship between speaker-independent and speaker-dependent parameters (Shikano et al.? 1986). If acorrelationstructure between parameterscan be established, and thecorrelationparameters can be estimated when training the general models, the parameters of unseen units can be adapted accordingly (Furui, 1980; Cox, 1995). To improve adaptation efficiency and effectiveness along this line, several techniques havebeenproposed,includingprobabilisticspectralmapping (Schwartz et al., 1987), cepstral normalization (Acero et al., 1990), and spectrum bias and shift transformation (Sankar andLee, 1996). Inadditiontoclusteringandsmoothing,a second type of constraintcan begiven to themodelparameters so that all the parametersareadjustedsimultaneouslyaccording toa predetermined set of transformations, e.g., atransformation based on multiple regression analysis (Furui, 1980). Variousmethods have recently been proposedin which alineartransformation (Affine transformation) between the reference and adaptive speaker-feature vectors is defined and then translated into abias vector and ascaling matrix, which can be estimated using an EM algorithm (MLLR; MaximumLikelihoodLinear Regression method)(Leggetterand Woodland, 1995). The transform parameters can be estimated from adaptation data that form pairs with the training data. In the speakerclusterselectionmethod, it is assumed that speakers can be divided into clusters, within which the speakers are similar. From many sets of phoneme model clusters representing speakervariability,themostsuitable set for the new speaker is automatically selected. This method is useful for choosing initial models, to which more sophisticated speakeradaptation techniques are applied. 8.11.5 Unsupervised Speaker Adaptation Methods
Themost useful adaptationmethod isunsupervisedonline instantaneous adaptation. In this approach, adaptationis performed at runtimeontheinput speech in an unsupervisedmanner.
Speech Recognition
337
Therefore, the recognition system does not require training speech to estimate the speaker characteristics; it works asif it were a universal (speaker-independent) system. This method is especially useful when the speakers vary frequently. Themostimportant issue in this method is how to perform phoneme-dependent adaptation without knowing the correct model sequence for the input speech. This is especially difficult for speakers whose utterances are error prone when using universal (speaker-independent) models, that is, for speakers who definitely need adaptation. It isvery useful if the online adaptation is performed incrementally, in which the recognition system continuously adapts to new adaptation data without using previous training data (Matsuoka and Lee, 1993). Hierarchicalspectralclustering is an adaptiveclustering technique that performs speaker adaptation in an automatic, selforganizingmanner. Themethod was proposedfor a matrixquantization-based speech coding (Shiraki et al., 1990) and a VQbased word-recognition system (Furui, 1989a, 1989b) in which each word is represented by a set of VQ index sequences. Speaker adaptation is achieved by adapting the codebook entries (spectral vectors) to a particular speaker while keeping the index sequence set intact. The key idea of this method is to cluster hierarchically the spectra in the new adaptation set in correspondence with those intheoriginal VQ codebook.Thecorrespondence between the centroid of a new cluster and the original code word is established by way of a deviation vector. Using deviation vectors, either code words or input frame spectra are shifted so that the corresponding centroids coincide. Continuity between adjacent clusters is maintained by determining the shifting vectors as the weighted-sum of thedeviationvectors of adjacentclusters. Adaptation is thus performed hierarchically fromglobal to local individuality as shown in Fig. 8.25. In the figure, u,, and v,, indicate the centroid of the mth codebook element cluster and that of the corresponding training speech cluster, respectively, pH, is thedeviationvector between these two centroids, and ci is a codebook element. The MLLR method has also been used asaconstraintin unsupervised speaker adaptation (Cox and Bridle, 1989; Digalakis and Neumeyer, 1995).
338
d=
n W
Chapter 8
Speech Recognition
339
The N-best-based unsupervised adaptation method (Matsui and Furui, 1996) uses the N most likely word sequences in parallel and iterativelymaximizesthejointlikelihoodforsentence hypotheses and modelparameters.The N-best hypotheses are created for each input speech by applyingspeaker-independent models;speaker adaptation based onconstrained Bayesian learning is then applied to each hypothesis. Finally, the hypothesis with the highest likelihood is selected as the most likely sequences. Figure 8.26 shows the overall structure of sucharecognition system. Conventionaliterativemaximization, which sequentially estimates hypotheses and model parameters, canonly reach a local maximum, whereas the N-best-based methodcan find aglobal maximum if reasonableconstraintsonparametersareapplied. Without giving reasonable constraints based on models of interspeakervariability, aninpututterancecan be adaptedto any hypothesis with resulting high likelihood. To reduce this problem, constraintsshould be placed on thetransformation so that it maintains a reasonable geometrical shape. Because inter-speakervariabilityofteninteracts with other variations,suchasallophoniccontextualdependency,intraspeakerspeechvariation,environmentalnoise, andchannel distortion,it is importanttocreatemethodsthatcan simultaneously cope with these other variations. Inter-speaker variability is generally more difficult to cope with than noise and channel variability, since the former is non-linear whereas the latter can usually be modeled as a linear transformation in the time, spectral, or cepstral domain. Therefore, the algorithms proposed for speaker adaptationcan generally be applied to noise andchannel adaptation.
8.12 ROBUST ALGORITHMS AGAINST NOISE AND CHANNEL VARIATIONS
The performance of a speech recognizer iswell known to often degradedrastically when there exist some acousticas well as
340
...
Chapter 8
Speech Recognition
341
linguistic mismatches between the testing and training conditions. Inadditiontothespeaker-to-speakervariability described in theprevioussection,theacousticmismatchesarisefromthe signal discrepancies dueto varyingenvironmentalandchannel conditions, such as telephone,microphone,background noise, room acoustics, and bandwidth limitations of transmission lines, as shown in Fig. 8.27. When people speak in a noisy environment, not only does the loudness (energy) of their speech increase, but thepitchandfrequencycomponentsalsochange.These speech variations are called theLombard effect. The linguistic mismatches arise from different task constraints, There has been a great deal of effort aiming at improving speech recognition and hence enhancing performance robustness in the abovementioned mismatches. Figure 8.28 shows the main methods forreducing mismatches that have been investigated to resolve speech variation problems (Juang, 199 1; Furui, 1992b, 1995c), along with the basic sequence of speech recognition processes. These methods can be classified into three levels: signal level, feature level, and model level. Since the speaker normalizationand adaptation methods are described in theprevioussection,this section focuses on environmental and channel mismatch problems. Several methodshavebeen used todeal withadditive noise: using special microphones,usingauditorymodelsfor speech analysis andfeatureextraction,subtracting noise, using noisemaskingandadaptivemodels,usingspectraldistance measures thatarerobust against noise, and compensating for spectraldeviation.Variousmethodshavealso been used to copewiththeproblemscausedbythedifferencesin characteristicsbetweendifferentkinds of microphonesand transmission lines. Acommonly used method is cepstralmeansubtraction (CMS), also called cepstralmeannormalization (CMN), in which the long-term cepstral meanis subtracted from the utterance. Thismethod is verysimple but veryeffective invarious applications of speech andspeakerrecognition(Atal, 1974; Furui, 1981).
3
Noise Other speakers Background noise
-
Distortion
t I
Speaker Voice quality Pitch Gender * Dialect Speaking style Stress/Emotion Speaking rate Lombardeffect
-
TasWContext Man-machine dialogue Dictation Free conversation Interview Phonetic/Prosodic context
-
FIG.8.27 Main causes of acoustic variation in speech
Microphone Distortion Electrical noise Directional characteristics
-
-
I
s b)
'5! 2 W
343
Speech Recognition
Close-talking microphone Microphone array Auditory models (EIH, SMC, p ~ p ) filtering Noise subtraction f Adaptive Comb filtering
Spectral mapping Cemtral mean normalization
."""""" I*"
/-.
Model-level normalization/ adaptation -.+ A
Distance/ Frequency weighting measure distance cepstralWeighted Cepstrum projection measure
fReferencd temlates/ \tnofels 1 &bust
1
Noise addition HMM(de)composition (PMC) Model transformation (MLLR) Bayesian adaptive learning
{
nLchingl---Word
Utterance
a Recognition
spotting
results
FIG.8.28 Main methods for contending with voice variation in speech recognition.
-"""""""
" " "
"""."
verification
344
8.12.1
Chapter 8
HMM Composition/PMC
TheHMMcomposition/parallelmodelcombination(PMC) method creates a noise-added-speech HMM by combining HMMs that model speech and noise (Gales and Young 1992; Martin et al., 1993). This method is closely related to the HMM decomposition proposed by Varga and Moore (1990, 1991). In HMM composition, observation probabilities (means and covariances) for noisy a speech HMMare estimated by convolutingtheobservation probabilitiesinalinearspectraldomain.Figures 8.29 and 8.30 showthe HMM composition process. Since a noise HMM can usually be trained by usinginput signals without speech, this method can be considered as an adaptation process where speech HMMs are adapted on the basis of the noise model. This method can be applied not only to stationary noise but also to time-variant noise, such as another speaker's voice. The effectiveness of this method was confirmed by experiments using speech signals to which noise or other speech had been added. The experimental results showed that this method produces recognition rates similar to those of HMMs trained by using a large noiseadded speech database.Thismethodhas fairly recently been extended to simultaneously cope with additive noise and convolutional (multiplicative) distortion (Gales and Young, 1993; Minami and Furui, 1995). 8.12.2 Detection-Based Approach for Spontaneous Speech Recognition
One of the most important remainingissues for speech recognition is how to create language models (rules) for spontaneous speech. When recognizing spontaneous speech in dialogs, it is necessary to dealwithvariations that are not encountered when recognizing speech that is read from texts. These variations include extraneous words,out-of-vocabularywords,ungrammatical sentences, disfluency, partialwords,repairs,hesitations,andrepetitions. It is
Speech Recognition 345
346
Chapter 8
w
a,
ro a,
Q v)
Q
Y'
gnition
Speech
347
crucial to develop robustand flexible parsingalgorithms that match the characteristics of spontaneous speech. How to extract contextual information, predict users' responses, and focus on key words are very important issues. A paradigm shift from the present transcription-based approach toa detection-based approach will be important to resolving such problems. A detection-based system consists of detectors, each of which aims at detectingthe presence of a prescribed event, such as a phoneme,aword, a phrase,a linguistic notion such as an expression of traveldestination.Thedetector uses a model for the event and an anti-model that provides contrast to the event. It follows the Neymann-Pearson lemma in that the likelihood ratio is used as the test statisticagainst a threshold. Several simple implementations of this paradigm have shown promises in dealing with naturalutterancescontainingmanyspontaneous speech phenomena (Kawahara et al., 1997). The following issues need to be addressed in this formulation: (1) Howtotrainthe models and anti-models?Theideaof discriminative trainingcan be applied using the verification error as the optimization criterion. (2) How to choose detection units? Reasonable choices are words and key phrases. (3) How to include language models and event context/constraints which can help raise the system performance in the integrated search after the detectors propose individual decisions?
This Page Intentionally Left Blank
Speaker Recognition
9.1 9.1.1
PRINCIPLES OF SPEAKER RECOGNITION Human and Computer Speaker Recognition
A technology closely related to speech recognition is speaker recognition, or the automatic recognitionofaspeaker(talker) throughmeasurements of specifically individualcharacteristics arisinginthespeaker's voice signal (Doddington, 1985; Furui, 1986; Furui, 1996; Furui, 1997; O'Shaugnessy, 1986; Rosenberg and Soong, 1991). Speaker recognition research is especially closely intertwinedwiththe principles underlyingspeaker-independent speech recognition technology. In the broadest sense of the word, speakerrecognitionresearchalso involves investigating clues humans use to recognize speakerseither by soundspectrogram (voice print) (Kersta, 1962; Tosi et al., 1972) or by hearing. History notes that as early as 1660 a witness was recorded as having been able to identify a defendant by his voice at one of the trial sessions summoned to determine circumstances surrounding the death of Charles I (NRC, 1979). Speaker recognition did not become a subject of scientific inquiry until over two centuries later,
349
350
Chapter 9
however,whentelephony made possiblespeakerrecognition independent of distanceinconjunction with soundrecording giving rise to speaker recognition independent of time. The use of soundspectrogramsinthe 1940s alsoincorporatedthe sensory capability of vision alongwith that of hearinginperforming speaker recognition. Notably, it was not until 1966 that a court of law finally admittedspeakerrecognitiontestimony based on spectrograms of speech sounds. In parallel with theauraland visual methods,automated methods of speaker recognition have continued to be developed, andare consequently yielding informationstrengtheningthe accuracy of theformermethods.Theautomatedmethodshave recently made remarkable progress partly owing to the influential advances in computerand pattern recognition technologies. Due to its ever increasing importance,this chapter will focus exclusively on automatic speaker recognition technology. The actual realization of speaker recognition systems makes use of voice as the key tool for verifying the identify of a speaker for application to an extensive array of customer-demand services. In the near future, these services will include banking transactions and shopping using the telephone network as well as the Internet, voicemail, database acquisition services including personal information accessing, reservation services, remote access of computers, and security control for protecting confidential areas of concern. Importantly,identity verification using voice is farmoreconvenient than usingcards, keys, orother artificialmeansfor identification, and is much safer because voice can neither be lost nor stolen. In addition, voice recognition does not require the use of hands. Accordingly, several systems are currently being planned forfutureapplicationsintherapidlyacceleratinginformationintensive age into which we are entering.Undersuchcircumstances, field trials combining speaker recognition with telephone cardsand creditcards (ATM)are alreadyunderway. Another important application of speaker recognition is its use for forensic purposes (Kunzel, 1994). The principal disadvantage of using voice is that its physical characteristics are variable and easily modified by transmission and
Speaker Recognition
351
microphone characteristics as well as by background noise. If a system is capable of accepting wide variation in the customer’s voice, for example, it might also unfortunately accept the voice of a differentspeaker if sufficiently similar. It is thus absolutely essential to use physical features which are stable and not easily mimicked or affected by transmission characteristics. 9.1.2 IndividualCharacteristics
Individualinformationincludes voice quality, voiceheight, loudness,speed,tempo,intonation,accent,andthe use of vocabulary. Various physical features interacting in a complicated manner produce these voice characteristics. They arise both from hereditary individual differences in articulatory organs, such as the length of the vocal tract and the vocal card characteristics, and from acquired differences in the manner of speaking. Voice quality and height, which arethemostimportant types of individual auditory information, are mainly related to the static and temporal characteristics of the spectral envelope and fundamental frequency (pitch). Thetemporalcharacteristics, that is, time functions of the spectral envelope, fundamental frequency, andenergy, can be used for speaker recognition in a way similar to those used for speech recognition.However, several considerations and processes designed to emphasize stable individual characteristics are necessary in order to achieve high-performance speaker recognition. The statistical characteristics derived from the time functions of spectral features are also successfully used in speaker recognition. The use of statistical characteristics specifically reduces the dimensions of templates, and consequently cuts down the run-time computation as well as the memory sizeof reference templates. Similar recognition results on 40-frame words have been obtained eitherwith standardDTW templatematching or with a single distance measure involving a 20-dimensional vector employing the statistical features of fundamental frequency and LPC parameters (Furui, 1981a).
Chapter 9
352
Since speaker recognition systems using temporal patterns of sourcecharacteristicsonly,such as pitch and energy, arenot resistant to mimicked voice, they shoulddesirably be combined with vocal tractcharacteristics, namely, withspectral envelope parameters, to build more robust systems (Rosenberg and Sambur, 1975).
9.2 SPEAKER RECOGNITION METHODS 9.2.1
Classification of Speaker Recognition Methods
Speakerrecognitioncan be principallydivided intospeaker verification and speakeridentification.Speaker verification is the process of accepting or rejecting theidentity claim of a speaker by comparinga set of measurements of the speaker’s utterances with a reference set of measurements of the utterance of the person whose identity is being claimed. Speaker identification is the process of determining from which of the registered speakersa given utterancecomes.Thespeakeridentification process is similar to the spoken word recognition process in that both determine which reference template is most similar to the input speech. Speaker verification is applicable to various kinds of services which include the use of voice as the key to confirming the identity claim of aspeaker.Speakeridentification is used incriminal investigations, for example, to determine which of the suspects produced the voice recorded at the scene of the crime. Since the possibility always exists that the actual criminal is not one of the suspects,however,theidentificationdecisionmust be made throughthecombined processes ofspeakerverification and speaker identification. Speaker recognition methods can also be divided into textdependent and text-independent methods. The former require the speaker to issue a predetermined utterance whereas the latter do not rely on a specific text being spoken. In general, because of the
Speaker Recognition
353
higher acoustic-phoneticvariability of text-independent input, more training materialis necessary to reliably characterize (model) a speaker than with text-dependent methods. Although several text-dependentmethods use features of special phonemes,suchasnasals,mosttext-dependent systems allow words (key words, names, ID numbers, etc.) or sentences to be arbitrarily selected for eachspeaker. In thelatter case, the differences in words or sentences between the speakers improves the accuracy of speaker recognition. When evaluating experimental systems, however, common key words or sentences are usually used for every speaker. Although key wordscan be fixed for eachspeakerin many applications of speaker verification, utterances of the same words cannot always be compared in criminal investigations. In such cases, atext-independentmethod is essential. Difficulty in speakerrecognition varies, depending on whether ornot the speakers intend for their identities to be verified. During speaker verification use, speakers are usually expected to cooperate without intentionallychangingtheirspeakingrate or manner. It is well known, however, andnaturalfrom theirpoint of view that speakers are most often uncooperative in criminal investigations, consequently compounding the difficulty in correctly recognizing their voices. Bothtext-dependent andindependentmethodshaveone serious weakness. That is, these systems can be easily beaten because anyone who plays back the recorded voice of a registered speaker uttering key words or sentences into the microphone can be accepted as the registered speaker. To contendwiththis problem, some methods employ small a set of words, such asdigits, as key words, and each user is prompted to utter a given sequence of key words that is randomly chosen eachtime the system is used (Higgins et al., 1991; Rosenberg et al., 1991). Yet even this method is not sufficiently reliable, since it can be beaten with advanced electronic recordingequipment thatcan readily reproduce key words in any requested order. Therefore,to counter this problem, a text-promptedspeakerrecognitionmethodhas recently been proposed. (See Subsection 9.3.3.)
Chapter 9
354
9.2.2 Structure of Speaker Recognition Systems
The common structure of speaker recognition systems is shown in speech wave are Fig. 9.1. Featureparametersextractedfroma compared with the stored reference templates or models for each registered speaker. The recognition decision is made according to the distance (or similarity) values. For speaker verification, input utterances with distances to thereference template smaller than the threshold are accepted as being utterances of the registered speaker (customer), while input utterances with distances larger than the thresholdare rejected as being those of adifferentspeaker (impostor).Withspeakeridentification,the registered speaker whose reference template is nearest to the input utterance between all of the registered speakers is selected as being the speaker of the input utterance. The receiver operating characteristic (ROC) curve adopted from psychophysics is used for evaluating speaker verification systems. In speaker verification, two conditions concern the input utterance: s, or the condition that the utterance belongs to the customer, and n, the opposite condition.Two decision conditions also exist: S, the condition that the utterance is accepted as being that of the customer, and N , the condition that the utterance is rejected. These conditions combine to make up the four conditional probabilities as designated in Table 9.1. Specifically, P(Sls) is the probability of correct acceptance; P(Sln) is the probability of false acceptance (FA), namely, the probability of accepting impostors, P(NIs) is the probability of false rejection (FR), or the probability of mistakenly rejecting the real customer; and P(Nlr?) is the probability of correct rejection. Since the relationships P(Sls)
+ P(NIs) = 1
P(Sln)
+ P(NJ12)= 1
and
Speaker Recognition
L
c
cr0
355
Chapter 9
356
TABLE9.1 Four Conditional Probabilities in Speaker Verification Input utterance condition Decision condition
s (customer)
n (impostor)
S (accept) N (reject)
exist for the four probabilities, speaker verification systems can be evaluated using the two probabilities P(Sls) and P(Sln). If these two values are assigned tothe vertical andhorizontal axes respectively, and if the decision criterion (threshold) of accepting the speech as being that of the customer is varied, ROC curves as indicated in Fig. 9.2 are obtained. Thefigure exemplifies the curves for three systems: A, B, and D. Clearly, the performanceof curve B is consistently superior to that of curve A, and D corresponds to the limiting case of purely chance performance. On theotherhand,therelationship between the decision 9.3. criterion and thetwokinds of errors is presentedinFig. Position n in Figs. 9.2 and 9.3 corresponds to the case in which a strict decision criterion is employed, and position b corresponds to that wherein a lax criterion is used. To set the threshold at the desired level of customer rejection and impostor acceptance, it is necessary toknowthedistribution of customerandimpostor scoresasbaseline data.The decisioncriterioninpractical applicationsshould be determinedaccording to the effects of decision errors. This criterion can be determined based on a priori probabilities of a match, P(s), on the cost values of the various decisionresults, andontheslope of theROCcurve.In experimental tests, the criterion is usually set a posteriori for each individual speaker in order to match up the two kinds of error rates, F R and FA, as indicated by c in Fig. 9.3.
Speaker Recognition
357
FIG.9.2 Receiver operating characteristic (ROC) curves; performance examples of three speaker verification systems: A, B, and D.
a
c
b
Decision criterion (Threshold)
FIG. 9.3 Relationshipbetweenerrorrateanddecision (threshold) in speakerverification.
criterion
358
9.2.3
Chapter 9
Relationship Between Error Rate and Number of Speakers
Let us assume that ZN representsapopulation of N registered speakers, that X' = (x1, x2, . . ., XJ is an n-dimensional feature vector representingthe speech sample, andthat Pi(X) is the probability density function of X for speaker i ( i 6 Z N ) .The chance probability density function of X within population ZN can then be expressed as
where Pr[i] is the a priorichanceprobabilityofspeaker i (Doddington, 1974). In the case of speaker verification, the region of X which should be accepted as the voice of customer i is
where Ci is chosen to effect the desired balance between FA and F R errors. With ZN constructed using randomly selected speakers, and with the a priori probability independent of the speaker, Pr[i] = 1/N, then Pz(X) will approach a limiting densityfunction independent of ZN as N becomes large. Thus, Pr(FA) and Pr(FR) are relatively unaffected by the size of the population, N , when it is large. From a practical perspective, Pz(x> is assumed to be constant since it is generally difficult to estimatethis value precisely, and
is simply used as the acceptance region.
Speaker Recognition
359
With speaker identification, the region of X,which should be judged as the voice of speaker i, is
The probability of error for speaker i then becomes
With Z N constructed by randomly selected speakers, the equations
can be obtained, where P,ai is the expected probability of not confusing speaker i with another speaker. Thus,the expected probability of correctly identifying a speaker decreases exponentially with the size of the population. This is anatural outcome of the fact that the distribution of infinite points cannot be separated in a finite parameter space. More specifically, when the population of speakers increases, the probability that the distributions of two or more speakers are very close increases.Therefore, the effectiveness of speaker identification systems must be evaluated according to their limits in population size. Figure 9.4 indicates this relationship between the size of the population and recognition error rates for speaker identification and verification (Furui, 1978). These results were obtained for a recognition system employing the statistical featuresof the spectral parameters derived from spoken words.
Chapter 9
360
2o Male Femole
n
10-
s
" "
9 . 4 "
Identification
-o-
-A-
Verification
U
0,
5-
c 0
L
2
2-
L
0,
O .-t
I-
.-
E 0.5 0
V
$ 0.2 0.1'
;
I
5
I
I
I
20 50 S i z e of populotion 10
I
100
FIG.9.4 Recognitionerrorrates as afunctionofpopulation speaker identification and verification.
size in
9.2.4 Intra-Speaker Variation and Evaluation of Feature Parameters
One of the most difficult problems in speaker recognition is the intra-speaker variation of feature parameters. The mostsignificant factor affecting speakerrecognitionperformance is variation in feature parameters from trial to trial (intersession variability or variability over time). Variationsarisefromthespeakerhim/ herself, from differences in recording and transmission conditions, and from noise. Speakers cannot repeat an utterance precisely the same way from trial to trial. It is well known that tokens of the same utterance recorded in one session correlate much morehighly than tokens recorded in separate sessions.
Speaker Recognition
361
It is importantforspeaker recognition systems to accommodate these variations since they affect recognition accuracy more significantly than in the case of speech recognition for two major reasons. First, the reference template for each speaker, which is constructed using training utterances prior to the recognition, is repeatedly used later. Second, individual information in a speech waveis more detailed thanphoneticinformation;that is, the interspeaker variation of physical parameters is much smaller than the interphoneme variation. A number of methods have been confirmed to be effective in reducing the effects of long-term variation in feature parameters and in obtaininggood recognition performanceafteralong interval (Furui, 198la). These include: 1. The application of spectral equalization, Le., the passing of the speech signal througha first- orsecond-order critical damping inverse filter which represents the overall pattern of the time-averaged spectrum for a word or short sentence of speech. An effect similar to the spectral equalization can be achieved by cepstral mean subtraction (CMS) or cepstral mean normalization (CMN) (Atal, 1974; Furui, 1981b). 2. The selection of stable feature parameters based on statistical evaluation using speech utterances recorded over a long period. of featureparametersextractedfroma 3. Thecombination variety of different words. 4. The construction of reference templates (models) and distance measures based on training utterances recorded over a long period. 5. The renewal of the reference template for each customerat the appropriate time interval. 6. Adaptation of the reference templates (models) as well as the verification threshold for each speaker. The effectiveness of the spectral equalization process, the socalled ‘blind equalization’method, was examined by means of speaker recognition experiments using statistical features extracted
362
Chapter 9
from a spoken word. Results with and without spectral equalization were compared for both short-term and long-term training. Theshort-termtraining set comprisedutterancesrecordedover a period of 10 days in three or four sessions at intervals of 2 or 3 days. The long-term trainingset consisted of utterances recorded over a 10-month period in four sessions at intervals of 3 months. The time interval between the last training utterance and the input utterance ranged from two or three days to five years. The speaker verification results obtained are exemplified in Fig. 9.5. Although these results clearly confirm that spectral equalization is effective in reducing errors as a functionof the time
-
.-c 4-
0
-
a
-
0
u at
*O
;
I
i
I
I
1
I
2
3
4
5
Interval
- W i tshp e c t r aelq u a l i z a t i o n ---
Without equalization
[years] 0 Short -term training 0 L o n g - t etr m aining
FIG. 9.5 Results of speaker verification using statistical features extracted from a spoken word with or without spectral equalization.
Speaker Recognition
363
interval for both short-term and long-term training, it is especially effective with short-term training. Concerning the speech production mechanism, the effectiveness of spectral equalization means that vocal tractcharacteristics are muchmorestable than the overall patterns of the vocal cord spectrunl. In the CMS (CMN) method, cepstral coefficients are averaged over the duration of an entire utterance, and the averaged values are subtracted from thecepstral coefficients of each frame. This method can compensate fairly well for additive variation in the log spectral domain. However, it unavoidably eliminates some text-dependent and speaker-specific features, so it isespeciallyeffective for textdependent speaker recognition application using sufficiently long utterances but is inappropriate for short utterances. It was shown that time derivatives of cepstral coefficients (deltacepstral coefficients) are resistant to linear channel mismatch between training and testing (Furui, 1981b; Soong and Rosenberg, 1986). In addition to thenormalizationmethodsintheparameter donlain, those in the distance/similarity domain using the likelihood ratio or a posteriori probability have also been actively investigated (See Subsection 9.2.5). To adapt HMMs for noisy conditions, the HMM composition (PMC; parallel model combination) method has been successfully employed. In selecting themost effective featureparameters,the following four parameters evaluation methods can be used: 1. Performing recognition experiments based on various combinations of parameters; 2. MeasuringtheF-ratio(intertointravarianceratio)for each parameter (Furui, 1978); 3. Calculatingthe divergence, which is an expansion of the F-ratio into a nlultidimensional space (Atal, 1972); and 4. Using the knockout method based on recognition error rates (Sambur, 1975). In order to reduce effectively the amount of infornlation, that is, the number of parameters, feature parameter sets are sometimes
_ " " " " I
""
.*""""UL_u
" " " " " I "
364
Chapter 9
projected into a space constructed by discriminant analysis which maximizes the F-ratio.
9.2.5
Likelihood(Distance)Normalization
To contend with theintra-speakerfeatureparametervariation problems, Higgins et al. (1991) proposed a normalization method for distance (similarity or likelihood) values that uses the likelihood ratio: log L ( X )
=
ratio of the conditional probability of The likelihood ra the observed measurements of the utterance given that the claimed identity is correct to the conditional probability of the observed measurements given that the speaker is an impostor. Generally, a positive value of log L indicates a valid claim, whereas a negative value indicates an impostor. The second term on the right-hand side of Eq. (9.8) is called the normalization term. The density at pointX for all speakers otherthan true speaker S can be dominated by thedensity for thenearest reference speaker, ifwe assume thatthe set of reference speakers is representative of all speakers. This means that the likelihood ratio normalization approximates the optimal scoring in Bayes’ sense. This normalization method is unrealistic, however, because even if only the nearest reference speakers is used, conditional probabilities must be calculatedforall of the reference speakers, which increases cost.Therefore,a set of speakers,knownas‘cohort speakers,’ has been chosen for calculating the normalization term of Eq. (9.8). Higgins etal.proposed using speakers thatare representative of the population near the claimed speaker. An experiment in which the size of the cohort speaker set was varied from 1 to 5 showed that speaker verification performance increases asafunction of thecohort size, andthatthe use of normalization significantlycompensates forthedegradation
ognition
Speaker
365
obtained by comparing verification utterances recorded using an electretmicrophonewithmodelsconstructedfromtraining utterances recorded with a carbon button microphone(Rosenberg, 1992). MatsuiandFurui (1993, 1994b) proposedanormalization method based on a posteriori probability:
The difference between thenormalizationmethod based on the likelihoodratio andthat based on aposterioriprobability is whether or not the claimed speaker is includedintheimpostor speaker set fornormalization;the cohort speaker set inthe likelihood-ratio-basedmethoddoes not includetheclaimed speaker, whereas thenormalizationtermfortheaposteriori probability-based method is calculated by using a set of speakers including the claimed speaker. Experimental results indicate that bothnormalizationmethodsalmostequallyimprovespeaker separability and reducethe need for speaker-dependent or textdependentthresholding,compared with scoring using only the model of theclaimedspeaker(MatsuiandFurui, 1994b; Rosenberg, 1992). The normalization method using the cohort speakers that are representative of thepopulationnearthe claimed speaker is expected to increase the selectivity of the algorithm against voices similar to the claimed speaker. However, this method is seriously problematic in thatit is vulnerable to illegal access by impostors of the opposite gender. Since the cohorts generally model only samegenderspeakers,theprobability of opposite-genderimpostor speech is not well modeled and the likelihood ratio is based on the tails of distributions, which gives rise to unreliable values. Another way of choosing the cohort speaker set is to use speaker who are typical of thegeneral population. Reynolds (1994) reported that arandomly selected, gender-balancedbackground speaker population outperformed a population near the claimed speaker.
366
Chapter 9
Carey et al. (1992) proposed a method in which the normalization term is approximated by the likelihood for a world model representingthepopulationingeneral.Thismethodhasthe advantage that the computational cost for calculating the normalization term is much smaller than the original method since it does not need to sum the likelihood values for cohort speakers. Matsui andFurui (1994b) proposedamethod based on tied-mixture HMMs in which the world model is formulatedasapooled mixture model representing the parameter distribution for all of the registered speakers. This model is created by averaging together the mixture-weighting factors of each reference speakercalculated using speaker-independentmixturedistributions.Thereforethe pooled model can be easily updated when a new speaker is added as a reference speaker. In addition, this method hasbeen confirmed to give much better results than either of the original normalization methods. Since these normalization methods do not take into account the absolute deviationbetween the claimed speaker's model and the input speech, they cannot differentiate highly dissimilar speakers. Higgins et al. (1991) reported that a multilayer network decision algorithm makes effective use of the relative and absolute scores obtained from the matching algorithm.
9.3 EXAMPLES
OF SPEAKER RECOGNITION SYSTEMS
9.3.1Text-DependentSpeakerRecognitionSystems
Large-scale experiments have been performed for some time for text-dependentspeakerrecognition which is more realistic than text-independent speaker recognition (Furui, 1981b; Zheng and Yuan, 1988; Naik et al., 1989; Rosenberg et al., 1991). They include experiments on a speaker verification system for telephone speech which was tested at Bell Laboratories using roughly 100 male and female speakers (Furui? 1981b). Figure 9.6 is a block diagram of the principalmethod.Withthismethod,not only is the time series
Speaker Recognition
367 Speech wave
I
1
1
LPC cepstrum
Exponsion by polynomiol f unc t ion
Long-time overage
i
N o r m a l i z a t i o n by overagecepstrum
c
1
1 F e a t u r es e l e c t i o n
I [pG-f
I
q Deci- s ionI S p e a k e ri d e n t i t y
FIG. 9.6 Block diagram indicating principal operation of speaker recognition method using time series of cepstral coefficients and their orthogonal polynomial coefficients.
brought into time registration with the stored reference functions, but a set of dynamic features(See Subsection 8.3.6) is also explicitly extracted and used for the recognition. Initially, 10 LPC cepstral coefficients are extracted every 10ms from a short speech sentence. These cepstral coefficients are then averaged over theduration of theentireutterance.The averaged values are next subtracted from the cepstral coefficients of every frame (CMS method) to compensate for the frequencyresponse distortion introduced by the transmission system and to reduce long-term intraspeaker spectral variability. Time functions forthecepstral coefficients are subsequentlyexpanded by an orthogonal polynomial representation over 90-ms intervals which are shifted every10 nls. The first- and second-order polynomial coefficients (A and A' cepstral coefficients) are thus obtained as the representations of dynamic characteristics. From the normalized cepstral and polynomial coefficients, a set of 18 elements is
368
Chapter 9
selected which are the most effective in separating the speaker’s overall distancedistribution.The time function of the set is brought into time registration with the reference template in order to calculate the distance between them. The overall distance is then compared with athresholdforthe verification decision. The threshold and reference template are updated every two weeks by using the distribution of interspeaker distances. Experimental results indicate that ahigh degree of verification accuracy canbe obtained even if the reference and input utterances are transmitted on different telephone systems, such as on those usingADPCMsandLPCvocoders.Anonlineexperiment performed over a period of six months, using dialed-up telephone speech uttered by 60 male and 60 female speakers, also supports the effectiveness of this system. An HMM can efficiently modelthestatisticalvariation inspectralfeatures.Therefore,HMM-basedmethodscan achieve significantly better recognition accuracies thanDTWbased methods if enough training utterances for each speaker are available. 9.3.2
Text-Independent Speaker Recognition Systems
In text-independentspeaker recognition, thewordsor sentences used in recognition trials generally cannot be predicted. Since it is impossible to model or match speech events at the word orsentence level, the following three kinds of methods shown in Fig. 9.7 have been actively investigated (Furui, 1986). (a) Long-term-statistics-based methods As text-independentfeatures,long-termsamplestatistics of variousspectralfeatures, such asthemeanandvariance of spectralfeatures over a series of utterances,havebeenused (Furui etal., 1972; Markel etal., 1977; Markel and Davi, 1979) (Fig. 9.7(a)). However,long-termspectral averages are extreme condensations of thespectralcharacteristics of aspeaker’s utterances and, as such, lack the discriminating power included
' c-
I
W
Speaker Recognition
I
0 .-
a,
cn .-0
T
T
T
369
370
Chapter 9
in the sequences of short-term spectral features used as models in text-dependent methods. In one of the trials using the long-term averagedspectrum (Furui etal., 1972), the effect of session-tosession variability was reduced by introducing a weighted cepstral distance measure. Studies on using statistical dynamic features have also been reported.Montacieetal. (1992) appliedamultivariate autoregression (MAR) model to the time series of cepstral vectors to characterizespeakers, andreportedgoodspeakerrecognition results. Griffinetal. (1994) studieddistancemeasuresforthe MAR-based method, and reported that when 10 sentences were used fortrainingandonesentencewas used fortesting, identification and verification rates were almost the same as those obtained by an HMM-based method. Itwas also reported that the optimum order of the MAR model was 2 or 3, and that distance normalization using a posteriori probability was essential to obtain good results in speaker verification. (b) VQ-based methods A set of short-term training feature vectors of a speaker can be used directly to representthe essential characteristics of that speaker. However, such a direct representationis impractical when thenumber of trainingvectors is large, since thememory and amount of computation requiredbecomeprohibitivelylarge. Therefore, attempts have been madeto find efficient ways of compressingthetraining data using vector quantization (VQ) techniques. In this method (Fig. 9.7(b)), VQ codebooks, consisting of a small number of representativefeaturevectors, are used asan efficient means of characterizing speaker-specific features (Li, and Wrench Jr., 1983; Matsui and Furui, 1990,1991; Rosenberg and Soong, 1987; Shikano, 1985; Soong et al., 1987). A speaker-specific codebook is generated by clustering the training feature vectors of each speaker. In the recognition stage, an input utteranceis vectorquantized by using the codebook of each reference speaker; the VQ distortion accumulated over the entire input utterance is used for making the recognition determination.
Speaker Recognition
371
(c) Ergodic-HMM-based methods Thebasicstructure is thesame astheVQ-basedmethod (Fig. 9.7(b)), but in this method an ergodic HMM is used instead of a VQ codebook. Over a long timescale, the temporal variation in speech signal parameters is represented by stochastic Markovian transitions between states. Poritz (1982) proposed using a five-state ergodic HMM (i.e., all possible transitions between states are allowed) to classify speech segments into one of the broad phonetic categories corresponding to the HMM states. A linear predictive HMM was adopted tocharacterize the output probability function. He characterized the automatically obtained categories as strong voicing, silence, nasal/liquid, stop burst/post silence, and frication. Tishby (1991) extended Portiz’s work to the richer class of mixture autoregressive (AR)HMMs.In these models,thestates are described as a linear combination (mixture) of AR sources. It was shown that the speaker recognition rates are strongly correlated with the total number of mixtures, irrespective of the number of states (Matsui and Furui, 1992). This means that the information on transitionsbetween different states is ineffective for text-independentspeakerrecognition.The case of a single-state continuous ergodic HMM corresponds to the technique based on the maximum likelihood estimation of a Gaussian-mixture model representation investigated by Rose et al. (1990). Furthermore, the VQ-based method can be regarded as a special (degenerate) case of a single-state HMM with a distortion measure being used as the observation probability. (d) Speech-recognition-based methods The VQ- and HMM-based methods can be regarded as methods that use phoneme-class-dependent speaker characteristics in shortterm spectral features through implicit phoneme-class recognition. In other words, phoneme-classes and speakers are simultaneously recognized in these methods. On the other hand, in the speechrecognition-basedmethods(Fig. 9.7(c)), phonemes or phonemeclasses are explicitly recognized and theneachphoneme (-class) segment in the input speech is compared with speaker models or templates corresponding to that phoneme (-class).
372
Chapter 9
Savic et al. (1990) used a five-state ergodic linear predictive HMM for broad phonetic categorization. In their method, after frames that belong toparticularphoneticcategorieshave been identified,feature selection is performed.Inthetrainingphase, reference templates are generated and verification thresholds are computed for each phonetic category. In the verification phase, afterphoneticcategorization,acomparison with the reference template for each particular category provides a verification score for that category. The final verification score is a weighted linear combination of the scores for eachcategory. The weights are chosen to reflect the effectiveness of particularcategories of phonemes in discriminating between speakers and are adjusted to maximize theverificationperformance.Experimentalresults showed that verification accuracy can be considerably improved by this category-dependent weighted linear combination method. Broadphoneticcategorizationcanalso be implemented by a speaker-specific hierarchical classifier instead of by an HMM, and the effectiveness of this approach has also been confirmed (Eatock and Mason, 1990). Rosenbergetal.have been testingaspeaker verification system using 4-digit phrases under field conditions of a banking application(Rosenbergetal., 1991; Setlur andJacobs, 1995). In this system, input speech is segmented into individualdigits usingaspeaker-independent HMM.The frameswithinthe word boundaries for a digit are compared with the corresponding speaker-specific HMM digit model and the Viterbi likelihood score is computed. This is done for each of the digits making up the inpututterance.The verification score is defined to be the average normalized log-likelihood score over all the digits in the utterance. Newmanetal. (1996) used alargevocabulary speech recognition system forspeakerverification. A set of speakerindependent phoneme models were adapted to each speaker. The speakerverificationconsistedoftwostages. First,speakerindependent speech recognition was runon each of thetest utterances to obtain phoneme segmentation. In the second stage, the segments were scored againsttheadapted models fora
cognition
Speaker
373
particular target speaker. Thescores were normalized by those with speaker-independent models. The system was evaluated using the 1995 NIST-administeredspeakerverification database, which consists of data taken from the Switchboard corpus. The results showed that this method could not outperform Gaussian mixture models. 9.3.3 Text-PromptedSpeakerRecognitionSystems Howcan we preventspeaker verification systems from being defeated by a recorded voice? Another problem is that people often do not like text-dependent systems because they do notlike to utter their identification number, such as their social security number, withinthehearing of other people. To contend withthese problems, a text-prompted speaker recognition method has been proposed. In this method, key sentences are completely changed every time (Matsui and Furui, 1993, 1994a). The system accepts the input utterance only when it determines thatthe registered speaker utteredthepromptedsentence. Because thevocabulary is unlimited, prospectiveimpostors cannot know inadvancethe sentence they will be prompted to say. This method can not only accurately recognize speakers but canalso reject an utterance whose text differs from the prompted text, even if it is uttered by a registered speaker. Thus, a recorded and played back voice can be correctly rejected. This method uses speaker-specific phoneme models as basic acoustic units. One of the major issues in this method is how to properly create these speaker-specific phoneme models when using trainingutterances of a limited size. Thephonememodels are represented by Gaussian-mixturecontinuous HMMsor tiedmixtureHMMs,and they aremade by adaptingspeakerindependentphonememodels to eachspeaker's voice. Since the text of trainingutterances is known, these utterances can be modeled as theconcatenation of phoneme models, and these models can be automatically adapted by an iterative algorithm.
374
Chapter 9
In the recognition stage, thesystem concatenates the phoneme models of each registered speaker to createa sentence HMM, according to the prompted text. The likelihood of input speech againstthe sentence model is thencalculated and used forthe speakerrecognitiondetermination. If thelikelihood of both speaker and text is high enough, the speaker is accepted as the claimed speaker. Notably, experimentalresults gave a high speaker and text verification rate when the adaptation method for tiedmixture-based phoneme models and the likelihood normalization method described in Subsection 9.2.5 were used.
10 Future Directions of Speech Information Processing
10.1 OVERVIEW
Forthemajority of humankind, speech understandingand productionareinvoluntary processesquickly and effectively performed throughoutour daily lives. A part of these human processes has already been synthetically reproduced, owing to the recent progress in speech signal processing, linguistic processing, computers, and LSI technologies. What we are actually capable of turningintopractical, beneficial tools at present using these technologies, however, can be considered very restricted at best. simplify to a certain degree the Figure 10.1 attemptsto relationships between thevarious types of speech recognition, understanding, synthesis, and coding technologies. Several of these remain to be investigated. It is essential to ensure that speech information processing technologies playthe ever-heightening, demand-stimulated role desired in facilitating the progress of the informationcommunications societies toward which we are aspiring.Thiscan only be achieved by enhancing our synthetic speech technologies to the point where they approach as closely as
375
376
concepts
-"--Understand-
1
I
I
7"""
i' I
I
J
1
(10 b p s )
I
concepts
I
I
t
codes
codes
""7-
A
into
I i ngu isti c
r
Parometer
production
I
I
I 1
FIG. 10.1 Principalspeechinformationprocessingtechnologies their relationships.
and
possible our inherent human abilities. Importantly, this necessitates our competently solving the broadest possible range of interrelated problems in the near future. In aneffort to graphically clarify the relationships between the elements of engineering and human speech information processing mechanisms,Fig. 10.2 detailsthevariations betweenspeech information processing technologies and the scientific and technological areas serving as the foundational roots of speech research. As is evident in the figure, and as described elsewhere in this book, speech research is fundamentally and intrinsically supported by a wide range of sciences. The intensification of speech research continues to underscore an even greater interrelationship between scientific and technological interests.
Future Directions of Speech Information Processing
377
- Large-vocabulary
- Speaker - Independent -Continuous
Speaker recognition normalization
: Speech synthesls -Synthesis bySpeaker rule -Text-to- speech
""""""--""-
speech
I
1
___ __"
""
"-
"-
Fundomental sciences and technologles
Neural net Acoustics Artificlal i n t e l I igence
FIG.10.2 Speech information processing "tree," consisting of present and future speech information processing technologies supported by scientific and technological areas serving as the foundations of speech research.
Althoughindividual aspects of speech information processing research have thus far been performed independently for the most
370
Chapter 10
part, they will encounter increased interaction until commonly shared problems become simultaneously investigated and solved. Only then can we expect to witness tremendous speech research progress, and hence fruition of widely applicable, beneficial techniques. Along these lines, thischaptersummarizeswhatareconsidered to be the nine most important research topics, in particular those which intertwinea multiplicity of speech researchareas. Thesetopicsmust be rigorouslypursued and investigated ifwe areto realize ourinformationcommunications societies fully incorporating enhanced speech technologies.
10.2 ANALYSIS AND DESCRIPTION OF DYNAMIC FEATURES
Psychological and physiological research into the human speech perception mechanisms overwhelmingly reports that the dynamic features of the speech spectrum and the speech wave over time intervals between 2 to 3 ms and 20 to50ms play crucially important roles in phoneme perception. This holds true not only for consonants such as plosives but also for vowels in continuous speech. On theotherhand,almost all speech analysis methods developed thus far, including Fourier spectral analysis and LPC analysis, assume the stationarityof the signals. Only a few methods have been investigated for representingtransitional or dynamic features. Although such representation constitutes one of the most difficult problems facing the speech researcher today, the discovery of a good method is expected to produce a substantial impact on the course of speech research. Coarticulationphenomenahave usually been studiedas variations or modifications of spectra resulting from the influence of adjacentphonemes or syllables. It is considered essential, however, that these phenomena be examined from theviewpoint of phonemic information existing in the dynamic characteristics. Additionallyimportanttotheserelatively‘microdynamic’ phonemic information-related features are the relatively
Future Directions Speech of Information Processing
379
‘macrodynamic’features covering theinterval between 200 to 300 ms and 2 to 3 s. The latter dynamic features bear prosodic features of speech such as intonation andstress. And although they seem to be easily extracted using the time functions of pitch and energy, they are actually extremely difficult to correctlyextract from a speech wave by automatic methods.Even if they were to be effectively extracted, it is still very difficult to relate these features to theperceptualprosodicinformation.Therefore, even inthe speech recognition area, in which prosodic features are expected to play asubstantial role, only a few trials utilizing them have succeeded to any notable degree. In speech synthesis, control rules for prosodic featureslargely affect the intelligibility and naturalness of synthesized voice. Here also, although the significance of prosodicfeatures is clearly as greatasthat of phonemicfeatures,theperception andcontrol mechanisms of prosodic features have not yet been clarified.
10.3 EXTRACTION AND NORMALIZATION OF VOICE INDIVIDUALITY
Although many kinds of speaker-independent speech recognizers have already been commercialized, a small fraction of people occasionally produce exceptionally low recognition rates withthese systems. A similar phenomenon, which is called the ‘sheep and goats phenomenon,’ also occurs in speaker recognition. The voice individualityproblemin speech recognitionhas been handled to a certain extent through studies into automatic adaptation algorithms using a small number of training utterances and unsupervised adaptationalgorithms.Intheunsupervised algorithms, utterances for recognition are also used for training. These algorithms are currently capable of only restricted application, however, since themechanism of producing voice individualityhas not yet been sufficiently delineated.Accordingly, becoming increasingly more important will be research on speakerindependent speech recognition systems having anautomatic
380
Chapter 10
speaker adaptation mechanism based on unsupervised training algorithms requiring no additional training utterances. Speaker adaptationor normalizationalgorithmsin speech recognition as well as speakerrecognitionalgorithmsshould be investigated using a common approach. This is because they are two sides of the same problem: how best to separate the speaker’s information and the phonemic information in speech waves. This approach is essential to effectively formulatethe unsupervised adaptation and text-independent speaker recognition algorithms. In the speech synthesis area, several speech synthesizers have been commercialized, in which voice quality can be selected from male, female, and infant voices. No system has been constructed, however, that can precisely select or control the synthesized voice quality.Research into themechanismunderlying voice quality, inclusive of voice individuality, is thus necessary to ensure that synthetic voice is capable of imitating a desired speaker’s voice or to select any voice quality such as a hard or soft voice. Even in speech coding (analysis-synthesis and waveform coding),thedependency of thecoded speech quality onthe individuality of the original speech increases with the advanced, high-compression-ratemethods. Putanother way, in these advanced methods, perceptual speech quality degradation of coded speech clearly depends onthe original voice. It is of obvious importance then to elucidate the mechanism of voice dependency and to develop a method which decreases this dependency.
10.4 ADAPTATION TO ENVIRONMENTAL VARIATION
For speech recognition and speaker recognition systems to bring their capabilities into full play during actual speech situations, they must be able to minimize effectively, or hopefully eliminate, the influence of overlapped stationary noise as well as unstationary noise such asother speaker’s voices. Present speech recognition systems have gone a long way toward resolving these problems by usingaclose-talkingmicrophone and by institutingtraining
Future Directions Speech of Information Processing
381
(adaptation)foreach speaker’s voice underthesame noise characteristicenvironment.Theenvironmentnaturallytends to vary, however, and the transmission characteristics of telephone sets and transmission lines also are not precisely controllable. This situation is becoming even more difficult because of the wide use of both cellular and codeless phones. Research is therefore necessary to ascertain mechanisms that will facilitate automatic adaptation to these variations. Also important for practical use is the development of a method capableof accurately recognizing a voice picked up by a microphone placed at a distance from the speaker. Since the intersession (temporal) variability of the physical properties of an individual voice decreases the recognition accuracy of speakerrecognition,a set of featureparametersmust be extracted that remain stable over long periods, even if, for example, thespeakershould be sufferingfrom acold orbronchial congestion. Furthermore, these parameters must be set up in such a way that they are extremely difficult to imitate.
10.5 BASIC UNITS FOR SPEECH PROCESSING Recognizing continuous speech featuring an extensive vocabulary necessitates the exploration of a recognition algorithm that utilizes basic units smaller than words. This establishment of basic speech unitsrepresentsone of theprincipalresearch foci fundamental not only to speech recognition but also to speakerrecognition, text-to-speech conversion, and very-low-bit-rate speech coding. Thesebasicspeechunitsconsideredintrinsic to speech information processing should be studiedfrom several perspectives: 1. Linguistic units (e.g., phonemes and syllables), 2. Articulatoryunits (e.g., positions and movingtargetsforthe jaw and tongue), 3. Perceptual units (e.g., distinctive features, and targets and loci of formant movement),
382
Chapter 10
4. Visual units (features used in spectrogram reading), and 5. Physical units ( e g , centroids in VQ and MQ). These units do not necessarily correspond. Furthermore, although conventional units have usually been produced from the linguistic point ofview, futureunits willbe established based on the combination of physical and linguistic units.Thisestablishment will take the visual, articulatory, and perceptual viewpoints into consideration.
10.6 ADVANCED KNOWLEDGE PROCESSING
One of the critical problems in speech understanding and text-tospeech conversion is how best to utilize and efficiently combine various kinds of knowledge sources including our common sense concerning language usage. There is ample evidence that human speech understanding involves the integration of a great variety of knowledge sources, including knowledge of the world or context, knowledge of the speaker and/or topic, lexical frequency, previous uses of a word or a semantically related topic, facial expressions (in face-to-facecommunication),prosody,as well as theacoustic attributes of the words. Our future systems could do much better by integrating these knowledge sources. The technological realization of these processes encompasses the use of the merits of artificial intelligence, particularly knowledge engineering systems, which provide methods capableof representing knowledge sources, including syntax and semantics, parallel and distributed processing methods for managing the knowledge sources, and treesearchmethods. A key factor inactualizinghighperformance speech understanding concerns the most potent way to combine the obscurely quantified acoustical information with the different types of symbolized knowledge sources. The use of statistical language modeling is especially convenient in the linguisticprocessing stage inspeech understanding. The methods produced from the results garnered from natural language
Future Directions Speech of Information Processing
383
processingresearch, such as phrase-structured grammarand casestructured grammar, are not alwaysuseful in speechprocessing, however, since there is a vast difference between written and spoken language. An entirely new linguistic science must therefore be invented for speech processing basedon the presently available technologies for natural language processing. Clearly, this novel science must also take the specific characteristics of conversational speech into consideration.
10.7 CLARIFICATION OF SPEECH PRODUCTION MECHANISM
A careful look into the dynamics of the diverse articulatory organs functioningduring speech production,coupled with trials for elucidatingtherelationship between thearticulatorymechanism and the acoustic characteristics of speech waves, exhibits considerable potential for producing key ideas fundamental to developing the new speech information processing technologies needed. Recent investigation has shown that the actual sound source of speech production is neither a simple pulse train nor white noise, nor is it necessarily linearlyseparablefromthevocaltract articulatory filter. Thisfindingruns contrarytotheproduction model now widely used. Furthermore, it is quite possible that simplification of the modelis one of the primary factors causing the degradation of synthesized voice. Therefore, development of a new soundsourcemodel that precisely representstheactualsource characteristics, as well asresearch on the mutualinteraction between the sound source and the articulatory filter, would seem to be necessary to enhance the progress of speech synthesis. Well-suited formulation of the rules governing movement of thearticulatoryorgansholdsthepromise of producinga clear representation of thecoarticulationphenomena which are very difficult to properly delineate at the acoustic level. Consequently, a dynamicmodel of coarticulation is intheprocess of being established based on these rules. This research is also expected to lead to a solution of the problem of not being able to clearly discern
384
Chapter 10
voice individuality and to produce techniques for segmenting the acoustic feature sequence into basic speech units. The actual direction this research is assuming is divided into a physiological approach and an engineering approach. The former approach involves the direct observation of the speech production processes. For example, vocal cord movement is observed using a fiberscope, an ultrasonic pulse method, or anoptoelectronic method. On the other hand, articulatorymovement in the vocal tract can be observed by scanning-type x-ray microbeam device, ultrasonic tomography, dynamic palatography, electromyography (EMG), or an electromagnetic articulograph (EMA) system. Although each of these methods have their own specially applicable features, none of them is capable of precisely observing the dynamics of the vocal organs. Accordingly, there will be a continuous need to improve on such devices and observation methods. The engineering approach concerns the estimation of source and vocal tract information from the acoustic features based on speech production models. This approach, founded on the results of the physiological approach, is expected to produce key ideas for developing new speech processing technologies.
10.8 CLARIFICATION OF SPEECH PERCEPTION ' MECHANISM
As is well known, a mutual relationship exists between the speech production and speech perception mechanisms. Psychological and physiological research into human speech perception is anticipated to give rise to new principles for guidingmorebroad-ranging progress in speech information processing. Although observation and modeling of the movement of vocal systems along with the physiological modeling of auditory peripheral systems have recently made great progress, the mechanism of speech informationprocessinginourownbrainhashardly been investigated. As described earlier, one of the most significant factors toward which speech perception research is being directed is the
Future Directions Speech of Information Processing
385
mechanism involved in perceiving dynamic signals. Psychological experiments on human memory clearly showed that speech plays a far moreimportantand essential role than vision in thehuman memory and thinking processes. Whereas models of separating acoustic sources have been researched in ‘auditory scene analysis,’ the mechanisms of how meanings of speech are understood and how speech is produced have not yet been elucidated. It will be necessary to clarify the process by which human beings understandandproducespokenlanguage,inorder to obtainhintsforconstructinglanguage models forourspoken language, which is very differentfromwrittenlanguage. It is necessary to be able to analyze context and accept ungrammatical sentences. Now is the time to start active research on clarifying the mechanism of speech information processing in thehuman brain so that epoch-making technological progress can be made based on the human model.
10.9 EVALUATION METHODS FOR SPEECH PROCESSING TECHNOLOGIES
Objective evaluationmethodsensuringquantitativecomparison between a broad range of techniques are essential to technological developments in the speech processing field. Establishing methods forevaluatingthemultifarious processes and systems employed here is, however, very difficult for a number of important reasons. One is thatnatural speech varies considerablyinits linguistic properties, voice qualities, and other aspects as well. Another is that efficiency of speech processing techniques often depends to a large extent on the characteristics of the input speech. Therefore,the following threeprincipalproblemsmust be solved before effectual evaluation methods can be established:
1. Task evaluation: creating a measure fully capable of evaluating the complexity and difficulty of the task (synthesis, recognition, or coding task) being processed;
386
Chapter 10
2.
Technique evaluation: formulating a method for evaluating the techniques both subjectively and objectively; 3. Databasefor evaluation:preparinga large-scale universal database for evaluating an extensive array of systems.
Crucialfutureproblemsincludehowtoevaluatethe performance of speechunderstandingandspokendialogue systems, and how best to measure the individualityand naturalness of coded and synthesized speech.
10.10 LSI FORSPEECHPROCESSINGUSE
Development and utilization of LSIs are indispensable tothe actualization of diverse, well-suited speech processing devices and systems. LSI technology has, on occasion, had considerable impact on the speech technology trend. Those algorithms that are easily packaged in LSIs, for example, tend to become mainstream tools even if they requirea relatively largenumber of elements and computation. Speech processing algorithms can be realized through specialpurpose LSIs and digital signal processors (DSPs). Although both avenueshaveadvantages and disadvantages,the DSP approach generally seems to be more beneficial, because speech processing algorithmsarebecomingsubstantiallymore diversified and continue to incorporate rapid advancements. The actual production of fully functioning speech processing hardware necessitates the fabrication of DSP-LSIs, which include high-speed circuits and large memories capable of processing and storing sufficiently longword-length data intheir design. Furthermore, the provisionof appropriatedevelopmentaltoolsforconstructingDSP-based systems using high-level computer languages is essential. It would be particularly beneficial if speech researchers were to assist in proposing the design policies behind the production of these LSIs and developmental devices.
Appendix A Convolution and z-Transform
A.l
CONVOLUTION
The convolution of x($ and h(n), usually written x(n) defined as
*
lz(n), is
If h ( n ) and x(n) are the impulse response of a linear system and its input, respectively, the system response can be expressed by the convolution 00
The convolution operation features the following properties: 1.
Commutativity: For any h and x , x(12) * h ( n ) = h ( n )
* x(n)
(A4
387
3.
Linearity: If parameters a and b are constants, then
h(n) *
XI ( 1 2 ) +
bx2(1~)] = a[h(n)* X I (n)]+ b [ h ( ~* )x ~ ( H ) ]
Generally,
i
4.
i
which means that the convolution and summing operations are interchangeable. Time reversal: If y(n) = x(n) * I z ( n ) , then y(-n) = x(-n)
5.
* h(-n)
(A@
Cascade: If two systems, Izl and 112, are cascaded, then the overall impulse response of thecombinedsystem is the convolution of the individual impulse responses,
and the overall impulse response is independent of the order in which the systems are connected.
A.2 Z-TRANSFORM
The direct z-transform of a time sequence x(n) is defined as
x(4
Convolution and z-Transform
389
where z is a complex variable and X ( z ) is a complex function. The inverse transform is given by
where the contour C must be in the convergence region of X ( z ) . The z-transform has the following elementary properties, in which Z[x(n)]represents the s-transform of .u(n): 1.
Linearity: Let x(n) and y(n) be any two functions and let X(z) and Y(z) be their respective z-transforms.Thenforany constants, a and b,
+ by(n)] = a X ( z ) + bY(z)
Z[ax(n)
2.
Convolution: If w(n)
x(n) * y(n), then
=
W(Z) =
3.
(A.11)
X ( s )Y ( z )
Shifting:
(A. 12)
Z[X(IZ - k ) ] = s-"(z)
4.
Differences:
Z [ x ( n - 1)
-
Z[x(n)- x ( n 5.
x(n)]= (z
-
l)]
=
(1
l)X(z)
(A.13)
z")X(z)
(A.14)
-
-
Exponential weighting:
Z[a"x(n>]= X(a-b)
6.
(A.15)
Linear weighting:
Z [ n x ( n ) ]= -z-
_ _ "" _ .
(A.10)
" "
~
d X (z ) dz
(A.16)
-
_" "
Appendix A
390
7.
Time reversal:
z[x(-n)] = X(z-l)
(A. 17)
The z-transforms for elementary functions are as follows. 1.
Unit impulse: Theunit impulse is defined as
'(") = If x ( n )
=
n= 0
{ i:
(A.18)
otherwise
S(n), then
x CG
X(")=
S(n)z-II
=
1
(A. 19)
I?=-CG
2.
Delayed unit impulse: If x(n)
x
=
S(n - k),
CG
X(")=
S(n
-
k)z-" (A.20)
3.
Unitstep: The unit step function
'(") If x(n)
=
=
{ hl
is defined as
n 2 0 otherwise
(A.21)
u(n), then
n=-ce
(A.22)
Convolution and z-Transform
4.
Exponential: If S ( H )
391
= U''ZI(I?),
(A.23)
A.3 STABILITY
A system is stable if a bounded (finite-amplitude)input x(n) always produces a bounded output y(r?).That is, if
1 x(n) 1
(1 I i 2 K ) . The set Y is referred to as the codebook, and {yi} are code vectors or templates. The size K of the codebook is referred to as the number of levels. of vector To design such a codebook, the k-dimensional space x is partitioned into K regions { Ci>(1 5 i K ) with a vector y i being associated with each region Cis The quantizer thenassigns the code vector y i if x is in Ci. This is represented by
q(x>
= yi
(B-2)
393
394
Appendix B
When K is quantized as y , a quantization distortion measure or distance measure d(x, y) can be defined between x and y . The overall average distortion is then represented by 1
M
A quantizer is said to be anoptimal(minimum-distortion) quantizer if the overall distortion is minimized over all K-level quantizers. Twoconditionsare necessary for optimality. The first is that the quantizer be realized by using a minimum-distortion or nearest-neighbor selection rule,
The second is that each code vector y i be chosen to minimize theaveragedistortionin region Ci. Such a vector is called the centroid of region C,. The centroid for a particular region depends on the definition of the distortion measure.
6.2
LLOYD’SALGORITHM(K-MEANSALGORITHM)
Lloyd’s algorithm or the K-means algorithm is an iterative clustering (refining) algorithm for codebook design. The algorithm divides the set of training vectors {x(r.l)) into K clusters {ci>in such a way that the two previously described conditions necessary for optimality are satisfied. The four steps of the algorithm are as follows. Step 1: Initialization Set m = 0 ( m : iterative index). Choose a set of initial code vectors, {yl(0)>(1 5 i 5 K ) , using an adequate method. Step 2: Classification Classify the set of trainingvectors {x(n)> (1 2 12 M) into clusters {Ci(nz))based on the nearest-neighbor rule,
Vector Quantization Algorithm
395
Step 3: Codevectorupdating Set 111 + nz + 1. Update thecode vector of every cluster by computingthecentroid of thetraining vectors ineach cluster as
Calculate the overall distortion D(111)for all training vectors. Step 4: Termination If the decrease in the overall distortion D ( m ) at iteration 117 relative to D(117 - 1) is below acertainthreshold,stop; otherwise, go to step 2. (Any other reasonable termination criteria may be substituted.) This algorithm systematically decreases the overall distortion by updating the codebook. The distortion sometimes converges, however, to a local optimum which may be significantly worse than the global optimum. Specifically, the algorithm tends to gravitate towards the local optimum nearest the initial codebook. A global optimummay be approximately achieved by repeatingthis algorithmfor several types of initializations, and choosingthe codebook having the minimum overall distortion results.
B.3 LBG ALGORITHM
Lloyd’s algorithm assumes that the codebook has a fixed size. A codebook can begin small and be gradually expanded, however, until it reaches its final size. One alternative is to split an existing cluster into two smaller clusters and assign a codebook entry to each.The following steps describe thismethodforbuilding an entire codebook (Gersho and Gray, 1992; Parsons, 1986).
396
Appendix B
Step 1: Create an initialclusterconsisting of the entire training a single entry set.theinitialcodebookthuscontains corresponding to the centroid of the entire set, as is depicted in Fig. B.l(a) for a two-dimensional input.
\
\
\
I
‘
FIG. B.l Splitting procedure. (a) Rate 0: Thecentroid of the entire training sequence. (b) Initial Rate 1: The single codewordis split to form an initial estimate of a two-word code. (c) Final Rate 1: The algorithm produces a good code with two words. The dotted line indicates the cluster boundary. (d) Initial Rate 2: The two words are split to form an initial estimate ofa four-word code.(e) Final Rate 2: The algorithm is run to produce a final four-word code.
Algorithm Quantization Vector
397
Step 2: Split thiscluster into twosubclusters,resultingina codebook of twice the size (Fig. B.l(b), (c)). process untilthecodebook Step 3: Repeatthiscluster-splitting reaches the desired size (Fig. B.l(d),(e)).
of ways. Ideally,each Splitting can be done inanumber clustershould be divided by ahyperplane which is normal (rectangular) to the direction of maximum distortion. This ensures that themaximumdistortions of thetwo new clusters will be smaller thanthat of theoriginal. As thenumber of codebook entries increases, however, thecomputational expense rapidly becomes prohibitive. Some authors perturb the centroid to generate two different points. If the centroidis x, then an initial estimateof two new codes can be created by forming x + A and x - A, where A is a small perturbation vector. The algorithm will then produce good codes (centroids).
This Page Intentionally Left Blank
Appendix C Neural Nets
Neuralnetmodelsarecomposed of many simple nonlinear computational nodes (elements) operating in parallel and arranged in patterns simulating biological neural nets (Lippman, 1987). The node sums N-weighted inputs and passes the result througha nonlinearity as shown in Fig. C.l. The node is characterized by an internal threshold or offset 0 and by the type of nonlinearity (nonlinear transformation). Shown in Fig. C.1are three types of nonlinearities: hard limiters, threshold logic elements, and sigmoidal nonlinearities. Among various kinds of neural nets, multilayer perceptrons have been proven successful indealingwithmanytypes of problems.The multilayer perceptrons are feedforward nets with one or more layers of nodes between the input and output nodes. These additional layers contain hidden nodes that are not directly connected to eitherthe inputoroutput nodes.A three-layer perceptron with two layers of hidden nodes is shown in Fig. C.2. The nonlinearity can be any of the three types shown in Fig. C.l . The decision rule involves selection of the class which corresponds to the output node having the largest output. In the formulas, xi / and .xk'' are the outputs of nodes in the first and second hidden layers, Oi/ and 0;' are internal thresholds in those nodes, andwijis the connection strength from the input to the first hidden layer. i j / is the connection strength between the first and the second layers,
399
Appendix C
400 X0
XI
Y ,-
Input
N- I
output
+ *N- I
fh
(a)
-I
H a r d l i m iSigmoid ter
logic Threshold
FIG.C . l Computational element or node which forms a weighted sum of N inputsandpasses the result througha nonlinearity. Three representative nonlinearities are shown below.
output
output layer Second hidden layer First hidden layer
ocjSNI-1
X0
xN-l
Input
FIG. C.2 A three-layerperceptronwith N continuous valued inputs, M outputs, and two layers of hidden units.
Neural Nets Structure
401 Types of
declslon reglons
Slngle-layer
A Two- layer
fi
Half plane bounded by hyperplane
I
Exclusive OR Problem
I
Classeswith Most general meshed regl'ons region shapes
I
.
9 Con vex open or closed regions
Three-layer Arbltrary (Complexity limited by number of nodes)
FIG.C.3 Types of decision regions that can be formed by single- and multilayer perceptrons with one and two layers of hidden units and two inputs. Shading indicates decision regions for class A. Smooth, closed contours bound input distributions for classes A and 6.Nodes in all nets use hard-limiting nonlinearities.
wi'
and is the connection strength between the second and output layers. The capabilities of multilayer perceptronsstemfromthe nonlinearitiesusedwithinnodes. By wayof example,the capabilities of perceptrons having one, two, and three layers that use hard-limiting nonlinearities are illustrated in Fig. C.3. A threelayer perceptron can form arbitrarily complex decision regions, and can separate the meshed classes as shown in the bottom of Fig. C.3. Generally, decision regions required by any classification algorithm can be generated by three-layer feedforward nets. The multilayer feedforward perceptrons can be automatically trainedtoimprove classification performance with thebackpropagationtrainingalgorithm.Thisalgorithm is aniterative
402
Appendix C
gradient algorithm designed to minimize the mean square error between the actual output of the net and the desired output. If the net is used as a classifier, all desired outputs are set to zero except for theonecorrespondingtothe class from which the output originates. That desired output is 1. Thealgorithmpropagates error terms required to adapt weights backward from nodes in the output layer to nodes in lower layers. The following outlines a back-propagation training algorithm which assumes a sigmoidal logistic nonlinearity for the function f ( a ) in Fig. C. 1.
Step 1: Weight and thresholdinitialization Set all weights and node thresholds to small random values. Step 2: Input and desired output presentation Present an input vector -yo, x1, . . . xN-l (continuous values) and specify the desired outputs do, d l , . . ., CJIM". Present samples from a training set cyclically until weights stabilize. Step 3: Actualoutput calculation Use the sigmoidal nonlinearity and formulas as in Fig. C.2 to calculate the outputs y o , y l , . . ., yM-l. Step 4: Weight adaption Use a recursive algorithm starting at the output nodes and working back to the first hidden layer. Adjust weights using
where wo(t) is the weight from hidden node i or from an input to nodej attime t , xi/ is the output of node i or an input,p is the gain term,and E j is an errorterm for nodej . If node j is an output node,
Neural Nets
403
If node j is an internal hidden node,
where k indicates all nodes in the layers above node j . Adapt internalnodethresholdsina similar manner by assuming they are connection weights on links from imaginary inputs havingavalue of 1. Convergence is sometimes faster and weight changes are smoothed if a momentum termis added to Eq. (C.2) as
where 0 < y < 1. Step 5: Repetition by returning to step 2 Repeat steps 2 to4 until the weights and thresholds converge. Neural nets typically provide a greaterdegree of robustness or faulttolerance than do conventionalsequentialcomputers.One difficulty notedwiththeback-propagationalgorithm is that in many cases the number of training data presentations required for convergence is large (more than 100 passes through all the training data).
This Page Intentionally Left Blank
Bibliography
CHAPTER 1 Fagen, M. D. Ed. (1975) A History of Engineering and Science in the Bell System, Bell TelephoneLaboratories,Inc., New Jersey, p. 6. Flanagan, J. L. (1972) Speech Analysis Synthesis and Perception, 2nd Ed., Springer-Verlag, New York. Furui, S. and Sondhi, M. Ed. (1992) Advances in Speech Signal Processing, Marcel Dekker, New York. Markel, J. D. and Gray, Jr., A. H. (1976) LinearPrediction of Speech, Springer-Verlag, New York. Rabiner,L.R.andSchafer,R. W. (1978) DigitalProcessing of Speech Signals, Prentice-Hall, New Jersey. Saito, S. and Nakata, K. (1985) Fundamentals of Speech Signal Processing, Academic Press Japan, Tokyo. Schroeder,M. Berlin.
R. (1999) Computer Speech,Springer-Verlag,
405
406
Bibliography
CHAPTER 2 Denes, P. B. andPinson, E. N. (1963) The Speech Chain, Bell Telephone Laboratories, Inc., New Jersey. Furui, S., Itakura, F., and Saito, S. (1972) ‘Talker recognition by the longtime averaged speech spectrum,’ Trans. IECEJ, S A , 10, pp. 549-556. Furui, S. (1986) ‘On the role of spectraltransitionfor speech perception,’ J. Acoust. SOC.Amer., 80, 4, pp. 1016-1025. Irii, H., Itoh, K., and Kitawaki, N. (1987) ‘Multi-lingual speech databasefor speech qualitymeasurements and itsstatistic on SpeechResearch, characteristics,’Trans.Committee Acoust. SOC. Jap., S87-69. Jakobson, R., Fant, G., and Halle, M. (1963) Preliminaries to SpeechAnalysis:TheDistinctiveFeaturesandTheir Correlates, MIT Press, Boston. Peterson, G. E. and Barney, H. L. (1952) ‘Control methods used inastudy of the vowels,’ J.Acoust. SOC.Amer., 24, 2, pp. 175-184. Saito, S., Kato, K., and Teranishi, N. (1958) ‘Statistical properties of fundamental frequencies of Japanese speech voices,’ J. Acoust. SOC. Jap., 14, 2, pp. 111-116. Saito, S. (1961) Fundamental Research on Transmission Quality of Japanese Phonemes, Ph.D Thesis, Nagoya Univ. Sato, H. (1975) ‘Acoustic cues of male and female voice quality,’ Elec. Conlmun. Labs Tech. J., 24, 5 , pp. 977-993. Stevens, K. N., Keyser, S. J., and Kawasaki, H. (1986) ‘Toward a phonetic and phonological theory of redundant features,’ in InvarianceandVariabilityin Speech Processes (eds. J. S. Perkel and D. H. Klatt), Lawrence Erlbaum Associates, New Jersey, pp. 426-449.
Bibliography
407
CHAPTER 3 Fant, G. (1959) ‘Theacoustics of speech,’ Proc.3rdInt.Cong. Acoust.: Sec. 3, pp. 188-201. Fant, G. (1960) Acoustic Theory of Speech Production, Mouton’s Co., Hague. Flanagan, J. L. (1972) Speech Analysis Synthesis and Perception, 2nd Ed., Springer-Verlang, New York. Flanagan, J. L., Ishizaka, K., and Shipley, K. L. (1975) ‘Synthesis of speech from a dynamic modelof the vocal cords and vocal tract,’ Bell Systems Tech. J., 54, 3, pp. 485-506. Flanagan,J.L.,Ishizaka, K., and Shipley, K. L. (1980) ‘Signal models for low bit-ratecoding of speech,’ J. Acoust.SOC. Amer., 68, 3, pp. 780-791. Ishizaka, K. and Flanagan, J. L. (1972) ‘Synthesis of voiced sounds from a two-massmodel of the vocal cords,’ Bell Systems Tech. J., 51, 6, pp. 1233-1268. Kelly, Jr., J. L. and Lochbaum, C. (1962) ‘Speech synthesis,’ Proc. 4th Int. Cong. Acoust., G42, pp. 1-4. Rabiner, L. R. and Schafer, R. W. (1978) Digital Processing of Speech Signals, Prentice-Hall, New Jersey. Stevens K. N. (1971) ‘Airflow and turbulence noise for fricative and stop consonants: static considerations,’ J. Acoust. SOC. Anler., 50, 4(Part 2), pp. 1180-1 192. Stevens, K. N. (1977) ‘Physics of laryngeal behavior and larynx models,’ Phonetica, 34, pp. 264-279. CHAPTER 4 Atal, B. S. (1974) ‘Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,’ J. Acoust. SOC.Amer., 55, 6, pp. 1304-13 12.
408
Bibliography
Atal, B. S. andRabiner,L.R. (1976)‘A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-24, 3, pp. 201-212. Bell, C. G., Fujisaki, H., Heinz, J. M., Stevens, K. N.,and House, A. S. (1 96 1)‘Reduction of speech spectra by analysisby-synthesis techniques,’ J.Acoust. SOC. Amer., 33, 12, pp. 1725-1736. Bogert, B. P.,Healy, M. J.R.,andTukey, J. W. (1963) ‘The frequencyanalysis of time-series for echoes,’ Proc.Symp. Time Series Analysis, Chap. 15, pp. 209-243. Dudley, H. (1939) ‘Thevocoder,’ pp. 122-126.
Bell LabsRecord,
18, 4,
Furui, S. (1 98 1)‘Cepstral analysis technique for automatic speaker verification,’ IEEE Trans. Acoust.,Speech, Signal Processing, ASSP-29, 2, pp. 254-272. Gold, B. and Rader, C. M. (1967) ‘The channel vocoder,’ IEEE Trans. Audio, Electroacoust., AU-15, 4, pp. 148-161. Imai, S. and Kitamura, T. (1978) ‘Speech analysis synthesis system using the log magnitude approximation filter,’ Trans. IECEJ, J61-A, 6, pp. 527-534. Itakura, F. and Tohkura, Y. (1978) ‘Feature extraction of speech signal and its application to data compression,’ Joho-shori, 19, 7, pp. 644-656. Itakura, F. (1981) ‘Speech analysis-synthesis based on spectrum encoding,’ J. Acoust. SOC. Jap., 37, 5, pp. 197-203. Markel,J. D. (1972) ‘The SIFT algorithmforfundamental frequencyestimation,’ IEEETrans.Audio. Electroacoust., AU-20, 5, pp. 367-377. Noll, A. M. (1964) ‘Short-time spectrum and ‘cepstrum’ techniques for vocal-pitchdetection,’ J.Acoust. SOC. Amer., 36, 2, pp. 296-302.
Bibliography
409
~011, A. M. (1967) ‘Cepstrum pitch determination,’ J. Acoust. SOC. Amer., 41, 2, pp. 293-309. Oppenheim,A. V. and Schafer, R. W. (1968) ‘Homomorphic analysis of speech,’ IEEETrans.Audio,Electroacoust., AU-16, 2, pp. 221-226. Oppenheim, A. V. (1969) ‘Speech analysis-synthesis system based onhomomorphic filtering,’ J.Acoust. SOC.Amer., 45, 2, pp. 458-465. Oppenheim, A. V. and Schafer, R. W. (1975) Digital Signal Processing, Prentice-Hall, New Jersey. Rabiner, L. R. and Schafer, R. W. (1975) Digital Processing of Speech Signals, Prentice-Hall, New Jersey. Schroeder, M. R. (1966) ‘Vocoders: analysis and synthesis of speech,’ Proc. IEEE, 54, 5 , pp. 720-734. Shannon, C. E. and Weaver, W. (1949) The Mathematical Theory of Communication, University of Illinois Press. Smith, C.P. (1969) ‘Perception of vocoder speech processed by pattern matching,’ J. Acoust. SOC.Amer., 46, 6(Part 2), pp. 1562-1571. Tohkura, Y. (1980)Speech Quality Improvement in PARCOR Speech Analysis-Synthesis Systems, Ph.D Thesis, Tokyo Univ. CHAPTER 5 Atal, B. S. and Schroeder, M. R. (1968) ‘Predictive coding of speech signals,’ Proc. 6th Int. Cong. Acoust., C-5-4. Atal, B. S. (1970) ‘Determination of the vocal-tract shape directly from the speech wave,’ J. Acoust. SOC. Amer., 47, l(Part l), 4K1, p. 64. Atal, B. S. and Hanauer, S. L. (1971) ‘Speech analysis and synthesis by linearprediction of the speech wave,’ J. Acoust.SOC. Amer., 50, 2(Part 2), pp. 637-655.
410
Bibliography
Fukabayashi, T., and Suzuki, H. (1975) ‘Speech analysis by linear pole-zero model,’ Trans. IECEJ, J58-A, 5 , pp. 270-277. Ishizaki, S. (1977) ‘Pole-zero model order identification in speech analysis,’ Trans. IECEJ, J60-A, 4, pp. 423-424. Itakura, F. and S. Saito (1968) ‘Analysis synthesis telephony based onthe maximumlikelihoodmethod,’Proc.6th Int.Cong. Acoust., C-5-5. Itakura, F. and Saito, S. (1971) ‘Digital filter techniques for speech analysis and synthesis,’Proc.7thInt.Cong.Acoust., Budapest, 25-C-1. Itakura, F. (1975) ‘Line spectrum representation of linear predictor coefficients of speech signal,’ Trans.Committeeon Speech Research, Acoust. SOC. Jap., S75-34. Itakura, F. and Sugamura, N. (1979) ‘LSP speech synthesizer, its principle and implementation,’ Trans. Committee on Speech Research, Acoust. SOC.Jap., S79-46. Itakura, F. (1981) ‘Speech analysis-synthesis based on spectrum encoding,’ J. Acoust. SOC. Jap., 37, 5 , pp. 197-203. Markel,J.D. (1972) ‘Digitalinverse filtering-A forformanttrajectoryestimation,’IEEETrans.Audio, Electroacoust., AU-20, 2, pp. 129-137.
new tool
Markel, J. D. and Gray, Jr., A. H. (1976) LinearPrediction of Speech, Springer-Verlag, New York. Matsuda, R. (1966) ‘Effects of thefluctuationcharacteristics of inputsignal on thetonaldifferential limen of speech transmissionsystemcontaining single dip infrequencyresponse,’ Trans. IECEJ, 49, 10, pp. 1865-1 871. Morikawa, H. and Fujisaki, H. (1984) ‘System identification of on a state-space thespeechproductionprocessbased representation,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-32, 2, pp. 252-262.
Bibliography
41 1
Nakajima, T., Suzuki, T., Ohmura, H., Ishizaki, S., and Tanaka, K. (1978) ‘Estimation of vocal tract area functionby adaptive deconvolution andadaptive speechanalysissystem,’ J. Acoust. SOC.Jap., 34, 3, pp. 157-166. Oppenheim,A. V., Kopec, G. E., andTribolet, J. M. (1976) ‘Speechanalysis by homomorphicprediction,’IEEE Trans.Acoust.,Speech?SignalProcessing,ASSP-24, 4, pp. 327-332. Sagayama, S. and Furui, S. (1977) ‘Maximum likelihood estimation of speech spectrum by pole-zero modeling,’ Trans. Committee on Speech Research, Acoust. SOC. Jap., S76-56. Sagayama, S. andItakura, F. (1981) ‘Compositesinusoidal modeling applied to spectral analysis of speech,’ Trans. IECEJ, J64-A, 2, pp. 105-112. Sugamura, N. andItakura, F. (1981) ‘Speech data compression by LSP speech analysis-synthesis technique,’ Trans. IECEJ, J64-A, 8, pp. 599-606. Tohkura, Y. (1980) Speech QualityImprovementinPARCOR Speech Analysis-Synthesis Systems, Ph.D Thesis, Tokyo Univ. Wakita, H. (1973) ‘Direct estimation of the vocal tract shape by inverse filtering of acoustic speech waveforms,’ IEEE Trans. Audio, Electroacoust., AU-21, 5 , pp. 417-427. Wiener, N. (1966) ExtrapolationInterpolation and Smoothing of Stationary Time Series, MIT Press, Cambridge, Massachusetts.
CHAPTER 6 Abut, H., Gray,R.M.,andRebolledo, G. (1982) ‘Vector quantization of speech and speech-like waveforms,’ IEEE Trans.Acoust., Speech, Signal Processing, ASSP-30, 3, pp. 423-435.
412
Bibliography
Anderson, J. B. and Bodie, J. B. (1975) ‘Tree encoding of speech,’ IEEE Trans. Information Theory, IT-21, 4, pp. 379-387. Atal, B. S. and Schroeder, M.R. (1970) ‘Adaptive predictive coding of speech signals,’ Bell Systems Tech. J., 49, 8, pp. 1973-1986. Atal, B. S. and Schroeder, M.R. (1979) ‘Predictive coding ofspeechsignals and subjective errorcriteria,’IEEE Trans.Acoust., Speech,SignalProcessing, ASSP-27, 3, pp. 247-254. Atal, B. S. and Remde, J. R. (1982)‘Anew model of LPC excitation for producing natural-sounding speech at low bit rates,’Proc.IEEEInt.Conf.Acoust.,Speech,Signal Processing, Paris, France, pp. 614-6 17. Atal, B. S. and Schroeder, M.R. (1984) ‘Stochasticcoding speechsignals at very low bitrates,’ Proc.Int.Conf. Commun., Pt. 2, pp. 1610-1613.
of
Atal, B. S. and Rabiner, L. R. (1986) ‘Speech research directions,’ AT&T Tech. J. 65, 5, pp. 75-88. Buzo, A., Gray, Jr., A. H., Gray, R. M., and Markel, J. D. (1980) ‘Speech coding based upon vector quantization,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-28, 5, pp. 562-574. Chen, J.-H., Melchner, M. J., Cox,R. V. and Bowker, D. 0. (1990) ‘Real-time implementation and performance of a 16kb/s lowdelay CELP speech coder,’Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 181-1 84. Childers, D.,Cox,R. V., DeMori,R.,Furui, S., Juang, B.-H., Mariani, J. J., Price, P., Sagayama, S., Sondhi, M. M. and Weischedel, R. (1998) ‘Thepast,present,andfuture of speech processing,’ IEEE Signal Processing Magazine, May, pp. 24-48. Crochiere, R. E., Webber, S. A., and Flanagan, J. L. (1976) ‘Digital coding of speech in sub-bands,’ Bell Systems Tech. J., 55, 8, pp. 1069-1085.
Bibliography
413
Crochiere, R. E., Cox, R. V., and Johnston, J. D. (1982) ‘Realtime speech coding,’ IEEETrans.Commun., COM-30, 4, pp. 621-634. Crochiere, R. E. and Flanagan, J. L. (1983) ‘Current perspectives indigitalspeech,’ IEEEComnlun.Magazine,January, pp. 3240. Cummiskey, P., Jayant, N. S., and Flanagan, J. L. (1973) ‘Adaptive quantizationindifferentialPCMcoding of speech,’ Bell Systems Tech. J., 52, 7, pp. 1105-1 118. Cuperman, V. and Gersho, A. (1982) ‘Adaptive differential vector coding of speech,’ Conf. Rec., 1982 IEEE Global Comnmn. Conf., Miami, FL, pp. E6.6.1-E6.6.5. David,Jr.,E.E.,Schroeder,M.R.,Logan, B. F., and Prestigiacomo,A. J. (1962) ‘Voice-excited vocodersfor practical speech bandwidth reduction,’ IRE Trans. Information Theory, IT-8, 5, pp. SlOl-S105. Elder, B. (1997) ‘Overview on the current developmentof MPEG-4 audio coding,’ in Proc. 4th Int. Workshopon Systems, Signals and Image Processing, Posnan. Esteban, D. andGaland,C. (1977) ‘Application of quadrature mirror filters to splitband voice schemes,’ Proc.IEEE Int. Conf. Acoust., Speech, Signal Processing, Hartford, CT, pp. 191-195. Farges, E. P. and Clements, M. A. (1986) ‘Hidden Markov models applied to very low bit rate speech coding,’ Proc.IEEE Int. Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, pp. 433-436. Fehn, H. G. and Noll, P. (1982) ‘Multipathsearchcoding of stationary signals with applications to speech,’ IEEE Trans. Commun., COM-30, 4, pp. 687-701. Flanagan, J. L., Schroeder, M. R., Atal, B. S., Crochiere, R. E., Jayant, N. S., andTribolet,J.M. (1979) ‘Speech coding,’ IEEE Trans. Commun., COM-27, 4, pp. 710-737.
414
Bibliography
Foster, J., Gray, R. M., and Dunham, M. 0. (1985) ‘Finitestate vector quantizationfor waveformcoding,’ IEEETrans. Information Theory, IT-3 1, 3, pp. 348-359. Gersho,A.andCuperman, V. (1983) ‘Vector quantization: Apattern-matchingtechniquefor speech coding,’ IEEE Commun. Magazine, December, pp. 15-21. (1992) Vector Quantizationand Gersho, A. andGray,R.M. Signal Compression, Kluwer, Boston. Gerson, I. A. and Jasiuk, M. A. (1990) ‘Vector sum excited linear prediction (VSELP) speech coding at 8kbs,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 461-464. Griffin, D. and Lim, J. S. (1988) ‘Multiband excitation vocoder,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-36, 8, pp. 1223-1235.
F. (1984)‘Bit allocation in time and Honda,M.andItakura, frequencydonlains for predictive coding of speech,’ IEEE Trans.Acoust., Speech,SignalProcessing, ASSP-32, 3, pp. 465473. Jayant, N. S. (1970) ‘Adaptivedeltamodulation with aone-bit memory,’ Bell Systems Tech. J., 49, 3, pp. 321-342. Jayant,N. S. (1973) ‘Adaptivequantizationwithaone-word memory,’ Bell Systems Tech. J., 52, 7, pp. 1119-1144. Jayant, N. S. (1974) ‘Digital coding of speech waveforms: PCM, DPCM, and DMquantizers,’ Proc. IEEE, 62, 5, pp. 61 1-632. Jayant, N. S. and Noll, P. (1984) Digital Coding of Waveforms, Prentice-Hall, New Jersey. Jayant, N. S. and Ramamoorthy, V. (1986). ‘Adaptive Postfiltering of 16 kb/s-ADPCM Speech,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, 16.4, pp. 829-832. Jelinek, F. andAnderson,J. B. (1971) ‘Instrumentabletree encoding of information sources,’ IEEE Trans. Information Theory, IT-17, 1, pp. 118-119.
Bibliography
415
Juang, B. H. and Gray, Jr., A. H. (1982) ‘Multiplestagevector quantizationfor speech coding,’Proc. IEEEInt.Conf. Acoust., Speech, Signal Processing, Paris,France,pp. 597600. Juang, B. H. (1986) ‘Design and performance of trellis vector quantizers for speech signals,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, pp. 437-440. Kataoka, A.,Moriya, T.andHayashi, S. (1993) ‘An 8-kbit/s speech coder based on conjugate structure CELP,’ Proc.IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 592-595. Kitawaki, N., Itoh,K.,Honda,M.,andKakeki,K. (1982) ‘Comparison of objective speech quality measures for voiceband codecs,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Paris, France, pp. 1000-1003. Kleijn, W. B. andHaagen,J. (1994) ‘Transformationand decomposition of the speech signal for coding,’ IEEE Signal Processing Lett., 1, 9, pp. 136-138. Krasner, M. A. (1979) ‘Digital encoding of speech andaudio signals based on the perceptual requirement of the auditory system,’ Lincoln Lab. Tech. Rep., 535. Linde, Y., Buzo, A., and Gray, R. M. (1980) ‘An algorithm for vector quantizer design,’ IEEE Trans. Commun.,COM-28, 1, pp. 84-95. Lloyd, S. P. (1957) ‘Least squares quantization in PCM,’ Institute of MathematicalStatisticsMeeting,AtlanticCity,NJ, September;also (1982) IEEETrans.InformationTheory, IT-28, 2(Part I), pp. 129-136. Makhoul,J.and Berouti, M. (1979) ‘Adaptive noise spectral shaping and entropy coding in predictive coding of speech,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-27, 1, pp. 63-73. Malah, D., Crochiere, R. E., and Cox, R. V. (1981) ‘Performance of transformandsubband coding systems combinedwith
416
Bibliography
harmonic scaling of speech,’ IEEE Trans. Acoust.,Speech, Signal Processing, ASSP-29, 2, pp. 273-283. Max, J. (1960) ‘Quantizing for minimum distortion,’ IRE Trans. Information Theory, IT-6, 1, 3, pp. 7-12. McAulay, R.J.andQuatieri, T. F. (1986) ‘Speech analysis/ synthesis based on a sinusoidal representation,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-34, pp. 744-754. Miki, S., Mano, K., Ohmuro,H.andMoriya,T. (1993) ‘Pitch synchronousinnovationCELP (PSI-CELP),’Proc.Eurospeech, pp. 261-264. Moriya, T.andHonda, M. (1986) ‘Speech coderusingphase equalizationandvectorquantization,’Proc.IEEEInt. Conf.Acoust.,Speech,SignalProcessing,Tokyo, Japan, pp. 1701-1704. Noll, P. (1975) ‘A comparative study of various schemes for speech encoding,’ Bell Systems Tech. J., 54, 9, pp. 1597-1614. Ozawa, K., Araseki, T., and Ono, S. (1982) ‘Speech coding based on multi-pulseexcitationmethod,’ Trans. Committee on Communication Systems, IECEJ, CS82-161. Ozawa, K. and Araseki, T. (1986) ‘High quality multi-pulsespeech coder with pitch prediction,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, pp. 1689-1692. Rabiner, L. R. and Schafer, R. W. (1978) DigitalProcessing of Speech Signals, Prentice-Hall, New Jersey. Richards, D. L. (1973) Telecommunication by Speech,Butterworths, London. Roucos, S., Schwartz, R.,andMakhoul,J. (1982a)‘Vector quantization for very-low-rate coding of speech,’ Conf. Rec. 1982 IEEE Global Commun. Conf., Miami, FL, pp. E6.2.1E6.2.5. Roucos, S., Schwartz,R., andMakhoul, J. (1982b) ‘Segment quantizationforvery-low-rate speech coding,’Proc. IEEE
Bibliography
417
Int. Conf. Acoust., Speech, Signal Processing, Paris, France pp. 1565-1 568. Schafer, R. W. and Rabiner, L. R. (1975) ‘Digital representation of speech signals,’ Proc. IEEE, 63, 1, pp. 662-677. Schroeder, M. R. andAtal, B. S. (1982) ‘Speech coding using efficient block codes,’ Proc. IEEE Int. Conf. Acoust., Speech Signal Processing, Paris, France, pp. 1668-1 67 1. Schroeder, M. R. andAtal, B. S. (1985) ‘Code-excided linear prediction (CELP): high-quality speech at very low bit rates,’ Proc.IEEEInt.Conf. Acoust., Speech, Signal Processing, Tampa, FL, pp. 937-940. Shiraki, Y. and Honda, M.(1986) ‘Very low bit rate speech coding based onjoint segmentation and variablelength segment quantizer,’ Proc. Acoust. SOC.Amer. Meeting, J. Acoust. SOC. Amer., Supple. 1, 79, p. S94. Smith, C.D. (1969) ‘Perception of vocoder speech processed by pattern matching,’ J. Acoust. SOC.Amer., 46, 6(Part 2), pp. 1562-1571. Stewart, L. C., Gray, R. M., and Linde, Y. (1982) ‘The design of trellis waveform coders,’ IEEE Trans. Commun.,COM-30,4, pp. 702-710. Supplee,L., Cohn, R., Collura, J. and McCree, A. (1997) ‘MELP:The new FederalStandardat 2400 bps,’ Proc. IEEEInt.Conf. Acoust., Speech, Signal Processing, pp. 1591-1 594. Tribolet, J. M.and Crochiere, R. E. (1978) ‘A vocoder-driven adaptationstrategyfor low bit-rateadaptivetransform coding of speech,’ Proc. Int. Conf. Digital Signal Processing, Florence, Italy, pp. 638-642. Tribolet, J. M. and Crochiere, R. E. (1979) ‘Frequencydomain coding of speech,’ IEEETrans.Acoust., Speech, Signal Processing, ASSP-27, 5, pp. 5 12-530.
418
Bibliography
Tribolet, J. M. and Crochiere, R. E. (1980) ‘A modified adaptive transformcoding scheme withpost-processing-enchancement,’Proc.IEEEInt.Conf.Acoust.,Speech,Signal Processing, Denver, Colorado, pp. 336-339. Wong, D. Y., Juang, B. H., and Gray, Jr., A. H. (1982) ‘An 800 bit/s vector quantization LPC vocoder,’ IEEE Trans.Acoust., Speech, Signal Processing, ASSP-30, 5, pp. 770-780. Wong, D. Y., Juang, B. H., and Cheng, D. Y. (1983) ‘Very low data rate speech compression with LPC vector and matrix quantization,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 65-68. Zelinski, R.and Noll, P. (1977) ‘Adaptivetransformcoding of speech signals,’ IEEETrans.Acoust., Speech, Signal Processing, ASSP-25, 4, pp. 299-309. CHAPTER 7 Allen, J., Carlson, R., Granstrom,B., Hunnicutt, S., Klatt, D., and of Unrestricted Pisoni, D. (1979) MITalk-79:Conversion English Text to Speech, MIT. Black, A. W. and Campbell, N. (1995) ‘Optimizing selection of unitsfrom speech databasesforconcatenative synthesis,’ Proc. Eurospeech, pp. 58 1-584. Coker, C. H., Umeda, N., and Browman, C. P. (1978) ‘Automatic synthesis fromordinary English text,’ IEEETrans.Audio, Electroacoust., AU-21, 3, pp. 293-298. Crochiere, R. E. and Flanagan, J.L. (1986) ‘Speech processing: An evolving technology,’ AT&T Tech. J., 65, 5, pp. 2-1 1. Ding, W. and Campbell, N. (1997) ‘Optimizing unit selection with voice source and formants in the CHATR speech synthesis system,’ Proc. Eurospeech, pp. 537-540. Dixon, N. R. and Maxey, H. D. (1968) ‘Terminal analog synthesis of continuous speech using the diphone method of segment
Bibliography
assembly,’ IEEETrans.Audio, pp. 40-50.
419
Electroacoust.,AU-16,
1,
Donovan, R. E. and Woodland, P. C. (1999) ‘A hidden Markovmodel-based trainable speech synthesizer,’ Computer Speech and Language, 13, pp. 223-241. Flanagan, J. L. (1972) ‘Voices of men and machines,: J. Acoust. SOC. Amer., 51, 5(Part l), pp. 1375-1387. Hirokawa,T.,Itoh, K. andSato, H. (1992) ‘Highquality speech synthesis based on wavelet compilation of phoneme segments,’ Proc.Int.Conf.SpokenLanguage Processing, pp. 567-570. Hirose, K., Fujisaki, H.,andKawai,H. (1986) ‘Generation of prosodic symbols for rule-synthesis of connected speech of Japanese,’Proc. IEEEInt.Conf. Acoust., Speech, Signal Processing, Tokyo, 45.4, pp. 2415-2418. Liu, Huang, X., Acero, A., Adcock, J., Hon, H.-W., Goldsmith, J., J. and Plumpe, M. (1996) ‘WHISTLER: A trainable text-tospeech system,’ Proc. Int. Conf. SpokenLanguage Processing, pp. 2387-2390. formant Klatt, D. H. (1980) ‘Software foracascade/parallel synthesizer,’ J. Acoust. SOC. Amer.,67, 3, pp. 971-995. Klatt,D.H. (1987) ‘Review of text-to-speech conversionfor English,’ J. Acoust. SOC.Amer., 82, 3, pp. 737-793. Laroche, L., Stylianou, Y. and Moulines, E. (1993) ‘HNS: Speech modification based on a harmonic + noise model,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 550-553. Lovins, J. B., Macchi, M. J., and Fujimura, 0. (1979)‘A demisyllable inventoryfor speech synthesis,’ 97th Meeting of Acoust. SOC. Amer., YY4.
F. (1990) ‘Pitch-synchronous Moulines,E.andCharpentier, waveform processing techniques for text-to-speech synthesis using diphones,’ Speech Communication, 9, pp. 453-467.
420
Nakajima, S. andHamada, of synthesisunitsbased Proc.IEEEInt.Conf. pp. 659-662.
Bibliography
H. (1988) ‘Automaticgeneration oncontextorientedclustering,’ Acoust.,Speech,SignalProcessing,
Nakajima, S. (1993) ‘English speech synthesisbased layeredcontextorientedclustering,’Proc.Eurospeech, pp. 1709-1712.
on multi-
Sagisaka, Y. and Tohkura, Y. (1984) ‘Phoneme duration control for speech synthesis by rule,’ Trans.IECEJ, J67-A,7, pp. 629-636. Sagisaka,Y. (1988) ‘Speech synthesis by ruleusing an optimal selection of non-uniformsynthesisunits,’Proc.IEEEInt. Conf. Acoust., Speech, Signal Processing, pp. 679-682. Sagisaka, Y. (1998) ‘Corpusbased Processing, 2, 6, pp. 407-414.
speech synthesis,’ J. Signal
Sato, H. (1978) ‘Speech synthesis on the basis of PARCOR-VCV concatenation units,’ Trans. IECEJ, J61-D, 11, pp. 858-865. Sato, H. (1984a) ‘Speech synthesis using CVC concatenation units and excitationwaveformselements,’ Trans.Committeeon Speech Research, Acoust. SOC. Jap., S83-69. Sato, H. (1984b) ‘Japanese text-to-speech conversion system,’ Rev. of the Elec. Commun. Labs., 32, 2, pp. 179-187. Tokuda, K., Masuko, T., Yamada, T., Kobayashi, T. and Imai, S. (1995) ‘An algorithm for speech parameter generation from continuousmixture HMMs withdynamicfeatures,’Proc. Eurospeech, pp. 757-760. CHAPTER 8 Acero, A. and Stern, R. M. (1990) ‘Environmental robustness in automatic speech recognition,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 849-852.
Bibliography
421
Atal, B. (1 974)‘Effectiveness of linear prediction characteristics of the speech wave forautomatic speakeridentification and verification,’ J. Acoust., SOC., Amer.,55, 6, pp. 1304-1 3 12. Bahl, L. R.and Jelinek, F. (1975) ‘Decoding for channelswith insertions,deletions, and substitutions, with applications to speech recognition,’ IEEE Trans. Information Theory,IT-2 1, pp. 404-41 1. Bahl, L. R., Brown, P.F., de Souza, P.V. and Mercer, L. R. (1986) ‘Maximum mutual information estimation of hidden Markov modelparametersfor speech recognition,’Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 49-52. Baker, J.K. (1975) ‘Stochasticmodeling forautomatic speech understanding,’in Speech Recognition(ed. D.R. Reddy), pp. 521-542. Baum, L. E. (1972) ‘Aninequality and associatedmaximization technique in statistical estimation for probabilistic functions of a Markov process,’ Inequalities, 3, pp. 1-8. Bellman, R. (1957) Dynamic Programming, Princeton Univ. Press, New Jersey. Bridle, J. S. (1973) ‘Anefficientelastictemplatemethod detectingkeywordsinrunningspeech,’Brit.Acoust.SOC. Meeting, pp. 1-4.
for
Bridle, J. S. and Brown, M. D. (1979) ‘Connected word recognition usingwholewordtemplates,’Proc.Inst.Acoust. Autumn Conf., pp. 25-28. Brown, P. F., Della Pietra, V. J., de Souza, P. V., Lai, J. C. and Mercer, R. L. (1992) ‘Class-based n-gram models of natural language,’ Computational Linguistics, 18, 4, pp. 467-479. Chen, S. S., Eide,E. M., Gales, M.J. F., Gopinath,R. A., Kanevsky, D.and Olsen,P. (1999) ‘Recentimprovements to IBM’s speech recognition system for automatic transcription of broadcast news,’ Proc.DARPA Broadcast News Workshop, pp. 89-94.
422
Bibliography
Childers, D., Cox, R. V., DeMori, R., Furui, S., Juang, B.-H., Mariani, J.J., Price, P., Sagayama, S., Sondhi, M. M.and Weischedel, R. (1998) ‘The past, present, and future of speech processing,’ IEEE Signal Processing Magazine, May, pp. 24-48. Cox, S. J.and Bridle, J. S. (1989) ‘Unsupervised speaker adaptation by probabilisticfitting,’Proc. IEEEInt.Conf. Acoust., Speech, Signal Processing, pp. 294-297. Cox, S. J. (1995) ‘Predictivespeaker adaptation inspeech recognition,’ Computer Speech and Language, 9, pp. 1-17. Davis, K. H., Biddulph, R., and Balashek, S. (1952) ‘Automatic recognition of spoken digits,’ J. Acoust. SOC.Amer., 24,6, pp. 637-642. Digalakis, V. and Neumeyer,L. (1995) ‘Speaker adaptation usingcombinedtransformation and Bayesianmethods,’ Proc.IEEEInt.Conf. Acoust., Speech, Signal Processing, pp. 680-683. Furui, S. (1975) ‘Learning and normalizationofthetalker differences intherecognition of spokenwords,’ Trans. Committee on Speech Research, Acoust. SOC.Jap., S75-25. Furui, S. (1978) Research on IndividualInformationin Waves, Ph.D Thesis, Tokyo University.
Speech
Furui, S. (1980) ‘A training procedure for isolated word recognition systems,’ IEEETrans. Acoust., Speech, Signal Processing, ASSP-28, 2, pp. 129-136. Furui, S. (1 98 1)‘Cepstral analysis technique for automatic speaker verification,’ IEEE Trans. Acoust.,Speech, Signal Processing, ASSP-29, 2, pp. 254-272. Furui, S. (1986a) ‘Speaker-independent isolated word recognition using dynamicfeatures of speech spectrum,’ IEEETrans, Acoust., Speech, Signal Processing, ASSP-34, 1, pp. 52-59. Furui, S. (1986b) ‘On the role of spectraltransitionfor speech perception,’ J. Acoust. SOC. Amer., 80, 4, pp. 1016-1025.
Bibliography
423
Furui, S. (1987) ‘A VQ-based preprocessor using cepstral dynamic features for large vocabulary word recognition,’ Proc. IEEE Int.Conf. Acoust., Speech, Signal Processing, Dallas, TX, 27.2, pp. 1127-1 130. Furui, S. (1989a) ‘Unsupervised speaker adaptation method based on hierarchicalspectral clustering,’ Proc.IEEEInt.Conf. Acoust., Speech, Signal Processing, pp. 286-289. Furui, S. (1989b) ‘Unsupervisedspeaker adaptation based on hierarchical spectral clustering,’ IEEE Trans. Acoust.,Speech, Signal Processing, ASSP-37, 12, pp. 1923-1930. Furui, S. (1992) ‘Toward robust speech recognition under adverse conditions,’ Proc. ESCA Workshop on Speech Processing in Adverse Conditions, Cannes-Mandelieu, pp. 3 1-42. Furui, S. (1995) ‘Flexible speech recognition,’Proc.Eurospeech, pp. 1595-1603. Furui, S. (1997) ‘Recent advances in robust speech recognition,’ Proc. ESCA-NATO Workshopon Robust Speech Recognition forUnknownCommunication Channels, Pont-a-Mousson, pp. 11-20. Gales, M. J. F. and Young, S. J. (1992) ‘An improved approach to thehidden Markov modeldecomposition of speech and noise,’ Proc.IEEEInt.Conf.Acoust.,Speech, Signal Processing, pp. 233-236. Gales, M. J. F. andYoung, S. J. (1993) ‘Parallel model combinationfor speech recognitionin noise,’ Technical Report CUED/F-INFENG/TRl35, Cambridge Univ. Gauvain, J.-L., Lamel, L., Adda, G. and Jardino, M. (1999) ‘The LIMSI 1998 Hub-4Etranscription system,’ Proc. DARPA Broadcast News Workshop, pp. 99-104. Goodman, R. G. (1976) Analysis of Languages for Man-Machine Voice Communication,Ph.DThesis,Carnegie-Mellon University.
“ “ ” l “ . “ ”
“ ”
” ” -
-
“ ”
424
Bibliography
Gray, Jr., A. H. and Markel, J. D. (1976) ‘Distance measures for speechprocessing,’ IEEETrans.Acoust., Speech,Signal Processing, ASSP-24, 5 , pp. 380-391. Huang,X.-D.andJack,M. A. (1989) ‘Semi-continuoushidden Markov source models for speech signals,’ Computer Speech and Language, 3, pp. 239-251. Huang, X.-D., Ariki, Y. and Jack, M. A. (1990) Hidden Markov Models for Speech Recognition,EdinburghUniv.Press, Edinburgh. Itakura, F. (1975) ‘Minimum prediction residual principle applied to speech recognition,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-23, 1, pp. 67-72. Jelinek, F. (1976) ‘Continuous speech recognition by statistical methods,’ Proc. IEEE, 64, 4, pp. 532-556. Jelinek, F. (1997) Statistical Methods for Speech Recognition, MIT Press, Cambridge. Juang, B.-H. (1 99 1)‘Speech recognition in adverse environments,’ Computer Speech and Language, 5 , pp. 275-294.
S. (1992) ‘Discriminativelearning Juang,B.-H.andKatagiri, forminimumerrorclassification,’IEEETrans., Signal Processing, 40, 12, pp. 3043-3054. Juang,B.-H.,Chou, W. and Lee, C.-H. (1996) ‘Statistical anddiscriminativemethodsforspeechrecognition,’in Automatic Speech and SpeakerRecognition(eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 109-1 32. (1984) ‘Adaptability to individual Kato, K. andKawahara,H. talkers in monosyllabic speech perception,’ Trans. Committee on Hearing Research, Acoust. SOC. Jap., H84-3. Katz, S. K. (1987) ‘Estimation from sparse data for the language model fora speechrecognition,’ IEEETrans.Acoust., Speech, Signal Processing, ASSP-35, 3, pp. 400-401.
Bibliography
425
Kawahara,T., Lee, C.-H.andJuang, B.-H. (1977) ‘Combining key-phrasedetection andsubwordbasedverificationfor flexible speech understanding,’ Proc. IEEE Int. Conf.Acoust., Speech, Signal Processing, pp. 1303-1 306. Klatt, D. H. (1982) ‘Prediction of perceived phonetic distance from critical-band spectra: A first step,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing,Paris, France, S 11.1, pp. 1278-128 1. Knill, K. and Young, S. (1997) ‘Hidden Markov models in speech andlanguage processing,’inCorpus-Based Methodsin Language and Speech Processing(eds. S. Youngand G. Bloothooft), Kluwer, Dordrecht, pp. 27-68.
S., andSaito, S. (1972) ‘Spokendigit Kohda,M.,Hashimoto, mechanical recognition system,’ Trans. IECEJ, 55-D, 3, pp. 186-193. Lee, C.-H. and Gauvain, G.-L. (1996) ‘Bayesian adaptive learning and MAP estimation of HMM,’ in Automatic Speech and Speaker Recognition (eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 83-107. Leggetter, C. J. and Woodland, P.C. (1995) ‘Maximum likelihood linear regression for speakeradaptation of continuous density hidden Markov models,’ Computer Speech and Language, 9, pp. 171-185. Lesser, V. R., Fennell, R. D., Erman, L. D., and Reddy, D. R. (1975) ‘Organization of the Hearsay I1 speech understanding system,’ IEEETrans. Acoust.,Speech,SignalProcessing, ASSP-23, 1, pp. 11-24. Lin, C.-H., Chang, P.-C. and Wu, C.-H. (1994) ‘An initial studyon speaker adaptationforMandarin syllablerecognitionwith minimum error discriminativetraining,’Proc. Int.Conf. Spoken Language Processing, pp. 307-3 10. Lowerre, B. T. (1976) TheHarpy Speech RecognitionSystem, Ph.D Thesis, Computer Science Department,CarnegieMellon University.
426
Bibliography
Martin, F., Shikano, K. and Minami, Y. (1993) ‘Recognition of noisy speech by composition of hidden Markov models,’ Proc. Eurospeech, pp. 1031-1034. Matsui, T. and Furui, S. (1995)‘A study of speaker adaptation based on minimum classification training,’ Proc. Eurospeech, pp. 81-84. Matsui, T. andFurui, S. (1996) ‘N-best-basedinstantaneous speaker adaptationmethodfor speech recognition,’Proc. Int. Conf. Spoken Language Processing, pp. 973-976. Matsumoto, H.andWakita,H. (1986)‘Vowel normalization by frequency warped spectral matching,’ Speech Communication, 5, 2, pp. 239-251. Matsuoka, T. and Lee, C.-H. (1993) ‘A study of on-line Bayesian adaptation for HMM-based speech recognition,’ Proc. Eurospeech, pp. 8 15-8 18. Minami, Y. andFurui, S. (1995) ‘Universal adaptationmethod based on HMM composition,’ Proc. ICA, pp. 105-108. Myers, C. S. andRabiner,L. R. (198 1) ‘Connecteddigit recognitionusinga level-building DTW algorithm,’ IEEE Trans.Acoust., Speech, Signal Processing, ASSP-29, 3, pp. 351-363. Nakagawa, S. (1983) ‘A connectedspokenword or syllable recognitionalgorithm by pattern matching,’ Trans. IECEJ, J66-D, 6, pp. 637-644. Nakatsu, R., Nagashima, H., Kojima, J., and speechrecognitionmethodfortelephone IECEJ, J66-D, 4, pp. 377-384.
Ishii, N. (1983) ‘A voice,’ Trans.
Ney, H.andAubert, X. (1996) ‘Dynamicprogrammingsearch strategies: From digitstrings to largevocabularyword graphs,’ in Automatic Speech and Speaker Recognition (eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 385-41 1.
Bibliography
427
Ney, H., Martin, S. and Wessel, F. (1997) ‘Statistical language modeling using leaving-one-out,’ in Corpus-BasedMethods in Language and Speech Processing (eds. S. Youngand G. Bloothooft), Kluwer, Dordrecht, pp. 174-207. Normandin, Y. (1 996) ‘Maximum mutual information estimation of hidden Markov models,’ in Automatic Speech and Speaker Recognition (eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 57-81. Ohkura, K., Sugiyama, M.andSagayama, S. (1992) ‘Speaker adaptation based on transfervector field smoothingwith continuous mixture density HMMs,’ Proc. Int. Conf. Spoken Language Processing, pp. 369-372. Ohtsuki, K., Furui, S., Sakurai,N.,Iwasaki,A.andZhang, 2.-P. (1999) ‘RecentadvancesinJapanesebroadcast transcription,’ Trans. Eurospeech, pp. 671-674.
news
Paliwal, K.K. (1982) ‘On theperformance of the quefrencyweighted cepstral coefficients in vowel recognition,’ Speech Communication, 1, 2, pp. 151-154. Paul, D. (1991) ‘Algorithmsfor an optimal A* searchand linearizing the search in the stack decoder,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 693-696. Rabiner, L. R., Levinson, S. E., Rosenberg, A. E., and Willpon, J. G. (1979a) ‘Speaker-independentrecognition of isolated words using clustering techniques,’ IEEETrans. Acoust., Speech, Signal Processing, ASSP-27, 4, pp. 336-349. Rabiner,L. R. and Wilpon, J. G. (1979b) ‘Speaker-independent isolatedwordrecognitionforamoderate size(54 word) vocabulary,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-27, 6, pp. 583-587. Rabiner, L. R., Levinson, S. E., and Sondhi, M. M. (1983) ‘On the application of vectorquantizationandhiddenMarkov models to speaker-independent,isolatedwordrecognition,’ Bell Systems Tech. J., 62, 4, pp. 1075-1105.
428
Bibliography
Rabiner, L. R. and Levinson, S. L. (1985) ‘A speaker-independent, syntax-directed, connected word recognition system based on hidden Markov models and level building,’ IEEETrans. Acoust., Speech, Signal Processing,ASSP-33, 3, pp. 561-573. Rabiner, L. R., Juang, B.-H., Levinson, S. E. and Sondhi, M. M. (1985) ‘Recognition of isolated digits using hidden Markov models with continuous mixture densities,’ AT&T Tech. J., 64, 6, pp. 1211-1234. Rabiner,L.andJuang,B.-H. (1993) Fundamentals of Speech Recognition, Prentice Hall, New Jersey. Rissanen, J. (1984) ‘Universalcoding,information,prediction and estimation,’ IEEE Trans. Information Theory, 30, 4, pp. 629-636. Rohlicek, J. R. (1995) ‘Wordspotting,’inModernMethods Speech Processing (ed. R.P.RamachandranandR. Mammone), Kluwer, Boston, pp. 123-1 57.
of
Rose, R.C. (1996) ‘Wordspottingfromcontinuousspeech utterances,’inAutomatic Speech and SpeakerRecognition (eds. C.-H. Lee, F. K. Soong and K. K. Paliwal),Kluwer, Boston, pp. 303-329. Sakoe, H.andChiba, S. (1971) ‘Recognition of continuously spokenwordsbasedontime-normalization by dynamic programming,’ J. Acoust. SOC.Jap., 27, 9, pp. 483-500. Sakoe, H. and Chiba, S. (1978) ‘Dynamic programming algorithm optimizationforspokenwordrecognition,’IEEETrans. Acoust., Speech, Signal Processing, ASSP-26, 1, pp. 43-49. Sakoe, H. (1979) ‘Two-level DP-matching - A dynamic programming-based pattern matching algorithm for connected word recognition,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-27, 6, pp. 588-595. Sakoe, H.andWatari,M. (1981) ‘Clockwise propagatingDPmatching algorithm for word recognition,’ Trans. Committee on Speech Research, Acoust. SOC. Jap., S81-65.
Bibliography
429
Sankar,A.and Lee, C.-H. (1996) ‘A maximum-likelihood approach to stochastic matching for robust speech recognition,’ IEEE Trans. Speech and Audio Processing, 4,3, pp. 190-202. Schwartz, R., Chow, Y.-L. and Kubala, F. (1987) ‘Rapid speaker adaptation usingaprobabilisticspectralmapping,’Proc. IEEEInt.Conf. Acoust.,Speech,SignalProcessing, pp. 633-636. Schwarz, G. (1978) ‘Estimatingthedimension Annals of Statistics, 6, pp. 461-464.
of amodel,’ The
Shikano, K. (1982) ‘Spoken word recognition based upon vector quantization of input speech,’ Trans. Committee on Speech Research, Acoust. SOC. Jap., S82-60. Shikano, K. andAikawa, K. (1982)‘Staggeredarray DP matching,’ Trans. Committee on Speech Research,Acoust. SOC.Jap., S82-15. Shikano, K., Lee, K.-F, and Reddy, R. (1986) ‘Speaker adaptationthroughvectorquantization,’Proc.IEEEInt.Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, 49.5, pp. 2643-2646. Shiraki, Y. and Honda, M. (1990) ‘Speaker adaptation algorithms based on piece-wise moving adaptive segment quantization method,’Proc. IEEEInt.Conf. Acoust.,Speech,Signal Processing, pp. 657-660. Slutsker, G. (1968) ‘Non-linearmethod signal,’ Trudy N. I. I. R.
of analysis of speech
Soong, F. K. and Huang, E. F. (1991) ‘A tree-trellis fast search for findingN-bestsentencehypotheses,’Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 705-708. Stern, R. M., Acero, A., Liu, F.-H. and Ohshima, Y. (1996) ‘Signal processing forrobustspeechrecognition,’inAutomatic Speech and SpeakerRecognition(eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 357-384.
430
Bibliography
Sugamura, N. andFurui, S. (1982) ‘Largevocabularyword recognition using pseudo-phoneme templates,’ Trans. IECEJ, J65-D, 8, pp. 1041-1048. Sugamura, N., Shikano, K., and Furui, S. (1983) ‘Isolated word recognition using phoneme-like templates,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Boston, MA, 16.3, pp. 723-726. Sugamura, N. andFurui, S. (1984) ‘Isolatedwordrecognition using strings of phoneme-like templates (SPLIT),’ J. Acoust. SOC.Japan, (E)5, 4, pp. 243-252. Sugiyama, M.andShikano, K. (1981) ‘LPCpeak weighted spectralmatching measures,’ Trans.IECEJ, J64-A, 5 , pp. 409-416. Sugiyama, M. and Shikano, K. (1982) ‘Frequency weighted LPC spectralmatching measures,’ Trans.IECEJ, J65-A, 9, pp. 965-972. Tohkura, Y. (1986) ‘A weighted cepstraldistancemeasure for speech recognition,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, 14.17, pp. 761-764. Varga, A.P.andMoore, R. K. (1990) ‘Hidden Markov model decomposition of speech and noise,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 845-848. Varga, A. P. and Moore,R.K. (1 99 1)‘Simultaneous recognitionof concurrent speechsignalsusinghidden Markovmodel decomposition,’ Proc. Eurospeech, pp. 1175-1 178. Velichko, V. and Zagoruyko, N. (1970) ‘Automatic recognition of 200 words,’ Int. J. Man-Machine Studies, 2, pp. 223-234. Vintsyuk, T. K. (1968) ‘Speech recognition by dynamic programming,’ Kybernetika, 4, 1, pp. 81-88. Vintsyuk, T. K. (1971) ‘Element-wise recognition of continuous speech composed of wordsfroma specified dictionary,’ Kibernetika, 2, pp. 133-143.
Bibliography
431
Viterbi, A. J. (1967) ‘Error bounds for convolutional codes andan asymptoticallyoptimaldecodingalgorithm,’ IEEETrans. Information Theory, IT-13, pp. 260-269. Young, S. (1996) ‘A review of large-vocabulary continuous-speech recognition,’ IEEE Signal Processing Magazine,September, pp. 45-57. CHAPTER 9 Atal, B. S. (1972) ‘Automatic speaker recognition based on pitch contours,’ J. Acoust. SOC. Amer.,52,6(Part 2), pp. 1687-1697. Atal, B. S. (1974) ‘Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,’ J. Acoust. SOC. Amer., 55, 6, pp. 1304-1312. Carey, M.and Parris,E. (1992) ‘Speakerverification using connected words,’ Proc.Institute of Acoustics, 14, 6, pp. 95-100. RomeAir Doddington, G. R. (1974) ‘Speakerverification,’ Development Center, Tech Rep., RADC 74-179. Doddington, G. (1 985) ‘Speaker recognition-Identifying people by their voices,’ Proc. IEEE, 73, 11, pp. 165 1-1664. Eatock, J. and Mason, J. (1990) ‘Automatically focusing on good discriminating speech segments in speaker recognition,’ Proc. Int. Conf. Spoken Language Processing, 5.2, 133-136. Furui, S., Itakura, F., and Saito, S. (1972) ‘Talker recognition by longtime averaged speech spectrum,’ Trans. IECEJ, 55-A, 10, pp. 549-556. Furui, S. (1978) Research on Individuality Information in Speech Waves, Ph.D Thesis, Tokyo University. Furui, S. (1981a) ‘Comparison of speakerrecognitionmethods using statistical features and dynamic features,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-29, 3? pp. 342-350.
432
Bibliography
Furui, S. (1981b) ‘Cepstralanalysistechniqueforautomatic speakerverification,’ IEEETrans.Acoust., Speech,Signal Processing, ASSP-29, 2, pp. 254-272. Furui, S. (1986) ‘Research on individualityfeaturesin speech waves and automatic speaker recognition techniques,’ Speech Communication, 5 , 2, pp. 183-197. Furui, S. (1996) ‘An overview of speaker recognitiontechnology,’ in Automatic Speech and Speaker Recognition(eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 31-56. Furui, S. (1997) ‘Recent advances in speaker recognition,’ Pattern Recognition Letters, 18, pp. 859-872. Griffin, C., Matsui, T. and Furui, S. (1994) ‘Distance measures for text-independent speaker recognition based on MAR model,’ Proc.IEEEInt.Conf.Acoust. Speech,SignalProcessing, Adelaide, 23. 6, pp. 309-312. Higgins,A.,Bahler,L. andPorter, J. (1991) ‘Speakerverificationusingrandomizedphraseprompting,’DigitalSignal Processing, 1, pp. 89-106. Kersta, L. G. (1962) ‘Voiceprintidentification,’ Nature, 196, pp. 1253-1257. Li, K. P.andWrench,Jr.,E. H. (1983) ‘An approachto textindependent speaker recognition with short utterances,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Boston, MA, 12.9, pp. 555-558. Kunzel, H. (1994) ‘Currentapproachestoforensicspeaker recognition,’ESCAWorkshoponAutomaticSpeaker Recognition, Identification and Verification, pp. 135-141. Markel, J., Oshika, B. andGray, A. (1977) ‘Long-termfeature averaging for speakerrecognition,’ IEEETrans. Acoust. Speech Signal Processing, ASSP-25, 4, pp. 330-337. Markel, J. and Davi, S. (1979) ‘Text-independent speaker recognitionfromalargelinguisticallyunconstrainedtime-spaced
Bibliography
data base,’ IEEE Trans. Acoust. ASSP-27, 1, pp. 74-82.
433
Speech SignalProcessing,
Matsui, T. and Furui, S. (1990) ‘Text-independent speaker recognition using vocal tract and pitch information,’ Proc. Int. Conf. Spoken Language Processing, Kobe, 5.3, pp. 137-140. Matsui,T.andFurui, S. (199 1)‘A text-independentspeaker recognition method robust against utterance variations,’ Proc. IEEEInt.Conf.Acoust. Speech SignalProcessing, S6.3, pp. 377-380. Matsui, T. and Furui, S. (1992) ‘Comparison of text-independent speakerrecognitionmethods using VQ-distortion and discrete/continuousHMMs,’Proc.IEEEInt.Conf.Acoust. Speech, Signal Processing, San Francisco, pp. 11-157-1 60. Matsui, T. and Furui, S. (1993) ‘Concatenated phoneme models fortext-variablespeakerrecognition,’Proc. IEEEInt. Conf.Acoust.Speech,SignalProcessing,Minneapolis,pp. 11- 391-394. Matsui, T.andFurui, S. (1994a) ‘Speaker adaptation of tiedmixture-basedphonememodelsfortext-promptedspeaker recognition,’Proc. IEEEInt.Conf.Acoust. Speech,Signal Processing, Adelaide, 13.1. Matsui, T. and Furui, S. (1994b) ‘Similarity normalization method for speakerverificationbased ona posterioriprobability,’ ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, pp. 59-62. Montacie, C., Deleglise, P., Bimbot, F. and Caraty, M.-J. (1992) ‘Cinematictechniquesforspeechprocessing:Temporal decompositionandmultivariatelinearprediction,’Proc. IEEEInt.Conf.Acoust. Speech,SignalProcessing,San Francisco, pp. 1-153-1 56. Naik, J., Netsch, M. and Doddington, G. (1989) ‘Speaker verification over long distance telephone lines, Proc.IEEEInt.Conf. Acoust., Speech, Signal Processing,’ S10b.3, pp. 524-527.
434
Bibliography
National Research Council (1979) On the Theory and Practice of Voice Identification, Washington, D. C. Newman,M., Gillick, L.,Ito, Y., McAllaster, D. and Peskin, B. (1996) ‘Speaker verification throughlargevocabulary continuous speech recognition,’Proc. Int.Conf.Spoken Language Processing, Philadelphia, pp. 24 19-2422. O’Shaugnessy, D. (1986) ‘Speakerrecognition,’ Magazine, 3,4, pp. 4-17.
IEEE ASSP
Poritz, A. (1982) ‘Linear predictive hidden Markov models and the speech signal,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, SI 1.5, pp. 1291-1294. Reynolds, D. (1994) ‘Speakeridentification and verification using Gaussian mixture speaker models,’ ESCA Workshop onAutomaticSpeakerRecognition,Identificationand Verification, pp. 27-30. Rose, R.and Reynolds, R. (1990) ‘Text independentspeaker identification using automatic acoustic segmentation,’ Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, S51.10, pp. 293-296. Rosenberg, A. E. and Sambur, M. R. (1975) ‘New techniques for automatic speaker verification,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-23, 2, pp. 169-176. Rosenberg, A.and Soong, F. (1987) ‘Evaluation of avector quantizationtalkerrecognition system in text independent and text dependent modes,’ Computer Speech and Language, 22, pp. 143-157. Rosenberg,A., Lee, C. and Gokcen, S. (1991) ‘Connectedword talker verification using whole word hidden Markov models,’ Proc.IEEEInt.Conf.Acoust. Speech, Signal Processing, Toronto, S6.4, pp. 381-384. Rosenberg, A. and Soong, F. (1991) ‘Recent research in automatic speaker recognition,’ in Advances inSpeech Signal Processing
Bibliography
435
(eds. S. Furui and M. M.Sondhi), Marcel Dekker,New York, pp. 701-737. Rosenberg,A. (1992) ‘The use of cohort normalized scores for speaker verification,’ Proc.Int.Conf.SpokenLanguage Processing, Banff, Th.sAM.4.2, pp. 599-602. Sambur, M. R. (1975) “Selection of acoustic features for speaker identification,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-23, 2, pp. 176-182. Savic, M. andGupta, S. (1990) ‘Variable parameterspeaker verification system based on hidden Markov modeling,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, S5.7, pp. 28 1-284. Setlur,A. andJacobs, T. (1995) ‘Results of aspeaker verification service trial using HMM models,’ EUROSPEECH’95, Madrid, pp. 639-642. Shikano, K. (1985) ‘Text-independent speaker recognition experiments using codebooksinvectorquantization,’J.Acoust. SOC.Am. (abstract), Suppl. 1, 77, SI 1. Soong, F. K. and Rosenberg,A.E. (1986) ‘On the use of instantaneous and transitional spectral information in speaker recognition,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 877-880. Soong, F., Rosenberg, A.andJuang, B. (1987) ‘A vector quantizationapproachtospeakerrecognition,’AT&T Technical Journal, 66, pp. 14-26. Tishby, N. (199 1) ‘On theapplication of mixture AR hidden Markov models to text independentspeakerrecognition,’ IEEE Trans. Acoust. Speech, Signal Processing, ASSP-30, 3, pp. 563-570. Tosi, O., Oyer, H., Lashbrook, W., Pedrey, C., Nicol, J., and Nash, E. (1972) ‘Experiment on voice identification,’ J. Acoust. SOC. Amer., 51, 6(Part 2), pp. 2030-2043.
436
Bibliography
Zheng, Y. and Yuan, B. (1988) ‘Text-dependent speaker identification using circular hidden Markov models,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, S13.3, pp. 580-582.
APPENDICES Gersho, A. andGray,R.M. (1992) VectorQuantizationand Signal Compression, Kluwer, Boston. Lippmann, R.P. (1987) ‘An introduction to computing with neural nets,’ IEEE ASSP Magazine, 4, 2, pp. 4-22. Makhoul, J., Roucos, S., and Gish, H. (1985) ‘Vector quantization,’ Proc. IEEE, 73, 11, pp. 1551-1588. Parsons, T. W. (1986) Voice and Speech Processing, McGraw-Hill, New York, pp. 274-275.
Index
A %correct, 322 A* search, 3 12 Abdominal muscles, 10 Accent, 10 component, 230 Accuracy, 322 Acoustic background, 303 Acoustic model, 314 Adaptation: backward (feedback), 143, 151 to environmental variation, 380 forward (feedforward), 143 on line, 336 instantaneous, 336 Adaptive bit allocation, 163 Adaptive delta modulation (ADM), 151 Adaptive differential PCM, 143, 148, (ADPCM), 151, 158
Adaptive inverse filtering, 114 Adaptive PCM (APCM), 143 Adaptive prediction, 147 backward, 149, 151 forward, 149 Adaptive predictive coding, 143, (APC), 149, 153 with adaptive bit allocation (APC-AB), 166 Adaptive predictive DPCM (AP-DPCM), 149 Adaptive quantization, 138, 143 backward, 151 Adaptive transform coding (ATC), 163 with VQ (ATC-VQ), 179 Adaptive vector predictive coding (AVPC), 180 Adjustment window condition, 269 AEN (articulation equivalent loss), 201 Affine transformation, 336 Affricate sound, 11
437
Index
438
Air Travel Information System (ATIS), 323 A-law, 142 Alexander Graham Bell, 1 Aliasing distortion, 47 Allophones, 2 19, 229 All-pole: model, 89 polynomial spectral density function, 90 spectrum, 68 speech production system, 68 Allophonic variations, 320 Amplitude: density distribution function, 21 level, 20 Analog-to-digital (A/D) conversion, 45, 51 Analysis-by-synthesis coder, 196 Analysis-by-synthesis (A-b-S) method, 42, 7 1, 190 Analysis-synthesis, 73,135 Antiformant, 30,127 Antiresonance, 30 circuit, 27, 224 Anti-model, 347 A posteriori probability, 363, 365 Area function, 33, 11 1 AR process, 91 Arithmetic coding, 134 Articulation, 9, 11, 27, 30 manner of, 12 place of, 12 Articulation equivalent transmission loss (AEN), 201 Articulators, 11 Articulatory model, 223 Articulatory movement, 11 Articulatory organs, 11, 246
Articulatory units, 381 Artificial intelligence (AI), 382 Aspiration, 11 Auditory critical bandwidth, 251 Auditory nerve system, 7 Auditory scene analysis, 385 Audrey, 243 Augmented transition network (ATN), 312 Autocorrelation: function, 52, 53, 251, 252 method, 87, 252 Automation control, 301 Autoregressive (AR) process, 89 Average branching factor, 322 B Back-propagation training algorithm, 401 Backward prediction error, 102 Backward propagation wave, 33 Backward variable, 285 Bakis model, 279 Band-pass: filter (BPF), 82, 70, 250 bank, 76, 159, 251 lifters, 252 Bark-scale frequency axis, 25 1 Basilar membrane, 25 1 Baum-Welch algorithm, 282, 288 Bayes’ rule, 3 13 Bayes’ sense, 364 Bayesian learning, 335 Beam search method, 31 1 Bernoulli effect, 10 Best-first method, 3 1 1 BIC (Bayesian Information Criterion), 325, 328 Bigram, 316
Index
Binary tree coding (BTC), 178 Blackboard model, 3 10 Blind equalization, 36 1 Bottom-up, 308 Boundary condition: at the lips and glottis, 115 for the time warping function, 268 Breadth-first method, 3 11 C Cascade connection, 225 Case frame, 312 Centroid, 176,337,394 Cepstral analysis, 79 Cepstral coefficient, 62, 77, 251 Cepstral distance (CD), 202 Cepstrunl, 62 method, 252 Cepstral mean: normalization (CMN), 325, 341, 361 subtraction (CMS), 341, 361 CHATR, 238 Cholesky decomposition method, 89 City block distance, 269 Claimed speaker, 364 Class N-gram, 317 Clustering, 332 Clustering-based methods, 176 Cluster-splitting method (LBG algorithm), 176, 395 Coarticulation, 16, 245, 378 dynamic model of, 383 Code: vectors, 176,393 Codebook, 176, 281, 393 Codeword, 279
439
Code-excited linear predictive coding (CELP), 193 Coding, 45, 47, 199 bit rate, 200 delay, 200 in frequency domain, 159 methods, evaluation of, 199 in time domain, 141 Cohort speakers, 364 Complexity of coder and decoder, 200 Composite sinusoidal model (CSM), 126 Concatenation synthesizer, 238 Connected word recognition, 295 Connection strength, 399 Consonant, 6 Context, 308 Context-dependent phoneme units, 229, 247 Context-free grammar (CFG), 312 Context-oriented-clustering (COC) method, 237 Continuous speech recognition, 246 Conversational speech recognition, 246 Convolution, 387 Convolutional (multiplicative) distortion, 344 Corpus, 314 Corpus-based speech synthesis, 237 Cosh measure, 256 Covariance method, 87 CS-ACELP, 205 Customer (registered speaker), 354
440
CVC syllable, 228, 247 CV syllable, 228, 247 D DARPA speech recognition projects, 323 Database for evaluation, 386 Decision criterion (threshold), 356 DECtalk system, 236 Deemphasis, 5 1 Delayed decision encoding, 173 Delayed feedback effect, 8 Deleted interpolation method, 316 Delta-cepstrum, 262, 363 Delta-delta-cepstrum, 263 Delta modulation (DM or AM), 149 Demisyllable, 229, 297 Depth-first method, 3 1 1 Detection-based approach, 344 Devocalization, 266 Diaphragm, 10 Differential coding, 148 Differential PCM (DPCM), 145, 148 Differential quantization, 149 Digital filter bank, 70 Digital processing of speech, Digital signal processors (DPSs), 386 Digital-to-analog (D/A) conversion, 5 1 Digitization, 45 Diphone, 229, 247 Diphthong, 13 Discounting ratio, 3 17 Discourse, 264
Index
Discrete cosine transform (DCT), 163 Discrete Fourier transform (DFT), 57, 163 Discriminant analysis, 364 Discriminative training, 293, 347 Distance (similarity) measure, 176, 249 based on LPC, 252 based on nonparametric spectral analysis, 25 1 Distance normalization, 364 Distinctive features, 20 Distortion rate function, 135 Divergence, 363 Double-SPLIT method, 278 Dual z-transform, 68 Duration, 230, 234, 264 Durbin’s recursive solution method, 89, 105, 108 Dyad, 229, 247 Dynamic characteristics, 367 Dynamic spectral features, 262, 367, 378 Dynamic programming (DP), matching, 266, 277, 297 asymmetrical, 270 staggered array, 272, 249 symmetrical, 270 unconstrained endpoint, 270 variations in, 270 method, 287 CW (clockwise), 300 O(n) (order n), 301 OS (one-stage), 301 path, 270 Dynamic spectral features (spectral transition), 262, 367, 378
Index
Dynamic time warping (DTW), 260, 266 E Ears, 7 EM algorithm, 290 Energy level, 248 Entropy, 322 coding, 133 Equivalent vocabulary size, 322 Error: deletion, 323 insertion, 323 rate, 323 substitution, 323 Euclidean distance, 250 Evaluation: factors for speech coding systems, 199 methods objective, 200 subjective, 200 for speech processing technologies, 385 F False acceptance (FA), 354 False rejection (FR), 354 Fast Fourier transform (FFT), 57, 251 Feedforward nets, 399 FFT cepstrum, 69 Filler, 305 speech model, 303 Filter bank, 70 Fine structure, 64 Finite state VQ (FSVQ), 182
441
First-order differential processing, 114 Fixed prediction, 147 FI-F~ plane, 16 Formant, 14,127 bandwidth, 19 frequency, 14, 39 extraction, 7 1 Formant-type speech synthesis method, 224 Forward-backward algorithm, 282, 283 Forward and backward waves, 223 Forward prediction error, 102 Forward propagation wave, 33 Forward-type AP-DPCM, 153 Forward variable, 283 Fourier transform, 53 pair (Wiener-Khintchine theorem), 54 Frame, 60 interval, 60 length, 60 F-ratio (inter- to intravariance ratio), 363 Frequency resolution, 60 Frequency spectrum, 52 Fricative, 10 Full search coding (FSC), 178 Fundamental equations, 35 Fundamental frequency (pitch), 10, 24, 79, 230, 351 Fundamental period, 10 G
Gaussian, 29 1 mixture, 305 mixture model (GMM), 325, 37 1
Index
442
Generation rules (rewriting rules), 3 12 Glottal area, 42 Glottal source, 10 Glottal volume velocity, 42 Glottis, 10 Good-Turing estimation theory, 317 Grammar, 3 14 Granular noise, 150
H Hamming window, 58 Hanning window, 58 Hard limiters, 399 Harmonic plus noise model (HNM), 220 Harpy system, 31 1 Hat theory of intonation, 230 Hearing, 7 Hearsay I1 system, 3 10 Hidden layers, 399 Hidden Markov model (HMM), 278 coding, 184 composition, 344, 363 continuous, 279, 290 decomposition, 344 discrete, 279 ergodic, 279, 305 based method, 371 evaluation problem, 282 hidden state sequence hidden state sequence uncovering problem, 283 left-to-right, 279 linear predictive, 37 1
[Hidden Markov model (HMM)] mixture autoregressive (AR), 37 1 MMI training of, 292 MCE/GPD training of, 292, 335 problems, procedures, semicontinuous, 292 system for word recognition, 293 theory and implementation of, 278 three basic algorithms for, 282 tied mixture, 292 training problem, 283 Hidden nodes, 399 Hierarchy model, 308 High-emphasis filter, 102 Homomorphic analysis, 66 Homomorphic filtering, 66 Homomorphic prediction, 129 Huffman coding, 133 Human-computer dialog systems, 323 Human-computer interaction, 243 Hybrid coding, 135,187 I IBM, 325 Impostor, 354 Individual characteristics, 349, 351 Individual differences: acquired, 35 1 hereditary, 35 1 Individuality, 246
Index
Information: rate distortion theory, 134, 177 transmission theory, 3 13 Initial state distribution, 28 1 Input and output nodes, 399 Integer band sampling, 162 Intelligibility test, 200 Internal thresholds, 399 Interpolation characteristics, 126 Inter-session (temporal) variability, 360 Intonation, 7, 10 component, basic, 230 Intraspeaker variation, 360, 364 Inverse filter, 85, 255 first- or second-order critical damping, 361 Inverse filtering method, 93, 114 Irreversible coding, 133 Island-driven method, 3 11 Isolated word recognition, 246 Itakura-Saito distance (distortion), 254
J Jaw, 9 K Karhunen-Loeve transform (KLT), 163 Katz's backoff smoothing, 3 17 Kelly's speech synthesis (production) model, 37, 110 K-means algorithm (Lloyd's algorithm), 176, 394 K-nearest neighbor (KNN) method, 332
443
Knockout method, 363 Knowledge processing, advanced, 382 Knowledge source, 308, 382 L Lag window, 252 Language model, 314, 344 Large-vocabulary continuous speech recognition, 306 Larynx, 9 Lattice, 248 filter, 109 diagram, 285 LBG algorithm (cluster-splitting method), 176, 395 LD-CELP, 205 Left-to-right method, 3 11 Level building (LB) method, 298 Lexicon, 306 Lifter, 77, 261 Liftering, 65 Likelihood, 248, 282 normalization, 364 ratio, 347, 363, 364 LIMSI, 324 Linear delta modulation (LDM), 149 Linearly separable equivalent circuit, 30, 64, 73, 85 Linear PCM, 142 Linear prediction, 2, 83, 145 Linear predictive coding (LPC), 2,78 analysis, 68, 83, 250, 252 procedure, 86 methods: code-excited, 138 multi-pulse-excited, 138
Index
444
[Linear predictive coding (LPC)] residual-excited, 138, 187 speech-excited, 138, 187 parameters, mutual relationships between, 127 speech synthesizer, 228 Linear predictor: coefficients, 84 filter, 84 Linear transformation, 335 based on multiple regression analysis, 336 Line spectrum pair (LSP), 1 16 analysis, 1 16 principle of, 1 16 solution of, 119 parameters, 121 coding of, 126 synthesis filter, 122 Linguistic constraints, 246 Linguistic information, 5, 243 Linguistic knowledge, 246 Linguistic science, new, 383 Linguistic units, 38 1 Lip rounding, 12 Lips, 9 Lloyd's algorithm (K-means algorithm), 176,394 Local decoder, 145 Locus theory, 229 Log likelihood ratio distance, 255 Log PCM, 142 Lombard effect, 341 Long-term (pitch) prediction, 148, 153 Long-term (term) averaged speech spectrum (LAS), 23, 370 Long-term-statistics-based method, 368
Loss: heat conducgion, 32 leaky, 32 viscous, 32 Loudness, 230 LPC: cepstral coefficients, 257 cepstral distance, 257 cepstrum, 69 correlation coefficients, 260 correlation function, 127 LSI for speech processing use, 386 Lungs, 89
M Markov: chains, 279 sources, 279 Mass conservation equation, 32 Matched filter principle, 197 Matrix quantization (MQ), 138, 182, 337 Maximum a posteriori (MAP), 330 decoding rule, 314 estimates, 335 probability, 3 13 Maximum likelihood (ML): estimation, 293 method, 70, 254 spectral distance, 254 spectral estimation, 89 formulation of, 89 physical meaning of, 93 MDL (Minimum Description Length) criterion, 325 Mean opinion score (MOS), 200
Index
Me1 frequency cepstral coefficient (MFCC), 252 Mel-scale frequency axis, 25 1 Mimicked voice, 352 Minimum phase impulse response, 77 Minimum residual energy, 256 Mismatches: acoustic, 341 linguistic, 341 MITalk-79 system, 234 Mixed excitation LPC (MELP), 196 Mixture, 290 M-L method, 173 MLLR (maximum likelihood linear regression) method, 325, 330 Models, 244 Modified, autocorrelation function, 14, 98, 107 Modified correlation method, 79 Momentum equation, 32 Morph, 234 Morphemes, 3 17 Morphological analysis, 3 17 p-law, 142 Multiband excitation (MBE), 196 Multilayer perceptrons, 399 Multipath search coding, 173 Multiple regression analysis, 336 Multi-pulse-excited LPC (MPC), 189 Multistage processing, 178 Multistage VQ, 179 Multitemplate method, 332 Multivariate autoregression (MAR), 370 Mutual information, 292
445
N N-best: based adaptation, 339 hypotheses, 339 results, 3 12 N-gram language model, 316 Nasal, 11 cavity, 9 Nasalization, 11 Nasalized vowel, 1 1 Nearest-neighbor selection rule, 394 Network model, 310 Neural net, 399 Neutral vowel, 13 Neyman-Pearson: hypothesis testing formulation, 305 lemma, 347 Noise: additive, 341 shaping, 138, 156 source, 44 threshold, 135 Nonlinear quantization, 138 Nonlinear warping of the spectrum, 335 Nonparametric analysis (NPA), 52 Nonuniform sampling, 266 Nonspeech sounds, 249 Normal equation, 89 Normalized residual energy, 256 Nyquist rate, 47 0
Objective evaluation, 200 Observation probability, 28 1 distribution, 28 1
446
Opinion-equivalent SNR (SNRq), 200 Opinion tests, 200 Optimal (minimum-distortion) quantizer, 394 Oral cavity, 9 Orthogonal polynomial representation, 367 Out-of-vocabulary, 305, 344 P Pair comparison (A-B test), 200 Parallel connection, 225 Parallel model combination (PMC), 344, 363 Parametric analysis (PA), 52 PARCOR (partial autocorrelation): analysis, 102 formulation of, 102 analysis-synthesis system, 110 coefficient, 102 extraction process, 89 and LPC coefficients, relationship between, 108 synthesis filter, 109 Partial correlator, 107 Peak factor, 21 Peak-weighted distance, 258 Perceiving dynamic signals, 385 Perceptually-based weighting, 192 Perceptual units, 38 1 Periodogram, 92 Perplexity, 322 log, 322 test-set, 322 Pharynx, 9 Phase equalization, 195
Index
Phone, 6 Phoneme, -6, 247 reference template, 275 Phoneme-based algorithm, 247 Phoneme-based system, 229 Phoneme-based word recognition, 275 Phoneme-like templates, 277 Phoneme context, 238 Phonemic symbol, 6 Phonetic decision tree, 320 Phonetic information, 246 Phonetic invariants, 331 Phonetic symbol, 6 Phonocode method, 184 Phrase component, 230 Physical units, 382 Pitch, 10, 264 error double-, 79 half-, 79 extraction, 78 by correlation processing, 79 by spectral processing, 79 by waveform processing, 79 Pitch-synchronous waveform concatenation, 220 Pitch (long-term) prediction, 148, 153 n--type four-terminal circuits, 223 Plosive, 10 Pole-zero analysis, 127 by maximum likelihood estimation, 130 Polynomial coefficients, 367 Polynomial expansion coefficients, lower order, 262 Positive definiteness, 250 Postfilter, adaptive noiseshaping, 158
Index
Postfiltering, 158 Pragmatics, 264, 308 Preemphasis, 51 Predicate logic, 3 12 Prediction, 145 error, 102 operators, forward and backward, 106 gain, 147 residual, 141, 145,256 Predictive coding, 141, 143 Procedural knowledge representation, 3 12 Production: model, 383 system, 3 12 Progressing wave model, 32 Prosodic features, 379 control of, 230 Prosodics, 308 Prosody, 264 Pseudophoneme, 277 PSI-CELP, 205 Pulse code modulation (PCM), 138,141 Pulse generator, 27
Q Quadrature mirror filter (QMF), 162 Quantization, 47 distortion, 49, 177 error, 49 noise, 49 step size, 47 Quantizing, 45 Quefrency, 64 Quefrency-weighted cepstral distance measure, 262
447
R Radiation, 9, 27 Random learning, 176 Rate distortion function, 135 Receiver operating characteristic (ROC) curve, 354 Recognition: speaker, 349 speech, 243 Rectangular window, 58 Reduction, 245 Reference template, 244, 264 Reflection coefficient, 35 11 1, 223 Registered speaker (customer), 354 Regression coefficients, 262 Residual: energy, 255 error, 84 signal, 99, 107 Residual-excited LPC vocoder (RELP), 187 Resonance (formant), 30 characteristics, 12 circuit, 27, 224 model, 38 Reversible coding, 133 Rewriting rules (generation rules), 3 12 Robust algorithms, 339 Robust and flexible speech coding, 21 1
S Sampling, 45, 46 frequency, 46 period, 46 Scalar quantization, 177
448
Search: one-pass, 320 multi-pass, 320 Segment quantization, 138 Segmental k-means training procedure, 295 Segmental SNR (SNR,,,), 201 Segmentation, 245 Selective listening, 8 Semantic class, 3 12 Semantic information, 312 Semantic markers, 312 Semantic net, 312 Semantics, 264, 308 Semivowel, 1 1 Sentence, 6 hypothesis, 248 Shannon-Fano coding, 133 Shannon’s information source coding theory, 133 Shannon-Someya’s sampling theorem, 46 Sheep and goats phenomenon, 334, 379 Short-term (spectral envelope) prediction, 148 Short-term spectrum, 52 Side information, 143,156 Sigmoidal nonlinearities, 399 Signal-to-amplitude-correlated noise ratio, 200 Signal-to-quantization noise ratio (SNR), 507 of a PCM signal, 142 Similarity matrix, 277 Similarity (distance) measure, 249 Simplified inverse filter tracking (SIFT) algorithm, 79 Single-path search coding, 175
Index
Sinusoidal transform coder (STC), 196 Slope: constraint, 270 overload distortion, 149 Smaller-than-word units, 248 Soft palate (velum), 10 Sound: pressure, 33 source model, 383 production, 27 spectrogram (voice print), 14, 60, 70, 349 spectrograph, 60 Source, 30 generation, 9 parameter, 78 estimation, 98 from residual signals, 98 Speaker: adaptation, 33 1, 335 unsupervised, 336 cluster selection, 335 identification, 352 normalization, 33 1, 334 recognition, 349 algorithms, textindependent, 380 human and computer, 349 methods, 352 principles of, 349 systems: examples of, 366 structure of, 354 text-dependent, 366 text-independent, 368 text-prompted, 373 text-dependent, 352 text-independent, 352
Index
[Speaker:] text-prompted, 353 verification, 352 Special-purpose LSIs, 386 Spectral analysis, 52 Spectral clustering:, hierarchical, 337 Spectral distance measure, 249 Spectral distortion, 126 Spectral envelope, 52, 64, 351 prediction, 148 Spectral equalization, 114, 361 Spectral equalizer, 102 Spectral fine structure, 52 Spectral parameters, statistical features of, 362 Spectral mapping, 335 Spectral similarity, 249 Speech: acoustic characteristics of, 14 analysis-synthesis system by LPC, 99 chain, 8 coding, 133 principal techniques for, 133 voice dependency in, 380 communication, 1 corpus, 237 database, 237 information processing future directions of, 375 technologies, 375 perception mechanism, clarification of, 384 period detection, 248 principal characteristics of, 5 processing basic units for, 381 technologies, evaluation methods for, 385
449
[Speech:] production, 5, 27, 383 mechanism, 9 clarification of, 383 ratio, 26 recognition, 243 advantages of, 243 based method, 371 classification of, 246 continuous, 245 conversational, 246 difficulties in, 245 principles of, 243 speaker-adaptive, 330 speaker-dependent, 246 speaker-independent, 246, 330 spectral structure of, 52 statistical characteristics of, 20 synthesis, 2 13 based on analysis-synthesis method, 216, 221 based on speech production mechanism, 222 based on waveform coding, 216, 217 by HMM, 222 principles of, 213 synthesizer by J. Q. Stewart, 216 by von Kempelen, 214 understanding, 246 SPLIT method, 277, 333 Spoken language, 385 Spontaneous speech recognition, 344 Stability, 101, 107, 121,391 Stack algorithm, 311 Standardization of speech coding methods, 199, 203
450
State transition probability, 28 1 distribution, 28 1 State-tying, 320 Stationary Gaussian process, 90 Statistical characteristics, 351 Statistical features, 359 Statistical language modeling, 312, 314 Stochastically excited LPC, 193 Stop consonant, 10 Stress, 7, 264 Sturm-Liouville derivative equation, 38 Subband coding (SBC), 143, 159 Subglottal air pressure, 10 Subjective evaluation, 200 Subword units, 248, 264 Supra-segmental attributes, 264 Syllable, 6 Symmetry, 250 Syntactic information, 312 Syntax, 264, 308 Synthesis by rule, 216, 226 principles of, 226 Synthesized voice quality, 380
T Talker recognition, 349 Task evaluation, 385 Technique evaluation, 386 Telephone, 1 Templates, 176 Temporal characteristics, 35 1 Temporal (inter-session) variability, 360, 381 Terminal analog method, 222, 224 Text-to-speech conversion, 23 1, 234
Index
Threshold logic elements, 399 Tied-mixture models, 3 18 Tied-state Gaussian-mixture triphone models, 320 Time: and frequency division, 141 resolution, 60 warping function, 267 Time-averaged spectrum, 361 Time-domain harmonic scaling (TDHS) algorithm, 168 Time domain pitch synchronous overlap add (TD-PSOLA) method, 220 Toeplitz matrix, 89 Tokyo Institute of Technology, 328 Tongue, 9 Top-down, 248, 308 Trachea, 9 Training mechanism, 331 Transcription, 243, 246, 323 Transform coding, 141 Transitional cepstral coefficient, 252 Transitional cepstral distance, 262 Transitional distance measure, 263 Transitional features, 378 Transitional logarithmic energy, 263 Tree coding: variable rate (VTRC), 196 Tree search, 178, 3 11 coding, 173 Tree-trellis algorithm, 3 12 Trellis: coding, 173, 184 diagram, 285
Index
Trigram, 316 Triphone, 318 Two-level DP matching, 295 Two-mass model, 40 U Unigram, 316 Units of reference templates/ models, 247 Universal coding, 134 Unsupervised (online) adaptation, 33 1 Unvoiced consonant, 11 Unvoiced sound, 11 V Variable length: coding, 133 VCV syllable, 247 VCV units, 228 Vector PCM (VPCM), 176 Vector quantization (VQ), 141, 173, 278, 279 algorithm, 393 based method, 370 based word recognition, 337 codebook, 337, 370 for linear predictor parameters, 180 principles of, 175 Vector-scalar quantization, 179 Velum (soft palate), 10 VFS (vector-field smoothing), 330 Visual units, 382 Viterbi algorithm, 282, 286,
451
Vocal cord, 10 model, 40 spectrum, 334, 363 vibration waveform, 42 Vocal organ, 7 Vocal tract, 9 analog method, 222, 223 area, estimation based on PARCOR analysis, 110 characteristics, 363 length, 334 transmission function, 38 model, 32 Vocal vibration, 10 Vocoder, 73 baseband, 187 channel, 76 correlation, 77 formant, 77 homomorphic, 77 linear predictive, 78 LSP, 78 maximum likelihood, 78 PARCOR, 78 pattern matching, 77 voice-excited, 187 Vocoder-driven ATC, 166, 188 Voder by H. Dudley, 216 Voiced consonant, 11 Voiced sound, 11 Voiced/unvoiced decision, 77, 8 1, 249 Voice-excited LPC vocoder (VELP), 187 Voice individuality, extraction and normalization of, 379 Voice print, 349 Volume velocity, 33
Index
452
Vowel, 6, 10 triangle, 16 VQ-based preprocessor, 333 VQ-based word recognition, 337 VSELP, 205 W Waveform coding, 135 Waveform interpolation (WI), 196 Waveform-based method, 228 Webster's horn equation, 38 Weighted cepstral distance, 260, 370 Weighted distances based on auditory sensitivity, 250 Weighted likelihood ratio (WLR), 258 Weighted slope metric, 262 Whispering, 11 White noise generator, 27 Wiener-Khintchine theorem, 54 Window function, 57
Word, 6, 247 dictionary, 264 lattice, 320 model, 264 recognition, 247 systems, structure of, 264 using phoneme units, 275 spotting, 249, 303 template, 264 World model, 366 Y Yule-Walker equation, 89 Z Zero-crossing: analysis, 70 number, 248 rate, 71 Zero-phase impulse response, 77 Z-transform, 68, 387, 388