The VLSI handbook

  • 85 311 10
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

The VLSI handbook

Library of Congress Cataloging-in-Publication Data / edited by Wai-Kai Chen. p. cm. Includes bibliographical references

2,172 663 37MB

Pages 1795 Page size 505 x 735.753 pts Year 2009

Report DMCA / Copyright

DOWNLOAD FILE

Recommend Papers

File loading please wait...
Citation preview

Library of Congress Cataloging-in-Publication Data The VLSI handbook / edited by Wai-Kai Chen. p. cm. Includes bibliographical references and index. ISBN 0-8493-8593-8 (alk. paper) 1. Integrated circuits--Very large scale integration.

I. Chen,Wai-Kai, 1936–

TK7874.75.V573 1999 621.39’5—dc21

99-047682 CIP

This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any information storage or retrieval system, without prior permission in writing from the publisher. All rights reserved. Authorization to photocopy items for internal or personal use, or the personal or internal use of specific clients, may be granted by CRC Press LLC, provided that $.50 per page photocopied is paid directly to Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 USA. The fee code for users of the Transactional Reporting Service is ISBN 0-8493-8593-8/00/$0.00+$.50. The fee is subject to change without notice. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. The consent of CRC Press LLC does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific permission must be obtained in writing from CRC Press LLC for such copying. Direct all inquiries to CRC Press LLC, 2000 Corporate Blvd., N.W., Boca Raton, Florida 33431. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation, without intent to infringe. © 2000 by CRC Press LLC No claim to original U.S. Government works International Standard Book Number 0-8493-8593-8 Library of Congress Card Number 99-047682 Printed in the United States of America 1 2 3 4 5 6 7 8 9 0 Printed on acid-free paper

Preface

Purpose The purpose of The VLSI Handbook is to provide in a single volume a comprehensive reference work covering the broad spectrum of VLSI technology. It is written and developed for the practicing electrical and computer engineers in industry, government, and academia. The goal is to provide the most upto-date information in IC technology, devices and their models, circuit simulations, amplifiers, logic design, memory, registers and system timing, microprocessor and ASIC, test and testability, design automation, and design languages. The handbook is not an all-encompassing digest of everything taught within an electrical and computer engineering curriculum on VLSI technology. Rather, it is the engineer's first choice in looking for a solution. Therefore, full references to other sources of contributions are provided. The ideal reader is a B.S.-level engineer with a need for a one-source reference to keep abreast of new techniques and procedures as well as review standard practices.

Background The handbook stresses fundamental theory behind professional applications. In order to do so, it is reinforced with frequent examples. Extensive development of theory and details of proofs have been omitted. The reader is assumed to have a certain degree of sophistication and experience. However, brief reviews of theories, principles and mathematics of some subject areas are given. These reviews have been done concisely with perception. The handbook is not a textbook replacement, but rather a reinforcement and reminder of material learned as a student. Therefore, important advancement and traditional as well as innovative practices are included. Since the majority of professional electrical engineers graduated before powerful personal computers were widely available, many computational and design methods may be new to them. Therefore, computers and software use are thoroughly covered. Not only does the handbook use traditional references to cite sources for the contributions, it also contains all relevant sources of information and tools that would assist the engineer in performing his/her job. This may include sources of software, databases, standards, seminars, conferences, etc.

Organization Over the years, the fundamentals of VLSI technology have evolved to include a wide range of topics and a broad range of practice. To encompass such a wide range of knowledge, the handbook focuses on the key concepts, models, and equations that enable the electrical or computer engineer to analyze, design and predict the behavior of very large-scale integrated circuits. While design formulas and tables are listed, emphasis is placed on the key concepts and theories underlying the applications.

© 2000 by CRC Press LLC

The information is organized into 13 major sections, which encompass the field of VLSI technology. Each section is divided into chapters, each of which was written by a leading expert in the field to enlighten and refresh knowledge of the mature engineer, and to educate the novice. Each chapter contains introductory material, leading to the appropriate applications, and references. The references provide a list of useful books and articles for following reading.

Locating Your Topic Numerous avenues of access to information contained in the handbook are provided. A complete table of contents is presented at the front of the book. In addition, an individual table of contents precedes each of the thirteen sections. Finally, each chapter begins with its own table of contents. The reader is urged to look over these tables of contents to become familiar with the structure, organization, and content of the book. For example, see Section VIII: Microprocessor and ASIC, then Chapter 61: Microprocessor Design Verification, and then Section 61.8: Emulation. This tree-like structure enables the reader to move up the tree to locate information on the topic of interest. A combined subject and author index has been compiled to provide means of accessing information. It can also be used to locate definitions; the page on which the definition appears for each key defining term is given in this index. The VLSI Handbook is designed to provide answers to most inquiries and direct inquirers to further sources and references. We trust that it will meet your needs.

Acknowledgments The compilation of this book would not have been possible without the dedication and efforts of the Editorial Board of Advisors, the section editors, the publishers, and most of all the contributing authors. I wish to thank them all and also my wife, Shiao-Ling, for her patience and understanding.

Wai-Kai Chen Editor-in-Chief

© 2000 by CRC Press LLC

Editor-in-Chief

Wai-Kai Chen, Professor and Head of the Department of Electrical Engineering and Computer Science at the University of Illinois at Chicago, teaches graduate and undergraduate courses in electrical engineering in the fields of circuits and systems. He received his B.S. and M.S. in electrical engineering at Ohio University where he was later recognized as a Distinguished Professor. He earned his Ph.D. in electrical engineering at the University of Illinois at UrbanaChampaign. Professor Chen has extensive experience in education and industry and is very active professionally in the fields of circuits and systems. He has served as a visiting professor at Purdue University and the University of Hawaii at Manoa. He was Editor of the IEEE Transactions on Circuits and Systems, both Series I and II and President of the IEEE Circuits and Systems Society. Currently, he is Editor-inChief of the Journal of Circuits, Systems and Computers and Editor of the Advanced Series in Electrical and Computer Engineering, Imperial College Press. He received the Lester R. Ford Award from the Mathematical Association of America, the Alexander von Humboldt Award from Germany, the Ohio University Alumni Medal of Merit for Distinguished Achievement in Engineering Education, the Senior University Scholar Award from University of Illinois at Chicago, the Distinguished Alumnus Award from the University Illinois at Urbana-Champaign, and the Society Meritorious Service Award and the Education Award from IEEE Circuits and Systems Society. He also received more than a dozen honorary professor awards from major institutions in China. A Fellow of the Institute of Electrical and Electronics Engineers and the American Association for the Advancement of Science, Professor Chen is widely known in the profession for his Applied Graph Theory (North-Holland), Theory and Design of Broadband Matching Networks (Pergamon Press), Active Network and Feedback Amplifier Theory (McGraw-Hill), Linear Networks and Systems (Brooks/Cole), Passive and Active Filters: Theory and Implementations (John Wiley), Theory of Nets (John Wiley), and The Circuits and Filters Handbook (Editor-in-Chief, CRC Press).

© 2000 by CRC Press LLC

Advisory Board

Professor Steve M. Kang Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Urbana, Illinois

Professor Saburo Muroga Computer Science Department University of Illinois at Urbana-Champaign Urbana, Illinois

Dr. Bing J. Sheu Avant! Corporation Fremont, California

© 2000 by CRC Press LLC

Contributors

Ramachandra Achar

Victor Boyadzhyan

Daniel A. Connors

Carleton University Ottawa, Ontario, Canada

Jet Propulsion Laboratory Pasadena, California

University of Illinois Urbana, Illinois

James H. Aylor

Charles E. Chang

Donald Cottrell

University of Virginia Charlottesville, Virginia

Conexant Systems, Inc. Newbury Park, California

Silicon Integration Initiative, Inc. (Si2, Inc.) Austin, Texas

R. Jakob Baker

Wai-Kai Chen

University of Idaho Boise, Idaho

University of Illinois Chicago, Illinois

David L. Barton

Kuo-Hsing Cheng

Intermetrics, Inc. Vienna, Virginia

Andrea Baschirotto Università di Pavia Pavia, Italy

Charles R. Baugh C. R. Baugh and Associates Bellevue, Washington

J. Bhasker Cadence Design Systems Allentown, Pennsylvania

David Blaauw Motorola, Inc. Austin, Texas

Marc Borremans Katholieke Universiteit Leuven Leuven-Heverlee, Belgium

© 2000 by CRC Press LLC

Tamkang University Tamsui, Taipei Hsien, Taiwan

John Choma, Jr. University of Southern California Los Angeles, California

John D. Cressler Auburn University Auburn, Alabama

Sorin Cristoloveanu Institut National Polytechnique de Grenoble Grenoble, France

Bram De Muer Katholieke Universiteit Leaven Leuven-Heverlee, Belgium

Amy Hsiu-Fen Chou National Tsing-Hua University Hsin-Chu, Taiwan

Geert A. De Veirman Silicon Systems, Inc. Tustin, California

Moon Jung Chung Michigan State University East Lansing, Michigan

Maria del Mar Hershenson

David J. Comer

Stanford University Stanford, California

Brigham Young University Provo, Utah

Allen M. Dewey

Donald T. Comer

Duke University Durham, North Carolina

Brigham Young University Provo, Utah

Abhijit Dharchoudhury

Peter J. Hesketh

Dimitri Kagaris

Motorola, Inc. Austin, Texas

The Georgia Institute of Technology Atlanta, Georgia

Southern Illinois University Carbondale, Illinois

Karl Hess

University of Illinois Urbana, Illinois

Steve M. Kang

Donald B. Estreich Hewlett-Parkard Company Santa Rosa, California

University of Illinois Urbana, Illinois

Nick Kanopoulos

John W. Fattaruso Texas Instruments, Incorporated Dallas, Texas

Charles Ching-Hsiang Hsu National Tsing-Hua University Hsinchu, Taiwan

Tanay Karnik

Eby G. Friedman University of Rochester Rochester, New York

Jen-Sheng Hwang National Science Council Hsinchu, Taiwan

Stantanu Ganguly Intel Corp. Austin, Texas

Yosef Gavriel Virginia Polytechnic Institute and State University Blacksburg, Virginia

Intel Corporation Hillsboro, Oregon

Yasuhiro Katsumata

Thad Gabara Lucent Technologies Murray Hill, New Jersey

Atmel, Multimedia and Communications Morrisville, North Carolina

Wen-mei Hwu University of Illinois Urbana, Illinois

Toshiba Corporation Isogo-ku, Yokohama, Japan

Pankaj Khandelwal Kazumi Inoh Toshiba Corporation Isogo-ku, Yokohama, Japan

University of Illinois Chicago, Illinois

John M. Khoury Hidemi Ishiuchi Toshiba Corporation Isogo-ku, Yokohama, Japan

Lucent Technologies Murray Hill, New Jersey

Heechul Kim

Royal Institute of Technology Kista-Stockholm, Sweden

The Ohio State University Columbus, Ohio

Hankuk University of Foreign Studies Yongin, Kyung Ki-Do, Korea

Rajesh K. Gupta

Hiroshi Iwai

Hideki Kimijima

Jan V. Grahn

University of California Irvine, California

Sumit Gupta University of California Irvine, California

Mohammed Ismail

Toshiba Corporation Isogo-ku, Yokohama, Japan

Toshiba Corporation Isogo-ku, Yokohama, Japan

Vikram Iyengar

Isik C. Kizilyalli

University of Illinois Urbana, Illinois

Johan Janssens Katholieke Universiteit Leuven Leuven-Heverlee, Belgium

© 2000 by CRC Press LLC

Lucent Bell Laboratories Orlando, Florida

Robert H. Klenke

Martin Margala

Michel S. Nakhla

Virginia Commonwealth University Richmond, Virginia

University of Alberta Edmonton, Alberta, Canada

Carleton University Ottawa, Ontario, Canada

Samuel S. Martin

Zainalabedin Navabi

Ivan S. Kourtev

Lucent Technologies Murray Hill, New Jersey

Northeastern University Boston, Massachusetts

Erik A. McShane

Philip G. Neudeck

University of Illinois Chicago, Illinois

NASA Glenn Research Center Cleveland, Ohio

Shin-ichi Minato

Kwok Ng

University of Idaho Moscow, Idaho

NTT Network Innovation Laboratories Yokosuka-shi, Japan

Lucent Technologies Murray Hill, New Jersey

Chi-Hung Lin

Sunderarajan S. Mohan

University of Pittsburgh Pittsburgh, Pennsylvania

Thomas H. Lee Stanford University Stanford, California

Harry W. Li

The Ohio State University Columbus, Ohio

Frank Ruei-Ling Lin National Tsing-Hua University Hsin-Chu, Taiwan

John W. Lockwood Washington University St. Louis, Missouri

Stephen I. Long University of California Santa Barbara, California

Flavio Lorenzelli ST Microelectronics, Inc. San Diego, California

Ashraf Lotfi Lucent Technologies Murray Hill, New Jersey

Joseph W. Lyding University of Illinois Urbana, Illinois

© 2000 by CRC Press LLC

Hideaki Nii Stanford University Stanford, California

Toshiba Corporation Isogo-ku, Yokohama, Japan

Tatsuya Ohguro Hisayo S. Momose Toshiba Corporation Isogo-ku, Yokohama, Japan

Toshiba Corporation Isogo-ku, Yokohama, Japan

Mikael Östling Eiji Morifuji Toshiba Corporation Isogo-ku, Yokohama, Japan

Royal Institute of Technology Kista-Stockholm, Sweden

Alice C. Parker Toyota Morimoto Toshiba Corporation Isogo-ku, Yokohama, Japan

University of Southern California Los Angeles, California

Saburo Muroga

Alison Payne

University of Illinois Urbana, Illinois

Imperial College University of London London, England

Akio Nakagawa Toshiba Corporation Saiwai-ku, Kawasaki, Japan

Yuichi Nakamura NEC Corporation Miyamae-ku, Kawasaki, Japan

Massoud Pedram University of Southern California Los Angeles, California

J. Gregory Rollins

Naofumi Takagi

Wayne Wolf

Antrim Design Systems Scotts Valley, California

Nagoya University Nagoya, Japan

Princeton University Princeton, New Jersey

Elizabeth M. Rudnick

Donald C. Thelen

Chung-Yu Wu

University of Illinois Urbana, Illinois

Analog Interfaces Bozeman, Montana

National Chiao Tung University Hsinchu, Taiwan

Kirad Samavati

Chris Toumazou

Evans Ching-Song Yang

Stanford University Stanford, California

Imperial College University of London London, England

National Tsing-Hua University Hsinchu, Taiwan

Rolf Schaumann Portland State University Portland, Oregon

Rick Shih-Jye Shen National Tsing-Hua University Hsinchu, Taiwan

Spyros Tragoudas Southern Illinois University Carbondale, Illinois

Yuh-Kuang Tseng

Krishna Shenai

Industrial Research and Technology Institute Chutung, Hsinchu, Taiwan

University of Illinois Chicago, Illinois

Meera Venkataraman

Bing J. Sheu

Troika Networks, Inc. Calabasas Hills, California

Avant! Corporation Los Angeles, California

Suhrid A. Wadekar

Bang-Sup Song

IBM Corp. Hopewell Junction, New York

University of California La Jolla, California

Chorng-kuang Wang

Michiel Steyaert

National Taiwan University Taipei, Taiwan

Katholieke Universiteit Leuven Leuven-Heverlee, Belgium

R. F. Wassenaar

Haruyuki Tago

University of Twente Enschede, The Netherlands

Toshiba Semiconductor Company Saiwai-ku, Kawasaki-shi, Japan

© 2000 by CRC Press LLC

Louis A. Williams III Texas Instruments, Incorporated Dallas, Texas

Kazuo Yano Hitachi Ltd. Kokubunji, Tokyo, Japan

Kung Yao University of California Los Angeles, California

Ko Yoshikawa NEC Corporation Fuchu, Tokyo, Japan

Kuniyoshi Yoshikawa Toshiba Corporation Isogo-ku, Yokohama, Japan

Takashi Yoshitomi Toshiba Corporation Isogo-ku, Yokohama, Japan

Min-shueh Yuan National Taiwan University Taipei, Taiwan

C. Patrick Yue Stanford University Stanford, California

Contents

SECTION I

1

VLSI Technology: A System Perspective Erik A. McShane 1.1 1.2 1.3 1.4

2

Introduction Contemporary VLSI Systems Emerging VLSI Systems Alternative Technologies

Introduction CMOS Technology BiCMOS Technology Future Technology Summary

Bipolar Technology 3.1 3.2 3.3 3.4 3.5

4

Krishna Shenai and

CMOS/BiCMOS Technology Yasuhiro Katsumata, Tatsuya Ohguro, Kazumi Inoh, Eiji Morifuji, Takashi Yoshitomi, Hideki Kimijima, Hideaki Nii, Toyota Morimoto, Hisayo S. Momose, Kuniyoshi Yoshikawa, Hidemi Ishiuchi, and Hiroshi Iwai 2.1 2.2 2.3 2.4 2.5

3

VLSI Technology

Jan V. Grahn and Mikael Östling

Introduction Bipolar Process Design Conventional Bipolar Technology High-Performance Bipolar Technology Advanced Bipolar Technology

Silicon on Insulator Technology 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9

Introduction Fabrication of SOI Wafers Generic Advantages of SOI SOI Devices Fully–Depleted SOI Transistors Partially Depleted SOI Transistors Short–Channel Effects SOI Challenges Conclusion

© 2000 by CRC Press LLC

Sorin Cristoloveanu

5

SiGe Technology 5.1 5.2 5.3 5.4 5.5

6

SiC Technology 6.1 6.2 6.3 6.4 6.5 6.6 6.7

7

Akio Nakagawa

Introduction Intelligent Power ICs High-Voltage Technology High-Voltage Metal Interconnection High-Voltage SOI Technology High-Voltage Output Devices Sense and Protection Circuit Examples of High-Voltage SOI Power ICs with LIGBT Outputs SOI Power ICs for System Integration High-Temperature Operation of SOI Power ICs

Noise in VLSI Technologies and Kwok Ng 9.1 9.2 9.3 9.4 9.5 9.6

10

Ashraf Lotfi

Magnetic Components Air Core Inductors Resistors Capacitors

Power IC Technologies 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10

9

Philip G. Neudeck

Introduction Fundamental SiC Material Properties Applications and Benefits of SiC Electronics SiC Semiconductor Crystal Growth SiC Device Fundamentals SiC Electronic Devices and Circuits Further Recommended Reading

Passive Components 7.1 7.2 7.3 7.4

8

John D. Cressler

Introduction SiGe Strained Layer Epitaxy The SiGe Heterojunction Bipolar Transistor (HBT) The SiGe Heterojunction Field Effect Transistor (HFET) Future Directions

Samuel S. Martin, Thad Gabara,

Introduction Microscopic Noise Device Noise Chip Noise Future Trends Conclusions

Micromachining

Peter J. Hesketh

10.1 Introduction 10.2 Micromachining Processes

© 2000 by CRC Press LLC

10.3 10.4 10.5 10.6 10.7 10.8 10.9 10.10 10.11

11

Microelectronics Packaging 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 11.10 11.11

12

Pankaj Khandelwal and Krishna Shenai

Introduction Packaging Hierarchy Package Parameters Packaging Substrates Package Types Hermetic Packages Die Attachment Techniques Package Parasitics Package Modeling Packaging in Wireless Applications Future Trends

Multichip Module Technologies 12.1 12.2 12.3 12.4 12.5 12.6 12.7 12.8 12.9 12.10 12.11 12.12

13

Bulk Micromachining of Silicon Surface Micromachining Advanced Processing CMOS and MEMS Fabrication Process Integration Wafer Bonding Optical MEMS Actuators for MEMS Optics Electronics Chemical Sensors

Victor Boyadzhyan and John Choma, Jr.

Introduction Multi-Chip Module Technologies Materials for HTCC Aluminum Packages LTCC Substrates Aluminum Nitride Materials for Multi-layered AlN Packages Thin Film Dielectrics Carrier Substrates Conductor Metallization Choosing Substrate Technologies and Assembly Techniques Assembly Techniques Summary

Channel Hot Electron Degradation-Delay in MOS Transistors Due to Deuterium Anneal Isik C. Kizilyalli, Karl Hess, and Joseph W. Lyding 13.1 13.2 13.3 13.4 13.5

Introduction Post-Metal Forming Gas Anneals in Integrated Circuits Impact of Hot Electron Effects on CMOS Development The Hydrogen/Deuterium Isotope Effect and CMOS Manufacturing Summary

© 2000 by CRC Press LLC

SECTION II

14

Bipolar Junction Transistor (BJT) Circuits and Donald T. Comer 14.1 14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9 14.10 14.11 14.12 14.13

15

Introduction Physical Characteristics and Properties of the BJT Basic Operation of the BJT Use of the BJT as an Amplifier Representing the Major BJT Effects by an Electronic Model Other Physical Effects in the BJT More Accurate BJT Models Heterojunction Bipolar Junction Transistors Integrated Circuit Biasing Using Current Mirrors The Basic BJT Switch High-Speed BJT Switching Simple Logic Gates Emitter-Coupled Logic

Introduction Fractal Capacitors Spiral Inductors On-Chip Transformers

SECTION III

Circuit Simulations

Analog Circuit Simulation 16.1 16.2 16.3 16.4 16.5 16.6 16.7 16.8 16.9 16.10

17

David J. Comer

RF Passive IC Components Thomas H. Lee, Maria del Mar Hershenson, Sunderarajan S. Mohan, Kirad Samavati, and C. Patrick Yue 15.1 15.2 15.3 15.4

16

Devices and Their Models

J. Gregory Rollins

Introduction Purpose of Simulation Netlists Formulation of the Circuit Equations Modified Nodal Analysis Active Device Models Types of Analysis Verilog-A Fast Simulation Methods Commercially Available Simulators

Interconnect Modeling and Simulation and Ramachandra Achar 17.1 Introduction 17.2 Interconnect Models

© 2000 by CRC Press LLC

Michel S. Nakhla

17.3 Distributed Transmission Line Equations 17.4 Interconnect Simulation Issues 17.5 Interconnect Simulation Techniques

18

Power Simulation and Estimation in VLSI Circuits 18.1 18.2 18.3 18.4 18.5 18.6 18.7

SECTION IV

19

Massoud Pedram

Introduction . Software-Level Power Estimation Behavioral-Level Power Estimation RT-Level Power Estimation Gate-Level Power Estimation Transistor-Level Power Estimation Conclusion

Amplifiers

CMOS Amplifier Design Donald C. Thelen

Harry W. Li, R. Jakob Baker, and

19.1 Introduction 19.2 Biasing Circuits 19.3 Amplifiers

20

Bipolar Amplifier Design 20.1 20.2 20.3 20.4 20.5 20.6 20.7

21

High-Frequency Amplifiers 21.1 21.2 21.3 21.4 21.5 21.6 21.7

22

Geert A. De Veirman

Introduction Single-Transistor Amplifiers Differential Amplifiers Output Stages Bias Reference Operational Amplifiers Conclusion

Chris Toumazou and Alison Payne

Introduction The Current Feedback Op-Amp RF Low-Noise Amplifiers Optical Low-Noise Preamplifiers Fundamentals of RF Power Amplifier Design Applications of High-Q Resonators in IF-Sampling Receiver Architectures Log-Domain Processing

Operational Transconductance Amplifiers Mohammed Ismail, and Chi-Hung Lin 22.1 Introduction 22.2 Noise Behavior of the OTA

© 2000 by CRC Press LLC

R. F. Wassenaar,

22.3 22.4 22.5 22.6

An OTA with an Improved Output Swing OTAs with High Drive Capability Common-Mode Feedback Filter Applications with Low-Voltage OTAs

SECTION V Logic Design

23

Expressions of Logic Functions 23.1 23.2 23.3 23.4

24

Saburo Muroga

Introduction to Basic Logic Operations Truth Tables Karnaugh Maps Binary Decision Diagrams

Basic Theory of Logic Functions

Saburo Muroga

24.1 Basic Theorems 24.2 Implication Relations and Prime Implicants

25

Simplification of Logic Expressions 25.1 25.2 25.3 25.4 25.5

26

Binary Decision Diagrams 26.1 26.2 26.3 26.4 26.5

27

Saburo Muroga

Minimal Sums Derivation of Minimal Sums by Karnaugh Map Derivation of Minimal Sums for a Single Function by Other Means Prime Implicates, Irredundant Conjunctive Forms, and Minimal Products Derivation of Minimal Products by Karnaugh Map

Shin-ichi Minato and Saburo Muroga

Basic Concepts Construction of BDD Based on a Logic Expression Data Structure Ordering of Variable for Compact BDDs Remarks

Logic Synthesis with AND and OR Gates in Two Levels Saburo Muroga 27.1 Introduction 27.2 Design of Single-Output Minimal Networks with AND and OR Gates in Two Levels 27.3 Design of Multiple-Output Networks with AND and OR Gates in Two Levels

28

Sequential Networks with AND and OR Gates 28.1 28.2 28.3 28.4 28.5

Saburo Muroga

Introduction Flip-Flops and Latches Sequential Networks in Fundamental Mode Malfunctions of Asynchronous Sequential Networks Different Tables for the Description of Transitions of Sequential Networks

© 2000 by CRC Press LLC

28.6 Steps for the Synthesis of Sequential Networks 28.7 Synthesis of Sequential Networks

29

Logic Synthesis with AND and OR Gates in Multi-levels Yuichi Nakamura and Saburo Muroga 29.1 29.2 29.3 29.4

30

Logic Networks with AND and OR Gates in Multi-levels General Division Selection of Divisors Limitation of Weak Division

Logic Properties of Transistor Circuits

Saburo Muroga

30.1 Basic Properties of Connecting Relays. 30.2 Analysis of Relay-Contact Networks 30.3 Transistor Circuits

31

Logic Synthesis with NAND (or NOR) Gates in Multi-levels Saburo Muroga 31.1 Logic Synthesis with NAND (or NOR) Gates 31.2 Design of NAND (or NOR) Networks in Double-Rail Input Logic by the Map-Factoring Method 31.3 Design of NAND (or NOR) Networks in Single-Rail Input Logic 31.4 Features of the Map-Factoring Method 31.5 Other Design Methods of Multi-level Networks with a Minimum Number of Gates

32

Logic Synthesis with a Minimum Number of Negative Gates Saburo Muroga 32.1 Logic Design of MOS Networks 32.2 Algorithm DIMN

33

Logic Synthesizer with Optimizations in Two Phases and Saburo Muroga

34

Logic Synthesizer by the Transduction Method 34.1 34.2 34.3 34.4

35

Saburo Muroga

Technology-Dependent Logic Optimization Transduction Method for the Design of NOR Logic Networks Various Transduction Methods Design of Logic Networks with Negative Gates by the Transduction Method

Emitter-Coupled Logic 35.1 35.2 35.3 35.4

Ko Yoshikawa

Saburo Muroga

Introduction Standard ECL Logic Gates Modification of Standard ECL Logic Gates with Wired Logic ECL Series-Gating Circuits

© 2000 by CRC Press LLC

36

CMOS 36.1 36.2 36.3 36.4 36.5 36.6

37

Saburo Muroga

CMOS (Complementary MOS) Logic Design of CMOS Networks Logic Design in Differential CMOS Logic Layout of CMOS Pseudo-nMOS Dynamic CMOS

Pass Transistors

Kazuo Yano and Saburo Muroga

37.1 Introduction 37.2 Electronic Problems of Pass Transistors 37.3 Top-down Design of Logic Functions with Pass-Transistor Logic

38

Adders Naofumi Takagi, Haruyuki Tago, Charles R. Baugh, and Saburo Muroga 38.1 38.2 38.3 38.4 38.5 38.6 38.7 38.8

39

Multipliers 39.1 39.2 39.3 39.4 39.5

40

Naofumi Takagi, Charles R. Baugh, and Saburo Muroga

Introduction Sequential Multiplier Array Multiplier Multiplier Based on Wallace Tree Multiplier Based on a Redundant Binary Adder Tree

Dividers 40.1 40.2 40.3 40.4 40.5

41

Introduction Addition in the Binary Number System Serial Adder Ripple Carry Adder Carry Skip Adder Carry Look-Ahead Adder Carry Select Adder Carry Save Adder

Naofumi Takagi and Saburo Muroga

Introduction Subtract-And-Shift Dividers Higher Radix Subtract-And-Shift Dividers Even Higher Radix Dividers with a Multiplier Multiplicative Dividers

Full-Custom and Semi-Custom Design

Saburo Muroga

41.1 Introduction 41.2 Full-Custom Design Sequence of a Digital System

42

Programmable Logic Devices 42.1 Introduction 42.2 PLAs and Variations

© 2000 by CRC Press LLC

Saburo Muroga

42.3 42.4 42.5 42.6

43

Logic Design with PLAs Dynamic PLA Advantages and Disadvantages of PLAs Programmable Array Logic

Gate Arrays

Saburo Muroga

43.1 Mask-Programmable Gate Arrays 43.2 CMOS Gate Arrays 43.3 Advantages and Disadvantages of Gate Arrays

44

Field-Programmable Gate Arrays 44.1 44.2 44.3 44.4

45

Saburo Muroga

Introduction Basic Structures of FPGAs Various Field-Programmable Gate Arrays Features of FPGAs

Cell-Library Design Approach

Saburo Muroga

45.1 Introduction 45.2 Polycell Design Approach 45.3 Hierarchical Design Approach

46

Comparison of Different Design Approaches 46.1 46.2 46.3 46.4

SECTION VI

47

48

Memory, Registers, and System Timing

System Timing 47.1 47.2 47.3 47.4 47.5

Ivan S. Kourtev and Eby G. Friedman

Introduction Synchronous VLSI Systems Synchronous Timing and Clock Distribution Networks Timing Properties of Synchronous Storage Elements A Final Note Appendix

ROM/PROM/EPROM

Jen-Sheng Hwang

48.1 Introduction 48.2 ROM 48.3 PROM

49

Saburo Muroga

Introduction Design Approaches with Off-the-Shelf Packages Full-and Semi-Custom Design Approaches Comparison of All Different Design Approaches

SRAM

Yuh-Kuang Tseng

49.1 Read/Write Operation

© 2000 by CRC Press LLC

49.2 Address Transition Detection (ATD) Circuit for Synchronous Internal Operation 49.3 Decoder and Word-Line Decoding Circuit 49.4 Sense Amplifier 49.5 Output Circuit

50

Embedded Memory 50.1 50.2 50.3 50.4 50.5 50.6

51

Flash Memories Rick Shih-Jye Shen, Frank Ruei-Ling Lin, Amy Hsiu-Fen Chou, Evans Ching-Song Yang, and Charles Ching-Hsiang Hsu 51.1 51.2 51.3 51.4 51.5 51.6 51.7 51.8

52

Introduction Review of Stacked-Gate Non-volatile Memory Basic Flash Memory Device Structures Device Operation Variations of Device Structure Flash Memory Array Structures Evolution of Flash Memory Technology Flash Memory System

Dynamic Random Access Memory 52.1 52.2 52.3 52.4 52.5 52.6 52.7 52.8 52.9

53

Chung-Yu Wu

Introduction Merits and Challenges Technology Integration and Applications3,5 Design Methodology and Design Space3,5 Testing and Yield3,5 Design Examples

Low-Power Memory Circuits 53.1 53.2 53.3 53.4 53.5 53.6 53.7

Kuo-Hsing Cheng

Introduction Basic DRAM Architecture DRAM Memory Cell Read/Write Circuit Synchronous (Clocked) DRAMs Prefetch and Pipelined Architecture in SDRAMs Gb SDRAM Bank Architecture Multi-level DRAM Concept of 2-bit DRAM Cell

Martin Margala

Introduction Read-Only Memory (ROM) Flash Memory Ferroelectric Memory (FeRAM) Static Random-Access Memory (SRAM) Dynamic Random-Access Memory (DRAM) Conclusion

© 2000 by CRC Press LLC

SECTION VII

54

Nyquist-Rate ADC and DAC 54.1 54.2 55.3 54.4 54.5 54.6 54.7

55

Introduction Technology The Receiver The Synthesizer The Transmitter Towards Fully Integrated Transceivers Conclusions

PLL Circuits 57.1 57.2 57.3 57.4

58

Introduction Basic Theory of Operation Alternative Sigma-Delta Architectures Filtering for Sigma-Delta Modulators Circuit Building Blocks Practical Design Issues Summary

RF Communication Circuits Michiel Steyaert, Marc Borremans, Johan Janssens, and Bram De Muer 56.1 56.2 56.3 56.4 56.5 56.6 56.7

57

Bang-Sup Song

Introduction ADC Design Arts ADC Architectures ADC Design Considerations DAC Design Arts DAC Architectures DAC Design Considerations

Oversampled Analog-to-Digital and Digital-to-Analog Converters John W. Fattaruso and Louis A. Williams III 55.1 55.2 55.3 55.4 55.5 55.6 55.7

56

Analog Circuits

Min-shueh Yuan and Chorng-Kuang Wang

Introduction PLL Techniques Building Blocks of the PLL Circuit PLL Applications

Continuous-Time Filters 58.1 58.2 58.3 58.4 58.5

John M. Khoury

Introduction State-Variable Synthesis Techniques Realization of VLSI Integrators Filter Tuning Circuits Conclusion

© 2000 by CRC Press LLC

59

Switched-Capacitor Filters 59.1 59.2 59.3 59.4 59.5 59.6 59.7 59.8 59.9

SECTION VIII

60

63

Tanay Karnik

Introduction Layout Problem Description Manufacturing Chip Planning

Architecture 63.1 63.2 63.3 63.4

Vikram Iyengar

Introduction Design Verification Environment Random and Biased-Random Instruction Generation Correctness Checking Coverage Metrics Smart Simulation Wide Simulation Emulation Conclusion

Microprocessor Layout Method 62.1 62.2 62.3 62.4

Abhijit Dharchoudhury,

Introduction Static Timing Analysis Noise Analysis Power Grid Analysis

Microprocessor Design Verification and Elizabeth M. Rudnick 61.1 61.2 61.3 61.4 61.5 61.6 61.7 61.8 61.9

62

Microprocessor and ASIC

Timing and Signal Integrity Analysis David Blaauw, and Stantanu Ganguly 60.1 60.2 60.3 60.4

61

Andrea Baschirotto

Introduction Sampled-Data Analog Filters The Principle of SC Technique First-Order SC Stages Second-Order SC Circuit Implementation Aspects Performance Limitations Compensation Technique (Performance Improvements) Advanced SC Filter Solutions

Daniel A. Connors and Wen-mei Hwu

Introduction Types of Microprocessors Major Components of a Microprocessor Instruction Set Architecture

© 2000 by CRC Press LLC

63.5 Instruction Level Parallelism 63.6 Industry Trends

64

ASIC Design 64.1 64.2 64.3 64.4 64.5 64.6 64.7 64.8 64.9 64.10 64.11 64.12 64.13 64.14 64.15 64.16

65

Logic Synthesis for Field Programmable Gate Array (FPGA) Technology John Lockwood 65.1 65.2 65.3 65.4 65.5 65.6 65.7

Introduction FPGA Structures Logic Synthesis Look-up Table (LUT) Synthesis Chortle Two-Step Approaches Conclusion

SECTION IX

66

Sumit Gupta and Rajesh K. Gupta

Introduction Design Styles Steps in the Design Flow Hierarchical Design Design Representation and Abstraction Levels System Specification Specification Simulation and Verification Architectural Design Logic Synthesis Physical Design I/O Architecture and Pad Design Tests After Manufacturing High-Performance ASIC Design Low Power Issues Reuse of Semiconductor Blocks Conclusion

Test and Testability

Testability Concepts and DFT

Nick Kanopoulos

66.1 Introduction: Basic Concepts 66.2 Design for Testability

67

ATPG and BIST

Dimitri Kagaris

67.1 Automatic Test Pattern Generation 67.2 Built-In Self-Test

68

CAD Tools for BIST/DFT and Delay Faults 68.1 Introduction 68.2 CAD for Stuck-at Faults 68.3 CAD for Path Delays

© 2000 by CRC Press LLC

Spyros Tragoudas

SECTION X Compound Semiconductor Digital Integrated Circuit Technology

69

Materials 69.1 69.2 69.3 69.4

70

Compound Semiconductor Devices for Digital Circuits Donald B. Estreich 70.1 70.2 70.3 70.4

71

Stephen I. Long

Introduction Compound Semiconductor Materials Why III-V Semiconductors? Heterojunctions

Introduction Unifying Principle for Active Devices: Charge Control Principle Comparing Unipolar and Bipolar Transistors Typical Device Structures

Logic Design Principles and Examples

Stephen I. Long

71.1 Introduction 71.2 Static Logic Design 71.3 Transient Analysis and Design for Very-High-Speed Logic

72

Logic Design Examples and Stephen I. Long

Charles E. Chang, Meera Venkataraman,

72.1 Design of MESFET and HEMT Logic Circuits 72.2 HBT Logic Design Examples

SECTION XI

73

Internet Based Micro-Electronic Design Automation (IMEDA) Framework Moon Jung Chung and Heechul Kim 73.1 73.2 73.3 73.4 73.5 73.6 73.7

74

Design Automation

Introduction Functional Requirements of Framework IMEDA System Formal Representation of Design Process Execution Environment of the Framework Implementation Conclusion

System-Level Design Alice C. Parker, Yosef Tirat-Gefen, and Suhrid A. Wadekar 74.1 74.2 74.3 74.4 74.5

Introduction System Specification System Partitioning Scheduling and Allocating Tasks to Processing Modules Allocating and Scheduling Storage Modules

© 2000 by CRC Press LLC

74.6 74.7 74.8 74.9 74.10

75

Synthesis at the Register Transfer Level and the Behavioral Level J. Bhasker 75.1 75.2 75.3 75.4 75.5 75.6 75.7 75.8

76

Introduction The ADEPT Design Environment A Simple Example of an ADEPT Performance Model Mixed-Level Modeling Conclusions

Introduction Uses of Microprocessors Embedded System Architectures Hardware/Software Co-Design

Design Automation Technology Roadmap 78.1 78.2 78.3 78.4

Donald Cottrell

Introduction Design Automation — An Historical Perspective The Future Summary

SECTION XII

79

James H. Aylor

Embedded Computing Systems and Hardware/Software Co-Design Wayne Wolf 77.1 77.2 77.3 77.4

78

Introduction The Two HDL’s The Three Different Domains of Synthesis RTL Synthesis Modeling a Three-State Gate An Example Behavioral Synthesis Conclusion

Performance Modeling and Analysis in VHDL and Robert H. Klenke 76.1 76.2 76.3 76.4 76.5

77

Selecting Implementation and Packaging Styles for System Modules The Interconnection Strategy Word Length Determination Predicting System Characteristics A Survey of Research in System Design

Algorithms and Architects

Algorithms and Architectures for Multimedia and Beamforming Communications Flavio Lorenzelli and Kung Yao 79.1 Introduction 79.2 Multimedia Support for General Purpose Computers 79.3 Beamforming Array Processing and Architecture

© 2000 by CRC Press LLC

SECTION XIII

80

Design Languages 80.1 80.2 80.3 80.4 80.5 80.6 80.7 80.8 80.9

81

Design Languages David L. Barton

Introduction Objects and Data Types Standard Logic Types Concurrent Statements Sequential Statements Simultaneous Statements Modular Designs Simulation Test Benches

Hardware Description in Verilog: An Introductory Tutorial Zainalabedin Navabi 81.1 81.2 81.3 81.4 81.5 81.6 81.7

Elements of Verilog Basic Component Descriptions A Complete Design Controller Description Gate and Switch Level Description Test Bench Descriptions Summary

© 2000 by CRC Press LLC

Shenai, K., et al. “VLSI Technology: A System Perspective” The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

1 VLSI Technology: A System Perspective 1.1 1.2

Introduction Contemporary VLSI Systems

1.3

Emerging VLSI Systems

Digital Systems • Analog Systems • Power Systems Embedded Memory • Monolithic RFICs • Single-Chip Sensors and Detectors • MEMS

Krishna Shenai Erik A. McShane University of Illinois at Chicago

1.4

Alternative Technologies Quantum Computing • DNA Computing • Molecular Computing

1.1 Introduction The development of VLSI systems has historically progressed hand-in-hand with technology innovations. Often, fresh achievements in lithography, or semiconductor devices, or metallization have led to the introduction of new products. Conversely, market demand for particular products or specifications has greatly influenced focused research into the technology capabilities necessary to deliver the product. Many conventional VLSI systems as a result have engendered highly specialized technologies for their support. In contrast, a characteristic of emerging VLSI products is the integration of diverse systems, each of which previously required a unique technology, into a single technology platform. The driving force behind this trend is the demand in consumer and noncommercial sectors for compact, portable, wireless electronics products — the nascent “system-on-a-chip” era.1–4 Figure 1.1 illustrates some of the system components playing a role in this development. Most of the achievements in dense systems integration have derived from scaling in silicon VLSI processes.5 As manufacturing has improved, it has become more cost-effective in many applications to replace a chip set with a monolithic IC: packaging costs are decreased, interconnect paths shrink, and power loss in I/O drivers is reduced. Further scaling to deep submicron dimensions will continue to widen the applications of VLSI system integration, but also will lead to additional complexities in reliability, interconnect, and lithography.6 This evolution is raising questions over the optimal level of integration: package level or chip level. Each has distinct advantages and some critical deficiencies for cost, reliability, and performance. Board-level interconnection of chip sets, although a mainstay of low-cost, high-volume manufacturing, cannot provide a suitably dense integration of high-performance, core VLSI systems. Package- and chiplevel integration are more practical contenders for VLSI systems implementation because of their compact dimensions and short signal interconnects. They also offer a tradeoff between dense monolithic integration and application-specific technology optimization. It is unclear at this time of the pace in the further

© 2000 by CRC Press LLC

FIGURE 1.1

These system components are representative of the essential building blocks in VLSI “systems-on-a-chip.”

evolution of VLSI systems, although systems integration will continue to influence and be influenced by technology development. The remainder of this chapter will trace the inter-relationship of technology and systems to date and then outline emerging and future VLSI systems and their technology requisites. Alternative technologies will also be introduced with a presentation of their potential impact on VLSI systems. Focused discussion of the specific VLSI technologies introduced will follow in later chapters. Given the level of systems integration afforded by available technology and the diverse signalprocessing capabilities and applications supported, in this chapter a “VLSI system” is loosely defined as any complex system, primarily electronic in nature, based on semiconductor manufacturing with an extremely dense integration of minimal processing elements (e.g., transistors) and packaged as a single- or multi-chip module.

1.2 Contemporary VLSI Systems VLSI systems can be crudely categorized by the nature of the signal processing they perform: analog, digital, or power. Included in analog are high-frequency systems, but they can be distinguished both by

© 2000 by CRC Press LLC

design methodology and their sensitivity to frequency-dependent characteristics in biasing and operation. Digital systems consist of logic circuits and memory, although it should be noted that most “digital” systems now also contain significant analog subsystems for data conversion and signal integrity. Power semiconductor devices have previously afforded only very low levels of integration considering their extreme current- and voltage-handling requirements (up to 1000 A and 10 kV) and resulting high temperatures. However, with the advent of hybrid technologies (integrating different materials on a single silicon substrate), partial insulating substrates (with dielectrically isolated regions for power semiconductor devices), and MCM packaging, integrated “smart” power electronics are appearing for medium power (up to 1 kW) applications. A relative newcomer to the VLSI arena is microelectromechanical systems (MEMS). As the name states, MEMS is not purely electronic in nature and is now frequently extended to also label systems that are based on optoelectronics, biochemistry, and electromagnetics.

Digital Systems Introduction The digital systems category comprises microprocessors, microcontrollers, specialized digital signal processors, and solid-state memory. As mentioned previously, these systems may also contain analog, power, RF, and MEMS subsystems; but in this section, discussion is restricted to digital electronics. Beginning with the introduction in 1971 of the first true microprocessor — the Intel 4004 — digital logic ICs have offered increasing functionality afforded by a number of technology factors. Transistor miniaturization from the 10-micron dimensions common 30 years ago to state-of-the-art 0.25-micron lithography has boosted IC device counts to over 10 million transistors. To support subsystem interconnection, multilevel metallization stacks have evolved. And, to reduce static and switching power losses, low-power/lowvoltage technologies have become standard. The following discussion of VLSI technology pertains to the key metrics in digital systems: power dissipation, signal delay, signal integrity, and memory integration. Power Dissipation The premier technology today for digital systems is CMOS, owing to its inherent low-power attributes and excellent scaling to deep submicron dimensions. Total power dissipation is expressed as

P = P dynamic + P static = ( P switching + P short-circuit + P leakage ) 2 DD

= V f

∑a c

n n

+ V DD

n

∑i

sc n

+ V DD

n

∑ ( 1 – a )i n

(1.1) leak n

n

where VDD is the operating supply; f is the clock frequency; and per node an is the switching activity, cn is the switching capacitance, iscn is the short-circuit current, and ileakn is the leakage current (subthreshold conduction and junction leakage). From this expression it is apparent that the most significant reduction in power dissipation can be accomplished by scaling the operating supply. However, as VDD is reduced to 1 V, the contribution of leakage current to overall power dissipation increases if transistor VT is scaled proportionally to VDD . Subthreshold current in bulk CMOS, neglecting junction leakage and body effects, can be expressed as7

I sub

W = ----- I 0 e L

V GS – V T ----------------------nφ t

– V DS

------------ φt  1 – e   

(1.2)

where

I 0 = k′ ( n – 1 )φ t

2

© 2000 by CRC Press LLC

(1.3)

γ n = 1 + ----------2 φt

(1.4)

k′ = µC ox′

(1.5)

2qε S N B γ = --------------------C ox′

(1.6)

ε ox C ox′ = ----t ox

(1.7)

W and L are channel width and length, respectively; φt is thermal voltage (approximately 0.259 V at 300 K); µ is carrier mobility in the channel; εox is gate dielectric permittivity (3.45 × 10–13 F/cm for SiO2); εS is semiconductor permittivity (1.04 × 10–12 F/cm for Si); NB is bulk doping; and tox is gate dielectric thickness. This trend is exacerbated if minimal-switching circuit techniques are employed or if sleep modes place the logic into idle states for long periods. Device scaling thus must consider the architecture and performance requirements. Figures 1.2 and 1.3 show the inverse normalized energy-delay product (EDP) contours for a hypothetical 0.25-micron device.8 The energy required per operation is

P E = --f

(1.8)

Normalization is performed relative to the best obtained EDP for this technology. Fig. 1.2 shows data for an ideal device and Fig. 1.3 adds non-idealities by considering velocity saturation effects and uncertainty in VDD , VT , and temperature T. In the ideal device, the dashed lines indicate vectors of normalized constant performance relative to the performance obtained at the optimal EDP point. The switching frequency can be approximated by

FIGURE 1.2 Inverse normalized EDP contours for an ideal device (after Ref. 8). Dashed lines indicate vectors of constant performance. Arrow F shows direction of increasing performance.

© 2000 by CRC Press LLC

FIGURE 1.3 Inverse normalized EDP contours for a non-ideal device considering velocity saturation and uncertainty in VDD , VT , and temperature T (after Ref. 8).

1 1 f = ------- = ------t rise t fall

(1.9)

I Dsat = -----------CV DD and the performance, F, when considered as proportional to f, can be expressed as

( V DD – V T ) f = α --------------------------V DD 2

(1.10)

where scaling factor α is applied to normalize performance. These plots illustrate the tradeoffs in optimizing system performance for low-power requirements and highest performance. Frequency can also be scaled to reduce power dissipation, but this is not considered here as it generally also degrades performance. Considering purely dynamic power losses (CV2f ), scaling the operating supply again yields the most significant reduction; but this scaling also affects the subthreshold leakage since VT must be scaled similarly to maintain comparable performance levels (see Eq. 1.10). In this respect, fully depleted SOI CMOS offers improved low-voltage, low-power characteristics as it has a steeper subthreshold slope than bulk CMOS. Subthreshold slope, S, is defined as

dV G S = -------------------d ( log I D )

(1.11)

This can be expressed (from Ref. 9) for bulk CMOS as

C kˆ T S = ------ ln ( 10 )  1 + ------D-  q C ox

© 2000 by CRC Press LLC

(1.12)

and for fully depleted SOI CMOS (assuming negligible interface states and buried-oxide capacitance) as

kˆ T S = ------ ln ( 10 ) q

(1.13)

where kˆ is Boltzmann’s constant (1.38 × 10–23 V·C/K), CD is depletion capacitance, and Cox is gate dielectric capacitance. Hence, for the same weak inversion gate bias, SOI CMOS can yield a leakage current several orders of magnitude less than in bulk CMOS. Additional power dissipation occurs in the extrinsic parasitics of the active devices and the interconnect. This contribution can be minimized by salicide (self-aligned silicide) processes that deposit a low sheet resistance layer on the source, drain, and gate surfaces. Switching Frequency and Signal Integrity After power dissipation, the signal delay (or maximum switching frequency) of a system is the most important figure-of-merit. This characteristic, as mentioned previously, provides a first-order approximation of system performance. It also affects the short-circuit contribution to power loss since a dc path between the supply rails exists during a switching event. Also, signal delay and slope determine the deviation of a logic signal pulse from an ideal step transition. Digital systems based on silicon bipolar and BiCMOS technologies still appear for high-speed applications, exploiting the higher small-signal gain and greater current drive of bipolar transistors over MOSFETs; but given the stringent power requirements of portable electronics, non-CMOS implementations are impractical. Emerging technologies such as silicon heterojunction bipolar and field-effect transistors (HBTs and HFETs) hold some promise of fast switching with reduced power dissipation, but the technology is too immature to be evaluated as yet. The switching rate of a capacitively loaded node in a logic circuit can be approximated by the time required for the capacitor to be fully charged or discharged, assuming that a constant current is available for charge transport (see Eq. 1.9). Neglecting channel-length modulation effects on saturation current, the switching frequency can be written as

I Dsat f = -----------CV DD k′ W ---- ----- ( V GS – V T ) 2 2 L = --------------------------------------CV DD

(1.14)

( V DD – V T ) ∝ --------------------------CV DD 2

Voltage scaling and its effects on power dissipation have already been discussed. Considering the capacitive contribution, a linear improvement to switching speed can be obtained by scaling node capacitance. Referring to Fig. 1.4 and neglecting interconnect capacitance, the node capacitance, of a MOSFET can be expressed as

© 2000 by CRC Press LLC

C OUT = C GD + κ ( V OL, V OH ) ( C db )

(1.15)

1 C GD = x jl C ox W + --- C ox WL eff 2

(1.16)

C db = C j0 WY + 2C jsw ( W + Y )

(1.17)

FIGURE 1.4

A MOSFET isometric cross-sectional view with critical dimensions identified.

C j0 =

qε Si -----------------------------------1 1 2  ------- + --------- φ  N sd N sub C jsw ≈ C j0 x j

(1.18)

(1.19)

The drain-to-body junction capacitance Cdb is bias dependent, and the scaling factor κ is included to determine an average value of output voltage level. Source/drain diffusion capacitance has two components: the bottom areal capacitance Cj0 and the sidewall perimeter capacitance Cjsw. Although Cjsw is a complex function of doping profile and should account for high-concentration channel-stop implants, an approximation is made to equate Cjsw and Cj0. From Fig. 1.5, it is clear that SOI CMOS has greatly reduced device capacitances compared to bulk CMOS by the elimination junction areal and perimeter capacitances. Another technique in SOI CMOS for improving switching delay involves dynamic threshold voltage control (DTMOS) by taking advantage of the parasitic lateral bipolar transistor inherent in the device structure.10

FIGURE 1.5

Cross-sectional views of a MOSFET: bulk and thin-film SOI.

To reduce interconnect resistance, copper interconnect has been introduced to replace traditional aluminum wires.11 Table 1.1 compares two critical parameters. The higher melting point of copper also reduces long-term interconnect degradation from electromigration, in which energetic carriers dislodge

© 2000 by CRC Press LLC

metal atoms creating voids or nonuniformities. Interconnect capacitance relative to the substrate is determined by the dielectric constant, εr , and the signal velocity can be defined as

c v = -------εr

(1.20)

Low-εr interlevel dielectrics are appearing to reduce this parasitic effect. TABLE 1.1 Comparison of Interconnect Characteristics for Al and Cu Material Al Cu

Specific Resistance (µΩ-cm) 2.66 1.68

Melting Point (°C) 660 1073

Memory Scaling The two most critical factors determining the commercial viability of RAM products are the total power dissipation and the chip area. For implementations in battery-operated portable electronics, the goal is a 0.9-V operating supply — the minimum voltage of a NiCd cell. RAM designs are addressing these objectives architecturally and technologically. SRAMs and DRAMs share many architectural features, including memory array partitioning, reduced voltage internal logic, and dynamic threshold voltage control. DRAM, with its higher memory density, is more attractive for embedded memory applications despite its higher power dissipation. Figure 1.6 shows a RAM block diagram that identifies the sources of power dissipation. The power equation as given by Itoh et al.12 is

P = I DD V DD

(1.21)

I DD = mi act + m ( n – 1 )i hld + ( n + m )C DE V INT f + C PT V INT f + I DCP

(1.22)

where iact is the effective current in active cells, ihld is the holding current in inactive cells, CDE is the decoder output capacitance, CPT is the peripheral circuit capacitance, VINT is the internal voltage level, IDCP is the static current in the peripheral circuits, and n and m define the memory array dimensions. In present DRAMs, power loss is dominated by iact, the charging current of an active subarray; but as VT is scaled along with the operating voltage, the subthreshold current begins to dominate. The trend in DRAM ICs (see Fig. 1.7) shows that the dc current will begin to dominate the total active current at about the 1-Gb range. Limiting this and other short-channel effects is necessary then to improve power efficiency. Figure 1.8 shows trends in device parameters. A substrate doping of over 1018 cm–3 is necessary to reduce SCE, but this has the disadvantage of also increasing junction leakage currents. To achieve reduced SCE at lower substrate dopings, shallow junctions (as thin as 15 nm) are formed.13 Bit storage capacitors must also be scaled to match device miniaturization but still retain adequate noise tolerance. Alpha-particle irradiation becomes less significant as devices are scaled, due to the reduced depletion region; but leakage currents still place a minimum requirement on bit charge. Figure 1.9 shows that required signal charge, QS, has reduced only slightly with increased memory capacity, but cell areas have shrunk considerably. High-permittivity (high-εr) dielectrics such as Ta2O5 and BST (BaxSr1–xTiO3) are required to provide these greater areal capacitances at reduced dimensions.14 Table 1.2 lists material properties for some of the common and emerging dielectrics. In addition to scaling the cell area, the capacitor aspect ratio also affects manufacturing: larger aspect ratios result in non-planar interlevel dielectric and large step height variation between memory arrays and peripheral circuitry.

© 2000 by CRC Press LLC

FIGURE 1.6

RAM block diagram indicating effective currents within each subsystem.

Analog Systems Introduction An analog system is any system that processes a signal’s magnitude and phase information by a linearized response of an active device to a small-signal input. Unlike digital signals, which exhibit a large output signal swing, analog systems rely on a sufficiently small signal gain that the linear approximation holds true across the entire spectrum of expected input signal frequencies. Errors in the linear model are introduced by random process variation, intrinsic device noise, ambient noise, and non-idealities in active and passive electronics. Minimizing the cumulative effects of these “noise” contributions is the fundamental objective of analog and RF design. Reflecting the multitude of permutations in input/output specifications and operating conditions, analog/RF design is supported by numerous VLSI technologies. Key among these are silicon MOST, BJT, and BiCMOS for low-frequency applications; silicon BJT for high-frequency, low-noise applications; and GaAs MESFET for high-frequency, high-efficiency amplifiers. Newcomers to the field include GaAs and SiGe heterojunction bipolar junction transistors (HBTs). The bandgap engineering employed in their fabrication results in devices with significantly higher fT and fmax than in conventional devices, often at lower voltages.15–17 Finally, MEMS resonators and mechanical switches offer an alternative to active device implementations.

© 2000 by CRC Press LLC

FIGURE 1.7

Contributions to total current in DRAMs (after Ref. 12).

FIGURE 1.8

Trends in DRAM device technology (after Ref. 13).

© 2000 by CRC Press LLC

TABLE 1.2 Comparison of High-Permittivity Constant Materials for DRAM Cell Capacitors

Material NO Ta2O5 BST

FIGURE 1.9

Dielectric Constant 7 20–25 200–400

Minimum Equivalent Oxide Thickness (nm) 3.5 to 4 2 to 3 ?

Trends in DRAM cell characteristics (after Ref. 13).

The most familiar application of a high-frequency system is in wireless communications, in which a translation is performed between the high-frequency modulated carrier (RF signal) used for broadcasting and the low-frequency demodulated signal (baseband) suitable for audio or machine interpretation. Wireless ICs long relied on package-level integration and scaling to deliver compact size and improved efficiency. Also, low-cost commercial IC technologies previously could not deliver the necessary frequency range and noise characteristics. This capability is now changing with several candidate technologies at hand for monolithic IC integration. CMOS has the attractive advantage of being optimal for integration of low-power baseband processing. Amplifiers Amplifiers boost the amplitude or power of an analog signal to suppress noise or overcome losses and enable further processing. Typical characteristics include a low noise figure (NF), large (selectable) gain (G), good linearity, and high power-added efficiency (PAE). To accommodate the variety of signal frequencies and performance requirements, several amplifier categories have evolved. These include conventional single-ended, differential, and operational amplifiers at lower frequencies and, at higher frequencies, low-noise and RF power amplifiers.

© 2000 by CRC Press LLC

A challenge in technology scaling is providing a suitable signal-to-noise ratio and adequate biasing at a reduced operating supply. For a fixed gain, reducing the operating supply implies a similar scaling of the input signal level, ultimately approaching the noise floor of the system and leading to greater susceptibility to internal and external noise sources. Large-signal amplifiers (e.g., RF power amplifiers) that exhibit a wide output swing face similar problems with linearity at a lower operating supply. A low-noise amplifier (LNA) is the first active circuit in a receiver. A common-source configuration of a MOSFET LNA is shown in Fig. 1.10. The input network is typically matched for lowest NF, and the output network is matched for maximum power transfer. Input impedance is matched to the source resistance, Rs, when18

ω 0 ( L 1 + L 2 )C gs = 1

(1.23)

gm L1 ----------- = R s C gs

(1.24)

2

The gain from the input matching network to the transistor gate-source voltage is equal to Q, the quality factor

1 Q = ----------------gm ω0 L1

(1.25)

where ω0 is the RF frequency. If only the device current noise is considered, then the LNA noise figure can be expressed as

2 ω0 L1 NF = 1 + --- ---------3 QR s

(1.26)

It is observed that a larger quality factor yields a lower noise figure, but current industry practice selects an LNA Q of 2 to 3 since increasing Q also increases the sensitivity of the LNA gain to

FIGURE 1.10 Common-source LNA circuit schematic.

© 2000 by CRC Press LLC

tolerances in the passive components. By combining Eqs. 24 and 25, the device input capacitance Cgs can be defined

1 C gs = ---------------R S Qω 0

(1.27)

Assuming that the transistor is in the saturation region and that Miller feedback gain is –1, the contributions to the input capacitance are

2 C =  --- C ox ′ L eff + C gso + 2C gdo W 3 

(1.28)

where Cgso and Cgdo are, respectively, the gate-source and gate-drain overlap capacitances. The bias current (assuming a reasonable value for gm) can then be obtained from 2

g m  L eff ------I Dsat =  ------- 2W k′

(1.29)

As device dimensions are reduced, the required biasing current drops. Since cutoff frequency, fT , also improves with smaller device dimensions, MOSFET performance in RF applications will continue to improve. Power amplifiers, the last active circuit in a transmitter, have less stringent noise figure requirements than an LNA since the input signal is generated locally in the transmitter chain. Instead, linearity and PAE are more critical, particularly for variable-envelope communications protocols. RF amplifiers typically operate in class AB mode to compromise between efficiency and linearity.19 Power-added efficiency is defined as

η PAE = ------------1 1 – ---G

(1.30)

where η is the drain (collector) efficiency (usually about 40 to 75%) and G is the amplifier power gain. This balance is highly sensitive to the precision of matching networks. Technologies such as GaAs, with its high-resistivity substrate, and SOI, with its insulating buried oxide, are best suited for integrated RF power amplifiers since they permit fabrication of low-loss, on-chip matching networks. Interconnects and Passive Components Passive components in analog and RF design have the essential role of providing biasing, energy storage, and signal level translation. As device technology has permitted a greater monolithic integration of active devices, a similar trend has appeared in passive components. The quality of on-chip passives, however, has lagged behind that of high-precision discrete components. Two characteristics are required of VLSI interconnects for RFICs: low-loss and integration of high-quality factor passives (capacitors and inductors). As discussed previously, resistive losses increase the overall noise figure, lead to decreased efficiency, and degrade the performance of on-chip passive components. Interconnect and device resistance are minimized by saliciding the gate and source/drain surfaces and appropriately scaling the metallization dimensions. Substrate coupling losses, which also degrade quality factors of integrated passives and can introduce substrate noise, are controlled by selecting a high-resistivity substrate such as GaAs or shielding the substrate with an insulating layer such as in SOI. In forming capacitors on-chip, two structures are available, using either interconnect layers or the MOS gate capacitance. Metal-insulator-metal (MIM) and dual-poly capacitors both derive a capacitance from a thin interlevel dielectric (ILD) layer deposited between the conducting plates. MIM capacitors

© 2000 by CRC Press LLC

offer a higher Q than dual-poly capacitors since, even with silicidation, resistance of poly layers is higher than in metal. Both types can suffer from imprecision caused by non-planarity in the ILD thickness caused by process non-uniformity across the wafer. MOS gate capacitance is less subject to variation caused by dielectric non-uniformity since the gate oxide formation is tightly controlled and occurs before any back-end processing. MOS capacitors, however, are usually dismissed for high-Q applications out of concern for the highly resistive well forming the bottom plate electrode. Recent work, however, has shown that salicided MOS capacitors biased into strong inversion will achieve a Q of over 100 for applications in the range 900 MHz to 2 GHz.20 Inductors are essential elements of RFICs for biasing and matching, and on-chip integration translates to lower system cost and reduced effects from package parasitics. However, inductors also require a large die area and exhibit significant coupling losses with the substrate. In addition to degrading the inductor Q, substrate coupling results in the inductor becoming a source of substrate noise. The Q of an inductor can be defined21 as

ωL Q = --------s . Substate loss factor ⋅ Self-resonance factor Rs

(1.31)

where Ls is the nominal inductance and Rs is the series resistance. The substrate loss factor approaches unity as the substrate resistance goes to either zero or infinity. This implies that the Q factor is improved if the substrate is either short- or open-circuited. Suspended inductors achieve an open-circuited substrate by etching the bulk silicon from under the inductor structure.22 Another approach has been to shortcircuit the substrate by inserting grounding planes (ground shields).21

Power Systems Introduction Power processing systems are those devoted to the conditioning, regulation, conversion, and distribution of electrical power. Voltage and current are considered the inputs, and the system transforms the input characteristics to the form required by the load. The distinguishing feature of these systems is the specialized active device structure (e.g., rectifier, thyristor, power bipolar or MOS transistor, IGBT) required to withstand the electrothermal stresses imposed by the system. The label power integrated circuits (PICs) refers to the monolithic fabrication of a power semiconductor device along with standard VLSI electronics. At the system level, this integration has been made possible by digital control techniques and the development of mixed-signal ICs comprising analog sensing and digital logic. On the technology side, development of the power MOSFET and IGBT led to greatly simplified drive circuits and decreased complexity in the on-chip electronics. Three types of PIC are identified: smart power, high-voltage ICs (HVICs), and discrete modules. Discrete modules are those in which individual ICs for power devices and control are packaged in a single carrier. Integration is at the package level rather than at the IC. Smart power adds a monolithic integration of analog protection circuitry to a standard power semiconductor device. The level of integration is quite low, but the power semiconductor device ratings are not disturbed by the other electronics. HVICs are different in that they begin from a standard VLSI process and accommodate the power semiconductor device by manufacturing changes. HVICs are singled out for further discussion as they are the most suitable for VLSI integration. Although the power semiconductor device ratings cannot achieve the levels of a discrete device, HVICs are available for ratings with currents of 50 to 100 A and voltages up to 1000 V. Two critical technical issues faced in developing HVICs are the electrical isolation of high-power and low-voltage electronics, and the development of high-Q passive components (e.g., capacitors, inductors, transformers). In the following discussion, the characteristics of power semiconductor devices will not be considered, only the issues relating to HVIC integration.

© 2000 by CRC Press LLC

Electrical Isolation Three types of electrical isolation are available as illustrated in Fig. 1.11.23 In junction isolation, a p+ implant is added to form protective diodes with the n– epitaxial regions. The diodes are reverse-biased by applying a large negative voltage (~ –1000 V) to the substrate. Problems with this isolation include temperature-dependent diode leakage currents and the possibility of a dynamic turn-on of the diode. Additional stress to the isolation regions and interlevel dielectric is introduced when high-voltage interconnect crosses the isolation implants. The applied electric field in this situation can result in premature failure of the device. A self-isolation technique can be chosen if all the devices are MOSFETs. When all devices are placed in individual wells (a twin-tub process), all channel regions are naturally isolated since current flow is near the oxide–semiconductor interface. The power semiconductor device and signal transistors are

FIGURE 1.11 High-voltage IC isolation technologies: junction, self-, and dielectric-isolation.

© 2000 by CRC Press LLC

fabricated simultaneously in junction and self-isolation, resulting in a compromise in performance.24 In practice, bulk isolation techniques are a combination of junction and self-isolation since many HVICs exhibit dynamic surges in substrate carriers, corresponding to power device switching, that may result in latchup of low-voltage devices.25 Dielectric isolation decouples the fabrication of power and signal devices by reserving the bulk semiconductor for high-voltage transistors and introducing an epitaxial semiconductor layer for low-voltage devices on a dielectric surface. Formation of the buried oxide can be accomplished either by partially etching an SOI wafer to yield an intermittent SOI substrate or by selectively growing oxide on regions intended for low-voltage devices, followed by epitaxial deposition of a silicon film. In addition to providing improved isolation of power semiconductor devices, the buried oxide enhances the performance of signal transistors by reducing parasitic capacitances and chip area.26 Interconnects and Passive Components A key objective in applying VLSI technology to power electronics is the reduction in system mass and volume. Current device technologies adequately provide the monolithic cofabrication of low-voltage and high-voltage electronics, and capacitors and inductors of sufficiently high Q are available to integrate substantial peripheral circuitry. As switching frequencies are increased in switching converters, passive component values are reduced, further improving the integration of power electronics. These integration capabilities have resulted from similar requirements in digital and analog/RF electronics. Conventional VLSI technologies, however, have yet to reproduce the same integration of magnetic materials needed for transformers. Magnetics integration is hindered by the incompatibility of magnetic materials with standard VLSI processes and concerns over contamination of devices. A transformer cross-section is shown in Fig. 1.12 where the primary and secondary are wound as lateral coils with connections between the upper and lower conductors provided by vias.27

FIGURE 1.12 Cross-sectional view of a VLSI process showing integration of magnetic layers for coil transformers.

1.3 Emerging VLSI Systems Building on the successes in integrating comprehensive systems from the individual signal processing domains, many next-generation hybrid systems are appearing that integrate systems in a cross-platform manner. The feasibility of these systems depends on the ability of technology to either combine monolithically the unique elements of each system or to introduce a single technology standard that can support broad systems.

Embedded Memory Computer processors today are bandwidth limited with the memory interface between high-capacity external storage and the processing units unable to meet access rates. Multimedia applications such as

© 2000 by CRC Press LLC

3-D graphics rendering and broadcast rate video will demand bandwidths from 1 to 10 GB/s. A conventional 4-Mb DRAM with 4096 sense amplifiers, a 150-ns cycle time, and a configuration of 1-M × 4-b achieves an internal bandwidth of 3.4 GB/s; however, as this data can be accessed at the I/O pins only in 4-bit segments, external bandwidth is reduced to 0.1% of available bandwidth. Although complex cache (SRAM) hierarchies have been devised to mask this latency, each additional cache level introduces additional complexity and has an asymptotic performance limit. Since the bandwidth bottleneck is introduced by off-chip (multiplexed) routing of signals, embedded memories are appearing to provide full-bandwidth memory accesses by eliminating I/O multiplexing. Integration of DRAM and logic is non-trivial as the technology optimization of the former favors minimal bit area, a compact capacitor structure, and low leakage, but in the latter favors device performance. Embedded DRAM therefore permits two implementations: logic fabricated in a DRAM process or DRAM fabricated as a macro in an ASIC process. Logic in DRAM Process DRAM technologies do not require multilevel interconnect since their regular array structure translates to very uniform routing. Introducing logic to a DRAM process has been accomplished by merging DRAM front-end fabrication (to capture optimized capacitor and well structures) with an ASIC back-end process (to introduce triple- or quad-level metallization). Example systems include a 576-Kb DRAM and 8-way hypercube processor28 and a 1-Mb DRAM and 4-element pixel processor.29 DRAM Macro in ASIC Process DRAM macros currently developed for instantiation in an ASIC process implement a substrate-platetrench-capacitor (SPT) by modifying deep trench isolation (DTI). Access transistor isolation is accomplished with a triple-well process to extract stray carriers injected as substrate noise. Fabrication of the stacked-capacitor (STC) structures typical of high-capacity DRAMs cannot be performed in standard ASIC processes without significant modification; this limitation forces additional tradeoffs in the optimal balance of embedded memory capacity and fabrication cost. Example systems include an 8-Mb DRAM and 70-K gate sea-of-gates IC.30

Monolithic RFICs Technologies for monolithic RFICs are proceeding in two directions: all-CMOS and silicon bipolar. AllCMOS has the attractive advantages of ready integration of baseband analog and digital signal processing, compatibility with standard analog/RF CMOS processes, and better characteristics for low-power/lowvoltage operation.31 CMOS is predicted to continue to offer suitable RF performance at operating supplies below 1 V as long as the threshold voltage is scaled accordingly. Silicon bipolar implementations, however, outperform CMOS in several key areas, particularly that of the receiver LNA. A frequency-hopped spread-spectrum transceiver has been designed in a 1-micron CMOS process with applications to low-power microcell communications.32,33 A microcell application was chosen since the maximum output power of 20 mW limits the transmission range. RFICs in bipolar and BiCMOS typically focus on the specific receiver components most improved by non-CMOS implementations. A common RFIC architecture is an LNA front-end followed by a downmixer. Interstage filtering and matching, if required, are provided off-chip. In CMOS, a 1-GHz RFIC achieved a conversion gain of 20 dB, an NF of 3.2 dB, an IP3 of 8 dBm, with a current drain of 9 mA from a 3-V supply.34 A similar architecture in BiCMOS at 1 GHz achieved a conversion gain of 16 dB, an NF of 2.2 dB, an IP3 of –10 dBm, with a current drain of 13 mA from a 5-V supply.35

Single-Chip Sensors and Detectors One of the most active areas of system development is in the field of “camera-on-a-chip” in which VLSIcompatible photodetector arrays are monolithically fabricated with digital and analog peripheral circuitry.

© 2000 by CRC Press LLC

The older charge-coupled device (CCD) technology is being replaced by active-pixel sensing (APS) technologies in new systems since CCD has a higher per-cell capacitance.36 APS also offers a monolithic integration of analog-to-digital converters and digital control logic. On-chip digitization of the sensor outputs eliminates the additional power loss and noise contributions incurred in buffering an analog signal for off-chip processing. Conventional CMOS technology is expected to permit co-fabrication of APS sensors down to feature dimensions as small as 0.25 microns,37 although leakage currents will become a concern for low-power operation. Other low-power techniques that degrade APS performance such as silicidation and thin-film devices can be controlled by slight process modifications. Silicide blocks eliminate the opaque, lowresistance layer; and thin-film devices, such as those found in fully depleted SOI, can be avoided by opening windows through the SOI-buried oxide to the bulk silicon.

MEMS Electrothermal properties of MEMS suspended or cantilevered layers are finding applications in a variety of electromechanical systems. For example, suspended layers have been developed for a number of purposes, including microphones,38 accelerometers, and pressure sensors. By sealing a fluid within a MEMS cavity, pressure sensors can also serve as infrared detectors and temperature sensors. Analog feedback electronics monitor the deflection of a MEMS layer caused by the influence of external stresses. The applied control voltage provides an analog readout of the relative magnitude of the external stresses. MEMS micropumps have been developed which have potential applications in medical products (e.g., drug delivery) and automotive systems (e.g., fuel injection).39 Flow rates of about 50 µl/min at 1-Hz cycling have been achieved, but the overall area required is still large (approximately 1 cm2), presently limiting integration. Beyond macro applications, microfluidics has been proposed as a technique for integrated cooling of high-power and high-temperature ICs in which coolant is circulated within the substrate mass.

1.4 Alternative Technologies The discussion so far has assumed a VLSI system to be comprised primarily of transistors, with the exception of non-electronic MEMS. Although this implementation has evolved unchallenged for nearly 50 years, several innovative technologies have progressed to the state that they are attracting serious attention as future competitors. Some, such as quantum computing, are still fundamentally electronic in nature. Others, like biological, DNA, and molecular computing, use living cells as elemental functional units. The chief advantage of these technologies is the extreme power efficiency of a computation. Quantum computing achieves ultra-low-power operation by reducing logic operations to the change in an electron spin or an atomic ionization state. Biological and DNA computing exploit the energy efficiency of living cells, the product of a billion years of evolution. A second attribute is extremely fast computation owing to greatly reduced signal path lengths (to a molecular or atomic scale) and massively parallel (MP) simultaneous operations.

Quantum Computing Quantum computing (QC) is concerned with the probabilistic nature of quantum states, by which a single atom can be used to “store” and “compare” multiple values simultaneously. QC has two distinct implementations: an electronic one based on the wave-function interaction of fixed adjacent atoms40 and a biochemical one based on mobile molecular interactions within a fluid medium.41 In fixed systems, atom placement and stimulation are accomplished with atomic force microscopy (AFM) and nuclear magnetic resonance (NMR), but performance as a system also requires the ability to

© 2000 by CRC Press LLC

individually select and operate on an atom.42 Molecular systems avoid this issue by using a fluid medium as a method of introducing initial conditions and isolating the computation from the measurement. Computational redundancy then statistically removes measurement error and incorrect results generated at the fluid boundaries. Recent work in algorithms has demonstrated that QC can solve two categories of problems more efficiently than with a classical computer science method by taking advantage of wave function indeterminate states. In search and factorization problems (involving a random search of N items), a classical solution requires O(N) steps, but a QC algorithm requires only O(√N); and binary parity computations can be improved from O(N) to O(N/2) in a QC algorithm.43 Practical implementation of QC algorithms to very large data sets is currently limited by instrumentation. For example, a biochemical QC system using NMR to change spin polarity has signal frequencies of less than 1 kHz. Despite this low frequency, the available parallelism is expected to factor (with the O(√N) algorithm) a 400-digit number in one year: greater than 3 × 10186 MOPS (megaoperations per second).44

DNA Computing In DNA computing, a problem set is encoded into DNA strands which then, by nucleotide matching properties, perform MP search and comparison operations. Most comparison techniques to date rely on conformal mapping, but some reports appearing in the literature indicate that more powerful DNA algorithms are possible with non-conformal mapping and secondary protein interaction.45 The challenge is in developing a formal language to describe DNA computing compounded by accounting for these secondary and tertiary effects, including protein structure and amino acid chemical properties.46 Initial work in formal language theory has shown that DNA computers can be made equivalent to a Turing machine.47 DNA can also provide extremely dense data storage, requiring about a trillionth of the volume required for an equivalent electronic memory: 1012 DNA strands, each 1000 units long, is equal to 1000 T bits.48 DNA computations employ up to 1020 DNA strands (over 11 million TB), well beyond the capacity of any conventional data storage.

Molecular Computing Molecular computers are a mixed-signal system for performing logic functions (with possible subpicosecond switching) and signal detection with an ability to evolve and adapt to new conditions. Possible implementations include modulation of electron, proton, or photon mobility; electronic-conformation interactions; and tissue membrane interactions.49 Table 1.3 lists some of the architectures and applications. “Digital” cell interactions, such as found in quantum or DNA computing, provide the logic implementation. Analog processing is introduced by the nonlinear characteristics of the cell interaction with respect to light, electricity, magnetism, chemistry, or other external stimulus. Molecular systems are also thought to emulate a neural network that can be capable of signal enhancement and noise removal.50 TABLE 1.3 Summary of Some Architectures and Applications Possible from a Molecular Computing System Mechanisms and Architectures Light-energy transducing proteins Light-energy transducing proteins (with controlled switching) Optoelectronic transducing Evolutionary structures

© 2000 by CRC Press LLC

Applications Biosensors Organic memory storage Pattern recognition and processing Adaptive control

References 1. McShane, E., Trivedi, M., Xu, Y., Khandelwal, P., Mulay, A., and Shenai, K., “Low-Power Systems on a Chip (SOC),” IEEE Circuits and Devices Magazine, vol. 14, no. 5, pp. 35-42, June 1998. 2. Laes, E., “Submicron CMOS Technology — The Enabling Tool for System-on-a-Chip Integration,” Alcatel Telecommunications Review, no. 2, pp. 130-137, 1996. 3. Ackland, B., “The Role of VLSI in Multimedia,” IEEE J. of Solid-State Circuits, vol. 29, no. 4, pp. 381-388, Apr. 1994. 4. Kuroda, I. and Nishitani, T., “Multimedia Processors,” Proc. of the IEEE, vol. 86, no. 6, pp. 12031221, Jun. 1998. 5. Clemens, J. T, “Silicon Microelectronics Technology,” Bell Labs Technical Journal, vol. 2, no. 4, pp. 76-102, Fall 1997. 6. Asai, S. and Wada, Y., “Technology Challenges for Integration Near and Below 0.1 Micron [Review],” Proc. of the IEEE, vol. 85, no. 4, pp. 505-520, Apr. 1997. 7. Tsividis, Y., Mixed Analog-Digital VLSI Devices and Technology: An Introduction, McGraw-Hill, New York, 1996. 8. Gonzalez, R., Gordon, B.M., and Horowitz, M. A., “Supply and Threshold Voltage Scaling for Low Power CMOS,” IEEE J. of Solid-State Circuits, vol. 32, no. 8, pp. 1210-1216, Aug. 1997. 9. Colinge, J.-P., Silicon-on-Insulator Technology: Materials to VLSI, 2nd edition, Kluwer Academic Publishers, Boston, 1997. 10. Assaderaghi, F., Sinitsky, D., Parke, S. A., Bokor, J., Ko, P.K., and Hu, C. M., “Dynamic ThresholdVoltage MOSFET (DTMOS) for Ultra-Low Voltage VLSI,” IEEE Trans. on Electron Devices, vol. 44, no. 3, pp. 414-422, Mar. 1997. 11. Licata, T.J., Colgan, E.G., Harper, J. M. E., and Luce, S. E., “Interconnect Fabrication Processes and the Development of Low-Cost Wiring for CMOS Products,” IBM J. of Research & Development, vol. 39, no. 4, pp. 419-435, Jul. 1995. 12. Itoh, K., Sasaki, K., and Nakagome, Y., “Trends In Low-Power RAM Circuit Technologies,” Proc. of the IEEE, vol. 83, no. 4, pp. 524-543, Apr. 1995. 13. Itoh, K., Nakagome, Y., Kimura, S., and Watanabe, T., “Limitations and Challenges of Multigigabit DRAM Chip Design,” IEEE J. of Solid-State Circuits, vol. 32, no. 5, pp. 624-634, May 1997. 14. Kim, K., Hwang, C.-G., and Lee, J. G., “DRAM Technology Perspective for Gigabit Era,” IEEE Trans. on Electron Devices, vol. 45, no. 3, pp. 598-608, Mar. 1998. 15. Cressler, J. D., “SiGe HBT Technology: A New Contender for Si-Based RF and Microwave Circuit Applications,” IEEE Trans. on Microwave Theory & Techniques, vol. 46, no. 5 Part 2, pp. 572-589, May 1998. 16. Hafizi, M., “New Submicron HBT IC Technology Demonstrates Ultra-Fast, Low-Power Integrated Circuits,” IEEE Trans. on Electron Devices, vol. 45, no. 9, pp. 1862-1868, Sept. 1998. 17. Wang, N. L. L., “Transistor Technologies for RFICs in Wireless Applications,” Microwave Journal, vol. 41, no. 2, pp. 98-110, Feb. 1998. 18. Huang, Q. T., Piazza, F., Orsatti, P., and Ohguro, T.,“The Impact of Scaling Down to Deep Submicron on CMOS RF Circuits,” IEEE J. of Solid-State Circuits, vol. 33, no. 7, pp. 1023-1036, Jul. 1998. 19. Larson, L. E., “Integrated Circuit Technology Options for RFICs — Present Status and Future Directions,” IEEE J. of Solid-State Circuits, vol. 33, no. 3, pp. 387-399, Mar. 1998. 20. Hung, C.-M., Ho, Y. C., Wu, I.-C., and K. O, “High-Q Capacitors Implemented in a CMOS Process for Low-Power Wireless Applications,” IEEE Trans. on Microwave Theory & Techniques, vol. 46, no. 5 Part 1, pp. 505-511, May 1998. 21. Yue, C. P. and Wong, S. S., “On-Chip Spiral Inductors with Patterned Ground Shields for Si-Based RF IC’s,” IEEE J. of Solid-State Circuits, vol. 33, no. 5, pp. 743-752, May 1998. 22. Chang, J. Y.-C., Abidi, A. A., and Gaitan, M.,“Large Suspended Inductors on Silicon and Their Use in a 2-mm CMOS RF Amplifier,” IEEE Electron Device Lett., vol. 14, pp. 246-248, May 1993.

© 2000 by CRC Press LLC

23. Mohan, N., Undeland, T. M., and Robbins, W. P., Power Electronics: Converters, Applications, and Design, 2nd edition, John Wiley & Sons, New York, 1996. 24. Tsui, P. G. Y., Gilbert, P. V., and Sun, S. W., “A Versatile Half-Micron Complementary BiCMOS Technology for Microprocessor-Based Smart Power Applications,” IEEE Trans. on Electron Devices, vol. 42, no. 3, pp. 564-570, Mar. 1995. 25. Chan, W. W. T., Sin, J. K. O., and Wong, S. S., “A Novel Crosstalk Isolation Structure for Bulk CMOS Power ICs,” IEEE Trans. on Electron Devices, vol. 45, no. 7, pp. 1580-1586, Jul. 1998. 26. Baliga, J., “Power Semiconductor Devices for Variable-Frequency Drives,” Proc. of the IEEE, vol. 82, no. 8, pp. 1112-1122, Aug. 1994. 27. Mino, M., Yachi, T., Tago, A., Yanagisawa, K., and Sakakibara, K.,“Planar Microtransformer With Monolithically-Integrated Rectifier Diodes For Micro-Switching Converters,” IEEE Trans. on Magnetics, vol. 32, no. 2, pp. 291-296, Mar. 1996. 28. Sunaga, T., Miyatake, H., Kitamura, K., Kogge, P. M., and Retter, E., “A Parallel Processing Chip with Embedded DRAM Macros,” IEEE J. of Solid-State Circuits, vol. 31, no. 10, pp. 1556-1559, Oct. 1996. 29. Watanabe, T., Fujita, R., Yanagisawa, K., Tanaka, H., Ayukawa, K., Soga, M., Tanaka, Y., Sugie, Y., and Nakagome, Y.,“A Modular Architecture for a 6.4-Gbyte/S, 8-Mb DRAM-Integrated Media Chip,” IEEE J. of Solid-State Circuits, vol. 32, no. 5, pp. 635-641, May 1997. 30. Miyano, S., Numata, K., Sato,K., Yabe, T., Wada, M., Haga, R., Enkaku, M., Shiochi, M., Kawashima, Y., Iwase, M., Ohgata, M., Kumagai, J., Yoshida, T., Sakurai, M., Kaki, S., Yanagiya, N., Shinya, H., Furuyama, T., Hansen, P., Hannah, M., Nagy, M., Nagarajan, A., and Rungsea, M., “A 1.6 Gbyte/sec Data Transfer Rate 8 Mb Embedded DRAM,” IEEE J. of Solid-State Circuits, vol. 30, no. 11, pp. 1281-1285, Nov. 1995. 31. Bang, S. H., Choi, J., Sheu, B. J., and Chang, R. C., “A Compact Low-Power VLSI Transceiver for Wireless Communication,” IEEE Trans. on Circuits & Systems I—Fundamental Theory & Applications, vol. 42, no. 11, pp. 933-945, Nov. 1995. 32. Rofougaran, A., Chang, J.G., Rael, J., Chang, J. Y.-C., Rofougaran, M., Chang, P. J., Djafari, M., Ku, M. K., Roth, E. W., Abidi, A. A., and Samueli, H., “A Single-Chip 900-MHz Spread-Spectrum Wireless Transceiver in 1-micron CMOS — Part I: Architecture and Transmitter Design,” IEEE J. of Solid-State Circuits, vol. 33, no. 4, pp. 515-534, Apr. 1998. 33. Rofougaran, A., Chang, G., Rael, J. J., Chang, J. Y.-C., Rofougaran, M., Chang, P. J., Djafari, M., Min, J., Roth, E. W., Abidi, A. A., and Samueli, H., “A Single-Chip 900-MHz Spread-Spectrum Wireless Transceiver in 1-micron CMOS — Part II: Receiver Design,” IEEE J. of Solid-State Circuits, vol. 33, no. 4, pp. 535-547, Apr. 1998. 34. Rofougaran, A., Chang, J. Y. C., Rofougaran, M., and Abidi, A. A., “A 1 GHz CMOS RF Front-End IC for a Direct-Conversion Wireless Receiver,” IEEE J. of Solid-State Circuits, vol. 31, no. 7, pp. 880-889, Jul. 1996. 35. Meyer, R. G. and Mack, W. D. , “A 1-GHz BiCMOS RF Front-End IC,” IEEE J. of Solid-State Circuits, vol. 29, no. 3, pp. 350-355, Mar. 1994. 36. Mendis, S., Kemeny, S. E., and Fossum, E. R. , “CMOS Active Pixel Image Sensor,” IEEE Trans. on Electron Devices, vol. 41, no. 3, pp. 452-453, Mar. 1994. 37. Fossum, E. R.,“CMOS Image Sensors — Electronic Camera-On-a-Chip,” IEEE Trans. on Electron Devices, vol. 44, no. 10, pp. 1689-1698, Oct. 1997. 38. Pederson, M., Olthuis, W., and Bergveld, P., “High-Performance Condenser Microphone with Fully Integrated CMOS Amplifier and DC-DC Voltage Converter,” IEEE J. of Microelectromechanical Systems, vol. 7, no. 4, pp. 387-394, Dec. 1998. 39. Benard, W. L., Kahn, H., Heuer, A. H., and Huff, M. A. , “Thin-Film Shape-Memory Alloy Actuated Micropumps,” IEEE J. of Microelectromechanical Systems, vol. 7, no. 2, pp. 245-251, Jun. 1998. 40. Brassard, G., Chuang, I., Lloyd, S., and Monroe, C., “Quantum Computing,” Proc. of the National Academy of Sciences of the USA, vol. 95, no. 19, pp. 11032-11033, Sept. 15, 1998.

© 2000 by CRC Press LLC

41. Wallace, R., Price, H., and Breitbeil, F., “Toward a Charge-Transfer Model of Neuromolecular Computing,” Int'l J. of Quantum Chemistry, vol. 69, no. 1, pp. 3-10, Jul. 1998. 42. Scarani, V., “Quantum Computing,” American J. of Physics, vol. 66, no. 11, pp. 956-960, Nov. 1998. 43. Grover, L. K., “Quantum Computing — Beyond Factorization And Search,” Science, vol. 281, no. 5378, pp. 792-794, Aug. 1998. 44. Gershenfeld, N. and Chuang, I. L., “Quantum Computing with Molecules,” Scientific American, vol. 278, no. 6, pp. 66-71, Jun. 1998. 45. Conrad, M. and Zauner, K. P., “DNA as a Vehicle for the Self-Assembly Model of Computing,” Biosystems, vol. 45, no. 1, pp. 59-66, Jan. 1998. 46. Rocha, A. F., Rebello, M. P., and Miura, K., “Toward a Theory of Molecular Computing,” J. of Information Sciences, vol. 106, pp. 123-157, 1998. 47. Kari, L., Paun, G., Rozenberg, G., Salomaa, A., Yu, S., “DNA Computing, Sticker Systems, and Universality,” Acta Informatica, vol. 35, no. 5, pp. 401-420, May 1998. 48. Forbes, N. A. and Lipton, R. J.,“DNA Computing — A Possible Efficiency Boost for Specialized Problems,” Computers in Physics, vol. 12, no. 4, pp. 304-306, Jul.-Aug. 1998. 49. Kampfner, R. R.,“Integrating Molecular and Digital Computing — An Information Systems Design Perspective,” Biosystems, vol. 35, no. 2-3, pp. 229-232, 1995. 50. Rambidi, N. G., “Practical Approach to Implementation of Neural Nets at the Molecular Level,” Biosystems, vol. 35, no. 2-3, pp. 195-198, 1995.

© 2000 by CRC Press LLC

Katsumata, Y. “CMOS/BiCMOS Technology” The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

2 Yasuhiro Katsumata Tatsuya Ohguro Kazumi Inoh Eiji Morifuji Takashi Yoshitomi Hideki Kimijima Hideaki Nii Toyota Morimoto Hisayo S. Momose Kuniyoshi Yoshikawa Hidemi Ishiuchi Toshiba Corporation

CMOS/BiCMOS Technology 2.1 2.2

Device Structure and Basic Fabrication Process Steps • Key Process Steps in Device Fabrication • Passive Device for Analog Operation • Embedded Memory Technology

2.3 2.4

BiCMOS Technology Future Technology Ultra-Thin Gate Oxide MOSFET • Epitaxial Channel MOSFET • Raised Gate/Source/Drain Structure

Hiroshi Iwai Tokyo Institute of Technology

Introduction CMOS Technology

2.5

Summary

2.1 Introduction Silicon LSIs (large-scale integrated circuits) have progressed remarkably in the past 25 years. In particular, complementary metal-oxide-semiconductor (CMOS) technology has played a great role in the progress of LSIs. By downsizing2 MOS field-effect-transistors (FETs), the number of transistors in a chip increases, and the functionality of LSIs is improved. At the same time, the switching speed of MOSFETs and circuits increases and operation speed of LSIs is improved. On the other hand, system-on-chip technology has come into widespread use and, as a result, the LSI system requires several functions, such as logic, memory, and analog functions. Moreover, the LSI system sometimes needs an ultra-high-speed logic or an ultra-high-frequency analog function. In some cases, bipolar-CMOS (BiCMOS) technology is very useful. The first part of this chapter focuses on CMOS technology as the major LSI process technology, including embedded memory technology. The second part, describes BiCMOS technology; and finally, future process technology is introduced.

2.2 CMOS Technology Device Structure and Basic Fabrication Process Steps Complementary MOS (CMOS) was first proposed by Wanlass and Sah in 1963.1 Although the CMOS process is more complex than the NMOS process, it provides both n-channel (NMOS) and p-channel (PMOS) transistors on the same chip, and CMOS circuits can achieve lower power consumption. Consequently, the CMOS process has been widely used as an LSI fabrication process. Figure 2.1 shows the structure of a CMOS device. Each FET consists of a gate electrode, source, drain, and channel, and gate bias controls carrier flow from source to drain through the channel layer.

© 2000 by CRC Press LLC

FIGURE 2.1

Structure of CMOS device: (a) cross-sectional view of CMOS, (b) plain view of CMOS.

Figure 2.2 shows the basic fabrication process flow. The first process step is the formation of p tub and n tub (twin tub) in silicon substrate. Because CMOS has two types of FETs, NMOS is formed in p tub and PMOS in n tub. The isolation process is the formation of field oxide in order to separate each MOSFET active area in the same tub. After that, impurity is doped into channel region in order to adjust the threshold voltage, Vth, for each type of FET. The gate insulator layer, usually silicon dioxide (SiO2), is grown by thermal oxidation, because the interstate density between SiO2 and silicon substrate is small. Polysilicon is deposited as gate electrode material and gate electrode is patterned by reactive ion etching (RIE). The gate length, Lg, is the critical dimension because Lg determines the performance of MOSFETs and it should be small in order to improve device performance. Impurity is doped in the source and drain regions of MOSFETs by ion implantation. In this process step, gate electrodes act as a self-aligned mask to cover channel layers. After that, thermal annealing is carried out in order to activate the impurity of diffused layers. In the case of high-speed LSI, the self-aligned silicide (salicide) process is applied for the gate electrode and source and drain diffused layers in order to reduce parasitic resistance. Finally, the metallization process is carried out in order to form interconnect layers.

FIGURE 2.2

Basic process flow of CMOS.

Key Process Steps in Device Fabrication Starting Material Almost all silicon crystals for LSI applications are prepared by the Czochralski crystal growth method,2 because it is advantageous for the formation of large wafers. (100) orientation wafers are usually used

© 2000 by CRC Press LLC

for MOS devices because their interstate trap density is smaller than those of (111) and (110) orientations.3 The light doping in the substrate is convenient for the diffusion of tub and reduces the parasitic capacitance between silicon substrate and tub region. As a starting material, lightly doped (~1015 atoms/cm3) p-type substrate is generally used. Tub Formation Figure 2.3 shows the tub structures, which are classified into 6 types: p tub, n tub, twin tub,4 triple tub, twin tub with buried p+ and n+ layers, and twin tub on p-epi/p+ substrate. In the case of the p tub process, NMOS is formed in p diffusion (p tub) in the n substrate, as shown in Fig. 2.3(a). The p tub is formed by implantation and diffusion into the n substrate at a concentration that is high enough to overcompensate the n substrate.

FIGURE 2.3 Tub structures of CMOS: (a) p tub; (b) n tup; (c) twin tub; (d) triple tub; (e) twin tub with buried p+ and n+ layers; and (f) twin tub on p-epi/p+ substrate.

© 2000 by CRC Press LLC

The other approach is to use an n tub.5 As shown in Fig. 2.3(b), NMOS is formed in the p substrate. Figure 2.3(c) shows the twin-tub structure,4 that uses two separate tubs implanted into silicon substrate. In this case, doping profiles in each tub region can be controlled independently, and thus neither type of device suffers from excess doping effect. In some cases, such as mixed signal LSIs, a deep n tub layer is sometimes formed optionally, as shown in Fig. 2.3(d), in order to prevent the crosstalk noise between digital and analog circuits. In this structure, both n and p tubs are electrically isolated from the substrate or other tubs on the substrate. In order to realize high packing density, the tub design rule should be shrunk; however, an undesirable mechanism, the well-known latch-up, might occur. Latch-up (i.e., the flow of high current between VDD and VSS) is caused by parasitic lateral pnp bipolar (L-BJT) and vertical npn bipolar (V-BJT) transistor actions6 as shown in Fig. 2.3(a), and it sometimes destroys the functions of LSIs. The collectors of each of these bipolar transistors feed each others’ bases and together make up a pnpn thyristor structure. In order to prevent latch-up, it is important to reduce the current gain, hFE, of these parasitic bipolar transistors, and the doping concentration of the tub region should be higher. As a result, device performance might be suppressed because of large junction capacitances. In order to solve this problem, several techniques have been proposed, such as p+ or n+ buried layer under p tub7 as shown in Fig. 2.3(e), the use of high-dose, high-energy boron p tub implants,8,9 and the shunt resistance for emitter-base junctions of parasitic bipolar transistors.7,10,11 It is also effective to provide many well contacts to stabilize the well potential and hence to suppress the latch-up. Recently, substrate with p epitaxial silicon on p+ substrate, can also be used to stabilize the potential for high-speed logic LSIs.12 Isolation Local oxidation of silicon (LOCOS)13 is a widely used isolation process, because this technique can allow channel-stop layers to be formed self-aligned to the active transistor area. It also has the advantage of recessing about half of the field oxide below the silicon surface, which makes the surface more planar. Figure 2.4 shows the LOCOS isolation process. First, silicon nitride and pad oxide are etched for the definition of active transistor area. After channel implantation as shown in Fig. 2.4(a), the field oxide is selectively grown, typically to a thickness of several hundred nanometers. A disadvantage of LOCOS is that involvement of nitrogen in the masking of silicon nitride layer sometimes causes the formation of a very thin nitride layer in the active region, and this often impedes the subsequent growth of gate oxide, thereby causing low gate breakdown voltage of the oxides. In order to prevent this problem, after stripping the masking silicon nitride, a sacrificial oxide is grown and then removed before the gate oxidation process.14,15

FIGURE 2.4 Process for local oxidation of silicon: (a) after silicon nitride/pad oxide etch and channel-stop implant; (b) after field oxidation, which produces an oxynitride film on nitride.

© 2000 by CRC Press LLC

In addition, the lateral spread of field oxide (bird's beak)14 poses a problem regarding reduction of the distance between active transistor areas in order to realize high packing density. This lateral spread is suppressed by increasing the thickness of silicon nitride and/or decreasing the thickness of pad oxide. However, there is a tradeoff with the generation of dislocation of silicon. Recently, shallow trench isolation (STI)16 has become a major isolation process for advanced CMOS devices. Figure 2.5 shows the process flow of STI. After digging the trench into the substrate by RIE as shown in Fig. 2.5(a), the trench is filled with insulator such as silicon dioxide as shown in Fig. 2.5(b). Finally, by planarization with chemical mechanical polishing (CMP),17 filling material on the active transistor area is removed, as shown in Fig. 2.5(c).

FIGURE 2.5 Process flow of STI: (a) trenches are formed by RIE; (b) filling by deposition of SiO2; and (c) planarization by CMP.

STI is a useful technique for downsizing not only the distance between active areas, but also the active region itself. However, a mechanical stress problem18 still remains, and several methods have been proposed19 to deal with it. Channel Doping In order to adjust the threshold voltage of MOSFETs, Vth, to that required by a circuit design, the channel doping process is usually required. The doping is carried out by ion implantation, usually through a thin dummy oxide film (10 to 30 nm) thermally grown on the substrate in order to protect the surface from contamination, as shown in Fig. 2.6. This dummy oxide film is removed prior to the gate oxidation. Figure 2.7 shows a typical CMOS structure with channel doping. In this case, n+ polysilicon gate electrodes are used for both n- and p-MOSFETs and, thus, this type of CMOS is called single-gate CMOS. The role of the channel doping is to enhance or raise the threshold voltage of n-MOSFETs. It is desirable to keep the concentration of p tub lower in order to reduce the junction capacitance of source and drain. Thus, channel doping of p-type impurity — boron — is required. Drain-to-source leakage current in shortchannel MOSFETs flows in a deeper path, as shown in Fig. 2.8; this is called the short-channel effect. Thus, heavy doping of the deeper region is effective in suppressing the short-channel effect. This doping is called deep ion implantation.

© 2000 by CRC Press LLC

FIGURE 2.6

Channel doping process step.

FIGURE 2.7

Schematic cross-section of single-gate CMOS structure.

FIGURE 2.8

Leakage current flow in short-channel MOSFET.

In the case of p-MOSFET with an n+ polysilicon gate electrode, the threshold voltage becomes too high in the negative direction if there is no channel doping. In order to adjust the threshold voltage, an ultra-shallow p-doped region is formed by the channel implantation of boron. This p-doped layer is often called a counter-doped layer or buried-channel layer, and p-MOSFETs with this structure are called buried-channel MOSFETs. (On the other hand, MOSFETs without a buried-channel layer are called surface-channel MOSFETs. n-MOSFETs in this case are the surface-channel MOSFETs.) In the buriedchannel case, the short-channel effect is more severe, and, thus, deep implantation of an n-type impurity such as arsenic or phosphorus is necessary to suppress them. In deep submicron gate length CMOS, it is difficult to suppress the short-channel effect,20 and thus, a p+-polysilicon electrode is used for p-MOSFETs, as shown in Fig. 2.9. For n-MOSFETs, an n+-polysilicon electrode is used. Thus, this type of CMOS is called dual-gate CMOS. In the case of p+-polysilicon pMOSFET, the threshold voltage becomes close to 0 V because of the difference in work function between n- and p-polysilicon gate electrode,21–23 and thus, buried layer is not required. Instead, n-type impurity channel doping such as arsenic is required to raise the threshold voltage slightly in the negative direction. Impurity redistribution during high-temperature LSI manufacturing processes sometimes makes channel profile broader, which causes the short-channel effect. In order to suppress the redistribution, a dopant with a lower diffusion constant, such as indium, is used instead of boron.

© 2000 by CRC Press LLC

FIGURE 2.9

Schematic cross-section of dual-gate CMOS structure.

For the purpose of realizing a high-performance transistor, it is important to reduce junction capacitance. In order to realize lower junction capacitance, a localized diffused channel structure.24,25 as shown in Fig. 2.10, is proposed. Since the channel layer exists only around the gate electrode, the junction capacitance of source and drain is reduced significantly.

FIGURE 2.10 Localized channel structure.

Gate Insulator The gate dielectric determines several important properties of MOSFETs and thus uniformity in its thickness, low defect density of the film, low fixed charge and interface state density at the dielectric and silicon interface, small roughness at the interface, high reliability of time-dependent dielectric breakdown (TDDB) and hot-carrier induced degradation, and high resistivity to boron penetration (explained in this section) are required. As a consequence of downsizing of MOSFET, the thickness of the gate dielectric has become thinner. Generally, the thickness of the gate oxide is 7 to 8 nm for 0.4-µm gate length MOSFETs, and 5 to 6 nm for 0.25-µm gate length MOSFETs. Silicon dioxide (SiO2) is commonly used for gate dielectrics, and can be formed by several methods, such as dry O2 oxidation,26 and wet or steam (H2O) oxidation.26 The steam is produced by the reaction of H2 and O2 ambient in the furnace. Recently, H2O oxidation has been widely used for gate oxidation because of good controllability of oxide thickness and high reliability. In the case of the dual-gate CMOS structure shown in Fig. 2.9, boron penetration from the p+ gate electrode to the channel region through the gate silicon dioxide, which is described in the following section, is a problem. In order to prevent this problem, oxynitride has been used as the gate dielectric material.27,28 In general, the oxynitride gate dielectric is formed by the annealing process in NH3, NO (or N2O) after silicon oxidation, or by direct oxynitridation of silicon in NO (or N2O) ambient. Figure 2.11 shows the typical nitrogen profile of the oxynitride gate dielectric. Recently, remote plasma nitridation29,30 has been much studied, and it is reported that the oxynitride gate dielectric grown by the remote plasma method showed better quality and reliability than that grown by the silicon nitridation method. In the regime of a sub-quarter-micron CMOS device, gate oxide thickness is close to the limitation of tunneling current flow, around 3 nm thickness. In order to prevent tunneling current, high κ materials,

© 2000 by CRC Press LLC

FIGURE 2.11 Oxygen, nitrogen, and silicon concentration profile of oxynitride gate dielectrics measued by AES.

such as Si3N431 and Ta2O5,32 are proposed instead of silicon dioxide. In these cases, the thickness of the gate insulator can be kept relatively thick because high κ insulator realizes high gate capacitance, and thus better driving capability. Gate Electrode Heavily doped polysilicon has been widely used for gate electrodes because of its resistance to hightemperature LSI fabrication processing. In order to reduce the resistance of the gate electrode, which contributes significantly to RC delay time, silicides of refractory metals have been put on the polysilicon electrode.33,34 Polycide,34 the technique of combining a refractory metal silicide on top of doped polysilicon, has the advantage of preserving good electrical and physical properties at the interface between polysilicon and the gate oxide while, at the same time, the sheet resistance of gate electrode is reduced significantly. For doping the gate polysilicon, ion implantation is usually employed. In the case of heavy doping, dopant penetration from boron-doped polysilicon to the Si substrate channel region through the gate oxide occurs in the high-temperature LSI fabrication process, as shown in Fig. 2.12. (On the other hand, usually, penetration of an n-type dopant [such as phosphorus or arsenic] does not occur.) When the doping of impurities in the polysilicon is not sufficient, the depletion of the gate electrode occurs as shown in Fig. 2.13, resulting in a significant decrease of the drive capability of the transistor, as shown in Fig. 2.14.35 There is a tradeoff between the boron penetration and the gate electrode depletion, and so thermal process optimization is required.36

FIGURE 2.12 Dopant penetration from boron-doped polysilicon to silicon substrate channel region.

© 2000 by CRC Press LLC

FIGURE 2.13 Depletion of gate electrode in the case that the doping of impurities in the gate electrode is not sufficient.

FIGURE 2.14 ID , gm – VG characteristics for various thermal conditions. In the case of 800°C/30 min, a significant decrease in drive capability of transistor occurs because of the depletion of the gate electrode.

Gate length is one of most important dimensions defining MOSFET performance; thus, the lithography process for gate electrode patterning requires high-resolution technology. In the case of a light-wave source, the g-line (wavelength 436 nm) and the i-line (365 nm) of a mercury lamp were popular methods. Recently, a higher-resolution process, excimer laser lithography, has been used. In the excimer laser process, KrF (248 nm)37 and ArF (193 nm)38 have been proposed and developed. For a 0.25-µm gate length electrode, the KrF excimer laser process is widely used in the production of devices. In addition, electron-beam39–41 and X-ray42 lithography techniques are being studied for sub0.1 µm gate electrodes. For the etching of gate polysilicon, a high-selectivity RIE process is required for selecting polysilicon from SiO2 because a gate dielectric beneath polysilicon is a very thin film in the case of recent devices. Source/Drain Formation Source and drain diffused layers are formed by the ion implantation process. As a consequence of transistor downsizing, at the drain edge (interface of channel region and drain) where reverse biased pn junctions exist, a higher electrical field has been observed. As a result, carriers across these junctions are suddenly accelerated and become hot carriers, which creates a serious reliability problem for MOSFET.43 In order to prevent the hot carrier problem, the lightly doped drain (LDD) structure is proposed.44 The LDD process flow is shown in Fig. 2.15. After gate electrode formation, ion implantation is carried out to make extension layers, and the gate electrode plays the role of self-aligned mask that covers the channel layer, as shown in Fig. 2.15(b). In general, arsenic is doped for n-type extension of NMOS, and BF2 for ptype extension of PMOS. To prevent the short-channel effect, the impurity profile of extension layers must

© 2000 by CRC Press LLC

FIGURE 2.15 Process flow of LDD structure: (a) after gate electrode patterning; (b) extension implantation; (c) sidewall spacer formation; and (d) source/drain implantation.

be very shallow. Although shallow extension can be realized by ion implantation with low dose, the resistivity of extension layers becomes higher and, thus, MOSFET characteristics degrade. Hence, it is very difficult to meet these two requirements. Also, impurities diffusion in this extension affects the short-channel effect significantly. Thus, it is necessary to minimize the thermal process after forming the extension. Insulating film, such as Si3N4 or SiO2, is deposited by a chemical vapor deposition method. Then, etching back RIE treatment is performed on the whole wafer; as a result, the insulating film remains only at the gate electrode side, as shown in Fig. 2.15(c). This remaining film is called a sidewall spacer. This spacer works as a self-aligned mask for deep source/drain n+ and p+ doping, as shown in Fig. 2.15(d). In general, arsenic is doped for deep source/drain of n-MOSFET, and BF2 for p-MOSFET. In the dualgate CMOS process, gate polysilicon is also doped in this process step to prevent gate electrode depletion. After that, in order to make doped impurities activate electrically and recover from implantation damage, an annealing process, such as rapid thermal annealing (RTA), is carried out. According to the MOSFET scaling law, when gate length and other dimensions are shrunk by factor k, the diffusion depth also needs to be shrunk by 1/k. Hence, the diffusion depth of the extension part is required to be especially shallow. Several methods have been proposed for forming an ultra-shallow junction. For example, very low accelerating voltage implantation, the plasma doping method,45 and implantation of heavy molecules, such as B10H14 for p-type extension,46 are being studied. Salicide Technique As the vertical dimension of transistors is reduced with device downscaling, an increase is seen in sheet resistance — both of the diffused layers, such as source and drain, and the polysilicon films, such as the gate electrode. This is becoming a serious problem in the high-speed operation of integrated circuits. Figure 2.16 shows the dependence of the propagation delay (tpd) of CMOS inverters on the scaling factor, k, or gate length.47 These results were obtained by simulations in which two cases were considered. First is the case in which source and drain contacts with the metal line were made at the edge of the diffused layers, as illustrated in the figure inset. In an actual LSI layout, it often happens that the metal

© 2000 by CRC Press LLC

FIGURE 2.16 Dependence of the propogation delay (tpd) of CMOS inverters on the scaling factor, k, or gate length.

contact to the source or drain can be made only to a portion of the diffused layers, since many other signal or power lines cross the diffused layers. The other case is that in which the source and drain contacts cover the entire area of the source and drain layers, thus reducing diffused line resistance. It is clear that without a technique to reduce the diffused line resistance, tpd values cannot keep falling as transistor size is reduced; they will saturate at gate lengths of around a 0.25 microns. In order to solve this problem — the high resistance of shallow diffused layers and thin polysilicon films — self-aligned silicide (salicide) structures for the source, drain, and gate have been proposed, as shown in Fig. 2.17.48–50

FIGURE 2.17 A typical process flow and schematic cross-section of salicide process: (a) MOSFET formation; (b) metal deposition; (c) silicidation by thermal annealing; and (d) removal of non-reactive metal.

© 2000 by CRC Press LLC

First, a metal film such as Ti or Co is deposited on the surface of the MOSFET after formation of the polysilicon gate electrode, gate sidewall, and source and drain diffused layers, as shown in Fig. 2.17(b). The film is then annealed by rapid thermal annealing (RTA) in an inert ambient. During the annealing process, the areas of metal film in direct contact with the silicon layer — that is, the source, drain, and gate electrodes — are selectively converted to the silicide, and other areas remain metal, as shown in Fig. 2.17(c). The remaining metal can be etched off with an acid solution such as H2O2 + H2SO4, leaving the silicide self-aligned with the source, drain, and gate electrode, as shown in Fig. 2.17(d). When the salicide process first came into use, furnace annealing was the most popular heat-treatment process48–50; however, RTA51–53 replaced furnace annealing early on, because it is difficult to prevent small amounts of oxidant from entering through the furnace opening, and these degrade the silicide film significantly since silicide metals are easily oxidized. On the other hand, RTA reduces this oxidation problem significantly, resulting in reduced deterioration of the film and consequently of its resistance. At present, TiSi251–53 is widely used as a silicide in LSI applications. However, in the case of ultra-small geometry MOSFETs for VLSIs, use of TiSi2 is subject to several problems. When the TiSi2 is made thick, a large amount of silicon is consumed during silicidation, and this results in problems of junction leakage at the source or drain. On the contrary, if a thin layer of TiSi2 is chosen, agglomeration of the film occurs54 at higher silicidation temperatures. On the other hand, CoSi255 has a large silicidation temperature window for low sheet resistance; hence, it is expected to be widely used as silicidation material for advanced VLSI applications.47 Interconnect and Metallization Aluminum is widely used as a wiring metal. However, in the case of downsized CMOS, electromigration (EM)56 and stress migration (SM)57 become serious problems. In order to prevent these problems, AlCu (typically ~0.5 wt % Cu)58 is a useful wiring material. In addition, ultra-shallow junction for downsized CMOS sometimes needs barrier metal,58 such as TiN, between the metal and silicon, in order to prevent junction leakage current. Figure 2.18 shows a cross-sectional view of a multi-layer metallization structure. As a consequence of CMOS downscaling, contact or via aspect ratio becomes larger; and, as a result, filling of contact or via is not sufficient. Hence, new filling techniques, such as W-plug,59,60 are widely used. In addition, considering both reliability and low resistivity, Cu is a useful wiring material.61 In the case of Cu is used, metal thickness can be reduced in order to realize the same interconnect resistance. The reduction of the metal thickness is useful for reducing the capacitance between the dense interconnect wires, resulting in the high-speed operation of the circuit. In order to reduce RC delay of wire

FIGURE 2.18 Cross-sectional view of multi-layer metallization.

© 2000 by CRC Press LLC

in CMOS LSI, not only wiring material but also interlayer material is important. In particular, low-κ material62 is widely studied. In the case of Cu wiring, the dual damascene process63 is being widely studied because it is difficult to realize fine Cu pattern by reactive ion etching. Figure 2.19 shows the process flow of Cu dual

FIGURE 2.19 Typical process flow of Cu dual damascene.

© 2000 by CRC Press LLC

damascene metallization. After formation of transistors and contact holes as shown in Fig. 2.19(a), barrier metal, such as TiN, and Cu are deposited as shown in Fig. 2.19(b). By using the CMP planarization process, Cu and barrier metal remains in the contact holes, as shown in Fig. 2.19(c). Insulator, such as silicon dioxide, is deposited and the grooves for first metal wires are formed by reactive ion etching, as shown in Fig. 2.19(d). After the deposition of barrier metal and Cu as shown in Fig. 2.19(e), Cu and barrier metal remain only in the wiring grooves due to use of a planarization process such as CMP, as shown in Fig. 2.19(f).

Passive Device for Analog Operation System-on-chip technology has come into widespread use; and as a result, an LSI system sometimes requires analog functions. In this case, analog passive devices should be integrated,64 as shown in Fig. 2.20. Resistors and capacitors already have good performance, even for high-frequency applications. On the other hand, it is difficult to realize a high-quality inductor on a silicon chip because of inductance loss in Si substrate, in which the resistivity is lower than that in the compound semiconductor, such as GaAs, substrate. The relatively higher sheet resistance of aluminum wire used for high-density LSI is another problem. Recently, quality of inductor has been improved by using thicker Al or Cu wire65 and by optimizing the substrate structure.66

FIGURE 2.20 Various passive devices for analog application.

Embedded Memory Technology Embedded DRAM There has been strong motivation to merge DRAM cell arrays and logic circuits into a single silicon chip. This approach makes it possible to realize high bandwidth between memory and logic, low power consumption, and small footprint of the chip.67 In order to merge logic and DRAM into a single chip, it is necessary to establish process integration for the embedded DRAM. Figure 2.21 shows a typical structure of embedded DRAM. However, the logic process and the DRAM process are not compatible with each other. There are many variations and options in constructing a consistent process integration for the embedded DRAM. Trench Capacitor Cell versus Stacked Capacitor Cell There are two types of DRAM cell structure: stacked capacitor cell68–73 and trench capacitor cell.74,75

© 2000 by CRC Press LLC

FIGURE 2.21 Schematic cross-section of the embedded DRAM, including DRAM cells and logic MOSFETs.

In trench cell technology, the cell capacitor process is completed before gate oxidation. Therefore, there is no thermal process due to cell capacitor formation after the MOSFET formation. Another advantage of the trench cell is that there is little height difference between the cell array region and the peripheral circuit region.76–79 In the stacked capacitor cell, the height difference high aspect ratio contact holes and difficulty in the planarization process after cell formation. The MOSFET formation steps are followed by the stacked capacitor formation steps, which include high-temperature process steps such as storage node insulator (SiO2/Si3N4) formation, and Si3N4 deposition for the self-aligned contact formation. The salicide process for the source and drain of the MOSFETs should be carefully designed to endure the hightemperature process steps. Recently, high-permittivity film for capacitor insulators, such as Ta2O5 and BST, has been developed for commodity DRAM and embedded DRAM. The process temperature for Ta2O5 and BST is lower than that for SiO2/Si3N4; this means the process compatibility is better with such high-permittivity film.80–82 MOSFET Structure The MOSFET structure in DRAMs is different from that in logic ULSIs. In recent DRAMs, the gate is covered with Si3N4 for self-aligned contact process steps in the bit-line contact formation. It is very difficult to apply the salicide process to the gate, source, and drain at the same time. A solution to the problem is to apply the salicide process to the source and drain only. A comparison of the MOSFET structures is shown in Fig. 2.22. Tsukamoto et al.68 proposed another approach, namely the use of Wbit line layer as the local interconnect in the logic portion.

FIGURE 2.22 Typical MOSFET structures for DRAM, embedded DRAM, and logic.

© 2000 by CRC Press LLC

Gate Oxide Thickness Generally, DRAM gate oxide thickness is greater than that of logic ULSIs. This is because the maximum voltage of the transfer gate in the DRAM cells is higher than VCC, the power supply voltage. In the logic ULSI, the maximum gate voltage is equal to VCC in most cases. To keep up with the MOSFET performance in logic ULSIs, the oxide thickness of the embedded DRAMs needs to be scaled down further than in the DRAM case. To do so, a highly reliable gate oxide and/or new circuit scheme in the word line biasing, such as applying negative voltage to the cell transfer gate, is required. Another approach is to use thick gate oxide in the DRAM cell and thin gate oxide in the logic.83 Fabrication Cost per Wafer The conventional logic ULSIs do not need the process steps for DRAM cell formation. On the other hand, most of DRAMs use only two layers of aluminum. This raises wafer cost of the embedded DRAMs. Embedded DRAM chips are used only if the market can absorb the additional wafer cost for some reasons: high bandwidth, lower power consumption, small footprint, flexible memory configuration, lower chip assembly cost, etc. Next-Generation Embedded DRAM Process technology for the embedded DRAM with 0.18-µm or 0.15-µm design rules will include stateof-the-art DRAM cell array and high-performance MOSFETs in the logic circuit. The embedded DRAM could be a technology driver because the embedded DRAM contains most of the key process steps for DRAM and logic ULSIs. Embedded Flash Memory Technology84 Recently, the importance of embedded flash technology has been increasing and logic chips with nonvolatile functions have become indispensable for meeting various market requirements. Key issues in the selection of an embedded flash cell85 are (1) tunnel-oxide reliability (damage-less program/erase(P/E) mechanism), (2) process and transistor compatibility with CMOS logic, (3) fast read with low Vcc, (4) low power (especially in P/E), (5) simple control circuits, (6) fast program speed, and (7) cell size. This ordering greatly depends on target device specification and memory density, and, in general, is different from that of high-density stand-alone memories. NOR-type flash is essential and EEPROM functionality is also required on the same chip. Figure 2.23 shows the typical device structure of a NOR-type flash memory with logic device.86

FIGURE 2.23 Device structure schematic view of the NOR flash memories with dual-gate Ti-salicide.

Process Technology87 To realize high-performance embedded flash chips, at least three kinds of gate insulators are required beyond the 0.25-µm regime in order to form flash tunnel oxide, CMOS gate oxide, high voltage transistor gate oxide, and I/O transistor gate oxide. Flash cells are usually made by a stacked gate process. Therefore, it is difficult to achieve less than 150% of the cost of pure logic devices. The two different approaches to realize embedded flash chips are memory-based and logic-based, as shown in Fig. 2.24.

© 2000 by CRC Press LLC

FIGURE 2.24 Process modules.

Memory-based approach is advantageous in that it exploits established flash reliability and yield guaranteed by memory mass production lines, but is disadvantageous for realizing high-performance CMOS transistors due to the additional flash process thermal budget. On the contrary, logic-based approach can use fully CMOS-compatible transistors as they are; but, due to the lack of dedicated mass production lines, great effort is required in order to establish flash cell reliability and performance. Historically, memory-based embedded flash chips have been adopted, but the logic-based chips have become more important recently. In general, the number of additional masks required to embed a flash cell into logic chips ranges from 4 to 9. For high-density embedded flash chips, one transistor stack gate cell using channel hot electron programming and channel FN tunneling erasing will be mainstream. For medium- or low-density, highspeed embedded flash chips, two transistors will be important in the case of using the low power P/E method. From the reliability point of view, a p-channel cell using band-to-band tunneling-induced electron injection88 and channel FN tunneling ejection are promising since page-programmable EEPROM can also be realized by this mechanism.85

2.3 BiCMOS Technology The development of BiCMOS technology began in the early 1980s. In general, bipolar devices are attractive because of their high speed, better gain, better driving capability, and low wide-band noise properties that allow high-quality analog performance. CMOS is particularly attractive for digital applications because of its low power and high packing density. Thus, the combination would not only lead to the replacement and improvement of existing ICs, but would also provide access to completely new circuits. Figure 2.25 shows a typical BiCMOS structure.89 Generally, BiCMOS has a vertical npn bipolar transistor, a lateral pnp transistor, and CMOS on the same chip. Furthermore, if additional mask steps are

FIGURE 2.25 Cross-sectional view of BiCMOS structure.

© 2000 by CRC Press LLC

FIGURE 2.26 Typical process flow of BiCMOS device.

allowed, passive devices are integrated, as described in the previous section. The main feature of the BiCMOS structure is the existence of a buried layer because bipolar processes require an epitaxial layer grown on a heavily doped n+ subcollector to reduce collector resistance. Figure 2.26 shows typical process flow for BiCMOS. This is the simplest arrangement for incorporating bipolar devices and a kind of low-cost BiCMOS. Here, the BiCMOS process is completed with minimum additional process steps required to form the npn bipolar device, transforming the CMOS baseline process into a full BiCMOS technology. For this purpose, many processes are merged. The p tub of n-MOSFET shares an isolation of bipolar devices, the n tub of p-MOSFET device is used for the collector, the n+ source and drain are used for the emitter regions and collector contacts, and also extrinsic base contacts have the p+ source and drain of PMOS device for common use. Recently, there have been two significant uses of BiCMOS technology. One is high-performance MPU90 by using the high driving capability of bipolar transistor; the other is mixed signal products that utilize the excellent analog performance of the bipolar transistor, as shown in Table 2.1. TABLE 2.1

Recent BiCMOS Structures

© 2000 by CRC Press LLC

For the high-performance MPU, merged processes were commonly used, and the mature version of the MPU product has been replaced by CMOS LSI. However, this application has become less popular now with reduction in the supply voltage. Mixed-signal BiCMOS requires high performance, especially with respect to fT , fmax, and low noise figure. Hence, a double polysilicon structure with a silicon91 or SiGe92 base with trench isolation technology is used. The fabrication cost of BiCMOS is a serious problem and, thus, a low-cost mixed-signal BiCMOS process93 has also been proposed.

2.4 Future Technology In this section, advanced technologies for realizing future downsized CMOS devices are introduced.

Ultra-Thin Gate Oxide MOSFET From a performance point of view, ultra-thin gate oxide in a direct-tunneling regime is desirable for future LSIs.94 In this section, the potential and possibility are discussed. Figure 2.27 shows a TEM cross-section a 1.5-nm gate oxide. Figure 2.28 shows Id-Vd characteristics for 1.5-nm gate oxide MOSFETs with various gate lengths. In the long-channel case, unusual electrical characteristics were observed because of the significant tunneling leakage current through the gate oxide. However, the characteristics become normal as the gate length is reduced because the gate leakage current decreases in proportion to the gate length and the drain current increases in inverse proportion to the gate length.95,96 Recently, very high drive currents of 1.8 mA/mm and very high transconductances of more than 1.1 S/mm have been reported using a 1.3-nm gate oxide at a supply voltage of 1.5 V.97 They also operate well at low power and high speed with a low supply voltage in the 0.5-V range.98

FIGURE 2.27 TEM cross-section of a 1.5-nm gate oxide film. Uniform oxide of 1.5-nm thickness is observed.

Figure 2.29 shows the dependence of cutoff frequency, fT , of 1.5-nm gate oxide MOSFETs on gate length.99 Very high cutoff frequencies of more than 150 GHz were obtained at gate lengths in the sub0.1-µm regime due to the high transconductance. Further, it was confirmed that the high transconductance offers promise of a good noise figure. Therefore, the MOSFETs with ultra-thin gate oxides beyond the direct-tunneling limit have the potential to enable extremely high-speed digital circuit operation as well as high RF performance in analog

© 2000 by CRC Press LLC

FIGURE 2.28 Id-Vd characteristics of 1.5-nm gate oxide MOSFETs with several gate lengths: (a) Lg = 10 µm; (b) Lg = 5 µm; (c) Lg = 1.0 µm; and (d) Lg = 0.1 µm.

FIGURE 2.29 Dependence of cutoff frequency (fT) on gate length (Lg) of 1.5-nm gate oxide MOSFETs.

© 2000 by CRC Press LLC

applications. Fortunately, the hot-carrier and TDDB reliability of these ultra-thin gate oxides seems to be good.95,96,100 Thus, ultra-thin gate oxides are likely to be used for such LSIs, for certain application. In actual applications, even though the leakage current of a single transistor may be very small, the combined leakage of the huge number of transistors in a ULSI circuit poses problems, particularly for battery backup operation.101,102 There is, however, the possibility of using these direct-tunneling gate oxide MOSFETs only for the smaller number of switches in the critical path determining operation speed. Also, use of a slightly thicker oxide of 2.0 or 2.5 nm would significantly reduce leakage current. The use of these direct-tunneling gate oxide MOSFETs in LSI devices with smaller integration is another possibility.

Epitaxial Channel MOSFET As the design rule progresses, the supply voltage decreases. In order to obtain high drivability, lower Vth has been required under low supply voltage. However, substrate concentration must be higher in order to suppress the short-channel effects. The ideal channel profile for this requirement is that the channel surface concentration is lower to realize lower Vth, and the concentration around extension region is higher in order to suppress the short-channel effects. It is difficult to realize such a channel profile by using ion implantation because the profile is very broad. This requirement can be realized by non-doped epitaxial Si formation on doped Si substrate.103 Figure 2.30 shows the process flow of MOSFETs with epitaxial Si channel, n channel, and p channel. Although the problem with this structure is the quality of epitaxial Si, degradation of the quality can be suppressed by wet treatment to clean the Si surface, and heating process before epitaxial growth. The zero Vth can be realized while suppressing the short-channel effect even when gate length is 0.1 µm. A 20% improvement of drivability can be realized.

FIGURE 2.30 The process flow of MOSFETs with epitaxial Si channel, n channel, and p channel.

Raised Gate/Source/Drain Structure In order to realize high drivability of MOSFETs, it is necessary to reduce the resistance under the gate sidewall. However, as the sidewall thickness becomes thinner, the stability of the short-channel effect degrades because the deeper source and drain become closer. If this junction depth becomes shallower, junction leakage degradation occurs because the distance between the bottom of the silicide and the junction becomes shorter. Raised gate/source/drain104 has been proposed as one way to resolve these problems. The structure is shown in Fig. 2.31.

© 2000 by CRC Press LLC

FIGURE 2.31 Structure of raised source/drain/gate FET: (a) schematic cross-section; (b) TEM photograph.

Even if the junction depth measured from the original Si substrate becomes shallower, the distance between the bottom of silicide and the junction becomes constant using this structure. Additionally, gate resistance becomes lower because top of the gate electrode is T-shaped. Thus, low gate, source, and drain resistance can be realized while the short-channel effects and junction leakage current degradation are suppressed.

2.5 Summary This chapter has described CMOS and BiCMOS technology. CMOS is the most important device structure for realizing the future higher-performance devices required for multimedia and other demanding applications. However, certain problems are preventing the downsizing of device dimensions. The chapter described not only conventional technology but also advanced technology that has been proposed with a view to overcoming these problems. BiCMOS technology is also important, especially for mixed-signal applications. However, CMOS device performance has already been demonstrated for RF applications and, thus, analog CMOS circuit technology will be very important for realizing the production of analog CMOS.

References 1. Wanlass , F. M. and Sah, C. T., “Nanowatt Logic Using Field Effect Metal-Oxide Semiconductor Triode,” IEEE Solid State Circuits Conf., p. 32, Philadelphia, 1963. 2. Rea, S. N., “Czochralski Silicon Pull Rate Limits,” Journal of Crystal Growth, vol. 54, p. 267, 1981. 3. Sze, S. M., Physics of Semiconductor Devices, 2nd ed., Wiley, New York, 1981. 4. Parrillo, L. C., Payne, R. S., Davis, R. E., Reutlinger, G. W., and Field, R. L., “Twin-Tub CMOS — A Technology for VLSI Circuits,” IEEE International Electron Device Meeting 1980, p. 752, Washington D.C., 1980. 5. Ohzone, T., Shimura, H., Tsuji, K., and Hirano, “Silicon-Gate n-Well CMOS Process by Full IonImplantation Technology,” IEEE Trans. Electron Devices, vol. ED-27, p. 1789, 1980. 6. Krambeck, R. H., Lee, C. M., and Law, H. F. S. , “High-Speed Compact Circuits with CMOS,” IEEE Journal of Solid State Circuits, vol. SC-17, p. 614, 1982. 7. Ochoa, A., Dawes, W., and Estreich, “Latchup Control in CMOS Integrated Circuits,” IEEE Trans. Nuclear Science, vol. NS-26(6), p. 5065, 1979.

© 2000 by CRC Press LLC

8. Rung, R. D., Dell’Oca, C. J., and Walker, L. G., “A Retrograde p-Well for Higher Density CMOS,” IEEE Trans. Electron Devices, vol. ED-28, p. 1115, 1981. 9. Combs, S. R.,“Scaleable Retrograde p-Well CMOS Technology,” IEEE International Electron Device Meeting 1981, p. 346, Washington D.C., 1981. 10. Schroeder, J. E., Ochoa Jr., A., and Dressendrfer, P. V., “Latch-up Elimination in Bulk CMOS LSI Circuits,” IEEE Trans. Nuclear Science, vol. NS-27, p. 1735, 1980. 11. Sakai, Y., Hayashida, T., Hashimoto, N., Mimato, O., Musuhara, T., Nagasawa, K., Yasui, T., and Tanimura, N., “Advanced Hi-CMOS Device Technology,” IEEE International Electron Device Meeting 1981, p. 534, Washington D.C., 1981. 12. de Werdt, R., van Attekum, P., den Blanken, H., de Bruin, L., op den Buijsch, F., Burgmans, A., Doan, T., Godon, H., Grief, M., Jansen, W., Jonkers, A., Klaassen, F., Pitt, M., van der Plass, P., Stomeijer, A., Verhaar, R., and Weaver, J., “A 1M SRAM with Full CMOS Cells Fabricated in a 0.7 µm Technology,” IEEE International Electron Device Meeting 1987, p. 532, Washington D.C., 1987. 13. Apples, J. A., Kooi, E., Paffen, M.M., Schlorje, J. J. H., and Verkuylen, W. H. C. G., “Local Oxidation of Silicon and its Application in Semiconductor Technology,” Philips Research Report, vol. 25, p. 118, 1970. 14. Shankoff, T. A., Sheng, T. T., Haszko, S. E., Marcus, R. B., and Smith, T. E., “Bird’s Beak Configuration and Elimination of Gate Oxide Thinning Produced During Selective Oxidation,” Journal of Electrochemical Society, vol. 127, p. 216, 1980. 15. Nakajima, S., Kikuchi, K., Minegishi, K., Araki, T., Ikuta, K., and Oda, M., “1 µm 256K RAM Process Technology Using Molybdenum-Polysilicon Gate,” IEEE International Electron Device Meeting 1981, p. 663, Washington D.C., 1981. 16. Fuse, G., Ogawa, H., Tateiwa, K., Nakano, I., Odanaka, S., Fukumoto, M., Iwasaki, H., and Ohzone, T., “A Practical Trench Isolation Technology with a Novel Planarization Process,” IEEE International Electron Device Meeting 1987, p. 732, Washington D.C., 1987. 17. Perry, K. A. , “Chemical Mechanical Polishing: The Impact of a New Technology on an Industry,” 1998 Symposium on VLSI Technology, Digest of Technical Papers, p. 2, Honolulu, 1998. 18. Kuroi, T., Uchida, T., Horita, K., Sakai, M., Itoh, Y., Inoue, Y., and Nishimura, T., “Stress Analysis of Shallow Trench Isolation for 256M DRAM and Beyond,” IEEE International Electron Device Meeting 1998, p. 141, San Francisco, CA, 1998. 19. Matsuda, S., Sato, T., Yoshimura, H., Sudo, A., Mizushima, I., Tsumashima, Y., and Toyoshima, Y., “Novel Corner Rounding Process for Shallow Trench Isolation utilizing MSTS (Micro-Structure Transformation of Silicon),” IEEE International Electron Device Meeting 1998, p. 137, San Francisco, CA, 1998. 20. Hu, G. J. and Bruce, R. H., “Design Trade-off between Surface and Buried-Channel FETs,” IEEE Trans. Electron Devices, vol. 32, p. 584, 1985. 21. Cham, K. M., D. W. Wenocur, Lin, J., Lau, C. K., and Hu, H.-S., “Submicronmeter Thin Gate Oxide p-Channel Transistors with p+ Poly-silicon Gates for VLSI Applications,” IEEE Electron Device Letters, vol. EDL-7, p. 49, 1986. 22. Amm, D. T., Mingam, H., Delpech, P., and d’Ouville, T. T., “Surface Mobility in p+ and n+ Doped Polysilicon Gate PMOS Transistors,” IEEE Trans. Electron Devices, vol. 36, p. 963, 1989. 23. Toriumi, A., Mizuno, T., Iwase, M., Takahashi, M., Niiyama, H., Fukumoto, M., Inaba, S., Mori, I., and Yoshimi, M., “High Speed 0.1 µm CMOS Devices Operating at Room Temperature,” Extended Abstract of 1992 International Conference on Solid State Devices and Materials, p. 487, Tsukuba, Japan, 1992. 24. Oyamatsu, H., Kinugawa, M., and Kakumu, M., “Design Methodology of Deep Submicron CMOS Devices for 1V operation,” 1993 Symposium on VLSI Technology, Digest of Technical Papers, p. 89, Kyoto, Japan, 1993. 25. Takeuchi, K., Yamamoto, T., Tanabe, A., Matsuki, T., Kunio, T., Fukuma, M., Nakajima, K., Aizaki, H., Miyamoto, H., and Ikawa, E. , “0.15 µm CMOS with High Reliability and Performance,” IEEE International Electron Device Meeting 1993, p. 883, Washington D.C., 1993.

© 2000 by CRC Press LLC

26. Ligenza, J. R. and Spitzer, W. G., “The Mechanism for Silicon Oxidation in Steam and Oxygen,” Journal of Phys. Chem. Solids, vol. 14, p. 131, 1960 27. Morimoto, T. , Momose, H. S., Ozawa, Y., Yamabe, K., and Iwai, H., “Effects of Boron Penetration and Resultant Limitations in Ultra Thin Pure-Oxide and Nitrided-Oxide,” IEEE International Electron Device Meeting 1990, p. 429, Washington D.C., 1990. 28. Uchiyama, A., Fukuda, H., Hayashi, T. , Iwabuchi, T., and Ohno, S., “High Performance Dual-Gate Sub-Halfmicron CMOSFETs with 6nm-thick Nitrided SiO2 Films in an N2O Ambient,” IEEE International Electron Device Meeting 1990, p. 425, Washington D.C., 1990. 29. Rodder, M., Chen, I.-C., Hattangaly, S., and Hu, J. C., “Scaling to a 1.0V-1.5V, sub 0.1µm Gate Length CMOS Technology: Perspective and Challenges,” Extended Abstract of 1998 International Conference on Solid State Devices and Materials, p. 158, Hiroshima, Japan, 1998. 30. Rodder, M. , Hattangaly, S., Yu, N., Shiau, W., Nicollian, P. , Laaksonen, T. , Chao, C. P., Mehrotra, M., Lee, C., Murtaza, S. , and Aur, A., “A 1.2V, 0.1µm Gate Length CMOS Technology: Design and Process Issues,” IEEE International Electron Device Meeting 1998, p. 623, San Francisco, 1998. 31. Khare, M. , Guo, X., Wang, X. W. , and Ma, T. P. , “Ultra-Thin Silicon Nitride Gate Dielectric for Deep-Sub-Micron CMOS Devices,” 1997 Symposium on VLSI Technology, Digest of Technical Papers, p. 51, Kyoto, Japan, 1997. 32. Yagishita, A., Saito, T., Nakajima, K., Inumiya, S., Akasaka, Y., Ozawa, Y., Minamihaba, G. , Yano, H., Hieda, K. , Suguro, K., Arikado, T., and Okumura, K., “High Performance Metal Gate MOSFETs Fabricated by CMP for 0.1 µm Regime,” IEEE International Electron Device Meeting 1998, p. 785, San Francisco, CA, 1998. 33. Murarka, S. P., Fraser, D. B., Shinha, A. K., and Levinstein, H. J., “Refractory Silicides of Titanium and Tantalum for Low-Resistivity Gates and Interconnects,” IEEE Trans. Electron Devices, vol. ED27, p. 1409, 1980. 34. Geipel, H. J. , Jr., Hsieh, N., Ishaq, M. H., Koburger, C. W., and White, F. R., “Composite Silicide Gate Electrode — Interconnections for VLSI Device Technologies,” IEEE Trans. Electron Devices, vol. ED-27, p. 1417, 1980. 35. Hayashida, H. , Toyoshima, Y. , Suizu, Y. , Mitsuhashi, K., Iwai, H., and Maeguchi, K. , “Dopant Redistribution in Dual Gate W-polycide CMOS and its Improvement by RTA,” 1989 Symposium on VLSI Technology, Digest of Technical Papers, p. 29, Kyoto, Japan, 1989. 36. Uwasawa, K., Mogami, T. , Kunio, T. , and Fukuma, M. , “Scaling Limitations of Gate Oxide in p+ Polysilicon Gate MOS Structure for Sub-Quarter Micron CMOS Devices,” IEEE International Electron Device Meeting 1993, p. 895, Washington D.C., 1993. 37. Ozaki, T., Azuma, T., Itoh, M., Kawamura, D., Tanaka, S., Ishibashi, Y., Shiratake, S., Kyoh, S., Kondoh, T., Inoue, S., Tsuchida, K., Kohyama, Y., and Onishi, Y., “A 0.15µm KrF Lithography for 1Gb DRAM Product Using High Printable Patterns and Thin Resist Process,” 1998 Symposium on VLSI Technology, Digest of Technical Papers, p. 84, Honolulu, 1998. 38. Hirukawa, S., Matsumoto, K. , and Takemasa, K. , “New Projection Optical System for Beyond 150 nm Patterning with KrF and ArF Sources,” Proceedings of 1998 International Symposium on Optical Science, Engineering, and Instrumentation, SPIE's 1998 Annual Meeting, p. 414, 1998. 39. Triumi, A. and Iwase, M., “Lower Submicrometer MOSFETs Fabricated by Direct EB Lithography,” Extended Abstract of the 19th Conference on Solid State Devices and Materials, p. 347, Tokyo, Japan, 1987. 40. Liddle, J. A. and Berger, S. D., “Choice of System Parameters for Projection Electron-Beam Lithography: Accelerating Voltage and Demagnification Factor,” Journal of Vacuum and Science Technology, vol. B10(6), p. 2776, 1992. 41. Nakajima, K., Yamashita, H., Kojima, Y., Tamura, T., Yamada, Y., Tokunaga, K., Ema, T., Kondoh, K., Onoda, N., and Nozue, H., “Improved 0.12µm EB Direct Writing for Gbit DRAM Fabrication,” 1998 Symposium on VLSI Technology, Digest of Technical Papers, p. 34, Honolulu, 1998. 42. Deguchi, K., Miyoshi, K., Ban, H., Kyuragi, H., Konaka, S., and Matsuda, T. , “Application of Xray Lithography with a Single-Layer Resist Process to Subquartermicron LSI Fabrication,” Journal of Vacuum and Science Technology, vol. B10(6), p. 3145, 1992.

© 2000 by CRC Press LLC

43. Matsuoka, F., Iwai, H., Hayashida, H., Hama, K., Toyoshima, Y., and Maeguchi, K., “Analysis of Hot Carrier Induced Degradation Mode on pMOSFETs,” IEEE Trans. Electron Devices, vol. ED-37, p. 1487, 1990. 44. Ogura, S., Chang, P. J., Walker, W. W., Critchlow, D. L., and Shepard, J. F., “Design and Characteristics of the Lightly-Doped Drain-Source (LDD) Insulated Gate Field Effect Transistor,” IEEE Trans. Electron Devices, vol. ED-27, p. 1359, 1980. 45. Ha, J. M., Park, J. W., Kim, W. S., Kim, S. P., Song, W. S., Kim, H. S., Song, H. J., Fujihara, K., Lee, M. Y., Felch, S., Jeong, U., Groeckner, M., Kim, K. H., Kim, H. J., Cho, H. T., Kim, Y. K., Ko, D. H., and Lee, G. C., “High Performance pMOSFET with BF3 Plasma Doped Gate/Source/Drain and A/D Extension,” IEEE International Electron Device Meeting 1998, p.639, San Francisco, 1998. 46. Goto, K., Matsuo, J., Sugii, T., Minakata, H., Yamada, I., and Hisatsugu, T., “Novel Shallow Junction Technology Using Decaborone (B10H14),” IEEE International Electron Device Meeting 1996, p. 435, San Francisco, 1996. 47. Ohguro, T., Nakamura, S., Saito, M., Ono, M., Harakawa, H., Morifuji, E., Yoshitomi, T., Morimoto, T., Momose, H. S., Katsumata, Y., and Iwai, H., “Ultra-shallow Junction and Salicide Technique for Advanced CMOS Devices,” Proceedings of the Sixth International Symposium on Ultralarge Scale Integration Science and Technology, Electrochemical Society, p. 275, May 1997. 48. Osburn, C. M., Tsai, M. Y., and Zirinsky, S., “Self-Aligned Silicide Conductors in FET Integrated Circuits,” IBM Technical Disclosure Bulletin, vol. 24, p. 1970, 1981. 49. Shibata, T., Hieda, K., Sato, M., Konaka, M., Dang, R. L. M., and Iizuka, H., “An Optimally Designed Process for Submicron MOSFETs,” IEEE International Electron Device Meeting 1981, p. 647, Washington D.C., 1981. 50. Ting, C. Y., Iyer, S. S., Osburn, C. M., Hu, G. J., and Schweighart, A. M., “The Use of TiSi2 in a Self-Aligned Silicide Technology,” Proceedings of 1st International Symposium on VLSI Science and Technology, Electrochemical Society Meeting, vol. 82(7), p. 224, 1982. 51. Haken, R. A., “Application of the Self-Aligned Titanium Silicide Process to Very Large Scale Integrated N-Metal-Oxide-Semiconductor and Complementary Metal-Oxide-Semiconductor Technologies,” Journal of Vacuum Science and Technology, vol. B3(6), p. 1657, 1985. 52. Kobayashi, N., Hashimoto, N., Ohyu, K., Kaga, T., and Iwata, S., “Comparison of TiSi2 and WSi2 for Sub-Micron CMOSs,” 1986 Symposium on VLSI Technology, Digest of Technical Papers, p. 49, 1986. 53. Ho, V. Q. and Poulin, D., “Formation of Self-Aligned TiSi2 for VLSI Contacts and Interconnects,” Journal of Vacuum Science and Technology, vol. A5, p. 1396, 1987. 54. Ting, C. H., d’Heurle, F. M., Iyer, S. S., and Fryer, P. M., “High Temperature Process Limitation on TiSi2,” Journal of Electrochemical Society, vol. 133(12), p. 2621, 1986. 55. Osburn, C. M., Tsai, M. Y., Roberts, S., Lucchese, C. J., and Ting, C. Y., “High Conductivity Diffusions and Gate Regions Using a Self-Aligned Silicide Technology,” Proceedings of 1st International Symposium on VLSI Science and Technology, Electrochemical Society, vol. 82-1, p. 213, 1982. 56. Kwork, T., “Effect of Metal Line Geometry on Electromigration Lifetime in Al-Cu Submicron Interconnects,” 26th Annual Proceedings of Reliability Physics 1988, p. 185, 1988. 57. Owada, N., Hinode, K., Horiuchi, M., Nishida, T., Nakata, K., and Mukai, K., “Stress Induced SlitLike Void Formation in a Fine-Pattern Al-Si Interconnect during Aging Test,” 1985 Proceedings of the 2nd International IEEE VLSI Multilevel Interconnection Conference, p. 173, 1985. 58. Kikkawa, T., Aoki, H., Ikawa, E., and Drynan, J. M., “A Quarter-Micrometer Interconnection Technology Using a TiN/Al-Si-Cu/Al-Si-Cu/TiN/Ti Multilayer Structure,” IEEE Trans. Electron Devices, vol. ED-40, p. 296, 1993. 59. White, F., Hill, W., Eslinger, S., Payne, E., Cote, W., Chen, B., and Johnson, K., “Damascene Stud Local Interconnect in CMOS Technology,” IEEE International Electron Device Meeting 1992, p. 301, San Francisco, 1992. 60. Kobayashi, N., Suzuki, M., and Saitou, M., “Tungsten Plug Technology: Substituting Tungsten for Silicon Using Tungsten Hexafluoride,” Extended Abstract of 1988 International Conference on Solid State Devices and Materials, p. 85, 1988.

© 2000 by CRC Press LLC

61. Cote, W., Costrini, G., Eldlstein, D., Osborn, C., Poindexter, D., Sardesai, V., and Bronner, G., “An Evaluation of Cu Wiring in a Production 64Mb DRAM,” 1998 Symposium on VLSI Technology, Digest of Technical Papers, p. 24 Honolulu, 1998. 62. Loke, A. L. S., Wetzel, J., Ryu, C., Lee, W.-J., and Wong, S. S., “Copper Drift in Low-κ Polymer Dielectrics for ULSI Metallization,” 1998 Symposium on VLSI Technology, Digest of Technical Papers, p. 26, Honolulu, 1998. 63. Wada, J., Oikawa, Y., Katata, T., Nakamura, N., and Anand, M. B., “Low Resistance Dual Damascene Process by AL Reflow Using Nb Liner,” 1998 Symposium on VLSI Technology, Digest of Technical Papers, p. 48, Honolulu, 1998. 64. Momose, H. S., Fujimoto, R., Ohtaka, S., Morifuji, E., Ohguro, T., Yoshitomi, T., Kimijima, H., Nakamura, S., Morimoto, T., Katsumata, Y., Tanimoto, H., and Iwai, H., “RF Noise in 1.5 nm Gate Oxide MOSFETs and the Evaluation of the NMOS LNA Circuit Integrated on a Chip,” 1998 Symposium on VLSI Technology, Digest of Technical Papers, p. 96, Honolulu, 1998. 65. Burghartz, J. N., “Progress in RF Inductors on Silicon — Understanding Substrate Loss,” IEEE International Electron Device Meeting 1998, p. 523, San Francisco, 1998. 66. Yoshitomi, T., Sugawara, Y., Morifuji, E., Ohguro, T., Kimijima, H., Morimoto, T., Momose, H. S., Katsumata, Y., and Iwai, H., “On-Chip Inductors with Diffused Shield Using Channel-Stop Implant,” IEEE International Electron Device Meeting 1998, p. 540, San Francisco, 1998. 67. Borel, J., “Technologies for Multimedia Systems on a Chip,” International Solid State Circuit Conference, Digest of Technical Papers, p. 18, 1997. 68. Tsukamoto, M., Kuroda, H., and Okamoto, Y., “0.25mm W-polycide Dual Gate and Buried Metal on Diffusion Layer (BMD) Technology for DRAM-Embedded Logic Devices,” 1997 Symposium on VLSI Technology, Digest of Technical Papers, p. 23, 1997. 69. Itabashi, K., Tsuboi, S., Nakamura, H., Hashimoto, K., Futoh, W., Fukuda, K., Hanyu, I., Asai. S., Chijimatsu,T., Kawamura, E., Yao, T., Takagi, H., Ohta, Y., Karasawa, T., Iio, H., Onoda, M., Inoue, F., Nomura, H., Satoh, Y., Higashimoto, M., Matsumiya, M., Miyabo, T., Ikeda, T., Yamazaki, T., Miyajima, M., Watanabe, K., Kawamura, S., and Taguchi, M., “Fully Planarized Stacked Capacitor Cell with Deep and High Aspect Ratio Contact Hole for Giga-bit DRAM,” 1997 Symposium on VLSI Technology, Digest of Technical Papers, p. 21, 1997. 70. Kim, K. N., Lee, J. Y., Lee, K. H., Noh, B. H., Nam, S. W., Park, Y. S., Kim, Y. H., Kim, H. S., Kim, J. S., Park, J. K., Lee, K. P., Lee, K. Y., Moon, J. T., Choi, J. S., Park, J. W., and Lee, J. G., “Highly Manufacturable 1Gb SDRAM,” 1997 Symposium on VLSI Technology, Digest of Technical Papers, p. 10, 1997. 71. Kohyama, Y., Ozaki, T., Yoshida, S., Ishibashi, Y., Nitta, H., Inoue, S., Nakamura, K., Aoyama, T., Imai, K., and Hayasaka, N., “A Fully Printable, Self-Aligned and Planarized Stacked Capacitor DRAM Cell Technology for 1Gbit DRAM and Beyond,” 1997 Symposium on VLSI Technology, Digest of Technical Papers, p. 17, 1997. 72. Drynan, J. M., Nakajima, K., Akimoto, T. , Saito, K., Suzuki, M., Kamiyama, S., and Takaishi, Y., “Cylindrical Full Metal Capacitor Technology for High-Speed Gigabit DRAMs,” 1997 Symposium on VLSI Technology, Digest of Technical Papers, p. 151, 1997. 73. Takehiro, S., Yamauchi, S., Yoshimura, M., and Onoda, H., “The Simplest Stacked BST Capacitor for the Future DRAMs Using a Novel Low Temperature Growth Enhanced Crystallization,” 1997 Symposium on VLSI Technology, Digest of Technical Papers, p. 153, 1997. 74. Nesbit, L., Alsmeier, J., Chen, B., DeBrosse, J., Fahey, P., Gall, M., Gambino, J., Gerhard, S., Ishiuchi, H., Kleinhenz, R., Mandelman, J., Mii, T., Morikado, M., Nitayama, A., Parke, S., Wong, H., and Bronner, G., “A 0.6µm2 256Mb Trench DRAM Cell with Self-Aligned BuriEd STrap (BEST),” IEEE International Electron Device Meeting, p. 627, Washington D.C., 1993. 75. Bronner, G., Aochi, H., Gall, M., Gambino, J., Gernhardt, S., Hammerl, E., Ho, H., Iba, J., Ishiuchi, H., Jaso, M., Kleinhenz, R., Mii, T., Narita, M., Nesbit, L., Neumueller, W., Nitayama, A., Ohiwa,T., Parke, S., Ryan, J., Sato, T., Takato, H., and Yoshikawa, S., “A Fully Planarized 0.25µm CMOS Technology for 256Mbit DRAM and Beyond,” 1995 Symposium on VLSI Technology, Digest of Technical Papers, p. 15, 1995.

© 2000 by CRC Press LLC

76. Ishiuchi, H., Yoshida, Y., Takato, H., Tomioka, K., Matsuo, K. , Momose, H., Sawada, S., Yamazaki, K., and Maeguchi, K., “Embedded DRAM Technologies,” IEEE International Electron Device Meeting, p. 33, Washington D.C., 1997. 77. Togo, M., Iwao, S., Nobusawa, H., Hamada, M., Yoshida, K., Yasuzato, N., and Tanigawa, T., “A Salicide-Bridged Trench Capacitor with a Double-Sacrificial-Si3N4-Sidewall (DSS) for High-Performance Logic-Embedded DRAMs,” IEEE International Electron Device Meeting, p. 37, Washington D.C., 1997. 78. Crowder, S., Stiffler, S., Parries, P., Bronner, G., Nesbit, L., Wille, W., Powell, M., Ray, A., Chen, B., and Davari, B., “Trade-offs in the Integration of High Performance Devices with Trench Capacitor DRAM,” IEEE International Electron Device Meeting, p. 45, Washington D.C., 1997. 79. Crowder, S., Hannon, R., Ho, H., Sinitsky, D., Wu, S., Winstel, K., Khan, B., Stiffler, S. R., and Iyer, S. S., “Integration of Trench DRAM into a High-Performance 0.18 µm Logic Technology with Copper BEOL,” IEEE International Electron Device Meeting, p. 1017, San Francisco, 1998. 80. Yoshida, M., Kumauchi, T., Kawakita, K., Ohashi, N., Enomoto, H., Umezawa, T., Yamamoto, N., Asano, I., and Tadaki, Y. ,“Low Temperature Metal-based Cell Integration Technology for Gigabit and Embedded DRAMs,” IEEE International Electron Device Meeting, p. 41, Washington D.C., 1997. 81. Nakamura, S., Kosugi, M., Shido, H., Kosemura, K., Satoh, A., Minakata, H., Tsunoda, H., Kobayashi, M., Kurahashi, T., Hatada, A., Suzuki, R. , Fukuda, M., Kimura, T., Nakabayashi, M., Kojima, M., Nara, Y., Fukano, T., and Sasaki, N., “Embedded DRAM Technology Compatible to the 0.13 µm High-Speed Logics by Using Ru Pillars in Cell Capacitors and Peripheral Vias,” IEEE International Electron Device Meeting, p. 1029, San Francisco, 1998. 82. Drynan, J. M., Fukui, K., Hamada, M., Inoue, K., Ishigami, T., Kamiyama, S., Matsumoto, A., Nobusawa, H., Sugai, K., Takenaka, M., Yamaguchi, H., and Tanigawa, T., “Shared Tungsten Structures for FEOL/BEOL Compatibility in Logic-Friendly Merged DRAM,” IEEE International Electron Device Meeting, p. 849, San Francisco, 1998. 83. Togo, M., Noda, K., and Tanigawa, T., “Multiple-Thickness Gate Oxide and Dual-Gate Technologies for High-Performance Logic-Embedded DRAMs,” IEEE International Electron Device Meeting, p. 347, San Francisco, 1998. 84. Yoshikawa, K., “Embedded Flash Memories — Technology assessment and future —,” 1999 International Syposium on VLSI Technology, System, and Applications, p. 183, Taipei, 1999. 85. Yoshikawa, K. , “Guide-lines on Flash Memory Cell Selection,” Extended Abstract of 1998 International Conference on Solid State Devices and Materials, p. 138, 1998. 86. Watanabe, H., Yamada, S., Tanimoto, M., Mitsui, M., Kitamura, S., Amemiya, K., Tanzawa, T., Sakagami, E., Kurata, M., Isobe, K., Takebuchi, M., Kanda, M., Mori, S.,and Watanabe, T., “Novel 0.44µm2 Ti-Salicide STI Cell Technology for High-Density NOR Flash Memories and High Performance Embedded Application,” IEEE International Electron Device Meeting 1998, p. 975, San Francisco, 1998. 87. Kuo, C., “Embedded Flash Memory Applications, Technology and Design,” 1995 IEDM Short Course: NVRAM Technology and Application, IEEE International Electron Device Meeting, Washington D.C., 1995. 88. Ohnakado, T., Mitsunaga, K., Nunoshita, M., Onoda, H., Sakakibara, K., Tsuji, N., Ajika, N., Hatanaka, M., and Miyoshi, H., “ Novel Electron Injection Method Using Band-to-Band Tunneling Induced Hot Electron (BBHE) for Flash Memory with a p-channel Cell,” IEEE International Electron Device Meeting, p. 279, Washington D.C., 1995. 89. Iwai, H., Sasaki, G., Unno, Y., Niitsu, Y., Norishima, M., Sugimoto, Y., and Kannzaki, K., “0.8µm Bi-CMOS Technology with High fT Ion-Implanted Emitter Bipolar Transistor,” IEEE International Electron Device Meeting 1987, p. 28, Washington D. C., 1987. 90. Clark, L. T. and Taylor, G. F., “High Fan-in Circuit Design,” 1994 Bipolar/BiCMOS Circuits & Technology Meeting, p. 27, Minneapolis, MN, 1994.

© 2000 by CRC Press LLC

91. Nii, H., Yoshino, C., Inoh, K., Itoh, N., Nakajima, H., Sugaya, H., Naruse, H., Kataumata, Y., and Iwai, H., “0.3 µm BiCMOS Technology for Mixed Analog/Digital Application System,” 1997 Bipolar/BiCMOS Circuits & Technology Meeting, p. 68, Minneapolis, MN, 1997. 92. Johnson, R. A., Zierak, M. J., Outama, K. B., Bahn, T. C., Joseph, A. J.,Cordero, C. N., Malinowski, J., Bard, K. A., Weeks, T. W., Milliken, R. A., Medve, T. J., May, G. A., Chong, W., Walter, K. M., Tempest, S. L., Chau, B. B., Boenke, M., Nelson, M. W., and Harame, D. L.,“1.8 million Transistor CMOS ASIC Fabricated in a SiGe BiCMOS Technology,” IEEE International Electron Device Meeting 1998, p. 217, San Francisco, 1998. 93. Chyan, Y.-F., Ivanov, T. G., Carroll, M. S., Nagy, W. J., Chen, A. S., and Lee, K. H., “A 50-GHz 0.25µm High-Energy Implanted BiCMOS (HEIBiC) Technology for Low-Power High-Integration Wireless-Communication System,” 1998 Symposium on VLSI Technology, Digest of Technical Papers, p. 92, Honolulu, 1998. 94. “The National Technology Roadmap for Semiconductors,” Semiconductor Industry Association, 1997. 95. Momose, H. S., Ono, M., Yoshitomi, T., Ohguro, T., Nakamura, S., Saito M., and Iwai, H., “Tunneling Gate Oxide Approach to Ultra-High Current Drive in Small-Geometry MOSFETs,” IEEE International Electron Device Meeting, p. 593, San Francisco, 1994. 96. Momose, H. S., Ono, M., Yoshitomi, T., Ohguro, T., Nakamura, S., Saito, M., and Iwai, H., “1.5 nm Direct-Tunneling Gate Oxide Si MOSFETs,” IEEE Trans. Electron Devices, vol. ED-43, p. 1233, 1996. 97. Timp, G., Agarwal, A. , Baumann, F. H., Boone, T., Buonanno, M., Cirelli, R., Donnelly, V., Foad, M., Grant, D., Green, M., Gossmann, H., Hillenius, S., Jackson, J., Jacobson, D., Kleiman, R., Kornblit, A., Klemens, F., Lee, J. T.-C., Mansfield, W., Moccio, S., Murrell, A., O’Malley, M., Rosamilia, J., Sapjeta, J., Silverman, P., Sorsch, T.,Tai, W. W., Tennant, D., Vuong, H., and Weir, B., “Low Leakage, Ultra-Thin Gate Oxides for Extremely High Performance sub-100 nm nMOSFETs,” IEEE International Electron Device Meeting, p. 930, Washington D.C., 1997. 98. Momose, H. S., Ono, M., Yoshitomi, T., Ohguro, T., Nakamura, S., Saito, M., and Iwai, H., “Prospects for Low-Power, High-Speed MPUs Using 1.5 nm Direct-Tunneling Gate Oxide MOSFETs,” Journal of Solid-State Electronics, vol. 41, p. 707, 1997. 99. Momose, H. S., Morifuji, E., Yoshitomi, T., Ohguro, T., Saito, M., Morimoto, T., Katsumata, Y., and Iwai, H., "High-Frequency AC Characteristics of 1.5 nm Gate Oxide MOSFETs," IEEE International Electron Device Meeting, p. 105, San Francisco, 1996. 100. Momose, H. S., Nakamura, S., Ohguro, T., Yoshitomi, T., Morifuji, E., Morimoto, T., Katsumata, Y., and Iwai, H., "Study of the Manufacturing Feasibility of 1.5 nm Direct-Tunneling Gate Oxide MOSFETs: Uniformity, Reliability, and Dopant Penetration of the Gate Oxide,” IEEE Trans. Electron Devices, vol. ED-45, p. 691, 1998. 101. Lo, S.-H., Buchanan, D. A., Taur, Y., and Wang, W., “Quantum-Mechanical Modeling of Electron Tunneling Current from the Inversion Layer of Ultra-Thin-Oxide nMOSFETs,” IEEE Electron Devices Letters, vol. EDL-18, p. 209, 1997. 102. Sorsch, T., Timp, W., Baumann, F. H., Bogart, K. H. A., Boone, T., Donnelly, V. M., Green, M., Evans-Lutterodt, K., Kim, C. Y., Moccio, S., Rosamilia, J., Sapjeta, J., Silverman, P., Weir B., and Timp, G., “Ultra-Thin, 1.0–3.0 nm, Gate Oxides for High Performance sub-100 nm Technology,” 1998 Symposium on VLSI Technology, Digest of Technical Papers, p. 222, 1998. 103. Ohguro, T., Naruse, N., Sugaya, H., Morifuji, E., Nakamura, S., Yoshitomi, T., Morimoto, T., Momose, H. S., Katsumata, Y., and Iwai, H., “0.18 µm Low Voltage/Low Power RF CMOS with Zero Vth Analog MOSFETs made by Undoped Epitaxial Channel Technique,” IEEE International Electron Device Meeting, p. 837, Washington D.C., 1997. 104. Ohguro, T., Naruse, H., Sugaya, H., Kimijima, H., Morifuji, E., Yoshitomi, T., Morimoto, T., Momose, H. S., Katsumata, Y., and Iwai, H., “0.12 µm Raised Gate/Source/Drain Epitaxial Channel NMOS Technology,” IEEE International Electron Device Meeting 1998, p. 927, San Francisco, 1998.

© 2000 by CRC Press LLC

Grahn, J.V., Ostling, M.“Bipolar Technology” The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

3 Bipolar Technology 3.1 3.2

Introduction Bipolar Process Design Figures-of-Merit • Process Optimization • Vertical Structure • Scaling Rules • Horizontal Layout

3.3

Conventional Bipolar Technology Junction-Isolated Transistors • Oxide-Isolated Transistors • Lateral pnp Transistors

3.4

Jan V. Grahn Mikael Östling Royal Institute of Technology (KTH)

High-Performance Bipolar Technology Polysilicon Emitter Contact • Advanced Device Isolation • Self-Aligned Structures

3.5

Advanced Bipolar Technology Implanted Base • Epitaxial Base • Future Trends

3.1 Introduction The development of a bipolar technology for integrated circuits goes hand in hand with the steady improvement in semiconductor materials and discrete components during the 1950s and 1960s. Consequently, silicon bipolar technology formed the basis for the IC market during the 1970s. As circuit dimensions shrink, the MOSFET (or MOS) has gradually taken over as the major technological platform for silicon integrated circuits. The main reasons are the ease of miniaturization and high yield for MOS compared to bipolar technology. However, during the same period of MOS growth, much progress was simultaneously achieved in bipolar technology.1,2 This is illustrated in Fig. 3.1 where the reported gate delay time for emitter-coupled logic (ECL) is plotted versus year. 2,3 In 1984, the 100 ps/gate limit was broken and, since then, the speed performance has been improved by a factor of ten. The high speed and large versatility of the silicon bipolar transistor still make it an attractive choice for a variety of digital and analog applications.4 Apart from high-speed performance, the bipolar transistor is recognized by its excellent analog properties. It features high linearity, superior low- and high-frequency noise behavior, and a very large transconductance.5 Such properties are highly desirable for many RF applications, both for narrowband as well as broad-band circuits.6 The high current drive capability per unit silicon area makes the bipolar transistor suitable for input/output stages in many IC designs (e.g., in fast SRAMs). The disadvantage of bipolar technology is the low transistor density, combined with a large power dissipation. High-performance bipolar circuits are therefore normally fabricated at a modest integration level (MSI/LSI). By using BiCMOS design, the benefits of both MOS and bipolar technology are utilized.7 One example is mixed analog/digital systems where a high-performance bipolar process is integrated with high-density CMOS. This technology forms a vital part in several system-on-a-chip designs (e.g., for telecommunication circuits). In this chapter, a brief overview of bipolar technology is given with an emphasis on the integrated silicon bipolar transistor. The information presented here is based on the assumption that the reader is

© 2000 by CRC Press LLC

FIGURE 3.1

Reported gate delay time for bipolar ECL circuits vs. year.

familiar with bipolar device fundamentals and basic VLSI process technology. Bipolar transistors are treated in detail in the well-known textbooks by Ashburn8 and Roulston.9 The first part of this chapter will outline the general concepts in bipolar process design and optimization (Section 3.2). The second part will present the three generations of integrated devices representing state-of-the-art bipolar technologies for the 1970s, 1980s, and 1990s (Sections 3.3, 3.4, and 3.5, respectively). Finally, some future trends in bipolar technology are outlined.

3.2 Bipolar Process Design The design of a bipolar process starts with the specification of the application target and its circuit technology (digital or analog). This leads to a number of requirements formulated in device parameters and associated figures-of-merit. These are mutually dependent and must therefore be traded off against each other, making the final bipolar process design a compromise between various conflicting device requirements.

Figures-of-Merit In the digital bipolar process, the cutoff frequency (fT) is a well-known figure-of-merit for speed. The fT is defined for a common-emitter configuration with its output short-circuit when extrapolating the small signal current gain to unity. From a circuit perspective, a more adequate figure-of-merit is the gate delay time (τd) measured for a ring-oscillator circuit containing an odd number of inverters.10 The τd can be expressed as a linear combination of the incoming time constants weighted by a factor determined by the circuit topology (e.g., ECL)10,11 Alternative expressions for τd calculations have been proposed.12 Besides speed, power dissipation can also be a critical issue in densely packed bipolar digital circuits, resulting in the power-delay product as a figure-of-merit.4 In the analog bipolar process, the dc properties of the transistor are of utmost importance. This involves minimum values on common-emitter current gain (β), Gummel plot linearity (βmax/β) breakdown voltage (BVCEO), and Early voltage (VA). The product β × VA is often introduced as a figure-of-merit for the device

© 2000 by CRC Press LLC

dc characteristics.13 Rather than fT , the maximum oscillation frequency, f max = f T ⁄ ( 8πR B C BC ) is preferred as a figure-of-merit in high-speed analog design, where RB and CBC denote the total base resistance and the base-collector capacitance, respectively.14 Alternative figures-of-merit for speed have been proposed in the literature.15,16 Analog bipolar circuits are often crucially dependent on a certain noise immunity, leading to the introduction of the corner frequency and noise figure as figures-of-merit for low-frequency and high-frequency noise properties, respectively.17

Process Optimization The optimization of the bipolar process is divided between the intrinsic and extrinsic device design. This corresponds to the vertical impurity profile and the horizontal layout of the transistor, respectively.10 See example in Fig. 3.2, where the device cross-section is also included. It is clear that the vertical profile and horizontal layout are primarily dictated by the given process and lithography constraints, respectively. Figure 3.3 shows a simple flowchart of the bipolar design procedure. Starting from the specified dc parameters at a given operation point, the doping profiles can be derived. The horizontal layout must be adjusted for minimization of the parasitics. A (speed) figure-of-merit can then be calculated. An implicit relation is thus obtained between the figure-of-merit and the processing parameters.11,18 In practice, several iterations must be performed in the optimization loop in order to find an acceptable compromise between the device parameters. This procedure is substantially alleviated by two-dimensional process simulations of the device fabrication19 as well as device simulations of the bipolar transistor.20,21 For optimization of a large number of device parameters, the strategy is based on screening out the unimportant factors, combined with a statistical approach (e.g., response surface methodology.)22

Vertical Structure The engineering of the vertical structure involves the design of the collector, base, and emitter impurity profiles. In this respect, fT is an adequate parameter to optimize. For a modern bipolar transistor with suppressed parasitics, the maximum fT is usually determined by the forward transit time of minority carriers through the intrinsic component. The most important fT tradeoff is against BVCEO , as stated by the Johnson limit for silicon transistors:23 the product fT × BVCEO cannot exceed 200 GHz-V (recently updated to 500 GHz-V).24 Collector Region The vertical n-type collector of the bipolar device in Fig. 3.2 consists of two regions below the p-type base diffusion: a lowly or moderately doped n-type epitaxial (epi) layer, followed by a highly doped n+-subcollector. The thickness and doping level of the subcollector are non-critical parameters; a high arsenic or antimony doping density between 1019 and 1020 cm–3 is representative, resulting in a sheet resistance of 20 to 40 Ω/sq. In contrast, the design of the epi-layer constitutes a fundamental topic in bipolar process optimization. To first order, the collector doping in the epi-layer is determined by the operation point (more specifically, the collector current density) of the component (see Fig. 3.3). A normal condition is to have the operation point corresponding to maximum fT , which typically means a collector current density on the order of 2–4 × 104 A/cm2. As will be recognized later, bipolar scaling results in increased collector current densities. Above a certain current, there will be a rapid roll-off in current gain as well as cutoff frequency. This is due to high-current effects, primarily the base push-out or Kirk effect, leading to a steep increase in the forward transit time.25 Since the critical current value is proportional to the collector doping,26 a minimum impurity concentration for the epi-layer is required, thus avoiding fT degradation (typically around 1017 cm–3 for a high-speed device). Usually, the epi-layer is doped only in the intrinsic structure by a selectively implanted collector (SIC) procedure.27 An example of such a doping profile is seen in Fig. 3.4. Such a collector design permits an improved control over the base-collector junction; that is, shorter base width as well as suppressed Kirk effect. The high collector doping concentration,

© 2000 by CRC Press LLC

FIGURE 3.2 (a) Layout, (b) cross-section, and (c) simulated impurity profile through emitter window for an integrated bipolar transistor (E = emitter, B = base, C = collector).

© 2000 by CRC Press LLC

FIGURE 3.3

Generic bipolar device optimization flowchart.

however, may be a concern for both CBC and BVCEO . The latter value will therefore often set a higher limit on the collector doping value. One way to reduce the electrical field in the junction is to implement a lightly doped collector spacer layer between the heavily doped base and collector regions.28,29 A retrograde collector profile with a low impurity concentration near the base-collector junction and then increasing toward the subcollector has also been reported to enhance fT .30,31 The thickness of the epi-layer exhibits large variations between different device designs, extending several micrometers in depth for analog bipolar components, whereas a high-speed digital design typically has an epi-layer thickness around 1 µm or below, thus reducing total collector resistance. As a result, the transistor breakdown voltage is sometimes determined by reach-through breakdown (i.e., full depletion penetration of the epi-collector). The thickness of the collector layer can therefore be used as a parameter in determining BVCEO , which in turn is traded off vs. fT .32

© 2000 by CRC Press LLC

FIGURE 3.4

Simulated vertical impurity profile with and without SIC.

In cases where fmax is of interest, the collector design must be carefully taken into account. Compared to fT , the optimum fmax is found for thicker and lower doped collector epi-layers.33,34 The vertical collector design will therefore, to a large extent, determine the tradeoff between fT and fmax. Base Region The width and peak concentration of the base profile are two of the most fundamental parameters in vertical profile design. The base width WB is normally in the range 0.1 to 1 µm, whereas a typical base peak concentration lies between 1017 and 1018 cm–3. The current gain of the transistor is determined by the ratio of the Gummel number in the emitter and base. Usually, a current gain of at least 100 is required for analog bipolar transistors, whereas in digital applications, β around 20 is often acceptable. A normal base sheet resistance (or pinch resistance) for conventional bipolar processes is of the order of 100 Ω/sq., whereas the number for high-speed devices typically is in the interval 1 to 10 kΩ/sq. This is due to the 2 small WB < 0.1 µm necessary for a short base transit time ( ∝ W B ). On the other hand, a too narrow base will have a negative impact on fmax because of its RB dependence. As a result, fmax exhibits a maximum when plotted against WB.35 The base impurity concentration must be kept high enough to avoid punch-through at low collector voltages; that is, the base-collector depletion layer penetrates across the neutral base. In other words, the base doping level is also dictated by the collector impurity concentration. Punch-through is the ultimate consequence of base width modulation or the Early effect manifested by a finite output resistance in the IC-VCE transistor characteristic.36 The associated VA or the product β × VA serves as an indicator of the linear properties for the bipolar transistor. The VA is typically at a relatively high level (>30 V) for analog applications, whereas digital designs often accept relatively low VA < 15 V. Figure 3.5 demonstrates simulations of current gain versus the base doping for various WB .37 It is clearly seen that the base doping interval permitting a high current gain while avoiding punch-through will be pushed to high impurity concentrations for narrow base widths. In addition, Fig. 3.5 points to another limiting factor for high base doping numbers above 5 × 1018 cm–3, namely, the onset of forwardbiased tunneling currents in the emitter-base junction38 leading to non-ideal base current characteristics.39 It is concluded that the allowable base doping interval will be very narrow for WB < 0.1 µm. The shape of the base profile has some influence over the device performance. The final base profile is the result of an implantation and diffusion process and, normally, only the peak base concentration is given along with the base width. Nonetheless, there will be an impurity grading along the base profile (see Figs. 3.2 and 3.4), creating a built-in electrical field and thereby adding a drift component for the

© 2000 by CRC Press LLC

FIGURE 3.5 Simulated current gain vs. base doping density for different base widths (NC = 6 × 1016 cm–3, NE = 1020 cm–3, and emitter depth 0.20 µm) (after Ref. 37, copyright© 1990, IEEE).

minority carrier transport.40 Recent research has shown that for very narrow base transistors, the uniform doping profile is preferable when maximizing fT .41,42 This is also valid under high injection conditions in the base.43 Uniformly doped base profiles are common in advanced bipolar processes using epitaxial techniques for growing the intrinsic base. During recent years, base profile design has largely been devoted to implementation of narrow bandgap SiGe in the base. The resulting emitter-base heterojunction allows a considerable enhancement in current gain, which can be traded off against increased base doping, thus substantially alleviating the problem with elevated base sheet resistances typical of high-speed devices.44 Excellent dc as well as high-frequency properties can be achieved. The position of the Ge profile with respect to the boron profile has been discussed extensively in the literature.45–47 More details about SiGe heterojunction engineering are found in Chapter 5 by Cressler. Emitter Region The conventional metal-contacted emitter is characterized by an abrupt arsenic or phosphorus profile fabricated by direct diffusion or implantation into the base area (see Fig. 3.2).48 In keeping emitter efficiency close to unity (and thus high current gain), the emitter junction cannot be made too shallow (~1 µm). The emitter doping level lies typically between 1020 and 1021 cm–3 close to the solid solubility limit at the silicon surface, hence providing a low emitter resistance as well as a large emitter Gummel number required for keeping current gain high. Bandgap narrowing, however, will be present in the emitter, causing a reduction in the efficient emitter doping.49 When scaling bipolar devices, the emitter junction must be made more shallow to ensure a low emitterbase capacitance. When the emitter depth becomes less than the minority carrier recombination length, the current gain will inevitably degrade. This precludes the use of conventional emitters in a highperformance bipolar technology. Instead, polycrystalline (poly) silicon emitter technology is utilized. By diffusing impurity species from the polysilicon contact into the monocrystalline (mono) silicon, a very shallow junction (< 0.2 µm) is formed; yet gain can be kept at a high level and even traded off against a higher base doping.50 A gain enhancement factor between 3 and 30 for the polysilicon compared to the monosilicon emitter has been reported (see also Section 3.4).51,52

© 2000 by CRC Press LLC

Scaling Rules The principles for vertical design can be summarized in the bipolar scaling rules formulated by Solomon and Tang; 53,54 see Table 3.1. Since the bipolar transistor is scaled under constant voltage, the current density increases with reduced device dimensions. At medium or high current densities, the vertical structure determines the speed. At low current densities, performance is normally limited by device parasitics. Eventually, tunnel currents or contact resistances constitute a final limit to further speed improvement based on the scaling rules. A solution is to use SiGe bandgap engineering to further enhance device performance without scaling. TABLE 3.1

Bipolar Scaling Rules (Scaling factor λ < 1)

Parameter Voltage Base width WB Base doping NB Current density J Collecting doping Depletion capacitances Delay Power Power-delay product

Scaling Factor 1 λ0.8 WB–2 λ–2 J λ λ 1 λ

Horizontal Layout The horizontal layout is carried out in order to minimize the device parasitics. Figure 3.6 shows the essential parasitic resistances and capacitances for a schematic bipolar structure containing two base contacts. The various RC constants in Fig. 3.6 introduce time delays. For conventional bipolar transistors, such parasitics often limit device speed. In contrast, the self-alignment technology applied in advanced bipolar transistor fabrication allows for efficient suppression of the parasitics. In horizontal layout, fmax serves as a first-order indicator in the extrinsic optimization procedure because of its dependence on CBC and (total) RB. These two parasitics are strongly connected to the geometrical layout of the device. The more advanced τd calculation takes all major parasitics into account under given load conditions, thus providing good insight into the various time delay contributions of a bipolar logic gate.55 From Fig. 3.6, it is seen that the collector resistance is divided into three parts. Apart from the epilayer and buried layer previously discussed, the collector contact also adds a series resistance. Provided the epi-layer is not too thick, the transistor is equipped with a deep phosphorus plug from the collector contact down to the buried layer, thus reducing the total RC. As illustrated in Fig. 3.6, the base resistance is divided into intrinsic (RBi) and extrinsic (RBx) components. The former is the pinched base resistance situated directly under the emitter diffusion, whereas the latter constitutes the base regions contacting the intrinsic base. The intrinsic part decreases with the current due to the lateral voltage drop in the base region.56 At high current densities, this causes current crowding effects at the emitter diffusion edges. This results in a reduced onset for high-current effects in the transistor. The extrinsic base resistance is bias independent and must be kept as small as possible (e.g., by utilizing self-alignment architectures). By designing a device layout with two or more base contacts surrounding the emitter, the final RB is further reduced at the expense of chip area. Apart from enhancing fmax, the RB reduction is also beneficial for device noise performance. The layout of the emitter is crucial since the effective emitter area defines the intrinsic device crosssection.57 The minimum emitter area, within the lithography constraints, is determined by the operational collector current and the critical current density where high-current effects start to occur.58 Eventually, a tradeoff must be made between the base resistance and device capacitances as a function of emitter geometry; this choice is largely dictated by the device application. Long, narrow emitter stripes, meaning

© 2000 by CRC Press LLC

FIGURE 3.6 Schematic view of the parasitic elements in a bipolar transistor equipped with two base contacts. RE = emitter resistance, RBi = intrinsic base resistance, RBx = extrinsic base resistance, RC = collector resistance, CEB = emitter-base capacitance, CBCi = intrinsic base-collector capacitance, CBCx = extrinisic base-collector capacitance, and CCS = collector-substrate capacitance. Gray areas denote depletion regions. Contact resistances are not shown.

a reduction in the base resistance, are frequently used. The emitter resistance is usually non-critical for conventional devices; however, for polysilicon emitters, the emitter resistance may become a concern in very small-geometry layouts.3 Of the various junction capacitances in Fig. 3.6, the collector-base capacitance is the most significant. This parasitic is also divided into intrinsic (CBCi) and extrinsic (CBCx) contributions. Similar to RBx, the CBCx is kept low by using self-aligned schemes. For example, the fabrication of a SIC causes an increase only in CBCi , whereas CBCx stays virtually unaffected. The collector-substrate capacitance CCS is one of the minor contributors to fT ; the CCS originates from the depletion regions created in the epi-layer and under the buried layer. CCS will become significant at very high frequencies due to substrate coupling effects.59

3.3 Conventional Bipolar Technology Conventional bipolar technology is based on the device designs developed during the 1960s and 1970s. Despite its age, the basic concept still constitutes a workhorse in many commercial analog processes where ultimate speed and high packing density are not of primary importance. In addition, a conventional bipolar component is often implemented in low-cost BiCMOS processes.

Junction-Isolated Transistors The early planar transistor technology took advantage of a reverse-biased pn junction in providing the necessary isolation between components. One of the earliest junction-isolated transistors, the so-called triple-diffused process, is simply based on three ion implantations and subsequent diffusion.60 This device

© 2000 by CRC Press LLC

has been integrated into a standard CMOS process using one extra masking step.61 The triple-diffused bipolar process, however, suffers from a large collector resistance due to the absence of a subcollector, and the npn performance will be low. By far, the most common junction-isolated transistor is represented by the device cross-section of Fig. 3.7, the so-called buried-collector process.60 This device is based on the concept previously shown in Fig. 3.2 but with the addition of an n+-collector plug and isolation. This is provided by the diffused p+-regions surrounding the transistor. The diffusion of the base and emitter impurities into the epi-layer allows relatively good control of the base width (more details of the fabrication is given in the next section on oxide-isolated transistors).

FIGURE 3.7

Cross-section of the buried-collector transistor with junction isolation and collector plug.

A somewhat different approach of the buried-collector process is the so-called collector-isolation diffusion.62 This process requires a p-type epi-layer after formation of the subcollector. An n+diffusion serves both as isolation and the collector plug to the buried layer. After an emitter diffusion, the p-type epi-layer will constitute the base of the final device. Compared to the buried-layer collector process, the collector-isolated device concept does not result in very accurate control over the final base width. The main disadvantage of the junction-isolated transistor is the relatively large chip area occupied by the isolation region, thus precluding the use of such a device in any VLSI application. Furthermore, highspeed operation is ruled out because of the large parasitic capacitances associated with the junction isolation and the relatively deep diffusions involved. Indeed, many of the conventional junction-isolated processes were designed for doping from the gas phase at high temperatures.

Oxide-Isolated Transistors Oxide isolation permits a considerable reduction in the lateral and vertical dimensions of the buriedlayer collector process. The reason is that the base and collector contacts can be extended to the edge of the isolation region. More chip area can be saved by having the emitter walled against the oxide edge. The principal difference between scaling of junction-and oxide-isolated transistors is visualized in Fig. 3.8.63 The device layouts are Schottky clamped; that is, the base contact extends over the collector region. This hinders the transistor to enter saturation mode under device operation. In Fig. 3.8(b), the effective surface area of the emitter contact has been reduced by a so-called washed emitter approach: since the oxide formed on the emitter window during emitter diffusion has a much higher doping concentration than its surroundings, this particular oxide can be removed by a mask-less wet etching. Hence, the emitter contact becomes self-aligned to the emitter diffusion area. The process flow including mask layouts for an oxide-isolated bipolar transistor of the buried-layer collector type is shown in Fig. 3.9.64 After formation of the subcollector by arsenic implantation through an oxide mask in the p–-substrate, the upper collector layer is grown epitaxially on top (Fig. 3.9(a)). The device isolation is fabricated by local oxidation of silicon (LOCOS) or recessed oxide (ROX) process

© 2000 by CRC Press LLC

FIGURE 3.8 Device layout and cross-section demonstrating scaling of (a)-(b) junction-isolated and (c)-(d) oxideisolated bipolar transistors (after Ref. 63, copyright© 1986, Wiley).

(Figs. 3.9(b) to (d)). The isolation mask in Fig. 3.9(b) is aligned to the buried layer using the step in the silicon (Fig. 3.9(a)) originating from the enhanced oxidation rate for highly doped n+-silicon compared to the p–-substrate during activation of the buried layer. The ROX is thermally grown (Fig. 3.9(d)) after the boron field implantation (or chan-stop) (Fig. 3.9(c)). This p+-implant is necessary for suppressing a conducting channel otherwise present under the ROX. The base is then formed by ion implantation of boron or BF2 through a screen oxide (Fig. 3.9(d)); in the simple device of Fig. 3.9, a single base implantation is used; in a more advanced bipolar process, the fabrication of the intrinsic and extrinsic base must be divided into one low dose and one high dose implantation, respectively, adding one more mask to the total flow. After base formation, an emitter/base contact mask is patterned in a thermally grown oxide (Fig. 3.9(e)). The emitter is then implanted using a heavy dose arsenic implant (Fig. 3.9(f)). An n+ contact is simultaneously formed in the collector window. After annealing, the device is ready for metallization and passivation. Apart from the strong reduction in isolation capacitances, the replacement of a junction-isolated process with an oxide-isolated process also adds other high-speed features such as thinner epitaxial layer and shallower emitter/base diffusions. A typical base width is a few 1000 Å and the resulting fT typically lies in the range of 1 to 10 GHz. The doping of the epitaxial layer is determined by the required breakdown voltage. Further speed enhancement of the oxide-isolated transistor is difficult due to the parasitic capacitances and resistances originating from contact areas and design-rule tolerances related to alignment accuracy.

Lateral pnp Transistors The conventional npn flow permits the bipolar designer to simultaneously create a lateral pnp transistor. This is made by placing two base diffusions in close proximity to each other in the epi-layer, one of them (pnp-collector) surrounding the other (pnp-emitter) (see Fig. 3.10). In general, the lateral pnp device exhibits poor performance since the base width is determined by lithography constraints rather than vertical base control as in the npn device. In addition, there will be electron injection from the subcollector into the p-type emitter, thus reducing emitter efficiency.

© 2000 by CRC Press LLC

FIGURE 3.9 Layout and cross-section of the fabrication sequence for an oxide-isolated buried-collector transistor (after Ref. 64, copyright© 1983, McGraw-Hill).

© 2000 by CRC Press LLC

FIGURE 3.10

Schematic cross-section of the lateral pnp transistor.

3.4 High-Performance Bipolar Technology The development of a high-performance bipolar technology for integrated circuits signified a large step forward, both with respect to speed and packing density of bipolar transistors. A representative device cross-section of a so-called double-poly transistor is depicted in Fig. 3.11. The important characteristics for this bipolar technology are the polysilicon emitter contact, the advanced device isolation, and the self-aligned structure. These three features are discussed here with an emphasis on self-alignment where the two basic process flows are outlined — the single-poly and double-poly transistor.

FIGURE 3.11 A double-poly self-aligned bipolar transistor with deep-trench isolation, polysilicon emitter and SIC. Metallization is not shown.

Polysilicon Emitter Contact The polysilicon emitter contact is fabricated by a shallow diffusion of n-type species (usually arsenic) from an implanted n+- polysilicon layer into the silicon substrate65 (see emitter region in Fig. 3.11). The thin oxide sandwiched between the poly- and monosilicon is partially or fully broken up during contact formation. The mechanism behind the improved current gain is strongly related to the details of the interface between the polysilicon layer and the monosilicon substrate.52 Hence, the cleaning procedure of the emitter window surface before polysilicon deposition must be carefully engineered for process robustness. Otherwise, the average current gain from wafer-to-wafer will exhibit unacceptable variations.

© 2000 by CRC Press LLC

The emitter window preparation and subsequent drive-in anneal conditions can also be used in tailoring the process with respect to gain and emitter resistance. From a fabrication point of view, there are further advantages when introducing polysilicon emitter technology. By implanting into the polysilicon rather than into single-crystalline material, the total defect generation as well as related anomalous diffusion effects are strongly suppressed in the internal transistor after the drive-in anneal. Moreover, the risk for spiking of aluminum during the metallization process, causing short-circuiting of the pn junction, is strongly reduced compared to the conventional contact formation. As a result, some of the yield problems associated with monosilicon emitter fabrication are, to a large extent, avoided when utilizing polysilicon emitter technology.

Advanced Device Isolation With advanced device isolation, one usually refers to the deep trenches combined with LOCOS or ROX as seen in Fig. 3.11.66 The starting material before etching is then a double-epitaxial layer (n+-n) grown on a lowly doped p–-substrate. The deep trench must reach all the way through the double epi-layer, meaning a high-aspect ratio reactive-ion etch. Hence, the trenches will define the extension of the buried layer collector for the transistor. The main reason for introducing advanced isolation in bipolar technology is the need for a compact chip layout.67 Quite naturally, the bipolar isolation technology has benefited from the trench capacitor development in the MOS memory area. The deep trench isolation allows bipolar transistors to be designed at the packing density of VLSI. The fabrication of a deep-trench isolation includes deep-silicon etching, chan-stop p+-implantation, an oxide/nitride stack serving as isolation, intrinsic polysilicon fill-up, planarization, and cap oxidation.66 The deep-trench isolation is combined with an ordinary LOCOS or ROX isolation, which is added before or after trench formation. The most advanced isolation schemes take advantage of shallow-trench isolation rather than ordinary LOCOS after the deep-trench process; in this way, a very planar surface with no oxide lateral encroachment (“birds beak”) is achieved after the planarization step. The concern regarding stress-induced crystal defects originating from trench etching requires careful attention so as not to seriously affect yield.

Self-Aligned Structures Advanced bipolar transistors are based on self-aligned structures made possible by polysilicon emitter technology. As a result, the emitter-base alignment is not dependent on the overlay accuracy of the lithography tool. The device contacts can be separated without affecting the active device area. It is also possible to create structures where the base is self-aligned both to the collector and emitter, the so-called sidewall-based contact structure (SICOS).68 This process, however, has not been able to compete successfully with the self-aligned schemes discussed below. The self-aligned structures are divided into single-polysilicon (single-poly) and double-polysilicon (double-poly) architectures, as visualized in Fig. 3.12.69 The double-poly structure refers to the emitter polysilicon and base polysilicon electrode, whereas the single-poly only refers to the emitter polysilicon. From Fig. 3.12, it is seen that the double-poly approach benefits from a smaller active area than the single-poly, manifested in a reduced base-collector capacitance. Moreover, the double-poly transistor in general exhibits a lower base resistance. The double-poly transistor, however, is more complex to fabricate than the single-poly device. On the other hand, by applying inside spacer technology for the double-poly emitter structure, the lithography requirements are not as strict as in the single-poly case where more conventional MOS design rules are used for definition of the emitter electrode. Single-Poly Structure The fabrication of a single-poly transistor has been presented in several versions, more or less similar to the traditional MOS flow. An example of a standard single-poly process is shown in Fig. 3.13.70 After

© 2000 by CRC Press LLC

FIGURE 3.12 (a) Double-poly structure and (b) single-poly structure. Buried layer and collector contact are not shown (after Ref. 69, copyright© 1989, IEEE).

arsenic emitter implantation (Fig. 3.13(a)) and polysilicon patterning, a so-called base-link is implanted using boron ions (Fig. 3.13(b)). Oxide is then deposited and anisotropically etched to form outside spacers, followed by the heavy extrinsic base implantation (Fig. 3.13(c)). Shallow junctions (including emitter diffusion) are formed by rapid thermal annealing (RTA). A salicide or polycide metallization completes the structure (Fig. 3.13(d)). The intrinsic base does not necessarily need to be formed prior to the extrinsic part. Li et al.71 have presented a reverse extrinsic-intrinsic base scheme based on a disposable emitter pedestal with spacers. This leads to improved control over the intrinsic base width and a lower surface topography compared to the process represented in Fig. 3.13. Another variation of the single-poly architecture is the so-called quasi-self-aligned process (see Fig. 3.14).72 A base oxide is formed by thermal oxidation in the active area and an emitter window is opened (Fig. 13.14(a)). Following intrinsic base implantation, the emitter polysilicon is deposited, implanted, and annealed. The polysilicon emitter pedestal is then etched out (Fig. 3.14(b)). The extrinsic base process, junction formation, and metallization are essentially the same as in the single-poly process shown in Fig. 3.13. Note that in Fig. 13.4, the emitter-base formation is self-aligned to the emitter window in the oxide, not to the emitter itself, hence explaining the term quasi-self-aligned. As a result, a higher total base resistance is obtained compared to the standard single-poly process. The boron implantation illustrated in Fig. 3.13(b) is an example of so-called base-link engineering aimed at securing the electrical contact between the heavily doped p+-extrinsic base and the much lower doped intrinsic base. Too weak a base link will result in high total base resistance, whereas too strong a base link may create a lateral emitter-base tunnel junction leading to non-ideal base current characteristics.73 Furthermore, a poorly designed base link jeopardizes matching between individual transistors since the final current gain may vary substantially with the emitter width.

© 2000 by CRC Press LLC

FIGURE 3.13 The single-poly, self-aligned process: (a) polyemitter implantation, (b) emitter etch and base link implantation, (c) oxide spacer formation and extrinsic base implantation, and (d) final device after junction formation and metallization.

© 2000 by CRC Press LLC

FIGURE 3.14

The single-poly, quasi-self-aligned process: (a) polyemitter implantation, (b) final device.

Double-Poly Structure The double-poly structure originates from the classical IBM structure presented in 1981.74 Most highperformance commercial processes today are based on double-poly technology. The number of variations are less than for the single-poly, mainly with different aspects on base-link engineering, spacer technology, and SIC formation. One example of a double-poly fabrication is presented in Fig. 3.15. After deposition of the base polysilicon and oxide stack, the emitter window is opened (Fig. 3.15(a)) and thermally oxidized. During this step, p+-impurities from the base polysilicon diffuse into the monosilicon, thus forming the extrinsic base. In addition, the oxidation repairs the crystal damage caused by the dry etch when opening the emitter window. A thin silicon nitride layer is then deposited, the intrinsic base is implanted using boron, followed by the fabrication of amorphous silicon spacers inside the emitter window (Fig. 3.15(b)). The nitride is exposed to a short dry etch, the spacers are removed, and the thin oxide is opened up by an HF dip. Deposition and implantation of the polysilicon emitter film is carried out (Fig. 3.15(c)). The structure is patterned and completed by RTA emitter drive-in and metallization (Fig. 3.15(d)). The emitter will thus be fully self-aligned with respect to the base. Note that the inside spacer technology implies that the actual emitter width will be significantly less than the drawn emitter width. The definition of the polyemitter in the single- and double-poly process inevitably leads to some overetching into the epi-layer, see Figs. 3.13(b) and 3.15(a), respectively. The final recessed region will make control over base-link formation more awkward.75,76 In fact, the base link will depend both on the degree of overetch as well as the implantation parameters.77 This situation is of no concern for the quasiself-aligned process where the etch of the polysilicon emitter stops on the base oxide. In a modification of the double-poly process, a more advanced base-link technology is proposed.78 After extrinsic base drive-in and emitter window opening, BF2-implanted poly-spacers are formed inside the emitter window. The boron is out-diffused through the emitter oxide, thus forming the base link. The intrinsic base is subsequently formed by conventional implantation through the emitter window. New dielectric inside spacers are formed prior to polysilicon emitter deposition, followed by arsenic implantation and emitter drive-in.

© 2000 by CRC Press LLC

FIGURE 3.15 The double-poly, self-aligned process: (a) emitter window etch, (b) intrinsic base implantation through thin oxide/nitride stack followed by inside spacer formation, (c) polyemitter implantation, (d) final device after emitter drive-in and metallization.

© 2000 by CRC Press LLC

Also, vertical pnp bipolar transistors based on the double-poly concept have been demonstrated.79 Either boron or BF2 is used for the polyemitter implantation. A pnp device with fT of 35 GHz has been presented in a classical double-poly structure.80

3.5 Advanced Bipolar Technology This chapter section treats state-of-the-art bipolar technologies reported during the 1990s (but not necessarily put into production). Alongside the traditional down-scaling in design rules, efforts have focused on new innovations in emitter and base electrode fabrication. A key issue has been the integration of epitaxial Si or SiGe intrinsic base into the standard npn process flow. This section concludes with an outlook on the future trends in bipolar technology after the year 2000.

Implanted Base Today’s most advanced commercial processes are specified with an fT around 30 GHz. The major developments are being carried out using double-poly technology, although new improvements have also been reported for single-poly architectures.72,81 For double-poly transistors, it was demonstrated relatively early that by optimizing a very low intrinsic base implant energy below 10 keV, devices with an fT around 50 GHz are possible to fabricate.82 The emitter out-diffusion is performed by a combined furnace anneal and RTA. In this way, the intrinsic base width is controlled below 1000 Å, whereas the emitter depth is only around 250 Å. Since ion implantation is associated with a number of drawbacks such as channeling, shadowing effects, and crystal defects, it may be difficult to reach an fT above 50 to 60 GHz based on such a technology. The intrinsic base implantation has been replaced by rapid vapor deposition using B2H6 gas around 900°C.83 The in-diffused boron profile will form a thin and low-resistive base. Also, the emitter implantation can be removed by utilizing in situ doped emitter technology (e.g., AsH3 gas during polysilicon deposition).84 Two detrimental effects are then avoided; namely, emitter perimeter depletion and the emitter plug effect.85 The former effect causes a reduced doping concentration close to the emitter perimeter, whereas the latter implies the plugging of doping atoms in narrow emitter windows causing shallower junctions compared to larger openings on the same chip. Arsenic came to replace phosphorus as the emitter impurity during the 1970s, mainly because of the emitter push-effect plaguing phosphorus monosilicon emitters. The phosphorus emitter has, however, experienced a renaissance in advanced bipolar transistors by introducing the so-called in situ phosphorus doped polysilicon (IDP) emitter.86 One motivation for using IDP technology is the reduction in final emitter resistance compared to the traditional As polyemitter, in particular for aggressively down-scaled devices with very narrow emitter windows. In addition, the emitter drive-in for an IDP emitter is carried out at a lower thermal budget than the corresponding arsenic emitter due to the difference in diffusivity between the impurity atoms. Using IDP and RVD, very high fT values (above 60 GHz) have been realized.83 It has been suggested that the residual stress of the IDP emitter and the interfacial oxide between the poly- and the monosilicon creates a heteroemitter action for the device, thus explaining the high current gains of IDP bipolar transistors.87 Base electrode engineering in advanced devices has become an important field in reducing the total base resistance, thus improving fmax of the transistor. One straightforward method in lowering the base sheet resistance is by shunting the base polysilicon with an extended silicide across the total base electrode. This has recently been demonstrated in an fmax = 60 GHz double-poly process.88 A still more effective concept is to integrate metal base electrodes.89 This approach is combined with in situ doped boron polysilicon base electrodes as well as an IDP emitter in a double-poly process (see Fig. 3.16). The tungsten electrodes are fully self-aligned using WF6-selective deposition. The technology, denoted SMI (self-aligned metal IDP), has been applied together with RVD base formation. The bipolar process was shown to produce fT and fmax figures of 100 GHz at a breakdown voltage of 2.5 V.90

© 2000 by CRC Press LLC

FIGURE 3.16 The self-aligned metal IDP process using selective deposition of tungsten base electrodes (after Ref. 89, copyright© 1997, IEEE).

Epitaxial Base By introducing epitaxial film growth techniques for intrinsic base formation, the base width is readily controlled on the order of some hundred angstroms. Both selective and non-selective epitaxial growth (SEG and NSEG, respectively) have been reported. One example of a SEG transistor flow is illustrated in Fig. 3.17.91 Not only the epitaxial base, but also the n–-collector is grown using SEG. The p+-poly overhangs warrant a strong base link between the SEG intrinsic base and the base electrode. This fT = 44 GHz process was capable of delivering divider circuits working at 25 GHz. A natural extension of the Si epitaxy is to apply the previously mentioned SiGe epitaxy, thus creating a heterojunction bipolar transistor (HBT). For example, the transistor process in Fig. 3.17 was later extended to a SiGe process with fT = 61 GHz and fmax = 74 GHz.92 Apart from high speed, low base resistance is a trademark for many SiGe bipolar processes. For details of the SiGe HBT, the reader is referred to Chapter 5. Here, only some process integration points of view are given of this very important technology for advanced silicon-based bipolar devices during the 1990s. While the first world records in terms of fT and fmax were broken for non-self-aligned structures or mesa HBTs, planar self-aligned process schemes taking advantage of the benefits using SiGe have

© 2000 by CRC Press LLC

FIGURE 3.17 Process demonstrating selective epitaxial growth: (a) self-aligned formation of p+-poly overhangs, (b) selective epitaxial growth of the intrinsic base, (c) emitter fabrication (after Ref. 91, copyright© 1992, IEEE).

subsequently been demonstrated. In recent years, both single-poly and double-poly SiGe transistors exhibiting excellent dc and ac properties have been presented. This also includes the quasi-self-aligned approach where a fully CMOS compatible process featuring an fmax of 71 GHz has been shown.93 A similar concept featuring NSEG, so-called differential epitaxy, has been applied in the design of an HBT with a single-crystalline emitter rather than a polysilicon emitter.94 This process, however, requires a very low thermal budget because of the high boron and germanium content in the intrinsic base of the transistor. Excellent properties for RF applications are made possible by this approach, such as high fT and fmax around 50 GHz, combined with good dc properties and low noise figures.95 In double-poly structures, the extrinsic base is usually deposited prior to SiGe base epitaxy, which is then carried out by SEG (as shown in Fig. 3.17). Several groups report τd below 20 ps using this approach. One example of the most advanced bipolar technologies is the SiGe double-poly structure shown in Fig. 3.18.96 This technology features the SMI base electrode technology, together with SEG of SiGe. The device is isolated by oxide-filled trenches. The reported τd was 7.7 ps and the fmax was 108 GHz, thus approaching the SiGe HBT mesa record of 160 GHz.97 A device similar to the one in Fig. 3.18 was also reported to yield an fT of 130 GHz.98

© 2000 by CRC Press LLC

FIGURE 3.18 A state-of-the-art bipolar device featuring SMI electrodes, selectively grown epitaxial SiGe base, in situ doped polysilicon emitter/base and oxide-filled trenches (after Ref. 96, copyright© 1998, IEEE).

Future Trends The shrinking of dimensions in bipolar devices, in particular for digital applications, will proceed one or two generations behind the MOS frontier, leading to a further reduction in τd. Several of the concepts reviewed above for advanced bipolar components are expected to be introduced in the next commercial high-performance processes; for example, in situ doped emitter and low-resistivity base electrodes. Similar to CMOS, the overall temperature budget must be reduced in processing. Evidently, bipolar technology in the future also continues to benefit from the progress made in CMOS technology; for example, in isolation and back-end processing. CMOS compatibility will be a general requirement for the majority of bipolar process development because of the strong interest in mixedsignal BiCMOS processes. Advanced isolation technology combining deep and shallow trenches, perhaps on silicon-on-insulator (SOI) or high-resistivity substrates, marks one key trend in future bipolar transistors. Bipolar technology based on SOI substrates may well be accelerated by the current introduction of SOI into CMOS production. However, thermal effects for high current drive bipolar devices must be solved when using SOI. An interesting low-cost alternative insulating substrate for RF-bipolar technology is the silicon-on-anything concept recently presented.99 In addition, copper metallization will be introduced for advanced bipolars provided that intermetal dielectrics as well as passive components are developed to meet this additional advantage.100 The epitaxial base constitutes another important trend where both Si and SiGe are expected to enhance performance in the high-frequency domain, although the introduction may be delayed due to progress in ion-implanted technology. In this respect, SEG technology has yet to prove its manufacturability. Future Si-based bipolar technology with fT and fmax greater than 100 GHz will continue to play an important role in small-density, high-performance designs. The most important applications are found in communication systems in the range 1 to 10 GHz (wireless telephony and local area networks) and 10 to 70 GHz (microwave and optical-fiber communication systems) where Si and/or SiGe bipolar technologies are expected to seriously challenge existing III-V technologies.101

Acknowledgments We are grateful to G. Malm and M. Linder for carrying out the process simulations. The support from the Swedish High-Frequency Bipolar Technology Consortium is greatly acknowledged.

© 2000 by CRC Press LLC

References 1. Ning, T. H. and Tang, D. D., Bipolar Trends, Proc. IEEE, 74, 1669, 1986. 2. Nakamura, T. and Nishizawa, H., Recent progress in bipolar transistor technology, IEEE Trans. Electron Dev., 42, 390, 1995. 3. Warnock, J. D., Silicon bipolar device structures for digital applications: Technology trends and future directions, IEEE Trans. Electron Dev., 42, 377, 1995. 4. Wilson, G. R., Advances in bipolar VLSI, Proc. IEEE, 78, 1707, 1990. 5. Barber, H. D., Bipolar device technology challenge and opportunity, Can. J. Phys., 63, 683, 1985. 6. Baltus, P., Influence of process- and device parameters on the performance of portable rf communication circuits, Proceedings of the 24th European Solid State Device Research Conference, Hill, C. and Ashburn, P., Eds., 1994, 3. 7. Burghartz, J. N., BiCMOS process integration and device optimization: Basic concepts and new trends, Electrical Eng., 79, 313, 1996. 8. Ashburn, P., Design and Realization of Bipolar Transistors, Wiley, Chichester, 1988. 9. Roulston, D. J., Bipolar Semiconductor Devices, McGraw-Hill, New York, 1990. 10. Tang, D. D. and Solomon, P. M., Bipolar transistor design for optimized power-delay logic circuits, IEEE J. of Solid-State Circuits, SC-14, 679, 1979. 11. Chor, E.-F., Brunnschweiler, A., and Ashburn, P., A propagation-delay expression and its application to the optimization of polysilicon emitter ECL processes, IEEE J. of Solid-State Circuits, 23, 251, 1988. 12. Stork, J. M. C., Bipolar transistor scaling for minimum switching delay and energy dissipation, in 1988 Int. Electron Devices Meeting Tech. Dig., 1988, 550. 13. Prinz, E. J. and Sturm, J. C., Current gain — Early voltage products in heterojunction bipolar transistors with nonuniform base bandgaps, IEEE Electron. Dev. Lett., 12, 661, 1991. 14. Kurishima, K., An analytical expression of fmax for HBT’s, IEEE Trans. Electron Dev., 43, 2074, 1996. 15. Taylor, G. W. and Simmons, J. G., Figure of merit for integrated bipolar transistors, Solid State Electronics, 29, 941, 1986. 16. Hurkx, G. A. M., The relevance of fT and fmax for the speed of a bipolar CE amplifier stage, IEEE Trans. Electron Dev., 44, 775, 1997. 17. Larson, L. E., Silicon bipolar transistor design and modeling for microwave integrated circuit applications, in 1996 Bipolar Circuits Technol. Meeting Tech. Dig., 1996, 142. 18. Fang, W., Accurate analytical delay expressions for ECL and CML circuits and their applications to optimizing high-speed bipolar circuits, IEEE J. of Solid-State Circuits, 25, 572, 1990. 19. Silvaco International, 1997: VWF Interactive tools, Athena User’s Manual, 1997. 20. Silvaco International, 1997: VWF Interactive tools, Atlas User’s Manual, 1997. 21. Roulston, D. J., BIPOLE3 User’s Manual, BIPSIM, Inc., 1996. 22. Alvarez, A. R., Abdi, B. L., Young, D. L., Weed, H. D., Teplik, J., and Herald, E. R., Application of statistical design and response surface methods to computer-aided VLSI device design, IEEE Trans. Comp.-Aided Design, 7, 272, 1988. 23. Johnson, E. O., Physical limitations on frequency and power parameters of transistors, RCA Rev., 26, 163, 1965. 24. Ng, K. K., Frei, M. R., and King, C. A., Reevaluation of the ftBVCEO limit in Si bipolar transistors, IEEE Trans. Electron Dev., 45, 1854, 1998. 25. Kirk, C. T., A theory of transistor cut-off frequency falloff at high current densities, IRE Trans. Electron. Devices, ED9, 164, 1962. 26. Roulston, D. J., Bipolar Semiconductor Devices, McGraw-Hill, New York, 1990, p. 257. 27. Konaka, S., Amemiya, Y., Sakuma, K., and Sakai, T., A 20 ps/G Si bipolar IC using advanced SST with collector ion implantation, in 1987 Ext. Abstracts 19th Conf. Solid-State Dev. Mater., Tokyo, 1987, 331.

© 2000 by CRC Press LLC

28. Lu, P.-F. and Chen, T.-C., Collector-base junction avalanche effects in advanced double-poly selfaligned bipolar transistors, IEEE Trans. Electron Dev., 36, 1182, 1989. 29. Tang, D. D. and Lu, P.-F., A reduced-field design concept for high-performance bipolar transistors, IEEE Electron. Dev. Lett., 10, 67, 1989. 30. Ugajin M., Konaka, S., Yokohama K., and Amemiya, Y., A simulation study of high-speed heteroemitter bipolar transistors, IEEE Trans. Electron Dev., 36, 1102, 1989. 31. Inou, K. et al., 52 GHz epitaxial base bipolar transistor with high Early voltage of 26.5 V with boxlike base and retrograded collector impurity profiles, in 1994 Bipolar Circuits Technol. Meeting Tech. Dig., 1994, 217. 32. Ikeda, T., Watanabe, A., Nishio, Y., Masuda, I., Tamba, N., Odaka, M., and Ogiue, K., High-speed BiCMOS technology with a buried twin well structure, IEEE Trans. Electron Dev., 34, 1304, 1987. 33. Kumar, M. J., Sadovnikov, A. D., and Roulston, D. J., Collector design tradeoffs for low voltage applications of advanced bipolar transistors, IEEE Trans. Electron Dev., 40, 1478, 1993. 34. Kumar, M. J. and Datta, K., Optimum collector width of VLSI bipolar transistors for maximum fmax at high current densities, IEEE Trans. Electron Dev., 44, 903, 1997. 35. Roulston, D. J. and Hébert, F., Optimization of maximum oscillation frequency of a bipolar transistor, Solid State Electronics, 30, 281, 1987. 36. Early, J. M., Effects of space-charge layer widening in junction transistors, Proc. IRE, 42, 1761, 1954. 37. Shafi, Z. A., Ashburn, P., and Parker, G., Predicted propagation delay of Si/SiGe heterojunction bipolar ECL circuits, IEEE J. of Solid-State Circuits, 25, 1268, 1990. 38. Stork, J. M. C. and Isaac, R. D., Tunneling in base-emitter junctions, IEEE Trans. Electron Dev., 30, 1527, 1983. 39. del Alamo, J. and Swanson, R. M., Forward-biased tunneling: A limitation to bipolar device scaling, IEEE Electron. Dev. Lett., 7, 629, 1986. 40. Roulston, D. J., Bipolar Semiconductor Devices, McGraw-Hill, New York, 1990, 220 ff. 41. Van Wijnen, P. J. and Gardner, R. D., A new approach to optimizing the base profile for high-speed bipolar transistors, IEEE Electron. Dev. Lett., 4, 149, 1990. 42. Suzuki, K., Optimum base doping profile for minimum base transit time, IEEE Trans. Electron Dev., 38, 2128, 1991. 43. Yuan, J. S., Effect of base profile on the base transit time of the bipolar transistor for all levels of injection, IEEE Trans. Electron Dev., 41, 212, 1994. 44. Ashburn, P. and Morgan, D. V., Heterojunction bipolar transistors, in Physics and Technology of Heterojunction Devices, Morgan D. V. and Williams R. H., Eds., Peter Peregrinus Ltd., London, 1991, chap. 6. 45. Harame, D. L., Stork, J. M. C., Meyerson, B. S., Hsu, K. Y.-J., Cotte, J., Jenkins, K. A., Cressler, J. D., Restle, P., Crabbé, E. F., Subbana, S., Tice, T. E., Scharf, B. W., and Yasaitis, J. A., Optimization of SiGe HBT technology for high speed analog and mixed-signal applications, in 1993 Int. Electron Devices Meeting Tech. Dig., 1993, 71. 46. Prinz, E. J., Garone, P. M., Schwartz, P. V., Xiao, X., and Sturm, J. C., The effects of base dopant outdiffusion and undoped Si1–xGex junction spacer layers in Si/Si1–xGex/Si heterojunction bipolar transistors, IEEE Electron. Dev. Lett., 12, 42, 1991. 47. Hueting, R. J. E., Slotboom, J. W., Pruijmboom, A., de Boer, W. B., Timmering, C. E., and Cowern, N. E. B., On the optimization of SiGe-base bipolar transistors, IEEE Trans. Electron Dev., 43, 1518, 1996. 48. Kerr, J. A. and Berz, F., The effect of emitter doping gradient on fT in microwave bipolar transistors, IEEE Trans. Electron Dev., ED-22, 15, 1975. 49. Slotboom, J. W. and de Graaf, H. C., Measurement of bandgap narrowing in silicon bipolar transistors, Solid State Electronics, 19, 857, 1976. 50. Cuthbertson, A. and Ashburn, P., An investigation of the tradeoff between enhanced gain and base doping in polysilicon emitter bipolar transistors, IEEE Trans. Electron Dev., ED-32, 2399, 1985.

© 2000 by CRC Press LLC

51. Ning, T. H. and Isaac, R. D., Effect on emitter contact on current gain of silicon bipolar devices, IEEE Trans. Electron Dev., ED-27, 2051, 1980. 52. Post, R. C., Ashburn, P., and Wolstenholme, G. R., Polysilicon emitters for bipolar transistors: A review and re-evaluation of theory and experiment, IEEE Trans. Electron Dev., 39, 1717, 1992. 53. Solomon, P. M. and Tang, D. D., Bipolar circuit scaling, in 1979 IEEE International Solid-State Circuits Conference Tech. Dig., 1979, 86. 54. Ning, T. H., Tang, D. D., and Solomon, P. M., Scaling properties of bipolar devices, in 1980 Int. Electron Devices Meeting Tech. Dig., 1980, 61. 55. Ashburn, P., Design and Realization of Bipolar Transistors, Wiley, Chichester, 1988, chap. 7. 56. Lary, J. E. and Anderson, R. L., Effective base resistance of bipolar transistors, IEEE Trans. Electron Dev., ED-32, 2503, 1985. 57. Rein, H.-M., Design considerations for very-high-speed Si-bipolar IC’s operating up to 50 Gb/s, IEEE J. of Solid-State Circuits, 8, 1076, 1996. 58. Schröter, M. and Walkey, D. J., Physical modeling of lateral scaling in bipolar transistors, IEEE J. of Solid-State Circuits, 31, 1484, 1996. 59. Pfost, M., Rein, H.-M., and Holzwarth, T., Modeling substrate effects in the design of high-speed Si-bipolar IC’s, IEEE J. of Solid-State Circuits, 31, 1493, 1996. 60. Lohstrom, J., Devices and circuits for bipolar (V)LSI, Proc. IEEE, 69, 812, 1981. 61. Wolf, S., Silicon Processing for the VLSI Area, Vol. 2, Lattice Press, Sunset Beach, 1990, 532-533. 62. ibid., p. 16-17. 63. Muller, R. S. and Kamins, T. I., Device Electronics for Integrated Circuits, 2nd ed., Wiley, New York, 1986, 307. 64. Parrillo, L. C., VLSI process integration, in VLSI Technology, Sze, S. M., Ed., McGraw-Hill, Singapore, 1983, 449 ff. 65. Ashburn, P., Polysilicon emitter technology, in 1989 Bipolar Circuits Technol. Meeting Tech. Dig., 1989, 90. 66. Li, G. P., Ning, T. H., Chuang, C. T., Ketchen, M. B., Tang, D.D., and Mauer, J., An advanced highperformance trench-isolated self-aligned bipolar technology, IEEE Trans. Electron Dev., ED-34, 2246, 1987. 67. Tang, D. D., Solomon, P. M., Isaac, R. D., and Burger, R. E., 1.25 µm deep-groove-isolated selfaligned bipolar circuits, IEEE J. of Solid-State Circuits, SC-17, 925, 1982. 68. Yano, K., Nakazato, K., Miyamoto, M., Aoki, M., and Shimohigashi, K., A high-current-gain lowtemperature pseudo-HBT utilizing a sidewall base-contact structure (SICOS), , IEEE Trans. Electron Dev., 10, 452, 1989. 69. Tang, D. D.-L., Chen, T.-C., Chuang, C. T., Cressler, J. D., Warnock, J., Li, G.-P., Polcari, M. R., Ketchen, M. B., and Ning, T. H., The design and electrical characteristics of high-performance single-poly ion-implanted bipolar transistors, IEEE Trans. Electron Dev., 36, 1703, 1989. 70. de Jong, J. L., Lane, R. H., de Groot, J. G., and Conner, G. W., Electron recombination at the silicided base contact of an advanced self-aligned polysilicon emitter, in 1988 Bipolar Circuits Technol. Meeting Tech. Dig., 1988, 202. 71. Li, G. P., Chen, T.-C., Chuang, C.-T., Stork, J. M. C., Tang, D. D., Ketchen, M. B., and Wang, L.K., Bipolar transistor with self-aligned lateral profile, IEEE Electron. Dev. Lett., EDL-8, 338, 1987. 72. Niel, S., Rozeau, O., Ailloud, L., Hernandez, C., Llinares, P., Guillermet, M., Kirtsch, J., Monroy, A., de Pontcharra, J., Auvert, G., Blanchard, B., Mouis, M., Vincent, G., and Chantre, A., A 54 GHz fmax implanted base 0.35 µm single-polysilicon bipolar transistor, in 1997 Int. Electron Devices Meeting Tech. Dig., 1997, 807. 73. Tang, D. D., Chen, T.-C., Chuang, C.-T., Li, G. P., Stork, J. M. C., Ketchen, M. B., Hackbarth, E., and Ning, T. H., Design considerations of high-performance narrow-emitter bipolar transistors, IEEE Electron. Dev. Lett., EDL-8, 174, 1987.

© 2000 by CRC Press LLC

74. Ning, T. H., Isaac, R. D., Solomon, P. M., Tang, D. D.-L., Yu, H.-N., Feth, G. C., and Wiedmann, S. K., Self-aligned bipolar transistors for high-performance and low-power-delay VLSI, IEEE Trans. Electron Dev., ED-28, 1010, 1981. 75. Chantre, A., Festes, G., Giroult-Matlakowski, G., and Nouailhat, An investigation of nonideal base currents in advanced self-aligned “etched-polysilicon” emitter bipolar transistors, IEEE Trans. Electron Dev., 38, 1354, 1991. 76. Sun, S. W., Denning, D., Hayden, J. D., Woo, M., Fitch, J. T., and Kaushik, V., A nonrecessed-base, self-aligned bipolar structure with selectively deposited polysilicon emitter, IEEE Trans. Electron Dev., 39, 1711, 1992. 77. Chuang, C.-T., Li, G. P., and Ning, T. H., Effect of off-axis implant on the characteristics of advanced self-aligned bipolar transistors, IEEE Electron. Dev. Lett., EDL-8, 321, 1987. 78. Hayden, J. D., Burnett, J. D., Pfiester, J. R., and Woo, M. P., A new technique for forming a shallow link base in a double polysilicon bipolar transistor, IEEE Trans. Electron Dev., 41, 63, 1994. 79. Maritan, C. M. and Tarr, N. G., Polysilicon emitter p-n-p transistors, IEEE Trans. Electron Dev., 36, 1139, 1989. 80. Warnock, J., Lu, P.-F., Cressler, J. D., Jenkins, K. A., and Sun, J. Y. C., 35 GHz/35 psec ECL pnp technology, in 1990 Int. Electron Devices Meeting Tech. Dig., 1990, 301. 81. Chantre, A., Gravier, T., Niel, S., Kirtsch, J., Granier, A., Grouillet, A., Guillermet, M., Maury, D., Pantel, R., Regolini, J. L., and Vincent, G., The design and fabrication of 0.35 µm single-polysilicon self-aligned bipolar transistors, Jpn. J. Appl. Phys., 37, 1781, 1998. 82. Warnock, J., Cressler, J. D., Jenkins, K. A., Chen, T.-C., Sun, J. Y.-C., and Tang, D. D., 50-GHz selfaligned silicon bipolar transistors with ion-implanted base profiles, IEEE Electron. Dev. Lett., 11, 475, 1990. 83. Uchino, T., Shiba, T., Kikuchi, T., Tamaki, Y., Watanabe, A., and Kiyota, Y., Very-high-speed silicon bipolar transistors with in situ doped polysilicon emitter and rapid vapor-phase doping base, IEEE Trans. Electron Dev., 42, 406, 1995. 84. Burghartz, J. N., Megdnis, A. C., Cressler, J. D., Sun, J. Y.-C., Stanis, C. L., Comfort, J. H., Jenkins, K. A., and Cardone, F., Novel in-situ doped polysilicon emitter process with buried diffusion source (BDS), IEEE Electron. Dev. Lett., 12, 679, 1991. 85. Burghartz, J. N., Sun, J. Y.-C., Stanis, C. L., Mader, S. R., and Warnock, J. D., Identification of perimeter depletion and emitter plug effects in deep-submicrometer, shallow-junction polysilicon emitter bipolar transistors, IEEE Trans. Electron Dev., 39, 1477, 1992. 86. Shiba, T., Uchino, T., Ohnishi, K., and Tamaki, Y., In situ phosphorus-doped polysilicon emitter technology for very high-speed small emitter bipolar transistors, IEEE Trans. Electron Dev., 43, 889, 1996. 87. Kondo, M., Shiba, T., and Tamaki, Y., Analysis of emitter efficiency enhancement induced by residual stress for in situ phosphorus-doped emitter transistors, IEEE Trans. Electron Dev., 44, 978, 1997. 88. Böck, J., Meister, T. F., Knapp, H., Aufinger, K., Wurzer. M., Gabl, R., Pohl, M., Boguth, S., Franosch, M., and Treitinger, L., 0.5 µm / 60 GHz fmax implanted base Si bipolar technology, in 1998 Bipolar Circuits Technol. Meeting Tech. Dig., 1998, 160. 89. Onai, T., Ohue, E., Tanabe, M., and Washio, K., 12-ps ECL using low-base-resistance Si bipolar transistor by self-aligned metal/IDP technology, IEEE Trans. Electron Dev., 44, 2207, 1997. 90. Kiyota, Y., Ohue, E., Washio, K., Tanabe, M., and Inade, T., Lamp-heated rapid vapor-phase doping technology for 100-GHz Si bipolar transistors, in 1996 Bipolar Circuits Technol. Meeting Tech. Dig., 1996, 173. 91. Meister, T. F., Stengl, R., Meul, H. W., Packan, P., Felder, A., Klose, H., Schreiter, R., Popp, J., Rein, H. M., and Treitinger, L., Sub-20 ps silicon bipolar technology using selective epitaxial growth, in 1992 Int. Electron Devices Meeting Tech. Dig., 1992, 401.

© 2000 by CRC Press LLC

92. Meister, T. F., Schäfer, H., Franosch, M., Molzer, W., Aufinger, K., Scheler, U., Walz, C., Stolz, M., Boguth, S., and Böck, J., SiGe base bipolar technology with 74 GHz fmax and 11 ps gate delay, in 1995 Int. Electron Devices Meeting Tech. Dig., 1995, 739. 93. Chantre, A., Marty, M., Regolini, J. L., Mouis, M., de Pontcharra, J., Dutartre D., Morin, C., Gloria, D., Jouan, S., Pantel, R., Laurens, M., and Monroy, A., A high performance low complexity SiGe HBT for BiCMOS integration, in 1998 Bipolar Circuits Technol. Meeting Tech. Dig., 1998, 93. 94. Schüppen, A., König, U., Gruhle, A., Kibbel, H., and Erben, U., The differential SiGe-HBT, Proceedings of the 24th European Solid State Device Research Conference, Hill, C. and Ashburn, P., Eds., 1994, 469. 95. Schüppen, A., Dietrich, H., Seiler, U., von der Ropp, H., and Erben, U., A SiGe RF technology for mobile communication systems, Microwave Engineering Europe, June 1998, 39. 96. Ohue, E., Oda, K., Hayami, R., and Washio, K., A 7.7 ps CML using selective-epitaxial SiGe HBTs, 1998 Bipolar Circuits Technol. Meeting Tech. Dig., 1998, 97. 97. Schüppen, A., Erben, U., Gruhle, A., Kibbel, H., Schumacher, H., and König, U., Enhanced SiGe heterojunction bipolar transistors with 160 GHz-fmax, 1995 Int. Electron Devices Meeting Tech. Dig., 1995, 743. 98. Oda, K., Ohue, E., Tanabe, M., Shimamoto, H., Onai, T., and Washio, K., 130-GHz fT SiGe HBT technology, in 1997 Int. Electron Devices Meeting Tech. Dig., 1997, 791. 99. Dekker, R., Baltus, P., van Deurzen, M., v.d. Einden, W., Maas, H., and Wagemans, A., An ultra low-power RF bipolar technology on glass, 1997 Int. Electron Devices Meeting Tech. Dig., 1997, 921. 100. Hashimoto, T., Kikuchi, T., Watanabe, K., Ohashi, N., Saito, N., Yamaguchi, H., Wada, S., Natsuaki, N., Kondo, M., Kondo, S., Homma, Y., Owada, N., and Ikeda, T., A 0.2 µm bipolar-CMOS technology on bonded SOI with copper metallization for ultra high-speed processors, 1998 Int. Electron Devices Meeting Tech. Dig., 1998, 209. 101. König, U., SiGe & GaAs as competitive technologies for RF-applications, 1998 Bipolar Circuits Technol. Meeting Tech. Dig., 1998, 87.

© 2000 by CRC Press LLC

Cristoloveanu, S.“Silicon on Insulatpr Technology” The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

4 Silicon on Insulator Technology 4.1 4.2

Introduction Fabrication of SOI Wafers Silicon on Sapphire • ELO and ZMR • FIPOS • SIMOX • Wafer Bonding • UNIBOND

4.3 4.4

Generic Advantages of SOI SOI Devices CMOS Circuits • Bipolar Transistors • High-Voltage Devices • Innovative Devices

4.5

Fully–Depleted SOI Transistors Threshold Voltage • Subthreshold Slope • Transconductance • Volume Inversion • Defect Coupling

Sorin Cristoloveanu Institut National Polytechnique de Grenoble

4.6 4.7 4.8 4.9

Partially Depleted SOI Transistors Short–Channel Effects SOI Challenges Conclusion

4.1 Introduction Silicon on Insulator (SOI) technology (more specifically, silicon on sapphire) was originally invented for the niche of radiation-hard circuits. In the last 20 years, a variety of SOI structures have been conceived with the aim of dielectrically separating, using a buried oxide (Fig. 4.1(b)), the active device volume from the silicon substrate.1 Indeed, in an MOS transistor, only the very top region (0.1–0.2 µm thick, i.e., less than 0.1% of the total thickness) of the silicon wafer is useful for electron transport and device operation, whereas the substrate is responsible for detrimental, parasitic effects (Fig. 4.1(a)). More recently, the advent of new SOI materials (Unibond, ITOX) and the explosive growth of portable microelectronic devices have attracted considerable attention on SOI for the fabrication of low-power (LP), low-voltage (LV), and high-frequency (HF) CMOS circuits. The aim of this chapter is to overview the state-of-the-art of SOI technologies, including the material synthesis (Section 4.2), the key advantages of SOI circuits (Section 4.3), the structure and performance of typical devices (Section 4.4), and the operation modes of fully depleted (Section 4.5), and partially depleted SOI MOSFETs (Section 4.6). Section 4.7 is dedicated to short-channel effects. The main challenges that SOI is facing, in order to successfully compete with bulk–Si in the commercial arena, are critically discussed in Section 4.8.

4.2 Fabrication of SOI Wafers Many techniques, more or less mature and effective, are available for the synthesis of SOI wafers.1

© 2000 by CRC Press LLC

FIGURE 4.1

Basic architecture of MOS transistors in (a) bulk silicon and (b) SOI.

Silicon on Sapphire Silicon on sapphire (SOS, Fig. 4.2(a1)) is the initial member of SOI family. The epitaxial growth of Si films on Al2O3 gives rise to small silicon islands that eventually coalesce. The interface transition region contains crystallographic defects due to the lattice mismatch and Al contamination from the substrate. The electrical properties suffer from lateral stress, in-depth inhomogeneity of SOS films, and defective transition layer.2

FIGURE 4.2 SOI family: (a) SOS, ZMR, FIPOS, and wafer bonding, (b) SIMOX variants, (c) UNIBOND processing sequence.

© 2000 by CRC Press LLC

SOS has recently undergone a significant lifting: larger wafers and thinner films with higher crystal quality. This improvement is achieved by solid-phase epitaxial regrowth. Silicon ions are implanted to amorphise the film and erase the memory of damaged lattice and interface. Annealing allows the epitaxial regrowth of the film, starting from the ‘seeding’ surface towards the Si–Al2O3 interface. The result is visible in terms of higher carrier mobility and lifetime; 100-nm thick SOS films with good quality have recently been grown on 6 in. wafers.3 Thanks to the ‘infinite’ thickness of the insulator, SOS looks promising for the integration of RF and radiation-hard circuits.

ELO and ZMR The epitaxial lateral overgrowth (ELO) method consists of growing a single-crystal Si film on a seeded and, often, patterned oxide (Fig. 4.2(a2)). Since the epitaxial growth proceeds in both lateral and vertical directions, the ELO process requires a post-epitaxy thinning of the Si film. Alternatively, poly-silicon can be deposited directly on SiO2; subsequently, zone melting recrystallization (ZMR) is achieved by scanning high-energy sources (lamps, lasers, beams, or strip heaters) across the wafer. The ZMR process can be seeded or unseeded; it is basically limited by the lateral extension of single-crystal regions, free from grain subboundaries and associated defects. ELO and ZMR are basic techniques for the integration of 3-D stacked circuits.

FIPOS The FIPOS method (full isolation by porous oxidized silicon) makes use of the very large surface-to-volume ratio (103 cm2 per cm3) of porous silicon which is, thereafter, subject to selective oxidation (Fig. 4.2(a3)). The critical step is the conversion of selected p-type regions of the Si wafer into porous silicon, via anodic reaction. FIPOS may enlighten Si technology because there are prospects, at least from a conceptual viewpoint, for combining electroluminescent porous Si devices with fast SOI–CMOS circuits.

SIMOX In the last decade, the dominant SOI technology was SIMOX (separation by implantation of oxygen). The buried oxide (BOX) is synthesized by internal oxidation during the deep implantation of oxygen ions into a Si wafer. Annealing at high temperature (1320°C, for 6 h) is necessary to recover a suitable crystalline quality of the film. High current implanters (100 mA) have been conceived to produce 8 in. wafers with good thickness uniformity, low defect density (except threading dislocations: 104–106 cm–2), sharp Si–SiO2 interface, robust BOX, and high carrier mobility.4 The family of SOI structures is presented in Figure 4.2(b): • Thin and thick Si films fabricated by adjusting the implant energy. • Low-dose SIMOX: a dose of 4 × 1017 O+/cm2 and an additional oxygen-rich anneal for enhanced BOX integrity (ITOX process) yield a 0.1-µm thick BOX (Fig. 4.2(b1)). • Standard SIMOX obtained with 1.8 × 1018 O+/cm2 implant dose, at 190keV and 650°C; the thicknesses of the Si film and BOX are roughly 0.2 µm and 0.4 µm, respectively (Fig. 4.2(b2)). • Double SIMOX (Fig. 4.2(b3)), where the Si layer sandwiched between the two oxides can serve for interconnects, wave guiding, additional gates, or electric shielding. • Laterally-isolated single-transistor islands (Fig. 4.2(b4)), formed by implantation through a patterned oxide. • Interrupted oxides (Fig. 4.2(b5)) which can be viewed as SOI regions integrated into a bulk Si wafer.

© 2000 by CRC Press LLC

Wafer Bonding Wafer bonding (WB) and etch-back stands as a rather mature SOI technology. An oxidized wafer is mated to another SOI wafer (Fig. 4.2(a4)). The challenge is to drastically thin down one side of the bonded structure in order to reach the targeted thickness of the silicon film. Etch-stop layers can be achieved by doping steps (P+/P-, P/N) or porous silicon (Eltran process).5 The advantage of wafer bonding is to provide unlimited combinations of BOX and film thicknesses, whereas its weakness comes from the difficulty to produce ultra-thin films with good uniformity.

UNIBOND A recent, revolutionary bonding-related process (UNIBOND) uses the deep implantation of hydrogen into an oxidized Si wafer (Fig. 4.2(c1)) to generate microcavities and thus circumvent the thinning problem.6 After bonding wafer A to a second wafer B and subsequent annealing to enhance the bonding strength (Fig. 4.2(c2)), the hydrogen-induced microcavities coalesce. The two wafers separate, not at the bonded interface but at a depth defined by the location of hydrogen microcavities. This mechanism, named Smart-cut, results in a rough SOI structure (Fig. 4.2(c4)). The process is completed by touchpolishing to erase the surface roughness. The extraordinary potential of the Smart-cut approach comes from several distinct advantages: (1) the etch-back step is avoided, (2) the second wafer (Fig. 4.2(c3)) being recyclable, UNIBOND is a singlewafer process, (3) only conventional equipment is needed for mass production, (4) relatively inexpensive 12 in. wafers are manufacturable, and (5) the thickness of the silicon film and/or buried oxide can be adjusted to match most device configurations (ultra-thin CMOS or thick-film power transistors and sensors). The defect density in the film is very low, the electrical properties are excellent, and the BOX quality is comparable to that of the original thermal oxide. The Smart-cut process is adaptable to a variety of materials: SiC or III–V compounds on insulator, silicon on diamond, etc. Smart-cut can be used to transfer already fabricated bulk-Si CMOS circuits on glass or on other substrates.

4.3 Generic Advantages of SOI SOI circuits consist of single-device islands dielectrically isolated from each other and from the underlying substrate (Fig. 4.1(b)). The lateral isolation offers more compact design and simplified technology than in bulk silicon; there is no need of wells or interdevice trenches. In addition, the vertical isolation renders the latch-up mechanisms impossible. The source and drain regions extend down to the buried oxide; thus, the junction surface is minimized. This implies reduced leakage currents and junction capacitances, which further translates into improved speed, lower power, and wider temperature range of operation. The limited extension of drain and source regions allows SOI devices to be less affected by shortchannel effects, originated from ‘charge sharing’ between gate and junctions. Besides the outstanding tolerance of transient radiation effects, SOI MOSFETs experience a lower electric-field peak than in bulk Si and are potentially more immune to hot carrier damage. It is in the highly competitive domain of LV/LP circuits, operated with one-battery supply (0.9–1.5V), that SOI can express its entire potential. A small gate voltage gap is suited to switch a transistor from off- to on-state. SOI offers the possibility to achieve a quasi-ideal subthreshold slope (60mV/decade at room temperature); hence, a threshold voltage shrunk below 0.3V. Low leakage currents limit the static power dissipation, as compared to bulk Si, whereas, the dynamic power dissipation is minimized by the combined effects of low parasitic capacitances and reduced voltage supply. Two arguments can be given to outline unequivocally the advantage of SOI over bulk Si: • Operation at similar voltage consistently shows about 30% increase in performance, whereas operation at similar low-power dissipation yields as much as 300% performance gain in SOI. It is

© 2000 by CRC Press LLC

believed, at least in the SOI community, that SOI circuits of generation (n) and bulk-Si circuits from the next generation (n + 1) perform comparably. • Bulk Si technology does attempt to mimic a number of features that are natural in SOI: the doublegate configuration is reproduced by processing surrounded-gate vertical MOSFETs on bulk Si, full depletion is approached by tailoring a low-high step doping, and the dynamic-threshold operation is borrowed from SOI. The problem for SOI is that such an enthusiastic list of merits did not perturb the fantastic progress and authority of bulk Si technology. There has been no room or need so far for an alternative technology such as SOI. However, the SOI community remains confident that the SOI advantages together with the predictable approach of bulk-Si limits will be enough for SOI to succeed soon.

4.4 SOI Devices CMOS Circuits High-performance SOI CMOS circuits, compatible with LV/LP and high-speed ULSI applications have been repeatedly demonstrated on submicron devices. Quarter-micron ring oscillators showed delay times of 14 ps/stage at 1.5 V7 and of 45 ps/stage at 1V.8 PLL operated at 2.5 V and 4 GHz dissipate 19 mW only.8 Microwave SOS MOSFETs, with T-gate configuration, had 66-MHz maximum frequency and low noise figure.3 More complex SOI circuits, with direct impact on mainstream microelectronics, have also been fabricated: 0.5 V–200 MHz microprocessor,9 4 Mb SRAM,10 16 Mb and 1 Gb DRAM,11 etc.1,12 Several companies (IBM, Motorola, Sharp) have announced the imminent commercial deployment of ‘SOIenhanced’ PC processors and mobile communication devices. CMOS SOI circuits show capability of successful operation at temperatures higher than 300°C: the leakage currents are much smaller and the threshold voltage is less temperature sensitive (≈0.5mV/°C for fully depleted MOSFETs) than in bulk Si.13 In addition, many SOI circuits are radiation-hard, able to sustain doses above 10 Mrad.

Bipolar Transistors As a consequence of the small film thickness, most bipolar transistors have a lateral configuration. The implementation of BiCMOS technology on SOI has resulted in devices with a cutoff frequency above 27 GHz.14 Hybrid MOS–bipolar transistors with increased current drive and transconductance are formed by connecting the gate to the floating body (or base); the MOSFET action governs in strong inversion whereas, in weak inversion, the bipolar current prevails.12 Vertical bipolar transistors have been processed in thick-film SOI (wafer bonding or epitaxial growth over SIMOX). An elegant solution for thin-film SOI is to replace the buried collector by an inversion layer activated by the back gate.12

High-Voltage Devices Lateral double-diffused MOSFETs (DMOS), with long drift region, were fabricated on SIMOX and showed 90 V/1.3A capability.15 Vertical DMOS can be accommodated in thicker wafer-bonding SOI. The SIMOX process offers the possibility to synthesize locally a buried oxide (‘interrupted’ SIMOX, Fig. 4.2(b5)). Therefore, a vertical power device (DMOS, IGBT, UMOS, etc.), located in the bulk region of the wafer, can be controlled by a low-power CMOS/SOI circuit (Fig. 4.3(a)). A variant of this concept is the ‘mezzanine’ structure, which served for the fabrication of a 600V/25A smart-power device.16 Double SIMOX (Fig. 4.2(b3)) has also been used to combine a power MOSFET with a double-shielded highvoltage lateral CMOS and an intelligent low-voltage CMOS circuit.17

© 2000 by CRC Press LLC

FIGURE 4.3 Examples of innovative SOI devices: (a) combined bipolar (or high power) bulk-Si transistor with lowvoltage SOI CMOS circuits, (b) dual-gate transistors, (c) pressure sensor, and (d) gate-all-around (GAA) MOSFET.

Innovative Devices Most innovative devices make use of special SOI features, including the possibility to (1) combine bulk Si and SOI on a single chip (Fig. 4.3(a)), (2) adjust the thickness of the Si overlay and buried oxide, and (3) implement additional gates in the buried oxide (Fig. 4.3(b)), by ELO process or by local oxidation of the sandwiched Si layer in double SIMOX (Fig. 4.2(b3)). SOI is an ideal material for microsensors because the Si/BOX interface gives a perfect etch-stop mark, making it possible to fabricate very thin membranes (Fig. 4.3(c)). Transducers for detection of pressure, acceleration, gas flow, temperature, radiation, magnetic field, etc. have successfully been integrated on SOI.1,16 The feasibility of three-dimensional circuits has been demonstrated on ZMR structures. For example, an image-signal processor is organized in three levels: photodiode arrays in the upper SOI layer, fast A/D converters in the intermediate SOI layer, and arithmetic units and shift registers in the bottom bulk Si level.18 The gate all–around (GAA) transistor of Fig. 4.3(d), based on the concept of volume inversion, is fabricated by etching a cavity into the BOX and wrapping the oxidized transistor body into a poly-Si gate.12 Similar devices include the Delta transistor19 and various double-gate MOSFETs. The family of SOI devices also includes optical waveguides and modulators, microwave transistors integrated on high-resistivity SIMOX, twin-gate MOSFETs, and other exotic devices.1,12 They do not belong to science fiction: the devices have already been demonstrated in terms of technology and functionality… even if most people still do not believe that they can operate.

4.5 Fully–Depleted SOI Transistors In SOI MOSFETs (Fig. 4.1(b)), inversion channels can be activated at both the front Si–SiO2 interface (via gate modulation V G1 ) and back Si–BOX interface (via substrate, back-gate bias V G2 ). © 2000 by CRC Press LLC

Full depletion means that the depletion region covers the entire transistor body. The depletion charge is constant and cannot extend according to the gate bias. A better coupling develops between the gate bias and the inversion charge, leading to enhanced drain current. In addition, the front- and back-surface potentials become coupled too. The coupling factor is roughly equal to the thickness ratio between gate oxide and buried oxide. The electrical characteristics of one channel vary remarkably with the bias applied to the opposite gate. Due to interface coupling, the front-gate measurements are all reminiscent of the back-gate bias and quality of the buried oxide and interface. Totally new ID(VG) relations apply to fully depleted SOI–MOSFETs whose complex behavior is controlled by both gate biases. The typical characteristics of the front-channel transistor are schematically illustrated in Fig. 4.4, for three distinct bias conditions of the back interface (inversion, depletion, and accumulation), and will be explained next.

FIGURE 4.4 Generic front-channel characteristics of a fully depleted n-channel SOI MOSFET for accumulation (A), depletion (D), and inversion (I) at the back interface: (a) ID( V G1 ) curves in strong inversion, (b) log ID( V G1 ) curves in weak inversion, and (c) transconductance gm( V G1 ) curves.

Threshold Voltage The lateral shift of ID( V G1 ) curves (Fig. 4.4(a)) is explained by the linear variation of the front-channel dep dep threshold voltage, V T1 , with back-gate bias. This potential coupling causes V T1 to decrease linearly, with increasing V G2 , between two plateaus corresponding, respectively, to accumulation and inversion at the back interface20:

C si C ox2 ( V G2 – V G2 ) dep acc V T1 = V T1 – -------------------------------------------------C ox1 ( C ox2 + C si + C it2 ) acc

(4.1)

acc

where V T1 is the threshold voltage when the back interface is accumulated

C ox1 + C si + C it1 Q si acc - 2Φ F – ----------V T1 = Φ fb1 + ----------------------------------C ox1 2C ox1

(4.2)

C si Q si acc - 2Φ – ----------V G2 = Φ fb2 – -------C ox2 F 2C ox2

(4.3)

acc

and V G2 is given by

In the above equations, Csi, Cox , and Cit are the capacitances of the fully depleted film, oxide, and interface traps, respectively; Qsi is the depletion charge, ΦF is the Fermi potential, and Φfb is the flat-band potential. The subscripts 1 and 2 hold for the front- or the back-channel parameters and can be interchanged to account for the variation of the back-channel threshold voltage V T2 with V G1 . © 2000 by CRC Press LLC

The difference between the two plateaus, ∆V T1 = ( C si ⁄ C ox1 ) 2ΦF , slightly depends on doping, whereas the slope does not. We must insist on the polyvalence of Eqs.(4.1) to (4.3) as compared to the simple case of bulk Si MOSFETs (or partially depleted MOSFETs), where

C it 4qε si N A Φ F V T1 = Φ fb1 + 1 + --------1-  2Φ F + ---------------------------- C ox1  C ox1

(4.4)

The extension to p-channels or accumulation-mode SOI–MOSFETs is also straightforward.1 In fully depleted MOSFETs, the threshold voltage decreases in thinner films (i.e., reduced depletion charge) until quantum effects arise and lead to the formation of a 2-D subband system. In ultra-thin films (tsi ≤ 10 nm), the separation between the ground state and the bottom of the conduction band increases with reducing thickness: a VT rebound is then observed.21

Subthreshold Slope For depletion at the back interface, the subthreshold slope (Fig. 4.4(b)) is very steep and the subthreshold swing S is given by22: dep

S1

C it C si  kT = 2.3 ------ 1 + --------1- + α 1 ------- q C ox1  C ox1

(4.5)

The interface coupling coefficient α1

C ox2 + C it2 - ID2·L2/W2, which makes V1 > V2. When the PH1 switches are closed, the steady-state voltage across the capacitor is V1 – V2, and

FIGURE 19.47 Adaptive biasing using a switched capacitor circuit.

© 2000 by CRC Press LLC

the current in M3 is set by ID2. When the PH2 switches close, V2 is momentarily increased, while V1 is decreased. The transient is repeated when the PH1 switches close. The net effect of the switched capacitor bias boost is that the current in M3 increases after both clock edges. Notice that the current is increased, whether it is needed or not. This circuit is fast because there is no feedback loop, but it is not the most efficient because it does not actually sense the current in the differential pair. In all three approaches, the quiescent current is less than the current when the output is required to slew. Therefore, output swing and gain are not degraded when the bias current returns to its quiescent value. All three adaptive bias circuits will cause larger current spikes to be put on the supplies by the opamps. The width of the power supply lines should be increased to compensate for the increased IR drop and crosstalk. Enhanced slew rate circuits are not linear time invariant systems because the transconductance and output impedance of the transistors are bias current dependent, and the bias is time varying. A transient analysis is the most dependable way to evaluate settling time in this case. Output Swing Decreasing power supply voltages put an uncomfortable squeeze on the design engineer. To maintain the desired signal-to-noise ratio with a smaller signal swing, circuit impedances must decrease, which often cancels any power savings that may be gained by a lower supply voltage. To get the best signal swing, differential circuits are used. If the output stages use cascode current mirrors, bias voltages must be generated, which keep both the mirror and cascode transistors in saturation with the minimum voltage on the output. Another example of a high swing bias circuit (refer also to Section 19.2) and a current mirror is shown in Fig. 19.48. First, let the W/L ratio of M2, M4, and M6 be equal, and the W/L ratio of M3 and M5 be equal. Now recall that to keep a MOSFET in saturation

V DS ≥ V GS – V THN

(19.61)

The minimum output voltage that will keep both M5 and M6 in saturation with proper biasing is

V out ≥ V GS6 – 2V THN + V GS5

(19.62)

ignoring the bulk effects. For a given current, the minimum drain voltage can be rewritten as

ID ⋅ L V DS ≥ ------------K⋅W

FIGURE 19.48 A high swing biasing circuit for low power supply applications.

© 2000 by CRC Press LLC

(19.63)

The equation for the minimum VOUT can be rewritten as

I OUT ⋅ L 6 I OUT ⋅ L 5 V OUT ≥ ------------------+ ------------------K6 ⋅ W6 K5 ⋅ W5

(19.64)

The trick to making this bias generator work is setting VDS of M1 equal to the minimum VDS required by M3 and M5. M1 is biased in the linear region, while we wish to keep M3 and M5 saturated. It is a good idea to set L1 = L3 = L5 to match etching tolerances. A second trick is to make sure M3 and M4 stay saturated. M4 is inserted between the gate and drain connections of M3 to make VDS3 = VDS5. If the W/L ratio of M3 is too small, M3 will be forced out of saturation by the source of M4. If the W/L3 is too large, the gate voltage of M3 will not be large enough to keep M4 in saturation. DC Gain Start by calculating the DC gain of the folded cascode OTA in Fig. 19.38. If we assume the output impedance of the n-channel cascode current mirror is much greater than the output impedance of the individual transistors, then the DC gain is approximately

v -----o = – g m1 ⋅ r o1 r o3 ⋅ g m2 ⋅ r o2 v in

(19.65)

If we assume that the current from the M3 splits equally to M1 and M2, the gain can be written as

v W1 L1 W2 L2 L3 -----o ----------------------------- ------------------⋅ 2L 1 + L 3 v in ∝ I D1

(19.66)

We can see that the gate area of the differential pair and cascode transistors must both double each time current is doubled to maintain the same gain. We also note that it is desirable to make L3 > L1. If the current in the amplifier were raised to increase gain-bandwidth, or slew rate, it would be desirable to increase the widths of the transistors by the same factor to maintain output swing. Regulated gate cascode outputs increase the gain of the OTA by effectively multiplying the gm of M4 by the gain of the RGC amplifier. The stability of the RGC amplifier loop must be considered. An example of a gain boosted output stage is shown in Fig. 19.49. Gain Bandwidth and Phase Margin Again, start with the transfer function for the folded cascode OTA of Fig. 19.38. If we assume the output impedances of the n-channel cascode current mirrors are very large, and the gain of M2 is much greater than one, we have

v g m1 r o1 g m2 r o2 -----o = ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- (19.67) v in  2  g m2   1 1 1 1 ( C 1 ⋅ C out ⋅ r o1 ⋅ r o2 )  s +  ------- + ---------------- + ---------------- + ------------------- s + ---------------------------------------- C 1 ⋅ C out ⋅ r o1 ⋅ r o2   C 1 C 1 ⋅ r o1 C 1 ⋅ r o2 C out ⋅ r o2 where ro1 is now the parallel combination of ro1 and ro3. If we further assume that the poles are spaced far apart, and that gm2 is much larger than 1/ro1 and 1/ro2, then the gain-bandwidth product is gm1/Cload. The second pole, which will determine phase margin, is approximately

g m2 1 - + ------------------ω = -----C 1 C out ⋅ r o2 The depletion capacitance of the drains of M1 and M3 will also add to this capacitance. As a first cut, let C1 = Kc·W2·L2. Now the equation for the second pole boils down to © 2000 by CRC Press LLC

FIGURE 19.49 OTA with regulated gate cascode output.

FIGURE 19.50 A telescopic OTA.

© 2000 by CRC Press LLC

K ⋅ I D2 I D2 ----------------------------------------ω ≈ K ⋅ L ⋅ W ⋅ L + -----------------C 2 2 2 L 2 ⋅ C out To get maximum phase margin, we clearly want to use as short a channel length as the gain specification will allow. The folded cascode OTA and two-stage OTA both have n-channel and p-channel transistors in the signal path. Since holes have lower mobility than electrons, it is necessary to make a silicon pchannel transistor about three times wider than an n-channel transistor of the same length to get the same transconductance. The added parasitic capacitance of the wider p-channel transistor is a hindrance for high-speed design. The telescopic OTA shown in Fig. 19.50 has only n-channel transistors in the signal path, and can therefore achieve very high bandwidths with acceptable phase margin. Its main drawback is that the output common-mode voltage must be more positive than the input commonmode voltage. This amplifier can achieve even wider bandwidth with acceptable phase margin if M2 is replaced by an npn bipolar transistor.

References 1. R. J. Baker, H. W. Li, and D. E. Boyce, CMOS: Circuit Design, Layout, and Simulation, IEEE Press, 1998. 2. A. S. Sedra and K. C. Smith, Microelectronic Circuits, fourth edition, Oxford University Press, London, 1998. 3. P. E. Allen and D. R. Holberg, CMOS Analog Circuit Design, Saunders College Publishing, Philadelphia, 1987. 4. P. R. Gray, Basic MOS Operational Amplifier Design — An Overview, Analog MOS Integrated Circuits, IEEE Press, 1980. 5. H. Qiuting, A CMOS power amplifier with a novel output structure, IEEE J. of Solid-State Circuits, vol. 27, no. 2, pp. 203-207, Feb. 1992. 6. M. Ismail, and T. Fiez, Analog VLSI: Signal and Information Processing, McGraw-Hill, Inc., New York, 1994. 7. S. Sen and B. Leung, A class-AB high-speed low-power operational amplifier in BiCMOS technology, IEEE J. of Solid-State Circuits, vol. 31, no. 9, pp. 1325-1330, Sept. 1996.

© 2000 by CRC Press LLC

De Veirman, G.E. "Bipolar Amplifier Design" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

20 Bipolar Amplifier Design 20.1 Introduction 20.2 Single-Transistor Amplifiers Basic Principles • Common-Emitter Amplifier • CommonCollector Amplifier (Emitter Follower) • Common-Base Amplifier • Darlington And Pseudo-Darlington Pairs

20.3 Differential Amplifiers Introduction: Amplification of dc and Difference Signals • Bipolar Differential Pairs (Emitter-Coupled Pairs) • Gain Enhancement Techniques • Linearization Techniques • Rail-to-Rail Common-Mode Inputs and Minimum Supply Voltage Requirement

20.4 Output Stages Class A Operation • Class B and Class AB Operation

20.5 Bias Reference 20.6 Operational Amplifiers

Geert A. De Veirman Silicon Systems, Inc.

Introduction • Ideal Op-Amps • Op-Amp Non-idealities

20.7 Conclusion

20.1 Introduction This chapter gives an overview of amplifier design techniques using bipolar transistors. An elementary understanding of the operation of the bipolar junction transistor is assumed. Section 20.2 reviews the basic principles of amplification and details the proper selection of an operating point. This section also introduces the three fundamental single-transistor amplifier stages: the common-emitter, the common-collector, and the common-base configurations. Section 20.3 deals with the problem of amplification of dc and difference signals. The emitter-coupled differential pair is discussed in great depth. Issues specific to output stages are presented in Section 20.4. Section 20.5 briefly touches on supply-independent biasing techniques. Next, Section 20.6 combines all the acquired building block knowledge in a condensed overview of operational amplifiers. A short conclusion is presented in Section 20.7, followed by a list of references.

20.2 Single-Transistor Amplifiers Basic Principles Bipolar Transistor Operation Although prior exposure to bipolar transistor fundamentals is expected, a few elementary concepts are briefly reviewed here. The bipolar junction transistor (BJT) contains two back-to-back pn junctions,

© 2000 by CRC Press LLC

known as the base–emitter and base–collector junctions. In this chapter, it is assumed that the BJT is operated in the so-called normal mode with a forward-bias applied at the base–emitter junction and a reverse voltage across the base–collector interface. Then, the collector current IC, which is β times the base current IB, is exponentially related to the base–emitter voltage VBE. The mathematical expression of this dependency is known as the Ebers-Moll equation. For small signal variations, the relationship between the current through the transistor (IC) and the control input voltage (VBE) is expressed by the transconductance gm. When designing bipolar transistor amplifiers, it is also essential to understand the relationship between IC and the voltage across the transistor (the collector-emitter voltage VCE). For each input voltage VBE, a different characteristic exists in the IC-VCE diagram, as shown in Fig. 20.1. As long as VCE is high enough to keep the transistor in the linear region, the IC vs. VCE characteristics are fairly flat. When VBE is held constant, increasing VCE corresponds to applying more reverse voltage across the base–collector junction. As a result, the base width is modulated and this causes the IC curves to increase approximately linearly with VCE. When the IC curves are extrapolated to negative VCE values, they eventually all go through a single pivot point, which is referred to as the transistor’s Early voltage VA. For very small values of VCE, the base–collector junction becomes forward-biased. As a result, the transistor saturates and the collector current drops off rapidly.

FIGURE 20.1

IC(Ithrough) vs. VCE(Vacross) characteristics of the npn bipolar junction transistor.

In Fig. 20.1, VBE1 is less than VBE2. In the linear operating region, the exponential nature of the BJT renders gm1 < gm2, which means that for an identical incremental input voltage ∆, IC2 changes by a larger amount than IC1. Also, since both curves merge at VA, IC2 is a stronger function of VCE than IC1. Mathematical expressions for the relationships discussed above will be introduced as needed in the following sections. Basic Bipolar Amplifier Concepts In order to amplify a voltage with a bipolar transistor, a load is placed in series with the device. In its simplest form, this load consists of a resistor RL. Taking into account the polarity of electrons and the direction of current flow, one ends up with a transistor and supply arrangement as shown in Fig. 20.2, where the voltage to be amplified (Vi) drives the base. Hence, the only available output node (Vo) in Fig. 20.2 is the voltage between the transistor and the load resistor. This voltage can be determined in the IC–VCE diagram of Fig. 20.1 by subtracting the voltage drop across RL from the supply voltage VCC. The result is a straight line, which is known as the load line. The load line, shown in Fig. 20.3, graphically represents the possible operating points of the bipolar transistor. The line is straight simply because the resistor RL is a linear element. Every point on the load line represents a state, which is determined by the voltage at the transistor’s base. This state in turn determines how much current flows through the transistor and how much voltage there exists across this device.

© 2000 by CRC Press LLC

FIGURE 20.2

Conceptual schematic of single-transistor amplifier.

FIGURE 20.3

Load line characteristic.

To determine the circuit’s voltage amplification, all one has to do is to plot the output voltage as a function of the input voltage. For different values of RL, this process results in the transfer characteristics depicted in Fig. 20.4. The transfer characteristics allow us to visually determine two important items: first, the range in which the input voltage Vi should lie in order to obtain the maximum possible voltage variation at the output; and second, the optimum value of RL. Since the transfer characteristics in Fig. 20.4 are far from linear, the best one can do is to identify a region where the curves can be linearized, provided the signal swings are not too large. Also, note that the transfer curves do not pass through the (0,0) coordinate. This means that the desired proportionality between input and output is not present. Consequently, one would want to move the Vo and Vi axes, so that their origin coincides with a desired point on the curves, as is the case for the V o ′ ⁄ V i ′ coordinate system illustrated in Fig. 20.4. The next subsection will discuss how such a suitable operating point (or quiescent point) can be established.

© 2000 by CRC Press LLC

FIGURE 20.4

Transfer characteristics.

Figure 20.4 furthermore graphically clarifies how the bipolar transistor can be used in three distinct ways: first, as a switch; second, as a linear amplifier for small signals; and third, as a linear amplifier for large signals. When operating the BJT as a switch, the (logic) designer is primarily concerned about the speed of the transitions between the OFF and ON states. Such obviously non-linear modus operandi is a topic outside the scope of this chapter. Large signals pose particular difficulties, which will be covered a bit later in the sections on differential pairs and output stages. In the next several sections, we will deal mainly with linear small-signal operation. Setting a Suitable Operating Point The choice of a proper operating point depends on several factors. Three key decision criteria are: 1. Maximum allowable power dissipation of the active element. Each transistor has a maximum power limit. If this limit is exceeded, permanent damage (e.g., degradation of β) occurs or, in the worst case, the device may be destroyed. (Actually, maximum power ratings can be temporarily exceeded, as long as such transgressions are very fast. But such operation goes beyond this discussion.) In the IC–VCE diagram, constant power curves are represented by hyperboles. One such hyperbole, representing the maximum allowable power dissipation Pmax, is included in Fig. 20.3. The designer must guarantee that excursions from the operating point along the amplifier’s load line do not cross into the forbidden zone above the Pmax curve. 2. Proper location on the transfer curve. As mentioned above during the discussion of Fig. 20.4, this choice directly affects the achievable linearity and signal swing. 3. Bias current. Figure 20.5 illustrates how the vertical axis in the Vo/Vi diagram can be moved. All one has to do is apply a proper dc bias voltage and superpose a small signal input. Making use of a separate bias voltage, as shown in Fig. 20.5(a), is one possibility. Alternatively, the dc voltage at the base can be set by means of a resistive divider from VCC, and the ac signal can be capacitively coupled through Cci, as depicted in Fig. 20.5(b). Similarly, a coupling capacitor Cco can be employed to separate the dc bias at Vo from the ac signal swing. As such, Cco moves the horizontal axis in the Vo/Vi diagram. In Fig. 20.6(a), a second load resistor RL′ is added. Since the capacitor Cco acts as an open circuit for dc signals, the operating point remains solely determined by the intersection of the chosen transistor characteristic and the static load line, which results from the combination of VCC and RL. Assuming Cco

© 2000 by CRC Press LLC

FIGURE 20.5 Common-emitter amplifier: (a) Vbias in series with ac signal source, (b) arrangement with coupling capacitors.

is large, this coupling capacitor behaves as a short-circuit for ac signals. Hence, RL′ is effectively in parallel with RL. Therefore, signal excursions occur along a new dynamic load line, as illustrated in Fig. 20.6(b).

Common-Emitter Amplifier The circuits in Figs. 20.5 or 20.6(a) are known as common-emitter amplifiers. This name is derived from the fact that the emitter terminal is common with the ground node. Small-Signal Gain The small-signal equivalent circuit of the common-emitter amplifier is shown in Fig. 20.7. This circuit is derived based on the observation that for ac signals, fixed dc bias sources are identical to ac ground. We have also assumed (for now) that the coupling capacitors Cci and Cco are so large that for any frequencies of interest, they effectively behave as short-circuits. Moreover, parasitic capacitances internal to the transistor are ignored. R L ′ has been eliminated for simplicity (or one could assume that RL represents the effective parallel resistance). RS is the series output resistance of the ac source Vi and rb stands for the physical resistance of the base. rπ models the linearized input voltage–current characteristic

© 2000 by CRC Press LLC

FIGURE 20.6 (a) common-emitter amplifier with ac coupled load R′L, (b) IC vs. VCE diagram with static and dynamic load lines.

FIGURE 20.7 Small-signal equivalent circuit for the common-emitter amplifier.

© 2000 by CRC Press LLC

of the base–emitter junction. In other words, rπ is the ratio between vbe and ib for small-signal excursions from the operating point set by VBE. Although the parallel combination of R1 and R2 shunts rπ, we assume here that both resistors have such high ohmic values that their effect on rπ can be regarded as immaterial. From the Ebers-Moll equation (with VBE >> VT),

V BE  V CE -------I C = βI B = I S exp  V T   1 + -------   VA 

(20.1)

where IS is the transistor’s saturation current and VT = kT/q the thermal voltage, one readily derives the following expressions for the transconductance gm, the input resistance rπ, and the output resistance ro.

dI C I g m = ----------- = -----CdV BE VT

(20.2)

dI B –1 I C –1 β ------------------r π =  dV BE =  βV T = ----    gm

(20.3)

V BE –1 dI C –1 I ----------- -----S-------   r o = dV CE = V A exp V T       

(20.4)

VA -----ro ≈ IC

(20.5)

In general, VA >> VCE. Therefore,

The following two equations can be written for the circuit in Fig. 20.7.

rπ -------------------------vπ = Vi rb + RS + rπ

(20.6)

V o = – g m v π ( r o || R L )

(20.7)

where || denotes the parallel combination of two resistors. Combining Eqs. (20.6) and (20.7) leads to the expression for the small-signal gain

rπ ro RL V -------------------------- --------------A = -----o = – g m r b + R S + r π r o + R L Vi

(20.8)

Assuming rb + RS > VT (where VT = kT/q is the thermal voltage) and assuming that both transistors are matched (i.e., the saturation currents IS1 = IS2), the difference voltage ViD can be expressed as follows

© 2000 by CRC Press LLC

I C1 -----V iD = V BE1 – V BE2 = V T ln  I C2  

(20.63)

After some manipulation and substituting IC1 + IC2 = αIEE (the total current flowing through RO), one gets

αI EE I C1 = ---------------------------------– V iD ----------1 + exp  V T    αI EE I C2 = ------------------------------V ------iD-  1 + exp V T  

(20.64)

(20.65)

where α is defined as β/(β + 1). Since VoD = –RL (IC1 – IC2), the expression for the differential output voltage VoD becomes

V oD

V iD – V iD -----------------exp  2V T – exp  2V T  V iD     -------------------------------------------------------- --------- = – αR L I EE V iD – V iD = – αR L I EE tanh  2V T = – αR L I EE tanh x (20.66) ------------------exp  2V T + exp  2V T     

The transfer function expressed in Eq. (20.66) is quite non-linear. A graphical representation is given in Fig. 20.25. When ViD > 2VT, the current through one of the two transistors is almost completely cut off and for further increases in ViD the differential output signal eventually clips at –αRLIEE. On the other hand, for small values of x, tanh x ≈ x. Under this small-signal assumption,

αI EE V oD ---------A DD = ------- = – 2V T R L = – g m R L V iD

FIGURE 20.25 Emitter-coupled pair’s dc transfer characteristic. © 2000 by CRC Press LLC

(20.67)

While the next subsection contains a more rigorous small-signal analysis, a noteworthy observation here is that, under conditions of equal power dissipation, the differential amplifier of Fig. 20.24 has only one half the transconductance value and hence only one half the gain of a single-transistor commonemitter amplifier. From Eq. (20.67), one furthermore concludes that when the tail current IEE is derived from a voltage source, which is proportional to absolute temperature (PTAT), and a resistor of the same type as RL, the transistor pair’s differential gain is determined solely by a resistor ratio. As such, the gain is well-controlled and insensitive to absolute process variations. A similar observation was made in Section 20.2 during the discussion of the common-emitter amplifier’s operating point stability. In Section 20.5, a suitable PTAT reference will be presented. An intuitive analysis of the common-mode gain can be carried out under the assumption that RO is large (e.g., assume RO represents the output resistance of a current source). Then, a common-mode input signal ViC results only in a small current change iC through RO and therefore VBE remains approximately constant. With iC ≈ ViC/RO and VoC = –RLiC/2, the common-mode gain can be expressed as

V oC RL -------A CC = V iC ≈ – -------2R o

(20.68)

A DD --------CMRR = A CC ≈ 2g m R O

(20.69)

Combining Eqs. (20.67) and (20.68) yields

Half-Circuit Equivalents Figure 20.26 illustrates the derivation of the emitter-coupled pair’s differential mode half-circuit equivalent representation. For a small differential signal, the sum of the currents through both transistors remains constant and the current through RO is unchanged. Therefore, the voltage at the emitters remains constant. The transistors operate as if no degeneration resistor were present, resulting in a high gain. In sum mode, on the other hand, the common resistor RO provides negative feedback, which significantly lowers the common-mode gain. In fact, with identical signals at both inputs, the symmetrical circuit can be split into two halves, each with a degeneration resistor 2RO as depicted in Fig. 20.27.

FIGURE 20.26 Difference mode.

Low-Frequency Small-Signal Analysis Figure 20.28 represents the low-frequency small-signal differential mode equivalent circuit wherein RSD models the corresponding source impedance. Under the presumption of matched devices, © 2000 by CRC Press LLC

FIGURE 20.27 Sum mode.

rπ V oD -------------------------------A DD = ------- = –gm RL 1 V iD --r π + r b + 2 R SD

(20.70)

With rb > 1, Eq. (20.71) reduces to –RL/2RO, the intuitive result obtained earlier in Eq. (20.68). For RE = 2RO, this result is also identical to Eq. (20.31). The combination of Eqs. (20.70) and (20.71) leads to

r π + r b + 2R SC + 2 ( β + 1 )R O A DD ------------------------------------------------------------------CMRR = -------- = ≈ 2g m R O 1 A CC --r π + r b + 2 R SD

(20.72)

Let us consider the special case where RO models the output resistance of a current source, implemented by a single bipolar transistor. Then, RO = VA/IEE, where VA is the transistor’s Early voltage. With gm = αIEE/2VT,

αV CMRR = ----------A VT

(20.73)

which is independent of the amplifier’s bias conditions, but only depends on the process technology and temperature. At room temperature, with α ≈ 1 and VA ≈ 25 V, the amplifier’s CMRR would be approximately 60 dB. The use of an improved current source, for example a bipolar transistor in series with an emitter degeneration resistor RD, can significantly increase the CMRR. More specifically,

αV A  I EE R D ---------- CMRR = V T  1 + -----------VT   © 2000 by CRC Press LLC

(20.74)

FIGURE 20.28 Small-signal equivalent circuit for difference mode.

© 2000 by CRC Press LLC

FIGURE 20.29 Small-signal equivalent circuit for sum mode.

© 2000 by CRC Press LLC

For IEERD = 250 mV, the CMRR in Eq. (20.74) is eleven times higher than in Eq. (20.73). In addition to expressions for the gain, the emitter-coupled pair’s differential and common-mode input resistances can readily be derived from the small-signal circuits in Figs. 20.28 and 20.29.

R inD = 2r π

(20.75)

1 --R inC = 2 r π + R O ( β + 1 )

(20.76)

Taking into account the thermal noise of the transistors’ base resistances and the load resistors RL, as well as the shot noise caused by the collector currents, the emitter-coupled pair’s total input referred squared noise voltage per Hertz is given by 2  V iN 1 1  ------- = 8kT  r b + -------- + ---------- 2 ∆f 2g m g m R L 

(20.77)

Due to the presence of base currents, there is also a small input noise current, which however will be ignored here and in further discussions. Small-Signal Frequency Response When the emitter-base capacitance Cπ, the collector-base capacitance Cµ, the collector-substrate capacitance Ccs, and the transistor’s output resistance ro are added to the transistor’s hybrid-π small-signal model in Fig. 20.28, the differential gain transfer function becomes frequency dependent. Although the differential-mode small-signal equivalent circuit is identical to that of the non-degenerated commonemitter amplifier analyzed in Section 20.2, the high-frequency analysis is repeated here for the sake of completeness. With Ri representing the parallel combination of (RSD/2 + rb) with rπ, and RC similarly designating the parallel combination of RL with ro, Eq. (20.70) must be rewritten as

gm Ri RC N ( s ) --------------------- ----------A DD = – 1 D(s) --2 R SD + r b

(20.78)

where

N(s) ----------- = D(s)  sC µ  1 – -------- ⁄ ( 1 + s ( C π R i + C µ R i + C µ R C + C µ R i R C g m + C cs R C ) gm  

(20.79)

+ s ( C π C cs R i R C + C π C µ R i R C + C µ C cs R i R C ) ) 2

As mentioned before, the right half-plane zero located at sz = gm/Cµ results from the capacitive feedthrough from input to output. This right half-plane zero is usually at such a high frequency that it can be ignored in most applications. Unlike in a single-ended amplifier, in a differential pair this zero can easily be canceled. One only has to add two capacitors CC = Cµ between the bases of the transistors and the opposite collectors, as illustrated in Fig. 20.30(a). Rather than using physical capacitors, perfect tracking can be achieved by making use of the base–collector capacitances of transistors, whose emitters are either floating or shorted to the bases, as illustrated in Figs. 20.30(b) and (c). Note, however, that the compensating transistors contribute additional collector-substrate parasitics, which to some extent counteract the intended broadbanding effect. © 2000 by CRC Press LLC

FIGURE 20.30 Cµ cancellation: (a) capacitor implementation, (b) transistor with floating emitter, and (c) transistor with shorted base–emitter junction.

© 2000 by CRC Press LLC

If the dominant pole assumption is valid, D(s) can be factored in the following manner 2  s  s s s D ( s ) =  1 – ----  1 – ---- ≈ 1 – ---- + ---------p 1  p 2 p1 p1 p2 

(20.80)

Equating Eqs. (20.79) and (20.80) yields

1 1 ---- ------------------------------------------------------------------------------p1 = – Ri RC  R  -----C π + C cs R i + C µ  1 + g m R C + -----C- Ri   RC  R  -----C π + C cs R i + C µ  1 + g m R C + -----C- Ri   1 ------------------------------- ------------------------------------------------------------------------------p2 = – C µ C cs R C ( C µ + C cs ) C π + -----------------C µ + C cs

(20.81)

(20.82)

Rather than getting into a likewise detailed analysis, the discussion of the emitter-coupled pair’s commonmode frequency response is limited here to the effect of the unavoidable capacitor CO (representing, for instance, the collector–base and collector–substrate parasitic capacitances of the BJT), which shunts RO. The parallel combination of RO and CO yields a zero in the common-mode transfer function. Correspondingly, a pole appears in the expression for the amplifier’s CMRR. Specifically,

RO -----------------------CMRR = 2g m 1 + sC O R O

(20.83)

The important conclusion from Eq. (20.83) is that at higher frequencies, the amplifier’s CMRR rolls off by 20 dB per decade. dc Offset Input Offset Voltage Until now, perfect matching between like components has been assumed. While ratio tolerances in integrated circuit technology can be very tightly controlled, minor random variations between “equal” components are unavoidable. These minor mismatches result in a differential output voltage, even if no differential input signal is applied. When the two bases in Fig. 20.24 are tied together, but the transistors and load resistors are slightly mismatched, the resulting differential output offset voltage can be expressed as

V oO = – ( R L + ∆R L ) ( I C + ∆I C ) + R L I C = V BE V BE --------------– ( R L + ∆R L ) ( I S + ∆I S ) exp  V T  + R L I S exp  V T     

(20.84)

or

V BE  ∆R  ∆R ∆I  ∆I  -------V oO ≈ –  ---------L- + -------S- R L I S exp  V T  = –  ---------L- + -------S- R L I C   IS  IS   RL  RL

(20.85)

Conversely, the output offset can be referred back to the input through a division by the amplifier’s differential gain. © 2000 by CRC Press LLC

 ∆R V oO ∆I  - = V T  ---------L- + -------S- V iO = -------------–gm RL R IS   L

(20.86)

The input referred offset voltage ViO represents the voltage that must be applied between the input terminals in order to nullify the differential output voltage. In many instances, the absolute value of the offset voltage is not important because it can easily be measured and canceled, either by an auto-zero technique or by trimming. Rather, when offset compensation is applied, the offset stability under varying environmental conditions becomes the primary concern. The drift in offset voltage over temperature can be calculated by differentiating Eq. (20.86):

V dV iO ---------- = ------iOdT T

(20.87)

From Eq. (20.87), one concludes that the drift is proportional to the magnitude of the offset voltage and inversely related to the change in temperature. Input Offset Current Since in most applications the differential pair is driven by a low-impedance voltage source, its input offset voltage is an important parameter. Alternatively, the amplifier can be controlled by high-impedance current sources. Under this condition, the input offset current IiO, which originates from a mismatch in the base currents, is the offset parameter of primary concern. Parallel to the definition of ViO, IiO is the value of the current source that must be placed between the amplifier’s open-circuited input terminals to reduce the differential output voltage to zero.

I I C + ∆I C I C ---C-  ∆I C ∆β - – ---- ≈ β  -------- – ------- I iO = -----------------β + ∆β β β  IC

(20.88)

The requirement of zero voltage difference across the output terminals can be expressed as

( R L + ∆R L ) ( I C + ∆I C ) = ( R L I C )

(20.89)

∆I ∆R L -------CI C ≈ – ---------RL

(20.90)

I EE  ∆β ----- ∆R I iO = – 2β  ---------L- + ------- R β  L

(20.91)

Eq. (20.89) can be rearranged as

Substituting Eq. (20.90) into Eq. (20.88) yields

IiO’s linear dependence on the bias current and its inverse relationship to the transistors’ current gain β as expressed by Eq. (20.91) intuitively make sense.

Gain Enhancement Techniques From Eq. (20.67), one concludes that there are two ways to increase the emitter-coupled pair’s gain: namely, an increase in the bias current or the use of a larger valued load resistor. Similar to the earlier discussion of the common-emitter amplifier, practical limitations of the available supply voltage and the corresponding limit on the allowable I-R voltage drop across the load resistors (in order to avoid saturating either of the two transistors), however, limit the maximum gain that can be achieved by a single stage. © 2000 by CRC Press LLC

This section introduces two methods that generally allow the realization of higher gain while avoiding the dc bias limitations. Negative Resistance Load In the circuit of Fig. 20.31, a gain boosting positive feedback circuit is connected between the output terminals. The output dc bias voltage is simply determined by VCC, together with the product of RL and the current flowing through it, which is now equal to (IE + IR)/2. However, for ac signals, the added circuit — consisting of two transistors with cross-coupled base–collector connections and the resistors RC between the emitters — represents a negative resistance of value –2(RC + 1/gmc), where gmc = αIR/2VT. The amplifier’s differential gain can now be expressed as

A DD ≈ – g m R L

1 -------------------------------g mc R L 1 – ----------------------1 + g mc R C

(20.92)

FIGURE 20.31 Emitter-coupled pair with negative resistor load.

Active Load Another approach to increase the gain consists of replacing the load resistors by active elements, such as pnp transistors. Figure 20.32 shows a fully differential realization of an emitter-coupled pair with active loads. The differential gain is determined by the product of the transconductance of the input devices and the parallel combination of the output resistances of the npn and pnp transistors. Since gm = IC/VT, ron = VAn/IC, and rop = VAp/IC, the gain becomes

A DD = – g m © 2000 by CRC Press LLC

V An V Ap V An V Ap 1 ----------------------------------------------------= – ------ V An + V Ap ( V An + V Ap )I C VT

(20.93)

FIGURE 20.32 Emitter-coupled pair with active pnp load.

Consequently, the gain is relatively high and independent of the bias conditions. The disadvantage of the fully differential realization with active loads is that the output common-mode voltage is not well-defined. This problem also exists for the corresponding single-ended implementation (see Fig. 20.9(a)), as mentioned in Section 20.2. If one were to use a fixed biasing scheme for both types of transistors in Fig. 20.32, minor, but unavoidable mismatches between the currents in the npn and pnp transistors will result in a significant shift of the operating point. The solution lies in a common-mode feedback (CMFB) circuit that controls the bases of the active loads and forces a predetermined voltage at the output nodes. The CMFB circuit has high gain for common-mode signals, but does not respond to differential signals present at its inputs. A possible realization of such a CMFB circuit is seen in the right portion of Fig. 20.32. Via emitter followers and resistors RF, the output nodes are connected to one input of a differential pair, whose other input terminal is similarly tied to a reference voltage VREF . The negative feedback provided to the pnp load transistors forces an equilibrium state where the dc voltages at the output terminals of the differential pair gain stage are equal to VREF . An alternative active load implementation with a single-ended output is shown in Fig. 20.33. Contrary to the low CMRR of a single-ended realization

FIGURE 20.33 Emitter-coupled pair with active load and single-ended output.

© 2000 by CRC Press LLC

with resistive loads, the circuit in Fig. 20.33 inherently possesses the same CMRR as a differential realization since the output voltage depends on a current differencing as a result of the pnp mirror configuration. The drawback of the single-ended circuit is a lower frequency response, particularly when low-bandwidth lateral pnp transistors are used.

Linearization Techniques As derived previously, the linear range of operation of the emitter-coupled pair is limited to approximately ViD ≈ 2VT. This section describes two techniques that can be used to extend the linear range of operation. Emitter Degeneration The most common technique to increase the linear range of the emitter-coupled pair relies on the inclusion of emitter degeneration resistors, as shown in Fig. 20.34. The analysis of the differential gain transfer function proceeds as before; however, no closed-form expression can be derived. Intuitively, the inclusion of RE introduces negative feedback, which lowers the gain and extends the amplifier’s linear operating region to a voltage range approximately equal to the product of REIE. The small-signal differential gain can be expressed as

FIGURE 20.34 Differential pair with degeneration resistors.

A DD ≈ – G M R L

(20.94)

where GM is the effective transconductance of the degenerated input stage. Therefore,

gm 1 -------------------- -----GM = 1 + gm RE ≈ RE

(20.95)

gm RL A DD ≈ – -------------------1 + gm RE

(20.96)

Consequently,

In case gmRE >> 1,

© 2000 by CRC Press LLC

R A DD ≈ – -----LRE

(20.97)

In comparison to the undegenerated differential pair, the gain is reduced by an amount (1 + gmRE) ≈ gmRE, which is proportional to the increase in linear input range. The common-mode gain transfer function for the circuit in Fig. 20.34 is

RL A CC ≈ – --------------------2R O + R E

(20.98)

For practical values of RE, ACC remains relatively unchanged compared to the undegenerated prototype. As a result, the amplifier’s CMRR is reduced approximately by the amount gmRE. Also, the input referred squared noise voltage per Hertz can be derived as 2 1 1 V iN -------- 2 2 ----------- 2 2 ------- = 8kT r b + 2g m ( g m R E ) + g 2 R ( g m R E ) + R E m L ∆f

(20.99)

This means that, to a first order, the noise also increases by the factor gmRE. Consequently, although the amplifier’s linear input range is increased, its signal-to-noise ratio (SNR) remains unchanged. To complete the discussion of the emitter degenerated differential pair, the positive effect emitter degeneration has on the differential input resistance RinD, and, to a lesser extent, on RinC should be mentioned. For the circuit in Fig. 20.34,

R inD = 2 [ r π + ( β + 1 )R E ]

(20.100)

1 ( β + 1 )R --R inC = 2 r π + -----------------------E + R O ( β + 1 ) 2

(20.101)

Parallel Combination of Asymmetrical Differential Pairs A second linearization technique consists of adding the output currents of two parallel asymmetrical differential pairs with respective transistor ratios 1:r and r:1 as shown in Fig. 20.35. The reader will

FIGURE 20.35 Parallel asymmetrical pairs. © 2000 by CRC Press LLC

observe that each differential pair in Fig. 20.35 is biased by a current source of magnitude IEE/2 so that the power dissipation as well as the output common-mode voltage remain the same as for the prototype circuit in Fig. 20.24. Assuming, as before, an ideal exponential input voltage–output current relationship for the bipolar transistors, the following voltage transfer function can be derived:

α  V iD ln r  V iD ln r --V oD = – 2 I EE R L tanh  -------- – ------- + tanh  -------- + ------- 2V 2  T   2V T 2 

(20.102)

After Taylor series expansion and some manipulation, Eq. (20.102) can be rewritten as

V oD

V iD 3 V iD  1  -------= – αI EE R L ( 1 – d ) --------- + d – --- 2V T + …  3  2V T 

(20.103)

where

r–1 2 -----------  d = r+1  

(20.104)

Equation (20.103) indicates that the dominant third-harmonic distortion component can be canceled by setting d = 1/3 or r = 2 + 3 = 3.732. The presence of parasitic resistances within the transistors tends to require a somewhat higher ratio r for optimum linearization. In practice, the more easily realizable ratio r = 4 (or d = 9/25) is frequently used. When the linear input ranges at a 1% total harmonic distortion (THD) level of the single symmetrical emitter-coupled pair in Fig. 20.24 and the dual circuit with r = 4 in Fig. 20.35 are compared, a nearly threefold increase is noted. For r = 4 and neglecting higher-order terms, Eq. (20.103) becomes

A DD ≈ – 0.64g m R L

(20.105)

where gm = αIEE/2VT as before. Equation (20.105) means that the tradeoff for the linearization is a reduction in the differential gain to 64% of the value obtained by a single symmetrical emitter-coupled pair with equal power dissipation. The squared input referred noise voltage per Hertz for the two parallel asymmetrical pairs can be expressed as 2 8kT V iN 1 1  -----------------2  r b ------- = ---- + -------- + ---------- 2 ( 0.64 )  5 2g m g R  ∆f m L

(20.106)

The factor rb/5 appears because of an effective reduction in the base resistance by a factor (r + 1) due to the presence of five transistors vs. one in the derivation of Eq. (20.77). If the unit transistor size in Fig. 20.35 is scaled down accordingly, a subsequent comparison of Eqs. (20.77) and (20.106) reveals that the input referred noise for the linearized circuit of Fig. 20.35 is 1/0.64, or 1.56 times higher than for the circuit in Fig. 20.24. Combined with the nearly threefold increase in linear input range, this means that the SNR nearly doubles. The approximately 6-dB increase in SNR is a distinct advantage over the emitter degeneration linearization technique. Moreover, the linearization approach introduced in this section can be extended to a summation of the output currents of three, four, or more parallel asymmetrical pairs. However, there is a diminished return in the improvement. Also, for more than two pairs, the required device ratios become quite large and the sensitivity of the linear input range to small mismatches in the actual ratios versus their theoretical values increases as well. © 2000 by CRC Press LLC

Rail-to-Rail Common-Mode Inputs and Minimum Supply Voltage Requirement With the consistent trend toward lower power supplies, the usable input common-mode range as a percentage of the supply voltage is an important characteristic of differential amplifiers. Full rail-to-rail input compliance is a highly desirable property. Particularly for low-power applications, the ability to operate from a minimal supply voltage is equally important. For the basic emitter-coupled pair in Fig. 20.24, the input is limited on the positive side when the npn transistors saturate. Therefore,

1 --V iC, pos = VCC – 2 R L I EE + V bc, forward

(20.107)

If one limits RLIEE/2 < Vbc,forward, ViC,pos can be as high as VCC or even slightly higher. On the negative side, the common-mode input voltage is limited to that level, where the tail current source starts saturating. Assuming a single bipolar transistor is used as the current source,

V iC, neg > V BE + V CE, sat ≈ 1 V

(20.108)

The opposite relationships hold for the equivalent pnp transistor-based circuit. As a result, the railto-rail common-mode input requirement can be resolved by putting two complementary stages in parallel. In general, as the input common-mode traverses between VCC and ground, three distinct operating conditions can occur: (1) at high voltage levels, only the npn stage is active; (2) at intermediate voltage levels, both the npn and pnp differential pairs are enabled; and finally, (3) for very low input voltages, only the pnp stage is operating. If care is not taken, three distinct gain ranges can occur: based on gmn only; resulting from gmn + gmp; and, contributed by gmp only. Non-constant gm and gain that depends on the input common-mode is usually not desirable for several reasons, not the least of which is phase compensation if the differential pair is used as the first stage in an operational amplifier. Fortunately, the solution to this problem is straightforward if one recognizes that the transconductance of the bipolar transistor is proportional to its bias current. Therefore, the only requirement for a bipolar constant-gm complementary circuit with full rail-to-rail input compliance is that under all operating conditions the sum of the bias currents of the npn and pnp subcircuits remains constant. A possible implementation is shown in Fig. 20.36. If ViC < VREF , the pnp input stage is enabled and the npn input transistors are off. When ViC > VREF , the bias current is switched to the npn pair and the pnp input devices turn off. For RLIEE/2 < Vcb,forward,n, the minimum required power supply voltage is VBE,n + VBE,p + VCE,sat,n + VCE,sat,p, which is lower than 2 V.

20.4 Output Stages Output stages are specially designed to deliver large signals (as close to rail-to-rail as possible) and a significant amount of power to a specified load. The load to be driven is often very low-ohmic in nature; for example, 4 to 8 ohms in the case of audio loudspeakers. Therefore, output stages must possess a low output impedance and be able to supply high amounts of current — without distorting the signal. The output stage also needs to have a relatively wide bandwidth, so that it does not contribute major frequency limitations to the overall amplifier. Equally desirable is a high efficiency in the power transfer. Preferably, the output stage consumes no (or very little) power in the quiescent state, when no input signal is present. In this section, two major classes of amplifiers, distinguished by their quiescent power needs, are discussed. Class A amplifiers consume the same amount of power regardless of the presence of an ac signal. Class B amplifiers, on the other hand, only consume power while activated by an input signal and dissipate absolutely no power in stand-by mode. A hybrid between these two distinct cases is Class AB operation, which consumes only a small amount of quiescent power. Because output stages deliver high power levels, care must be taken to guarantee

© 2000 by CRC Press LLC

FIGURE 20.36 Low-voltage rail-to-rail input circuit.

that the transistors do not exceed their maximum power ratings, even under unintended operating conditions, such as when the output is shorted to ground. To avoid permanent damage or total destruction, output stages often include some sort of overload protection circuitry.

Class A Operation The emitter follower, analyzed in Section 20.2, immediately comes to mind as a potential output stage configuration. The circuit is revisited here with an emphasis on its signal-handling capability and power efficiency. Figure 20.37(a) shows an emitter follower transistor Q1 biased by a current source Q2 and loaded by a resistor RL. For generality, separate positive (VCC) and negative (–VEE) supplies, as well as ground are used in Fig. 20.37(a) and the analysis below. The following identities can readily be derived

V i = V BE, 1 + V o V I Q + ------o I1 RL V i = V T ln ---- + V o = V T ln ----------------- + V o IS IS

(20.109)

(20.110)

where IQ is the quiescent current supplied by the current source Q2. If one assumes that RL is quite large, the output current Io = Vo/RL is relatively small and VBE,1 is approximately constant. Then,

I V o ≈ V i – V T ln ----Q IS or there is approximately a fixed voltage drop between input and output.

© 2000 by CRC Press LLC

(20.111)

FIGURE 20.37 (a) Emitter-follower output stage (Class A), and (b) voltage transfer diagram.

The positive output excursion is limited by the eventual saturation of Q1. This means that Vo cannot exceed (VCC – VCEsat,1). VCesat,1 is less than VBE,1. However, since Vi is most often provided by a previous gain stage, its voltage level can generally not be raised above the supply. In practice, this means that the maximum output voltage is limited to

I V o, max = V CC – V BE, 1 ≈ V CC – V T ln ----Q IS

(20.112)

Similarly, the negative excursion is limited by the saturation of Q2. Therefore,

V o, min = – V EE + V CEsat, 2 © 2000 by CRC Press LLC

(20.113)

If the assumption about the load resistor is not valid and RL is small, the slope of the transfer characteristic is not exactly 1 and some curvature occurs for larger signal excursions. Also, negative output clipping can occur sooner. Indeed, the maximum current flow through RL during the negative excursion is bounded by the bias IQ. Consequently,

V o, min = – I Q R L

(20.114)

Figure 20.37(b) graphically represents the emitter follower’s transfer characteristic for both RL assumptions. If VCC and VEE are much larger than VBE and VCEsat, the maximum symmetrical swing one can obtain is

V o, peak ≈

V CC + V EE ----------------------2

(20.115)

provided Vi has the proper dc bias and the quiescent current is equal to or greater than the optimum value

V o, peak I Q, opt = I o, peak = ------------RL

(20.116)

Under the conditions of Eqs. (20.115) and (20.116), the emitter follower’s power dissipation is

P supply = ( V CC + V EE )I Q, opt ≈ 2V o, peak I o, peak

(20.117)

For sinusoidal input conditions, the average power delivered to the load is expressed as

1 --P load = 2 V o, peak I o, peak

(20.118)

Therefore, the highest achievable power efficiency is limited to

P load -----------η = P supply ≈ 25%

(20.119)

In conclusion, while the emitter follower can be used as an output stage, it suffers from two major limitations. First, the output swing is asymmetrical and not rail-to-rail. Second, the operation is Class A and thus consumes dc power. The emitter follower’s maximum power efficiency, which is only reached at full signal swing, is very poor.

Class B and Class AB Operation Figure 20.38(a) shows a symmetrical configuration with the emitters of an npn and a pnp transistor tied together at the output node, while the input is provided to their joint bases. This dual emitter follower combination is frequently referred to as a push-pull arrangement. When no input is applied, clearly no current flow is possible and, thus, the operation is Class B. The push-pull configuration, however, does not solve all the problems of the single-ended emitter follower. Indeed, the output swing is still not railto-rail, but is limited to one VBE drop from either supply rail (assuming Vi cannot exceed VCC or VEE). The circuit’s transfer characteristic is shown in Fig. 20.38(b). Note that both transistors are off, not just for zero input as desired, but they also do not turn on when small inputs are applied. In effect, for VBE,p < Vi < VBE,n, the output has a dead zone. Such hard non-linearity leads to undesirable cross-over distortion. A method to overcome this problem will be discussed shortly. For larger inputs, only one of the transistors conducts.

© 2000 by CRC Press LLC

FIGURE 20.38 (a) Emitter-follower push-pull output stage (Class B), and (b) voltage transfer diagram.

The push-pull configuration’s power efficiency under sinusoidal input conditions can be derived as follows. The dissipated power is equal to

V o, peak ------------P supply = ( V CC + V EE )I supply = ( V CC + V EE ) πR L

(20.120)

while the power delivered to the load is expressed as 2

V o, peak P load = ------------2R L © 2000 by CRC Press LLC

(20.121)

Thus,

π V o, peak P load --- ----------------------η = ------------ = 2 V CC + V EE P supply

(20.122)

Equation (20.122) says that the efficiency is directly proportional to the peak output amplitude. In case the base–emitter voltage can be neglected relative to the supply voltages,

V o, peak, max =

V CC + V EE – V BE, n – V BE, p V CC + V EE ------------------------------------------------------------ ----------------------≈ 2 2

(20.123)

The power efficiency’s upper bound is therefore given by

π --η max ≈ 4 ≈ 78.6%

(20.124)

The dead zone and resulting cross-over distortion can be eliminated by inserting two conducting diodes between the bases of the npn and pnp output devices, as shown in Fig. 20.39(a). Strictly speaking, the push-pull circuit is then no longer Class A, but rather becomes Class AB as a small stand-by current constantly flows through both output devices. The resulting linear transfer characteristic is shown in Fig. 20.39(b). One should, however, note that the actual quiescent current is not well-defined, as it depends on matching between the base–emitter voltage drops of the diodes and the output transistors. For discrete implementations, this configuration would clearly be too sensitive. Even in integrated circuits, it is often desirable to stabilize the operating point through the inclusion of degeneration resistors, as illustrated in Fig. 20.40(a). In addition, these series resistors act as passive current limiters. Indeed, a potential problem occurs in the circuit of Fig. 20.39(a) when the output node is shorted to ground (RL = 0). If the input voltage is large and positive, Q2 and the diode string are off, forcing all the current IB to flow into the base of Q1, where it gets multiplied by the (high) current gain β1. The resulting collector current may be so large as to cause Q1 to self-destruct. Obviously, the voltage drop, which builds across either degeneration resistor, limits the maximum current and, as such, can prevent damage. Unfortunately, the inclusion of series resistors is a far from perfect solution as the values needed for quiescent current stabilization and overdrive protection are often unacceptably large. Thus, the degeneration resistors reduce the power efficiency, limit the output swing, and raise the circuit’s output resistance. Fortunately, this issue can be circumvented if a diode is added in parallel with the degeneration resistor, as shown in Fig. 20.40(a). For low current values, the diode is off and the circuit is characterized by the high resistance of R1 (R2). At higher current levels, the diode turns on and provides a low dynamic resistance. For small input signals, the degeneration resistor is in series with the load and voltage division occurs. At higher input levels when the diode is on, the output again follows the input. The transfer characteristic, illustrated in Fig. 20.40(b), shows that rather than being completely eliminated as in the circuit of Fig. 20.39, the dead zone is replaced by a “slow zone.” In other words, the hard non-linearity of the simple push-pull circuit in Fig. 20.38 is transformed into a more acceptable soft non-linearity. The transfer characteristics can be further linearized by applying negative feedback around the complete amplifier. The reader should observe, however, that the diode voltage drop further limits the maximum signal swing. While the diodes may reduce the passive current limiting provided by the resistors, superior active limiting is achieved when they are replaced by transistors, as depicted in Fig. 20.40(c). When the voltage drop across R1 becomes high enough to forward-bias the base–emitter junction of QD1, the transistor starts to pull current away from the base of Q1, delivering it harmlessly to the load, without additional multiplication. One should note that the limiting effect for large negative currents is not nearly as effective in this scheme, since this is largely determined by the current sinking capability of the driving transistor (not shown in Fig. 20.40), which pulls current out of the base of Q2. On the other hand, pnp transistors are usually lateral or substrate devices with relatively low current gain, which furthermore rolls off very quickly at higher current © 2000 by CRC Press LLC

FIGURE 20.39 (a) Emitter-follower push-pull output stage (Class AB), and (b) voltage transfer diagram.

levels. Consequently, the potential problem of negative current overload is inherently less severe. Whereas these lateral or substrate pnp’s are generally adequate for low to moderate power applications, if high power must be delivered, a complementary bipolar process with isolated vertical pnp’s is required. When such a more complicated and expensive process is not available, alternatively quasi-complementary structures can be used. In Fig. 20.41, for example, the pnp transistor is replaced by a pseudoDarlington pair (see Section 20.2). In summary, the Class AB push-pull configuration meets the requirements of an output stage with relatively high efficiency in its power transfer and low stand-by dissipation. For very low voltage

© 2000 by CRC Press LLC

FIGURE 20.40 (a) Emitter-follower push-pull output stage (Class AB) with resistors and diodes; (b) voltage transfer diagram; and (c) emitter-follower push-pull output stage (Class AB) with resistors and transistors.

applications, however, the voltage drop across the base–emitter junction (and the series diodes) constitutes a serious limitation. A true rail-to-rail output swing (apart from an unavoidable, but low VCEsat) can only be achieved by a complementary common-emitter output stage, as drawn in Fig. 20.42. Following the discussion in Section 20.2, the reader will likely interject that this arrangement suffers from a high output impedance and potential frequency limitations. An in-depth treatment of low-voltage common-emitter output configurations is beyond the scope of this chapter. The interested reader is referred to Ref. 5.

20.5 Bias Reference It is definitely not the author’s intention to present an in-depth discussion of voltage and current reference design. However, on several occasions, the terms “bandgap voltage” and “PTAT voltage” were mentioned.

© 2000 by CRC Press LLC

FIGURE 20.41Quasi-complementary push-pull output stage.

FIGURE 20.42 Rail-to-rail complementary common-emitter output stage.

It was also noted that a current, which is proportional to absolute temperature and inversely related to a resistor, is quite often desired in order to stabilize an amplifier’s gain. A circuit that provides such supply-independent voltages and currents is shown in Fig. 20.43. By inspection,

V BG = V BE1 + R 2 I = V BE1 + V PTAT

(20.125)

The current mirror consisting of Q3 and Q4 forces the current I to split evenly between Q1 and Q2. As a result, the following identity is valid:

I --V BE1 = V BE2 + R 1 2

(20.126)

Substituting the Ebers-Moll identity into Eq. (20.126), where IS2 is N times IS1, yields

2 2 I I  -----  ----I = R 1  V T ln --------- – V T ln --------- = R 1 V T ln N 2I S1 2I S2  © 2000 by CRC Press LLC

(20.127)

FIGURE 20.43 Bias reference.

By combining Eqs. (20.125) and (20.127), one gets

R -----2 V BG = V BE1 + 2 R 1 V T ln N

(20.128)

R2 R kT ---------2 -----V PTAT = 2 R 1 V T ln N = 2 R 1 q ln N

(20.129)

Also,

From Eq. (20.129), one concludes that VPTAT is indeed proportional to absolute temperature. Apart from the absolute temperature T, VPTAT only depends on physical constants (k and q), an area multiple N, and the ratio of two resistors. In addition to VPTAT, the expression for VBG (Eq. (20.128)) contains the term VBE1, which decreases by 2 mV per degree increase in temperature. Through an appropriate selection of the resistors R1 and R2, the voltage VBG can be made temperature independent. This occurs for VBG ≈ 1.25 V, known as the BJT bandgap voltage. The currents ISOURCE and ISINK in Fig. 20.43 are simply mirrored copies of I, and thus exhibit the desired PTAT and resistor dependence. Further observation of the circuit in Fig. 20.43 reveals that it possesses a second (although unstable) operating point. Indeed, the circuit is also in equilibrium when VBG = VPTAT = 0 V and there is no current flow. To prevent the bias reference from being stuck in this undesired state, an initial start-up circuit is added as shown. When power is first applied, the start-up circuit injects a small current into the mirror Q3-Q4, forcing the circuit to wake up and drift away from the zero state. As VBG increases toward 1.25 V, the differential pair eventually switches and the start-up current is simply thrown away. The reader should realize that the circuit in Fig. 20.43 is conceptual in nature and specifically drawn to show that a supply voltage VCC < 2 V suffices. However, it suffers from non-idealities due to base currents and poor supply rejection resulting from the simple current mirrors with relatively low output impedances. At the expense of added circuit complexity and the need for a higher supply voltage, significant improvements can be made. Such specialized bias reference discussion, however, goes beyond the scope of this chapter. © 2000 by CRC Press LLC

20.6 Operational Amplifiers Introduction Operational amplifiers (or op-amps) are key analog circuit building blocks that find widespread use in a variety of applications, such as precision amplification circuits (e.g., instrumentation amplifiers), continuoustime and switched-capacitor filters, etc. The traditional symbolic representation of the operational amplifier is shown in Fig. 20.44. The op-amp is a five-terminal device, with inverting and non-inverting input terminals (hence, accommodating a differential input signal), a single-ended output terminal, as well as positive and negative supply terminals. Most commercially available op-amps require a dual supply system of equal, but opposite value (e.g. ±15 V or ±5 V); however, asymmetrical or single-supply circuits are also available (e.g., +5 V and ground). Special-purpose operational amplifiers with differential or fully balanced outputs also exist. The internal circuitry of op-amps combines the different building blocks, which were previously discussed. Op-amps typically consist of two or three stages. The input stage, based on a differential pair, provides the initial amplification. A second or intermediate stage may be included to boost the amplifier’s gain. Differential to single-ended conversion is also accomplished in the first stage, or, if applicable, in the second stage. The output stage, typically an emitter follower push-pull configuration, provides a low impedance, large swing, and high current drive. An elementary op-amp schematic can be found in Fig. 20.47(a).

FIGURE 20.44 Op-amp symbol.

Ideal Op-Amps The op-amp’s input–output relationship can be expressed as

Vo = A ( Vi – Vi ) +



(20.130)

In the case of an ideal op-amp, A is assumed to have infinite magnitude as well as bandwidth. As such, there is no phase shift over frequency, as illustrated in Fig. 20.45(a). Since the output Vo must remain bounded, the assumption of infinite gain leads to the fundamental principle of virtual ground (virtual short). In other words, if the op-amp is ideal, there cannot exist a voltage difference between the input + – terminals, and V i must equal V i . The assumption of ideality also calls for an infinite input impedance (i.e., the op-amp does not load the driving source) and a zero output impedance (i.e., the op-amp can accommodate arbitrarily small loads).

Op-Amp Non-idealities Real operational amplifiers generally approximate the ideal op-amp model reasonably well; however, they naturally have finite gain, finite bandwidth, and finite input as well as output impedances. Specific nonidealities are itemized below.

© 2000 by CRC Press LLC

FIGURE 20.45 Magnitude and phase responses: (a) ideal op-amp; (b) real op-amp (potentially unstable in unity-gain feedback loop); (c) real op-amp with internal compensation (guaranteed stable, phase margin approx. 90 degrees).

Finite Gain The op-amp’s (low-frequency) gain can be made quite high, particularly when multiple gain stages are cascaded. The absolute gain value, however, is not very well-defined as it depends on widely varying process parameters, such as the transistor’s current gain β. If a precise gain is required, the op-amp must be configured into a feedback network; for example, the non-inverting and inverting gain stages shown in Figs. 20.46(a) and (b), respectively. Assuming finite op-amp gain but infinite input impedance, the gain of the non-inverting amplifier in Fig. 20.46(a) can be expressed as

Ri + Rf Vo ---------------------------------- = A R i + R f + AR i Vi

(20.131)

If A is high, Eq. (20.131) can be approximated by

V A Ri + Rf Rf -----o ------------- --------------V i ≈ A + 1 R i ≈ 1 + ----Ri

(20.132)

For practical resistor values, the closed-loop gain in Eq. (20.132) is not very high (at least compared to the op-amp’s open-loop gain A), but it can be accurately set, even over process corners, by the ratio of two like components. Similarly, the inverting amplifier of Fig. 20.46(b) provides a closed-loop gain

Rf A R Vo R ------------------------------------------ -----f ----- = – A R i + R f + AR i ≈ – A + 1 R i ≈ – -----f Vi Ri

(20.133)

An important building block in active filter design is the inverting integrator circuit displayed in Fig. 20.46(c). By inspection,

–1 –A V ---------------------------------------- ----------------o = ≈ sR i C f ( A + 1 )sR i C f + 1 Vi

(20.134)

Unlike for the previous two amplifiers, the gain in Eq. (20.134) depends on the product of absolute resistor and capacitor values. When implemented as a monolithic circuit, the integrator’s time constant is therefore subject to large tolerances, which need to be compensated for by trimming or an automatic tuning loop. © 2000 by CRC Press LLC

FIGURE 20.46 (a) Noninverting amplifier; (b) inverting amplifier; and (c) inverting integrator.

© 2000 by CRC Press LLC

Finite Bandwidth The op-amp’s internal electronics are characterized by parasitic poles and/or zeros. As previously discussed, these unavoidable parasitics cause the gain to roll off at high frequencies and also introduce phase shifts. A typical op-amp’s magnitude and phase responses are shown in Fig. 20.45(b). Assuming the op-amp has a dominant pole, the gain initially decreases by 20 dB/decade and the phase shift approaches –90°. At higher frequencies, the gain starts to decrease faster due to the combined effect of additional non-dominant parasitics and the phase increases as well. Since op-amps are used in gain stabilizing negative feedback circuits, as discussed in the previous sub-section, this high-frequency behavior can lead to instability. The worst possible situation occurs for unity feedback configurations. To avoid potential oscillation problems even under these conditions, commercial op-amps are nearly always internally compensated. Typically, a low-frequency dominant pole is introduced, which causes the gain to drop below unity at the –180° phase cross-over frequency, as shown in Fig. 20.45(c). The difference between the actual phase shift at the op-amp’s unity-gain frequency and –180° is referred to as the phase margin. Whereas an op-amp is unconditionally stable when its phase margin is positive, values well above 45° are highly desirable to obtain quick settling to input transients. Conversely, the ratio between unity and the actual gain at the frequency corresponding to –180° phase shift is called the gain margin. If the dominant pole assumption is valid, the op-amp’s gain A should be rewritten as

Ao ωo Ao ωo ------------- -----------A ( s ) = A ( jω ) = s + ω o ≈ s

(20.135)

where Ao is the dc gain and ωo is the radial –3 dB frequency. Aoωo is referred to as the op-amp’s gain–bandwidth product. When Eq. (20.135) is substituted into Eqs. (20.133) and (20.134), the latter gain expressions become frequency dependent. If Eq. (20.135) is likewise combined with the integrator gain in Eq. (20.134), the latter’s denominator turns into a second-order polynomial. Finite Input Impedance An op-amp’s input stage is usually a differential pair. Expressions for its differential and common-mode input resistances have been previously derived. If a high differential input resistance is desired, several design options exist. First, npn input pairs are better than pnp’s, thanks to the higher β. Also, the input impedance increases as the input transconductance is lowered, either by reducing the bias current or by adding degeneration resistors. A Darlington pair can be used when an even higher input resistance is required. At high frequencies, the input impedance becomes capacitive (and thus decreases) due to the BJT’s Cπ and Cµ. Input Bias Current The bipolar transistors in the input differential pair require base current. For npn transistors, the current flows into the base, whereas current must be pulled out of the base in the case of pnp transistors. As a result of the higher β, for a given transconductance, the required input bias current is lower in absolute value in the case of an npn input stage compared to a pnp. Darlington or pseudo-Darlington input pairs can further reduce the input bias current requirement. Alternatively, input bias or base current cancellation techniques are sometimes applied. Input Offset Voltage Unavoidable device mismatches require the application of a small difference voltage between the input terminals in order to get zero volts at the output terminal. The reader is referred back to Section 20.3 (Input Offset Voltage). Input Offset Current Similar to the input offset voltage, when the inputs are currents rather than voltages, small component mismatches require the application of a small input difference current in order to get zero volts at the output node. See Section 20.3 (Input Offset Current) for more details.

© 2000 by CRC Press LLC

Finite Output Impedance A typical output stage consists of an emitter follower push-pull configuration. Hence, the output resistance depends on 1/gm, plus the resistance of the driving stage divided by β. Since β rolls off at high frequencies, Ro increases accordingly. As such, the output impedance appears inductive. This phenomenon can lead to stability problems when driving capacitive loads. Finite Common-Mode Rejection Ratio In Section 20.3, the common-mode rejection ratio was defined as the ratio between the differential and common-mode gains. However, an op-amp’s CMRR can be more meaningfully explained in terms of the input offset voltage. In this way, the CMRR can be defined as the change in input offset voltage due to a unit change in common-mode input voltage. The CMRR of commercial op-amps typically measures 100 to 120 dB. Finite Power Supply Rejection Ratio Similar to the CMRR, the power supply rejection ratio (PSRR) is defined as the change in input offset voltage due to a unit change in the supply voltage. An op-amp’s PSRR is nearly always different with respect to its positive and negative supplies. Hence, two individual performance numbers are specified. Typical values are in the range of the CMRR. Slew Rate, Full-Power Bandwidth, and Unity-Gain Frequency When a step input is applied, the op-amp’s output cannot change instantaneously. Rather, a finite transition time is needed. The maximum rate of change in output voltage is referred to as the opamp’s slew rate (SR). To determine an expression for the slew rate, consider the elementary op-amp in Fig. 20.47(a). This basic circuit consists of an input differential pair gain stage, which also converts the input to a single-ended signal. An intermediate stage provides additional gain. In addition, a dominant pole is introduced by means of the Miller capacitor Cc. The output stage is a Class AB emitter follower pushpull configuration. For the purpose of slew rate calculation, the elementary op-amp can be replaced by the equivalent circuit in Fig. 20.47(b). Then,

dV o SR = -------dt

max

m Vi = g---------Cc

(20.136) max

The maximum value of gmVi is equal to α times the tail current source IEE. Thus,

αI EE SR = --------Cc

(20.137)

Based on the equivalent circuit in Fig. 20.47(b), the op-amp’s radial unity-gain frequency is readily determined as

g ω u = ----mCc

(20.138)

By combining Eqs. (20.137) and (20.138), the following relationship can be identified between the opamp’s slew rate and its unity-gain frequency

αI EE ---------SR = g m ω u

(20.139)

From Eq. (20.137), one could conclude that to improve the slew rate, IEE must be increased and/or Cc lowered. However, for the elementary opamp in Fig. 20.47(a), IEE is also directly proportional to gm. © 2000 by CRC Press LLC

FIGURE 20.47 (a) Elementary op-amp, and (b) equivalent circuit for slew rate calculation.

Thus, both methods of increasing the slew rate at the same time increase the unity-gain bandwith ωu. Obviously, a higher ωu would also be desirable. However, ωu has an upper limit, which is determined by the op-amp’s non-dominant poles. Indeed, less than –180° phase shift must be maintained at ωu in order to guarantee stability when external feedback is applied. This requirement for ωu severely limits our flexibility to enhance the op-amp’s slew rate. For a given ωu, Eq. (20.139) suggests that the slew rate can be improved by increasing the IEE/gm ratio. Two approaches that fall into this category have been previously described in Section 20.3 on differential pair linearization. First, instead of a simple emittercoupled pair, a differential pair with degeneration resistors can be used. Unfortunately, as pointed out before, the degeneration resistors negatively impact the circuit’s noise and also degrade its offset performance. A better approach for improved slew rate is the use of parallel asymmetrical differential pairs,

© 2000 by CRC Press LLC

such as two pairs with a 1:r emitter ratio. Although the emphasis in Section 20.3 was on linearization, and a ratio r = 4 was chosen specifically to eliminate the third harmonic distortion, the reader may recall that the tradeoff was a reduction in current efficiency (gm/IEE) to 64% of that of a simple differential pair. In case of an op-amp, a wide linear input range is of lesser or no concern (when feedback is applied, there is virtually no differential signal across the input terminals) and therefore different emitter ratios can be chosen. gm/IEE further decreases with larger r values. As such, the slew rate can be improved, while keeping ωu constant. The reader is referred to Ref. 7 for further details. Another parameter that can directly be correlated to the slew rate is the full-power bandwidth ωmax. The full-power bandwidth is defined as the maximum radian frequency for which the op-amp achieves a full output swing under the assumption of a sinusoidal signal. Let

V o = Vˆ sin ωt

(20.140)

dV o --------- = Vˆ ω cos ωt dt

(20.141)

dV o --------dt

(20.142)

Then,

and

= Vˆ ω max max

The left-hand side in Eq. (20.142) is, per definition, the slew rate. Hence, the slew rate and full-power bandwidth are simply related by the amplitude of the sinusoidal output signal.

20.7 Conclusion In this chapter, the basic concepts behind signal amplification using bipolar transistors were introduced. Different circuit configurations, which fulfill distinct roles as building blocks in multi-stage designs or operational amplifiers, were presented. During their analysis, no parallel was drawn nor was a comparison made with respect to competing CMOS designs. This chapter would, however, not be complete if the main pros and cons of bipolar amplifiers are not very briefly mentioned. Bipolar amplifiers typically enjoy an advantage by offering higher gain, wider bandwidth, lower noise, and smaller offsets compared to their CMOS counterparts. The transconductance of an MOS transistor, while proportional to the device’s size, is generally much lower than for bipolar and, rather than linearly, only increases with the square root of the bias current. One disadvantage of bipolar amplifiers is the need for input bias currents, coupled with unavoidable input offset currents. Second, MOS transistors in saturation approximate a square-law I–V relationship and are therefore inherently more linear than bipolar devices, which are exponential in nature. Additionally, the linearity of MOS designs can be improved simply by proper device sizing. But, perhaps the most significant drawback of bipolar technology is a more complicated and expensive process (especially in the case of a complementary process with true isolated pnp transistors). Furthermore, bipolar’s incompatibility with mainstream VLSI processes prevents higher levels of system integration.

References Text Books 1. R. D. Middlebrook, Differential Amplifiers, Wiley, New York, 1963. 2. L. J. Giacoletto, Differential Amplifiers, Wiley, New York, 1970.

© 2000 by CRC Press LLC

3. P. R. Gray and R. G. Meyer, Analysis and Design of Analog Integrated Circuits, 2nd ed., Wiley, New York, 1984.

4. A. B. Grebene, Bipolar and MOS Analog Integrated Circuit Design, Wiley, New York, 1984. 5. J. Fonderie, Design of Low-Voltage Bipolar Operational Amplifiers, Delft University Press, Delft, 1991. Articles 6. J. E. Solomon, The monolithic opamp: a tutorial study, IEEE J. Solid-State Circuits, vol. SC-9, pp. 314-332, Dec. 1974. 7. J. Schmoock, An input stage transconductance reduction technique for high-slew rate operational amplifiers, IEEE J. Solid-State Circuits, vol. SC-10, pp. 407-411, Dec. 1975. 8. J. O. Voorman, W. H. A. Bruls, and P. J. Barth, Bipolar integration of analog gyrator and laguerre type filters, Proc. ECCTD, 1983, Stuttgart, pp. 108-110. 9. J. H. Huijsing and D. Linebarger, Low-voltage operational amplifier with rail-to-rail input and output ranges, IEEE J. Solid-State Circuits, vol. SC-20, pp. 1144-1150, Dec. 1985. 10. R. J. Widlar and M. Yamatake, A fast settling opamp with low supply currents, IEEE J. Solid-State Circuits, vol. SC-24, pp. 796-802, June 1989. 11. G. A. De Veirman, S. Ueda, J. Cheng, S. Tam, K. Fukahori, M. Kurisu, and E. Shinozaki, A 3.0 V 40 Mbit/s hard disk drive read channel IC, IEEE J. Solid-State Circuits, vol. SC-30, pp. 788-199, July 1995. 12. G. A. De Veirman, Differential circuits, in Encyclopedia of Electrical and Electronics Engineering, J. G. Webster, Editor, Wiley, New York, 1999.

© 2000 by CRC Press LLC

Toumazou, C., Payne, A. "High-Frequency Amplifiers" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

8593-21-frame Page 1 Friday, January 28, 2000 02:19 PM

21 High-Frequency Amplifiers 21.1 Introduction 21.2 The Current Feedback Op-Amp Current Feedback Op-Amp Basics • CMOS Compound Device • Buffer and CFOA Implementation

21.3 RF Low-Noise Amplifiers Specifications • CMOS Common-Source LNA: Simplified Analysis • CMOS Common-Source LNA: Effect of Cgd • Cascode CS LNA

21.4 Optical Low-Noise Preamplifiers Front-End Noise Sources • Receiver Performance Criteria • Transimpedance (TZ) Amplifiers • Layout for HF Operation

21.5 Fundamentals of RF Power Amplifier Design PA Requirements • Power Amplifier Classification • Practical Considerations for RF Power Amplifiers • Conclusions

21.6 Applications of High-Q Resonators in IF-Sampling Receiver Architectures IF Sampling • Linear Region Transconductor Implementation • A gm-C Bandpass Biquad

Chris Toumazou Alison Payne Imperial College, University of London

21.7 Log-Domain Processing Instantaneous Companding • Log-Domain Filter Synthesis • Performance Aspects • The Basic Log-Domain Integrator • Synthesis of Higher-Order Log-Domain Filters

21.1 Introduction As the operating frequency of communication channels for both video and wireless increases, there is an ever-increasing demand for high-frequency amplifiers. Furthermore, the quest for single-chip integration has led to a whole new generation of amplifiers predominantly geared toward CMOS VLSI. In this chapter, we will focus on the design of high-frequency amplifiers for potential applications in the front-end of video, optical, and RF systems. Figure 21.1 shows, for example, the architecture of a typical mobile phone transceiver front-end. With channel frequencies approaching the 2-GHz range, coupled with demands for reduced chip size and power consumption, there is an increasing quest for VLSI at microwave frequencies. The shrinking feature size of CMOS has facilitated the design of complex analog circuits and systems in the 1- to 2-GHz range, where more traditional low-frequency lumped circuit techniques are now becoming feasible. Since the amplifier is the core component in such systems, there has been an abundance of circuit design methodologies for high-speed, low-voltage, low-noise, and low distortion operation.

© 2000 by CRC Press LLC

8593-21-frame Page 2 Friday, January 28, 2000 02:19 PM

FIGURE 21.1 Generic wireless transceiver architecture.

This chapter will present various amplifier designs that aim to satisfy these demanding requirements. In particular, we will review, and in some cases present new ideas for power amps, LNAs, and transconductance cells, which form core building blocks for systems such as Fig. 21.1. Section 21.2 begins by reviewing the concept of current-feedback, and shows how this concept can be employed in the development of low-voltage, high-speed, constant-bandwidth CMOS amplifiers. The next two sections of the chapter focus on amplifiers for wireless receiver applications, investigating performance requirements and design strategies for optical receiver amplifiers (Section 21.3) and high-frequency low-noise amplifiers (Section 21.4). Section 21.5 considers the design of amplifiers for the transmitter side, and in particular the design and feasibility of Class E power amps are discussed. Finally, Section 21.6 reviews a very recent low-distortion amplifier design strategy termed “log-domain,” which has shown enormous potential for high-frequency, low-distortion tunable filters.

21.2 The Current Feedback Op-Amp Current Feedback Op-Amp Basics The operational amplifier (op-amp) is one of the fundamental building blocks of analog circuit design.1,2 High-performance signal processing functions such as amplifiers, filters, oscillators, etc. can be readily implemented with the availability of high-speed, low-distortion op-amps. In the last decade, the development of complementary bipolar technology has enabled the implementation of single-chip video opamps.3–7 The emergence of op-amps with non-traditional topologies, such as the current feedback opamp, has improved the speed of these devices even further.8–11 Current feedback op-amp structures are well known for their ability to overcome (to a first-order approximation) the gain-bandwidth tradeoff and slew rate limitation that characterizes traditional voltage feedback op-amps.12 Figure 21.2 shows a simple macromodel of a current feedback op-amp (CFOA), along with a simplified circuit diagram of the basic architecture. The topology of the current feedback op-amp differs from the conventional voltage feedback op-amp (VOA) in two respects. First, the input stage of a CFOA is a unitygain voltage buffer connected between the inputs of the op-amp. Its function is to force Vn to follow Vp, very much like a conventional VOA does via negative feedback. In the case of the CFOA, because of the low output impedance of the buffer, current can flow in or out of the inverting input, although in normal operation (with negative feedback) this current is extremely small. Secondly, a CFOA provides a high open-loop transimpedance gain Z(jω), rather than open-loop voltage gain as with a VOA. This is shown in Fig. 21.2, where a current-controlled current source senses the current IINV delivered by the buffer to the external feedback network, and copies this current to a high impedance Z(jω). The voltage conveyed to the output is given by Eq. 21.1:

V OUT - ( jω ) = Z ( jω ) V OUT = Z ( jω ) ⋅ I INV ⇒ ----------I INV

© 2000 by CRC Press LLC

(21.1)

8593-21-frame Page 3 Friday, January 28, 2000 02:19 PM

FIGURE 21.2 Current feedback op-amp macromodel.

When the negative feedback loop is closed, any voltage imbalance between the two inputs due to some external agent, will cause the input voltage buffer to deliver an error current IINV to the external network. This error current IINV = I1 – I2 = IZ is then conveyed by the current mirrors to the impedance Z, resulting in an ouput voltage as given by Eq. 21.1. The application of negative feedback ensures that VOUT will move in the direction that reduces the error current IINV and equalizes the input voltages. We can approximate the open-loop dynamics of the current feedback op-amp as a single pole response. Assuming that the total impedance Z(jω) at the gain node is the combination of the output resistance of the current mirrors Ro in parallel with a compensation capacitor C, we can write:

Ro Ro Z ( jω ) = ------------------------ = ---------------ω 1 + jωR o C 1 + j ----ωo

(21.2)

where ωo = 1/Ro ⋅ C represents the frequency where the open-loop transimpedance gain is 3 dB down from its low frequency value Ro. In general, Ro is designed to be very high in value. Referring to the non-inverting amplifier configuration shown in Fig. 21.3:

V IN V OUT – V IN V IN V OUT I INV = ------- – -------------------------- = ---------------- – ----------RG RF R G // R F RF

FIGURE 21.3 CFOA non-inverting amplifier configuration.

© 2000 by CRC Press LLC

(21.3)

8593-21-frame Page 4 Friday, January 28, 2000 02:19 PM

Substituting Eq. 21.1 into Eq. 21.3 yields the following expression for the closed-loop gain:

R R Z ( jω ) 1 A CL ( jω ) =  1 + -----F- ⋅ -------------------------- =  1 + -----F- ⋅ -----------------------  RF R G R F + Z ( jω ) R G 1 + ------------Z ( jω )

(21.4)

Combining Eqs. 21.2 and 21.4, and assuming that the low frequency value of the open-loop transimpedance is much higher than the feedback resistor (Ro >> RF) gives:

R A Vo 1 A CL ( jω ) =  1 + -----F- ⋅ --------------------------- = ---------------- RF ⋅ ω R G ω 1 + j -----1 + j --------------ωα Ro ⋅ ωo

(21.5)

Referring to Eq. 21.5, the closed-loop gain AVo = 1 + RF /RG , while the closed-loop –3 dB frequency ωα is given by:

R ω α = -----o ⋅ ω o RF

(21.6)

Eq. 21.6 indicates that the closed-loop bandwidth does not depend on the closed-loop gain as in the case of a conventional VOA, but is determined by the feedback resistor RF . Explaining this intuitively, the current available to charge the compensation capacitor at the gain node is determined by the value of the feedback resistor RF and not Ro, provided thatRo >> RF . So, once the bandwidth of the amplifier is set via RF , the gain can be independently varied by changing RG. The ability to control the gain independently of bandwidth constitutes a major advantage of current feedback op-amps over conventional voltage feedback op-amps. The other major advantage of the CFOA compared to the VFOA is the inherent absence of slew rate limiting. For the circuit of Fig. 21.3, assume that the input buffer is very fast and thus a change in voltage at the non-inverting input is instantaneously converted to the inverting input. When a step ∆VIN is applied to the non-inverting input, the buffer output current can be derived as:

V IN – V OUT V IN I INV = ------------------------- + -------RF RG

(21.7)

Eq. 21.7 indicates that the current available to charge/discharge the compensation capacitor is proportional to the input step regardless of its size, that is, there is no upper limit. The rate of change of the output voltage is thus: –t ⁄ R ⋅ C I INV dV OUT R --------------- = -------⇒ V OUT ( t ) = ∆V IN ⋅  1 + -----F- ⋅ ( 1 – e f )   C dt RG

(21.8)

Eq. 21.8 indicates an exponential output transition with time constant τ = RF ⋅ C. Similar to the smallsignal frequency response, the large-signal transient response is governed by RF alone, regardless of the magnitude of the closed-loop gain. The absence of slew rate limiting allows for faster settling times and eliminates slew rate-related non-linearities. In most practical bipolar realizations, Darlington-pair transistors are used in the input stage to reduce input bias currents, which makes the op-amp somewhat noisier and increases the input offset voltage. This is not necessary in CMOS realizations due to the inherently high MOSFET input impedance. However, in a closed-loop CFOA, RG should be much larger than the output impedance of the buffer. In bipolar realizations, it is fairly simple to obtain a buffer with low output resistance, but this becomes more of a

© 2000 by CRC Press LLC

8593-21-frame Page 5 Friday, January 28, 2000 02:19 PM

problem in CMOS due to the inherently lower gain of MOSFET devices. As a result, RG typically needs to be higher in a CMOS CFOA than in a bipolar realization, and consequently, RF needs to be increased above the value required for optimum high-frequency performance. Additionally, the fact that the input buffer is not in the feedback loop imposes linearity limitations on the structure, especially if the impedance at the gain node is not very high. Regardless of these problems, current feedback op-amps exhibit excellent highfrequency characteristics and are increasingly popular in video and communications applications.13 The following sections outline the development of a novel low-output impedance CMOS buffer, which is then employed in a CMOS CFOA to reduce the minimum allowable value of RG.

CMOS Compound Device A simple PMOS source follower is shown in Fig. 21.4. The output impedance seen looking into the source of M1 is approximately Zout = 1/gm, where gm is the small signal transconductance of M1. To increase gm, the drain current of M1 could be increased, which leads to an increased power dissipation. Alternatively, the dimensions of M1 can be increased, resulting in additional parasitic capacitance and hence an inferior frequency response. Figure 21.5 shows a configuration that achieves a higher transconductance than the simple follower of Fig. 21.3 for the same bias current.11 The current of M2 is fed back to M1 through the a:1 current mirror. This configuration can be viewed as a compound transistor whose gate is the gate of M1 and whose source is the source of M2. The impedance looking into the compound source can be approximated as Zout = (gm1 – a ⋅ gm2)/(gm1 ⋅ gm2), where gm1 and gm2 represent the small signal transconductance of M1 and M2, respectively. The output impedance can be made small by setting the current mirror transfer ratio a = gm1/gm2.

FIGURE 21.4 Simple PMOS source follower.

FIGURE 21.5 Compound MOS device.

© 2000 by CRC Press LLC

8593-21-frame Page 6 Friday, January 28, 2000 02:19 PM

FIGURE 21.6 Actual p-compound device implementation.

The p-compound device is practically implemented as in Fig. 21.6. In order to obtain a linear voltage transfer function from node 1 to 2, the gate-source voltages of M1 and M3 must cancel. The current mirror (M4-M2) acts as an NMOS-PMOS gate-source voltage matching circuit14 and compensates for the difference in the gate-source voltages of M1 and M3, which would normally appear as an output offset. DC analysis, assuming a square law model for the MOSFETs, shows that the output voltage exactly follows the input voltage. However, in practice, channel length modulation and body effects preclude exact cancellation.15

Buffer and CFOA Implementation The current feedback op-amp shown in Fig. 21.7 has been implemented in a single-well 0.6-µm digital CMOS process11; the corresponding layout plot is shown in Fig. 21.8. The chip has an area of 280 µm by

FIGURE 21.7 Current feedback op-amp schematic.

© 2000 by CRC Press LLC

8593-21-frame Page 7 Friday, January 28, 2000 02:19 PM

FIGURE 21.8 Current feedback op-amp layout plot.

330 µm and a power dissipation of 12 mW. The amplifier comprises two voltage followers (input and output) connected by cascoded current mirrors to enhance the gain node impedance. A compensation capacitor (Cc = 0.5 pF) at the gain node ensures adequate phase margin and thus closed-loop stability. The voltage followers have been implemented with two compound transistors, p-type and n-type, in a pushpull arrangement. Two such compound transistors in the output stage are shown shaded in Fig. 21.7. The input voltage follower of the current feedback op-amp was initialy tested open-loop, and measured results are summarized in Table 21.1. The load is set to 10 kΩ/10 pF, except where mentioned otherwise, 10 kΩ being a limit imposed by overall power dissipation of the chip. Intermodulation distortion was measured with two tones separated by 200 kHz. The measured output impedance of the buffer is given in Fig. 21.9. It remains below 80 Ω up to a frequency of about 60 MHz, when it enters an inductive region. A maximum impedance of 140 Ω is reached around 160 MHz. Beyond this frequency, the output impedance is dominated by parasitic capacitances. The inductive behavior is characteristic of the use of feedback to reduce output impedance, and can cause stability problems when driving capacitive loads. Small-signal analysis (summarized in Table 21.2) predicts a double zero in the output impedance.15 TABLE 21.1 Voltage Buffer Performance Power Supply DC gain (no load) Output impedance HD2 (Vin = 200 mVrms)

IM3 (Vin=200 mVrms) Slew rate Input referred noise

5V

Dissipation

5 mW

–3.3dB 75Ω 1 MHz 10 MHz 20 MHz 20 MHz, ∆f = 200 KHz (Load = 10 pF) 10 nV Hz

Bandwidth Min. load resistance –50 dB –49 dB –45 dB –53 dB + 130 V/µs

140 MHz 10 KΩ

-72 V/µs

Note: Load = 10 kΩ/10 pF, except for slew rate measurement.

Making factor G in Table 21.2 small will reduce the output impedance, but also moves the double zero to lower frequencies and intensifies the inductive behavior. The principal tradeoff in this configuration is between output impedance magnitude and inductive behavior. In practice. the output impedance

© 2000 by CRC Press LLC

8593-21-frame Page 8 Friday, January 28, 2000 02:19 PM

FIGURE 21.9 Measured buffer output impedance characteristics. TABLE 21.2 Voltage Transfer Function and Output Impedance of Compound Device

G Z out = -------------------------------------------------------------------------------------------------------( g m1 + g ds1 + g ds2 ) ⋅ ( g m3 + g ds3 ) ⋅ ( g m4 + g ds4 ) g m1 ⋅ g m3 ⋅ ( g m4 + g ds4 ) V out -------- = -------------------------------------------------------------------------------------------------------------------------V in ( g m1 + g ds1 + g ds2 ) ⋅ ( g m3 + g ds3 ) ⋅ ( g m4 + g ds4 ) + g L ⋅ G G = ( g m1 + g ds1 + g ds2 ) ⋅ ( g m4 + g ds4 + g ds3 ) – g m2 ⋅ g m3 can be reduced by a factor of 3 while still maintaining good stability when driving capacitive loads. Figure 21.10 shows the measured frequency response of the buffer. Given the low power dissipation, excellent slew rates have been achieved (Table 21.2).

FIGURE 21.10

Measured buffer frequency response.

© 2000 by CRC Press LLC

8593-21-frame Page 9 Friday, January 28, 2000 02:19 PM

After the characterization of the input buffer stage, the entire CFOA was tested to confirm the suitability of the compound transistors for the implementation of more complex building blocks. Open-loop transimpedance measurements are shown in Fig. 21.11. The bandwidth of the amplifier was measured at gain settings of 1, 2, 5, and 10 in a non-inverting configuration, and the feedback resistor was trimmed to achieve maximum bandwidth at each gain setting separately. CFOA measurements are summarized in Table 21.3, loading conditions are again 10 kΩ/10 pF.

FIGURE 21.11

Measured CFOA open-loop transimpedance gain. TABLE 21.3 Current Feedback Op-Amp Measurement Summary Power Supply Gain 1 2 5 10 Frequency 1 MHz

10 MHz

5V

Power Dissipation

12 mW

Input (mV rms)

Gain

HD2 (dB)

140 40 10 80 40 13

2 5 10 2 5 10

–51 –50 –49 –42 –42 –43

Bandwidth (MHz) 117 118 113 42

Fig. 21.12 shows the measured frequency response for various gain settings. The bandwidth remains constant at 110 MHz for gains of 1, 2, and 5, consistent with the expected behavior of a CFOA. The bandwidth falls to 42 MHz for a gain of 10 due to the finite output impedance of the input buffer stage which series as the CFOA inverting input. Figure 21.13 illustrates the step response of the CFOA driving a 10 kΩ/10 pF load at a voltage gain of 2. It can be seen that the inductive behavior of the buffers has little effect on the step response. Finally, distortion measurements were carried out for the entire CFOA for gain settings 2, 5, and 10 and are summarized in Table 21.3. HD2 levels can be further improved by employing a double-balanced topology. A distortion spectrum is shown in Fig. 21.14; the onset of HD3 is due to clipping at the test conditions.

© 2000 by CRC Press LLC

8593-21-frame Page 10 Friday, January 28, 2000 02:19 PM

FIGURE 21.12

Measured CFOA closed-loop frequency response.

FIGURE 21.13

Measured CFOA step response.

21.3 RF Low-Noise Amplifiers This section reviews the important performance criteria demanded of the front-end amplifier in a wireless communication receiver. The design of CMOS LNAs for front-end wireless communication receiver applications is then addressed. Section 21.4 considers the related topic of low-noise amplifiers for optical receiver front-ends.

Specifications The front-end amplifier in a wireless receiver must satisfy demanding requirements in terms of noise, gain, impedance matching, and linearity.

© 2000 by CRC Press LLC

8593-21-frame Page 11 Friday, January 28, 2000 02:19 PM

FIGURE 21.14

CFOA harmonic distortion measurements.

Noise Since the incoming signal is usually weak, the front-end circuits of the receiver must possess very low noise characteristics so that the original signal can be recovered. Provided that the gain of the front-end amplifier is sufficient so as to suppress noise from the subsequent stages, the receiver noise performance is determined predominantly by the front-end amplifier. Hence, the front-end amplifier should be a lownoise amplifier (LNA). Gain The voltage gain of the LNA must be high enough to ensure that noise contributions from the following stages can be safely neglected. As an example, Fig. 21.15 shows the first three stages in a generic frontend receiver, where the gain and output-referred noise of each stage are represented by Gi and Ni (i = 1, 2, 3), respectively. The total noise at the third stage output is given by:

N out = N in G 1 G 2 G 3 + N 1 G 2 G 3 + N 2 G 3 + N 3

FIGURE 21.15

(21.9)

Three-stage building block with gain Gi and noise Ni per stage.

This output noise (Nout) can be referred to the input to derive an equivalent input noise (Neq):

N out N N out N2 N3 N eq = ----------- = ------------------ = N in + ------1 + ----------- + -----------------Gain G1 G2 G3 G1 G1 G2 G1 G2 G3

(21.10)

According to Eq. 22.10, the gain of the first stage should be high in order to reduce noise contributions from subsequent stages. However, if the gain is too high, a large input signal may saturate the

© 2000 by CRC Press LLC

8593-21-frame Page 12 Friday, January 28, 2000 02:19 PM

subsequent stages, yielding intermodulation products which corrupt the desired signal. Thus, optimization is inevitable. Input Impedance Matching The input impedance of the LNA must be matched to the antenna impedance over the frequency range of interest, in order to transfer the maximum available power to the receiver. Linearity Unwanted signals at frequencies fairly near the frequency band of interest may reach the LNA with signal strengths many times higher than that of the wanted signal. The LNA must be sufficiently linear to prevent these out-of-band signals from generating intermodulation products within the wanted frequency band, and thus degrading the reception of the desired signal. Since third-order mixing products are usually dominant, the linearity of the LNA is related to the “third-order intercept point” (IP3), which is defined as the input power level that results in equal power levels for the output fundamental frequency component and the third-order intermodulation components. The dynamic range of a wireless receiver is limited at the lower bound by noise and at the upper band by non-linearity.

CMOS Common-Source LNA: Simplified Analysis Input Impedance Matching by Source Degeneration For maximum power transfer, the input impedance of the LNA must be matched to the source resistance, which is normally 50 Ω. Impedance-matching circuits consist of reactive components and therefore are (ideally) lossless and noiseless. Figure 21.16 shows the small signal equivalent circuit of a CS LNA input stage with impedance-matching circuit, where the gate-drain capacitance Cgd is assumed to have negligible effect and is thus neglected.16,17 The input impedance of this CS input stage is given by:

gm 1 -L Z in = jω ( L g + L s ) + ------------- + -----jωC gs C gs s

FIGURE 21.16

(21.11)

Simplified small-signal equivalent circuit of the CS stage.

Thus, for matching, the two conditions below must be satisfied:

(i)

1 2 ω o = --------------------------( L g + L s )C gs

and

( ii )

gm ------ L = Rs C gs s

(21.12)

Noise Figure of CS Input Stage Two main noise sources exist in a CS input stage as shown in Fig. 21.17; thermal noise from the source resistor Rs (denoted v Rs 2 ) and channel thermal noise from the input transistor (denoted i d 2 ). The output noise current due to v Rs 2 can be determined from Fig. 21.17 as:

© 2000 by CRC Press LLC

8593-21-frame Page 13 Friday, January 28, 2000 02:19 PM

FIGURE 21.17

2

Simplified noise equivalent circuit of the CS stage. V Rs 2 = 4kT Rs ; id = KTΓg dc . 2

2 2 gm g m v Rs 2 -2 v Rs 2 i nout1 = -----------------------------------------2 = ------------------------2 2 2 4ω R s C gs ω ( g m L s + R s C gs )

(21.13)

2

while the output noise current due to i d can be evaluated as:

id 1 i nout2 = -------------------------- = --- i d 2 gm Ls   1 + ---------- R s C gs

1 2 2 ∴i nout2 = --- i d 4

(21.14)

From Eqs. 21.13 and 21.14, the noise figure of the CS input stage is determined as: 2 ω o R s C gs  Ls  i nout2 - = 1 + Γ  -------------NF = 1 + ------------2 = 1 + Γ  --------------------    g L m s + Lg i nout1 2

2

(21.15)

In practice, any inductor (especially a fully integrated inductor) has an associated resistance that will contribute thermal noise, degrading the noise figure in Eq. 21.15. Voltage Amplifier with Inductive Load Referring to Fig. 21.15, the small signal current output is given by:

gm vs i out = ---------------------------------------------------------------------------------------------2 [ 1 – ω C gs ( L g + L s ) ] + jω ( g m L s + R s C gs )

(21.16)

For an inductive load (L1) with a series internal resistance rL1, the output voltage is thus:

– ( r L1 + jωL 1 )g m v s v out = – i out ( r L1 + jωL 1 ) = ---------------------------------------------------------------------------------------------2 [ 1 – ω C gs ( L g + L s ) ] + jω ( g m L s + R s C gs )

(21.17)

Assuming that the input is impedance matched, the voltage gain at the output is given by: 2 2 2 r L1 ω o L 1 2 1  L 1  L 1  ( r L1 ) + ( ω o ) ( L 1 ) v out = ------------------------------------------------ 1 +  ---------- ≅ --- ω o ----- ------ = ------------------ r L1  2ω o L s 2  L s   r L1 2ω o L s vs

© 2000 by CRC Press LLC

(21.18)

8593-21-frame Page 14 Friday, January 28, 2000 02:19 PM

CMOS Common-Source LNA: Effect of Cgd In the analysis so far, the gate-drain capacitance (Cgd) has been assumed to be negligible. However, at very high frequencies, this component cannot be neglected. Figure 21.18 shows the modified input stage of a CS LNA including Cgd and an input ac-coupling capacitance Cin. Small signal analysis shows that the input impedance is now given by:

gm Ls Z in = ---------------------------------------------------------------------------------------2 g m ( 1 – ω L s C gd ) C gs + C gd jωL s g m + ------------------------------------1 ----- + jωC gd ZL

(21.19)

Equation 21.19 exhibits resonance frequencies that occur when:

1 – ω L s C gs = 0 2

1 – ω L g C in = 0 2

and

(21.20)

Equation 21.19 indicates that the input impedance matching is degraded by the load ZL when Cgd is included in the analysis.

FIGURE 21.18

Noise equivalent circuit of the CS stage, including effects of Cgd.

Input Impedance with Capacitive Load If the load ZL is purely capacitive, that is,

1 Z L = -----------jωC L

(21.21)

then the input impedance can be easily matched to the source resistor Rs. Substituting Eq. 21.21 for ZL, the bracketed term in the denominator of Eq. 21.19 becomes:

g m ( 1 – ω L s C gd ) d 1 = jωL s g m + ------------------------------------- = 0 jω ( C gd + C L ) 2

under the condition that

© 2000 by CRC Press LLC

(21.22)

8593-21-frame Page 15 Friday, January 28, 2000 02:19 PM

1 – ω L s ( 2C gd + C L ) = 0 2

(21.23)

The three conditions in Eqs. 21.20 and 21.23 should be met to ensure input impedance matching. However, in practice, we are unlikely to be in the situation of using a load capacitor. Input Impedance with Inductive Load If ZL = jωLL, the CS LNA input impedance is given by:

gm Ls Z in = ----------------------------------------------------------------------------------------2 1 – ω L s C gd   C gs + jωC gd g m L s + L L -------------------------- 1 – ω2 L C  L

(21.24)

gd

In order to match to a purely resistive input, the value of the reactive term in Eq. 21.24 must be negligible, which is difficult to achieve.

Cascode CS LNA Input Matching As outlined in the paragraph above, the gate-drain capacitance (Cgd) degrades the input impedance matching and therefore reduces the power transfer efficiency. In order to reduce the effect of Cgd, a cascoded structure can be used.18–20 Figure 21.19 shows a cascode CS LNA. Since the voltage gain from the gate to the drain of M1 is unity, the gate-drain capacitance (Cgd1) no longer sees the full input-output voltage swing which greatly improves the input-output isolation. The input impedance can be approximated by Eq. 21.11, thus allowing a simple matching circuit to be employed.18

FIGURE 21.19

Cascode CS LNA.

Voltage Gain Figure 21.20 shows the small-signal equivalent circuit of the cascode CS LNA. Assuming that input is fully matched to the source, the voltage gain of the amplifier is given by:

jωL 1 g m2 g m1 v out 1 -  ---------------------------- ---------------------------------------------------------------- = – ---  ----------------------------2 2     2 g + jωC vs gs2 1 – ω L 1 C gd2 m2 ( 1 – ω L s C gs1 ) + jωL s g m1

© 2000 by CRC Press LLC

(21.25)

8593-21-frame Page 16 Friday, January 28, 2000 02:19 PM

FIGURE 21.20

Equivalent circuit of cascode CS LNA.

At the resonant frequency, the voltage gain is given by:

v out 1 L 1 1 L 1 1 ------- ( ω o ) = – ---  -----1  ------------------------------- × -------------------------------- ≈ – ---  -----1 × ----------------------  2  L s   1 – ω o 2 L 1 C gd2 vs 2 L C gs1 ω (21.26) s 1 + jω o  -------1 + j  ------o   g m2   ω T From Eq. 21.26, the voltage gain is dependent on the ratio of the load and source inductance values. Therefore, high gain accuracy can be achieved since this ratio is largely process independent. Noise Figure Figure 21.21 shows an equivalent circuit of the cascode CS LNA for noise calculations. Three main noise sources can be identified: the thermal noise voltage from Rs, and the channel thermal noise currents from M1 and M2. Assuming that the input impedance is matched to the sources, the output noise current due to v RS 2 can be derived as:

g m2 1  ------------------------------ v RS i out1 = ----------------------------------------------------2   g + jω C m2 o gs2 2jω o L s ( 1 – ω o L 1 C gd2 ) 2

The output noise current contribution due to i d1 of M1 is given by:

FIGURE 21.21

Noise equivalent circuit of cascode CS LNA.

© 2000 by CRC Press LLC

(21.27)

8593-21-frame Page 17 Friday, January 28, 2000 02:19 PM

g m2 1  ------------------------------ i d1 i out2 = ---------------------------------------2  g + 2 ( 1 – ω o L 1 C gd2 ) m2 jω o C ds2

(21.28)

2

The output noise current due to i d2 of M2 is given by:

jω o C gs2 i out3 = ------------------------------------------------------------------------i d2 2 ( 1 – ω o L 1 C gd2 ) ( g m2 + jω o C gs2 )

(21.29)

The noise figure of the cascode CS LNA can thus be derived as: 2

2

2

2

i out2 i out3 4ω o C gs2  NF = 1 + ---------- + ----------- = 1 + Γ  1 + --------------------- 2 2 g m1 g m2  i out1 i out1

(21.30)

In order to improve the noise figure, the transconductance values (gm) of M1 and M2 should be increased. Since the gate-source capacitance (Cgs2) of M2 is directly proportional to the gate width, the gate width of M2 cannot be enlarged to increase the transconductance. Instead, this increase should be realized by increasing the gate bias voltage.

21.4 Optical Low-Noise Preamplifiers Figure 21.22 shows a simple schematic diagram of an optical receiver, consisting of a photodetector, a preamplifier, a wide-band voltage amplifier, and a pre-detection filter. Since the front-end transimpedance preamplifier is critical in determining the overall receiver performance, it should possess a wide bandwidth so as not to distort the received signal, high gain to reject noise from subsequent stages, low noise to achieve high sensitivity, wide dynamic range, and low inter-symbol-interference (ISI).

FIGURE 21.22

Front-end optical receiver.

Front-End Noise Sources Receiver noise is dominated by two main noise sources: the detector (PIN photodiode) noise and the amplifier noise. Figure 21.23 illustrates the noise equivalent circuit of the optical receiver. PIN Photodiode Noise The noise generated by a PIN photodiode arises mainly from three shot noise contributions: quantum noise Sq(f), thermally generated dark-current shot noise SD(f), and surface leakage-current shot noise SL(f). Other noise sources in a PIN photodiode, such as series resistor noise, are negligible in comparison. The quantum noise Sq(f), also called signal-dependent shot noise, is produced by the light-generating nature of photonic detection and has a spectral density Sq(f) = 2qIpd ∆f, where Ipd is the mean signal current arising from the Poisson statistics. The dark-current shot noise SD(f) arises in the photodiode

© 2000 by CRC Press LLC

8593-21-frame Page 18 Friday, January 28, 2000 02:19 PM

FIGURE 21.23

Noise equivalent circuit of the front-end optical receiver.

bulk material. Even when there is no incident optical power, a small reverse leakage current still flows, resulting in shot noise with a spectral density SD(f) = 2qIDB ∆f, where IDB is the mean thermally generated dark current. The leakage shot noise SL(f) occurs because of surface effects around the active region, and is described by SL(f) = 2qISL∆f, where ISL is the mean surface leakage current. Amplifier Noise For a simple noise analysis, the pre- and post-amplifiers in Fig. 21.22 are merged to a single amplifier with a transfer function of Av(ω). The input impedance of the amplifier is modeled as a parallel combination of Rin and Cin. If the photodiode noise is negligibly small, the amplifier noise will dominate the whole receiver noise performance, as can be inferred from Fig. 21.23. The equivalent noise current and voltage spectral densities of the amplifier are represented as Si(A2/Hz) and Sv(V2/Hz), respectively. Resistor Noise The thermal noise generated by a resistor is directly proportional to the absolute temperature T and is represented by a series noise voltage generator or by a shunt noise current generator21 of value: 2

v R = 4kTR∆f

or

1 2 i R = 4kT ---∆f R

(21.31)

where k is Boltzmann’s constant and R is the resistance.

Receiver Performance Criteria Equivalent Input Noise Current 〈 i eq 〉 2

The transfer function from the current input to the amplifier output voltage is given by:

R in V out -A (ω) Z T ( ω ) = -------- = Z in A v ( ω ) = -----------------------------------------1jωR in ( C pd + C in ) v I pd

(21.32)

where Cpd is the photodiode capacitance, and Rin and Cin are the input resistance and capacitance of the amplifier, respectively. Assuming that the photodiode noise contributions are negligible and that the amplifier noise sources are uncorrelated, the equivalent input noise current spectral density can be derived from Fig. 21.23 as:

Sv 1 2 2 -2 = S i + S v ---------2 + ( 2πf ) ( C pd + C in ) S eq ( f ) = S i + -----------[ Z in ] R in

© 2000 by CRC Press LLC

(21.33)

8593-21-frame Page 19 Friday, January 28, 2000 02:19 PM

The total mean-square noise output voltage 〈 v no 〉 is calculated by combining Eqs. 21.32 and 21.33 as follows: 2



〈 v no 〉 = 2

∫S

(f ) Z T ( f ) df 2

eq

(21.34)

0

This total noise voltage can be referred to the input of the amplifier by dividing it by the squared dc gain 2 Z T ( 0 ) of the receiver, to give an equivalent input mean-square noise current: ∞



2 〈 v no 〉 Sv  ZT ( f ) 2 2 ZT ( f ) -2 df 〈 i eq 〉 = ------------------2 =  S i + --------2 -------------------2 df + S v [ 2π ( C pd + C in ) ] f -----------------  ZT ( 0 ) ZT ( 0 ) R in ZT ( 0 ) 2



2



0

2

(21.35)

0

Sv  2 3 =  S i + --------2 I 2 B + [ 2π ( C pd + C in ) ] I 3 B S v   R in

where B is the operating bit-rate, and I2(= 0.56) and I3(= 0.083) are the Personick second and third integrals, respectively, as given in Ref. 22. According to Morikoni et al.,23 the Personick integral in Eq. 21.35 is correct only if a receiver produces a raised-cosine output response from a rectangular input signal at the cut-off bit rate above which the frequency response of the receiver is zero. However, the Personick integration method is generally preferred when comparing the noise (or sensitivity) performance of different amplifiers. Optical Sensitivity Optical sensitivity is defined as the minimum received optical power incident on a perfectly efficient photodiode connected to the amplifier, such that the presence of the amplifier noise corrupts on average only one bit per 109 bits of incoming data. Therefore, a detected power greater than the sensitivity level guarantees system operation at the desired performance. The optical sensitivity is predicted theoretically by calculating the equivalent input noise spectral density of the receiver, and is calculated24 via Eq. 21.36:

hc 1 2 S = 10 log 10 Q ------ 〈 i eq 〉 ⋅ ------------- (dBm)  qλ 1mW

(21.36)

where h is Planck’s constant, c is the speed of light, q is electronic charge, and λ (µm) is the wavelength of light in an optical fiber. Q ( = SNR ) , where SNR represents the required signal-to-noise ratio (SNR). The value of Q should be 6 for a bit error rate (BER) of 10–9, and 7.04 for a BER of 10–12. The relation between Q and BER is given by:

exp ( – Q ⁄ 2 ) BER = ------------------------------2πQ 2

(21.37)

Since the number of photogenerated electrons in a single bit is very large (more than 104) for optoelectronic integrated receivers,25 Gaussian statistics of the above BER equation can be used to describe the detection probability in PIN photodiodes. SNR at the Photodiode Terminal22 Among the photodiode noise sources, quantum noise is generally dominant and can be estimated as:

〈 i n 〉 q = 2qI pd B eq 2

© 2000 by CRC Press LLC

(21.38)

8593-21-frame Page 20 Friday, January 28, 2000 02:19 PM

where Ipd is the mean signal current and Beq is the equivalent noise bandwidth. The signal-to-noise-ratio (SNR) referred to the photodiode terminal is thus given by: 2

I pd SNR = ----------------------------------------------------------------4kTB 2 2 〈 i n 〉 pd + ----------------eq- + 〈 i eq 〉 amp RB

(21.39)

where all noise contributions due to the amplifier are represented by the equivalent noise current where all noise contributions due to the amplifier are represented by the equivalent noise current 〈 i eq 2〉 amp . It is often convenient to combine the noise contributions from the amplifier and the photodiode with the thermal noise from the bias resistor, by defining a noise figure NF:

4kTB eq NF 4kTB 2 2 〈 i n 〉 pd + ----------------eq- + 〈 i eq 〉 amp = -----------------------RB RB

(21.40)

The SNR at the photodiode input is thus given by: 2

I pd R B SNR ≅ -----------------------4kTB eq NF

(21.41)

Inter-Symbol Interference (ISI) When a pulse passes through a band-limited channel, it gradually disperses. When the channel bandwidth is close to the signal bandwidth, the expanded rise and fall times of the pulse signal will cause successive pulses to overlap, deteriorating the system performance and giving higher error rates. This pulse overlapping is known as inter-symbol interference (ISI). Even with raised signal power levels, the error performance cannot be improved.26 In digital optical communication systems, sampling at the output must occur at the point of maximum signal in order to achieve the minimum error rate. The output pulse shape should therefore be chosen to maximize the pulse amplitude at the sampling instant and give a zero at other sampling points; that is, at multiples of 1/B, where B is the data-rate. Although the best choice for this purpose is the sincfunction pulse, in practice a raised-cosine spectrum pulse is used instead. This is because the sinc-function pulse is very sensitive to changes in the input pulse shape and variations in component values, and because it is impossible to generate an ideal sinc-function. Dynamic Range The dynamic range of an optical receiver quantifies the range of detected power levels within which correct system operation is guaranteed. Dynamic range is conventionally defined as the difference between the minimum input power (which determines sensitivity) and the maximum input power (limited by overload level). Above the overload level, the bit-error-rate (BER) rises due to the distortion of the received signal.

Transimpedance (TZ) Amplifers High-impedance (HZ) amplifiers are effectively open-loop architectures, and exhibit a high gain but a relatively low bandwidth. The frequency response is similar to that of an integrator, and thus HZ amplifiers require an output equalizer to extend their frequency capabilities. In contrast, the transimpedance (TZ) configuration exploits resistive negative feedback, providing an inherently wider bandwidth and eliminating the need for an output equalizer. In addition, the use of negative feedback provides a relatively low input resistance and thus the architecture is less sensitive to the photodiode parameters. In a TZ amplifier, the photodiode bias resistor RB can be omitted, since bias current is now supplied through the feedback resistor.

© 2000 by CRC Press LLC

8593-21-frame Page 21 Friday, January 28, 2000 02:19 PM

In addition to wider bandwidth, TZ amplifiers offer a larger dynamic range because the transimpedance gain is determined by a linear feedback resistor, and not by a non-linear open-loop amplifier as is the case for HZ amplifiers. The dynamic range of TZ amplifiers is set by the maximum voltage swing available at the amplifier output, provided no integration of the received signal occurs at the front end. Since the TZ output stage is a voltage buffer, the voltage swing at the output can be increased with high current operation. The improvement in dynamic range in comparison to the HZ architecture is approximately equal to the ratio of open-loop to closed-loop gain.27 Conclusively, the TZ configuration offers the better performance compromise compared to the HZ topology, and hence this architecture is preferred in optical receiver applications. A schematic diagram of a TZ amplifier with PIN photodiode is shown in Fig. 21.24. With an openloop, high-gain amplifier and a feedback resistor, the closed-loop transfer function of the TZ amplifier is given by:

–Rf Rf - ≅ ---------------------------------------Z T ( s ) = -------------------------------------------------------------------------C + ( 1 + A )C C in 1 + A in f    ------------- + sR -----------------------------------1 + sR f ------- + C f f A   A  A

(21.42)

where A is the open-loop mid-band gain of the amplifier which is assumed to be greater than unity, Rf is the feedback resistance, Cin is the total input capacitance of the amplifier including the photodiode and the parasitic capacitance, and Cf represents the stray feedback capacitance. The –3 dB bandwidth of the TZ amplifier is approximately given by:

(1 + A) f –3d = -----------------2πR f C T

(21.43)

where CT is the total input capacitance including the photodiode capacitance. The TZ amplifier can thus have wider bandwidth by increasing the open-loop gain, although the open-loop gain cannot be increased indefinitely without stability problems.

FIGURE 21.24

Schematic diagram of a transimpedance amplifier with photodiode.

However, a tradeoff between low noise and wide bandwidth exists, since the equivalent input noise current spectral density of TZ amplifier is given by:

4kT 4kT S eq ( f ) = ---------- + ---------- + S i ( f ) + S v ( f ) Rf RB

© 2000 by CRC Press LLC

1 1 2 2 2  ---+ ----- + ( 2πf ) ( C pd + C in )  R f R B

(21.44)

8593-21-frame Page 22 Friday, January 28, 2000 02:19 PM

where Cin is the input capacitance of the input transistor. Increasing the value of Rf reduces the noise current in Eq. 21.44 but also shrinks the bandwidth in Eq. 21.43. This conflict can be mitigated by making A in Eq. 21.43 as large as the closed-loop stability allows.28 However, the feedback resistance Rf cannot be increased indefinitely due to the dynamic range requirements of the amplifier, since too large a feedback resistance causes the amplifier to be overloaded at high signal levels. This overloading can be avoided by using automatic gain control (AGC) circuitry, which automatically reduces the transimpedance gain in discrete steps to keep the peak output signal constant.27 The upper limit of Rf is set by the peak amplitude of the input signal. Since the dc transimpedance gain is approximately equal to the feedback resistance Rf , the output voltage is given by Ipd × Rf , where Ipd is the signal photocurrent. If this output voltage exceeds the maximum voltage swing at the output, the amplifier will be saturated and the output will be distorted, yielding bit errors. The minimum value of Rf is determined by the output signal level at which the performance of the receiver is degraded due to noise and offsets. For typical fiber-optic communication systems, the input signal power is unknown, and may vary from just above the noise floor to a large value enough to generate 0.5 mA at the detector diode.29 The TZ configuration has some disadvantages over HZ amplifiers. The power consumption is fairly high, partly due to the broadband operation provided by negative feedback. A propagation delay exists in the closed-loop of the feedback amplifier that may reduce the phase margin of the amplifier and cause peaking in the frequency response. Additionally, any stray feedback capacitance Cf will further deteriorate the ac performance. Among three types of TZ configuration in CMOS technology (common-source, common-drain, and common-gate TZ amplifiers), the common-gate configuration has potentially the highest bandwidth due to its inherently lower input resistance. Using a common-gate input configuration, the resulting amplifier bandwidth can be made independent of the photodiode capacitance (which is usually the limiting factor in achieving GHz preamplifier designs). Recently, a novel common-gate TZ amplifier has been demonstrated, which shows superior performance compared to various other configurations.30,31

Layout for HF Operation Wideband high-gain amplifiers have isolation problems irrespective of the choice of technology. Coupling from output to input, from the power supply rails, and from the substrate are all possible. Therefore, careful layout is necessary, and special attention must be given to stray capacitance, both on the integrated circuit and associated with the package.32 Input/Output Isolation For stable operation, a high level of isolation between I/O is necessary. Three main factors degrade the I/O isolation33,34: (1) capacitive coupling between I/O signal paths through the air and through the substrate; (2) feedback through the dc power supply rails and ground-line inductance; and (3) the package cavity resonance since at the cavity resonant frequency, the coupling between I/O can become very large. In order to reduce the unwanted coupling (or to provide good isolation, typically more than 60 dB) between I/O, the I/O pads should be laid out to be diagonally opposite each other on the chip with a thin ‘left-to-right’ geometry between I/O. The small input signal enters on the left-hand side of the chip, while the large output signal exits on the far right-hand side. This helps to isolate the sensitive input stages from the larger signal output stages.35,36 The use of fine line-widths and shielding are effective techniques to reduce coupling through the air. Substrate coupling can be reduced by shielding and by using a thin and low-dielectric substrate. Akazawa et al.33 suggest a structure for effective isolation: a coaxial-like signal-line for high shielding, and a very thin dielectric dc feed-line structure for low characteristic impedance. Reduction of Feedback Through the Power Supply Rails Careful attention should be given to layout of power supply rails for stable operation and gain flatness. Power lines are generally inductive; thus, on-chip capacitive decoupling is necessary to reduce the

© 2000 by CRC Press LLC

8593-21-frame Page 23 Friday, January 28, 2000 02:19 PM

high-frequency power line impedance. However, a resonance between these inductive and capacitive components may occur at frequencies as low as several hundred MHz, causing a serious dip in the gain-frequency response and an upward peaking in the isolation-frequency characteristics. One way to reduce this resonance is to add a series damping resistor to the power supply line, making the Q factor of the LC resonance small. Additionally, the power supply line should be widened to reduce the characteristic impedance/inductance. In practice, if the characteristic impedance is as small as several ohms, the dip and peaking do not occur, even without resistive termination.33 Resonance also occurs between the IC pad capacitance (Cpd) and the bond-wire inductance (Lbond). This resonance frequency is typically above 2 GHz in miniature RF packages. Also in layout, the power supply rails of each IC chip stage should be split from the other stages in order to reduce the parasitic feedback (or coupling effect through wire-bonding inductance), which causes oscillation.34 This helps to minimize crosstalk through power supply rail. The IC is powered through several pads and each pad is individually bonded to the power supply line. I/O Pads The bond pads on the critical signal path (e.g., input pad and output pads) should be made as small as possible to minimize the pad-to-substrate capacitance.35 A floating n-well placed underneath the pad will further reduce the pad capacitance since the well capacitance will appear in series with the pad capacitance. This floating well also prevents the pad metal from spiking into the substrate. High-Frequency (HF) Ground The best possible HF grounds to the sources of the driver devices (and hence the minimization of interstage crosstalk) can be obtained by separate bonding of each source pad of the driver MOSFETs to the ground plane that is very close to the chip.36 A typical bond-wire has a self-inductance of a few nH, which can cause serious peaking within the bandwidth of amplifiers or even instability. By using multiple bondwires in parallel, the ground-line inductance can be reduced to less than 1 nH. Flip-Chip Connection In noisy environments, the noise-insensitive benefits of optical fibers may be lost at the receiver connection between the photodiode and the preamplifier. Therefore, proper shielding, or the integration of both components onto the same substrate, is necessary to prevent this problem. However, proper shielding is costly, while integration restricts the design to GaAs technologies. As an alternative, the flip-chip interconnection technique using solder bumps has been used.37,38 Small solder bumps minimize the parasitics due to the short interconnection lengths and avoid damages by mechanical stress. Also, it needs relatively low-temperature bonding and hence further reduces damage to the devices. Easy alignment and precise positioning of the bonding can be obtained by a self-alignment effect. Loose chip alignment is sufficient because the surface tension of the molten solder during re-flow produces precise self-alignment of the pads.34 Solder bumps are fabricated onto the photodiode junction area to reduce parasitic inductance between the photodiode and the preamplifier.

21.5 Fundamentals of RF Power Amplifier Design PA Requirements An important functional block in wireless communication transceivers is the power amplifier (PA). The transceiver PA takes as input the modulated signal to be transmitted, and amplifies this to the power level required to drive the antenna. Because the levels of power required to transmit the signal reliably are often fairly high, the PA is one of the major sources of power consumption in the transceiver. In many systems, power consumption may not be a major concern, as long as the signal can be transmitted with adequate power. For battery-powered systems, however, the limited amount of available energy means that the power consumed by all devices must be minimized so as to extend the transmit time.

© 2000 by CRC Press LLC

8593-21-frame Page 24 Friday, January 28, 2000 02:19 PM

Therefore, power efficiency is one of the most important factors when evaluating the performance of a wireless system. The basic requirement for a power amplifier is the ability to work at low supply voltages as well as high operating frequencies, and the design becomes especially difficult due to the tradeoffs between supply voltage, output power, distortion, and power efficiency that can be made. Moreover, since the PA deals with large signals, small-signal analysis methods cannot be applied directly. As a result, both the analysis and the design of PAs are challenging tasks. This section will first present a study of various configurations employed in the design of state-of-theart non-linear RF power amplifiers. Practical considerations toward achieving full integration of PAs in CMOS technology will also be highlighted.

Power Amplifier Classification Power amplifiers currently employed for wireless communication applications can be classified into two categories: linear power amplifiers and non-linear power amplifiers. For linear power amplifiers, the output signal is controlled by the amplitude, frequency, and phase of the input signal. Conversely, for non-linear power amplifiers, the output signal is only controlled by the frequency of input signal. Conventionally, linear power amplifiers can be classified as Class A, Class B, or Class AB. These PAs produce a magnified replica of the input signal voltage or current waveform, and are typically used where accurate reproduction of both the envelope and the phase of the signal is required. However, either poor power efficiency or large distortion prevents them from being extensively employed in wireless communications. Many applications do not require linear RF amplification. Gaussian Minimum Shift Keying (GMSK),39 the modulation scheme used in the European standard for mobile communications (GSM), is an example of constant envelope modulation. In this case, the system can make use of the greater efficiency and simplicity offered by non-linear PAs. The increased efficiency of non-linear PAs, such as Class C, Class D, and Class E, results from techniques that reduce the average collector voltage–current product (i.e., power dissipation) in the switching device. Theoretically, these switching-mode PAs have 100% power efficiency since, ideally, there is no power loss in the switching device. Linear Power Amplifiers Class A The basic structure of the Class A power amplifier is shown in Fig. 21.25.40 For Class A amplification, the conduction angle of the device is 360°, that is, the transistor is in its active region for the entire input

FIGURE 21.25

Single-ended Class A power amplifier.

© 2000 by CRC Press LLC

8593-21-frame Page 25 Friday, January 28, 2000 02:19 PM

cycle. The serious shortcoming with Class A PAs is their inherently poor power efficiency, since the transistor is always dissipating power. The efficiency of a single-ended Class A PA is ideally limited to 50%. However, in practice, few designs can reach this ideal efficiency due to additional power loss in the passive components. In an inductorless configuration, the efficiency is only about 25%.41 Class B A PA is defined as Class B when the conduction angle for each transistor of a push-pull pair is 180° during any one cycle. Figure 21.26 shows an inductorless Class B power amplifier. Since each transistor only conducts for half of the cycle, the output suffers crossover distortion due to the finite threshold voltage of each transistor. When no signal is applied, there is no current flowing; as a result, any current through either device flows directly to the load, thereby maximizing the efficiency. The ideal efficiency can reach 78%,41 allowing this architecture to be of use in applications where linearity is not the main concern.

FIGURE 21.26

Inductorless Class B power amplifier.

Class AB The basic idea of Class AB amplification is to preserve the Class B push-pull configuration while improving the linearity by biasing each device slightly above threshold. The implementation of Class AB PAs is similar to Class B configurations. By allowing the two devices to conduct current for a short period, the output voltage waveform during the crossover period can be smoothed, which thus reduces the crossover distortion of the output signal. Nonlinear Power Amplifiers Class C A Class C power amplifier is the most popular non-linear power amplifier used in the RF band. The conduction angle is less than 180° since the switching transistor is biased on the verge of conduction. A portion of the input signal will make the transistor operate in the amplifying region, and thus the drain current of the transistor is a pulsed signal. Figures 21.27(a) and (b) show the basic configuration of a Class C power amplifier and its corresponding waveforms; clearly, the input and output voltages are not linearly related. The efficiency of an ideal Class C amplifier is 100% since at any point in time, either the voltage or the current waveforms are zero. In practice, this ideal situation cannot be achieved, and the power efficiency should be maximized by reducing the power loss in the transistor. That is, minimize the current through the transistor when the voltage across the output is high, and minimize the voltage across the output when the current flows through the device.

© 2000 by CRC Press LLC

8593-21-frame Page 26 Friday, January 28, 2000 02:19 PM

FIGURE 21.27

(a) Class C power amplifier, and (b) Class C waveforms.

Class D A Class D amplifier employs a pair of transistors and a tuned output circuit, where the transistors are driven to act as a two-pole switch and the output circuit is tuned to the switching frequency. The theoretical power efficiency is 100%. Figure 21.28 shows the voltage-switching configuration of a Class D amplifier. The input signals of transistors Q1 and Q2 are out of phase, and consequently when Q1 is on, Q2 is off, and vice versa. Since the load network is a tuned circuit, we can assume that it provides little impedance to the operating frequency of the voltage vd and high impedance to other harmonics. Since vd is a square wave, its Fourier expansion is given by

2 1 2 v d ( ωt ) = V dc --- + --- sin ( ωt ) + ------ sin ( 3ωt )… 3π 2 π

(21.45)

The impedance of the RLC series load at resonance is equal to RL, and thus the current is given by:

2V dc i L ( ωt ) = ---------sin ( ωt ) πR L

© 2000 by CRC Press LLC

(21.46)

8593-21-frame Page 27 Friday, January 28, 2000 02:19 PM

FIGURE 21.28

Class D power amplifier.

Each of the devices carries the current during one half of the switching cycle. Therefore, the output power is given by: 2

2 V dc P o = -----2 -------π RL

(21.47)

Design efforts should focus on reducing the switching loss of both transistors as well as generating the input driving signals. Class E The idea behind the Class E PA is to employ non-overlapping output voltage and output current waveforms. Several criteria for optimizing the performance can be found in Ref. 42. Following these guidelines, Class E PAs have high power efficiency, simplicity, and relatively high tolerance to circuit variations.43 Since there is no power loss in the transistor as well as in the other passive components, the ideal power efficiency is 100%. Figure 21.29 shows a class E PA, and the corresponding waveforms are given in Fig. 21.30.

FIGURE 21.29

Class E power amplifier.

© 2000 by CRC Press LLC

8593-21-frame Page 28 Friday, January 28, 2000 02:19 PM

FIGURE 21.30

Waveforms of Class E operation.

The Class E waveforms indicate that the transistor should be completely off before the voltage across it changes, and that the device should be completely on before it starts to allow current to flow through it. Refs. 44 and 45 demonstrate practical Class E operation at RF frequencies using a GaAs process.

Practical Considerations for RF Power Amplifiers More recently, single-chip solutions for RF transceivers have become a goal for modern wireless communications due to potential savings in power, size, and cost. CMOS must clearly be the technology of choice for a single-chip transceiver due to the large amount of digital baseband processing required. However, the power amplifier design presents a bottleneck toward full integration, since CMOS power amplifiers are still not available. The requirements of low supply voltage, gigahertz-band operation, and high output power make the implementation of CMOS PAs very demanding. The proposal of “microcell” communications may lead to a relaxed demand for output power levels that can be met by designs such as that described in Ref. 46, where a CMOS Class C PA has demonstrated up to 50% power efficiency with 20 mW output power. Non-linear power amplifiers seem to be popular for modern wireless communications due to their inherent high power efficiency. Since significant power losses occur in the passive inductors as well as the switching devices, the availability of on-chip, low-loss passive inductors is important. The implementation of CMOS on-chip spiral inductors has therefore become an active research topic.47 Due to the poor spectral efficiency of a constant envelope modulation scheme, the high power efficiency benefit of non-linear power amplifiers is eliminated. A recently proposed linear transmitter using a nonlinear power amplifier may prove to be an alternative solution.48 The development of high mobility devices such as SiGe HBTs has led to the design of PAs demonstrating output power levels up to 23 dBm at 1.9 GHz with power-added efficiency of 37%.49 Practical power amplifier designs require that much attention be paid to issues of package and harmonic terminations. Power losses in the matching networks must be absolutely minimized, and tradeoffs between power-added efficiency and linearity are usually achieved through impedance matching. Although GaAs processes provide low-loss impedance matching structures on the semi-insulating substrate, good shielding techniques for CMOS may prove to be another alternative.

Conclusions Although linear power amplifiers provide conventional “easy-design” characteristics and linearity for modulation schemes such as π/4-DQPSK, modern wireless transceivers are more likely to employ

8593-21-frame Page 29 Friday, January 28, 2000 02:19 PM

non-linear power amplifiers due to their much higher power efficiency. As the development of highquality on-chip passive components makes progress, the trend toward full integration of the PA is becoming increasingly plausible. The rapid development of CMOS technology seems to be the most promising choice for PA integration, and vast improvements in frequency performance have been gained through device scaling. These improvements are expected to continue as silicon CMOS technologies scale further, driven by the demand for high-performance microprocessors. The further development of high mobility devices such as SiGe HBTs may finally see GaAs MOSFETs being replaced by wireless communication applications, since SiGe technology is compatible with CMOS.

21.6 Applications of High-Q Resonators in IF-Sampling Receiver Architectures Transconductance-C (gm-C) filters are currently the most popular design approach for realizing continuous-time filters in the intermediate frequency range in telecommunications systems. This section will consider the special application area of high-Q resonators for receiver architectures employing IF sampling.

IF Sampling A design approach for contemporary receiver architectures that is currently gaining popularity is IF digitization, whereby low-frequency operations such as second mixing and filtering can be performed more efficiently in the digital domain. A typical architecture is shown in Fig. 21.31. The IF signal is digitized, multiplied with the quadrature phases of a digital sinusoid, and lowpass filtered to yield the quadrature baseband signals. Since processing takes place in the digital domain, I/Q mismatch problems are eliminated. The principal issue in this approach, however, is the performance required from the A/D converter (ADC). Noise referred to the input of the ADC must be very low so that selectivity remains high. At the same time, the linearity of the ADC must be high to minimize corruption of the wanted signal through intermodulation effects. Both the above requirements should be achieved at an input bandwidth commensurate with the value of the IF frequency, and at an acceptable power budget.

FIGURE 21.31

IF-sampling receiver.

Oversampling has become popular in recent years because it avoids many of the difficulties encountered with conventional methods for A/D and D/A conversion. Conventional converters are often difficult to implement in fine-line, very large-scale integration (VLSI) technology, because they require precise analog components and are very sensitive to noise and interference. In contrast, oversampling converters trade off resolution in time for resolution in amplitude, in such a way that the imprecise nature of the analog circuits can be tolerated. At the same time, they make extensive use of digital signal processing power, taking advantage of the fact that fine-line VLSI is better suited for providing fast digital circuits than for

© 2000 by CRC Press LLC

8593-21-frame Page 30 Friday, January 28, 2000 02:19 PM

providing precise analog circuits. Therefore, IF-digitization techniques utilizing oversampling SigmaDelta modulators are very well suited to modern sub-micron CMOS technologies, and their potential has made them the subject of active research. Most Delta-Sigma modulators are implemented with discrete-time circuits, switched-capacitor (SC) implementations being by far the most common. This is mainly due to the ease with which monolithic SC filters can be designed, as well as the high linearity which they offer. The demand for high-speed Σ∆ oversampling ADCs, especially for converting bandpass signals, makes it necessary to look for a technique that is faster than switched-capacitor. This demand has stimulated researchers to develop a method for designing continuous-time ∆Σ ADCs. Although continuous-time modulators are not easy to integrate, they possess a key advantage over their discrete-time counterparts. The sampling operation takes place inside the modulator loop, making it is possible to “noise-shape” the errors introduced by sampling, and provide a certain amount of anti-aliasing filtering at no cost. On the other hand, they are sensitive to memory effects in the DACs and are very sensitive to jitter. They must also process continuous-time signals with high linearity. In communications applications, meeting the latter requirement is complicated by the fact that the signals are located at very high frequencies. As shown in Fig. 21.32, integrated bandpass implementations of continuous-time modulators require integrated continuous-time resonators to provide the noise shaping function. The gm-C approach of realizing continuous-time resonators offers advantages of complete system integration and total design freedom. However, the design of CMOS high-Q high-linearity resonators at the tens of MHz is very challenging. Since the linearity of the modulator is limited by the linearity of the resonators utilized, the continuous-time resonator is considered to be the most demanding analog sub-block of a bandpass continuous-time Sigma-Delta modulator. Typical specifications for a gm-C resonator used to provide the noiseshaping function in a Σ∆ modulator in a mobile receiver (see Fig. 21.32) are summarized in Table 21.4.

FIGURE 21.32

Continuous-time Σ∆ A/D in IF-sampling receiver.

Linear Region Transconductor Implementation The implementation of fully integrated, high-selectivity filters operating at tens to hundreds of MHz provides benefits for wireless transceiver design, including chip area economy and cost reduction. The

© 2000 by CRC Press LLC

8593-21-frame Page 31 Friday, January 28, 2000 02:19 PM

TABLE 21.4 Fully Integrated Continuous-Time Resonator Specifications Resonator Specifications Center frequency Quality factor Spurious free dynamic range Power dissipation

50 MHz 50 >30 dB Minimal

main disadvantages of on-chip active filter implementations when compared to off-chip passives include increased power dissipation, deterioration in the available dynamic range with increasing Q, and Q and resonant frequency integrity (because of process variations, temperature drifts, and aging, automatic tuning is often unavoidable, especially in high-Q applications). The transconductor-capacitor (gm-C) technique is a popular technique for implementing high-speed continuous time filters and is widely used in many industrial applications.52 Because gm-C filters are based on integrators built from an open-loop transconductance amplifier driving a capacitor, they are typically very fast but have limited linear dynamic range. Linearization techniques that reduce distortion levels can be used, but often lead to a compromise between speed, dynamic range, and power consumption. As an example of the tradeoffs in design, consider the transconductor shown in Fig. 21.33. This design consists of a main transconductor cell (M1, M2, M3, M4, M10, M11, and M14) with a negative resistance load (M5, M6, M7, M8, M9 , M12, and M13). Transistors M1 and M2 are biased in the triode region of operation using cascode devices M3 and M4 and determine the transconductance gain of the cell. In the triode region of operation, the drain current versus terminal voltage relation can be approximated (for simple hand calculations) as ID = K[2(VGS – VT)VDS – VDS2], where K and VT are the transconductance parameter and the threshold voltage respectively. Assuming that VDS is constant for both M1 and M2, both the differential mode and the common mode transconductance gains can be derived as GDM = GCM = 2KVDS, which can thus be tuned by varying VDS. The high value of common-mode transconductance is undesirable since it may result in regenerative feedback loops in high-order filters. To improve the CMRR transistor and avoid the formation of such loops, M10 is used to bias the transconductor, thus transforming it from a pseudo-differential to a fully differential transconductor.53 Transistors M11 and M14 constitute a floating voltage source, thus maintaining a constant drain-source voltage for M1 and M2.

FIGURE 21.33

Triode region transconductor.

© 2000 by CRC Press LLC

8593-21-frame Page 32 Friday, January 28, 2000 02:19 PM

The non-linearities in the voltage-to-current transfer of this stage are mainly due to three effects. The first is the finite impedance levels at the sources of the cascode devices, which cause a signaldependent variation of the corresponding drain-source voltages of M1 and M2. A fast floating voltage source and large cascode transistors therefore need to be used to minimize this non-linearity. The second cause of non-linearity is the variation of carrier mobility µ of the input devices M1 and M2 with VGS – VT , which becomes more apparent when short-channel devices are used (K = µ ⋅ Cox ⋅ W/2 ⋅ L). A simple first-order model for transverse-field mobility degradation is given by µ = µ0/(1 + θ ⋅ (VGS – VT)), where µ0 and θ are the zero field mobility and the mobility reduction parameter, respectively. Using this model, the third-order distortion can be determined by a Maclaurin series expansion as θ2/4(1 + θ(VCM – VT)).54 This expression cannot be regarded as exact, although it is useful to obtain insight. Furthermore, it is valid only at low frequencies, where reactive effects can be ignored and the coefficients of the Maclaurin series expansion are frequency independent. At high frequencies or when very low values of distortion are predicted by the Maclaurin series method, a generalized power series method (Volterra series) must be employed.55,56 Finally, a further cause of non-linearity is mismatch between M1 and M2, which can be minimized by good layout. A detailed linearity analysis of this transconductance stage is presented in Ref. 60. To provide a load for the main transconductor cell, a similar cell implemented by p-devices is used. The gates of the linear devices M5 and M6 are now cross-coupled with the drains of the cascode devices M7 and M8. In this way, weak positive feedback is introduced. The differential-mode output resistance can now become negative and is tuned by the VDS of M5 and M6 (M12 and M13 form a floating voltage source), while the common-mode output resistance attains a small value. When connected to the output of the main transconductor cell as shown in Fig. 21.33, the crosscoupled p-cell forms a high-ohmic load for differential signals and a low-ohmic load for common-mode signals, resulting in a controlled common-mode voltage at the output.54,57 CMRR can be increased even further using M10, as described previously. Transistor M9 is biased in the triode region of operation and is used to compensate the offset common-mode voltage at the output. The key performance parameter of an integrator is the phase shift at its unity-gain frequency. Deviations from the ideal –90° phase include phase lead due to finite dc gain and phase lag due to highfrequency parasitic poles. In the transconductor design of Fig. 21.33, dc gain is traded for phase accuracy, thus compensating the phase lag introduced by the parasitic poles. The reduction in dc gain for increased phase accuracy is not a major problem for bandpass filter applications, since phase accuracy at the center frequency is extremely important while dc gain has to be adequate to ensure that attenuation specifications are met at frequencies below the passband. From simulation results using parameters from a 0.8-µm CMOS process, with the transconductor unity gain frequency set at 50 MHz, third-order intermodulation components were observed at –78 dB with respect to the fundamental signals (two input signals at 49.9 MHz and 50.1 MHz were applied, each at 50 mVpp).

A gm-C Bandpass Biquad Filter Implementation The implementation of on-chip high-Q resonant circuits presents a difficult challenge. Integrated passive inductors have generally poor quality factors, which limits the Q of any resonant network in which they are employed. For applications in the hundreds of MHz to a few GHz, one approach is to implement the resonant circuit using low-Q passive on-chip inductors with additional Q-enhancing circuitry. However, for lower frequencies (tens of MHz), on-chip inductors occupy a huge area and this approach is not attractive. As disscussed above, an alternative method is to use active circuitry to eliminate the need for inductors. gm-C-based implementations are attractive due to their high-speed potential and good tunability. A bandpass biquadratic section based upon the transconductor of Fig. 21.33 is shown in Fig. 21.34. The transfer function of Fig. 21.34 is given by:

© 2000 by CRC Press LLC

8593-21-frame Page 33 Friday, January 28, 2000 02:19 PM

FIGURE 21.34

Biquad bandpass.

( 1 + s ⋅ Ro ⋅ C ) g mi ⋅ R o Vo -2 ⋅ --------------------------------------------------------------------------------------------------------- = ------------------Vi ( Ro ⋅ C )  2 2 ⋅ Ro + gm2 ⋅ Ro2 ⋅ R 1 + gm2 ⋅ Ro2  - + --------------------------- s + s --------------------------------------------- 2 2 2 Ro ⋅ C Ro ⋅ C  

(21.48)

Ro represents the total resistance at the nodes due to the finite output resistance of the transconductors. R represents the effective resistance of the linear region transistors in the transconductor (see Fig. 21.33), and is used here to introduce damping and control the Q. From Eq. 21.48, it can be shown that ωo ≈ gm/C, Q ≈ gm ⋅ Ro/(2 + Ro ⋅ R ⋅ gm2), Q max = Q r = 0 = ( g m ⋅ R o ) ⁄ 2 and Ao = gmi ⋅ Q. Thus, gm is used to set the central frequency, R is used to control the Q, and gmi controls the bandpass gain Ao. A dummy gmi is used to provide symmetry and thus better stability due to process variations, temperature, and aging. One of the main problems when implementing high-Q high-frequency resonators is maintaining the stability of the center frequency ωo and the quality factor Q. This problem calls for very careful layout and the implementation of an automatic tuning system. Another fundamental limitation regarding available dynamic range occurs: namely, that the dynamic range (DR) of high-Q gm-C filters has been found to be inversely proportional to the filter Q.57 The maximum dynamic range is given by:

V max V max ⋅ C DR = -------------2 = --------------------------------4 ⋅ k⋅T⋅ξ⋅Q V noise 2

2

(21.49)

where Vmax is the maximum rms voltage across the filter capacitors, C is the total capacitance, k is Boltzman’s constant, T is the absolute temperature, and ξ is the noise factor of the active circuitry (ξ = 1 corresponds to output noise equal to the thermal noise of a resistor of value R = 1/gm, where gm is the transconductor value used in the filter). In practice, the dynamic range achieved will be less than this maximum value due to the amplification of both noise and intermodulation components around the resonant frequency. This is a fundamental limitation, and the only solution is to design the transconductors for low noise and high linearity. The linearity performance in narrowband systems is characterized by the spurious-free dynamic range (SFDR). SFDR is defined as the signal-to-noise ratio when the power of the third-order intermodulation products equals the noise power. As shown in Ref. 60, the SFDR of the resonator in Fig. 21.34 is given by:

3 ⋅ V o, peak ⋅ C 1 -  -----------------------------SFDR = -----------------------2⁄3 2 ( k ⋅ T )  4 ⋅ ξ ⋅ IM 3, int 2

2⁄3

1 ------2 Q

(21.50)

where IM3, int is the third-order intermodulation point of the integrator used to implement the resonator. The spurious free dynamic range of the resonator thus deteriorates by 6 dB if the quality factor is doubled,

© 2000 by CRC Press LLC

8593-21-frame Page 34 Friday, January 28, 2000 02:19 PM

assuming that the output swing remains the same. In contrast, implementing a resonant circuit using low-Q passive on-chip inductors with additional Q-enhancing circuitry leads to a dynamic range amplified by a factor Qo, where Qo is the quality factor of the on-chip inductor itself.59 However, as stated above, for frequencies in the tens of MHz, on-chip inductors occupy a huge area and thus the Qo improvement in dynamic range is not high enough to justify the area increase. Simulation Results To confirm operation, the filter shown in Fig. 21.34 has been simulated in HSPICE using process parameters from a commercial 0.8-µm CMOS process. Figure 21.35 shows the simulated frequency and phase response of the filter for a center frequency of 50 MHz and a quality factor of 50. Figure 21.36 shows the simulated output of the filter when the input consists of two tones at 49.9 MHz and 50.1 MHz, respectively, each at 40 mVpp. At this level of input signal, the third-order intermodulation components were found to be at the same level as the noise. Thus, the predicted SFDR is about 34 dB with Q = 50. Table 21.5 summarizes the simulation results.

FIGURE 21.35

Simulated bandpass frequency response.

21.7 Log-Domain Processing Instantaneous Companding The concept of instantaneous companding is an emerging area of interest within the field of analog integrated circuit design. Currently, the main area of application for this technique is the implementation of continuous-time, fully integrated filters with wide dynamic range, high-frequency potential, and wide tunability. With the drive toward lower supply voltages and higher operating frequencies, traditional analog integrated circuit design methodologies are proving inadequate. Conventional techniques to linearize inherently non-linear devices require an overhead in terms of increased power consumption or reduced operating speed. Recently, the use of companding, originally developed for audio transmission, has been

© 2000 by CRC Press LLC

8593-21-frame Page 35 Friday, January 28, 2000 02:19 PM

FIGURE 21.36

Simulated two-tone intermodulation test. TABLE 21.5 Simulation Results Power dissipation (Supply voltage = 5 V) Common-mode output offset Center frequency Quality factor Output noise voltage (integrated over the band from 40 MHz to 60 MHz with Q = 50) Output signal voltage (so that intermodulation components are at the same level as the noise, Q = 50) Spurious free dynamic range (Q = 50)

12.5 mW 1/ft. Thus, it would seem that the usable cut-off frequency of the basic log-domain first-order filter is limited by the actual ft of the transistors. The second pole time constant τp2 (assuming that τp1 = C/gm2) is:

C π1 + C cs1 1 1 τ p2 = ( C µ4 + C π4 )  r b4 + ------- + ( C µ2 + C π2 )  r b2 + ------- + ---------------------  g m3 g m1 g m1 C π13 + C cs3 + ------------------------ + C µ1 r b1 + C µ4 r b4 g m3

(21.77)

This corresponds approximately to the ft of the transistors, although Eq. 21.77 shows that the collector–substrate capacitance also contributes toward limiting the maximum operating frequency. The zero time constant τz is given by:

C π2 C π4 τ z = ( C π1 + C µ1 )r b1 + ------- + -------- + C µ4 r b4 g m2 g m4

(21.78)

This is of the same order of magnitude as the second pole. This means that the first zero and the second pole will be close together, and will compensate to a certain degree. However, in reality, there are more poles and zeros than Eq. 21.75 would suggest, and it is likely that others will also occur around the actual ft of the transistors. Noise Noise in companding and log-domain circuits is discussed in some detail in Refs. 69 to 71, and a complete treatment is beyond the scope of this discussion. For linear (non-companding) circuits, noise is generally assumed to be independent of signal level, and the signal-to-noise ratio (SNR) will increase with increasing input signal level. This is not true for log-domain systems. At small input signal levels, the noise value can be assumed approximately constant, and an increase in signal level will give an increase in

© 2000 by CRC Press LLC

8593-21-frame Page 45 Friday, January 28, 2000 02:19 PM

SNR. At high signal levels, the instantaneous value of noise will increase, and thus the SNR levels out at a constant value. This can be considered as an intermodulation of signal and noise power. For the Class A circuits discussed above, the peak signal level is limited by the dc bias current. In this case, the largesignal noise is found to be of the same order of magnitude as the quiescent noise level, and thus a linear approximation is generally acceptable (this is not the case for Class AB circuits).

Synthesis of Higher-Order Log-Domain Filters The state-space synthesis technique outlined above proves difficult if implementation of high-order filters is required, since it becomes difficult to define and manipulate a large set of state equations. One solution is to use the signal flow graph (SFG) synthesis method proposed by Perry and Roberts72 to simulate LC ladder filters using log-domain building blocks. The interested reader is also referred to Refs. 73 through 75, which present modular and transistor-level synthesis techniques that can be easily extended to higherorder filters.

References 1. A. Sedra and K. Smith, Microelectronic Circuits, Oxford, 1998. 2. P. R. Gray and R. G. Meyer, Analysis and Design of Analog Integrated Circuits, Wiley, 1993. 3. W. H. Gross, New high speed amplifier designs, design techniques and layout problems, in Analog Circuit Design, Ed. J. S. Huijsing, R. J. van der Plassche, W. Sansen, Kluwer Academic, 1993. 4. D.F. Bowers, The impact of new architectures on the ubiquitous operational amplifier, in Analog Circuit Design, Ed. J. S. Huijsing, R. J. van der Plassche, W. Sansen, Kluwer Academic, 1993. 5. J. Fonderie and J. H. Huijsing, Design of low-voltage bipolar opamps, in Analog Circuit Design, Ed. J. S. Huijsing, R. J. van der Plassche, and W. Sansen, Kluwer Academic, 1993. 6. M. Steyaert and W. Sansen, Opamp design towards maximum gain-bandwidth, in Analog Circuit Design, Ed. J. S. Huijsing, R. J. van der Plassche, W. Sansen, Kluwer Academic, 1993. 7. K. Bult and G. Geelen, The CMOS gain-boosting technique, in Analog Circuit Design, Ed. J. S. Huijsing, R. J. van der Plassche, W. Sansen, Kluwer Academic, 1993. 8. J. Bales, A low power, high-speed, current feedback opamp with a novel class AB high current output stage, IEEE Journal Solid-State Circuits, vol. 32, no. 9, Sep. 1997, p. 1470. 9. C. Toumazou, Analogue signal processing: the ‘current way’ of thinking, Int. Journal of High-Speed Electronics, vol. 32, no. 3-4, p. 297, 1992. 10. K. Manetakis and C. Toumazou, A new CMOS CFOA suitable for VLSI technology, Electron. Letters, vol. 32, no. 12, June 1996. 11. K. Manetakis, C. Toumazou, and C. Papavassiliou, A 120MHz, 12mW CMOS current feedback opamp, Proc. of IEEE Custom Int. Circuits Conf., p. 365, 1998. 12. D. A. Johns and K. Martin, Analog Integrated Circuit Design, Wiley, 1997. 13. C. Toumazou, J. Lidgey, and A. Payne, Emerging techniques for high-frequency BJT amplifier design: a current-mode perspective, Parchment Press for Int. Conf. on Electron. Circuits Syst., Cairo, 1994. 14. M.C.H Cheng and C. Toumazou, 3V MOS current conveyor cell for VLSI technology, Electron. Lett., vol. 29, p. 317, 1993. 15. K. Manetakis, Intermediate frequency CMOS analogue cells for wireless communications, Ph.D. thesis, Imperial College, London, 1998. 16. R. A. Johnson et al., A 2.4GHz silicon-on-sapphire CMOS low-noise amplifier, IEEE Microwave and Guided Wave Lett., vol. 7, no. 10, pp. 350-352, Oct. 1997. 17. A. N. Karanicolas, A 2.7V 900MHz CMOS LNA and mixer, IEEE Digest of I.S.S.C.C., pp. 50-51, 1996. 18. D. K. Shaffer and T. H. Lee, A 1.5-V, 1.5-GHz CMOS low noise amplifier, IEEE J.S.S.C., vol. 32, no. 5, pp. 745-759, May 1997.

© 2000 by CRC Press LLC

8593-21-frame Page 46 Friday, January 28, 2000 02:19 PM

19. J. C. Rudell et al., A 1.9GHz wide-band if double conversion CMOS integrated receiver for cordless telephone applications, Digest of IEEE I.S.S.C.C., pp. 304-305, 1997. 20. E. Abou-Allam et al., CMOS front end RF amplifier with on-chip tuning, Proc. of IEEE ISCAS96, pp. 148-151, 1996. 21. P. R. Gray and R. G. Meyer, Analysis and Design of Analogue Integrated Circuits and Systems, Chap. 11, 3rd ed., John Wiley & Sons, New York, 1993. 22. M. J. N. Sibley, Optical Communications, Chap. 4-6, Macmillan, 1995. 23. J. J. Morikuni et al., Improvements to the standard theory for photoreceiver noise, J. Lightwave Tech., vol. 12, no. 4, pp. 1174-1184, Jul. 1994. 24. A. A. Abidi, Gigahertz transresistance amplifiers in fine line NMOS, IEEE J.S.S.C., vol. SC-19, no. 6, pp. 986-994, Dec. 1984. 25. M. B. Das, J. Chen, and E. John, Designing optoelectronic integrated circuit (OEIC) receivers for high sensitivity and maximally flat frequency response, J. of Lightwave Tech., vol. 13, no. 9, pp. 1876-1884, Sep. 1995. 26. B. Sklar, Digital Communication: Fundamentals and Applications, Prentice-Hall 1988. 27. S. D. Personick, Receiver design for optical fiber systems, IEEE Proc., vol. 65, no. 12, pp. 16701678, Dec. 1977. 28. J. M. Senior, Optical Fiber Communications: Principles and Practice, Chap. 8-10, PHI, 1985. 29. N. Scheinberg et al, Monolithic GaAs transimpedance amplifiers for fiber-optic receivers, IEEE J.S.C.C., vol. 26, no. 12, pp. 1834-1839, Dec. 1991. 30. C. Toumazou and S. M. Park, Wide-band low noise CMOS transimpedance amplifier for gigahertz operation, Electron. Lett., vol. 32, no. 13, pp. 1194-1196, Jun. 1996. 31. S. M. Park and C. Toumazou, Giga-hertz low noise CMOS transimpedance amplifier, Proc. IEEE ISCAS, vol. 1, pp. 209-212, June 1997. 32. D. M. Pietruszynski et al, A 50-Mbit/s CMOS monolithic optical receiver, IEEE J.S.S.C., vol. 23, no. 6, pp. 1426-1432, Dec. 1988. 33. Y. Akazawa et al., A design and packaging technique for a high-gain, gigahertz-band single-chip amplifier, IEEE J.S.S.C., vol. SC-21, no. 3, pp. 417-423, Jun. 1986. 34. N. Ishihara et. al., A Design technique for a high-gain, 10-GHz class-bandwidth GaAs MESFET amplifier IC module, IEEE J.S.S.C., vol. 27, no. 4, pp. 554-561, Apr. 1992. 35. M. Lee and M. A. Brooke, Design, fabrication, and test of a 125Mb/s transimpedance amplifier using MOSIS 1.2 µm standard digital CMOS process, Proc. 37th Midwest Sym., Cir. and Sys., vol. 1, pp. 155-157, Aug. 1994. 36. R. P. Jindal, Gigahertz-band high-gain low-noise AGC amplifiers in fine-line NMOS, IEEE J.S.S.C., vol. SC-22, no. 4, pp. 512-520, Aug. 1987. 37. N. Takachio et al., A 10Gb/s optical heterodyne detection experiment using a 23GHz bandwidth balanced receiver, IEEE Trans. M.T.T., vol. 38, no. 12, pp. 1900-1904, Dec. 1990. 38. K. Katsura et al., A novel flip-chip interconnection technique using solder bumps for high-speed photoreceivers, J. Lightwave Tech., vol. 8, no. 9, pp. 1323-1326, Sep. 1990. 39. K. Murota and K. Hirade, GMSK modulation for digital mobile radio telephony, IEEE Trans. Commun., vol. 29, pp. 1044-1050, 1981. 40. H. Krauss, C. W. Bostian, and F. H. Raab, Solid State Radio Engineering, New York, Wiley, 1980. 41. A. S. Sedra and K. C. Smith, Microelectronic Circuits, 4th Edition 1998. 42. N. O. Sokal and A. D. Sokal, Class E, A new class of high efficiency tuned single-ended switching power amplifiers, IEEE Journal of Solid-State Circuits, vol. SC-10, pp. 168-176, June 1975. 43. F. H. Raab, Effects of circuit variations on the class E tuned power amplifier, IEEE Journal of SolidState Circuits, vol. SC-13, pp. 239-247, 1978. 44. T. Sowlati, C. A. T. Salama, J. Sitch, G. Robjohn, and D. Smith, Low voltage, high efficiency class E GaAs power amplifiers for mobile communications, in IEEE GaAs IC Symp. Tech. Dig., pp. 171174, 1994.

© 2000 by CRC Press LLC

8593-21-frame Page 47 Friday, January 28, 2000 02:19 PM

45. _________ Low voltage, high efficiency GaAs class E power amplifiers for wireless transmitters, IEEE Journal of Solid-State Circuits, vol. SC-13, no. 10, pp. 1074-1080, 1995. 46. A. Rofougaran et al., A single-chip 900 MHz spread-spectrum wireless transceiver in 1-µm CMOS. part I: architecture and transmitter design, IEEE Journal of Solid-State Circuits, vol. SC-33, no. 4, pp. 515-534. 47. J. Chang, A. A. Abidi, and M. Gaitan, Large suspended inductors on silicon and their use in a 2µm CMOS RF amplifier, IEEE Electron Device Letters, vol. 14, no. 5, May 1993. 48. T. Sowlati et al., Linearized high efficiency class E power amplifier for wireless communications, IEEE Custom Integrated Circuits Conf. Proc., pp. 201-204, 1996. 49. G. N. Henderson, M. F. OKeefe, T. E. Boless, P. Noonan, et al., SiGe bipolar junction transistors for microwave power applications, IEEE MTT-S Int. Microwave Symp. Dig., pp. 1299-1302, 1997 50. O. Shoaei and W. M. Snelgrove, A wide-range tunable 25MHz-110MHz BiCMOS continuous-time filter, Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Atlanta, 1996. 51. P.-H. Lu, C.-Y. Wu, and M.-K. Tsai, Design techniques for VHF/UHF high-Q tunable bandpass filters using simple CMOS inverter-based transresistance amplifiers, IEEE Journal of Solid-State Circuits, vol. 31, no. 5, May 1996. 52. Y. Tsividis, Integrated continuous-time filter design - an overview, IEEE Journal of Solid-State Circuits, vol. 29, no. 3, Mar. 1994. 53. F. Rezzi, A. Baschirotto, and R. Castello, A 3V 12-55MHz BiCMOS Pseudo-Differential Continuous-Time Filter, IEEE Trans. on Circuits and Systems-I, vol. 42, no. 11, Nov. 1995. 54. B. Nauta, Analog CMOS Filters for Very High Frequencies, Kluwer Academic Publishers. 55. C. Toumazou, F. Lidgey, and D. Haigh, Analogue IC Design: The Current-Mode Approach, Peter Peregrinus Ltd. for IEEE Press, 1990. 56. S. Szczepanski and R. Schauman, Nonlinearity-induced distortion of the transfer function shape in high-order filters, Kluwer Journal of Analog Int. Circuits and Signal Processing, vol. 3, p. 143-151, 1993. 57. S. Szczepanski, VHF fully-differential linearized CMOS transconductance element and its applications, Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), London 1994. 58. A. A. Abidi, Noise in active resonators and the available dynamic range, IEEE Trans. on Circuits and Systems-I, vol. 39, no. 4, Apr. 1992. 59. S. Pipilos and Y. Tsividis, RLC active filters with electronically tunable center frequency and quality factor, Electron. Letters, vol. 30, no. 6, Mar. 1994. 60. K. Manetakis, Intermediate frequency CMOS analogue cells for wireless communications, Ph.D. thesis, Imperial College, London, 1998. 61. K. Manetakis and C. Toumazou, A 50MHz high-Q bandpass CMOS filter, Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Hong-Kong, 1997. 62. Y. Tsividis, Externally linear time invariant systems and their application to companding signal processors, IEEE Trans. CAS-II, vol. 44(2), pp. 65-85, Feb.1997. 63. D. Frey, Log-domain filtering: an approach to current-mode filtering, IEE Proc.-G, vol. 140, pp. 406-416, 1993. 64. B. Gilbert, Translinear circuits: a proposed classification, 1975, Electron. Lett., vol. 11, no. 1, pp. 14-16, 1975. 65. P. Grey, and R. Meyer, Analysis and Design of Analog Integrated Circuits, John Wiley & Sons Inc., New York, 3rd Edition, 1993. 66. E. M. Drakakis, A. Payne, and C. Toumazou, Log-domain state-space: a systematic transistor-level approach for log-domain filtering, accepted for publication in IEEE Trans. CAS-II, 1998. 67. V. Leung, M. El-Gamal, and G. Roberts, Effects of transistor non-idealities on log-domain filters, Proc. IEEE Int. Symp. Circuits Syst., Hong-Kong, pp. 109-112, 1997. 68. D. Perry and G. Roberts, Log domain filters based on LC-ladder synthesis, Proc. 1997 IEEE Int. Symp. on Circuits and Syst. (ISCAS), pp. 311-314, Seattle, 1995.

© 2000 by CRC Press LLC

8593-21-frame Page 48 Friday, January 28, 2000 02:19 PM

69. J. Mulder, M. Kouwenhoven, and A. van Roermund, Signal × noise intermodulation in translinear filters, Electron. Lett, vol. 33(14), pp. 1205-1207. 70. M. Punzenberger and C. Enz, Noise in instantaneous companding filters, Proc. 1997 IEEE Int. Symp. Circuits Syst., Hong Kong, pp. 337-340, June 1997. 71. M. Punzenberger and C. Enz, A 1.2V low-power BiCMOS class-AB log-domain filter, IEEE J. SolidState Circuits, vol. SC-32(12), pp. 1968-1978, Dec. 1997. 72. D. Perry and G. Roberts, Log-domain filters based on LC ladder synthesis, Proc. 1995 IEEE Int. Symp. Circuits Syst., Seattle, pp. 311-314, 1995. 73. E. Drakakis, A. Payne, and C. Toumazou, Bernoulli operator: a low-level approach to log-domain processing, Electron. Lett., vol. 33(12), pp. 1008-1009, 1997. 74. F. Yang, C. Enz, and G. Ruymbeke, Design of low-power and low-voltage log-domain filters, Proc. 1996 IEEE Int. Symp. Circuits Syst., Atlanta, pp. 125-128, 1996. 75. J. Mahattanakul and C. Toumazou, Modular log-domain filters, Electron. Lett., vol. 33(12), pp. 1130-1131, 1997. 76. D. Frey, A 3.3 V electronically tuneable active filter useable to beyond 1 GHz, Proc. 1994 IEEE Int. Symp. Circuits Syst., London, pp. 493-496, 1994. 77. M. El-Gamal, V. Leung, and G. Roberts, Balanced log-domain filters for VHF applications, Proc. 1997 IEEE Int. Symp. Circuits Syst., Monterey, pp. 493-496, 1997.

© 2000 by CRC Press LLC

Wassenaar, R.F., et al."Operational Transconductance Amplifiers" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

22 Operational Transconductance Amplifiers R.F. Wassenaar University of Twente

Mohammed Ismail The Ohio State University

Chi-Hung Lin The Ohio State University

22.1 22.2 22.3 22.4

Introduction Noise Behavior of the OTA An OTA with an Improved Output Swing OTAs with High Drive Capability OTAs with 1:B Current Mirrors • OTA with Improved Output Stage • Adaptively Biased OTAs • Class AB OTAs

22.5 Common-Mode Feedback 22.6 Filter Applications with Low-Voltage OTAs

22.1 Introduction In many analog or mixed analog/digital VLSI applications, an operational amplifier may not be appropriate to use for an active element. For example, when designing integrated high-frequency active filter circuitry, a much simpler building block, called an operational transconductance amplifier (OTA), is often used.1 This type of amplifier is characterized as a voltage-driven current source and in its simplest form is a combination of a differential input pair with a current mirror as shown in Fig. 22.1. It is a simple circuit with a relatively small chip area. Further, it has a high bandwidth and also a good common-mode rejection ratio up to very high frequencies. The small signal transconductance, gm = ∂Iout/∂Vin, can be controlled by the tail current. This chapter discusses CMOS OTA design for modern VLSI applications. We begin the chapter with a brief study of noise in OTAs, followed by OTA design techniques.

22.2 Noise Behavior of the OTA The noise behavior of the OTA is discussed here. Attention will be paid to thermal and flicker noise and to the fact that, for minimal noise, some voltage gain, from the input of the differential pair to the input of the current mirror, is required. Then, only the noise of the input pair becomes dominant and the other noise sources can be neglected to first order. The noise behavior of a single MOS transistor is modeled by a single noise voltage source. This noise voltage source is placed in series with the input (gate) of a “noiseless” transistor. Fig. 22.2(a) shows the simple OTA, including the noise sources, while Fig. 22.2(b) shows the same circuit with all the noise referred to the input of the stage.

© 2000 by CRC Press LLC

FIGURE 22.1 (a) A NMOS differential pair with a PMOS current mirror forming an OTA; (b) the symbol for a single-ended OTA; and (c) the symbol for a fully differential OTA.

FIGURE 22.2 (a) The OTA with its noise voltage sources, and (b) the same circuit with the noise voltage sources referred to one of the input nodes.

All the noise sources indicated in Fig. 22.2(a) are converted to equivalent input noise voltags, which are then added to form a single noise source at the input (Fig. 22.2(b)). As a result, we obtain (assuming gm1 = gm2 and gm3 = gm4) the following mean-square input referred noise voltage

g m3 2 2 2 2 - ( V p3 2 + V p4 2 ) V eq = V n1 + V n2 +  ----- g m1

© 2000 by CRC Press LLC

(22.1)

The thermal noise contribution of one transistor, over a band ∆f, is written as:

2 1 2 V th = --- 4kT ----- ∆f 3 gm

(22.2)

where k is the Boltzman constant and T is the absolute temperature. The equivalent noise voltage V theq 2 becomes:

g m3 2  1 2 1 1 1 2 V theq = --- 4kT  ------- + ------- +  ------------ + ------- ∆f  g m1 g m2  g m1   g m3 g m4 3

(22.3)

2

and because gm1 = gm2 and gm3 = gm4, V theq becomes:

g m3 2 1  16 1 2 - ------- ∆f V theq = ------ kT  ------- +  ----- g m1  g m1 g m3 3

(22.4)

g m3 16 kT 2 V theq = ------ -------  1 + ------ ∆f 3 g m1  g m1

(22.5)

or

Expressing gm in physical parameters results in:

µ p ( W ⁄ L ) 3 16kT 2 - ∆f V theq = ---------------------------------------------  1 + ----------------------- µ n ( W ⁄ L ) 1 3 µ n C ox ( W ⁄ L ) 1 I 0

(22.6)

In this equation, I0 represents the tail current of the differential pair. Note that the term between brackets represents the relative noise contribution of the current mirror. This term can be neglected if M3 and M4 are chosen relatively long and narrow in comparison to M1 and M2. It should be mentioned that the thermal noise of an N-MOS transistor and a P-MOS transistor with equal transconductance is the same. In most standard IC processes, a three to ten times lower 1/f noise is observed for P-MOS transistors in comparison to N-MOS transistors of the same size. However, in modern processes, the 1/f noise contribution of N- and P-MOS transistors tends to be equal. For the 1/f noise, it is usually assumed for standard IC processes that:

K′ 2 V 1 ⁄ f = ------------------∆f WLC ox f

(22.7)

where K′ is the flicker noise coefficient in the range of 10–24 J for N-MOS transistors and in the range of 3 × 10–25 to 10–25 J for P-MOS transistors. The equivalent 1/f input noise source of the OTA in Fig. 22.2(b) yields: 2 2K′ n ∆f  K′ p µ p L 1  2 1 + -------------------V eq ( 1 ⁄ f ) = ----------------------2 W 1 L 1 C ox f  K′ n µ n L 3 

(22.8)

Here, the noise contributions of the current mirror (M3, M4) will be negligible if L3 is chosen much larger than L1.

© 2000 by CRC Press LLC

The offset voltage of a differential pair is lowest when the transistors are in the weak-inversion mode; but on the contrary, the mismatch in the current transfer of a current mirror is lowest when the transistors are deep in strong inversion. Hence, the conditions that have to be fulfilled for both minimal equivalent input noise and minimal offset are easy to combine.

22.3 An OTA with an Improved Output Swing A CMOS OTA with an output swing much higher than that in Fig. 22.1(a) is shown in Fig. 22.3. This configuration needs two extra current mirrors and consumes more current, but the output voltage “window” is, in the case when common-mode input voltage is zero, about doubled. The rules discussed earlier for sizing the input transistors and current-mirror transistors to reduce noise and offset still apply. However, there is still a tradeoff. On the one hand, a high voltage gain from the input nodes to the current mirror is good for reducing noise and mismatch effects; on the other hand, too much gain also reduces the upper limit of the common-mode input voltage range and the phase margin needed to ensure stability (this will be discussed later).2 A voltage gain on the order of 3 to 10 is advised. The frequency behavior of the OTA in Fig. 22.3 is rather complex since there are two different signal paths in parallel, as shown in Fig. 22.4. In this scheme, rp represents the parallel value of the output resistance of the stage (ro6||ro8) and the load resistance (RL); therefore,

r p = r o6 || r o8 || R L

FIGURE 22.3 An OTA with an improved output window.

FIGURE 22.4 The signal paths of the OTA in Fig. 22.3.

© 2000 by CRC Press LLC

(22.9)

FIGURE 22.5 (a) The Bode plot belonging to signal path 1 in the OTA in Fig. 22.3 and 22.4, (b) signal path 2, and (c) to the combined signal path.

The capacitor Cp represents the sum of the parasitic output capacitance and the load capacitance Cp = Co + CL. Using the half-circuit principle for the differential pair, a fast signal path can be seen from M2 via current mirror M7, M8 to the output. This signal path contributes an extra high-frequency pole. The other signal path leads from transistor M1 via both current mirrors M3, M4 and M5, M6 to the output. In this path, two extra poles are added. The transfer of both signal paths and their combination are shown in the plots in Fig. 22.5, assuming equal pole positions of all three current mirrors. Note that the first (dominant) pole (ω1) is determined by rp and Cp.

1 ω 1 = ---------rp Cp

(22.10)

The second pole (ω2) is determined by the transconductance of M3 and the sum of the gate-source capacitance of M3 and M4. If M3 and M4 are equal, the second pole is located at:

g m3 ω 2 = ----------2C gs3

(22.11)

The unity-gain corner frequency ωT of the loaded OTA is at:

g m1 ω T = -----Cp

(22.12)

g m3 C p ω - -----------------2 = -----g m1 2C gs3 ωT

(22.13)

Therefore, the ratio ω2/ωT is:

When the OTA is used for high-frequency filter design, an integrator behavior is required, that is, a constant 90° phase at least at frequencies around ωT. Therefore, a high value of the ratio ω2/ωT is needed in order to have as little influence as possible from the second pole. It is obvious from Eq. 22.12 that the

© 2000 by CRC Press LLC

low-frequency voltabe gain from the input nodes of the circuit to the input of the current mirrors (= gm1/gm3) must not be chosen too high. As mentioned, this is in contrast to the requirements for minimum noise and offset. Sometimes, OTAs are used as unity gain voltage buffers; for example, in switched capacitor filters. In this case, the emphasis is put more on obtaining high open-loop voltage gain, improved output window, and good capability to drive capacitive loads efficiently (or small resistors); its integrator behavior is of less importance. To increase the unloaded voltage gain, cascode transistors can be added in the output stage. This greatly increases the output impedance of the OTA and hardly decreases the phase margin. The penalty that has to be paid is an additional pole in the signal path and some reduction of the maximum possible output swing. This reduction can be very small if the cascode transistors are biased on the weak-inversion mode. The open-loop voltage gain can be in the order of 40 to 60 dB. A possible realization of such a configuration is shown in Fig. 22.6.3

FIGURE 22.6 An OTA with improved output impedance.

22.4 OTAs with High Drive Capability For driving capacitive loads (or small resistors), a large available output current is necessary. In the OTAs shown so far, the amount of output current available is equal to twice the quiescent current (i.e., the tail current I0). In some situations, this current can be too small. There are several ways to increase the available current in an efficient way. To achieve this, four design principles will be discussed here: 1. Increasing the quiescent current by using current mirrors with a current transfer ratio greater than 1 2. Using a two-transistor level structure to drive the output transistors 3. Adaptive biasing techniques 4. Class AB techniques

OTAs with 1:B Current Mirrors One way to increase available output current is to increase the transfer ratio of the current mirrors CM1 and CM2 by a factor B, as indicated in Fig. 22.7.4 The amount of available output current and also the overall transconductance increase by the same factor. Unfortunately, the –3 dB frequency of the CM1CM2 current mirrors will be reduced by a factor (B + 1)/2 due to the larger gate-source capacitance of the mirror output transistors. Moreover, ωT will increase, ω2 will decrease, and the ratio ω2/ωT will be

© 2000 by CRC Press LLC

FIGURE 22.7 An OTA with improved load current using 1:B current mirrors.

strongly deteriorated. The amount of available output current though is B times the tail current. It is also possible to increase the current transfer ratio of current mirror CM3 instead of CM1. A better current efficiency then results, but at the expense of more asymmetry in the two signal paths. Although the amount of the maximum available output current is B times the tail current in both situations, the ratio between the maximum available current and quiescent current of the output stage remains equal to two, just as in the OTAs discussed previously.

OTA with Improved Output Stage Another way of increasing the maximal available output current is illustrated in Fig. 22.8.6 It improves upon the factor-two relationship between quiescent and maximal available current. Assuming equal K factors for all transistors shown in the circuit leads to the conclusion that the effective gate-source voltage of transistor M11 (= VGS11 – VT11) equals that of transistor M1 (=VGS1 – VT1), since they carry the same current, assuming that transistors M1, M4, and M6 are in saturation. Because the current drawn through transistor M9 is equal to the current in transistor M2, their effective source-gate voltages must also be equal assuming equal K factor for M2 and M9. Since the sum of the effective gate-source voltages transistors M11 and M12, and also of M9 and M10, is fixed and equal to VB, a situation exists which is equivalent to the two transistor level structure described in Ref. 5.

FIGURE 22.8 An OTA with an improved ratio between the maximum available current and the quiescent current of the output stage.

© 2000 by CRC Press LLC

The ratio between the maximum available output current and the quiescent current of the output 2 stage can be chosen by the designer. It is equal to: ( V B ⁄ ( V B – V GS0 ) ) , where VGS0 is the quiescent gatesource voltage of transistor M11. If the OTA is used in an over-drive situation |Vin| > ( 2I 0 ) ⁄ K , then either M6 or M5 will be cut off, while the other transistor carries its maximum current. As a result, one of the output transistors (M10 or M12) carries its maximum current, while the other transistor is in a low-current stand-by situation. The maximum current that one of the output transistors carries is therefore proportional to VB2. With the high ohmic resistor R (indicated in Fig. 22.8 with dotted lines), this maximum current corresponds to either (VP – VSS – VTN)2 or (VDD – VQ – VTP)2, because in that situation no current flows through the resistor. Hence, with the extra resistor, it becomes possible to increase the maximum current in overdrive situations and therefore reduce the slewing time. Because resistor R is chosen to be high, it does not disturb the behavior of the circuit discussed previously. In practice, resistor R is replaced by transistor MR working in the triode region, as shown in Fig. 22.9(a). Figure 22.9(b) shows the circuit which was used in Ref. 5 for biasing the gates of transistors M0, M9, and M11. It is much like the so-called “replica biasing.” The current in the circuit is strongly determined by the voltage across R (and its value) and is therefore very sensitive to variations in the supply voltage.

FIGURE 22.9 (a) The complete OTA, (b) and its bias stage.

Adaptively Biased OTAs Another combination of high available output current with low standby current can be realized by making the tail current of the differential input pair signal dependent. Figure 22.10 shows the basic idea of such an OTA with adaptive biasing.7 The tail current I0 of the differential pair is the sum of a fixed value IR and an additional current equal to the absolute value of the difference between the drain currents multiplied by the current feedback factor B (I0 = IR + B|I1 – I2|). Therefore, with zero differential input voltage, only a low bias current IR flows through the input pair. A differential input voltage, Vind, will cause a difference in the drain currents which will increase the tail current. This, in turn, again gives rise to a greater difference in the drain current, and so on. This is the kind of positive feedback that can bring the differential input pair from the weak-inversion mode into the strong-inversion mode, depending on the input voltage and the chosen current feedback factor B. Normally, when Vind = Vin+ – Vin– is small, the input transistors are in weak inversion. The differential output current (I1 – I2) of a differential pair operating in weak inversion equals the tail current times tanh( ( qV ind ) ⁄ ( 2AkT ) ). This leads to the following equation:

qV ind  ( I 1 – I 2 ) = I R + B I 1 – I 2 tanh  ------------ 2AkT

© 2000 by CRC Press LLC

(22.14)

FIGURE 22.10

An OTA with an input dependent tail current.

or

qV ind  tanh  ------------ 2AkT ( I 1 – I 2 ) = ----------------------------------------------I R qV ind  1 – B tanh  ------------ 2AkT

(22.15)

and because Iout = (I1 – I2):

I out

qV ind  tanh  ------------ 2AkT = ----------------------------------------------I R qV ind  1 – B tanh  ------------ 2AkT

(22.16)

However, in the case of large currents, this expression will no longer be valid since M1 – M2 will leave the weak-inversion domain and enter the strong-inversion region. If that is the case, the output current becomes:

I out

k  --- V ind 2 =  k  --2- V ind 

4I R K 2 2 2 ------- – ( 1 – B )V ind + B ---V ind 2 K 4I K 2 2 2 ------R- – ( 1 – B )V ind – B ---V ind 2 K

for V ind > 0 for V ind < 0

(22.17)

In order to keep some control over the output current, a negative overall feedback must be applied, which is usually the case. For example, when an OTA is used as a unity-gain buffer with a load of CL (see Fig. 22.11) and assuming a positive input step is applied, then the output current increases dramatically due to the positive feedback action described previously and, as a result, the output voltage will increase. This will lead to a decrease of the differential input voltage Vind (Vind = Vs – Vout). The result will be a very fast settling of the output voltage, and that is what we wanted to have. In order to realize current |I1 – I2|, two currentsubtracter circuits can be combined (see Fig. 22.12). If the current I2 is larger than current I1, the output of current-subtracter circuit 1 (Iout1) will carry a current; otherwise, the output current will be zero. The opposite situation is found for the output current of subtracter circuit 2 because of the interchange of their input

© 2000 by CRC Press LLC

FIGURE 22.11

An OTA used as a unity gain buffer.

currents (Iout2 = B(I1 – I2)). Consequently, either Iout1 or Iout2 will draw a current B|I1 – I2| and the other current will be zero. It is for this reason that the upper current mirrors (in Fig. 22.13) have two extra outputs to support the currents for the circuit in Fig. 22.12. A practical realization of the adaptive biasing OTA is shown in Fig. 22.13. In order to avoid unwanted, relatively high stand-by currents due to transistors mismatches, the transfer ratio of the current mirrors (M12, M13) and (M19, M18) can be chosen somewhat larger than 1. This ensures an inactive region of the input voltage range whereby the feedback loop is deactivated.

FIGURE 22.12 in Fig. 22.10.

A combination of two current subtracters for realizing the adaptive biasing current for the circuit

FIGURE 22.13

A practical realization of OTA with an adaptive biasing of its tail current.

© 2000 by CRC Press LLC

FIGURE 22.14

An OTA using a minimum selector for adapting the tail current.

Another example of an adaptive tail current circuit is shown in Fig. 22.14.8 It has a normal OTA structure except that the input pair is realized in twofold, and the tail current transistor is used in a feedback loop. This feedback loop includes the inner differential pair and tail current transistor M0 as well as a minimum current selector, the current source IU, transistor MR, and a current sink IL. The minimum current selector9 delivers an output current equal to the lowest value of its input currents (I′out = Min(I′1, I′2)). The feedback loop ensures that the output current of the minimum current selector is equal to the difference in currents between the upper and lower current sources. Assume that the upper current carries a current 2IB and the lower current source carries IB, then the feedback loop will bias the tail current in such a way that either I′1 or I′2 becomes equal to IB; for positive values of Vind, that will be I′2. It should be realized that at Vind = 0, all four input transistors are biased at the same gate-source voltage (VGS0), corresponding to a drain current IB. In the case of positive input voltages, the gate-source voltage of M2/M′2 will not change. Therefore, all the input voltage will be added to the bias voltage of M1/M′1, that is,

V GS1 = V GS0 + V ind

(22.18)

Figure 22.15 shows the ID vs. VGS characteristic for both transistors M1/M′1 and M2/M′2. Accordingly, the relationship between (I1 – I2) vs. Vind (for Vind > 0) follows the right side of the ID – VGS curve of M1,

FIGURE 22.15

The ID vs. VGS characteristic for transistors M1/M′1 and M2/M(2, showing their standby point VGS0, IB.

© 2000 by CRC Press LLC

FIGURE 22.16

I1 – I2 vs. Vind.

starting from the stand-by point (VGS0, IB) as indicated by the solid curve in Fig. 22.15. A similar view can be taken for negative values of the input voltage Vind, resulting in an equal (I1 – I2) vs. Vin curve rotated 180°. The result is shown in Fig. 22.16. Note that this input stage has a relationship between (I1 – I2) and Vind that is different from that of a simple differential input stage. By increasing Vind, the slope increases and, to a first-order approximation, there will not be a limit for the maximum value of (I1 – I2). Note that there is an additional MOS transistor MR in the circuit in Fig. 22.14 to fix the output voltage of the minimum current selector circuit. The lower current source IL is necessary to be able to discharge the gate-source capacitor C of M0 (indicated in Fig. 22.14 with dotted lines). The OTA in Fig. 22.14 is simpler than that in Fig. 22.13. However, its bandwidth is lower due to the high impedance of node P in the feedback loop.

Class AB OTAs Another possibility to design an OTA with a good current efficiency is to use an input stage exhibiting a class AB characteristics.11 The input stage in Fig. 22.17 contains two CMOS pairs12 connected as Class AB input transistors. They are driven by four source-followers. By applying a differential input voltage,

FIGURE 22.17

An OTA having a class AB input stage.

© 2000 by CRC Press LLC

the current through one of the input pairs will increase while the current through the other will decrease. The maximum current that can flow through the CMOS pair is, to first order, unlimited. In practice, it is limited by the supply voltage, the Keq factor, the mobility reduction factor, and the series resistance. The currents are delivered to the output with the help of two current mirrors. In the OTA shown in Fig. 22.17, only one of the two outputs of each CMOS pair is used. The other output currents flow directly to the supply rails. Instead of wasting the other output currents, they can be used to supply an extra output. So with the addition of two current mirrors, an OTA with complementary outputs as shown in Fig. 22.18 can be achieved.10 An improvement of the output impedance and low-frequency voltage gain can be obtained by cascoding the output transistors of the current mirrors (Fig. 22.19). Usually, this reduces the output window. The function of transistors M41-M44 is to control the dc output voltages. They form a part of a common-mode feedback system, which will be discussed next.

FIGURE 22.18

An OTA having a class AB input stage and two complementary outputs.

FIGURE 22.19

An improved fully differential OTA. (From Ref. 10. With permission.)

© 2000 by CRC Press LLC

The relationship between the differential input voltage Vin and one of the output currents Iout is shown in Fig. 22.20. There is a linear relationship between Vind and Iout for small to moderate values of Vind. In the case of larger values of Vind, one of the CMOS pairs becomes cut off, resulting in a quasi-quadratic relationship. At a further increase of Vind, the output current will be somewhat saturated due to mobility reduction and to the fact that one of the transistors of the CMOS pair leaves saturation mode. The latter effect is, of course, also strongly dependent on the common input voltage.

FIGURE 22.20

The Vin → Iout characteristic of the OTA in Fig. 5.19.

22.5 Common-Mode Feedback A fully differential OTA circuit, as in Fig. 22.19, has many advantages compared with its single-ended counterpart. It is a basic building block in filter design. A fully differential approach, in general, leads to a more efficient current use, doubling of the maximum output-voltage swing, and an improvement of the power-supply rejection ratio (PSRR). It also leads to a significant reduction of the total harmonic distortion, since all even harmonics are canceled out due to the symmetrical structure. Even when there is a small imperfection in the symmetry, the reduction in distortion will be significant. However, this type of symmetrical circuit needs an extra feedback loop. The feedback around a single-ended OTA usually only provides a differential-mode feedback and is ineffective for commonmode signals. So, in the case of the fully differential OTA, a common-mode feedback (CMFB) circuit is needed to control the common output voltage. Without a CMFB, the common-mode output voltage of the OTA is not defined and it may drift out of its high-gain region. The general structure of a simple OTA circuit with a differential output and a CMFB circuit is shown in Fig. 22.21. The need for a CMFB circuit is a drawback since it counters many of the advantages of the fully differential approach. The CMFB circuit requires chip area and power, introduces noise, and limits the output-voltage swing. Figure 22.22(b) shows a simple implementation of a CMFB circuit. A differential pair (M1, M2) is used to sense the common-mode output voltage. So, the voltage at the common source of this differential pair (Vs) is used. Its voltage provides, with a level shift of one VGS, the common-mode output voltage of the OTA. The voltage at this node is the first order insensitive to the differential input voltage. The relationship between the differential input voltage Vin of the differential pair, superimposed on a common-mode input voltage VCM, and its common-source voltage Vs is shown in Fig. 22.22(a). The common-mode output

© 2000 by CRC Press LLC

FIGURE 22.21

The general structure of a simple OTA circuit having a differential output and the required CMFB.

voltage of the OTA is determined by the VGS of M1/M2 and M9/M10 and can be controlled by the voltage source V0. There might be an offset in the dc value of the two output voltages due to a mismatch in transistors M9 and M10. If the amplitude of the differential output voltage increases, the common-mode voltage will not remain constant, but will be slightly modulated by the differential output voltage, with a modulation frequency that is twice the differential input signal frequency. This modulation is caused by the “non-flat” characteristic of the Vs vs. Vin characteristic of the differential pair (M1, M2) (see Fig. 22.22(a)).

FIGURE 22.22 (a) The relationship between the differential input voltage, superimposed on a common-mode voltage VCM of a differential pair (M1, M2) and its common-source voltage Vs; (b), a fully differential OTA with the differential pair (M1, M2) for providing a common-mode feedback.

Another commonly used CMFB circuit is shown in the fully-differential folded cascode OTA in Fig. 22.23.13 In this circuit, a similar high-output resistance and high unloaded voltage gain can be achieved as in the normal cascode circuits. An advantage of the folded cascode technique, however, is a higher

© 2000 by CRC Press LLC

FIGURE 22.23

A fully differential folded cascode OTA with another commonly used CMFB circuit.

accuracy in the signal-current transfer because current mirrors are avoided. In Fig. 22.23, all transistors are in saturation, with the exception of M1, M11, and M12, which are in the triode region. The CMFB is provided with the help of M11 and M12. These two transistors sense the output voltages VP and VQ. Since they operate in the triode region, their sum-current is insensitive to the differential output voltage (VP – VQ) and depends only on the common output voltage ((VP+VQ)/2). Because the current that flows through M17 and M18 forces the value of the above-mentioned sum-current, they also determine, together with Vbias4, the common-mode output voltage. By choose Vbias1 in such a way that IM19 is twice IM17, and making the width of transistor M1 twice that of M11 (= M12), the nominal common-mode output voltage will be equal to the gate voltage of M1.

22.6 Filter Applications with Low-Voltage OTAs Usually, gm-C filters are considered suitable candidates for high-speed and low-power applications. Compared with the SC op-amp and RC op-amp techniques, the applicability of the gm-C filter is limited by the low dynamic range and medium, even poor, linearity. The strategy of both simplifying the architecture and designing an ultra-low-voltage OTA16 are used to meet low-power and dynamic range requirements. The filter topology is derived from a passive ladder form of 5th-order elliptic filtering. Using element replacement and sharing multiple inputs for gyrators, a fully differential 5thorder elliptic filter is shown in Fig. 22.24.14 This multi-input sharing in the filter design reduces the numbers of OTAs from 11 to 6. Especially for wide bandwidth design, the method saves almost half of the die area and power dissipation. This design uses balanced signals to reduce even harmonics and to relax parasitic matching requirements. The capacitors realizing the filter poles are connected between the outputs of the transconductors and signal ground. This helps the stability of the commonmode feedback circuit because of the loading to both common-mode and fully-differential signal paths> The inherent 6 dB loss at low frequency is compensated for by the first transconductor with 2gm gain. Fig. 22.25 shows the frequency response and the passband of the filter. The filter has –3dB frequency tuning ranges of 1.2 MHz to 2.9 MHz and –60 dB stop band rejection. Obviously, the tuning range is limited by the low-power supply rail. This presents a problem for automatic tuning design at very low supply voltages. A possible application for this filter is for channel selection in wideband handy phones.15

© 2000 by CRC Press LLC

FIGURE 22.24

A gm-C elliptic filter with low-voltage OTAs.

FIGURE 22.25

Frequency response and passband of the filter.

© 2000 by CRC Press LLC

References 1. M. Ismail and T. Fiez, Analog VLSI Signal and Information Processing, McGraw-Hill, 1994. 2. E. A. Vittoz, The design of high-performance analog circuits on digital CMOS chips, IEEE J. SolidState Circuits, vol. SC-20, pp. 657-665, June 1985. 3. F. Krummenacher, High voltage gain CMOS OTA for micro-power SC filters, Electronics Letters, vol. 17, pp. 160-162, 1981. 4. M. S. J. Steyaert, W. Bijker, P. Vorenkamp, and J. Sevenhans, ECL-CMOS and CMOS-ECL Interface in 1.27mm CMOS for 150 MHz Digital ECL Data Transmission Systems, IEEE J. Solid-State Circuits, vol. SC-26, pp. 15-24, Jan. 1991. 5. R. F. Wassenaar, Analysis of analog C-MOS circuits, Ph.D. thesis, University of Twente, The Netherlands, 1996. 6. S. L. Wong and C. A. T. Salama, An efficient CMOS buffer for driving large capacitive loads, IEEE J. Solid-State Circuits, vol. SC-21, pp. 464-469, June 1986. 7. M. G. Degrauwe, J. Rijmenants, E. A. Vittoz, and H. J. DeMan, Adative biasing CMOS amplifiers, IEEE J. Solid-State Circuits, vol. SC-17, pp. 522-528, June 1982. 8. E. Seevinck, R. F. Wassenaar, and W. de Jager, Universal adaptive biasing principle for micro-power amplifiers, Digest of Technical Papers ESSCIRC’84, pp. 59-62, Sept. 1984. 9. R. F. Wassenaar, Current-Mode Minimax Circuit, IEEE Circuits and Devices, vol. 8, pp. 47, Nov. 1992. 10. S. H. Lewis and P. R. Gray, A pipelined 5MHz 9b ADC, Proceedings ISSCC’87, pp. 210-211, 1987. 11. S. Dupuie and M. Ismail, High frequency CMOS transconductors, Ch. 5 in Analog IC Design: The Current-Mode Approach, Toumazou, Lidgey, and Haight, Eds., Peter Peregrinus, Ltd., London, 1990. 12. E. Seevinck and R. F. Wassenaar, A versatile CMOS linear transconductor/square-law function circuit, IEEE Journal of Solid-State Circuts, vol. SC-22, no. 3, pp. 366-377, June 1987. 13. T. C. Choi, R. T. Kaneshiro, R. Brodersen, and P. R. Gray, High-frequency CMOS switched capacitor filters for communication applications, Proceedings ISSCC’83, pp. 246-247, 314, 1983. 14. R. Schaumann, Continuous-Time Integrated Filters, Ch. 80, The Circuits and Filters Handbook, W.-K. Chen, Editor-in-Chief, CRC Press and IEEE Press, New York, 1995. 15. C.-C. Hung, K. A. I. Halonen, M. Ismail, V. Porra, and A. Hyogo, A low-voltage, low-power CMOS fifth-order elliptic GM-C filter for baseband mobile, wireless communication, IEEE Transactions on Circuits and Systems for Video Technology, vol. 7, no. 4, pp. 584-592, August 1997. 16. C.-H. Lin and M. Ismail, Design and analysis of an ultra low-voltage CMOS class-AB V-I converter for dynamic range enhancement, International Symposium on Circuits and Systems, Orlando, Florida, June 1999.

© 2000 by CRC Press LLC

Muroga, S. "Expressions of Logic Functions" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

23 Expressions of Logic Functions 23.1 Introduction to Basic Logic Operations Basic Logic Expressions • Logic Expressions • Logic Expressions with Cubes

23.2 Truth Tables Decimal Specifications

23.3 Karnaugh Maps Two Completely Specified Functions to Express an Incompletely Specified Function

Saburo Muroga University of Illinois at Urbana-Champaign

23.4 Binary Decision Diagrams

23.1 Introduction to Basic Logic Operations In a contemporary digital computer, logic operations for computational tasks are usually done with signals that take values of 0 or 1. These logic operations are performed by many logic networks which constitute the computer. Each logic network has input variables x1, x2, …, xn and output functions f1, f2, … , fm. Each of the input variables and output functions take only binary value, 0 or 1. Now let us consider one of these output functions, f. Any logic function f can be expressed by a combination table (also called a truth table) exemplified in Table 23.1. TABLE 23.1 Combination Table x 0 0 0 0 1 1 1 1

y 0 0 1 1 0 0 1 1

z 0 1 0 1 0 1 0 1

f 0 0 1 1 0 0 0 1

Basic Logic Expressions Any logic function can be expressed with three basic logic operations: OR, AND, and NOT. It can also be expressed with other logic operations, as explained in a later section. The OR operation of n variables x1, x2, …, xn yields the value 1 whenever at least one of the variables is 1, and 0 otherwise, where each of x1, x2, …, xn assumes the value 0 or 1. This is denoted by x1 ∨ x2 ∨

© 2000 by CRC Press LLC

… ∨ xn. The OR operation defined above is sometimes called logical sum, or disjunction. Also, some authors use “+”, but throughout Section V, we use ∨, and + is used to mean an arithmetic addition. The AND operation of n variables yields the value 1 if and only if all variables x1, x2, …, xn are simultaneously 1. This is denoted by x1·x2·x3 … xn. These dots are usually omitted: x1x2x3…xn. The AND operation is sometimes called conjunction, or logical product. The NOT operation of a variable x yields the value 1 if x = 0, and 0 if x = 1. This is denoted by x or x′. The NOT operation is sometimes called complement or inversion. Using these operations, AND, OR, and NOT, a logic function, such as the one shown in Table 23.1 can be expressed in the following formula:

f = xyz ∨ xyz ∨ xyz

(23.1)

Logic Expressions Expressions with logic operations with AND, OR, and NOT, such as Eq. 23.1, are called switching expressions or logic expressions. Variables x, y, and z are sometimes called switching variables or logic variables, and they assume only binary values 0 and 1. In logic expressions such as Eq. 23.1, each variable, xi, appears with or without the NOT operation, that is, as x i or xi. Henceforth, x i and xi are called the literals of a variable xi.

Logic Expressions with Cubes Logic expressions such as f = xy ∨ yz ∨ xyz can be expressed alternatively as a set, {(10-), (-11), (010)}, using components in a vector expression such that the first, second, and third components of the vector represent x, y, and z, respectively, where the value “1” represents xi, “0” represents x i , and “-” represents the lack of the variable. For example, (10-) represents xy . These vectors are called cubes. Logic expressions with cubes are used often because of their convenience for processing by a computer.

23.2 Truth Tables The value of a function f for different combinations of values of variables can be shown in a table, as exemplified in Tables 23.1 and 23.2. The table for n variables has 2n rows. Thus the table size increases rapidly as n increases. TABLE 23.2 Truth Table with Don’t-Care Conditions Decimal Number of Row 0 1 2 3 4 5 6 7

x 0 0 0 0 1 1 1 1

Variables y 0 0 1 1 0 0 1 1

z 0 1 0 1 0 1 0 1

Function f 0 1 d 0 d 1 1 0

Under certain circumstances, some of the combinations of input variable values never occur, or even if they occur, we do not care what values f assumes. These combinations are called don’t-care conditions, or simply don’t-cares, and are denoted by “d” or “*”, as shown in Table 23.2.

© 2000 by CRC Press LLC

Decimal Specifications A concise means of expressing the truth table is to list only rows with f = 1 and d, identifying these rows with their decimal numbers in the following decimal specifications. For example, the truth table of Table 23.2 can be expressed, using ∑, as

f ( x, y, z ) = Σ ( 1, 5, 6 ) + d ( 2, 4 ) If only rows with f = 0 and d are considered, the truth table in Table 23.2 can be expressed, using ∏, as

f ( x, y, z ) = Π ( 0, 3, 7 ) + d ( 2, 4 )

23.3 Karnaugh Maps Logic functions can be visually expressed using a Karnaugh map, which is simply a different way of representing a truth table, as exemplified for four variables in Fig. 23.1(a). For the case of four variables, for example, a Karnaugh map consists of 16 cells; that is, 16 small squares as shown in Fig. 23.1(a). Here, two-bit numbers along the horizontal line above the squares show the values of x1 and x2, and two-bit binary numbers along the vertical line on the left of the squares show the values of x3 and x4. The top left cell in Fig. 23.1(a) has 1 inside for x1 = x2 = x3 = x4 = 0. Also, the cell in the second row and the second column from the left has d inside. This means f = d (i.e., don’t-care) for x1 = 0, x2 = 1, x3 = 0, and x4 = 1. The binary numbers that express variables are arranged in such a way that binary numbers for any two cells that are horizontally or vertically adjacent differ in only one bit position. Also, the two numbers in each row in the first and last columns differ in only one bit position and are interpreted to be adjacent. Also, the two numbers in each column in the top and bottom rows are similarly interpreted to be adjacent. Thus, the four cells in the top row are interpreted to be adjacent to the four cells in the bottom row in each column. The four cells in the first column are interpreted to be adjacent to the four cells in the last column in each row. With this arrangement of cells and this interpretation, a Karnaugh map is more than a concise representation of a truth table; it can express many important algebraic concepts, as we will see later. A Karnaugh map is a two-dimensional representation of the 16 cells on the surface of a torus, as shown in Fig. 23.1(b), where the two ends of the map are connected vertically and horizontally.

FIGURE 23.1

Karnaugh map for four variables.

Figure 23.2 shows the correspondence between the cells in the map in Fig. 23.2(a) and the rows in the truth table in Fig. 23.2(b). Notice that the rows in the truth table are not shown in consecutive order in the Karnaugh map. The Karnaugh map labeled with variable letters, instead of with binary numbers,

© 2000 by CRC Press LLC

FIGURE 23.2

Correspondence between the cells in a Karanaugh map for four variables and the rows in a truth table.

shown in Fig. 23.2(c), is also often used. Although a 1 or 0 shows the function’s value corresponding to a particular cell, 0 is often not shown in each cell. Cells that contain 1’s are called 1-cells (similarly, 0-cells). Patterns of Karnaugh maps for two and three variables are shown in Figs. 23.3(a) and (b), respectively. As we extend this treatment to the cases of 5 or more variables, the maps, which will be explained in a later subsection, become increasingly complicated.

FIGURE 23.3

Karanaugh map for two and three variables.

A rectangular loop that consists of 2i 1-cells without including any 0-cells expresses a product of literals for any i, where i ≥ 1. For example, the square loop consisting four 1-cells in Fig. 23.4(a) represents the product x 2 x 3 , as we can see it from the fact that x 2 x 3 takes value 1 only for these 1-cells, i.e., x1 = 0, x2 = 1, x3 = x4 = 0; x1 = x2 = 1, x3 = x4 = 0; x1 = 0, x2 = 1, x3 = 0, x4 = 1; and x1 = x2 = 1, x3 = 0, x4 = 1. A rectangular loop consisting of a single 1-cell, such as the one in Fig. 23.4(b), for example, represents the product of literals that the cube (0001) expresses, i.e., x 1 x 2 x 3 x 4 . Thus, the map in Fig. 23.4(a) expresses the function x 2 x 3 ∨ x 1 x 3 x 4 and the map in Fig. 23.4(b) expresses the function x 1 x 2 x 3 x 4 ∨ x 1 x 2 x 3 ∨ x 2 x 4 .

© 2000 by CRC Press LLC

FIGURE 23.4

Products of literals expressed on Karnaugh maps.

Two Completely Specified Functions to Express an Incompletely Specified Function Suppose we have an incompletely specified function f, as shown in Fig. 23.5(a), i.e., a function that has some don’t-cares. This incompletely specified function f can be expressed alternatively with two completely specified functions, f ON and f OFF, shown in Fig. 23.5(b) and (c), respectively. f ON is called ON-set of f and is the function whose value is 1 for f = 1and 0 for f = 0 and d. f OFF is called OFF-set of f and is 1 for f = 0 and 0 for f = 1 and d. Don’t-care set of f can be derived as

f

FIGURE 23.5

DC

= f

∨f

ON

OFF

Expression of an incompletely specified function f with two completely specified functions.

23.4 Binary Decision Diagrams Truth tables or logic expressions can be expressed with binary decision diagrams, which are usually abbreviated as BDDs. Compared with logic expressions or truth tables, BDDs have unique features, such as unique concise representation, processing speed, and the memory space, as discussed in a later section. Let us consider a logic expression, x 1 x 2 ∨ x3, for example. This can be expressed as the truth table shown in Fig. 23.6(a). Then, this can be expressed as the BDD shown in Fig. 23.6(b). It is easy to see why Fig. 23.6(b) represents the truth table in Fig. 23.6(a). Let us consider the row, (x1x2x3) = (011), for example, in Fig. 23.6(a). In Fig. 23.6(b), starting from the top node which represents x1, we go down to © 2000 by CRC Press LLC

FIGURE 23.6

Binary decision diagram.

the left node which represents x2, following the dotted line corresponding to x1 = 0. From this node, we go down to the second node from the left which represents x 3, following the solid line corresponding to x2 = 1. From this node, we go down to the fourth rectangle from the left which represents, f = 1, following the solid line corresponding to x3 = 1. Thus, we have the value of f that is shown for the row, (x1, x2, x3) = (011) in the truth table in Fig. 23.6(a). Similarly, for any row in the truth table, we reach the rectangle that shows the value of f identical to that in the truth table in Fig. 23.6(a), by following a solid or dotted line corresponding to 1 or 0 for each of x1, x2, x3, respectively. BDD in Fig. 23.6(b) can be simplified to the BDD shown in Fig. 23.6(c), which is called the reduced BDD, by the reduction to be described in a later section. When a function has don’t-cares, d’s, we can treat it in the same manner by considering a rectangle for d’s, as shown in Fig. 23.7.

FIGURE 23.7

Reduced BDD with don’t-cares.

© 2000 by CRC Press LLC

Muroga, S."Basic Theory of Logic Functions" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

24 Basic Theory of Logic Functions 24.1 Basic Theorems Logic Expressions and Expansions

Saburo Muroga University of Illinois at Urbana-Champaign

24.2 Implication Relations and Prime Implicants Consensus • Derivation of All Prime Implicants from a Disjunctive Form

24.1 Basic Theorems Theory on logic functions where the values of variables and their functions are 0 or 1 only is called switching theory. Here, let us discuss the basics of switching theory. Let us denote the set of input variables by the vector expression (x1, x2, …, xn). There are 2n different input vectors when each of these n variables assume the value 1 or 0. An input vector (x1, x2, …, xn) such that f(x1, x2, …, xn) = 1 or 0 is called a true (input) vector, or a false (input) vector of f, respectively. Vectors with n components are often called n-dimensional vectors if we want to emphasize that there are n components. When the value of a logic function f is specified for each of the 2n vectors (i.e., for every combination of the values of x1, x2, …, xn), f is said to be completely specified. Otherwise, f is said to be incompletely specified; that is, the value of f is specified for fewer than 2n vectors. Input vectors for which the value of f is not specified are called don’t-care conditions usually denoted by “d” or “*”, as described in Chapter 23. These input vectors are never applied to a network whose output realizes f, or the values of f for these input vectors are not important. Thus, the corresponding values of f need not be considered. If there exists a pair of input vectors (x1, …, xi-1, 0, xi+1, …, xn) and (x1, …, xi-1, 1, xi + 1, …, xn) that differ only in a particular variable xi, such that the values of f for these two vectors differ, the logic function f(x1, x2, …, xn) is said to be dependent on xi. Otherwise, it is said to be independent of xi. In this case, f can be expressed without the xi in the logic expression of f. If f is independent of xi, xi is called a dummy variable. If f(x1, x2, …, xn) depends on all its variables, it is said to be non-degenerate; otherwise, degenerate. For example, x1 ∨ x2x3 ∨ x 2 x 3 can be expressed as x1 ∨ x2 without dummy variable x3.

Logic Expressions and Expansions Given variable xi, xi and x i are called literals of xi, as already explained. Definition 24.1: (1) A conjunction (i.e., a logical product) of literals where a literal for each variable appears at most once is called a term (or a product). A term may consist of a single literal. A disjunction (i.e., logical sum) of terms is called a disjunctive form (or a sum of products). (2) Similarly, a disjunction (i.e., a logical sum) of literals where a literal for each variable appears at most

© 2000 by CRC Press LLC

once is called an alterm. An alterm may consist of a single literal. A conjunction of alterms is called a conjunctive form (or a product of sums). For example, x 1 x 2 x 3 is a term, and x1 ∨ x2 ∨ x 3 is an alterm. Also, x1x2 ∨ x1 ∨ x2 and x1 ∨ x 1 x 2 are disjunctive forms that are equivalent to the logic function x1 ∨ x2, but x1 ∨ x 2 ( x 1 ∨ x 2 ) is not a disjunctive form, although it expresses the same function. A disjunctive form does not contain products of literals that are identically 0 (e.g., x 1 x 2 x 2 ) from the first sentence of (1). Similarly, a conjunctive form does not contain disjunctions of literals that are identically 1 (e.g., x1 ∨ x 2 ∨ x2 ∨ x3). The following expressions of a logic function are important special cases of a disjunctive form and a conjunctive form. Definition 24.2: Assume that n variables, x1, x2, …, xn, are under consideration. (1) A minterm is defined as the conjunction of exactly n literals, where exactly one literal for each variable (xi and x i are two literals of a variable xi) appears. When a logic function f of n variables is expressed as a disjunction of minterms without repetition, it is called the minterm expansion of f. (2) A maxterm is defined as a disjunction of exactly n literals, where exactly one literal for each variable appears. When f is expressed as a conjunction of maxterms without repetition, it is called the maxterm expansion of f. For example, when three variables, x1, x2, and x3, are considered, there exist 23 = 8 minterms: x 1 x 2 x 3 , x 1 x 2 x 3 , x 1 x 2 x 3 , x 1 x 2 x 3 , x 1 x 2 x 3 , x 1 x 2 x 3 , x 1 x 2 x 3 , and x1x2x3. For the given function x1 ∨ x2x3, the minterm expansion is x 1 x 2 x 3 ∨ x 1 x 2 x 3 ∨ x 1 x 2 x 3 ∨ x 1 x 2 x 3 ∨ x1x2x3 and the maxterm expansion is (x1 ∨ x2 ∨ x3) (x1 ∨ x2 ∨ x 3 ) (x1 ∨ x 2 ∨ x3), as explained in the following. Also notice that the row for each true vector in a truth table and also its corresponding 1-cell in the Karnaugh map correspond to a minterm. If f = 1 for x1 = x2 = 1 and x3 = 0, then this row in the truth table and its corresponding 1-cell in the Karnaugh map corresponds to a minterm x 1 x 2 x 3 . Also, as will be described in a later section, the row for each false vector in a truth table and also its corresponding 0-cell in the Karnaugh map corresponds to a maxterm. For example, if f = 0 for x1 = x2 = 1 and x3 = 0, then this row in the truth table and its corresponding 0-cell in the Karnaugh map corresponds to a maxterm ( x 1 ∨ x 2 ∨ x 3 ) . Theorem 24.1: Any logic function can be uniquely expanded with minterms and also with maxterms. For example, f(x1, x2) = x1 ∨ x 2 x 3 can be uniquely expanded with the minterms as

x1 ∨ x2 x3 = x1 x2 x3 ∨ x1 x2 x3 ∨ x1 x2 x3 ∨ x1 x2 x3 ∨ x1 x2 x3 and also can be uniquely expressed with maxterms as

x1 ∨ x2 x3 = ( x1 ∨ x2 ∨ x3 ) ( x1 ∨ x2 ∨ x3 ) ( x1 ∨ x2 ∨ x3 ) . These expansions have different expressions but both express the same function x 1 ∨ x 2 x 3 . The following expansions, called Shannon’s expansions, are often useful. Any function f(x1, x2, …, xn) can be expanded into the following expression with respect x1:

f ( x 1, x 2, …, x n ) = x 1 f ( 1, x 2, x 3, …, x n ) ∨ x 1 f ( 0, x 2, x 3, …, x n ) where f(1, x2, x3, …, xn) and f(0, x2, x3, …, xn), which are f(x1, x2, …, xn) with x1 set to 1 and 0, respectively, are called cofactors. By further expanding each of f(1, x2, x3, …, xn) and f(0, x2, x3, …, xn) with respect to x2, we have

f ( x 1, x 2, …, x n ) = x 1 x 2 f ( 1, x 2, x 3, …, x n ) ∨ x 1 x 2 f ( 1, 0, x 3, …, x n ) ∨ x 1 x 2 f ( 0, 1, x 3, …, x n ) ∨ x 1 x 2 f ( 0, 0, x 3, …, x n )

© 2000 by CRC Press LLC

Then we can further expand with respect to x3. And so on. Also, similarly

f ( x 1, x 2, …, x n ) = ( x 1 ∨ f ( 0, x 2, x 3, …, x n ) ) ( x 1 ∨ f ( 1, x 2, x 3, …, x n ) ) f ( x 1, x 2, …, x n ) = ( x 1 ∨ x 2 ∨ f ( 0, 0, x 3, …, x n ) ) ( x 1 ∨ x 2 ∨ f ( 0, 1, x 3, …, x n ) ) ( x 1 ∨ x 2 ∨ f ( 1, 0, x 3, …, x n ) ) ( x 1 ∨ x 2 ∨ f ( 1, 1, x 3, …, x n ) ) And so on. These expansions can be extended to the case with m variables factored out, where m ≤ n, although the only expansions for m = 1 (i.e., x1) and 2 (i.e., x1 and x2) are shown above. Of course, when m = n, the expansions become the minterm and maxterm expansions. Theorem 24.2: De Morgan’s Theorem — ( x 1 ∨ x 2 ∨ … ∨ x n ) = x 1 x 2 …x n

and

( x 1 x 2 …x n ) = x 1 ∨ x 2 ∨ … ∨ x n

A logic function is realized by a logic network that consists of logic gates, where logic gates are realized with hardware, such as transistor circuits. De Morgan’s Theorem 24.2 has many applications. For example, it asserts that a NOR gate, i.e., a logic gate whose output expresses the complement of the OR operation on its inputs, with noncomplemented variable input x1, x2, …, xn is interchangeable with an AND gate, i.e., a logic gate whose output expresses the AND operation on its complemented variable inputs x1, x2, …, xn, since the outputs of both gates express the same function. This is illustrated in Figure 24.1 for n = 2.

FIGURE 24.1

Application of De Morgan’s theorem.

Definition 24.3: The dual of a logic function f(x1, x2, …, xn) is defined as f ( x 1, x 2, …, x n ) where f denotes the complement of the entire function f(x1, x2, …, xn) [in order to denote the complement of a function f(x1, x2, …, xn), the notation f ( x 1, x 2, …, x n ) might be used instead of f ]. Let it be denoted by f d(x1, x2, …, xn). In particular, if f(x1, x2, …, xn) = f d(x1, x2, …, xn), then f(x1, x2, …, xn) is called a self-dual function. For example, when f(x1, x2) = x1∨x2 is given, we have f d(x1, x2) = f ( x 1, x 2 ) = x 1 ∨ x 2 . This is equal to x1x2 by the first identity of Theorem 24.2. In other words, f d(x1, x2) = x1x2. The function x1x2∨x2x3∨x1x3 is self-dual, as can be seen by applying the two identities of Theorem 24.2. Notice that, if f d is the dual of f, the dual of f d is f. The concept of a dual function has many important applications. For example, it is useful in the conversion of networks with different types of gates, as in Fig. 24.2, where the replacement of the AND and OR gates in Fig. 24.2(a) by OR and AND gates, respectively, yields the output function f d in Fig. 24.2(b), which is dual to the output f of Fig. 24.2(a). As will be explained in a later chapter, a logic gate in CMOS is another important application example of the concept of “dual,” where CMOS stands for complementary MOS and is a logic family, i.e., a type of transistor circuit for realizing a logic gate. Duality is utilized for reducing the power consumption of a logic gate in CMOS. The following theorem shows a more convenient method of computing the dual of a function than direct use of Definition 24.3.

© 2000 by CRC Press LLC

FIGURE 24.2

Duality relation between two networks.

Theorem 24.3: Generalized De Morgan’s Theorem — Let f(x1, x2, …, xn) be a function expressed by “∨” and “·” (and possibly also by parentheses and the constants 0 and 1). Let g(x1, x2, …, xn) be a function that is obtained by replacing every “∨” and “·” by “·” and “∨”, respectively, throughout the logic expression of f(x1, x2, …, xn) (and also, if 0 or 1 is contained in the original expression, by replacing 0 and 1 by 1 and 0, respectively). Then,

fd(x 1, x2, …, x n) = g(x 1, x 2, …, x n) For example, when f(x1, x2) = x1 ∨ x 2 ⋅ x 3 is given, f d = x1 · ( x 2 ∨ x 3 ) is obtained by this theorem. Here, notice that in f, the calculation of · precedes that of ∨; and in f d, the ∨ must correspondingly be calculated first [thus, the parentheses are placed as ( x 2 ∨ x 3 ) ]. When f = x1 ∨ 0 · x2 · x 3 is given, f d = x1 · (1 ∨ x2 ∨ x 3 ) results by this theorem. When f = x1 ∨ 1 · x2 · x 3 is given, f d = x1 · (0 ∨ x2 ∨ x3) results. For example, the dual of x1 ∨ x2x3 is x 1 ∨ x 2 x 3 according to Definition 24.3, which is a somewhat complicated expression. But by using the generalized De Morgan’s theorem, we can immediately obtain the expression without bars, x1· (x2 ∨ x3) = x1x2 ∨ x1x3.

24.2 Implication Relations and Prime Implicants In this section, we discuss the algebraic manipulation of logic expressions; that is, how to convert a given logic expression into others. This is very useful for simplification of logic expression. Although simplification of a logic expression based on a Karnaugh map, which will be discussed in Chapter 25, is convenient in many cases, algebraic manipulation is more convenient in many other situations.3 Definition 24.4: Let two logic functions be f(x1, x2, …, xn) and g(x1, x2, …, xn). If every vector (x1, x2, …, xn) satisfying f(x1, x2, …, xn) = 1 satisfies also g(x1, x2, …, xn) = 1 but the converse does not necessarily hold, we write

f( x 1, x 2, …, x n) ⊆ g( x 1, x2, …, x n)

(24.1)

and we say that f implies g. In addition, if there exists a certain vector (x1, x2, …, xn) satisfying simultaneously f(x1, x2, …, xn) = 0 and g(x1, x2, …, xn) = 1, we write

f( x1, x 2, …, x n) ⊂ g( x 1, x2, …, x n)

(24.2)

and we say that f strictly implies g (some authors use different symbols instead of ⊂). Therefore, Eq. 24.1 means f(x1, x2, …, xn) ⊂ g(x1, x2, …, xn) or f(x1, x2, …, xn) = g(x1, x2, …, xn). These relations are called implication relations. The left- and right-hand sides of Eq. (24.1) or (24.2) are called antecedent and consequent, respectively. If an implication relation holds between f and g. That is, if f ⊆ g or f ⊇ g holds, f and g are said to be comparable (more precisely, “⊆-comparable” or “implicationcomparable”). Otherwise, they are incomparable.

© 2000 by CRC Press LLC

When two functions, f and g, are given, we can find by the following methods at least whether or not there exists an implication relation between f and g; for example, using a truth table for f and g, directly based on Definition 24.4. If and only if there is no row in which f = 1 and g = 0, the implication relation f ⊆ g holds. Furthermore, if there is at least one row in which f = 0 and g = 1, the relation is tightened to f ⊂ g. Table 24.1 shows the truth table for f = x 1 x 3 ∨ x 1 x 2 x 3 and g = x1∨x2. There is no row with f = 1 and g = 0, so f ⊆ g holds. Furthermore, there is a row with f = 0 and g = 1, so the relation is actually f ⊂ g. TABLE 24.1 Example for f ⊆ g x1 0 0 0 0 1 1 1 1

x2 0 0 1 1 0 0 1 1

x3 0 1 0 1 0 1 0 1

f 0 0 0 1 1 0 1 0

g 0 1 1 1 1 1 1 1

Although “g implies f ” means “if g = 1, then f = 1,” it is to be noticed that “g does not imply f ” does not necessarily mean “f implies g” but does mean either “f implies g” or “g and f are incomparable.” In other words, it does mean “if g = 1, then f becomes a function other than the constant function which is identically equal to 1.” (As a special case, f could be identically equal to 0.) Notice that “g does not imply f ” does not necessarily mean “if g = 0, then f = 0.” Definition 24.5: An implicant of a logic function f is a term that implies f. For example, x1, x2, x1x2, , and x 1 x 3 x 2 are examples of implicants of the function x1∨x2. But x 1 x 2 is not. Notice that x1x3 is an implicant of x1∨x2 even though x1∨x2 is independent of x3. (Notice that every product of an implicant of f with any dummy variables is also an implicant of f. Thus, f has an infinite number of implicants.) But x1x2 is not an implicant of f = x1x3 ∨ x 2 because x1x2 does not imply f. (When x1x2 = 1, we have f = x3, which can be 0. Therefore, even if x1x2 = 1, f may become 0.) Some implicants are not obvious from a given expression of a function. For example, x1x2 ∨ x 1 x 2 has implicants x2x3 and x2x3x4. Also, x 1 x 2 ∨ x 2 x 3 ∨ x 1 x 3 has an implicant x3 because, if x3 = 1, x 1 x 2 ∨ x 2 x 3 ∨ x 1 x 3 becomes x 1 x 2 ∨ x2 ∨ x 1 = x1 ∨ x2 ∨ x 1 = (x1 ∨ x 1 ) ∨ x2 = 1 ∨ x2, which is equal to 1. Definition 24.6: A term P is said to subsume another term Q if all the literals of Q are contained among those of P. If a term P which subsumes another term Q contains literals that Q does not have, P is said to strictly subsume Q. For example, term x 1 x 2 x 3 x 4 subsumes x 1 x 3 x 4 and also itself. More precisely speaking, x 1 x 2 x 3 x 4 strictly subsumes x 1 x 3 x 4 . Notice that Definition 24.6 can be equivalently stated as follows: “A term P is said to subsume another term Q if P implies Q; that is, P ⊆ Q. Term P strictly subsumes another term Q if P ⊂ Q.” Notice that when we have terms P and Q, we can say, “P implies Q” or, equivalently, “P subsumes Q.” But the word “subsume” is ordinarily not used in other cases, except for comparing two alterms (as we will see in Section 25.4). For example, when we have functions f and g that are not in single terms, we usually do not say “f subsumes g.” On a Karnaugh map, if the loop representing a term P (always a single rectangular loop consisting of 2i 1-cells because P is a product of literals) is part of the 1-cells representing function f, or is contained in the loop representing a term Q, P implies f or subsumes Q, respectively. Figure 24.3 illustrates this. Conversely, it is easy to see that, if a term P, which does not contain any dummy variables of f, implies f, the loop for P must consist of some 1-cells of f, and if a term P, which does not contain any dummy variables of another term Q, implies Q, the loop for P must be inside the loop for Q.

© 2000 by CRC Press LLC

(a) Term P implies a function f.

FIGURE 24.3

(b) Term P subsumes a term Q.

Comparison of “imply” and “subsume.”

The following concept of “prime implicant” is useful for deriving a simplest disjunctive form for a given function f (recall that logic expressions for f are not unique) and consequently for deriving a simplest logic network realizing f. Definition 24.7: A prime implicant of a given function f is defined as an implicant of f such that no other term subsumed by it is an implicant of f. For example, when f = x1x2 ∨ x 1 x 3 ∨ x1x2x3 ∨ x 1 x 2 x 3 is given, x1x2, x 1 x 3 , and x2x3 are prime implicants. But x1x2x3 and x 1 x 2 x 3 are not prime implicants, although they are implicants (i.e., if any of them is 1, then f = 1). Prime implicants of a function f can be obtained from other implicants of f by stripping off unnecessary literals until further stripping makes the remainder no longer imply f. Thus, x2x3 is a prime implicant of x1x2 ∨ x 1 x 3 , and x2x3x4 is an implicant of this function but not a prime implicant. As seen from this example, some implicants, such as x2x3, and accordingly some prime implicants are not obvious from a given expression of a function. Notice that, unlike implicants, a prime implicant cannot contain a literal of any dummy variable of a function. On a Karnaugh map, all prime implicants of a given function f of at least up to four variables can be easily found. As is readily seen from Definition 24.7, each rectangular loop that consists of 2i 1-cells, with i chosen as large as possible, is a prime implicant of f. If we find all such loops, we will have found all prime implicants of f. Suppose that a function f is given as shown in Fig. 24.4(a). Then, the prime implicants are shown in Fig. 24.4(b). In this figure, we cannot make the size of the rectangular loops any bigger. (If we increase the size of any one of these loops, the new rectangular loop will contain a number of 1-cells that is not 2i for any i, or will include one or more 0-cells.)

FIGURE 24.4

Expression of prime implicants on Karnaugh maps.

© 2000 by CRC Press LLC

Consensus Next, let us systematically find all prime implicants, including those not obvious, for a given logic function. To facilitate our discussion, let us define a consensus. Definition 24.8: Assume that two terms, P and Q, are given. If there is exactly one variable, say x, appearing noncomplemented in one term and complemented in the other — in other words, if P = xP′ and Q = x Q′ (no other variables appear complemented in either P′ or Q′, and noncomplemented in the other) — then the product of all literals except the literals of x, that is, P′Q′ with duplicates of literals deleted, is called the consensus of P and Q. For example, if we have two terms, x 1 x 2 x 3 and x 1 x 2 x 4 x 5 , the consensus is x 2 x 3 x 4 x 5 . But x 1 x 2 x 3 and x 1 x 2 x 4 x 5 do not have a consensus because two variables, x1 and x2, appear noncomplemented and complemented in these two terms. A consensus can easily be shown on a Karnaugh map. For example, Fig. 24.5 shows a function f = x1x2 ∨ x 1 x 4 . In addition to the two loops shown in Fig. 24.5(a), which corresponds to the two prime implicants, x1x2 and x 1 x 4 , of f, this f can have another rectangular loop, which consists of 2ι 1-cells with i chosen as large as possible, as shown in Fig. 24.5(b). This third loop, which represents x2x4, the consensus of x1x2 and x 1 x 4 , intersects the two loops in Fig. 24.5(a) and is contained within the 1-cells that represent x1x2 and x 1 x 4 . This is an important characteristic of a loop representing a consensus. Notice that these three terms, x2x4, x1x2, and x 1 x 4 , are prime implicants of f. When rectangular loops of 2i 1-cells are adjacent (not necessarily exactly in the same row or column), the consensus is a rectangular loop of 2i 1-cells, with i chosen as large as possible, that intersects and is contained within these loops. Therefore, if we obtain all largest possible rectangular loops of 2i 1-cells, we can obtain all prime implicants, including consensuses, which intersect and are contained within other pairs of loops. Sometimes, a consensus term can be obtained from a pair consisting of another consensus and a term, or a pair of other consensuses that do not appear in a given expression. For example, x 1 x 2 and x2x3 of x 1 x 2 ∨ x2x3 ∨ x 1 x 3 yield consensus x1x3, which in turn yields consensus x3 with x 1 x 3 . Each such consensus is also obtained among the above largest possible rectangular loops.

FIGURE 24.5

Expression of a consensus on Karnaugh maps.

As we can easily prove, every consensus that is obtained from terms of a given function f implies f. In other words, every consensus generated is an implicant of f, although not necessarily a prime implicant.

Derivation of All Prime Implicants from a Disjunctive Form The derivation of all prime implicants of a given function f is easy, using a Karnaugh map. If, however, the function has five or more variables, the derivation becomes increasingly complicated on a Karnaugh

© 2000 by CRC Press LLC

map. Therefore, let us discuss an algebraic method, which is convenient for implementation in a computer program, although for functions of many variables even algebraic methods are too time consuming and we need to resort to heuristic methods. The following algebraic method to find all prime implicants of a given function, which Tison4,5 devised, is more efficient than the previously known iterated-consensus method, which was proposed for the first time by Blake in 1937.2 Definition 24.9: Suppose that p products, A1, …, Ap, are given. A variable such that only one of its literals appears throughout A1, …, Ap is called a unate variable. A variable such that both of its literals appear in A1, …, Ap is called a binate variable. For example, when x 1 x 2 , x 1 x 3 , x 3 x 4 , and x 2 x 4 are given, x1 and x4 are binate variables, since x1 and x4 as well as their complements, x 1 and x 4 appear, and x2 and x3 are unate variables. Procedure 24.1: The Tison Method — Derivation of All Prime Implicants of a Given Function Assume that a function f is given in a disjunctive from f = P ∨ Q ∨ … ∨ T, where P, Q, …, T are terms, and that we want to find all prime implicants of f. Let S denote the set {P, Q, …, T}. 1. Among P, Q, …, T in set S, first delete every term subsuming any other term. Among all binate variables, choose one of them. For example, when f = x1x2x4 ∨ x 1 x 3 ∨ x 2 x 3 ∨ x2x4 ∨ x 3 x 4 is given, delete x1x2x4, which subsumes x2x4. The binate variables are x3 and x4. Let us choose x3 first. 2. For each pair of terms, that is, one with the complemented literal of the chosen variable and the other with the noncomplemented literal of that variable, generate the consensus. Then add the generated consensus to S. From S, delete every term that subsumes another. For our example, x3 is chosen as the first binate variable. Thus we get consensus x1x2 from pair x1x3 and x 2 x 3 , and consensus x 1 x 4 from pair x1x3 and x 3 x 4 . None subsumes another. Thus, S, the set of prime implicants, becomes x1x3, x 2 x 3 , x2x4, x 3 x 4 , x1x2, and x 1 x 4 .

3. Choose another binate variable in the current S. Then go to Step 2. If all binate variables are tried, go to Step 4. For the above example, for the second iteration of Step 2, we choose x4 as the second binate variable and generate two new consensuses as follows:

But when they are added to S, each one subsumes some term contained in S. Therefore, they are eliminated.

© 2000 by CRC Press LLC

4. The procedure terminates because all binate variables are processed, and all the products in S are desired prime implicants. The last expression is called the complete sum or the all-prime-implicant disjunction. The complete sum is the first important step in deriving the most concise expressions for a given function.  Generation of prime implicants for an incompletely specified function, which is more general than the case of completely specified function described in Procedure 24.1, is significantly speeded up with the use of BDD (described in Chapters 24.1 and 24.4) by Coudert and Madre.1

References 1. Coudert, O. and J. C. Madre, “Implicit and incremental computation of primes and essential primes of Boolean functions,” Design Automation Conf., pp. 36-39, 1992. 2. Brown, F. M., “The origin of the iterated consensus,” IEEE Tr. Comput., p. 802, Aug. 1968. 3. Muroga, S., Logic Design and Switching Theory, John Wiley & Sons, 1979 (now available from Krieger Publishing Co.). 4. Tison, P., Ph.D. dissertation, Faculty of Science, University of Grenoble, France, 1965. 5. Tison, P., “Generalization of consensus theory and application to the minimization of Boolean functions,” IEEE Tr. Electron. Comput., pp. 446-456, Aug. 1967.

© 2000 by CRC Press LLC

Muroga, S. "Simplification of Logic Expressions" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

25 Simplification of Logic Expressions 25.1 Minimal Sums 25.2 Derivation of Minimal Sums by Karnaugh Map Maps for Five and Six Variables

Saburo Muroga University of Illinois at Urbana-Champaign

25.3 Derivation of Minimal Sums for a Single Function by Other Means 25.4 Prime Implicates, Irredundant Conjunctive Forms, and Minimal Products 25.5 Derivation of Minimal Products by Karnaugh Map

25.1 Minimal Sums In this chapter, using only AND and OR gates, we will synthesize a two-level logic network. This is the fastest network, if we assume that every gate has the same delay, since the number of levels cannot be reduced further unless a given function can be realized with a single AND or OR gate. If there is more than one such network, we will derive the simplest network. Such a network has a close relation to important concepts in switching algebra, that is, irredundant disjunctive forms and minimal disjunctive forms (or minimal sums), as we discuss in the following. Now let us explore basic properties of logic functions. For many functions, some terms in their complete sums are redundant. In other words, even if we eliminate some terms from a complete sum, the remaining expression may still represent the original function for which the complete sum was obtained. Thus, we have the following concept. Definition 25.1: An irredundant disjunctive form for f (sometimes called an irredundant sum-ofproducts expression or an irredundant sum) is a disjunction of prime implicants such that removal of any of the prime implicants makes the remaining expression not express the original f. An irredundant disjunctive form for a function is not necessarily unique. Definition 25.2: Prime implicants that appear in every irredundant disjunctive form for f are called essential prime implicants of f. Prime implicants that do not appear in any irredundant disjunctive form for f are called absolutely eliminable prime implicants of f. Prime implicants that appear in some irredundant disjunctive forms for f but not in all are called conditionally eliminable prime implicants of f. Different types of prime implicants are shown in the Karnaugh map in Fig. 25.1 for a function f = x 1 x 3 x 4 ∨ x 1 x 2 x 3 ∨ x 1 x 2 x 4 ∨ x2x3x4 ∨ x1x3x4 ∨ x 1 x 2 x 4 . Here, x 1 x 2 x 4 , x 1 x 2 x 4 , and x 1 x 2 x 4 are essential prime implicants, x 2 x 3 x 4 and x 1 x 3 x 4 are conditionally eliminable prime implicants and x 1 x 2 x 3 is absolutely eliminable prime implicant.

© 2000 by CRC Press LLC

FIGURE 25.1

Different types of prime implicants.

The concepts defined in the following play an important role in switching theory. Definition 25.3: Among all irredundant disjunctive forms of f, those with a minimum number of prime implicants are called minimal sums (some authors call them as “sloppy minimal sum”). Among minimal sums, those with a minimum number of literals are called absolute minimal sums for f. Irredundant disjunctive forms for a given function f can be obtained by deleting prime implicants one by one from the complete sum in all possible ways, after obtaining the complete sum by the Tison method discussed in Chapter 24. Then the minimal sums can be found among the irredundant disjunctive forms. Usually, however, this approach is excessively time-consuming because a function has many prime implicants when the number of variables is very large. When the number of variables is too large, derivation of a minimal sum is practically impossible. Later, we will discuss efficient methods to derive minimal sums within reasonable computation time when the number of variables is few.

25.2 Derivation of Minimal Sums by Karnaugh Map Because of its pictorial nature, a Karnaugh map is a very powerful tool for deriving manually all prime implicants, irredundant disjunctive forms, minimal sums, and also absolute minimal sums. Algebraic concepts such as prime implicants and consensuses can be better understood on a map. One can derive all prime implicants, irredundant disjunctive forms, and minimal sums by the following procedures on Karnaugh maps, when the number of variables is small enough for the map to be manageable. Procedure 25.1: Derivation of Minimal Sums on a Karnaugh Map This procedure consists of three steps: 1. On a Karnaugh map, encircle all the 1-cells with rectangles (also squares as a special case), each of which consists of 2i 1-cells, choosing i as large as possible, where i is a non-negative integer. Let us call these loops prime implicant loops, since they correspond to prime implicants in the case of the Karnaugh map for four or fewer variables. (In the case of a five- or six-variable map, the correspondence is more complex, as will be explained later.) Examples are shown in Fig. 25.2(a). 2. Cover all the 1-cells with prime-implicant loops so that removal of any loops leaves some 1-cells uncovered. These sets of loops represent irredundant disjunctive forms. Figs. 25.2(b) through 25.2(e) represent four irredundant disjunctive forms, obtained by choosing the loops in Fig. 25.2(a) in four different ways. For example, if the prime-implicant loop x 1 x 3 x 4 is omitted in Fig. 25.2(b), the 1-cells for (x1, x2, x3, x4) = (1 1 0 1) and (1 0 0 1) are not covered by any loops. 3. From the sets of prime implicant loops formed in Step 2 for irredundant disjunctive forms, choose the sets with a minimum number of loops. Among these sets, the sets that contain as many of the largest loops as possible (a larger loop represents a product of fewer literals) represent minimal sums. Figure 25.2(c) expresses the unique minimal sum for this function, since it contains one less loop than Fig. 25.2(b), (d), or (e).

© 2000 by CRC Press LLC

FIGURE 25.2

Irredundunt disjunctive forms and a minimal sum.

It is easy to see, from the definitions of prime implicants, irredundant disjunctive forms, and minimal sums, why Procedure 25.1 works. When we derive irredundant disjunctive forms or minimal sums by Procedure 25.1, the following property is useful. When we find all prime-implicant loops by Step 1, some 1-cells may be contained in only one loop. Such 1-cells are called distinguished 1-cells and are labeled with asterisks. (The 1-cells shown with asterisks in Fig. 25.2(a) are distinguished 1-cells.) A prime implicant loop that contains distinguished 1-cells is called an essential prime implicant loop. The corresponding prime implicant is an essential prime implicant, as already defined. In every irredundant disjunctive form and every minimal sum to be found in Step 2 and 3, respectively, essential prime implicants must be included, since each 1-cell on the map must be contained in at least one prime implicant loop and distinguished 1-cells can be contained only in essential prime implicant loops. Hence, if essential prime implicant loops are first identified and chosen, Procedure 25.1 is quickly processed. Even if the don’t-care condition d is contained in some cells, prime implicants can be formed in the same manner, by simply regarding d as being 1 or 0 whenever necessary to draw a greater prime implicant loop. For example, in Fig. 25.3, we can draw a greater rectangular loop by regarding two d’s as being 1. One d is left outside and is regarded as being 0. We need not consider loops consisting of d’s only.

Maps for Five and Six Variables The Karnaugh map is most useful for functions of four or fewer variables, but it is often useful also for functions of five or six variables. A map for five variables consists of two four-variable maps, as shown in Fig. 25.4, one for each value of the first variable. A map for six variables consists of four four-variable maps, as shown in Fig. 25.5, one for each combination of values of the first two variables. Note that the four maps in Fig. 25.5 are arranged so that binary numbers represented by x1 and x2 differ in only one bit horizontally and vertically (the map for x1 = x2 = 1 goes to the bottom right-hand side).

© 2000 by CRC Press LLC

FIGURE 25.3

A map with don’t-care conditions.

FIGURE 25.4

Karnaugh map for five variables.

FIGURE 25.5

Karnaugh map for six variables.

© 2000 by CRC Press LLC

In a five-variable map, 1-cells are combined in the same way as in the four-variable case, with the additional feature that rectangular loops of 2i 1-cells that are on different four-variable maps can be combined to form a greater loop replacing the two original loops only if they occupy the same relative position on their respective four-variable maps. Notice that these loops may be inside other loops in each four-variable map. For example, if f15 and f31 are 1, they can be combined; but even if f15 = f29 = 1, f15 and f29 cannot. In a six-variable map, only 1-cells in two maps that are horizontally or vertically adjacent can be combined if they are in the same relative positions. In Fig. 25.5, for example, if f5 and f37 are 1, they can be combined; but even if f5 and f53 are 1, f5 and f53 cannot. Also, four 1-cells that occupy the same relative positions in all four-variable maps can be combined as representing a single product. For example, f5, f21, f37, and f53 can be combined if they are 1. In the case of a five-variable map, we can find prime implicant loops as follows. Procedure 25.2: Derivation of Minimal Sums on a Five-Variable Map 1. Unlike Step 1 of Procedure 25.1 for a function of four variables, this step requires the following two substeps to form prime implicant loops: a. On each four-variable map, encircle all the 1-cells with rectangles, each of which consists of 2i 1-cells, choosing the number of 1-cells contained in each rectangle as large as possible. Unlike the case of Procedure 25.1, these loops are not necessarily prime implicant loops because they may not represent prime implicant, depending on the outcome of substep b. In Figs. 25.6 and 25.7, for example, loops formed in this manner are shown with solid lines.

The prime implicants of this function are x 1 x 2 x 5 and x2x3x5.

FIGURE 25.6

Prime implicant loops on a map for five variables.

The prime implicants of this function are x 1 x 3 x 5 , x2x3x5, x2x4x5, and x1x2x3.

FIGURE 25.7

Prime implicant loops on a map for five variables.

© 2000 by CRC Press LLC

b. On each four-variable map, encircle all the 1-cells with rectangles, each of which consists of 2i 1-cells in exactly the same relative position on the two maps, choosing i as great as possible. Then connect each pair of the corresponding loops with an arc. On each four-variable map, some of these loops may be inside some loops formed in substep a. In Fig. 25.6, one pair of loops that is in the same relative position is newly formed. One member of the pair, shown in a dotted line, is contained inside a loop formed in substep a. The pair is connected with an arc. The other loop coincides with a loop formed in substep a. In Fig. 25.7, two pairs of loops are formed: one pair is newly formed, as shown in dotted lines, and the second pair is the connection of loops formed in substep a. The loops formed in substep b and also those formed in substep a but not contained in any loop formed in substep b are prime implicant loops, since they correspond to prime implicants. In Fig. 25.6, the loop formed in substep a, which represents x 1 x 2 x 3 x 5 , is contained in the prime implicant loop formed in substep b, which represents x2x3x5. Thus, the former loop is not a prime implicant loop, and consequently x 1 x 2 x 3 x 5 is not a prime implicant. 2. Processes for deriving irredundant disjunctive forms and minimal sums are the same as Steps 2 and 3 of Procedure 25.1.  In the case of six-variable map, the derivation of prime implicant loops requires more comparisons of four-variable maps, as follows. Procedure 25.3: Derivation of Minimal Sums on a Six-Variable Map 1. Derivation of all prime-implicant loops requires the following three substeps: a. On each four-variable map, encircle all the 1-cells with rectangles, each of which consists of 2i 1-cells, choosing i as great as possible. b. Find the rectangles (each of which consists of 2i 1-cells) occupying the same relative position on every two adjacent four-variable maps, choosing i as great as possible. (Two maps in diagonal positions are not adjacent.) Thus, we need four comparisons of two maps (i.e., upper two maps, lower two maps, left two maps, and right two maps). c. Then find the rectangles (each of which consists of 2i 1-cells) occupying the same relative position on all four-variable maps, choosing i as great as possible. Prime implicant loops are loops formed at substeps c, loops formed at b but not contained inside those at c, and loops formed at a but not contained inside those formed at b or c. 2. Processes for deriving irredundant disjunctive forms and minimal sums are the same as Steps 2 and 3 of Procedure 25.1.  An example is shown in Fig. 25.8. Irredundant disjunctive forms and minimal sums are derived in the same manner as in the case of four variables. Procedures 25.1, 25.2, and 25.3 can be extended to the cases of seven or more variables with increasing complexity. It is usually hard to find a minimal sum, however, because each prime implicant loop consists of 1-cells scattered in many maps.

25.3 Derivation of Minimal Sums for a Single Function by Other Means Derivation of a minimal sum for a single function by Karnaugh maps is convenient for manual processing because designers can know the nature of logic networks better than automated minimization, to be described in Chapter 27, but its usefulness is limited to functions of four or five variables.

© 2000 by CRC Press LLC

FIGURE 25.8

Prime-implicant loops on a map for six variables.

There are several methods for derivation of a minimal sum for a single function. A minimal sum can be found by forming a so-called a covering table where the minterms of a given function f are listed on the horizontal coordinate and all the prime implicants of f are listed on the vertical coordinate. A minimal sum can be described as a solution, that is, a minimal set of prime implicants that covers all the minterms.2 The feasibility of this approach based on the covering table is limited by the number of minterms and prime implicants rather than the number of variables, which is the limiting factor for the feasibility of the derivation of minimal sums based on Karnaugh maps. This approach based on the table can be converted to an algebraic method, with prime implicants derived by the Tison method, as described by Procedure 4.6.1 in Ref. 3. This is much faster than the approach based on the table. Another approach is generation of irredundant disjunctive forms and then derive a minimal sum among them.4,5 The feasibility of this approach is limited by the number of consensuses rather than the number of variables, minterms, or prime implicants. As the number of variables, minterms, or prime implicants increases, the derivation of absolute minimal sums or even prime sums becomes too time-consuming, although the enhancement of the feasibility has been explored.1 When too time-consuming, we need to resort to heuristic minimization, as will be described in Chapter 27 and Chapter 42, Section 42.3.

25.4 Prime Implicates, Irredundant Conjunctive Forms, and Minimal Products The “implicates,” “irredundant conjunctive forms,” and “minimal products,” which can be defined all based on the concept of conjunctive form, will be useful for deriving a minimal network that has OR gates in the first level and one AND gate in the second level. First, let us represent the maxterm expansion of a given function f on a Karnaugh map. Unlike the map representation of the minterm expansion of f, where each minterm contained in the expansion is represented by a 1-cell on a Karnaugh map, each maxterm is represented by a 0-cell. Suppose that a function f is given as shown by the 1-cells in Fig. 25.9(a). This function can be expressed in the maxterm expansion:

© 2000 by CRC Press LLC

FIGURE 25.9

Representation of a function f on a map with alterms.

f = ( x1 ∨ x2 ∨ x3 ∨ x4 ) ( x1 ∨ x2 ∨ x3 ∨ x4 ) ( x1 ∨ x2 ∨ x3 ∨ x4 ) ( x1 ∨ x2 ∨ x3 ∨ x4 ) ( x1 ∨ x2 ∨ x3 ∨ x4 ) ( x1 ∨ x2 ∨ x3 ∨ x4 ) ( x1 ∨ x2 ∨ x3 ∨ x4 ) ( x1 ∨ x2 ∨ x3 ∨ x4 ) The first maxterm, ( x 1 ∨ x2 ∨ x3 ∨ x4), for example, is represented by the 0-cell that has components (x1, x2, x3, x4) = (1 0 0 0) in the Karnaugh map in Fig. 25.9(a) (i.e., ( x 1 ∨ x2 ∨ x3 ∨ x4) becomes 0 for x1 = 1, x2 = x3 = x4 = 0). Notice that each literal in the maxterm is complemented or noncomplemented, corresponding to 1 or 0 in the corresponding component, respectively, instead of corresponding to 0 or 1 in the case of minterm. All other maxterms are similarly represented by 0-cells. It may look somewhat strange that these 0-cells represent the f expressed by 1-cell on the Karnaugh map in Fig. 25.9(a). But the first maxterm, ( x 1 ∨ x2 ∨ x3 ∨ x4), for example, assumes the value 0 only for (x1, x2, x3, x4) = (1 0 0 0) and assumes the value 1 for all other combinations of the components. The situation is similar with other maxterms. Therefore, the conjunction of these maxterms becomes 0 only when any of the maxterms becomes 0. Thus, these 0-cells represent f through its maxterms. This is what we discussed earlier about the maxterm expansion or the Π-decimal specification of f, based on the false vectors of f. Like the representation of a product on a Karnaugh map, any alterm can be represented by a rectangular loop of 2i 0-cells by repeatedly combining a pair of 0-cell loops that are horizontally or vertically adjacent in the same rows or columns, where i is a non-negative integer. For example, the two adjacent 0-calls in the same column representing maxterms ( x 1 ∨ x2 ∨ x3 ∨ x4) and ( x 1 ∨ x2 ∨ x3 ∨ x 4 ) can be combined to form alterm x 1 ∨ x2 ∨ x3, by deleting literals x4 and x4, as shown in a solid-line loop in Fig. 25.9(b). The function f in Fig. 25.9 can be expressed in the conjunctive form f = ( x 1 ∨ x2 ∨ x3)(x1 ∨ x 2 ∨ x3)(x1 ∨ x 4 )( x 1 ∨ x 2 ∨ x 3 ∨ x4), using a minimum number of such loops. The alterms in this expansion are represented by the loops shown in Fig. 25.9(b). Definition 25.4: An implicate of f is an alterm implied by a function f. Notice that an implicate’s relationship with f is opposite to that of an implicant that implies f. For example, (x1 ∨ x2 ∨ x3) and (x1 ∨ x2 ∨ x 3 ) are implicates of function f = x1 ∨ x2, since whenever f = 1, both (x1 ∨ x2 ∨ x3) and (x1 ∨ x2 ∨ x 3 ) become 1. The implication relationship between an alterm P and a function f can sometimes be more conveniently found, however, by using the property “f implies P if f = 0 whenever P = 0,” which is a restatement of the property “f implies P if P = 1 whenever f = 1” defined earlier (we can easily see this by writing a truth table for f and P). Thus, (x1 ∨ x3), (x1 ∨ x 2 ), and (x1 ∨ x2 ∨ x3) are implicates of f = (x1 ∨ x 2 )( x 2 ∨ x3), because when (x1 ∨ x3), (x1 ∨ x 2 ), or (x1 ∨ x2 ∨ x3) is 0, f is 0. [For example, when (x1 ∨ x3) is 0, x1 = x3 = 0 must hold. Thus, f = ( x 2 ) (x2) = 0. Consequently, (x1 ∨ x3) is an alterm implied by f, although this is not obvious from the given expressions of f and (x1 ∨ x3).] Also, (x1 ∨ x 2 ∨ x4) and (x1 ∨ x2 ∨ x3 ∨ x4), which contain the literals of a dummy variable x4 of this f, are implicates of f.

© 2000 by CRC Press LLC

Definition 25.5: An alterm P is said to subsume another alterm Q if all the literals in Q are among the literals of P. The alterm (x1 ∨ x 2 ∨ x3 ∨ x4) subsumes ( x 2 ∨ x3). Summarizing the two definitions of “subsume” for “alterms” in Definition 25.5 and “terms” in Definition 24.6, we have that “P subsumes Q” simply means “P contains all the literals in Q,” regardless of whether P and Q are terms or alterms. But the relationships between “subsume” and “imply” in the two cases are opposite. If an alterm P subsumes another alterm Q, then Q ⊆P holds; whereas if a term P subsumes another term Q, then P ⊆ Q holds. Definition 25.6: A prime implicate of a function f is defined as an implicate of f such that no other alterm subsumed by it is an implicate of f. In other words, if deletion of any literal from an implicate of f makes the remainder not an implicate of f, the implicate is a prime implicate. For example, (x2 ∨ x3) and (x1 ∨ x3) are prime implicates of f = (x1 ∨ x 2 )(x2 ∨ x3)( x 1 ∨ x2 ∨ x3), but x1 ∨ x3 ∨ x4 is not a prime implicate of f, although it is still an implicate of this f. As seen from these examples, some implicates are not obvious from a given conjunctive form of f. Such an example is x1 ∨ x3 in the above example. As a matter of fact, x1 ∨ x3 can be obtained as the consensus by the following Definition 25.7 of two alterms, (x1 ∨ x 2 ) and (x2 ∨ x3), of f. Also notice that, unlike implicates, prime implicates cannot contain a literal of any dummy variable of f. On a Karnaugh map, a loop for an alterm P that subsumes another alterm Q is contained in the loop for Q. For example, in Fig. 25.10, the loop for alterm (x1 ∨ x 3 ∨ x4), which subsumes alterm (x1 ∨ x 3 ), is contained in the loop for x1 ∨ x 3 . Thus, a rectangular loop that consists of 2i 0-cells, with i as large as possible, represents a prime implicate of f.

FIGURE 25.10

An alterm that subsumes another alterm.

The consensus of two alterms, V and W, is defined in a manner similar to the consensus of two terms. Definition 25.7: If there is exactly one variable, say x, appearing noncomplemented in one alterm and complemented in the other (i.e., if two alterms, V and W, can be written as V = x ∨ V ′ and W = x ∨ W ′, where V ′and W ′ are alterms free of literals of x), the disjunction of all literals except those of x (i.e., V ′ ∨ W ′ with duplicate literals deleted) is called the consensus of the two alterms V and W. For example, when V = x ∨ y ∨ z ∨ u and W = x ∨ y ∨ u ∨ v are given, their consensus is y ∨ z ∨ u ∨ v . On a Karnaugh map, a consensus is represented by the largest rectangular loop of 2i 0-cells that intersects, and is contained within, two adjacent loops of 0-cells that represent two alterms. For example, in Fig. 25.11 the dotted-line loop represents the consensus of two alterms, ( x 2 ∨ x 4 ) and ( x 1 ∨ x 2 ∨ x 3 ) , which are represented by the two adjacent loops.

© 2000 by CRC Press LLC

FIGURE 25.11

Consensus of two adjacent alterms.

All prime implicates of a function f can be algebraically obtained from a conjunctive form for f by modifying the Tison method discussed in Procedure 24.1; that is, by using dual operations in the method. We can define the following concepts, which are dual to irredundant disjunctive forms, minimal sums, absolute minimal sums, essential prime implicants, complete sums, and others. Definition 25.8: An irredundant conjunctive form for a function f is a conjunction of prime implicates such that removal of any of them makes the remainder not express f. The minimal products are irredundant conjunctive forms for f with a minimum number of prime implicates. The absolute minimal products are minimal products with a minimum number of literals. Prime implicates that appear in every irredundant conjunctive form for f are called essential prime implicates of f. Conditionally eliminable prime implicates are prime implicates that appear in some irredundant conjunctive forms for f, but not in others. Absolutely eliminable prime implicates are prime implicates that do not appear in any irredundant conjunctive form for f. The complete product for a function f is the product of all prime implicates of f.

25.5 Derivation of Minimal Products by Karnaugh Map Minimal products can be derived by the following procedure, based on a Karnaugh map. Procedure 25.4: Derivation of Minimal Products on a Karnaugh Map Consider the case of a map for four or fewer variables (cases for five or more variables are similar, using more than one four-variable map). 1. Encircle 0-cells, instead of 1-cells, with rectangular loops, each of which consists of 2i 0-cells, choosing i as large as possible. These loops are called prime implicate loops because they represent prime implicates (not prime implicants). Examples of prime implicate loops are shown in Fig. 25.12 (including the dotted-line loop). The prime implicate corresponding to a loop is formed by making a disjunction of literals, instead of a conjunction, using a noncomplemented variable corresponding to 0 of a coordinate of the map and a complemented variable corresponding to 1. (Recall that, in forming a prime implicant, variables corresponding to 0’s were complemented and those corresponding to 1’s were noncomplemented.) Thus, corresponding to the loops of Fig. 25.12, we get the prime implicates, ( x 1 ∨ x 4 ) , ( x 1 ∨ x 2 ∨ x 3 ) and ( x 2 ∨ x 3 ∨ x 4 ) . 2. Each irredundant conjunctive form is derived by the conjunction of prime implicates corresponding to a set of loops, so that removal of any loop leaves some 0-cells uncovered by loops. An example is the set of two solid-line loops in Fig. 25.12, from which the irredundant conjunctive form ( x 1 ∨ x 4 ) ( x 1 ∨ x 2 ∨ x 3 ) is derived.

© 2000 by CRC Press LLC

FIGURE 25.12

Prime-implicate loops and the corresponding prime implicates.

3. Among sets of a minimum number of prime implicate loops, the sets that contain as many of the largest loops as possible yield minimal products. The don’t-care conditions are dealt with in the same manner as in case of minimal sums. In other words, whenever possible, we can form a larger prime implicate loop by interpreting some d’s as 0cells. Any prime-implicate loops consisting of only d-cells need not be formed.  Procedure 25.4 can be extended to five or more variables in the same manner as Procedures 25.2 and 25.3, although the procedure will be increasingly complex.

References 1. Coudert, O., “Two-Level Logic Minimization: An Overview,” Integration, vol. 17, pp. 97-140, Oct. 1994. 2. McCluskey, E. J., “Minimization of Boolean functions,” Bell System Tech. J., vol. 35, no. 5, pp. 14171444, Nov. 1956. 3. Muroga, S., Logic Design and Switching Theory, John Wiley & Sons, 1979 (now available from Krieger Publishing Co.). 4. Tison, P., Ph.D. dissertation, Faculty of Science, University of Grenoble, France, 1965. 5. Tison, P., “Generalization of consensus theory and application to the minimization of Boolean functions,” IEEE Tr. Electron. Comput., pp. 446-456, Aug. 1967.

© 2000 by CRC Press LLC

Minato, S., Muroga, S."Binary Decision Diagrams" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

26 Binary Decision Diagrams Shin-ichi Minato NTT Network Innovation Laboratories

Saburo Muroga University of Illinois at Urbana-Champaign

26.1 Basic Concepts 26.2 Construction of BDD Based on a Logic Expression Binary logic operation • Complement Edges • Derivation of a Logic Expression from a BDD

26.3 Data Structure 26.4 Ordering of Variables for Compact BDDs 26.5 Remarks

26.1 Basic Concepts Binary decision diagrams (BDDs), which were introduced in Chapter 23, Section 23.1, are a powerful means for computer processing of logic functions because in many cases, with BDDs, smaller memory space is required for storing logic functions and values of functions can be calculated faster than with truth tables or logic expressions. As logic design has been done in recent years with computers, BDDs are extensively used because of these features. BDDs are used in computer programs for automation of logic design, verification (i.e., identifying whether two logic networks represent the identical logic functions), diagnosis of logic networks, simplification of transistor circuits (such as ECL and MOSFET circuits, as explained in Chapters 35, 36, and 37), and other areas, including those not related to logic design. In this chapter, we discuss the basic data structures and algorithms for manipulating BDDs. Then we describe the variable ordering problem, which is important for the effective use of BDDs. The concept of BDD was devised by Lee in 1959.15 Binary decision programs that Lee discussed are essentially binary decision diagrams. Then, in 1978, its usefulness for expressing logic functions was shown by Akers.3,4 But since Bryant6 developed a reduction method in 1986, it has been extensively used for design automation for logic design and related areas. From a truth table, we can easily derive the corresponding binary decision diagram. For example, the truth table shown in Fig. 26.1(a) can be converted into the BDD in Fig. 26.1(b). But there are generally many BDDs for a given truth table, that is, the logic function expressed by this truth table. For example, all BDDs shown in Fig. 26.1(c) through (e) represent the logic function that the truth table in Fig. 26.1(a) expresses. Here, note that in each of Figs. 26.1 (b), (c), and (d), the variables appear in the same order and none of them appears more than once in every path from the top node. But in (e), they appear in different orders in different paths. BDDs in Figs. 26.1(b), (c), and (d) are called ordered BDDs (or OBDDs). But the BDD in Fig. 26.1(e) is called an unordered BDD. These BDDs can be reduced into a simple BDD by the following procedure. In a BDD, the top node is called the root that represents the given function f(x1, x2, …, xn). Rectangles that have 1 or 0 inside are called terminal nodes. They are also called 1-terminal and 0-terminal.

© 2000 by CRC Press LLC

FIGURE 26.1

Truth table and BDDs for x1x2 ∨ x 2 x 3 .

Other nodes are called non-terminal nodes denoted by circles with variables inside. They are also called simply nodes, differentiating themselves from the 0- and 1-terminals. From a node with xi inside, two lines go down. The solid line is called 1-edge, representing xi = 1; and the dotted line is called 0-edge, representing xi = 0. In an OBDD, the value of the function f can be evaluated by following a path of edges from the root node to one of the terminal nodes. If the nodes in every path from the root node to a terminal node are assigned with variables, x1, x2, x3, …, and xn, in this order, then f can be expressed as follows, according to the Shannon expansion (described in Chapter 24). By branching with respect to x1 from the root node, f(x1, x2, …, xn) can be expanded as follows, where f(0, x2, …, xn) and f(1, x2, …, xn) are functions that the nodes at the low ends of 0-edge and 1-edge from the root represent, respectively:

f ( x 1, x 2, …, x n ) = x 1 f ( 0, x 2, …, x n ) ∨ x 1 f ( 1, x 2, …, x n ) Then, by branching with respect to x2 from each of these two nodes that f(0, x2, …, xn) and f(1, x2, …, xn) represent, each f(0, x2, …, xn) and f(1, x2, …, xn) can be expanded as follows:

f ( 0, x 2, …x n ) = x 2 f ( 0, 0, x 3, …, x n ) ∨ x 2 f ( 0, 1, x 3, …, x n )

© 2000 by CRC Press LLC

and f ( 1, x 2, …, x n ) = x 2 f ( 1, 0, x 3, …, x n ) ∨ x 2 f ( 1, 1, x 3, …, x n ) And so on. As we go down from the root node toward the 0- or 1-terminal, more variables of f are set to 0 or 1. Each term excluding xi or x i in each of these expansions [i.e., f(0, x2, …, xn) and f(1, x2, …, xn) in the first expansion, f(0, 0, …, xn) and f(0, 1, …, xn) in the second expansion, ect.], are called cofactors. Each node at the low ends of 0-edge and 1-edge from a node in an OBDD represents cofactors of the Shannon expansion of the logic function at the node, from which these 0-edge and 1-edge come down. Procedure 26.1: Reduction of a BDD 1. For the given BDD, apply the following steps in any order. a. If two nodes, va and vb , that represent the same variable xi, branch to the same nodes in a lower level for each of xi = 0 and xi = 1, then combine them into one node that still represents variable xi. In Fig. 26.2(a), two nodes, va and vb , that represent variable xi, go to the same node vu for xi = 0 and the same node vv for xi = 1. Then, according to Step a, two nodes, va and vb, are merged into one node vab, as shown in Fig. 26.2(b). The node vab represents variable xi.

FIGURE 26.2

Step 1(a) of Procedure 26.1.

b. If a node that represents a variable xi branches to the same node in a lower level for both xi = 0 and xi = 1, then that node is deleted, and the 0- and 1-edges that come down to the former are extended to the latter. In Fig. 26.3(a), node va that represents variable xi branches to the same node vb for both xi = 0 and xi = 1. Then, according to Step b, node va is deleted, as shown in Fig. 26.3(b), where the node va is a don’t-care for xi, and the edges that come down to va are extended to vu. c. Terminals nodes with the same value, 0 or 1, are merged into the terminal node with the same original value. All the 0-terminals in Fig. 26.1(b) are combined into one 0-terminal in each of Figs. 26.1(c), (d), and (e). All the 1-terminals are similarly combined. 2. When we cannot apply any step after repeatedly applying these steps (a), (b), and (c), the reduced ordered BDD (i.e., ROBDD) or simply called the reduced BDD (i.e., RBDD), is obtained for the given function.

© 2000 by CRC Press LLC

FIGURE 26.3

Step 1(b) of Procedure 26.1.

The following process is faster than the repeated application of Step 1(a) of Procedure 26.1. Suppose a BDD contains two nodes, va and vb, both of which represent xi, such that a sub-BDD, B1, that stretches downward from va is completely identical to another sub-BDD, B2, that stretches downward from vb. In this case, each sub-BDD is said to be isomorphic to the other. Then the two sub-BDDs can be merged. For example, the two sub-BDDs, B1 and B2, shown in the two dotted rectangles in Fig. 26.4(a), both representing x4, can be merged into one, B3, as shown in Fig. 26.4(b).

FIGURE 26.4

Merger of two isomorphic sub-BDDs.

Theorem 26.1: Any completely or incompletely specified function has a unique reduced ordered BDD and any other ordered BDD for the function in the same order of variables (i.e., not reduced) has more nodes.

© 2000 by CRC Press LLC

According to this theorem, the ROBDD is unique for a given logic function when the order of the variables is fixed. (A BDD that has don’t-cares can be expressed with addition of the d-terminal or two BDDs for the ON-set and OFF-set completely specified functions, as explained with Fig. 23.5. For details, see Ref. 24.) Thus, ROBDDs give canonical forms for logic functions. This property is very important to practical applications, as we can easily check the equivalence of two logic functions by only checking the isomorphism of their ROBDDs. Henceforth in this chapter, ROBDDs will be referred to as BDDs for the sake of simplicity. It is known that a BDD for an n-variable function requires an exponentially increasing memory space in the worst case, as n increases.16 However, the size of memory space for the BDD varies with the types of functions, unlike truth tables which always require memory space proportional to 2n. But many logic functions that we encounter in design practice can be shown with BDDs without large memory space.14 This is an attractive feature of BDDs. Fig. 26.5 shows the BDDs representing typical functions, AND, OR, and the parity function with three and n input variables. The parity function for n variables, x1 ⊕ x2 ⊕ … ⊕ xn, can be shown with the BDD of 2(n – 1) + 1 nodes and 2 terminals; whereas, if a truth table or a logic expression is used, the size increases exponentially as n increases.

FIGURE 26.5

BDDs for typical logic functions.

A set of BDDs representing many functions can be combined into a graph that consists of BDDs sharing sub-graphs among them, as shown in Fig. 26.6. This idea saves time and space for duplicating isomorphic BDDs. By sharing all the isomorphic sub-graphs completely, no two nodes that express the same function coexist in the graph. We call it shared BDDs (SBDDs),25 or multi-rooted BDDs. In a shared BDD, no two root nodes express the same function. Shared BDDs are now widely used and those algorithms are more concise than ordinary BDDs. In the remainder of this chapter, we deal with shared BDD only.

f1 = x1 ⋅ x2 , f2 = x1 ⊕ x2 , f3 = x2 , f4 = x1 ∨ x2

FIGURE 26.6

A shared BDD.

© 2000 by CRC Press LLC

26.2 Construction of BDD Based on a Logic Expression Procedure 26.1 shows a way of constructing compact BDDs from the truth table for a function f of n variables. This procedure, however, is not efficient because the size of its initial BDD is always of the order of 2n, even for a very simple function. To avoid this problem, Bryant6 presented a method to construct BDDs by applying a sequence of logic operations in a logic expression. Figure 26.7 shows a simple example of constructing a BDD for f = (x2 ⋅ x3) ∨ x1. First, trivial BDDs for x2 ⋅ x3, and x3 are created in Fig. 26.7(a). Next, applying the AND operation between x2 and x3, the BDD for x2 ⋅ x3 is then generated. Then, the BDD for the entire expression is obtained as the result of the OR operation between (x2 ⋅ x3) and x1. After deleting the nodes that are not on the paths from the root node for f toward the 0- or 1-terminal, the final BDD is shown in Fig. 26.7(b).

FIGURE 26.7

Generation of BDDs for f = (x2 ⋅ x3) ∨ x1.

In the following, we show a formal algorithm for constructing BDDs for an arbitrary logic expression. This algorithm is generally far more efficient than that based on a truth table.

Binary logic operation Suppose we perform a binary operation between two functions, g and h, and this operation is denoted by g  h, where “” is one of OR, AND, Exclusive-OR, and others. Then by the Shannon expansion of a function explained previously, g  h can be expanded as follows:

 ∨ x ⋅ g  g  h = xi ⋅  g h h i   xi = 0   x = 0   x = 1  x = 1  i i i 















with respect to variable xi, which is the variable of the node that is in the highest level among all the nodes in g and h. This expansion creates the 0- and 1-edges which go to the next two nodes from the node for the variable xi, by operations  g  xi = 0  h  xi = 0  and  g  xi = 1  h  xi = 1  . The two subgraphs         whose roots are these two nodes represent the cofactors derived by  the Shannon expansion, g  h  xi = 0  and  g  xi = 1  h  xi = 1  . Repeating this expansion for all the variables, as we go down   xi = 0       0ntually trivial operations, such as g ⋅ 1 = g, g ⊕ g = 0, and h ∨ 0 = h, finishing the construction of a BDD for g  h. © 2000 by CRC Press LLC

When BDDs for functions g and h are given, we can derive a new BDD for function f = g  h by the following procedure. Procedure 26.2: Construction of a BDD for Function f = g  h, Given BDDs for g and h. Given BDDs for functions g and h (e.g., Fig. 26.8(a)), let us construct a new BDD (e.g., Fig. 26.8(b)) with respect to the specified logic operation g  h and then reduce it by the previous Procedure 26.1 (e.g., Fig. 26.8(d)).

FIGURE 26.8

Procedure of binary operation.

1. Starting with the root nodes for g and h and going down toward to the 0- or 1-terminal, apply repeatedly steps a and b in the following, considering steps c and d. a. When the node under consideration, say Na, in one of the two BDDs for g and h is in a higher level than the other node, Nb: If Na and Nb are for variables xa and xb, respectively, and

© 2000 by CRC Press LLC

if the 0-edge from Na for xa = 0 goes to node Na0 and the 1-edge from Na for xa = 1 goes to node Na1, then we create the following two new nodes, to which the 0- and 1-edges go down from the corresponding node N′a for the variable xa in the new BDD (i.e., N′a corresponds to this combination of Na and Nb), as a part of the new BDD: One new node for the operation  between Na0 and Nb for xa = 0, and the other new node for the operation  between Na1 and Nb for xa = 1. The variable for the first node, i.e., the node for Na0  Nb will be the one in the higher level of the two nodes, Na0 and Nb, in the BDDs for g and h; and the variable for the second node (i.e., the node for Na1  Nb) will be the one in the higher level of the two nodes, Na1 and Nb, in the BDDs for g and h. In this case, creation of edges is not considered with respect to Nb in the original BDDs. For example, suppose binary operation g  h is to be performed on the functions g and h in the BDD shown in Fig. 26.8(a) and furthermore, “” is OR in this case. We need to start the OR operation with the root nodes for g and h (i.e., N8 and N7 respectively). N8 is in a higher level than N7. So, in Fig. 26.8(b) which explains how to construct the new BDD according to this procedure, we create two new nodes; one for OR(N4, N7) for the 0-edge for x1 = 0 from the node for OR(N8, N7) and the other for OR(N6, N7) for the 1-edge for x1 = 1 from the node for OR(N8, N7), corresponding to this step (a). Thus, we need next to form the OR between N4 and N7 and also the OR between N6 and N7 at these nodes in the second level in Fig. 26.8(b). These two nodes are both for variable x2. For OR(N4, N7), N7 is now in a higher level than N4. So, corresponding to this step (a), we create the new nodes for OR(N4, N2) and also for OR(N4, N5) for 0- and 1-edge, respectively, in the third level in Fig. 26.8(b). b. When the node under consideration, say Na, in one of the BDDs for g and h is in the same level as the other node, Nb. (If Na is for a variables xa, these two nodes are for the same variable xa because both g and h have this variable.): If the 0-edge from Na for xa = 0 goes to node Na0 and the 1-edge from Na for xa = 1 goes to node Na1, and if the 0-edge from Nb for xb = 0 goes to node Nb0 and the 1-edge from Nb for xb = 1 goes to node Nb1, then we create the following two new nodes, to which the 0- and 1-edges come down from the corresponding node N′a for the variable xa in the new BDD (i.e., N′a corresponds to this combination of Na and Nb), as a part of the new BDD: one new node for the operation  between Na0 and Nb0 for xa = 0, and the other new node for the operation  between Na1 and Nb1 for xa = 1. The variable for the first node, i.e., the node for Na0  Nb0 will be the one in the higher level of the two nodes, Na0 and Nb0, in the BDDs for g and h; and the variable for the second node, i.e., the node for Na1  Nb1 will be the one in the higher level of the two nodes, Na1 and Nb1, in the BDDs for g and h. For the example in Fig. 26.8(b), we need to create two new nodes for the operations, OR(N4, N2) and OR(N2, N5) to which the 0- and 1-edges go for x2 = 0 and x2 = 1, respectively, from the node OR(N6, N7) because N6 and N7 are in the same level for variable x2 in Fig. 26.8(a). These two nodes are both for variable x3. c. In the new BDD for the operation g  h, we need not have more than one subgraph that represent the same logic function. So, during the construction of the new BDD, whenever a new node (i.e., a root node of a subgraph) is for the same operation as an existing node, we need not continue creation of succeeding nodes from that new node.

© 2000 by CRC Press LLC

For the example, in Fig. 26.8(b), the node for operation OR(N4, N2) appears twice, as shown with a dotted arc, and we do not need to continue the creation of new nodes from the second and succeeding nodes for OR(N4, N2). OR(N2, N3) also appears twice, as shown with a dotted arc. d. In the new BDD for the operation g  h, if one of Na and Nb is the 0- or 1-terminal, or Na and Nb are the same, i.e., Na = Nb, then we can apply the logic operation that g  h defines. If g  h is for AND operation, for example, the node for AND(N1, Na) represents Na because N1 is the 1-terminal in Fig. 26.8(a), and furthermore, if Na represents function g, this node represents g. Also, it is important to notice that only this step (d) is relevant to the logic operation defined by g  h, whereas all other steps from (a) through (c) are irrelevant of the logic operation g  h. For the example in Fig. 26.8(b), the node for operation OR(N0, N2) represents N2, which is for variable x4. Also, the node for operation OR(N0, N1) represents constant value 1 because N0 and N1 are the 0and 1-terminals, respectively, and consequently, 0 ∨ 1. By using binary operation f ⊕ 1, f can be performed and its processing time is linearly proportional to the size of a BDD. However, it is improved to a constant time by using the complement edge, which is discussed in the following. This technique is now commonly used. 2. Convert each node in the new BDD that is constructed in Step 1 to a node with the corresponding variable inside. Then derive the reduced BDD by Procedure 26.1. For the example, each node for operation OR(Na, Nb) in Fig. 26.8(b) is converted to a node with the corresponding variable inside in Fig. 26.8(c), which is reduced to the BDD in Fig. 26.8(d).

Complement Edges Complement edge is a technique to reduce computation time and memory requirement of BDDs by using edges that indicates to complement the function of the subgraph pointed to by the edge, as shown in Fig. 26.9(a). This idea was first shown by Akers3 and later discussed by Madre and Billon.18 The use of complement edges brings the following outstanding merits. • The BDD size is reduced by up to a half • Negation can be performed in constant time • Logic operations are sped by applying the rules, such as f ⋅ f = 0, f ∨ f = 1, and f ⊕ f = 1 Use of complement edges may break the uniqueness of BDDs. Therefore, we have to adopt the two rules, as illustrated in Fig. 26.9(b):

FIGURE 26.9

Complement edges.

© 2000 by CRC Press LLC

1. Using the 0-terminal only. 2. Not using a complement edge as the 0-edge of any node (i.e., use it on 1-edge only). If necessary, the complement edges can be carried over to higher nodes.

Derivation of a Logic Expression from a BDD A logic expression for f can be easily derived from a BDD for f. A path from the root node for f to the 1-terminal in a BDD is called 1-path, where the values of the variables on the edges on this path make f equal 1. For each 1-path, a product of literals is formed by choosing xi for xi = 1 or its complement, x i for xi = 0. The disjunction of such products for all 1-paths yields a logic expression for f. For example, in Fig. 26.8(a), the sequence of nodes, N8-N6-N4-N2-N1, is a 1-path and the values of variables on this path, x1 = 1, x2 = 0, x3 = 1, and x4 = 1, make the value of f (g for this example) equal 1. Finding all 1paths, a logic expression for f is derived as f = x 1 x 2 x 3 x 4 ∨ x1x2x4 ∨ x 1 x 3 x 4 . It is important to notice that logic expressions that are derived from all 1-paths are usually not minimum sums for the given functions.

26.3 Data Structure In a typical realization of a BDD manipulator, all the nodes are stored in a single table in the main memory of the computer. Figure 26.10 is a simple example of realization for the BDD shown in Fig. 26.6. Each node has three basic attributes: input variable and the next nodes accessed by the 0- and 1edges. Also, 0- and 1-terminals are first allocated in the table as special nodes. The other non-terminal nodes are created one by one during the execution of logic operations.

FIGURE 26.10

Table-based realization of a shared BDD.

Before creating a new node, we check the reduction rules of Procedure 26.1. If the 0- and 1-edges go to the same next node (Step 1(b) of Procedure 26.1) or if an equivalent node already exists (Step 1(a)), then we do not create a new node but simply copy that node as the next node. To find an equivalent node, we check a table which displays all the existing nodes. The hash table technique is very effective to accelerate this checking. (It can be done in a constant time for any large-scale BDDs, unless the table overflows in main memories.) When generating BDDs for logic expressions, such as Procedure 26.2, many intermediate BDDs are temporarily generated. It is important for memory efficiency to delete such unnecessary BDDs. In order to determine the necessity of the nodes, a reference counter is attached to each node, which shows the number of incoming edges to the node. In a typical implementation, the BDD manipulator consumes 20 to 30 bytes of memory for each node. Today, there are personal computers and workstations with more than 100 Mbytes of memory, and those facilitate us to generate BDDs containing as many as millions of nodes. However, the BDDs still grow beyond the memory capacity in some practical applications.

© 2000 by CRC Press LLC

26.4 Ordering of Variables for Compact BDDs BDDs are a canonical representation of logic functions under a fixed order of input variables. A change of the order of variable, however, may yield different BDDs of significantly different sizes for the same function. The effect of variable ordering depends on logic functions, changing sometimes dramatically the size of BDDs. Variable ordering is an important problem in using BDDs. It is generally very time consuming to find the best order.30 Currently known algorithms are limited to run on the small size of BDDs with up to about 17 variables.13 It is difficult to find the best order for larger problems in reasonably short processing time. However, if we can find a fairly good order, it is useful for practical applications. There are many research works on heuristic methods of variable ordering. Empirically, the following properties are known on the variable ordering. 1. Closely-related variables: Variables that are in close relationship in a logic expression should be close in variable order (e.g., x1 in x1 ⋅ x2 ∨ x3 ⋅ x4 is in closer relationship with x2 than x3). The logic network of AND-OR gates with 2n inputs in 2-level shown in Fig. 26.11(a) has 2n nodes in the best order with the expression (x1 ⋅ x2) ∨ (x3 ⋅ x4) ∨ … ∨ (x2n – 1 ⋅ x2n), as shown for n = 2 in Fig. 26.11(b), while it needs (2n + 1 – 2) nodes in the worst order, as shown in Fig. 26.11(c). If the same order of variables as the one in Fig. 26.11(b) is kept on the BDD, Fig. 26.11(c) represents the function (x1 ⋅ xn + 1) ∨ (x2 ⋅ xn + 2) ∨ … ∨ (xn ⋅ x2n).

FIGURE 26.11

BDDs for 2-level logic network with AND/OR gates.

2. Influential variables: The variables that greatly influence the nature of a function should be at higher position. For example, the 8-to-1 selector shown in Fig. 26.12(a) can be represented by a linear size of BDD when the three control inputs are ordered high, but when the order is reversed, it becomes of exponentially increasing size as the number of variables (i.e., the total number of data inputs and control inputs) increases, as shown in Fig. 26.12(b).

© 2000 by CRC Press LLC

FIGURE 26.12

BDDs for 8-to-1 selector.

Based on empirical rules like this, Fujita et al.9 and Malik et al.20 presented methods; in these methods, an output of the given logic networks is reached, traversing in a depth-first manner, then an input variable that can be reached by going back toward the inputs of the network is placed at highest position in variable ordering. Minato25 devised another heuristic method in which each output function of the given logic networks is weighted and then input variables are ordered by the weight of each input variable determined by how each input can be reached from an output of the network. Butler et al.8 proposed another heuristic based on a measure which uses not only the connection configuration of the network, but also the output functions of the network. These methods probably find a good order before generating BDDs. They find good orders in many cases, but there is no method that is always effective to a given network. Another approach reduces the size of BDDs by reordering input variables. A greedy local exchange (swapping adjacent variables) method was developed by Fujita et al.10 Minato22 presented another reordering method which measures the width of BDDs as a cost function. In many cases, these methods find a fairly good order using no additional information. A drawback of the approach of these methods is that they cannot start if an initial BDD is too large. One remarkable work is dynamic variable ordering, presented by Rudell.27 In this technique, the BDD package itself determines and maintains the variable order. Every time the BDDs grow to a certain size, the reordering process is invoked automatically. This method is very effective in terms of the reduction of BDD size, although it sometimes takes a long computation time. Table 26.1 shows experimental results on the effect of variable ordering. The logic network “sel8” is an 8-bit data selector, and “enc8” is an 8-bit encoder. “add8” and “mult6” are an 8-bit adder and a 6-bit multiplier. The rest is chosen from the benchmark networks in MCNC’90.1 Table 26.1 compares four different orders: the original order, a random order, and two heuristic orders. The results show that the heuristic ordering methods are very effective except for a few cases which are insensitive to the order.

© 2000 by CRC Press LLC

TABLE 26.1 Effect of Variable Ordering Network Feature

Name sel8 enc8 add8 mult6 vg2 c432 c499 c880

BDD Size (with complement edges)

No. of Inputs

No. of Outputs

No. of Inputs to All Gates

12 9 17 12 25 36 41 60

2 4 9 12 8 7 32 26

29 31 65 187 97 203 275 464

Original

Random

Heur-1

Heur-2

16 28 83 2183 117 3986 115654 (>500k)

88 29 885 2927 842 (>500k) (>500k) (>500k)

23 28 41 2471 97 27302 52369 23364

19 27 41 2281 86 1361 40288 9114

Note: Heuristic-1: Heuristic order based on connection configuration.5 Heuristic-2: BDD reduction by exchanging variables.14

Unfortunately, there are some hard examples where variable ordering is powerless. For example, Bryant6 proved that an n-bit multiplier function requires an exponentially increasing number of BDD nodes for any variable order, as n increases. However, for many other practical functions, the variable ordering methods are useful for generating compact BDDs in a reasonably short time.

26.5 Remarks Several groups developed BDD packages, and some of them are open to the public. For example, the CUDD package12 is well-known to BDD researchers in the U.S. Many other BDD packages may be found on the Internet. BDD packages, in general, are based on the quick search of hash tables and linked-list data structures. They greatly benefit from the property of the random access machine model,2 where any data in main memory can be accessed in constant time. Presently, considerable research is in progress. Detection of total or partial symmetry of a logic function with respect to variables has been a very time-consuming problem, but now it can be done in a short time by BDDs.26,29 Also, decomposition of a logic function, which used to be very difficult, can be quickly solved with BDD.5,21 A number of new types of BDDs have been proposed in recent years. For example, the Zero-Suppressed BDD (ZBDD)23 is useful for solving covering problems, which are used in deriving a minimal sum and other combinatorial problems. The Binary Moment Diagram (BMD)7 is another type of BDD that is used for representing logic functions for arithmetic operations. For those who are interested in more detailed techniques related to BDDs, several good surveys11,24,28 are available.

References 1. ACM/SIGDA Benchmark Newsletter, DAC ’93 Edition, June 1993. 2. Aho, A. V., J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, MA, 1974. 3. Akers, S. B., “Functional testing with binary decision diagrams,” Proc. 8th Ann. IEEE Conf. FaultTolerant Comput., pp. 75-82, 1978. 4. Akers, S. B., “Binary decision diagrams,” IEEE Trans. on Computers, vol. C-27, no. 6, pp. 509-516, June 1978. 5. Bertacco, V. and M. Damiani, “The disjunctive decomposition of logic functions,” ICCAD ’97, pp. 78-82, Nov. 1997. 6. Bryant, R. E., “Graph-based algorithms for Boolean function manipulation,” IEEE Trans. on Computers, vol. C-35, no. 8, pp. 677-691, Aug. 1986.

© 2000 by CRC Press LLC

7. Bryant, R. E. and Y.-A. Chen, “Verification of arithmetic functions with binary moment diagrams,” Proc. 32nd ACM/IEEE DAC, pp. 535-541, June 1995. 8. Butler, K. M., D. E. Ross, R. Kapur, and M. R. Mercer, “Heuristics to compute variable orderings for efficient manipulation of ordered binary decision diagrams,” Proc. of 28th ACM/IEEE DAC, pp. 417-420, June 1991. 9. Fujita, M., H. Fujisawa, and N. Kawato, “Evaluation and improvement of Boolean comparison method based on binary decision diagrams,” Proc. IEEE/ACM ICCAD ’88, pp. 2-5, Nov. 1988. 10. Fujita, M., Y. Matsunaga, and T. Kakuda, “On variable ordering of binary decision diagrams for the application of multi-level logic synthesis,” Proc. IEEE EDAC ’91, pp. 50-54, Feb. 1991. 11. Hachtel, G. and F. Somenzi, Logic Synthesis and Verification Algorithms, Kluwer Academic Publishers, 1996. 12. http://vlsi.colorado.edu/software.html. 13. Ishiura, N., H. Sawada, and S. Yajima, “Minimization of binary decision diagrams based on exchanges of variables,” Proc. IEEE/ACM ICCAD ’91, pp. 472-475, Nov. 1991. 14. Ishiura, N. and S. Yajima, “A class of logic functions expressible by a polynomial-size binary decision diagrams,” in Proc. Synthesis and Simulation Meeting and International Interchange (SASIMII ’90. Japan), pp. 48-54, Oct. 1990. 15. Lee, C.Y., “Representation of switching circuits by binary-decision programs,” Bell Sys. Tech. Jour., vol. 38, pp. 985-999, July 1959. 16. Liaw, H.-T. and C.-S. Lin, “On the OBDD-representation of general Boolean functions,” IEEE Trans. on Computers, vol. C-41, no. 6, pp. 661-664, June 1992. 17. Lin, B. and F. Somenzi, “Minimization of symbolic relations,” Proc. IEEE/ACM ICCAD ’90, pp. 8891, Nov. 1990. 18. Madre, J. C. and J. P. Billon, “Proving circuit correctness using formal comparison between expected and extracted behaviour,” Proc. 25th ACM/IEEE DAC, pp. 205-210, June 1988. 19. Madre, J. C. and O. Coudert, “A logically complete reasoning maintenance system based on a logical constraint solver,” Proc. Int’l Joint Conf. Artificial Intelligence (IJCAI'91), pp. 294-299, Aug. 1991. 20. Malik, S., A. R. Wang, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, “Logic verification using binary decision diagrams in a logic synthesis environment,” Proc. IEEE/ACM ICCAD ’88, pp. 6-9, Nov. 1988. 21. Matsunaga, Y., “An exact and efficient algorithm for disjunctive decomposition,” SASIMI ’98, pp. 44-50, Oct. 1998. 22. Minato, S., “Minimum-width method of variable ordering for binary decision diagrams,” IEICE Trans. Fundamentals, vol. E75-A, no. 3, pp. 392-399, Mar. 1992. 23. Minato, S., “Zero-suppressed BDDs for set manipulation in combinatorial problems,” Proc. 30th ACM/IEEE DAC, pp. 272-277, June 1993. 24. Minato, S., Binary Decision Diagrams and Applications for VLSI CAD, Kluwer Academic Publishers, 1995. 25. Minato, S., N. Ishiura, and S. Yajima, “Shared binary decision diagram with attributed edges for efficient Boolean function manipulation,” Proc. 27th IEEE/ACM DAC, pp. 52-57, June 1990. 26. Möller, D., J. Mohnke, and M. Weber, “Detection of symmetry of Boolean functions represented by ROBDDs,” Proc. IEEE/ACM ICCAD ’93, pp. 680-684, Nov. 1993. 27. Rudell, R., “Dynamic variable ordering for ordered binary decision diagrams,” Proc. IEEE/ACM ICCAD ’93, pp. 42-47, Nov. 1993. 28. Sasao, T., Ed., Representation of Discrete Functions, Kluwer Academic Publishers, 1996. 29. Sawada, H., S. Yamashita, and A. Nagoya, “Restricted simple disjunctive decompositions based on grouping symmetric variables,” Proc. IEEE Great Lakes Symp. on VLSI, pp. 39-44, Mar. 1997. 30. Tani, S., K. Hamaguchi, and S. Yajima, “The complexity of the optimal variable ordering problems of shared binary decision diagrams,” The 4th Int’l Symp. Algorithms and Computation, Lecture Notes in Computer Science, vol. 762, Springer, 1993, pp. 389-398.

© 2000 by CRC Press LLC

Muroga, S. "Logic Synthesis with AND and OR Gates in Two Levels" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

27 Logic Synthesis With AND and OR Gates in Two Levels 27.1 Introduction 27.2 Design of Single-Output Minimal Networks with AND and OR Gates in Two Levels 27.3 Design of Multiple-Output Networks with AND and OR Gates in Two Levels.

Saburo Muroga University of Illinois at Urbana-Champaign

Multiple-Output Prime Implicants • Paramount Prime Implicants • Design of a Two-Level Network with a Minimum Number of AND and OR Gates • Networks That Cannot be Designed by the Preceding Procedure • Applications of Procedure 27.1

27.1 Introduction When logic networks are realized in transistor circuits on an integrated circuit chip, each gate in logic networks usually realizes more complex functions than AND or OR gates. But handy design methods are not available for designing logic networks with such complex gates under very diversified complex constraints such as delay time and layout rules. Thus, designers often design logic networks with AND, OR, and NOT gates as a starting point for design with more complex gates under complex constraints. AND, OR, and NOT gates are much easier for human minds to deal with. Logic networks with AND, OR, and NOT gates, after designing such networks, are often converted into transistor circuits. This conversion process illustrated in Fig. 27.1 (the transistor circuit in this figure will be explained in later chapters) is called technology mapping. As can be seen, technology mapping is complex because logic gates and the corresponding transistor logic gates usually do not correspond one to one before and after technology mapping and layout has to be considered for speed and area. Also, logic gates are realized by different types of transistor circuits, depending on design objectives. They are called logic families. There are several logic families, such as ECL, nMOS circuits, static CMOS, and dynamic CMOS, as discussed in Chapters 30, 35, and 36. Technology mapping is different for different logic families. First, let us describe the design of logic networks with AND and OR gates in two levels, because handy methods are not available for designing logic networks with AND and OR gates in more than two levels. Although logic networks with AND and OR gates in two levels may not be directly useful for designing transistor circuits to be laid out on an integrated circuit chip, there are some cases where they are directly usable, such as programmable logic arrays to be described in a later chapter.

© 2000 by CRC Press LLC

FIGURE 27.1

Conversion of a logic network with simple gates by technology mapping.

27.2 Design of Single-Output Minimal Networks with AND and OR Gates in Two Levels Suppose that we want to obtain a two-level network with a minimum number of gates and, then, as a secondary objective, a minimum number of connections under Assumptions 27.1 in the following, regardless of whether we have AND or OR gates, respectively, in the first and second levels, or in the second and first levels. In this case, we have to design a network based on the minimal sum and another based on the minimal product, and then choose the better network. Suppose that we want to design a two-level AND/OR network for the function shown in Fig. 27.2(a). This function has only one minimal sum, as shown with loops in Fig. 27.2(a). Also, it has only one minimal product, as shown in Fig 27.3(a). The network in Fig. 27.3(b), based on this minimal product, requires one less gate, despite more loops, than the network based on the minimal sum in Fig. 27.2(b), and consequently the network in Fig. 27.3(b) is preferable.

FIGURE 27.2

Minimal sum and the corresponding network.

FIGURE 27.3

Minimal product and the corresponding network.

© 2000 by CRC Press LLC

The above design procedure of a minimal network based on a minimal sum is meaningful under the following assumptions. Assumptions 27.1: (1) The number of levels is at most two; (2) Only AND gates and OR gates are used in one level and second the other level; (3) Complemented variables x i ’s as well as noncomplemented xi’s for each i are available as the network inputs; (4) No maximum fan-in restriction is imposed on any gate in a network to be designed; (5) Among networks realizable in two levels, we will choose networks that have a minimum number of gates. Then, from those with the minimum number of gates, we will choose a network that has a minimum number of connections. If any of these is violated, the number of logic gates as the primary objective and the number of connections as the secondary objective are not minimized. If we do not have the restriction “at most two levels,” we can generally have a network of fewer gates. Karnaugh maps have been widely used because of convenience when the number of variables is small. But when the number of variables is many, maps are increasingly complex and processing with them become tedious, as discussed in Chapter 25. Furthermore, the corresponding logic networks with AND and OR gates in two levels are not useful because of excessive fan-ins and fan-outs.

27.3 Design of Multiple-Output Networks with AND and OR Gates in Two Levels So far, we have discussed the synthesis of a two-level network with a single output. In many cases in practice, however, we need a two-level network with multiple outputs; so here let us discuss the synthesis of such a network, which is more complex than a network with a single output. An obvious approach is to design a network for each output function separately. But this approach usually will not yield a more compact network than will synthesis of the functions collectively, because, for example, in Fig. 27.4, the AND gate h, which can be shared by two output gates for fi and fj, must be repeated in separate networks for fi and fj by this approach.

FIGURE 27.4

A two-level network with multiple outputs.

© 2000 by CRC Press LLC

Before discussing a design procedure, let us study the properties of a minimal two-level network that has only AND gates in the first level and only OR gates for given output functions f1, f2, …, fm in the second level, as shown in Fig. 27.4, where “a minimal network” means that the network has a minimum number of gates as the primary objective and a minimum number of connections as the secondary objective. The number of OR gates required for this network is at most m. Actually, when some functions are expressed as single products of literals, the number of OR gates can be less than m because these functions can be realized directly at the outputs of the AND gates without the use of OR gates. Also, when a function fi has a prime implicant consisting of a single literal, that literal can be directly connected to the OR gate for fi without the use of an AND gate. (These special cases can be treated easily by modifying the synthesis under the following assumption.) However, for simplicity, let us assume that every variable input is connected only to AND gates in the first level and every function fi, 1 ≤ i ≤ m, is realized at the outputs of OR gates in the second level. This is actually required with some electronic realizations of a network (e.g., when every variable input needs to have the same delay to the network outputs, or when PLAs which will be described in Chapter 42 are used to realize two-level networks).

Multiple-Output Prime Implicants Suppose that we have a two-level, multiple-output network that has AND gates in the first level, and m OR gates in the second (i.e., output) level for functions f1, f2, …, fm (as an example, f1, f2, and f3 are given as shown in the Karnaugh maps in Fig. 27.5). The network has a minimum number of gates as the primary objective, and a minimum number of connections as the secondary objective. Let us explore the basic properties of such a minimal network. Property 27.1: First consider an AND gate, g, that is connected to only one output gate, say for fi, in Fig. 27.4. If gate g has inputs xg1, xg2, …, xgp, its output represents the product xg1xg2 …xgp . Then, if the product assumes the value 1, fi becomes 1. Thus, the product xg1xg2…xgp is an implicant of fi . Since the network is minimal, gate g has no unnecessary inputs, and the removal of any input from gate g will make the OR gate for fi express a different function (i.e., in Fig. 27.5, the loop that the product represents becomes larger, containing some 0-cells, by deleting any variables from the product and then the new product is not an implicant of fi). Thus, xg1, xg2, …, xgp is a prime implicant of fi. Property 27.2: Next consider an AND gate, h, that is connected to two output gates for fi and fj in Fig. 27.4. This time, the situation is more complicated. If the product xh1xh2…xhq realized at the output of gate h assumes the value 1, both fi and fj become 1 and, consequently, the product fifj of the two functions also becomes 1 (thus, if a Karnaugh map is drawn for fifj as shown in Fig. 27.5, the map has a loop consisting of only 1-cells for xh1xh2…xhq). Thus xh1xh2…xhq is an implicant of product fifj. The network is minimal in realizing each of functions f1, f2, …, fm; so, if any input is removed from AND gate h, at least one of fi and fj will be a different function and xh1xh2…xhq will no longer be an implicant of fifj. (That is, if any input is removed from AND gate h, a loop in the map for at least one of fi and fj will become larger. For example, if x1 is deleted from M: x 1 x 2 x 3 for f1f2, the loop for x 2 x 3 , product of the remaining literals, appears in the maps for f1 and also for f2 in Fig. 27.5. The loop in the map for f2 contains 0-cell. Thus, if a loop is formed in the map for f1f2 as a product of these loops in the maps for f1 and f2, the new loop contains 0-cell in the map for f1f2. This means that if any variable is removed from xh1xh2…xhq, the remainder is not an implicant of f1f2.) Thus, xh1xh2…xhq is a prime implicant of the product fifj. (But notice that the product xh1xh2…xhq is not necessarily a prime implicant of each single function fi or fj, as can be seen in Fig. 27.5. For example, M: x 1 x 2 x 3 is a prime implicant of f1f2 in Fig. 27.5. Although this product, x 1 x 2 x 3 , is a prime implicant of f2, it is not a prime implicant of f1 in Fig. 27.5.) Generalizing this, we obtain the following conclusion. Suppose that the output of an AND gate is connected to r OR gates for functions fi1, …, fir . Since the network is minimal, this gate has no unnecessary inputs. Then the product of input variables realized at the output of this AND gate is a prime implicant of the product fi1 … fir (but is not necessarily a prime implicant of any product of r – 1 or fewer, of these fi1, …, fir).

© 2000 by CRC Press LLC

FIGURE 27.5

Multiple-output prime implicants on Karnaugh maps.

As in the synthesis of a single-output network, we need to find all prime implicants and then develop disjunctive forms by choosing an appropriate set of prime implicants. (Notice that each of these disjunctive forms is not necessarily a minimal sum for one of functions, fi.) But, unlike the synthesis of a singleoutput network, in this case we must consider all prime implicants not only for each of the given functions f1, …, fm, but also for all possible products of them, f1f2, f1f3, …, fm–1fm; f1f2f3, f1f2f4, …; …; …; f1f2…fm–1fm. Definition 27.1: Suppose that m functions f1, …, fm are given. All prime implicants for each of these m functions, and also all prime implicants for every possible product of these functions, that is, f 1, f 2, …, f m f 1 f 2, f 1 f 3, …, f m – 1 f m, f 1 f 2 f 3, f 1 f 2 f 4, …, f m – 2 f m – 1 f m …, …, …, …, …, …, f 1 f 2 …f m – 1, f 1 f 2 …f m – 2 f m, …, f 2 f 3 …f m f 1 f 2 f 3 …f m – 1 f m

© 2000 by CRC Press LLC

are called the multiple-output prime implicants of f1, …, fm. When the number of variables is small, we can find all multiple-output prime implicants on Karnaugh maps, as illustrated in Fig. 27.5 for the case of three functions of four variables. In addition to the maps for given functions f1, f2, and f3, we draw the maps for all possible products of these functions; that is, f1f2, f2f3, f1f3, and f1f2f3. Then, prime-implicant loops are formed on each of these maps. These loops represent all the multiple-output prime implicants of given functions f1, f2, and f3.

Paramount Prime Implicants Suppose we find all multiple-output prime implicants for the given functions. Then, if a prime implicant P appears more than once as prime implicants for different products of functions, P for the product of the largest number of functions among these products of functions is called a paramount prime implicant for P in order to differentiate this P from other multiple-output prime implicants. As a special case, if P appears only once, it is the paramount prime implicant for P. For example, among all multiple-output prime implicants for f1, f2, and f3 (i.e., among all prime implicants for f1f2, f2f3, f1f3, and f1f2f3 shown as loops in Fig. 27.5), the prime implicant x 1 x 2 x 4 appears three times in Fig. 27.5. In other words, x 1 x 2 x 4 appears as a prime implicant for the function f1 (see the map for f1 in Fig. 27.5), as a prime implicant for f2, and also as a prime implicant for the product f1f2 (in the map for f1f2, this appears as L: x 1 x 2 x 4 ). Then, the prime implicant x 1 x 2 x 4 for f1f2 is the paramount prime implicant for x 1 x 2 x 4 because it is a prime implicant for the product of two functions, f1 and f2, but the prime implicant x 1 x 2 x 4 for f1 or f2 is a prime implicant for a single function f1 or f2 alone. In the two-level network, x 1 x 2 x 4 for the product f1f2 realizes the AND gate with output connections to the OR gates for f1 and f2, whereas x 1 x 2 x 4 for f1 or f2 realizes the AND gate with output connection to only the OR gate for f1 or f2, respectively. Thus, if we use x 1 x 2 x 4 for the product f1f2 instead of x 1 x 2 x 4 for f1 or f2 (in other words, if we use AND gates with more output connections) then the network will be realized with no more gates. In this sense, x 1 x 2 x 4 for the product f1f2 is more desirable than x 1 x 2 x 4 for the function f1 or f2. Prime implicant x 1 x 2 x 4 for the product f1f2 is called a paramount prime implicant, as formally defined in the following, and is shown with label L in a bold line in the map for f1f2 in Fig. 27.5; whereas x 1 x 2 x 4 for f1 or f2 alone is not labeled and is also not in a bold line in the map for f1 or f2. Thus, when we provide an AND gate whose output realizes the prime implicant x 1 x 2 x 4 , we can connect its output connection in three different ways, i.e., to f1, f2, or both f1 and f2, realizing the same output functions, f1, f2, and f3. In this case, the paramount prime implicant means that we can connect the largest number of OR gates from this AND gate; in other words, this AND gate has the largest coverage, although some connections may turn out to be redundant later. Definition 27.2: Suppose that when all the multiple-output prime implicants of f1, …, fm are considered, a product of some literals, P, is a prime implicant for the product of k functions fp1, fp2, …, fpk (possibly also prime implicants for products of k – 1or fewer of these functions), but is not a prime implicant for any product of more functions that includes all these functions fp1, fp2, …, fpk. (For the above example, x 1 x 2 x 4 is a prime implicant for the product of two functions, f1 and f2, and also a prime implicant for a single function, f1 or f2. But x 1 x 2 x 4 is not a prime implicant for the product of more functions, including f1 and f2, that is, for the product f1f2f3.) Then, P for the product of fp1, fp2, …, fpk is called the paramount prime implicant for this prime implicant ( x 1 x 2 x 4 for f1f2 is a paramount prime implicant, but x 1 x 2 x 4 for f1 or f2 is not). As a special case, if P is a prime implicant for only one function but not a prime implicant for any product of more than one function, it is a paramount prime implicant (B: x 1 x 3 x 4 for f1 is such an example in Fig. 27.5). In Fig. 27.5, only labeled loops shown in bold lines represent paramount prime implicants. If a two-level network is first designed only with the AND gates that correspond to the paramount prime implicants, we can minimize the number of gates. This does not necessarily minimize the number of connections as the secondary objective. But in some important electronic realizations of two-level networks, such as PLAs (which will be explained later) this is not important. Thus, let us consider only the minimization of the number of gates in the following for the sake of simplicity.

© 2000 by CRC Press LLC

Design of a Two-Level Network with a Minimum Number of AND and OR Gates In the following procedure, we will derive a two-level network with a minimum number of gates (but without minimizing the number of connections as the secondary objective) by finding a minimal number of paramount prime implicants to represent the given functions. Procedure 27.1: Design of a Multiple-Output Two-Level Network That Has a Minimum Number of Gates Without Minimizing the Number of Connections We want to design a two-level network that has AND gates in the first level and OR gates in the second level. We shall assume that variable inputs can be connected to AND gates only (not to OR gates), and that the given output functions f1, f2, …, fm are to be realized only at the outputs of OR gates. Suppose we have already found all the paramount prime implicants for the given functions and their products. 1. Find a set of a smallest number of paramount prime implicant loops that covers all 1-cells in Karnaugh maps for the given functions f1, f2, …, fm. In this case, maps for their products, such as f1f2, need not be considered. If there is more than one such set, choose a set that has as large loops as possible (i.e., choose a set of loops such that the total number of inputs to the AND gates that correspond to these loops is the smallest). For example, suppose that the three functions of four variables, f1, f2, f3, shown in the Karnaugh maps in Fig. 27.5 are given. Then, using only the bold-lined loops labeled with letters (i.e., paramount prime implicants), try to cover all 1-cells in the maps for f1, f2, and f3 only (i.e., using the only top three maps in Fig. 27.5). Then, we find that more than one set of loops have the same number of loops with the same sizes. AKLCDMNFHJ, one of these sets, covers all functions f1, f2, and f3, as illustrated in Fig. 27.6.

FIGURE 27.6 Covering f1, f2, and f3 with the minimum number of paramount prime implicants.

2. Construct a network corresponding to the chosen set of paramount prime implicant loops. Then, the network of 13 gates shown in Fig. 27.7 has been uniquely obtained. Letter N (i.e., x 1 x 2 x 3 x 4 ), for example, is a paramount prime implicant for the product f1f2f3, so the output of an AND gate with inputs, x1, x 2 , x3, and x 4 , is connected to the OR gates for f1, f2, and f3. 3. Then delete unnecessary connections, or replace some logic gates by new ones, by the following steps from the logic networks derived in Step 2, by considering whether some Karnaugh maps have a paramount prime implicant loop that is inside another one, or adjacent to another one.

© 2000 by CRC Press LLC

FIGURE 27.7

The network corresponding to AKLCDMNFHJ.

a. If, in a Karnaugh map for a function fi, there is a paramount prime implicant loop that is inside another one, then the former is not necessary, because the 1-cells contained in this loop are covered by the latter loop. Thus, we can delete the connection from the AND gate corresponding to the former loop to the OR gate for fi, without changing the output functions of the logic network. For example, loop K in the map for f2 is inside the loop C in Fig. 27.6. Thus, we can delete K from the map for f2, because the 1-cell in K is covered by the loop C. This means that we can delete the connection (shown with * in Fig. 27.7) from the AND gate labeled K in Fig. 27.7. Similarly we can delete the loops N from the maps for f1 and f2, and accordingly, the connections (shown with * in Fig. 27.7) from the AND gates labeled N to the OR gates for f1 and f2. b. If, in a Karnaugh map for a function fi, there is a paramount prime implicant loop that is adjacent to another one, we may be able to expand the former by incorporating some 1-cells of the latter without having any 0-cells and at the same time replace the loop by the expanded loop (i.e., the number of logic gates unchanged). If we can do so, replace the former loop by the expanded loop. The expanded loop represents a product of fewer literals. Thus, we can delete the connection to the AND gate corresponding to the expanded loop without changing the output functions of the logic network. For example, loop K in the map for f1 has an adjacent loop A. Then we can replace the loop K by a larger loop (i.e., loop B in Fig. 27.5) by incorporating the 1-cell on its left, which represents the product x 1 x 3 x 4 . Also, K appears only in the map for f2, beside K in the map for f1. K in the map for f2 is concluded to be eliminable, so the AND gate for K in Fig. 27.7 can be replaced by the new gate for B, keeping the number of logic gates unchanged. This means that we can delete the connection of input

© 2000 by CRC Press LLC

x2 (shown with ** in Fig. 27.7) to the AND gate labeled K in Fig. 27.7 (i.e., this is essentially replacement of K by B). Thus we can delete in totally, 4 connections from the logic network shown in Fig. 27.7, ending up with a simpler network with 13 gates and 41 connections. When the number of paramount prime implicants is very small, we can find, directly on the maps, a minimum number of paramount prime implicants that cover all 1-cells in the maps for f1, f2, …, fm (not their products) and derive a network with a minimum number of gates, using Procedure 27.1. But when the number of paramount prime implicants is many, the algebraic procedure of Section 4.6 in Ref. 5 is more efficient. Procedure 27.1 with the deletion of unnecessary connections, however, may not necessarily yield a minimum number of connections as the secondary objective, although the number of gates is minimized as the primary objective. If we want to have a network with a minimum number of connections as the secondary objective, although the network has the same minimum number of gates as the primary objective, then we need to modify Procedure 27.1 and then delete unnecessary connections, as described as Procedure 5.2.1 in Ref. 5. But this procedure is more tedious and time-consuming.

Networks That Cannot be Designed by the Preceding Procedure Notice that the design procedures in this section yield only a network that has all AND gates in the first level and all OR gates in the second level. (If 0-cells on Karnaugh maps are worked on instead of 1-cells in Procedure 27.1, we have a network with all OR gates in the first level and all AND gates in the second level.) If AND and OR gates are mixed in each level, or the network need not be in two levels, Procedure 27.1 does not guarantee the minimality of the number of logic gates.6 These two problems can be solved by the integer programming logical design method (to be mentioned in Chapter 31, Section 31.5), which is complex and can be applied to only networks of a small number of gates.

Applications of Procedure 27.1 Procedure 27.1 has important applications, such as PLAs, it but cannot be applied for designing large PLAs. For multiple-output functions with many variables, absolute minimization is increasingly timeconsuming. Using BDD (described in Chapter 26), Coudert, Madre, and Lin extended the feasibility of absolute minimization.2,3 But as it is becoming too time-consuming, we have to give up absolute minimization, resorting to heuristic minimization, such as a powerful method which is called MINI4 and was later improved as ESPRESSO.1

References 1. Brayton, R., G. Hachtel, C. McMullen, and A. Sangiovanni-Vincentelli, Logic Minimization Algorithms for Vlsi Synthesis, Kluwer Academic Publishers, 1984. 2. Lin, B., O. Coudert, and J. C. Madre, “Symbolic prime generation for multiple-valued functions,” DAC 1992, pp. 40-44, 1992. 3. Coudert, O., “On Solving Covering Problems,“ DAC, pp. 197-202, 1996. 4. Hong, S. J., R. G. Cain and D. L. Ostapko, “MINI: a heuristic approach for logic minimization,” IBM Jour. Res. Dev., pp. 443-458, Sept. 1974. 5. Muroga, S., Logic Design and Switching Theory, John Wiley & Sons, 1979 (Now available from Krieger Publishing Co.). 6. Weiner, P. and T. F. Dwyer, “Discussions of some flaws in the classical theory of two level minimizations of multiple-output switching networks,” IEEE Tr. Comput., pp. 184-186, Feb. 1968.

© 2000 by CRC Press LLC

Muroga, S. "Sequential Networks with AND and OR Gates" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

28 Sequential Networks 28.1 Introduction 28.2 Flip-Flops and Latches S-R Latches • Flip-Flops

28.3 Sequential Networks in Fundamental Mode Transition Tables • Fundamental Mode

28.4 Malfunctions of Asynchronous Sequential Networks Racing Problem of Sequential Networks • Remedies for the Racing Problem

28.5 Different Tables for the Description of Transitions of Sequential Networks 28.6 Steps for the Synthesis of Sequential Networks General Model of Sequential Networks • Synthesis as a Reversal of Network Analysis • Design Steps for Synthesis of Sequential Networks

28.7 Synthesis of Sequential Networks

Saburo Muroga University of Illinois at Urbana-Champaign

Raceless Flip-Flops • Example of Design of a Synchronous Sequential Network • Design of Synchronous Sequential Networks in Skew Mode • Design of Asynchronous Sequential Networks in Fundamental Mode • Advantages of Skew Mode

28.1 Introduction A logic network is called a sequential network when the values of its outputs depend not only on the current values of inputs but also on some of the past values, whereas a logic network is called a combinational network when the values of its outputs depend only on the current values of inputs but not on any past values. Analysis and synthesis of sequential networks are far more complex than combinational networks. When reliable operation of the networks is very important, the operations of logic gates are often synchronized by clocks. Such clocked networks, whether they are combinational or sequential networks, are called synchronous networks, and networks without clock are called asynchronous networks.

28.2 Flip-Flops and Latches Because in a sequential network the outputs assume values depending not only on the current values but also on some past values of the inputs, a sequential network must remember information about the past values of its inputs. Simple networks called flip-flops that are realized with logic gates are usually used as memories for this purpose. Semiconductor memories can also serve as memories for sequential networks, but flip-flops are used if higher speed is necessary to match the speed of logic gates. Let us explain the simplest flip-flops, which are called latches.

© 2000 by CRC Press LLC

S-R Latches The network in Fig. 28.1(a) which is called an S-R latch, consists of two NOR gates. Assume that the values at terminals S and R are 0, and the value at terminal Q is 0 (i.e., S = R = 0 and Q = 0). Since gate 1 has inputs of 0 and 0, the value at terminal Q is 1 (i.e., Q = 1). Since gate 2 in the network has two inputs, 0 and 1, its output is Q = 0. Thus, signals 0 and 1 are maintained at terminals Q and Q , respectively, as long as S and R remain 0. Now let us change the value at S to 1. Then, Q is changed to 0, and Q becomes 1 after a short time delay. Even if Q = 1 and Q = 0 were their original values, the change of the value at S to 1 still yields Q = 1 and Q = 0. In other words, Q is set to 1 by supplying 1 to S, no matter whether we originally had Q = 0, Q = 1, or Q = 1, Q = 0. Similarly, when 1 is supplied to R with S remaining at 0, Q and Q are set to 0 and 1, respectively, after a short time delay, no matter what values they had before. Thus, we get the first three combinations of the values of S and R shown in the table in Fig. 28.1(b). In other words, as long as S = R = 0, the values of Q and Q are not changed. If S = 1, Q is set to 1. If R = 1, Q is set to 0. Thus, S and R are called set and reset terminals, respectively. In order to let the latch work properly, the value 1 at S or R must be maintained until new values of Q and Q are established. The S-R latch is usually denoted as in Fig. 28.1(c). An S-R latch can also be realized with NAND gates, as shown in Fig. 28.1 (d). Latches and flip-flops have a direct reset-input terminal and a direct set-input terminal, although these input terminals are omitted in Fig. 28.1 and in the figures for other flip-flops for the sake of simplicity. These input terminals are convenient for initial setting to Q = 1, or resetting to Q = 0.

FIGURE 28.1

S-R latches.

When S = R = 1 occurs, the outputs Q and Q are both 0. If S and R simultaneously return to 0, these two outputs cannot maintain 0. Actually, a simultaneous change of S and R to 0 or 1 is physically impossible, often causing the network to malfunction, unless we make the network more sophisticated, such as synchronization of logic gates by a clock, as will be explained later. If S returns to 0 from 1 before R does, we have Q = 0 and Q = 1. If R returns to 0 from 1 before S does, we have Q = 1 and Q = 0. Thus, it is not possible to predict what values we will have at the outputs after having S = R = 1. The output of this network is not defined for S = R = 1, as this combination is not used. For simplicity, let us assume that only one of the inputs to any network changes at a time, unless otherwise noted. This is a reasonable and important assumption.

Flip-Flops Usually, S-R latches are used in designing asynchronous sequential networks, although sequential networks can be designed without them. For example, the memory function can be realized with longer loops of gates than the loop in the latch, and also more sophisticated flip-flops than S-R latches can be used. For example, a loop consisting of a pair of inverters and a special gate called a transmission gate is used in CMOS networks, as we will see in Chapter 36, Section 36.2. But for synchronous networks, raceless flip-flops (described later in this chapter) are particularly important.

© 2000 by CRC Press LLC

28.3 Sequential Networks in Fundamental Mode In a sequential network, the value of the network output depends on both the current input values and some of past input values stored in the network. We can find what value the output has for each individual combination of input values, depending on what signal values are stored in the network. This essentially means interpreting the signals stored inside the network as new input variables called internal variables and then interpreting the entire network as a combinational network of the new input variables, plus the original inputs, which are external input variables. Let us consider the sequential network in Fig. 28.2. Assume that the inputs are never changed unless the network is in a stable condition, that is, unless none of the internal variables is changing. Also assume that, whenever the inputs change, only one input changes at a time. Let y1, y 1 , y2, and y 2 denote the outputs of the two S-R latches.

FIGURE 28.2

A sequential network with S-R latches.

Let us assume that y1 = y2 = 0 (accordingly, y 1 = y 2 = 1). x1 = 0, and x2 = 1. Then, as can be easily found, we have z1 = 1 and z2 = 0 for this combination of values. Because of x1 = 0 and x2 = 1, the inputs of the latches have values R2 = 1 (accordingly, y2 = 0) and S1 = S2 = R1 = 0. Then y1 and y2 remain 0. As long as x1, y1, and y2 remain 0 and x2 remains 1, none of the signal values in this network changes and, consequently, this combination of values of x1, x2, y1, and y2 is called a stable state. Next let us assume that x1 is changed to 1, keeping x2 = 1 and y1 = y2 = 0. For this new combination of input values, we get z1 = 0 and z2 = 0 after a time delay of τ, where τ is the switching time (delay time) of each gate, assuming for the sake of simplicity that each gate has the same τ (this is not necessarily true in practice). The two latches have new inputs S1 = 1 and S2 = R1 = R2 = 0 after a time delay of τ. Then, they have new output values y1 = 1 (previously 0), y 1 = 0, y2 = 0, and y 2 = 1 after a delay due to the response time of the latches. Outputs z1 and z2 both change from 0 to 1. After this change, the network does not change any further. Summarizing the above, we can say that, when the network has the combination x1 = x2 = 1, z1 = z2 = y1 = y2 = 0, the network does not remain in this combination, but changes into the new combination x1 = x2 = z1 = z2 = y1 = 1, y2 = 0. Also, outputs z1 and z2 change into z1 = z2 = 1, after assuming the values z1 = z2 = 0 temporarily. After the network changes into the combination x1 = x2 = y1 = 1, y2 = 0, and z1 = z2 = 1, it remains there. The combination x1 = x2 = 1 and z1 = z2 = y1 = y2 = 0 is called an unstable state. The transition from an unstable to a stable state, such as the above transition to the stable state, is the key to the analysis of a sequential network.

Transition Tables This and other transitions for other combinations of values of x1, x2, y1, y2, z1, and z2 can be shown on the map in Table 28.1, which is called a transition-output table, showing the next values of y1 and

© 2000 by CRC Press LLC

y2 as Y1 and Y2, respectively. The entry in each cell in Table 28.1 is next states Y1, Y2 and outputs z1 and z2. The above transition from x1 = 0, x2 = 1, y1 = y2 = 0 to x1 = x2 = y1 = 1, y2 = 0 is shown by the line with an arrow in Table 28.1. For this transition, the network is initially in the cell labeled with ∗ in Table 28.1; and when x1 changes to 1, the network moves to the next cell labeled with the dot with this cell’s entry, Y1 = 1, Y2 = 0, and z1 = z2 = 0, during the transient period. Here, it is important to notice that in this cell, next values of internal variables y1 and y2, i.e., Y1 and Y2 are shown but the network actually has current values of y1 and y2, that is, y1 = y2 = 0 during this transition period corresponding to this cell. Y1 and Y2 in this cell, which is in an unstable state, indicates what values y1 and y2 should take after the transition. Then, the network moves to the bottom cell where the values of y1 and y2 are identical to Y1 and Y2, respectively, as indicated by the line with an arrow, because Y1 and Y2 will be current y1 and y2 after the transition. The present values of the internal variables are shown with y1 and y2 in small letters and their next values are with Y1 and Y2 in capital letters. More specifically, variables y1 and y2 are called present-state (internal) variables, and Y1 and Y2 are called next-state (internal) variables. As can easily be seen, when the values of Y1 and Y2 shown inside a cell are identical to those of y1 and y2 shown on the left of the table, respectively, a state containing these values of Y1, Y2, y1, and y2 is stable, because there is no transition of y1 and y2 into a new, different state. Next-state variables in stable states are encircled in Table 28.1. Each column in this table corresponds to a combination of values of network inputs x1 and x2, that is, an input state. Each row corresponds to a combination of values of internal variables y1 and y2, that is, a present internal state. Thus, each cell corresponds to a total state, or simply a state; that is, a combination of values of x1, x2, y1, and y2. TABLE 28.1 Transition-Output Table in Fundamental Mode

When only next states Y1 and Y2 are shown, instead of showing all next states Y1, Y2, and outputs z1 and z2 in each cell of a transition-output table, the table is called a transition table, and when only outputs z1 and z2 are shown, the table is called an output table.

Fundamental Mode A sequential network is said to be operating in fundamental mode when the transition occurs horizontally in the transition table corresponding to each network input change and, then, unless the new state in the same row is stable, the transition continues vertically (not diagonally), settling in a new stable state, such as the transition shown with the line with an arrow in Table 28.1. The transitions of the network take place under the assumption that only one of the network inputs changes at a time, only when the network is not in transition; that is, only when the network is settled in a stable state.

© 2000 by CRC Press LLC

By using a transition-output table, the output sequence [i.e., the sequence of output values corresponding to any input sequence, (i.e., the sequence of input values)] can be easily obtained. In other words, the output sequence can be obtained by choosing columns corresponding to the input sequence and then moving to stable states whenever states.

28.4 Malfunctions of Asynchronous Sequential Networks An asynchronous sequential network does not work reliably unless appropriate binary numbers are assigned to internal variables for each input change.

Racing Problem of Sequential Networks A difference in time delays of signal propagation along different paths may cause a malfunction of an asynchronous sequential network that is called a race. This is caused by a difference in the delay times of gates. Let us consider the transition-output table of a certain asynchronous sequential network shown in Table 28.2. TABLE 28.2 Transition-Output Table in Fundamental Mode

Suppose that the network is in the stable state (x1, x2, y1, y2) = (1110). If the inputs change from (x1, x2) = (11) to (10), the network changes into the unstable state (x1, x2, y1, y2) = (1010), marked with ∗ in Table 28.2. Because of y1 = 1, y2 = 0, Y1 = 0 and Y2 = 1, two logic gates whose outputs represent y1 and y2 in the network must change their output values simultaneously. However, it is extremely difficult for the two gates to finish their changes at exactly the same time, because no two paths that reach these gates have identical time delays. Actually, one of these two logic gates finishes its change before the other does. In other words, we have one of the following two cases: 1. y2 changes first, making the transition of (y1, y2) from (10) to (11) and leading the network into stable state (x1, x2, y1, y2) = (1011). 2. y1 changes first, making the transition of (y1, y2) from (10) to (00) and reaching the unstable state (x1, x2, y1, y2) = (1000), and then y2 changes, making the further transition of (y1, y2) from (00) to (01), settling in the stable state (x1, x2, y1, y2) = (1001). Thus, the network will go to either state (y1, y2) = (11), in case 1, or state (y1, y2) = (00), in case 2, instead of going directly to (y1, y2) = (01). If (y1, y2) = (11) is reached, state (11) is stable, and the network stops here. If (00) is reached, this is an unstable state, and another transition to the next stable state, (01), occurs. State (01) is the destination stable state (i.e., desired stable state), but (11) is not. Thus, depending on which path of gates works faster, the network may malfunction. This situation is called a race. The network may or may not malfunction, depending on which case actually occurs.

© 2000 by CRC Press LLC

Next, suppose that the network is in the stable state (x1, x2, y1, y2) = (0100) and that inputs (x1, x2) = (01) change to (11). The cell labeled with ∗∗ in Table 28.2 has (Y1 Y2 ) = (11). Thus, y1 and y2 must have simultaneous changes. Depending on which one of the two logic gates whose outputs represent y1 and y2 changes its output faster, there are two possible transitions. But in this case, both end up in the same stable state, (y1, y2) = (10). This is another race, but the network does not have a malfunction depending on the order of change of internal variables. Hence, this is called a noncritical race, whereas the previous race for cases 1 and 2 is termed a critical race because the performance is unpredictable (it is hard to predict which path of gates will have the signal change faster), possibly causing the network to malfunction. We may have more complex critical racing. In other words, if we have a network such that the output of the gate whose output represents y1 feeds back to the input of a gate on a path that reaches the gate whose output represents y1, the y1 may continue to change its value from 0 to 1 (or from 1 to 0), then change back to 0 by feedback of new value 1, and so on. This oscillatory race may continue for a long time.

Remedies for the Racing Problem Whenever a network has critical races, we must eliminate them for reliable operation. One approach is to make the paths of gates have definitely different time delays. This approach, however, may not be most desirable for the following reasons. First, the speed may be sacrificed by adding gates for delay. Second, there are cases where this approach is impossible if a network contains more than one critical race. By eliminating a critical race in one column in the transition table by using a path of different time delay, a critical race in another column may occur or may be aggravated. A better approach is to choose some entries in the transition table so that no critical races occur. Then, on the basis of this new table, we synthesize a network with the desired performance, according to the synthesis method to be described later. A change of entries in some unstable states without changing destination stable states, such that only one internal variable changes its value at a time, is one method for eliminating critical races. The critical race discussed above can be eliminated from Table 28.2 by replacing the entry marked with ∗ by (00), as shown in Table 28.3, where only Y1 and Y2 are shown without the network output z. If every entry that causes the network to malfunction can be changed in this manner, a reliable network with the performance desired in the original is produced, because every destination stable state to which the network should go is not changed, and the output value for the destination stable state is also not changed. (We need not worry about noncritical races, since they cause no malfunctions.) However, sometimes, there are entries for some unstable states that cannot be changed in this manner. For example, consider Table 28.4 (some states are not shown, for the sake of simplicity). The entry (01) for (x1, x2, y1, y2) = (0010) is such an entry TABLE 28.3 Transition Table in Fundamental Mode, Derived by Modifying Table 28.2

© 2000 by CRC Press LLC

TABLE 28.4 Transition Table in Fundamental Mode

and causes a critical race. [State (x1, x2, y1, y2) = (0010) in Table 28.2 may have the same property, but actually, the network never enters this state because of the assumption that inputs x1 and x2 do not change simultaneously.] Since this entry requires simultaneous transitions of two internal variables, y1 and y2, we need to change it to (00) or (11). Both lead the network to stable states different from the destination stable state (01). When a change of entries in unstable states does not work, we need to redesign the network completely by adding more states (e.g., 8 states with 3 internal variables, y1, y2, and y3, instead of 4 states with 2 internal variables, y1 and y2, for Table 28.4) without changing transitions among stable states, as we will see later in this chapter. This redesign may include the reassignment of binary numbers to states, or the addition of intermediate unstable states through which the network goes from one stable state to another. The addition of more states, reassignment of binary numbers to states, and addition of intermediate unstable states for this redesign, however, is usually cumbersome. So, designers usually prefer the use of synchronous sequential networks because design procedures are simpler and the racing problem, including oscillatory races due to the existence of feedback loops in a sequential network, is completely eliminated.

28.5 Different Tables for the Description of Transitions of Sequential Networks When we look at a given sequential network from the outside, we usually cannot observe the values (i.e., binary numbers) of the internal variables; that is, the inside of the network (e.g., if the network is implemented in an IC package, no internal variables may appear at its pins). Also, binary numbers are not very convenient to use. Therefore, binary numbers for the internal states of the network may be replaced in a transition-output table by arbitrary letters or decimal numbers. The table that results is called a state-output table. For example, Table 28.5 is the state-output table obtained from the transitionoutput table in Table 28.2, where y1y2 is replaced by s and Y1Y2 is replaced by S. A present state of the internal state is denoted by s and its next state by S. In some state tables, networks go from one stable state to another through more than one unstable state, instead of exactly one unstable state. For example, in the state-output table in Table 28.5, when inputs (x1, x2) change from (00) to (01), the network goes from stable state (x1, x2, s) = (00C) to another stable state, (01A), by first going to unstable state (01C), then to intermediate unstable state (01D), and finally to stable state (01A). Here, (x1, x2, s) = (01D) is called an intermediate unstable state. Such a multiple transition from one stable state to another in a state-output table cannot be observed well from outside the network, and is not important as far as the external performance of the network is concerned. Even if each intermediate unstable state occurring during multiple transitions is replaced by the corresponding ultimate stable state, it does not make any difference if the

© 2000 by CRC Press LLC

TABLE 28.5 State-Output Table Derived from Table 28.2

network performance is observed from outside. Such a table is called a flow-output table. The flowoutput table corresponding to the state-output table in Table 28.5 is shown in Table 28.6, where D in the intermediate unstable state (x1, x2, s) = (01C), for example, in Table 28.5 is replaced by A in Table 28.6. TABLE 28.6 Flow-Output Table Derived from Table 28.5

28.6 Steps for the Synthesis of Sequential Networks Let us first introduce the general model for sequential networks and then describe a sequence of steps for designing a sequential network.

General Model of Sequential Networks A sequential network may be generally expressed in the schematic form shown in Fig. 28.3. A large block represents a loopless network of logic gates only (without any flip-flops). All loops with and without flip-flops (i.e., both loops that do not contain flip-flops and loops that contain flip-flops) are drawn outside the large block. This loopless network has external input variables x1, …, xn, and internal variables y1, …, yp, as its inputs. It also has external output variables z1, …, zm, and excitation variables e1, …, eq, as its outputs. Some of the excitation variables e1, …, eq are inputs to the flip-flops, serving to excite the flip-flops. The remainder of the excitation variables are starting points of the loops without flip-flops, which end up at some of the internal variables. For loops without flip-flops, ei = Yi holds for each i. For example, the network in Fig. 28.2 can be redrawn in the format of Fig. 28.3, placing the latches outside the loopless network.

© 2000 by CRC Press LLC

FIGURE 28.3

A general model of a sequential network.

Synthesis as a Reversal of Network Analysis The outputs e1, …, eq, z1, …, zm of the combinational network inside the sequential network in Fig. 28.3 express logic functions that have y1, …, yp, x1, …, xn as their variables. Thus, if these logic functions are given, the loopless network can be designed by the methods discussed in earlier chapters. This means that we have designed the sequential network, since the rest of the general model in Fig. 28.3 is simply loops with or without flip-flops to be placed outside this combinational network. But we cannot derive these logic functions e1, …, eq, z1, …, zm directly from the given design problem, so let us find, in the following, what gap to fill in. When loops have no flip-flops, ei = Yi holds for each i, where Y is the next value of y, and the binaryvalue relationship between the inputs y1, …, yp, x1, …, xn and the outputs Y1, …, Yq, z1, …, zm of the combinational network inside the sequential network in Fig. 28.3 is expressed by a transition-output table. But when loops have flip-flops, we need the relationship between e1, …, eq, z1, …, zm and y1, …, yp, x1, …, xn, where e1, …, eq are the inputs S1, R1, S2, R2 … of the latches. A table that shows this relationship is called an excitation-output table. For this purpose, we need to derive the output-input relation of the S-R latch, as shown in Table 28.8, by reversing the input-output relationship in Table 28.7, which shows the output y of a latch and its next value Y for every feasible combination of the values of S and R. TABLE 28.7 Input-Output Relationship of a S-R Latch S

R

0

0

0

1

1

0

y 0 1 0 1 0 1

Y 0 1 0 0 1 1

In Table 28.8, d ’s mean don’t-care conditions. For example, y = Y = 0 in Table 28.8 results from S = R = 0 and also from S = 0, R = 1 in Table 28.7. Therefore, the first row in Table 28.8, y = Y = 0,

© 2000 by CRC Press LLC

TABLE 28.8 Output-Input Relationship of a S-R Latch y 0 0 1 1

Y 0 1 0 1

S 0 1 0 d

R d 0 1 0

S = 0, R = d, is obtained, because y = Y = 0 results from S = 0 only, but R can be 0 or 1; that is, don’tcare, d. By using Table 28.8, the transition-output table in Table 28.1, for example, is converted into the excitation-output table in Table 28.9. For example, corresponding to y1 = Y1 = 0 in the first row and the first column in Table 28.1, S1 = 0, R1 = d is obtained as the first 0d in the cell in the first row and the first column of Table 28.9, because the first row in Table 28.8 corresponds to this case. Of course, when a network, for example, Fig. 28.2 is given, we can derive the excitation-output table directly from the network in Fig. 28.2 rather than the transition-output table in Table 28.1 (which was derived for Fig. 28.2). But when we are going to synthesize a sequential network from a transitionoutput table, we do not have the network yet and we need to construct an excitation-output table from a transition-output table, using Table 28.8. TABLE 28.9 Excitation-Output Derived from Table 28.1 y1y2 00 01 11 10

00 0d, 0d, 10 0d, d0, 10 01, d0, 10 0d, 0d, 10

x1, x2 01 11 0d, 0d, 10 10, 0d, 00 0d, 01, 10 10, d0, 10 d0, 01, 10 d0, d0, 10 d0, 0d, 11 d0, 0d, 11 S1R1, S2R2, z1z2

10 0d, 10, 00 0d, d0, 10 d0, d0, 10 d0, 0d, 10

Design Steps for Synthesis of Sequential Networks A sequential network can be designed in the sequence of steps shown in Fig. 28.4. The required performance of a network to be designed for the given problem is first converted into a flow-output table. Then this table is converted into a state-output table, and next into a transition-output table, by choosing an appropriate assignment of binary numbers to all states. Then, the transition-output table is converted into an excitation-output table if the loops contain flip-flops. If the loops contain no flip-flops, the excitation-output table need not be prepared, since it is identical to the transition-output table. Finally, a network is designed, using the logic design procedures discussed in the preceding chapters.

FIGURE 28.4

Steps for designing a sequential network.

© 2000 by CRC Press LLC

28.7 Synthesis of Sequential Networks Since the use of clocks has many advantages, synchronous sequential networks with clocks are in many cases preferred to asynchronous sequential networks. In addition to the ease of design and elimination of hazards including racing problems, the speed of logic networks can sometimes (e.g., Domino CMOS) be improved by synchronizing the operations of logic gates by clocks, and waveforms of signals can be reshaped into clean ones. For synchronous sequential networks, more sophisticated flip-flops than latches are usually used, along with clocks. With these flip-flops, which are called raceless flip-flops, network malfunctioning due to hazards can be completely eliminated. Design of sequential networks with raceless flip-flops and clocks is much simpler than that of networks in fundamental mode, since we can use simpler flow-output and state-output tables, which are said to be in skew mode, and multiple changes of internal variables need not be avoided in assigning binary numbers to states.

Raceless Flip-Flops Raceless flip-flops are flip-flops that have complex structures but eliminate network malfunctions due to races. Most widely used raceless flip-flops are master-slave flip-flops and edge-triggered flip-flops. A master-slave flip-flop (or simply, an MS flip-flop) consists of a pair of flip-flops called a master flip-flop and a slave flip-flop. Let us explain the features of master-slave flip-flops based on the J-K master-slave flip-flop shown in Fig. 28.5. For the sake of simplicity, all the gates in these flip-flops are assumed to have equal delay times, although in actual electronic implementations, this may not be true. Its symbol is shown in Fig. 28.6, where the letters MS are shown inside the rectangle. Each action of the masterslave flip-flop is controlled by the leading and trailing edges of a clock pulse, as explained in the following.

FIGURE 28.5

J-K master-slave flip-flop.

FIGURE 28.6

Symbol for J-K master-slave flip-flop.

The J-K master-slave flip-flop in Fig. 28.5 works with the clock pulse, as illustrated in Fig. 28.7, where the rise and fall of the pulse are exaggerated for illustration. When the clock has the value 0 (i.e., c = 0) NAND gates 1 and 2 in Fig. 28.5 have output values 1, and then the flip-flop consisting of gates 3 and 4 does not change its state. As long as the clock stays at 0, gates 5 and 6 force the flip-flop consisting of

© 2000 by CRC Press LLC

FIGURE 28.7

Clock pulse waveform.

gates 7 and 8 to assume the same output values as those of the flip-flop consisting of gates 3 and 4 (i.e., the former is slaved to the latter, which is the master). Each of the master and slave is a slight modification of the latch in Fig. 28.1(d). When the clock pulse starts to rise to the lower threshold at time t1 in Fig. 28.7, gates 5 and 6 are disabled: in other words, the slave is cut off from the master. (The lower threshold value of the clock pulse still presents the logic value 0 to gates 1 and 2. The clock waveform is inverted to gates 5 and 6 through the inverter. The inputs to gates 5 and 6 from the inverter present the logic value 0 also to gates 5 and 6, because the inverted waveform is still close to logic value 1 but is not large enough to let gates 5 and 6 work. The inverter is actually implemented by a diode which is forward-biased, so that the network works in this manner.) When the clock pulse reaches the upper threshold at t2, (i.e., c = 1), gates 1 and 2 are enabled, and the information at J or K is read into the master flip-flop through gate 1 or 2. Since the slave is cut off from the master by disabled gates 5 and 6, the slave does not change its state, maintaining the previous output values of Q and Q . When the clock pulse falls to the upper threshold at t3 after its peak, gates 1 and 2 are disabled, cutting off J and K from the master. In other words, the outputs of 1 and 2 become 1, and the master maintains the current output values. When the clock pulse falls further to the lower threshold at t4, gates 5 and 6 are enabled and the information stored at the master is transferred to the slave, gates 7 and 8. The important feature of the master-slave flip-flop is that the reading of information into the flip-flop and the establishment of new output values are done at different times; in other words, the outputs of the flip-flop can be completely prevented from feeding back to the inputs, possibly going through some gates outside the master-slave flip-flop, while the network that contains the flip-flop is still in transition. The master-slave flip-flop does not respond to its input until the leading edge of the next clock pulse. Thus, we can avoid oscillatory races. Now we have a J-K flip-flop that works reliably, regardless of how long a clock pulse or signal 1 at terminal J or K lasts, since input gates 1 and 2 in Fig. 28.5 are gated by output gates 7 and 8, which do not assume new values before gates 1 and 2 are disconnected from J and K. As we will see later, sequential networks that work reliably can be constructed compactly with masterslave flip-flops, without worrying about hazards. Other types of master-slave flip-flops are also used. The T (type) master-slave flip-flop (this is also called toggle flip-flop, trigger flip-flop, or T flip-flop) has only a single input, labeled T, as shown in Fig. 28.8(a), which is the J-K master-slave flip-flop with J and K tied together as T. Whenever we have T = 1 during the clock pulse, the outputs of the flip-flop change, as shown in Fig. 28.8(b). The T-type flip-flop, denoted as Fig. 28.8(c), is often used in counters. The D (type) master-slave flip-flop (D implies “delay”) has only a single input D, and is realized, as shown in Fig. 28.9(a). As shown in Fig. 28.9(b), no matter what value Q has, Q is set to the value of D that is present during the clock pulse. The D-type flip-flop is used for delay of a signal or data storage and is denoted by the symbol shown in Fig. 28.9(c).

© 2000 by CRC Press LLC

FIGURE 28.8

T-type master-slave flip-flop.

FIGURE 28.9

D-type master-slave flip-flop.

Edge triggered flip-flops are another type of raceless flip-flop. Either the leading or the trailing edge of a clock pulse (not both) causes a flip-flop to respond to an input, and then the input is immediately disengaged from the flip-flop. Edge-triggered flip-flops are mostly used in the same manner as masterslave flip-flops.

Example of Design of a Synchronous Sequential Network Let us now explain synthesis of a synchronous sequential network with an example. Specification for the Design Example Suppose we want to synthesize a clocked network with two inputs, x1 and x2, and single output z under the following specifications: 1. Inputs do not change during the presence of clock pulses, as illustrated in Fig. 28.10. Inputs x1 and x2 cannot assume value 1 simultaneously during the presence of clock pulses. During clock pulses, an input signal of value 1 appears at exactly one of two inputs, x1 and x2, of the network, or does not appear at all. 2. The value of z changes as follows. a. The value of z becomes 1 when the value 1 appears during the clock pulse at the same input at which the last value 1 appeared. Once z becomes 1, z remains 1 regardless of the presence or absence of clock pulses, until signal 1 starts to appear at the other input during clock pulses. (This includes the following case. Suppose that we have z = 0, when signal 1 appears at one of the inputs during a clock pulse. Then, signal 0 follows at both x1 and x2 during the succeeding clock pulses. If signal 1 comes back to the same input, z becomes 1.) As illustrated in Fig. 28.10, z becomes 1 at time t1 at the leading edge of the second pulse because signal 1 is repeated at input x1. Then z continues to remain 1 even though signal 1 appears at neither input at the third pulse starting t2. b. The value of z becomes 0 when the value 1 appears at the other input during the clock pulse. Once z becomes 0, z remains 0 regardless of the presence or absence of clock pulses, until signal 1 starts to appear at the same input during clock pulses. In Fig. 28.10, z continues to be 1 until the leading edge of the pulse starting at time t3. Then z continues to remain 0 until the time t4.

© 2000 by CRC Press LLC

FIGURE 28.10

Waveform for the design example.

Let us prepare a flow-output table for this design problem in the following steps. 1. An output value of z must depend on an internal state only, because z has to maintain its value until next change. Let us assume that there are exactly two states, A and B, such that z = 0 when the network is in state A and z = 1 when the network is in state B. (We will try more than two states if two are found insufficient.) 2. Assume that during the absence of a clock pulse, the network is in state A. Let c denote the clock. Since the network must stay in this state as long as c = 0, the next state, S, in the column for c = 0 must be A, as shown in Table 28.10(a). Similarly, the next state for the second row in the column c = 0 must be B. It is to be noted that during c = 0, there are four combinations of values of x1 and x2 (i.e., x1 = x2 = 0; x1 = 1 and x2 = 0; x1 = 0 and x2 = 1; and x1 = x2 = 1). But the states of the network to be synthesized is irrelevant of the values of x1 and x2. Thus, in Table 28.10, we have only one column corresponding to c = 0, instead of four columns. TABLE 28.10 Two States Are Not Enough

z 0 1

c=1 x1, x2 s c=0 00 01 11 A A, 0 — B B, 1 — S, z (a) Entering A is not correct.

10 A

z 0 1

c=1 x1, x2 s c=0 00 01 11 A A, 0 — B B, 1 — S, z (b) Entering B is not correct.

10 B

3. When the network is in state A during c = 0, suppose that we have x1 = 1 at the next clock pulse. Let us choose A as the next state, S, as shown in the last column in Table 28.10(a). But this choice means that, if we apply value 1 repeatedly at x1, the network goes back and forth between the two states in the first and last columns in the first row in Table 28.10(a). Then z = 1 must result from the specification of the network performance; but since the network stays in the first row, we must have z = 0. This is a contradiction. Thus, the next state for (x1, x2, s) = (10A) cannot be A. 4. Next assume that the next state for (x1, x2, s) = (10A) is B, as shown in Table 28.10(b). Suppose that the value 1 has been alternating between x1 and x2. If we had the last 1 at x2, the network must be currently in A because of z = 0 for alternating 1’s. When the next 1 appears at x1, z = 0 must still hold because the value 1 is still alternating. But the network will produce z = 1 for state (10A), because B is assumed to correspond to z = 1. This is a contradiction, and the choice of B is also wrong.

© 2000 by CRC Press LLC

5. In conclusion, two states are not sufficient, so we try again with more states. For this example, considering only two states corresponding to z = 0 and 1 is not appropriate. 6. In the above, we had contradictions by assuming only two states, A and B, corresponding to z = 0 and 1; we did not know which input’s 1 led to each state. Thus, in addition to the values of z, let us consider which input had the last 1. In other words, we have four states corresponding to the combinations of z and the last 1, as shown in Table 28.11. At this stage, we do not know whether or not three states are sufficient. But let us assume four states for the moment. As a matter of fact, both three states and four states require two internal variables. Hence, in terms of the number of internal variables, it does not matter whether we have three or four states, although the two cases may lead to different networks. TABLE 28.11 Flow-Output Table in Skew-Mode

7. Derivation of a flow-output table in skew mode: Let us form a flow-output table, assuming the use of J-K master-slave flip-flops. In the column of c = 0 in Table 28.11, the network must stay in each state during the absence of a clock pulse. Thus, all the next states S in all the cells in the column of c = 0 must be identical to s in each row. Suppose that the network is in state (c, x1, x2, s) = (000A) with output z = 0 after having the last 1 at x1. When the value 1 appears at x1 and c becomes 1, the next state S must be B for the following reason. When the current clock pulse disappears, this 1 at x1 will be “the last 1” at x1 (so S must be A or B) and z will have to be 1 because of the repeated occurrence of 1’s at x1. (This contradicts z = 0, which we will have if A is entered as S.) Hence, the possibility of S being A is ruled out, and S must be B. Since value 1 is repeated at x1, we have z = 1. The next states and the values of output z in all other cells in Table 28.11 can be found in a similar manner. Let us analyze how a transition among stable states occurs in this table. According to the problem specification, inputs x1 and x2 can change only during the absence of clock pulses. Suppose that during the absence of a clock pulse, the network is in state (c, x1, x2, s) = (000A) in Table 28.11. Suppose that inputs (x1, x2) change from (00) to (10) sometime during the absence of a clock pulse and (x1, x2) = (10) lasts at least until the trailing edge of the clock pulse. If this transition is interpreted on Table 28.11, the network moves from (c, x1, x2, s) = (000A) to (110A), as shown by the dotted-line with an arrow, at the leading edge of the clock pulse. Then the network must stay in this state, (110A), until the trailing edge of the clock pulse, because the outputs of the J-K master-slave flip-flops keep the current values during the clock pulse, changing to its new values only at the trailing edge of the clock pulse and thus the internal state s does not change yet to its new state S. (This is different from the fundamental mode, in which the network does not stay in this state unless the state is a stable one, and vertically moves to a stable state in a different row in the same column.) Then the network moves from (c, x1, x2, s) = (110A) to (010B), as shown by the solid-line with an arrow in Table 28.11, when s assumes the new state S at the trailing edge of the clock pulse. Thus, the transition occurs horizontally and then diagonally in Table 28.11. This type of transition is called skew mode, in order to differentiate it from the fundamental mode.

© 2000 by CRC Press LLC

Notice, however, that in the new state (110A) at the leading edge of the clock pulse in the above transition from (c, x1, x2, s) = (000A), the network z assumes the new output value. Consequently, in this new stable state during c = 1 in Table 28.11, the new current value of the network output, z, is shown, while the value of the internal state in this new stable state shows the next state S, though the network is actually in s. In the fundamental mode, when the network moves horizontally to a new unstable state in the same row in the state-output table, the current value of the network output is shown and the internal state represents the next state, S (in this sense the situation is not different), but the network output lasts only during a short, transient period, unless the new state is stable. In contrast, in skew mode, the output value for the new state is not transient (because the network stays in this state during the clock pulse) and is essential in the description of the network performance. Let us synthesize a network for this flow-output table in skew mode later. 8. Derivation of a flow-output table in fundamental mode: Next let us try to interpret this table in fundamental mode. Table 28.11 shows that, when the network placed in state (c, x1, x2, s) = (000A) receives x2 = 1 before the appearance of the next pulse, the next state will be C at the leading edge of the next pulse. But the network goes to unstable state (c, x1, x2, s) = (101C) that has entry D, since if we assume fundamental mode, it must move vertically, instead of the diagonal transition in skew mode. Hence, the network must go further to stable state (101D), without settling in the desired stable state, C, if the network still keeps x2 = 1 and c = 1. Therefore, the network cannot be in fundamental mode. (Recall that the next state entries in a flow-output table, unlike a stateoutput table, do not show intermediate unstable states but do show the destination stable states.) The above difficulty can be avoided by adding two new rows, E and F, as shown in Table 28.12. When the network placed in state (c, x1, x2, s) = (000A) receives x2 = 1, the network goes to the new stable state F in column (x1, x2) = (01) and in row F. For this state F, z = 0, without causing contradiction. When the clock pulse disappears, the network goes to stable state (000C) after passing through unstable state (000F). The problem with the other states is similarly eliminated. All stable states are encircled as stable states. TABLE 28.12 Flow-Output Table in Fundamental Mode

The values of z for stable states are easily entered. The values of z for unstable states can be entered with certain freedom. For example, suppose that the network placed in state (c, x1, x2, s) = (000A) receives x1 = 1. Then the next state S is B. In this case, we may enter 0 or 1 as z for (110A) for the following reason. We have z = 0 for the initial stable state A during c = 0 and z = 1 for the destination stable state B during c = 1 and correspondingly z = 0 or 1. This does not make much difference as far as the external behavior of the network is concerned, because the network stays

© 2000 by CRC Press LLC

in this unstable state (110A) for the short transient period and it simply means that z = 1 appears a little bit earlier or later. The network that results, however, may be different. Accordingly, z = d (d denotes “don’t-care”) would be the best assignment, since this gives flexibility in designing the network later.

Design of Synchronous Sequential Networks in Skew Mode Now let us design a synchronous sequential network based on a flow-output table in skew mode, using J-K master-slave flip-flops. As pointed out previously, when master-slave flip-flops are used, the network does not have racing hazards even if internal variables make multiple changes, because the flip-flops do not respond to any input changes during clock pulses. Thus, we need not worry about hazards due to multiple changes of internal variables and consequently appropriate assignment of binary numbers to states in forming a transition-output table from a state-output table or a flow-output table in the design steps in Fig. 28.4. But the number of gates, connections, or levels in a network to be designed can differ, depending on how binary numbers are assigned to states (it is of secondary importance compared with the hazard problem, which makes networks useless if they malfunction). Making state assignments without considering multiple changes of internal variables is much easier than having to take these changes into account. Let us derive the transition-output table shown in Table 28.13 from Table 28.11, using a state assignment as shown. TABLE 28.13 Transition-Output Table in Skew Mode c =1 x1, x2 s A B C D

y1y2 00 01 11 10

c=0 00, 0 01, 1 11, 0 10, 1

00 00, 0 01,1 11, 0 10, 1

01 11 11, 0 dd, d 11, 0 dd, d 10, 1 dd, d 10, 1 dd, d Y1Y2, z

10 01, 0 01, 1 00, 0 00, 0

Reversing the inputs and outputs relationship of J-K master-slave flip-flops shown in Table 28.14(a), we have the output-input relationship shown in Table 28.14(b) (other master-slave flip-flop types can be treated in a similar manner). Table 28.14(b) shows what values inputs J and K must take for each change of internal variable y to its next value Y. (In order to have y = Y = 0, J = K = 0 or J = 0 and K = 1 must hold, as we can see in Table 28.14(a) and thus we have S = 0 and R = d in Table 28.14(b).) Using Table 28.14(b), we form the excitation table in Table 28.15. Decomposing Table 28.15 into five Karnaugh maps for J1, K1, J2, K2 and z, we can find a minimal sum for each of J1, K1, J2, K2, and z. On the basis of these logic expressions, we design the loopless network inside the general model of sequential networks in Fig. 28.3. Then, placing two J-K master-slave flip-flops outside this loopless network, we have designed the sequential network shown in Fig. 28.11. Master-slave flip-flops do not respond to changes in their inputs when and after their outputs change until the leading edges of next clock pulses. Thus, no network malfunction due to races occurs, and no post-analysis of whether or not the designed networks malfunction due to this is necessary. This is the advantage of clocked networks with raceless flip-flops.

Design of Asynchronous Sequential Networks in Fundamental Mode Now let us design an asynchronous sequential network based on a flow-output table in fundamental mode. According to the design steps of Fig. 28.4, we have to derive a transition-output table from the flowoutput table shown in Table 28.12, by deriving a state-output table by assigning appropriate binary

© 2000 by CRC Press LLC

TABLE 28.14 Input-Output Relationship and OutputInput Relationships of J-K Master-Slave Flip-Flop (a) Input-output relationship Inputs J K 0

0

0

1

1

0

1

1

(b) Output-input relationship

Outputs y Y 0 0 1 1 0 0 1 0 0 1 1 1 0 1 1 0

Outputs y Y 0 0 0 1 1 0 1 1

Inputs S R 0 d 1 d d 1 d 0

TABLE 28.15 Excitation-Output Derived from Table 28.13 c =1 x1, x2 y1y2 00 01 11 10

FIGURE 28.11

c=0 0d, 0d, 0 0d, d0, 0 d0, d0, 0 d0, 0d, 0

00 0d, 0d, 0 0d, d0, 0 d0, d0, 0 d0, 0d, 0

01 11 1d, 1d, 0 dd, d, 0 1d d0, 0 dd, d, 0 d0, d1, 0 dd, d, 0 d0, 0d, 0 dd, d, 0 J1K1, J2K2, z

Synthesized network based on Table 28.15.

© 2000 by CRC Press LLC

10 0d, 1d, 0 0d, d0, 0 d1, d1, 0 d1, 0d, 0

numbers to states such that multiple changes do not occur for every transition from one binary number to another, possibly changing some intermediate unstable states to others. Then, if the designers want to use S-R latches, we can derive an excitation table by finding the output-input relationship of the S-R latch, as illustrated with Tables 28.7, 28.8, and 28.9. Then, we can design the loopless network inside the general model of Fig. 28.3 by deriving minimal sums from the Karnaugh maps decomposed from the excitation-output table. If the designers do not want to use S-R latches, we can design the loopless network inside the general model of Fig. 28.3 by deriving minimal sums from the Karnaugh maps decomposed from the transition-output table without deriving an excitation-output table. In the case of an asynchronous sequential network, we need post-analysis of whether the designed network works reliably.

Advantages of Skew Mode The advantages of skew-mode operation with raceless flip-flops, such as master-slave or edge-triggered flip-flops, can be summarized as follows: 1. We can use no more complex and often simpler flow-output tables (or state-output tables) in skew mode than are required in fundamental mode, making design easier (because we need not consider both unstable and stable states for each input change, and need not consider adding extra states, or changing intermediate unstable states, which are to avoid multiple changes of internal variables). 2. State assignments are greatly simplified because we need not worry about hazard due to multiple changes of internal variables. (If we want to minimize the number of gates, connections, or levels, we need to try different state assignments. This is less important than the reliable operations of the networks to be synthesized.) 3. Networks synthesized in skew mode usually require fewer internal variables than those in fundamental mode. 4. After the network synthesis, we do not need to check whether the networks contain racing hazards or not. This is probably the greatest of all the advantages of skew mode, since checking hazards and finding remedies is usually very cumbersome and time-consuming.

References 1. Kohavi, Z., Switching and Automata Theory, 2nd ed., McGraw-Hill, 1978. 2. McCluskey, E. J., Logic Design Principles: With Emphasis on Testable Semicustom Circuits, PrenticeHall, 1986. 3. Miller, R., Switching Theory, vol. 2, John Wiley & Sons, 1965. 4. Muroga, S., Logic Design and Switching Theory, John Wiley & Sons (now available from Krieger Publishing Co.), 1979. 5. Roth, C. H. Jr., Fundamentals of Logic Design, 4th ed., West Publishing Co., 1992. 6. Unger, S. H., The Essence of Logic Circuits, 2nd ed., IEEE Press, 1997.

© 2000 by CRC Press LLC

Nakamura, Y., Muroga, S. "Logic Synthesis with AND and OR Gates in Multi-levels" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

29 Logic Synthesis with AND and OR Gates in Multi-levels Yuichi Nakamura NEC Corporation

Saburo Muroga University of Illinois at Urbana-Champaign

29.1 Logic Networks with AND and OR Gates in Multi-levels 29.2 General Division 29.3 Selection of Divisors 29.4 Limitation of Weak Division

29.1 Logic Networks with AND and OR Gates in Multi-levels In logic networks, the number of levels is defined as the number of gates in the longest path from external inputs to external outputs. When we design logic networks with AND and OR gates, those in multi-levels can be designed with no more gates than those in two levels. Logic networks in multi-levels have more levels than those in two levels, but this does not necessarily mean that those in multi-levels have greater delay time than those in two levels because a logic gate that has many fan-out connections generally has greater delay time than gates that have fewer fan-out connections (Remark 29.1). Also, a logic gate that has many fan-in connections from other logic gates tends to have larger area in the chip and longer delay time than other gates that have fewer fan-in connections. Thus, if we want to design a logic network with a small delay time and small area, we need to design a logic network in many levels, keeping maximum fan-out and fan-in of each gate under a reasonably small limit. Remark 29.1: When the line width in an IC chip is large, the delay time of logic gates is larger than those over connections and, once a logic network is designed, it can be laid out on the chip usually without further modifications. But when the line width becomes very short, under 0.25 µm, long connections add more delay due to parasitic capacitance and resistance than the delay of gates. But length of connections cannot be known until making layout on an IC chip after finishing logic design. So at the time of logic design, prior to layout, designers know only the number of fan-out connections from each gate, and this is only partial information on delay estimation. Thus, when the line width becomes very short, it is difficult to estimate precisely the delay of a logic network at the time of logic design. We need to modify a logic network interactively, as we lay it out on the chip. Such a multi-level logic network can be derived by rewriting a logic expression with parentheses. For example, a logic expression

f = ab ∨ acd ∨ ace ∨ bce ∨ bcd

(29.1)

can be realized with five AND gates and one OR gate in two levels, as illustrated in Fig. 29.1(a). However,

© 2000 by CRC Press LLC

FIGURE 29.1

Networks for f = ab ∨ acd ∨ ace ∨ bce ∨ bcd.

f = c ( a ∨ b ) ( d ∨ e ) ∨ ab

(29.2)

can be obtained by rewriting the expression with parentheses, as explained in the following. This logic expression can be realized with three OR gates and two AND gates in three levels, as illustrated in Fig. 29.1(b). The network in Fig. 29.1(b) would have a smaller area and smaller delay than the one in Fig. 29.1(a) because of fewer logic gates and a smaller maximum fan-in (a logic gate with five fan-ins, for example, has more than twice the area and delay of a gate with two fan-ins). A logic network in two levels with the fewest gates can be derived by minimal sums or minimal products, which can be derived by reasonably simple algorithms (described in Chapter 27). But if we try to derive multi-level logic networks, only few reasonable algorithms are known. One of them is the weak division described in the following, although the minimality is not guaranteed and its execution is not straightforward. Another algorithm is the special case (i.e., AND and OR gates) of the map-factoring method (described in Chapter 31).

29.2 General Division Rewriting of a logic expression using parentheses can be done by the following division. The division is based on the use of sub-expressions that can be found in the given logic expression. The given logic expression in a sum-of-products can be rewritten with parentheses if it has common sub-expressions. For example, the logic expression in Eq. 29.1 can be converted to the following expression, using a subexpression (a ∨ b):

f = cd(a ∨ b ) ∨ c e(a ∨ b) ∨ ab This can be further rewritten into the following, by sharing the common sub-expression (a ∨ b):

f = c ( a ∨ b ) ( d ∨ e ) ∨ ab

(29.3)

This rewriting can be regarded symbolically as division. Rewriting of the expression in Eq. 29.1 into the one in Eq. 29.3 may be regarded as division with the divisor x = a ∨ b, the quotient q = cd ∨ ce and the remainder r = ab. Then, the expression f can be represented as follows:

f = xq ∨ ab, with divisor x = a ∨ b, quotient q = cd ∨ ce = c(d ∨ e), and remainder r = ab. The division is symbolically denoted as f /x. Generally, the quotient should not be 0, but the remainder may be 0. The division, however, may yield many different results because there are many possibilities in choosing a divisor and also the given logic function can be written in many different logic expressions, as explained in the following.

© 2000 by CRC Press LLC

Division can be repeated on one given logic expression. Suppose f = ab ∨ ac ∨ ba ∨ bc ∨ ca ∨ cb is given. Repeating division three times, choosing successively b ∨ c, a ∨ c, and a ∨ b as divisors, the following result is derived:

f = x 1 a ∨ x 2 b ∨ x 3 c with divisors x1 = b ∨ c, x 2 = a ∨ c, and x3 = a ∨ b.

29.3 Selection of Divisors Among the divisions, those with a certain type of divisor to be described in the following are called weak divisions. The objective of the weak division is the derivation of a logic network with a logic expression having a minimal total number of literals, repeatedly applying the weak division to the given logic expression until the division cannot apply to the logic expression any further. In this case, the total number of literals is intended to be an approximation of the total number of inputs to all the logic gates, although not exactly, as explained later. Thus, we should start with a logic expression in a minimal sum of products by using two-level logic minimization1,2 before weak division in order to obtain a good result. In the weak division, the divisor selection is the most important problem to produce a compact network because there are many divisor candidates to be found in the given logic expression and the result of the weak division depends on divisor selection. For example, the expression f = ab ∨ acd ∨ ace ∨ bce ∨ bcd in Eq. 29.1 with 14 literals has many divisor candidates, a, b, c, a ∨ b, b ∨ cd, and others. When a ∨ b is first selected as the divisor, the resultant network illustrated in Fig. 29.1(b) for f = c(a ∨ b)(d ∨ e) ∨ ab with 7 literals is obtained. On the other hand, if b ∨ cd is first selected as the divisor, the resultant network for f = c(e(a ∨ b) ∨ bd) ∨ a(b ∨ cd) with 10 literals illustrated in Fig. 29.1(c) is obtained, which is larger than the network illustrated in Fig. 29.1(b). Divisors can be derived by finding sub-expressions called kernels. All the kernels for the given logic expression can be found as follows. First, all the subsets of products in the given logic expression are enumerated. Next, for each subset of products, the product of the largest number of literals that is common with all the products in this subset is found. This product of the largest number of literals is called a co-kernel. Then the sum of products, from each of which this co-kernel (i.e., the product of the largest number of literals) is eliminated is obtained as a kernel. For example, the sum of products abc ∨ abd has the co-kernel ab as the product of the largest number of literals that is common to all the products, abc and abd. The kernel of abc ∨ abd is c ∨ d. However, ab ∨ ac ∨ d has no kernels, because it has no common literals for all products. The kernel b ∨ c with co-kernel a is found when the subset of products, ab ∨ ac, is considered. Certainly, by trying all divisor candidates and selecting the best one, we can derive a network as small as possible by the weak division. However, such an exhaustive search is too time-consuming, requiring a huge memory space. The branch-and-bound method is generally more efficient than the exhaustive search.3 Thus, the heuristic method that the type of divisor candidates is restricted to sum of products with specific feature is proposed.2 This method is called kernel decomposition, reducing the number of divisor candidates. A kernel of an expression for f is the sum with at least two products such that all the products in the sum contain no common literals (e.g., ab ∨ c is a kernel, but ab ∨ ac and abc are not kernels, because all products, ab and ac, in ab ∨ ac contain a, and abc is a single product), especially, a kernel whose subsets contain no other kernels is called a level-0 kernel (e.g., a ∨ b is a level-0 kernel, but ab ∨ ac ∨ d is not a level-0 kernel because sub-expression ab ∨ ac contains kernel b ∨ c). The level of kernels is defined recursively as a level-K kernel contains at the next lower level-(K – 1) kernel (e.g., ab ∨ ac ∨ d is a level-1 kernel because it contains level-0 kernel b ∨ c). Usually, a level-K kernel with K ≥ 1 is not used as a divisor to save processing time, because the results obtained by all level kernels as divisors are the almost same as those obtained by using only the level-0 kernels. Thus, all the kernels that are not level-0 kernels are excluded from divisor candidates.

© 2000 by CRC Press LLC

For example, the logic expression illustrated in Fig. 29.1, f = ab ∨ acd ∨ ace ∨ bce ∨ bcd, has 16 kernels as shown in Table 29.1. By eliminating all the level-1 kernels, ad ∨ ae ∨ bd ∨ be, ea ∨ eb ∨ bd and others from Table 29.1, we have the divisor candidates in Table 29.2. TABLE 29.1 Kernals for f = ab ∨ acd ∨ ace ∨ bce ∨ bcd Kernel

Co-kernel

Level

ad ∨ ae ∨ bd ∨ be ea ∨ eb ∨ bd eb ∨ ed ∨ cd ab ∨ ae ∨ bd ad ∨ ae ∨ be a ∨ ce ∨ cd b ∨ cd b ∨ ce a ∨ be a ∨ cd d∨e ad ∨ be a∨b ae ∨ bd a∨b d∨e

c c c c c b a a b b ac c cd c ce bc

1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

TABLE 29.2 0-level Kernals and Co-kernels Derived from Table 29.1 Kernel

Co-kernel

Weight of Kernels

b ∨ cd b ∨ ce a ∨ be a ∨ cd ad ∨ be ae ∨ bd a∨b d∨e

a a b b c c ce, cd ac, bc

1 1 1 1 1 1 6 6

The next step is the selection of one divisor from all candidates. A candidate that decreases the largest number of literals in the expression by the weak division is selected as a divisor. If there is a tie, choose one of them. The difference in the number of literals before and after the weak division by the kernel is called the weight of the kernel. In the above example, the result of the weak division by the kernel b ∨ cd is f = a(b ∨ cd) ∨ ace ∨ bce ∨ bcd. The weight of the kernel b ∨ cd is 1, because the number of literals is reduced from 14 to 13 by this division. The weight of the kernels can be easily calculated by the number of literals in kernels and co-kernels without execution of weak division, because the quotient of the division by a kernel is a product of a co-kernel and other sub-expressions. The weight of the kernels for this example is shown in Table 29.2. Then the kernel a ∨ b is selected as a divisor with the largest weight. The expression f = ab ∨ acd ∨ ace ∨ bce ∨ bcd is divided by a ∨ b. In the next division, d ∨ e is selected and divides the expression after division by a ∨ b. Finally, the network f = c(a ∨ b)(d ∨ e) ∨ ab illustrated in Fig. 29.1(b) is obtained. These operations, enumeration of all the kernels, calculation of the weights of all kernels, selection of the largest one, and division are applied repeatedly until no kernel can be found. In this case, we can choose a different sequence of divisors, deriving a different result. But often we do not have a different result, so it may not be worthwhile to try many different sequences of divisors.

© 2000 by CRC Press LLC

Instead of the kernel decomposition method, a faster method is proposed.4 In this method, the divisor candidates are restricted to only 0-level kernels with two products, along with introduction of complement-sharing that when both ab ∨ ab and ab ∨ ab are among divisor candidates, one is realized with a logic gate while realizing the other by complementing it by an inverter ( ab ∨ ab = (a ∨ b )( a ∨ b) = ab ∨ ab for this example). Although the restriction is stronger than the kernel decomposition method, the method produces smaller networks and can run faster than the kernel decomposition in many cases.

29.4 Limitation of Weak Division Although simpler networks can be easily derived by the weak division, the weak division cannot derive certain types of logic networks because of its restrictions in its rewriting of logic expressions with parentheses. Rewriting of logic expressions without such restrictions is called strong division. For example, complements of sub-expressions, such as f = ( ab ∨ ac ) ∨ ba ∨ c , is not used in the weak division. Also, the two literals, x and x , for each variable x are regarded as different variables without using identities such as ab ∨ a = a, aa = 0, and a ∨ a = 1 in the weak division. Suppose the sum-of-products f = ab ∨ ac ∨ ba ∨ bc ∨ ca ∨ cb is given. This can be rewritten in the following two different logic expressions:

f 1 = ab ∨ ac ∨ ba ∨ bc ∨ ca ∨ cb and

f 2 = aa ∨ ab ∨ ac ∨ ba ∨ bb ∨ bc ∨ ca ∨ cb ∨ cc , using the identities aa = 0, bb = 0, and cc = 0 They can be further rewritten as follows:

f1 = a ( b ∨ c ) ∨ b ( a ∨ c ) ∨ c ( a ∨ c ) and

f2 = ( a ∨ b ∨ c ) ( a ∨ b ∨ c ) Both of these expressions can be written in the following expressions, using the divisors, quotients, and remainders, which are 0 in this particular example:

f1 = x 11q 11 ∨ x 12q12 ∨ x 13q 13 ∨ r1 with divisors x11 = b ∨ c , x12 = a ∨ c , x13 = a ∨ b , quotients q11 = a, q12 = b, q13 = c, and remainder r1 = 0

f2 = x 21q21 ∨ r2 with divisor x21 = a ∨ b ∨ c, quotient q21 = a ∨ b ∨ c , and remainder r2 = 0. Corresponding to these expressions, we have two different logic networks, as shown in Fig. 29.2. The function f2 is derived by division by a divisor above but actually cannot be obtained by the weak division. Thus, the logic network for f2 is labeled as the result by strong division in Fig. 29.2. Strong division is rewriting of logic expressions using any form, including the complement of a sub-expression and accordingly has far greater possibilities than the division explained so far. The results of division are evaluated by the number of literals contained in each of the obtained expressions. This number is an approximation of the total number of fan-ins of gates in networks in the

© 2000 by CRC Press LLC

FIGURE 29.2

Strong and weak divisions.

following sense: inputs to a logic gate from other gates are not counted as literals. In Fig. 29.2, the number of literals of f1 is 9, and the number of literals of f2 is 6. But in the logic network for f1, an input to each of three AND gates in Fig. 29.2, for example, is not counted as a literal. Counting them, the total number of fan-ins of all logic gates in the logic network for f1, which is 15 in Fig. 29.2, is larger than the total number of fan-ins of all gates in the logic network for f2, 8.

References 1. R. K. Brayton, G. D. Hachtel, C. McMullen, and A. Sangiovanni-Vincentelli, Logic Minimization Algorithms for VLSI Synthesis, Kluwer Academic Publishers, Boston, 1984. 2. R. K. Brayton, A. Sangiovanni-Vincentelli, and A. Wang, “MIS: A Multiple-Level Logic Optimization System,” IEEE Transaction on CAD, CAD-6(6), pp. 1062-1081, July 1987. 3. G. De Micheli, A. Sangiovanni-Vincentelli, and P. Antognetti, Design System for VLSI Circuits: Logic Synthesis and Silicon Compilation, Martinus Nijhoff Publishers, pp. 197-248, 1987. 4. J. Rajski, and J. Vasudevamurthy, “Testability Preserving Transformations in Multi-level Logic Synthesis,” IEEE ITC, pp. 265-273, 1990.

© 2000 by CRC Press LLC

Muroga, S. "Logic Properties of Transistor Circuits" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

30 Logic Properties of Transistor Circuits 30.1 Basic Properties of Connecting Relays 30.2 Analysis of Relay-Contact Networks Transmission of Relay-Contact Networks

30.3 Transistor Circuits

Saburo Muroga University of Illinois at Urbana-Champaign

Bipolar Transistors • MOSFET (Metal-Oxide Semiconductor Field Effect Transistor) • Difference in the Behavior of n-MOS and p-MOS Logic Gates

30.1 Basic Properties of Connecting Relays Relays are probably the oldest means to realize logic operations. Relays, which are electromechanical devices, and their solid-state equivalents (i.e., transistors) are extensively used in many industrial products, such as computers. Relays are conceptually simple and appropriate for introducing physical realization of logic operations. More importantly, the connection configuration of a relay contact network is the same as that of transistors inside a logic gate realized with transistors, in particular MOSFETs (which stands for metal-oxide semiconductor field effect transistors). A relay consists of an armature, a magnet, and a metal contact. An armature is a metal spring made of magnetic material with a metal contact on it. There are two different types of relays: a make-contact relay and a break-contact relay. A make-contact relay is a relay such that, when there is no current through the magnet winding, the contact is open. When a direct current is supplied through the magnet winding, the armature is attracted to the magnet and, after a short time delay, the contact is closed. This type of relay contact is called a “make-contact” and is usually denoted by a lower-case x. The current through the magnet is denoted by a capital letter X, as shown in Fig. 30.1. A break-contact relay is a relay such that when there is no current through the magnet winding, the contact closes. When a direct current is supplied, the armature is attracted to the magnet and, after a short time delay, the contact opens. This type of relay contact is called a “break-contact” and is usually denoted by x . The current through the magnet is again denoted by X, as shown in Fig. 30.2. In either case of a make-contact relay or a break-contact relay, no current in a magnet is represented by X = 0, and the flow of a current is represented by X = 1. Then x = X, no matter whether X = 0 or 1. But the contact of a make-contact relay is open or closed according as X = 0 or 1, because the contact is expressed by x, whereas the contact of a break-contact relay is closed or open according as X = 0 or 1, because the contact is expressed by x . Let us connect two make-contacts x and y in series, as shown in Fig. 30.3. Since X and x assume identical values at any time, the magnet, along with its symbol X, will henceforth be omitted in figures unless it is needed for some reason. Then we have the combinations of states shown in Table 30.1(a),

© 2000 by CRC Press LLC

FIGURE 30.1

A make-contact relay.

FIGURE 30.2

A break-contact relay.

FIGURE 30.3

Series connection of relay contacts. TABLE 30.1 Combinations of States for the Series Connection in Fig. 30.3 (a) x Open Open Closed Closed

y Open Closed Open Closed

(b) Entire Path Between a and b Open Open Open Closed

x 0 0 1 1

y 0 1 0 1

f 0 0 0 1

which has only two states, “open” and “closed.” Let f denote the state of the entire path between terminals a and b, where f is called the transmission of the network. Since “open” and “closed” of a make-contact are represented by x = 0 and x = 1, respectively, Table 30.1(a) may be rewritten as shown in Table 30.1(b). This table shows the AND of x and y, defined in Chapter 23 and denoted by f = xy. Thus the network of a series connection of make-contacts realizes the AND operation of x and y. Let us connect two make-contacts x and y in parallel as shown in Fig. 30.4. Then we have the combinations of states shown in Table 30.2(a). Replacing “open” and “closed” by 0 and 1, respectively, we may rewrite Table 30.2(a) as shown in Table 30.2(b). This table shows the OR of x and y, defined in Chapter 24 and denoted by f = x ∨ y.

30.2 Analysis of Relay-Contact Networks Let us analyze a relay contact network. “Analysis of a network” means the description of the logic performance of the network in terms of a logic expression.1

© 2000 by CRC Press LLC

FIGURE 30.4

Parallel connection of relay contacts. TABLE 30.2 Cominations of States for the Parallel Connection in Fig. 30.4 (a) x Open Open Closed Closed

y Open Closed Open Closed

(b) Entire Path Between a and b Open Closed Closed Closed

x 0 0 1 1

y 0 1 0 1

f 0 1 1 1

Transmission of Relay-Contact Networks We now discuss general procedures to calculate the transmission of a network in which relay contacts are connected in a more complex manner. The first general procedure is based on the concept of tie sets, defined as follows. Definition 30.1: Consider a path that connects two external terminals, a and b, and no part of which forms a loop. Then the literals that represent the contacts on this path are called a tie set of this network. An example of a tie set is the contacts x1, x6, x 2 , and x5 on the path numbered 1 in Fig. 30.5.

FIGURE 30.5

Tie sets of non-series-parallel network.

Procedure 30.1: Derivation of the Transmission of a Relay-Contact Network by Tie Sets Find all the tie sets of the network. Form the product of all literals in each tie set. Then the disjunction  of all these products yields the transmission of the given network. These tie sets represent all the shortest paths that connect terminals a and b. As an example, the network of Fig. 30.5 has the following tie sets: For path 1: x1, x6, x 2 , x5 For path 2: x2, x4, x5 For Path 3: x1, x6, x3, x4, x5 For path 4: x2, x3, x 2 , x5

© 2000 by CRC Press LLC

Then, we get the transmission of the network:

f = x1 x6 x2 x5 ∨ x2 x4 x5 ∨ x1 x6 x3 x4 x5 ∨ x2 x2 x3 x5

(30.1)

where the last term, x 2 x 2 x 3 x 5 , may be eliminated, since it is identically equal to 0 for any value of x2. Procedure 30.1 yields the transmission of the given network because all the tie sets correspond to all the possibilities for making f equal 1. For example, the first term, x 1 x 6 x 2 x 5 , in Eq. 30.1 becomes 1 for the combination of variables x1 = x6 = x5 = 1 and x2 = 0. Correspondingly, the two terminals a and b of the network in Fig. 30.5 are connected for this combination. The second general procedure is given after the following definition. Definition 30.2: Consider a set of contacts that satisfy the following conditions: 1. If all of the contacts in this set are opened simultaneously (ignoring functional relationship among contacts; in other words, even if two contacts, x and x, are included in this set, it is assumed that contacts x and x can be opened simultaneously), the entire network is split into exactly two isolated subnetworks, one containing terminal a and the other containing b. 2. If any of the contacts are closed, the two subnetworks can be connected. Then, the literals that represent these contacts are called a cut set of this network.



As an example, let us find all the cut sets of the network in Fig. 30.5, which is reproduced in Fig. 30.6. First, let us open contacts x1 and x2 simultaneously, as shown in Fig. 30.6(a). Then terminals a and b are completely disconnected (thus condition 1 of Definition 30.2 is satisfied). If either of contacts x1 and x2 is closed, the two terminals a and b can be connected by closing the remaining contacts (thus, condition 2 of Definition 30.2 is satisfied). We have all the cut sets shown in the following list and also in Fig. 30.6(b): For cut 1: x1, x2 For cut 2: x6, x2 For cut 3: x1, x3, x4 For cut 4: x6, x3, x4 For cut 5: x 2 , x3, x2 For cut 6: x 2 , x4 For cut 7: x5

FIGURE 30.6

Cut sets of the network in Fig. 30.5.

Procedure 30.2: Derivation of the Transmission of Relay-Contact Network by Cut Sets Find all the cut sets of a network. Form the disjunction of all literals in each cut set. Then the product of all these disjunctions yields the transmission of the given network.  On the basis of the cut sets in the network of Fig. 30.5 derived above, we get

f = ( x 1 ∨ x 2 ) ( x 6 ∨ x 2 ) ( x 1 ∨ x 3 ∨ x 4 ) ( x 6 ∨ x 3 ∨ x 4 ) ( x 2 ∨ x 3 ∨ x 2 ) ( x 2 ∨ x 4 ) ( x 5 ) (30.2)

© 2000 by CRC Press LLC

This expression looks different from Eq. 30.1, but they are equivalent, since we can get identical truth tables for both expressions. Procedure 30.2 yields the transmission of a relay contact network because all the cut sets correspond to all possible ways to disconnect two terminals, a and b, of a network; that is, all possibilities of making f equal 0. Any way to disconnect a and b which is not a cut set constitutes some cut set plus additional unnecessary open contacts, as can easily be seen. The disjunction inside each pair of parentheses in Eq. 30.2 corresponds to a different cut set. A disjunction that contains the two different literals of any variable (e.g., ( x 2 ∨ x3 ∨ x2) in Eq. 30.2 contains two literals, x 2 and x2, of the variable x2) is identically equal to 1 and is insignificant in multiplying out f. Therefore, every cut set that contains the two literals of some variable need not be considered in Procedure 30.2.

30.3 Transistor Circuits Bipolar transistor and MOSFET are currently the two most important types of transistors for integrated circuit chips, although MOSFET is becoming increasingly popular.2 A transistor is made of pure silicon that contains a trace of impurities (i.e., n-type silicon or p-type silicon). When a larger amount of impurity than standard is added, we have n+- and p+-type silicon. When less, we have n–- and p–-type silicon. A bipolar transistor has a structure of n-type region (or simply n-region) consisting of n-type silicon and p-type region (or simply p-region) consisting of p-type silicon, as illustrated in Fig. 30.7, different from that of MOSFET illustrated in Fig. 30.10.

Bipolar Transistors An implementation example of an n-p-n bipolar transistor, which has three electrodes (i.e., an emitter, a base, and a collector) is shown in Fig. 30.7(a) along with its symbol in Fig. 30.7(b). A p-n-p transistor has the same structure except p-type regions and n-type regions exchanged (n+- and p+-type regions also exchanged).

FIGURE 30.7

n-p-n transistor.

Suppose that an n-p-n transistor is connected to a power supply with a resistor and to the ground, as shown in Fig. 30.8(a). When the input voltage vi increases, the collector current ic gradually increases, as shown in Fig. 30.8(b). (Actually, ic is 0 until vi reaches about 0.6 V. Then ic gradually increases and then starts to saturate.) As ic increases, the output voltage vo decreases from 5 V to 0.3 V or less because of the voltage difference across the resistor R, as shown in Fig. 30.8(c). Therefore, when the input vi is a high

© 2000 by CRC Press LLC

FIGURE 30.8

Inverter circuit.

voltage (about 5 V), the output vo is a low voltage (about 0.3 V), and when vi is a low voltage (about 0.3 V), vo is a high voltage (about 5 V). This is illustrated in Table 30.3(a). Thus, if binary logic values 0 and 1 are represented by low and high voltages, respectively, we have the truth table in Table 30.3(b). This means that the circuit in Fig. 30.8(a) works as an inverter. In other words, if vi represents a logic variable x, then vo represents the logic function x . TABLE 30.3 Input-Output Relations of the Inverter in Fig. 30.8(a) (a) Voltage Required Input vi Output vo Low voltage High voltage High voltage Low voltage

(b) Truth Table vi vo 0 1 1 0

Since we are concerned with binary logic values in designing logic networks, we will henceforth consider only the on-off states of currents or the corresponding voltages in electronic circuits (e.g., A and B in Fig. 30.8(b), or A′ and B′ in Fig. 30.8(c)), without considering their voltage magnitudes. As we will see later, the transistor circuit in Fig. 30.8(a) is often used as part of more complex transistor circuits that constitute logic gates. Here, notice that if resistor R′ is added between the emitter and the ground, and the output terminal vo is connected to the emitter, instead of to the collector, as shown in Fig. 30.9, then the new circuit does not work as an inverter. In this case, when vi is a high voltage, vo is also a high voltage, because the current that flows through the transistor produces the voltage difference across resistor R′. When vi is a low voltage, vo is also a low voltage, because no current flows and

FIGURE 30.9

Emitter follower.

© 2000 by CRC Press LLC

consequently no voltage difference develops across R′. So if vi represents a variable x, vo represents the logic function x, and no logic operation is performed. The transistor circuit in Fig. 30.9, which is often called an emitter follower, works as a current amplifier. This circuit is used often as part of other circuits to supply a large output current by connecting the collector of the transistor directly to the Vcc without R. A logic gate based on bipolar transistors generally consists of many transistors which are connected in a more complex manner than Fig. 30.8 or 30.9, and realizes a more complex logic function than x . (See Chapter 35 on ECL.)

MOSFET (Metal-Oxide Semiconductor Field Effect Transistor) In integrated circuit chips, two types of MOSFETs are usually used; that is, n-channel enhancementmode MOSFET (or abbreviated as n-channel enhancement MOS, or enhancement nMOS) and p-channel enhancement-mode MOSFET (or abbreviated as p-channel enhancement MOS, or enhancement pMOS). The structure of the former is illustrated in Fig. 30.10(a). They are expressed by the symbols shown in Fig. 30.10 (b) and (c), respectively. Each of them has three terminals: gate, source, and drain. In Fig. 30.10(a), the gate realized with metal is shown for the sake of simplicity, but a more complex structure, called silicon-gate MOSFET, is now far more widely used. The “gate” in Fig. 30.10 should not be confused with “logic gates.” The thin area underneath the gate between the source and drain regions in Fig. 30.10(a) is called a channel, where a current flows whenever conductive.

FIGURE 30.10

MOSFET.

Suppose that the source of an n-channel enhancement-mode MOSFET is grounded and the drain is connected to the power supply of 3.3 V through resistor R, as illustrated in Fig. 30.11(a). When the input voltage vi increases from 0 V, the current i, which flows from the power supply to the ground through R and the MOSFET, increases as shown in Fig. 30.11(b), but for vi smaller than the threshold voltage VT, essentially no current flows. Then because of the voltage drop across R, the output voltage vo, decreases, as shown in Fig. 30.11(c). Since we use binary logic, we need to use only two different voltage values, say 0.2 and 3.3 V. If vi is 0.2 V, no current flows from the power supply to the ground through the MOSFET and vo is 3.3 V. If vi is 3.3 V, the MOSFET becomes conductive and a current flows. Vo is 0.2 V because of the voltage drop across R. Thus, if vi is a low voltage (0.2 V), vo is a high voltage (3.3 V); and if vi is a high voltage, vo is a low voltage, as shown in Table 30.4(a). If low and high voltages represent logic values 0 and 1, respectively, in other words, if we use positive logic, Table 30.4(a) is converted to the truth table in Table 30.4(b). (When low and high voltages represent 1 and 0, respectively, this is said to be in negative logic.) Thus, if vi represents logic variable x, output vo represents function x. This means that the electronic circuit in Fig. 30.11(a) works as an inverter.

© 2000 by CRC Press LLC

FIGURE 30.11

Inverter. TABLE 30.4 Truth Table for the Circuit in Fig. 30.11 (a) Voltage Required Input vi Output vo Low voltage High voltage High voltage Low voltage

(b) Truth Table Input x Output f 0 1 1 0

Resistor R shown in Fig. 30.11 occupies a large area, so it is usually replaced by a MOSFET, called an n-channel depletion-mode MOSFET, as illustrated in Fig. 30.12, where the depletion-mode MOSFET is denoted by the MOSFET symbol with double lines. Notice that the gate of this depletion-mode MOSFET is connected to the output terminal instead of the power supply, the logic gate with depletion-mode MOSFET replacing the resistor work in the same manner as before. Logic gates with depletion-mode MOSFETs work faster and are more immune to noise.

FIGURE 30.12

A logic gate with depletion-mode MOSFET.

The n-channel depletion-mode MOSFET is different from the n-channel enhancement-mode MOSFET in having a thin n-type silicon layer embedded underneath the gate, as illustrated in Fig. 30.13. When a positive voltage is applied at the drain against the source, a current flows through this thin n-type silicon layer even if the voltage at the gate is 0 V (against the source). As the gate voltage becomes more positive, a greater current flows. Or, as the gate voltage becomes more negative, a smaller current flows. If the gate voltage decreases beyond threshold voltage VT , no current flows. This relationship, called a transfer curve, between the gate voltage VGS (against the source) and the current i is shown in Fig. 30.14(a), as compared with that for the n-channel enhancement-mode MOSFET shown in Fig. 30.14(b). By connecting many n-channel enhancement MOSFETs, we can realize any negative function, i.e., the complement of a sum-of-products where only non-complemented literals are used ( x ∨ yz is an example of negative function). For example, if we connect three MOSFETs in series, including the one

© 2000 by CRC Press LLC

FIGURE 30.13

n-Channel depletion-mode MOSFET.

FIGURE 30.14

Shift of threshold voltage VT (VGS is a voltabe between gate and source).

for resistor replacement, as shown in Fig. 30.15, the output f realizes the NAND function of variables x and y. Only when both inputs x and y have high voltages, two MOSFETs for x and y become conductive and a current flows through them. Then the output voltage is low. Otherwise, at least one of them is non-conductive and no current flows. Then the output voltage is high. This relationship is shown in Table 30.5(a). In positive logic, this is converted to the truth table 30.5(b), concluding that the circuit represents xy which is called the NAND function. Figure 30.16 shows a logic circuit for x ∨ y (called the NOR function) by connecting MOSFETs in parallel. A more complex example is shown in Fig. 30.17.

FIGURE 30.15

Logic gate for the NAND function. TABLE 30.5 Truth Table for the Circuit in Fig. 30.15 (a) Voltage Relation x y f Low Low High Low High High High Low High High High Low

© 2000 by CRC Press LLC

x 0 0 1 1

(b) Truth Table y 0 1 0 1

f 1 1 1 0

FIGURE 30.16

Logic gate for the NOR function.

FIGURE 30.17

A logic gate with many MOSFETs.

The MOSFET that is connected between the power supply and the output terminal is called a load or load MOSFET in each of Figs. 30.15 through 30.17. Other MOSFETs that are directly involved in logic operations are called a driver or driver MOSFETs in each of these circuits. Procedure 8.3: Calculation of the Logic Function of a MOS Logic Gate The logic function f for the output of each of these MOS logic gates can be obtained as follows. 1. Calculate the transmission of the driver, regarding each n-channel MOSFET as a make-contact of relay, as illustrated in Fig. 30.18. When x = 1, a current flows through the magnet in the makecontact relay in Fig. 30.18 and the contact x is closed and becomes conductive, whereas n-MOS becomes conductive when x = 1, i.e., input x of nMOS is a high voltage. 2. Complement it.  For example, the transmission of the driver in Fig. 30.16 is x ∨ y. Then by complementing it, we have the output function f = x ∨ y . The output function of a more complex logic gate, such as Fig. 30.18, can be calculated in the same manner. Thus, a MOS circuit expresses a negative function with respect to input variables connected to driver MOSFETs.

FIGURE 30.18

Analogy between n-MOS and a make-contact relay.

© 2000 by CRC Press LLC

FIGURE 30.19

Behavior of logic gates with n-MOS and p-MOS.

© 2000 by CRC Press LLC

Difference in the Behavior of n-MOS and p-MOS Logic Gates As illustrated in Fig. 30.19, an n-MOS logic gate behaves differently from a p-MOS logic gate. In the case of an n-MOS logic gate which consists of all nMOSFETs, illustrated in Fig. 30.19(a), the power supply of the n-MOS logic gate must be positive voltage, say +3.3 V, whereas the power supply of the p-MOS logic gate which consists of all pMOSFETs, illustrated in Fig. 30.19(b), must be a negative voltage. Otherwise, these logic gates do not work. Each n-MOS in the driver in Fig. 30.19(a) becomes conductive when a high voltage (e.g., +3.3 V) is applied to its MOSFET gate, and becomes nonconductive when a low voltage (e.g., 0 V) is applied to its MOSFET gate. A direct current flows to the ground from the power supply, as shown by the bold arrow when the driver becomes conductive. In contrast, each p-MOS in the driver in Fig. 30.19(b) becomes non-conductive when a high voltage (e.g., 0 V) is applied to its MOSFET gate, and becomes conductive when a low voltage (e.g., – 3.3 V) is applied to its MOSFET gate. A direct current flows to the power supply from the ground (i.e., in the opposite direction to the case of n-MOS logic gate in (a)), as shown by the bold arrow when the driver becomes conductive. The relationship among voltages at the inputs and outputs for these logic gates are shown in Figs. 30.19(c) and (d). In positive logic, i.e., interpretation of a positive and negative voltages as 1 and 0, respectively, the table in Fig. 30.19(c) yields the truth table for NOR, i.e., x ∨ y , shown in (e), and in negative logic, i.e., interpretation of a positive and negative voltages as 0 and 1 respectively, the table in (d) also yields the truth table for NOR in (e). Similarly, in negative logic, the table in Fig. 30.19(c) yields the truth table for NAND, i.e., xy , shown in (f), and in positive logic, the table in (d) also yields the truth table for NAND in (f). The two functions in (e) and (f) are dual. The relationships in these examples are extended into the general statements, as shown in the upper part of Fig. 30.19. Whether we use n-MOSFETs or p-MOSFETs in a logic gate, the output of the gate represents the same function by using positive or negative logic, respectively, or negative or positive logic, respectively. Thus, if we have a logic gate, realizing a function f, that consists of all n-MOS or all p-MOS, then the logic gate that is derived by exchanging n-MOS and p-MOS realizes its dual fd, no matter whether positive logic or negative logic is used. The output function of a logic gate that consists of all p-MOS can be calculated by Procedure 30.3, regarding each p-MOS as a make-contact relay. But each p-MOS is conductive (instead of being nonconductive) when its gate voltage is low, and is non-conductive (instead of being conductive) when its gate voltage is high. Also the output of the p-MOS logic gate is high when its driver is conductive (in the case of the n-MOS logic gate, the output of the gate is low when its driver is conductive) and is low when its driver is non-conductive (in the case of the n-MOS logic gate, the output of the gate is high when its driver is non-conductive). In other words, the polarity of voltages at the inputs and output of a p-MOS logic gate is opposite to that of an n-MOS logic gate. Thus, by using negative logic, a p-MOS logic gate has the same output function as the n-MOS logic gate in the same connection configuration in positive logic, as illustrated in Fig. 30.19.

References 1. Muroga, S., Logic Design and Switching Theory, John Wiley & Sons (now available from Krieger Publishing), 1979. 2. Muroga, S., VLSI System Design, John Wily & Sons, 1982.

© 2000 by CRC Press LLC

Muroga, S. "Logic Synthesis with NAND (or NOR) Gates in Multi-levels" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

31 Logic Synthesis With NAND (or NOR) Gates in Multi-levels 31.1 Logic Synthesis with NAND (or NOR) Gates 31.2 Design of NAND (or NOR) Networks in Double-Rail Input Logic by the Map-Factoring Method Networks with AND and OR Gates in Two Levels, as a Special Case • Consideration of Restrictions Such as Maximum Fanin • The Map-Factoring Method for NOR Network

31.3 Design of NAND (or NOR) Networks in Single-Rail Input Logic Extension of the Concept of Permissible Loop

Saburo Muroga University of Illinois at Urbana-Champaign

31.4 Features of the Map-Factoring Method 31.5 Other Design Methods of Multi-level Networks with a Minimum Number of Gates

31.1 Logic Synthesis with NAND (or NOR) Gates In the previous sections, we have discussed the design of a two-level network with AND and OR gates in double-rail input logic (i.e., both xi and x i for each xi are available as network inputs) that has a minimum number of gates as the primary objective and a minimum number of connections as the secondary objective. If a network need not be in two levels, we may be able to further reduce the number of gates or connections, but there is no known simple systematic design procedure for this purpose, whether tabular, algebraic, or graphical, that guarantees the minimality of the network. (The integer programming logic design method4,6 can do this but is complex, requiring long processing time.) But when multi-level minimal networks with NAND gates only (or NOR gates only) are to be designed, there is a method called the map-factoring method to design a logic network based on a Karnaugh map. In designing a logic network in single-rail input logic (i.e., only one of xi and x i for each xi is available as a network input) with the map-factoring method, it is less easy to see the minimality of the number of gates, although when two-level minimal networks with NAND gates in double-rail input logic are to be designed, it is as easy to see the minimality as two-level minimal networks with AND and OR gates in double-rail input logic on Karnaugh maps. By using designer’s intuition based on the pictorial nature of a Karnaugh map, at least reasonably good networks in multi-levels can be derived after trial-and-error efforts. As a matter of fact, minimal networks can sometimes be obtained, although the method does not enable us to prove their minimality. However, if we are satisfied with reasonably good networks, the map-factoring method is useful for manual design. It is actually an extension of the Karnaugh map method for minimal two-level networks with AND and OR gates discussed so far, with far greater

© 2000 by CRC Press LLC

flexibility: by the map-factoring method, we can design not only in two-levels but also in multi-levels in single-rail or double-rail input logic, and also two-level minimal networks with NAND gates in doublerail input logic that can be designed by the map-factoring method are essentially two-level minimal networks with AND and OR gates, as will be discussed later. The map-factoring method, which was first described in Chapter 6 of Ref. 3 for single-rail input logic as discussed later in this chapter, is extended here with minor modification.5 Logic networks with NOR gates only or NAND gates only are useful in some cases, such as gate arrays (to be described in Chapter 43) because the simple connection configuration of MOSFETs in each of these gates makes the area small with high speed. In designing a logic network in multi-levels, we usually minimize the number of logic gates as the primary objective and then the number of connections as the secondary objective. This is because the design with the minimization of the number of logic gates as the primary objective and then the number of connections as the secondary objective is easier than the minimization of the number of connections as the primary objective and then the number of logic gates as the secondary objective, although the minimization of the number of connections, or the lengths of connections (which is far more difficult to minimize), is important because connections occupy significantly large areas on an integrated circuit chip. But judging the results by an experiment by the integer programming logic design method in limited scale, we have the same or nearly same minimal logic networks by either approach.7

31.2 Design of NAND (or NOR) Networks in Double-Rail Input Logic by the Map-Factoring Method NAND gates and NOR gates, which are realized with MOSFETs, are often used in realizing integrated circuit chips, although logic gates that express negative functions which are more complex than NAND or NOR are generally used. (A “negative function” is the complement of a disjunctive form of noncomplemented variables. An example is xy ∨ z .) NAND gates are probably more often used than other types of logic gates realizing negative functions because a NAND gate in CMOS has a simple connection configuration of MOSFETs and is fast. Let us consider representing a NAND gate on a Karnaugh map. The output of the NAND gate with inputs x, y , and z shown in Fig. 31.1(a) can be expressed as the loop for xyz consisting of only 0-cells on the map in Fig. 31.1(b). (Recall that this loop represents product xyz on an ordinary Karnaugh map in the previous chapters.) In this case, it is important to note that only 0-cells are contained inside the loop and all the 1-cells are outside. This is because the output of this NAND gate is the complement of the AND operation of inputs, x, y , and z (i.e., xyz ). Thus, if all the 0-cells in the map can be encircled by a single rectangular loop representing a product of some literals (i.e., the number of 0-cells constituting this loop is 2i where i is a non-negative integer), the map represents the output of a NAND gate whose

FIGURE 31.1

Representation of a NAND gate on a map.

© 2000 by CRC Press LLC

inputs are the literals in the product represented by the loop. The value of f is 0 for x = y = z = 1 (i.e., x = z = 1 and y = 0) in both Figs. 31.1(a) and (b). The value of f is 1 if at least one of x, y , and z is 0. Next, connect a new NAND gate (numbered gate 2) to the output of the above NAND gate (i.e., gate 1) and also connect inputs w and z to the new gate, as shown in Fig. 31.2(a). Unlike the case of gate 1 explained in Fig. 31.1, the output f of this new NAND gate is not expressed by the entire loop for wz in Fig. 31.2(b), but is expressed by the portion of the loop (i.e., only 0-cells inside the loop for wz) because the input of gate 2 from gate 1 becomes 0 for the combination, w = x = z = 1 and y = 0, and consequently f becomes 1 for this combination (if there were no connection from gate 1, the rectangular loop representing wz contains only 0-cells, like gate 1 explained in Fig. 31.1). This may be interpreted as the rectangular loop representing wz for gate 2 being inhibited by the loop for gate 1, as shown in Fig. 31.2(b). The remainder of the loop (i.e., all 0-cells, which is actually all the 0-cells throughout the map) is encircled by a loop and is shaded. This shaded loop represents the output of gate 2 and is said to be associated with the output of gate 2. In other words, the loop (labeled wz) which represents wz denotes NAND gate 2, whereas the shaded loop (labeled 2) inside this loop in Fig. 31.2(b) denotes the output function of gate 2. Notice that the entire loop for gate 1 is shaded to represent the output function of gate 1, because gate 1 has no inputs from other gates (i.e., gate 1 is inhibited by no other shaded loops) and consequently the loop representing gate 1 coincides with the shaded loop representing the output of gate 1 (i.e., the shaded loop associated with the output of gate 1).

FIGURE 31.2

Representation of a network of two NAND gates.

Now let us state a formal procedure to design a NAND network on a Karnaugh map. Procedure 31.1: The Map-Factoring Method: Design of a Network in Double-Rail Input Logic with as few NAND Gates as Possible 1. Make the first rectangular loop of 2i cells. This loop may contain 1-cells, 0-cells, d-cells, or a mixture. Draw a NAND gate corresponding to this loop. As inputs to this gate, connect all the literals in the product that this loop represents. Shade the entirety of this loop. For example, let us synthesize a network for f = wy ∨ xy ∨ z shown in the Karnaugh map in Fig. 31.3(a). Let us make the first rectangular loop as shown in Fig. 31.3(a). (Of course, the first loop can be chosen elsewhere.) This loop represents product wx . Draw gate 1, corresponding to this loop and connect inputs w and x to this gate. Shade the entirety of this loop because this gate has no input from another gate and consequently the entirety of this loop is associated with the output function of gate 1. 2. Make a rectangular loop consisting of 2i cells, encircling 1-cells, 0-cells, d-cells, or a mixture. Draw a NAND gate corresponding to this loop. To this gate, connect literals in the product that this loop represents.

© 2000 by CRC Press LLC

FIGURE 31.3

Example for Procedure 31.1.

Up to this point, this step is identical to Step 1. Now, to this new gate, we further connect the outputs of some or all of the gates already drawn, if we choose to do so. There are the following possibilities: a. If we choose not to connect any previous gate to the new gate, the new loop is entirely shaded. b. If we choose to connect some or all of the previously drawn gates to the new gate, encircle and shade the area inside the new loop, excluding the shaded loops of the previously drawn gates connected to the new gate. The shaded loop thus formed is associated with the output of this new gate. Let us continue our example of Fig. 31.3. Let us make the loop labeled y shown in Fig. 31.3(b) as a next rectangular loop consisting of 2i cells. Draw the corresponding gate 2. Connect input y to gate 2 because this loop represents y . If we choose to connect the output of gate 1 also to gate 2 (i.e., by choosing the case b above), the shaded loop labeled 2 in Fig. 31.3(b) represents the output of gate 2. 3. Repeat Step 2 until the following condition is satisfied: Termination condition: When a new loop and the corresponding new gate are introduced, all the 0cells on the entire map and possibly some d-cells constitute the shaded loop associated with the output of the new gate. Continuing our example, let us make the loop labeled z as a next rectangular loop consisting of 2i, as shown in Fig. 31.3(c). Draw the corresponding gate 3 with input z connected. Choosing the case b in Step 2, connect the output of gate 2 as input of gate 3. (In this case b, we have three choices; that is., connection of the output of gate 1 only, connection of the output of gate 2 only, and connection of both outputs of gates 1 and 2. Let us take the second choice now.) Then, the output of gate 3 is expressed by the shaded loop labeled 3 in Fig. 31.3(c). Now, the termination condition is satisfied: all the 0-cells on the entire map constitute the shaded loop associated with the output of new gate 3. Thus, a network for the given function f has been obtained in Fig. 31.3(c). For the sake of simplicity, the example does not contain d-cells. Even when a map contains d-cells, that is, cells for don’t-care conditions, the map-factoring method, Procedure 31.1, can be easily used by appropriately interpreting each d-cell as a 0-cell or a 1-cell only when we examine the termination  condition in Step 3.

© 2000 by CRC Press LLC

Notice that we can choose different loops (including the first one in Step 1) in each step, leading to different final networks.

Networks with AND and OR Gates in Two Levels, as a Special Case If we circle only 1-cells possibly along with some d-cells but without any 0-cells in each step, we can derive a logic network with NAND gates in two levels, as illustrated in Fig. 31.4(a). This can be easily converted to the network with AND and OR gates in two levels shown in Fig. 31.4(b).

FIGURE 31.4

Network in two levels.

Consideration of Restrictions Such as Maximum Fan-in If the restriction of maximum fan-in or fan-out is imposed, loops and connections must be chosen so as not to violate it. With the map-factoring method, it is easy to take such a restriction into consideration. Also, the maximum number of levels in a network can be easily controlled.

The Map-Factoring Method for NOR Network A minimal network of NOR gates for a given function f can be designed by the following approach. Use the map-factoring method to derive a minimal network of NAND gates for fd, the dual of the given function f. Then replace NAND gates in the network with NOR gates. The result will be a minimal network of NOR gates for f.

31.3 Design of NAND (or NOR) Networks in Single-Rail Input Logic In Section 31.2, we discussed the design of a multi-level NAND networks in double-rail input logic, using the map-factoring method. Now let us discuss the design of a multi-level NAND network in single-rail input logic (i.e., no complemented variables are available as inputs to the network) using the mapfactoring method. First let us define permissible loops on a Karnaugh map. A permissible loop is a rectangle consisting of cells that contains the cell whose coordinates are variable values of all 1’s (i.e., the cell marked with the asterisk in each map in Fig. 31.5), where i is one of 0, 1, 2, …. In other words, a permissible loop must contain the particular cell denoted by the asterisk in each map in Fig. 31.5, where this permissible loop consists of 1-cells, 0-cells, d-cells (i.e., don’t-care cells), or a mixture. All permissible loops for three variables are shown in Fig. 31.5. In the following, let us describe the map-factoring method. The procedure is the same as Procedure 31.1 except the use of permissible loop, instead of rectangular loop of 2i-cells at any place on the map, for representing a gate.

© 2000 by CRC Press LLC

FIGURE 31.5

Permissible loops for three varibles.

Procedure 31.2: The Map-Factoring Method: Design of a Network in Single-Rail Input with as Few NAND Gates as Possible. We want to design a network with as few NAND gates as possible, under the assumption that noncomplemented variables but no complemented variables are available as network inputs. 1. Make the first permissible loop, encircling 1-cells, 0-cells, d-cells, or a mixture of them. Draw a NAND gate corresponding to this loop. As inputs to this gate, connect all the literals in the product that this loop represents. Shade the entirety of this loop. For example, when a function f = x ∨ y ∨ z is given, let us choose the first permissible loop, as shown in Fig. 31.6(a), although there is no reason why we should choose this particular loop. (There is no guiding principle for finding which loop should be the first permissible loop. Another loop could be better, but we cannot guess at this moment which loop is the best choice. Another possibility in choosing the first permissible loop will be shown later in Fig. 31.7(a).) The loop we have chosen is labeled 1 and is entirely shaded. Then, NAND gate 1 is drawn. Since the loop represents x, we connect x to this gate. 2. Make a permissible loop, encircling 1-cells, 0-cells, d-cells, or mixture of them. Draw a NAND gate corresponding to this loop. To this new gate, connect literals in the product that this loop represents. In the above example, a new permissible loop is chosen as shown in Fig. 31.6(b), although there is no strong reason why we should choose this particular permissible loop. This loop represents product yz, so y and z are connected to the new gate, which is labeled 2. To this new NAND gate, we further connect the outputs of some gates already drawn, if we choose to do so. There are the following possibilities. a. If we choose not to connect any previous gate to the new gate, the new permissible loop is entirely shaded. If we prefer not to connect gate 1 to gate 2 in the above example, we get the network in Fig. 31.6(b). Ignoring the shaded loop for gate 1, the new permissible loop is entirely shaded and is labeled 2, as shown in Fig. 31.6(b). b. If we choose to connect some previously drawn gates to the new gate, encircle and shade the area inside the new permissible loop, excluding the shaded loops of the previously drawn gates which are connected to the new gate. In other words, the new permissible loop, except its portion inhibited by the shaded areas associated with the outputs of the connected previously drawn gates, is shaded. In the above example, if we choose to connect gate 1 to gate 2, we obtain the network in Fig. 31.6(b′). The portion of the new permissible loop that is not covered by the shaded loop associated with gate 1 is shaded and labeled 2.

© 2000 by CRC Press LLC

FIGURE 31.6

Map-factoring method applied for f = x ∨ y ∨ z .

3. Repeat Step 2 until the following condition is satisfied: Termination condition: When a new permissible loop and the corresponding new gate are introduced, all the 0-cells on the entire map and possibly some d-cells constitute the shaded loop associated with the output of the new gate (i.e., the shaded loop associated with the output of the new gate contains all the 0-cells on the entire map, but no 1-cell). Let us continue Fig. 31.6(b). If we choose the new permissible loop as shown in Fig. 31.6(c), and if we choose to connect gates 1 and 2 to the new gate 3, the above termination condition has been satisfied; in other words, all the 0-cells on the entire map constitute the shaded loop associated with the output of the new gate 3. We have obtained a network of NAND gates for the given function. Depending on what permissible loops we choose and how we make connections among gates, we can obtain different networks. After several trials, we choose the best network. As an alternative for the network for f obtained in Fig. 31.6(c), let us continue Fig. 31.6(b′). If we choose the new permissible loop as shown in Fig. 31.6(c′), and gates 1 and 2 are connected to gate 3, we satisfy the termination condition in Step 3, and we have obtained the network in Fig. 31.6(c′), which is different from the one in Fig. 31.6(c). If we choose the first permissible loop 1, as shown in Fig. 31.7(a), differently from Fig. 31.6(a), we can proceed with the map-factoring method as shown in Fig. 31.7(b), and we obtain the third network

© 2000 by CRC Press LLC

FIGURE 31.7

Network obtained by choosing permissible loops different from Fig. 31.6.

as shown in Fig. 31.7(c). Of course, we can continue differently in Figs. 31.7(b) and (c). Also, the first permissible loop can be chosen differently from Fig. 31.6(a) or 31.7(a), but it is too timeconsuming to try all possibilities, so we have to be content with a few trials. We need a few trials to gain a good feeling of how to obtain good selections, and thereafter, we may obtain a reasonably good network. For the above example, f = x ∨ y ∨ z , the network obtained in Fig. 31.7, happens to be the minimal network (the minimality of the number of gates and then, as the secondary objective, the minimality of the number of connections can be proved by the integer programming logical  design method.).4,6 As might be observed already, any permissible loop in the case of four variables represents a product of non-complemented literals (e.g., the loop chosen in Fig. 31.6(b) represents yz) by letting the permissible loop contain the cell with coordinates, w = x = y = z = 1, whereas any rectangular loop of 2i cells, which does or does not contain this cell, represents a product of complemented literals, non-complemented literals, or a mixture as observed previously.

Extension of the Concept of Permissible Loop By defining permissible loops differently, we can extend the map-factoring method to designing NAND gate networks with some variables complemented and others non-complemented as network inputs. This is useful when it is not convenient to have other combinations of complemented variables and noncomplemented variables as network inputs (e.g., long connections are needed). For example, let us define a permissible loop as a rectangular loop consisting of 2i cells which contains the cell with coordinates, x = y = 0 and z = 1. Then, we can design NAND networks with only x , y , and z as network inputs. Some functions can be realized with simpler networks by using some network inputs complemented. For example, the function in Fig. 31.7 can be realized with only one NAND gate by using x , y , and z as network inputs, instead of three gates in Fig. 31.7(c). Complemented inputs can be realized at the outputs of another network to be connected to the inputs of the network under consideration, or can be realized with inverters. When these network inputs run some distance on a PC board or a chip, the area covered by them is smaller than the case of double-rail input logic because the latter case requires another set of long-running lines for having their complements.

31.4 Features of the Map-Factoring Method Very often in design practice, we need to design small networks. For example, we need to modify large networks by adding small networks, designing large networks by assembling small networks, or designing frequently used networks which are to be stored as cells in a cell library (these will be discussed in Chapter 45). Manual design is still useful in these cases because designers can understand very well the functional relationships among inputs, outputs, and gates and also complex constraints.

© 2000 by CRC Press LLC

The map-factoring method has unique features that other design methods based on Karnaugh maps with AND and OR gates described in the previous sections do not have. The map-factoring method can synthesize NAND (or NOR) gate networks in single-rail input or double-rail input logic, in not only in two-levels but also in multi-levels. Also, constraints such as maximum fan-in, maximum fan-out, and the number of levels can be easily taken into account. In contrast, methods for designing logic networks with AND and OR gates on Karnaugh maps described in the previous sections can yield networks under several strong restrictions stated previously, such as only two levels and no maximum fan-in restriction. But networks in two levels in double-rail input logic with NAND (or NOR) gates derived by the mapfactoring method are essentially networks in two-level of AND and OR gates in double rail-input logic. The usefulness of the map-factoring method is due to the use of a single gate type, NAND gate (or NOR gate type). Although the map-factoring method is intuitive, derivation of minimal networks with NAND gates in multi-levels is often not easy because there is no good guide line for each step (i.e., where we choose a loop) and in this sense, the method is heuristic. The map-factoring method can be extended to a design problem where some inputs to a network are independent variables, x1, x2, ..., xn and other inputs are logic functions of these variable inputs.

31.5 Other Design Methods of Multi-level Networks with a Minimum Number of Gates If we are not content with heuristic design methods, such as the map-factoring method, that do not guarantee the minimality of networks and we want to design a network with a minimum number of NAND (or NOR) gates (or a mixture) under arbitrary restrictions, we cannot do so within the framework of switching algebra, unlike the case of minimal two-level AND/OR gate networks (which can be designed based on minimal sums or minimal products), and the integer programming logic design method is currently the only method available.4,6 This method is not appropriate for hand processing, but can design minimal networks for up to about 10 gates within reasonable processing time by computer. The method can design multiple-output networks also, regardless of whether functions are completely or incompletely specified. Also, networks with a mixture of different gate types (such as NAND, NOR, AND, or OR gates), with gates having double outputs, or with wired-OR can be designed, although the processing time and the complexity of programs increase correspondingly. Also, the number of connections can be minimized as the primary, instead of as the secondary, objective. Although the primary purpose of the integer programming logic design method is the design of minimal networks with a small number of variables, minimal networks for some important functions of an arbitrary number of variables (i.e., no matter how large n is for a function of n variables), such as adders1,7 and parity functions,2 have been derived by analyzing the intrinsic properties of minimal networks for these networks.

References 1. Lai, H. C. and S. Muroga, “Minimum binary adders with NOR (NAND) gates,” IEEE Tr. Computers, vol. C-28, pp. 648-659, Sept. 1979. 2. Lai, H. C. and S. Muroga, “Logic networks with a minimum number of NOR (NAND) gates for parity functions of n variables,” IEEE Tr. Computers, vol. C-36, pp. 157-166, Feb. 1987. 3. Maley, G. A., and J. Earle, The Logical Design of Transistor Digital Computers, Prentice-Hall, Englewood Cliffs, NJ, 1963. 4. Muroga, S. “Logic design of optimal digital networks by integer programming,” Advances in Information Systems Science, vol. 3, ed. by J. T. Tou, Plenum Press, pp. 283-348, 1970. 5. Muroga, S., Logic Design and Switching Theory, John Wiley & Sons (now available from Krieger Publishing Co.), 1979.

© 2000 by CRC Press LLC

6. Muroga, S., “Computer-aided logic synthesis for VLSI chips,” Advances in Computers, vol. 32, ed. by M. C. Yovits, Academic Press, pp. 1-103, 1991. 7. Muroga, S. and H. C. Lai, “Minimization of logic networks under a generalized cost function,” IEEE Tr. Computers, vol. C-25, pp. 893-907, Sept. 1976. 8. Sakurai, A. and S. Muroga, “Parallel binary adders with a minimum number of connections,” IEEE Tr. Computers, vol. C-32, pp. 969-976, Oct. 1983. (In Fig. 7, labels a0 and c0 should be exchanged.)

© 2000 by CRC Press LLC

Muroga, S. "Logic Synthesis with a Minimum Number of Negative Gates" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

32 Logic Synthesis with a Minimum Number of Negative Gates Saburo Muroga

32.1 Logic Design of MOS Networks

University of Illinois at Urbana-Champaign

32.2 Algorithm DIMN

Phase 1 • Phase 2 • Phase 3

32.1 Logic Design of MOS Networks A MOS logic gate can express a negative function and it is not directly associated with a simple logic expression such as a minimal sum. So, it is not a simple task to design a network with MOS logic gates so that the logic capability of each MOS logic gate to express a negative function is utilized to the fullest extent. Here let us describe one of few such design procedures that design transistor circuits directly from the given logic functions.7,8 A logic gate whose output represents a negative function is called a negative gate. A MOS logic gate is a negative gate. We now design a network with a minimum number of negative gates. The feed-forward network shown in Fig. 32.1 (the output of each gate feeds forward to the gates in the succeeding levels) can express any loopless network. Let us use Fig. 32.1 as the general model of a loopless network. The following procedure designs a logic network with a minimum number of negative gates, assuming that only non-complemented input variables (i.e., x1, x2,…, xn) are available as network inputs (i.e., single-rail input logic), based on this model.5,6,9 Procedure 32.1: Design of Logic Networks with a Minimum Number of Negative Gates in SingleRail Input Logic We want to design a MOS network with a minimum number of MOS logic gates (i.e., negative gates) for the given function f(x1, x2,…, xn). (The number of interconnections among logic gates is not necessarily minimized.) It is assumed that only non-complemented variables are available as network inputs. The network is supposed to consist of MOS logic gates gi’s whose outputs are denoted by ui’s, as shown in Fig. 32.1.

Phase 1 1. Arrange all input vectors V = (x1, x2,…, xn) in a lattice, as shown in Fig. 32.2, where the nodes denote the corresponding input vectors shown in parentheses. White nodes, black nodes, and nodes with a cross in a circle, ⊗, denote true vectors, false vectors, and don’t-care vectors, respectively. The number of 1’s contained in each vector V is defined as the weight of the vector. All vectors with the same weight are on the same level, placing vectors with greater weights in higher levels, and every pair of vectors that differ only in one bit position is connected by a short line.

© 2000 by CRC Press LLC

FIGURE 32.1

A feed-forward network of negative gates.

FIGURE 32.2

Lattice example for Procedure 32.1.

Figure 32.2 is a lattice example for an incompletely specified function of four variables. 2. We assign the label L(V) to each vector V = (x1, x2,…, xn) in the lattice in Steps 2 and 3. Henceforth, L(V) is shown without parentheses in the lattice. First assign the value of f to the vector (11…1) of weight n at the top node as L(11…1). If f for the top node is “don’t-care,” assign 0. In the example in Fig. 32.2, we have L(11…1) = 1 because the value of f for the top node is 1, as shown by the white node. 3. When we finish the assignment of L(V) to each vector of weight w, where 0 < w ≤ n, assign L(V′ to each vector V′ of weight w – 1, the smallest binary number satisfying the following conditions: If f(V′) is not “don’t-care,” a. The least significant bit of L(V′) is f(V′) (i.12e., the least significant bit of L(V′) is 0 or 1, according to whether f is 0 or 1), and

© 2000 by CRC Press LLC

b. The other bits of L(V′) must be determined such that L(V′) ≥ L(V) holds for every vector V of weight w that differs from V′ in only one bit position. In other words, the binary number represented by L(V′) is not smaller than the binary number represented by L(V). If f(V′) is “don’t-care,” ignore (a), but consider (b) only. [Consequently, the least significant bit of L(V′) is determined such that (b) is met.] For the example we get a label L(1110) = 10 for the node for vector V′ = (1110) because the last bit must be 0 = f(1110) by (a), and the number must be equal to or greater than the label 1 for the top node for vector (1111) by (b). Also, for vector V′ = (1000), we get a label L(1000) = 100 because the last bit must be 0 = f(1000) by (a), and the label L(1000) as a binary number must be equal to or greater than each of the labels, 10, 11, and 11, already assigned to the three nodes (1100), (1010), and (1001), respectively. 4. Repeat Step 3 until a label L(00…0) is assigned to the bottom vector (00. . .0). Then the bit length of L(00. . .0) is the minimum number of MOS logic gates required to realize f. Denote it by R. Then make all previously obtained L(V) into binary numbers of the same length as L(00. . .0) by adding 0’s in front of them such that every label L(V) has exactly R bits. For the example, we have R = 3, so the label L(11. . .1) = 1 obtained for the top node is changed to 001 by adding two 0’s.

Phase 2 Now let us derive MOS logic gates from the L(V)’s found in Phase 1. 1. Denote each L(V) obtained in Phase 1 as (u1,…, ui, ui+1,…, uR). As will be seen later, u1,…, ui, ui+1,…, uR are log functions realized at the outputs of logic gates, g1,…, gi, gi+1,…, gR, respectively, as shown in Fig. 32.1. 2. For each L(V) = (u1,…, ui, ui+1,…, uR) that has ui+1 = 0, make a new vector (V, u1,…, ui) (i.e., (x1,…, xn, u1,…, ui)) which does not include ui+1,…, uR. This means the following. For each i, find all labels L(V)’s whose (i + 1)-th bit is 0 and then for each of these labels, we need to create a new vector (x1,…, xn, u1,…, ui) by containing only the first i bits of the label L(V). For example, for u1 = 0 (i.e., by setting i of “ui+1 = 0” to 0), the top node of the example lattice in Fig. 32.2 has the label (u1, u2, u3) = (001) which has u1 = 0 (the label 1 which was labeled in Step 3 of Phase 1 was changed to 001 in Step 4). For this label, we need to create a new vector (x1, x2, x3, x4, u1,…, ui), but the last bit ui becomes u0 because we set i = 0 for ui+1 = 0. There is no u0, so the new vector is (x1, x2, x3, x4) = (1111), excluding u1,…, ui. (In this sense, the case of u1 = 0 is special, unlike the other cases of u2 = 0 and u3 = 0 in the following.) For other nodes with labels having u1 = 0, we create new vectors in the same manner. Next, for u2 = 0 (i.e., by setting i of “ui+1 = 0” to 1), the top node has the label (u1, u2, u3) = (001) which has u2 = 0. For this label, we need to create a new vector ((x1, x2, x3, x4, u1), including u1 this time. So, the new vector ((x1, x2, x3, x4, u1) = (11110) results as the label L(1111) = 001 for the top node. Also, a new vector (11010) results for L(1101) = 001, and so on. 3. Find all the minimal vectors from the set of all the vectors found in Step 2, where the minimal vectors are defined as follows. When ak ≥ bk holds for every k for a pair of distinct vectors A = (a1,…, am) and B = (b1,…, bm), then the relation is denoted by

A f B and B is said to be smaller than A. In other words, A and B are compared bit-wise. If no vector in the set is smaller than B, B is called a minimal vector of the set. For example, (10111) f (10101) because a bit (i.e., 1 or 0) in every bit position of (10111) is greater than or equal to the corresponding bit in

© 2000 by CRC Press LLC

(10101). But (10110) and (10101) are incomparable because a bit in each of the first four bit positions of (10110) is greater than or equal to the corresponding bit of (10101), but the fifth bit of (10110) (i.e., 0) is smaller than the corresponding bit (i.e., 1) of (10101). For the example, for u1 = 0, the minimal vectors are (0100), (0010), (0001). Then, for u2 = 0, we get (11110), (11010), (01110), (10001), and (00001) by Step 2. Then the minimal vectors are (11010), (01110), and (0000l). Here, notice that (11110), for example, cannot be a minimal vector because (11110) f (11010). (11010) cannot be compared with other two, (01110) and (0001), with respect to f . 4. For every minimal vector, make the product of the variables that have 1’s in the components of the vector, where the components of the vector (V, u1,…, ui) denote variables x1, x2, …, xn, u1, …, ui, in this order. For example, we form x1x2x4 for vector (11010). Then make a disjunction of all these products and denote it by u i + 1 . For the example, we get

u 2 = x 1x 2x 4 ∨ x 2x 3x 4 ∨ u 1 from the minimal vectors (11010), (01110), and (00001) for u2 = 0. 5. Repeat Steps 2 through 4 for each of i = 1, 2,…, R – 1. For the example, we get

u 1 = x2 ∨ x3 ∨ x 4 and u 3 = x 1x 3x 4u2 ∨ x 1u1 ∨ x 2u2 6. Arrange R MOS logic gates in a line along with their output functions, u1, …, uR. Then construct each MOS logic gate according to the disjunctive forms obtained in Steps 4 and 5, and make connections from other logic gates and input variables (e.g., MOS logic gate g2, whose output function is u2, has connections from x1, x2, x3, x4, and u1 to the corresponding MOSFETs in g2, according to disjunctive form u 2 = x1x2x4 ∨ x2x3x4 ∨ u1). The network output, u3 (i.e., u3 = x 1 x 3 x 4 u 2 ∨ x 1 u 1 ∨ x 2 u 2 , which can be rewritten as x 1 x 2 ∨ x 1 x 3 x 4 ∨ x 2 x 3 x 4 ∨ x 2 x 3 x 4 ) realizes the given function f. For the example, we get the network shown in Fig. 32.3 (the network shown here is with logic gates in n-MOS only but it is easy to convert this into CMOS, as we will see in Chapter 36).

Phase 3 The bit length R in label L(00. . .0) for the bottom node shows the number of MOS logic gates in the network given at the end of Phase 2. Thus, if we do not necessarily choose the smallest binary number in Step 3 of Phase 1, but choose a binary number still satisfying the other conditions (i.e., (a) and (b)) in Step 3 of Phase 1, then we can still obtain a MOS network of the same minimum number of MOS logic gates as long as the bit length R of L(00. . .0) is kept the same. (For the top node also, we do not need to choose the smallest binary number as L(l l…), no matter whether f for the node is don’t-care.) This freedom may change the structure of the network, although the number of logic gates is still the minimum. Among all the networks obtained, there is a network that has a minimum number of logic gates as its primary objective, and a minimum number of interconnections as its secondary objective. (Generally, it is not easy to find such a network because there are too many possibilities.)  Although the number of MOS logic gates in the network designed by Procedure 32.1 is always minimized, the networks designed by Procedure 32.1 may have the following problems: (1) the number

© 2000 by CRC Press LLC

FIGURE 32.3

MOS network based on the labels L(V) in Fig. 32.2.

of interconnections is not always minimized; (2) some logic gates may become very complex so that these logic gates may not work properly with reasonably small gate delay times. If so, we need to split these logic gates into a greater number of reasonably simple logic gates, giving up the minimality of the number of logic gates. Also, after designing several networks according to Phase 3, we may be able to find a satisfactory network.

32.2 Algorithm DIMN Compared with the problem (1) of Procedure 32.1, problem (2) presents a far more serious difficulty. Thus, Algorithm DIMN (an acronym for Design of Irredundant Minimal Network) was developed to design a MOS network in single-rail input logic such that the number of gates is minimized and every connection among cells is irredundant (i.e., if any connection among logic gates is removed, the network output will be changed).2,3 Algorithm DIMN is powerful but is far more complex than the minimal labeling procedure (i.e., Phases 1 and 2 of Procedure 32.1). So, let us only outline it. Procedure 32.2: Outline of Algorithm DIMN 1. All the nodes of a lattice are labeled by the minimal labeling procedure (i.e., Phases 1 and 2 of Procedure 32.1), starting with the top node and moving downward. Let the number of bits of each label be R. Then all the nodes are labeled by a procedure similar to the minimal labeling procedure, starting with the bottom node which is now labeled with the largest binary number of R bits, and moving upward on the lattice. Then, the first negative gate with irredundant MOSFETs is designed after finding as many don’t-cares as possible by comparing two labels at each node which are derived by these downward and upward labelings. 2. The second negative gate with irredundant MOSFETs is designed after downward and upward labelings to find as many don’t-cares as possible. This process is repeated to design each gate until the network output gate with irredundant MOSFETs is designed.  Unlike the minimal labeling procedure where the downward labeling is done only once and then the entire network is designed, Algorithm DIMN repeats the downward and upward labelings for designing each negative gate. The network derived by DIMN for the function x1x2x3x4 ∨ x 1 x 2 x 3 x 4 ∨ x 1 x 2 x 3 x 4 ∨ x 1 x 2 x 3 x 4 , for example, has three negative gates, the same as that of the network derived by the minimal labeling procedure for the same function. The former, however, has only 12 n-channel MOSFETs, whereas the latter has 20 n-channel MOSFETs. DIMN usually yields networks that are substantially simpler than

© 2000 by CRC Press LLC

those designed by the minimal labeling procedure, but it has the same problems as Procedure 32.1 when the number of input variables of the function becomes large. The computational efficiency of DIMN was improved.4 Also, a version of DIMN specialized to two-level networks was developed.1

References 1. Hu, K. C., Logic Design Methods for Irredundant MOS Networks, Ph.D. dissertation, Dept. Comput. Sci., University of Illinois, at Urbana-Champaign, Aug. 1978, Also Report. No. UIUCDCSR-80-1053, 317, 1980. 2. Lai, H. C. and S. Muroga, “Automated logic design Of MOS networks,” Chapter 5, in Advances in Information Systems Science, vol. 9, ed. by J. Tou, Plenum Press, New York, pp. 287-336, 1985. 3. Lai, H. C. and S. Muroga, “Design of MOS networks in single-rail input logic for incompletely specified functions,” IIEEE Tr. CAD, 7, pp. 339-345, March 1988. 4. Limqueco, J. C., Algorithms for the Design of Irredundant MOS Networks, Master thesis, Dept. Comput. Sci., University of Illinois, Urbana, IL, 87, 1998. 5. Liu, T. K., “Synthesis of multilevel feed-forward MOS networks,” IEEE TC, pp. 581-588, June 1977. 6. Liu, T. K., “Synthesis of feed-forward MOS network with cells of similar complexities,” IEEE TC, pp. 826-831, Aug. 1977. 7. Muroga, S., VLSI System Design, John Wily & Sons, 1982. 8. Muroga, S., Computer-Aided Logic Synthesis for VLSI Chips, ed. by M. C. Yovits, vol. 32, Academic Press, pp. 1-103, 1991. 9. Nakamura, K., N. Tokura, and T. Kasami, “Minimal negative gate networks,” IEEE TC, pp. 5-11, Jan. 1972.

© 2000 by CRC Press LLC

Yoshikawa, K., et al. "Logic Synthesizer with Optimizations in Two Phases" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

33 Ko Yoshikawa NEC Corporation

Saburo Muroga

Logic Synthesizer with Optimizations in Two Phases

University of Illinois at Urbana-Champaign

Logic networks and then their corresponding transistor circuits to be laid out on integrated chips have been traditionally designed manually, spending long time and repeatedly making mistakes and correcting them. As the cost and size of transistor circuits continue to decline, logic networks that are realized by transistor circuits have been designed with an increasingly large number of logic gates. Manual design of such large logic networks is becoming too time-consuming and prone to design mistakes, thus necessitating automated design. Logic synthesizers are automated logic synthesis systems used for this purpose and transform the given logic functions into technology-dependent logic circuits that can be easily realized as transistor circuits. The quality of technology-dependent logic circuits derived by logic synthesizers is not necessarily better than manually designed ones, at least for those with a small number of logic gates, but technology-dependent logic circuits for those with millions of logic gates cannot be designed manually for reasonably short times. Automated design of logic networks has been attempted since the early 1960s.3 Since the beginning of the 1960s, IBM has pushed research of design automation in logic design. In the 1970s, a different type of algorithm, called the Transduction method, was devised for automated design of logic networks, as described in Chapter 34. Since the beginning of the 1980s, the integration size of integrated chips has tremendously increased, research and development of automated logic synthesis has become very active at several places1,2,4,9 and powerful logic synthesizers have become commercially available.6 There are different types of logic synthesizers. But here, let us describe a logic synthesizer that derives a technology-dependent logic circuit in two phases,7 although there are logic synthesizers where, unlike the following synthesizer in two phases, only technology-dependent optimization is used throughout the transformation of the given logic functions into technology-dependent logic circuits that are easily realizable as transistor circuits: In the first phase, an optimum logic network that is not necessarily realizable as a transistor circuit is designed by Boolean algebraic approaches which are easy to use. This phase is called technologyindependent optimization. In the second phase, the logic network derived in the first phase is converted into a technologydependent logic circuit that is easily realizable as a transistor circuit. (Note that all logic synthesizers, including the logic synthesizer in two phases, must eventually have technology-dependent logic circuits that are easily realizable, as transistor circuits.) This phase is called technologydependent optimization. The logic synthesizer in two phases has variations, depending on what optimization algorithms are used, and some of them have the advantage of short processing time.

© 2000 by CRC Press LLC

Suppose a logic network is designed for the given functions by some means, including design methods described in the previous chapters. Figure 33.1 is an example of such logic networks. Then the logic network is to be processed by many logic optimization algorithms in sequence, as shown in Fig. 33.2. As illustrated in Fig. 33.1, a subnetwork (i.e., several logic gates that constitute a small logic network by themselves) that has a single output logic function is called a node, and its output function is expressed as a sum-of-products form. Thus, the entire logic network is expressed as a network of the nodes. During the first phase of technology-independent optimization, each node does not correspond to a transistor circuit. But the second phase of technology dependent-optimization, as illustrated in Fig. 33.3, converts the logic network derived by the first phase into technology-dependent circuits, in which each node corresponds to a cell [i.e., a small transistor circuit (shown in each dot-lined loop in Fig. 33.3)] in a library of cells. The cell library consists of a few hundred cells where a cell is a small transistor circuit with good layout. These cells are designed beforehand and can be repeatedly used for logic synthesis.

FIGURE 33.1

Logic network to be processed by a logic synthesizer.

FIGURE 33.2

A logic optimization flow in the logic synthesizer in two phases.

© 2000 by CRC Press LLC

OAI22, OAI21, and AOI21 are the cells in the cell library.

FIGURE 33.3

Technology-dependent logic circuit.

A typical flow of logic optimization in the logic synthesis in two phases works as follows and is illustrated in Fig. 33.2, although there may be variations: 1. Sweep, illustrated in Fig. 33.4: Nodes that do not have any fan-outs are deleted. A node that consists of only a buffer or an inverter node is merged into next nodes. The following rules optimize nodes that have inputs of constant values, 0 or 1:

A ⋅ 1 = A, A ⋅ 0 = 0, A ∨ 1 = 1, A ∨ 0 = A 2. Collapsing (also often called flattening), illustrated in the upper part of Fig.33.5: More than one node is merged into one node, so that it has a more complicated logic function than before. 3. Two-level logic minimization: Two-level logic minimization2 is carried out at each node, as illustrated in the lower part of Fig. 33.5. (PLA minimization programs to be described in Chapter 42 can be used for this purpose.) 4. Decomposition, illustrated in Fig. 33.6: Decomposition4 extracts common expressions from the sum-of-products forms of nodes, adds intermediate nodes whose logic functions are the same as the common expressions, and re-expresses the sum-of-products form at the original nodes, using the intermediate nodes. This transformation, called weak division (described in Chapter 29), reduces the area of the circuit. 5. Technology mapping,5 illustrated in Fig. 33.7: Technology mapping finds an appropriate mapping of the network of the nodes onto the set of cells in the target cell library. Technology mapping works as follows:

© 2000 by CRC Press LLC

FIGURE 33.4

Sweep.

a. Select a tree structure: a tree structure where all nodes have one fan-out node is extracted from the circuit. b. Decompose it to two-input NANDs and inverters: a selected tree is decomposed to two-input NANDs and inverters. The result of this decomposition is called a subject tree. c. Matching: all cells in the target cell library are also decomposed to two-input NANDs and inverters, as shown in Fig. 33.8. Pattern matching between the subject tree and the cells in the cell library is carried out to get the best circuit. Timing or power optimization, that is, minimization of delay time or power consumption under other conditions (e.g., minimization of delay time without excessively increasing power consumption, or minimization of power consumption without sacrificing speed too much) is also carried out by changing the structure of the nodes and their corresponding cells in the library. 6. Fanout optimization, illustrated in Fig. 33.9: Fanout optimization attempts to reduce delay time by the following transformations.

© 2000 by CRC Press LLC

FIGURE 33.5

Collapsing and two-level logic minimization.

FIGURE 33.6

Decomposition.

© 2000 by CRC Press LLC

FIGURE 33.7

Technology mapping.

FIGURE 33.8

Pattern trees for a library.

© 2000 by CRC Press LLC

FIGURE 33.9

Fan-out optimization.

(a) Buffering: inserting buffers and inverters in order to reduce load capacitance of logic gates that have many fanouts. In particular, a critical path that is important for performance is separated from other fanouts.8 (b) Repowering: replacing a logic gate that has many fanouts, by another cell in the cell library that has greater output power (its cell is larger) and the same logic function as the original one. This does not change the structure of the circuit.8 (c) Cloning: creating a clone node for a logic gate that has many fanouts and distributing the fanouts among these nodes to reduce the load capacitance. This is a typical logic optimization flow. Many other optimization algorithms can be incorporated within this flow to synthesize better circuits.

References 1. Bergamaschi, R. A. et al., “High-level synthesis in an industrial environment,” IBM Jour. Res. and Dev., vol. 39, pp. 131-148, Jan./March 1995. 2. Brayton R. K., G. D. Hachtel, C. McMullen, and A. Sangiovanni-Vincentelli, Logic Minimization Algorithms for VLSI Synthesis, Kluwer Academic Publishers, Boston, 1984. 3. Breuer, M. A., “General survey of design automation of digital computers,” Proc. IEEE, vol. 54, no. 12, pp. 1708-1721, Dec., 1966. 4. Brayton R. K., A. Sangiovanni-Vincentelli, and A. Wang, “MIS: A multiple-level logic optimization system,” IEEE Tr. CAD, CAD-6, no. 6, pp. 1062-1081, Nov. 1987. 5. Detjens E., G., Gannot, R., Rudell, A., Sangiovanni-Vincentelli, and A. Wang, “Technology mapping in MIS,” Proc. Int’l Conf. CAD, pp. 116-119, 1987. 6. Kurup, P. and T. Abbasi, Ed., Logic Synthesis Using Synopsys, 2nd ed., Kluwer Academic Publishers, 322, 1997. 7. Rudell R., “Tutorial: Design of a logic synthesis system,” Proc. 33rd Design Automation Conf., pp. 191-196, 1996. 8. Singh K. J. and A Sangiovanni-Vincentelli., “A heuristic algorithm for the fanout problem,” Proc. 27th Design Automation Conf., pp. 357-360, 1990. 9. Stok, L., et al. (IBM), “BooleDozer: Logic synthesis for ASICs,” IBM Jour. Res. Dev., vol. 40, no. 4, pp. 407-430, July 1996.

© 2000 by CRC Press LLC

Muroga, S. "Logic Synthesizer by the Transduction Method" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

34 Logic Synthesizer by the Transduction Method 34.1 34.2

Technology-Dependent Logic Optimization Transduction Method for the Design of NOR Logic Networks Permissible Functions • Pruning Procedure • Calculation of Sets of Permissible Functions • Derivation of an Irredundant Network Using the MSPF • Calculation of Compatible Sets of Permissible Functions • Comparison of MSPF and CSPF • Transformations

34.3

Saburo Muroga University of Illinois at Urbana-Champaign

Various Transduction Methods Computational Example of the Transduction Method

34.4

Design of Logic Networks with Negative Gates by the Transduction Method

34.1 Technology-Dependent Logic Optimization In this chapter, a logic optimization method called the transduction method, which was developed in the early 1970s, is described. Unlike the logic synthesizer in two phases described in Chapter 33, we can have logic synthesizers with totally technology-dependent optimization, based on the transduction method, and also can have transistor circuits with better quality, although it is more time-consuming to execute the method. Some underlying ideas in the method came from analyzing the minimal logic networks derived by the integer programming logic design method mentioned in Chapter 31, Section 31.5. The integer programming logic design method is very time-consuming, although it can design absolutely minimal networks. The processing time almost exponentially increases as the number of logic gates increases, and it appears to be excessively time consuming to design minimal logic networks with over about 10 logic gates. Unlike the integer programming logic design method, logic synthesizers based on the transduction method are heuristic, just like any other logic synthesizer for synthesizing large circuits, and the minimality of the transistor circuits derived is not guaranteed, but one can design circuits with far greater numbers of transistors.

34.2 Transduction Method for the Design of NOR Logic Networks The transduction method can be applied on any types of gates, including negative gates (i.e., MOS logic gates), AND gates, OR gates, or logic gates for more complex logic operations; but in the following, the transduction method is described with logic networks of NOR gates for the sake of simplicity. The basic concept of the transduction method is “permissible functions,” which will be explained later. This method, which is drastically different from any previously known logic design

© 2000 by CRC Press LLC

method, is called the transduction method (transformation and reduction) because it repeats the transformation and reduction as follows: Procedure 34.1: Outline of the transduction method Step 1. Design an initial NOR network by any known design method. Or, any network to be simplified can be used as an initial network. Step 2. Remove the redundant part of the network, by use of permissible functions. (This is called the pruning procedure, as defined later). Step 3. Perform transformation (which is local or global) of the current network, using permissible functions. Step 4. Repeat Steps 2 and 3 until no further improvement is possible.



Let us illustrate how Procedure 34.1 simplifies the initial network shown in Fig. 34.1(a), as an example. By a transformation (more precisely speaking, transformation called “connectable condition,” which is explained later), we have the new network shown in Fig. 34.1(b), by connecting the output of gate v6 to the gate v8 (shown by a bold line). By the pruning procedure, the connection from gate v6 to gate v4 (shown by a dotted line) is deleted, deriving the new network shown in Fig. 34.1(c), and then the connection from gate v9 to gate v6 is deleted, deriving the new network shown in Fig. 34.1(d). Then, by a transformation (“connectable condition” again), we have the new network shown in Fig. 34.1(e), by connecting the output of gate v6 to the gate v7. By the pruning procedure, we can delete the output connection from gate v10 to gate v7, and then we can delete gate v10. During all these transformations and prunings, the network output f is not changed. Thus, the initial network with 7 gates and 13 connections, shown in Fig. 34.1(a), has been simplified to the one with 6 gates and 11 connections, shown in Fig. 34.1(f), by the transduction method.

FIGURE 34.1

An example for the simplification of a logic network by the transduction method.

© 2000 by CRC Press LLC

Many transformation procedures for Step 3 were devised to develop efficient versions of the transduction method. Reduction of a network is done in Step 2 and also sometimes in Step 3. The initial network stated in Step 1 of Procedure 34.1 can be designed by any known method and need not be the best network. The transduction method is essentially an improvement of a given logic network, that is, the initial network by repeated transformation and reduction, as illustrated in the flow diagram in Fig. 34.2. In contrast, traditional logic design is the design of a logic network (i.e., the initial network in the case of the Transduction method) for given logic functions and once derived, no improvement is attempted. This was natural because it was not easy to find where to delete gates or connections in logic networks. For example, can we simplify the network in Fig. 34.1(a), when it is given? The transduction method shows how to do it.

FIGURE 34.2

Basic structure of the transduction method.

The transduction method can reduce a network of many NOR gates into a new network with as few gates as possible. Also, for a given function, the transduction method can be used to generate networks of different configurations, among which we can choose those suitable for the most compact layouts on chips, since chip areas are greatly dependent on connection configurations of networks. The transduction method was developed in the early 1970s with many reports and summarized in Refs. 9 and 10.

Permissible Functions First, let us discuss the concept of permissible functions, which is the core concept for the transduction method. Definition 34.1: If no output function of a network of NOR gates changes by replacing the function realized at a gate (or an input terminal) vi, or a connection cij, with a function g, then function g is called a permissible function for that vi, or cij, respectively. (Note that in this definition, changes in don’t-care values, ∗, of the network output functions do not matter.) For example, we want to design a network for a function f˙ (x1, x2, x3) shown in the truth table in Fig. 34.3(a). Suppose the network in Fig. 34.3(b) is designed by some means to realize function f˙ (x1, x2, x3) = (01∗10∗11). The output function realized at each gate vi in Fig. 34.3(b) is denoted by f(vi) in Fig. 34.3(a). The columns in Fig. 34.3(a) are shown horizontally as vectors in Fig. 34.3(b). For example, the

© 2000 by CRC Press LLC

FIGURE 34.3

Permissible functions of a network.

output function of gate v5, f(v5), is shown by vector (00001000) placed just outside gate v5 in Fig. 34.3(b). (Henceforth, columns in a truth table will be shown by vectors in figures for logic networks.) Notice that the function f˙ (x1, x2, x3) = (01∗10∗11) (denoted by f with a dot on top of it) is a function to be realized by a network and generally is incompletely specified, containing don’t cares (i.e., ∗’s), whereas the output function f(v4) (denoted by f without a dot on top of it) of gate v4, for example, in the actually realized network in Fig. 34.3(b) is completely specified, containing no ∗ because the value of the function f(v4) realized at the output of gate v4 is 0 or 1 for each combination of values of input variables x1 = (00001111), x2 = (00110011), and x3 = (01010101) (assuming xi contains no ∗’s in its vector for each i ). Let us find permissible functions at the output of gate v5. Because the first component of the output f˙ of gate v4 is 0, the first component of the output of gate v6 is 1, and v4 is a NOR gate, the first component © 2000 by CRC Press LLC

of a permissible function at the output of gate v5 can be 0 or 1 (in other words, ∗). Because the second component of the output f˙ of v4 is 1 and v4 is a NOR gate, the second component of a permissible function at the output of v5 must be 0. Because the third component of f˙ at v4 is ∗, the third component of a permissible function at the output of gate v5 can be 0 or 1 (i.e., ∗). Calculating every component in this manner, we will find that (00001000), (10001000), … are all permissible functions at the output of gate v5, including the original vector (00001000) shown at v5 in Fig. 34.3(b). In other words, even if we replace the function f(v5) realized at gate v5 by any of these permissible functions, the network output at v4 still realizes the given function f˙ (x1, x2, x3) = (01∗10∗11). Notice that the value of the signal at a connection cij from gate vi to vj is always identical to the output value of vi, but permissible functions for a connection cij are separately defined from those for a gate vi. As we shall see later, this is important when gate vi has more than one output connection. When vi has only one output connection cij, permissible functions for vi are identical to those for cij (the networks in Fig. 34.3 are such cases and the permissible functions for each cij are identical to those for vi); but when vi has more than one output connection, they are not necessarily identical, as we will see later. Usually, there is more than one permissible function for each of vi and cij. But when we find a set of permissible functions for a gate or connection, all the vectors representing these permissible functions can be expressed by one vector, as discussed in the following. This is convenient for processing. For example, suppose the output function f(vi) of a gate vi in a network has the values shown in the column labeled f(vi) in the truth table in Table 34.1, where the network has input variables x1, x2, …, xn. Let us write this column as a vector, f(vi) = (0011 …). Suppose any outputs of the network do not change even if this function f(vi) in Table 34.1 is replaced by g1 = (0101 …), g2 = (0001 …), … , or gh = (0..1…) shown in Table 34.1. In other words, g1, …, gh are permissible functions (they are not necessarily all the permissible functions at gate vi). Then, permissible functions, g1 through gh, can be expressed by a single vector G(vi) = (0∗∗1 …) for the following reasons: TABLE 34.1 Truth Table x1 0 0 0 0 … … …

… … … … … … … …

xn – 1 0 0 1 1

xn 0 1 0 1

f(vi) 0 0 1 1 … … …

g1 0 1 0 1

g2 0 0 0 1

g3 0 1 1 1

… … … … … … … …

gh 0 … … 1

G(vi) 0 * * 1 … … …

1. Suppose permissible functions, g1 and g2, in Table 34.1 differ only in their second components. In other words, the second component is 1 in g1 and 0 in g2. Consequently, even if the second component of the output function of gate vi is don’t care ( i.e., ∗), no network output will change. Thus, (0∗01…) (i.e., the vector g1 = (0101…) with the second component replaced by ∗) is another permissible function at gate vi. This can be interpreted as follows. Original permissible functions, g1 and g2, can be replaced by the new permissible function, interpreting ∗ to mean 0 or 1; in other words, (0∗01…) means g1 = (0101…) and g2 = (0001…). When there is a permissible function gj with ∗ in the second component, any other permissible function, gk, can be replaced by gk itself with its second component replaced by ∗, even if other components of gj may not be identical to those of gk. For example, a permissible function, g3 = (0111…), in Table 34.1 can be replaced by another permissible function (0∗11…). This is because the value of each component of the output function is independent of any other component (in other words, the value of the output function at each gate for a combination of values of x1, x2, …, xn is completely independent of those for other combinations). This means that if the second component of one permissible function, gj, is ∗, then the second components of all other permissible functions can be ∗, where 1 ≤ j ≤ h. © 2000 by CRC Press LLC

2. Suppose the first and fourth components are 0 and 1, respectively, in permissible functions, g1 through gh. 3. Then, the set of permissible functions, g1, g2, …, gh, for gate vi can be expressed as a single vector, G(vi) = (0∗∗1…), as shown in the last column in Table 34.1.

Pruning Procedure From the basic properties of permissible functions just observed, we have the following. Theorem 34.1: The set of all permissible functions at a gate (or an input terminal) or connection, or any subset of it, can be expressed by a single incompletely specified function. Henceforth, let G denote the vector that this single incompletely specified function expresses. G(vi) = (0∗∗1…) in Table 34.1 is such an example. When an arbitrary NOR gate network, such as Fig. 34.3(b), is given, it is generally very difficult to find which connections and/or gates are unnecessary (i.e., redundant) and can be deleted from the network without changing any output functions of the network. But, it is easy to do so by the following fundamental property based on the concept of permissible function: If a set of permissible functions at a gate, a connection, or an input terminal in a given NOR gate network has no 1-component, in other words, consists of components, 0’s, ∗’s, or a mixture of 0’s and ∗’s, then that gate (and its output connections), connection, or input terminal, is redundant. In other words, it can be removed from the network without changing the network outputs, because the function on this gate, connection, or input terminal, can be the vector consisting of all 0-components (by changing every ∗ to 0); and even if all the output connections of it are removed from the NOR gates whose inputs have these connections, the output functions of the network do not change. (This is because an input to a NOR gate that has more than one input is 0, and it will not affect the output of that NOR gate and hence is removable.) Based on this property, we can simplify the network by the following procedure: Procedure 34.2: Pruning procedure We calculate a set of permissible functions for every gate, connection, and input terminal in a network of NOR gates, starting with the connections connected to output terminals of the network and moving toward the input terminals of the network. Whenever the single vector, G, that this set of permissible functions expresses, is found to consist of only 0 or ∗ components, without 1components during the calculation, we can delete that gate, connection, or input terminal without changing any network output.  In this case, it is important to notice that we generally can prune only one of the gates, connections, and input terminals whose G’s consist of only 0-components and ∗-components, as we will see later. If we want to prune more than one of them, we generally need to recalculate permissible functions of all gates, connections, and input terminals throughout the network after pruning one.

Calculation of Sets of Permissible Functions Let us introduce simple terminology for later convenience. If there is a connection cij from the output of vi to the input of vj, vi is called an immediate predecessor of vj, and vj is called an immediate successor of vi. IP(vi) and IS(vi) denote the set of all immediate predecessors of vi and the set of all immediate successors of vi, respectively. If there is a path of gates and connections from the output of gate vi to the input of vj, then vi is called a predecessor of vj and vj is called a successor of vi. P(vi) and S(vi) denote the set of all predecessors of vi and the set of all successors of vi, respectively. Thus far, we have considered only an arbitrary set of permissible functions at a gate, connection, or input terminal, vi (or cij). Now let us consider the set of all permissible functions at vi (or cij).

© 2000 by CRC Press LLC

Definition 34.2: The set of all permissible functions at any gate, connection, or input terminal is called the maximum set of permissible functions, abbreviated as MSPF. This set can be expressed by a single vector, as already shown. The MSPF of a gate, connection, or input terminal can be found by examining which functions change the network outputs. But MSPFs throughout the network can be far more efficiently calculated by the following procedure. Procedure 34.3: Calculation of MSPFs in a network MSPFs of every gate, connection, and input terminal in a network can be calculated, starting from the outputs of the network and moving toward the input terminals, as follows: 1. Suppose we want to realize a network that has input variables x1, x2, …, xn and output functions f˙ 1, f˙ 2, …, f˙ m, that may contain don’t-care components (i.e., ∗’s). Then, suppose we have actually realized a network with R NOR gates for these functions by some means and the output gates of this network realize f1, f2, …, fm, respectively. (v1, v2, …, vn are regarded as input terminals where x1, x2, …, xn are provided, respectively.) Notice that the output functions, f1, f2, …, fm, of this network do not contain any don’t-care components because each network output function is 1 or 0 for each combination of values of x1, x2, …, xn. Calculate output function f(vj) at each gate vj in this network. Then, the MSPF, GM(vt), of each gate, vt, whose output function is one of the network outputs, fk, is set to f˙ k (not fk, which is a completely specified function at vt of the actually realized network); that is, GM(vt) = f˙ k, where k = 1, 2, …, m. 2. Suppose a gate vj in the network has MSPF, GM(vj). Then, the MSPF of each connection cij to the input of this gate can be calculated as follows: If a component of GM(vj) is 0 and the disjunction (i.e., logical sum, or OR) of the values of the corresponding components of the input connections other than cij is 0, then the MSPF of input connection cij is 1. If they are 1 and 0, respectively, the MSPF of cij is 0. And so on. This operation, denoted by , is summarized in Table 34.2, where the first operand is a component of GM(vj) and the second operand is the disjunction of the component values of the input connections other than cij and where “-” denotes “undefined” because vj is a NOR gate and consequently these cases do not occur. Calculate each component of GM(vj) in this manner. TABLE 34.2 Definition of  Second operand,

( ∨ f(v))

i.e., v ∈ IP ( v j )

 First operand, i.e., GM(vj)

0 1 *

v ≠ vi 0 1 0 *

1 * – *

* – – –

– undefined

If this gate vj has only one input connection cij (i.e., vj becomes an inverter), the MSPF of cij is 1, 0, or ∗ according as GM(vj) is 0, 1, or ∗. This calculation can be formally stated by the following formula:

G M ( c ij ) = G M ( v j )

( ∨ f(v)) v ∈ IP ( v j ) v ≠ vi

© 2000 by CRC Press LLC

(34.1)

where if IP(vi) = {vi} (in other words, gate vj has only one input connection from vi), then

( ∨ f(v)) v ∈ IP ( v j ) v ≠ vi

is regarded as the function that is constantly 0. For example, let us calculate MSPFs of the network shown in Fig. 34.3(b) which realizes the function f˙ (x1, x2, x3) = (01∗10∗11) given in Fig. 34.3(a). Because gate v4 is the network output, this f˙ is the MSPF of gate v4 according to Step 1; that is, GM(v4) = (01∗10∗11) as shown in Fig. 34.3(c). Gate v4 has input connections realizing functions f(c54) = (00001000) and f(c64) = (10000000). Then, GM(c54), MSPF of c54, can be calculated as follows.The first component of GM(c54) is ∗, using Table 34.2, because the first component of GM(v4) is 0 and the first component of f(c64) (which, in this case, is the disjunction of the first components of functions realized at connections other than c54 because c64 is the only connection other than c54) is 1. Proceeding with the calculation based on Table 34.2, we have GM(c54) = (∗0∗01∗00). (GM(c54), and other GM(cij)’s are not shown in Fig. 34.3.) 3. The MSPF of a gate or input terminal, vi, can be calculated from GM(cij) as follows: If a gate or input terminal, vi, has only one output connection, cij, whose MSPF is GM(cij), then GM(vi), MSPF of vi, is given by:

G M ( v i ) = G M ( c ij )

(34.2)

Thus, we have GM(v5) = GM(c54) and this is shown in Fig. 34.3(c). If vi has more than one output connection to gates vj’s, thenGM(vi) is not necessarily identical to GM(cij). In this case, the MSPF for any gate or input terminal, vi, in the network is given by the following H: (1)

(2)

H = ( H , H , …, H

n

(2 )

)

(34.3)

whose w-th component is given by: H(w) = ∗ if fk(w) ⊇ fk′(w)for every k such that 1 ≤ k ≤ m, and not hold for some k such that 1 ≤ k ≤ m, H(w) = f(w)(vi) if fk(w) ⊇ fk′(w) does n (2 ) ) is the new k-th output function of the network, to which the kwhere fk′ = (fk′(1), fk′(2), …, f k′ th output of the original network, fk, changes by the following reconfiguration: insert an inverter between gate or input terminal, vi, and its immediate successor gates, vj’s (so the immediate successor gates receive f (vi) instead of the original function f(vi) at vi). Here, “⊇,” which means ⊃ or =, is used as an ordinary set inclusion, i.e., ∗ ⊇ 0, ∗ ⊇ 1, 1 ⊇ 1, and 0 ⊇ 0, interpreting ∗ as the set of 1 and 0. f(w)(vi) is the w-th component of f(vi). Notice that the calculation of the MSPF for vi, based on Eq. 34.3, is done by finding out the new values of the network outputs, f1, f2, …, fm, for the preceding reconfiguration, without using GM(cij)’s. Thus, this calculation is complex and time-consuming. 4. Repeat Steps 2 and 3 until finishing the calculation of MSPFs throughout the network. Let us continue the example in Fig. 34.3(c), where input variables, x1, x2, and x3, are at input terminals v1, v2, and v3, respectively, and then for the input variables, x1 = f(v1) = (0 0 0 0 1 1 1 1), x2 = f(v2) = (0 0 1 1 0 0 1 1), and x3 = f(v3) = (0 1 0 1 0 1 0 1), the outputs of gates are f(v4) = (0 1 1 1 0 1 1 1), f(v5) = (0 0 0 0 1 0 0 0), f(v6) = (1 0 0 0 0 0 0 0), and f(v7) = (1 1 1 1 0 0 0 0). Because gate v5 has only one output connection, GM(v5) = GM(c54), according to Step 3. The first component of GM(c25) (i.e., the MSPF of connection, c25) from input terminal v2 (which actually has input variable x2), is ∗ by Table 34.2 because the first component of GM(v5) is ∗. The second component of GM(v2) is ∗ by Table 34.2 because the second component of GM(v5) is 0 and the  disjunction of the second components of f(c75) and f(c35) is 1 ∨ 1 = 1. And so on. © 2000 by CRC Press LLC

Derivation of an Irredundant Network Using the MSPF The following procedure provides a quick means to identify redundant gates, connections, or input terminals in a network. By repeating the calculation of MSPF at all gates, connections, and input terminals throughout the network and the deletion of some of them by the pruning procedure (Procedure 34.2), we can derive an irredundant logic network as follows, where “irredundant network” means that if any connection, gate, or input terminal is deleted, some network outputs will change; in other words, every connection, gate, or input terminal is not redundant. Procedure 34.4: Derivation of an irredundant network using the MSPF 1. Calculate the MSPFs for all gates, connections, and input terminals throughout the network by Procedure 34.3, starting from each output terminal and moving toward input terminals. 2. During Step 1, we can remove a connection, gate, or input terminal, without changing any output function of the network, as follows: a. If there exists any input connection of a gate whose MSPF consists of 0’s and ∗’s only, remove it. We can remove only one connection because if two or more connections are simultaneously removed, some network outputs may change. b. As a result of removing any connection, there may be a gate without any output connections. Then remove such a gate. 3. If a connection and possibly a gate are removed in Step 2, return to Step 1 with the new network. Otherwise, go to Step 4. 4. Terminate the procedure and we have obtained an irredundant network.  This procedure does not necessarily yield a network with a minimum number of gates or connections, because an irredundant network does not necessarily have a minimum number of gates or connections. But the procedure is useful in many cases because every network with a minimum number of gates or connections must be irredundant and an obtained network may be minimal. Example 34.1: Let us apply Procedure 34.4 to the network shown in Fig. 34.3(c). Because each gate has only one output connection, GM(vi) = GM(cij) holds for every vi. Thus, using Eq. 34.1 and GM(v4) = f˙ (x1, x2, x3), MSPFs are obtained as follows:

GM(c54) = GM(v5) = GM(v4)  f(v6) = (* 0 * 0 1 * 0 0), GM(c64) = GM(v6) = GM(v4)  f(v5) = (1 0 * 0 * * 0 0), and GM(c75) = GM(v7) = GM(v5)  (x 2 ∨ x3) = (* * * * 0 * * *) Then we can remove connection c75 and consequently gate v7 by the pruning procedure, as shown in Fig. 34.3(d). For this new network, we obtain the following:

f(v4) = (0 1 1 1 0 1 1 1), f(v5) = (1 0 0 0 1 0 0 0), f(v6) = (1 0 0 0 0 0 0 0), GM(v4) = f˙ (x 1, x2, x 3) = (0 1 * 1 0 * 1 1), GM(c54) = GM(v5) = (* 0 * 0 1 * 0 0), and GM(c64) = GM(v6) = GM(v4)  f(v5) = (* 0 * 0 * * 0 0) Then we can remove connection c64 and then gate v6. Thus, the original network of four gates and nine connections is reduced to the network of two gates with three connections in Fig. 34.3(e), where  GM’s are recalculated and no gate can be removed by the pruning procedure.

© 2000 by CRC Press LLC

Whenever we delete any one connection, gate, or input terminal by the pruning procedure, we need to calculate MSPFs of the new network from scratch. If we cannot apply the pruning procedure, the final logic network is irredundant. This means that if any gate, connection, or input terminal is deleted from the network, some of the network outputs change. It is important to notice that an irredundant network is completely testable; that is, we can detect, by providing appropriate values to the input terminals, whether or not the network contains faulty gates or connections. (A redundant network may contain gates or connections such that it is not possible to detect whether they are faulty.) Use of MSPFs is time-consuming because the calculation of MSPFs by Eq. 34.3 in Step 3 of Procedure 34.3 is time-consuming and also because we must recalculate MSPFs whenever we delete a connection, gate, or input terminal by the pruning procedure.

Calculation of Compatible Sets of Permissible Functions Reduction of calculation time is very important for developing a practical procedure. For this purpose, we introduce the concept of a compatible set of permissible functions which is a subset of an MSPF. Definition 34.3: Let V be the set of all the gates and connections in a network, and e be a gate (or input terminal) vi or a connection cij in this set. All the sets of permissible functions for all e’s inV, denoted by GC(e)’s, are called compatible sets of permissible functions (abbreviated as CSPFs), if the following properties hold with every subset T of V: a. Replacement of the function at each gate or connection, t, in set T by a function f(t) ∈ GC(t) results in a new network where each gate or connection, e, such that e ∉ T and e ∈ V, realizes function f(e) ∈ GC(e).  b. The condition a holds, no matter which function f(t) in GC(t) is chosen for each t. Consequently, even if the function of a gate or connection is replaced by f(vt) ∈ GC(vt), the function f(vu) at any other gate or connection still belongs to the original Gc(vu). CSPFs at all gates and connections in a network can be calculated as follows. Procedure 34.5: Calculation of CSPFs in a logic network CSPFs of all gates, connections, and input terminals in a network can be calculated, starting from the network outputs and moving toward the input terminals, as follows: 1. Suppose we want to realize a network that has inputs x1, x2 … xn and output functions f˙ 1, f˙ 2, …, f˙ m, which may contain don’t-care components (i.e., *’s). Then, suppose we have actually realized a network with R NOR gates for these functions by some means. Calculate output function f(vj) at each gate vj in this network. Then, CSPF of each gate, vt, whose output function is one of the network outputs, fk, is GC(vt) = f˙ k (not fk), where k = 1, 2, …, m. 2. Unlike MSPFs, the CSPF at each gate, connection, or input terminal is highly dependent on the order of processing gates, connections, and input terminals in a network. So, before starting the calculation of CSPFs, we need to determine an appropriate order of processing. Selection of a good processing order, denoted by r, is important for the development of an efficient transduction method. Let r(vi) denote the ordinal number of gates or connections, vi in this order. Suppose a gate vj in the network has CSPF, GC(vj). Then, CSPF of each input connection cij, GC(cij), of this gate can be calculated by the following formula which is somewhat different from Eq. 34.1:

G C ( c ij ) = { G C ( v j )

( ∨ f ( v ) ) }#f ( v i )

v ∈ IP ( v j ) r ( v ) > r ( vi ) © 2000 by CRC Press LLC

(34.4)

by calculating the terms on the right-hand side in the following steps: a. Calculate the disjunction (i.e., component-wise disjunction) of all the functions f(vt)’s of immediate predecessor gates, vt’s, of the gate vj, whose ordinal numbers r(vt)’s are greater than the ordinal number r(vi) of the gate vi. If there is no such gate vt, the disjunction is 0. b. Calculate the expression inside { } in Eq. 34.4 by using Table 34.2 (the definition of ). In this table, GC(vj) is used as the first operand, and

( ∨ f(v)) v ∈ IP ( v j ) r ( v ) > r ( vi )

(i.e., the disjunction calculated in Step a) is used as the second operand. c. Calculate the right-hand side of Eq. 34.4 using Table 34.3, with the value calculated in Step b as the first operand and f(vi) as the second operand. TABLE 34.3 Definition of #

First operand, i.e.,

G C  v j  

( ∨ f(v))

# 0 1 *

Second operand, i.e., f(vi) 0 1 0 * * 1 * *

* * * *

v ∈ IP ( v j ) r ( v ) > r ( vi )

For example, suppose gate vj in Fig. 34.4 has CSPF, GC(vj) = (010*), and input connections cij, cgj, and chj, from gates vi, vg, and vh, whose functions are f(vi) = (1001), f(vg) = (0011), and f(vh) = (0010). (Notice that f(va) = f(cab) always holds for any gate or input terminal, va, no matter whether va has more than one output connection.) Suppose we choose an order, r(vi) < r(vg) < r(vh). Then the first component of GC(cij) is { 0  (0 ∨ 0) }#1 because the first components of GC(vj), f(vg), f(vh), and f(vi), which appear in this order in Eq. 34.4, are 0, 0, 0, and 1, respectively. Using Tables 34.2 and 34.3, this becomes { 0  (0 ∨ 0) }#1 ={ 0  0 }#1 = 1#1 = 1. Calculating other components similarly, we have GC(cij) = (10**). We have GC(cgj) = (*0**) because the fourth component of GC(cgj), for example, is {∗  0}#1 since the fourth components of GC(vj), f(vh), and f(vg) are *, 0, and 1, respectively. In this case, the value of f(vi) is not considered unlike the calculation of MSPF, GM(cgj), because f(vi) is not included in Eq. 34.4 due to the order, r(vi) < r(vg). Also, we have GC(chj) = (*01*) because the fourth component of GC(chj), for example, becomes {*  0}#0 since the fourth components of GC(vj),

FIGURE 34.4

An example of the calculation of MSPFs and CSPFs.

© 2000 by CRC Press LLC

( ∨ f(v)) v ∈ IP ( v j )

,

r ( v ) > r ( vi )

and f(vh) are ∗, 0 (because no gate v such that r(v) > r(vh)), and 0, respectively. In this case, f(vi) and f(vg) are not considered. For comparison, let us also calculate MSPFs by Procedure 34.3. The MSPFs of connections cij, cgj, and chj can be easily calculated as GM(cij) = (10∗∗), GM(cgj) = (*0**), and GM(chj) = (*0**), respectively, as shown in Fig. 34.4. Comparing with the CSPFs, we can find GC(cij) = GM(cij) and GC(cgj) = GM(cgj). But GC(chj) is a subset of GM(chj) (denoted as GC(chj) ⊂ GM(chj)) because the third component of GC(chj) = (*01*) is 1, which is a subset of the third component, * (i.e., 0 or 1), of GM(chj) = (*0**), while other components are identical. The CSPF and MSPF of a gate, connection, or input terminal can be identical. For the gate with GC(vj) = GM(vj) = (010*) shown in Fig. 34.5, for example, we have GC(cij) = GM(cij) = (10**), GC(cgj) = GM(cgj) = (*0**), and GC(chj) = GM(chj) = (*01*) with the ordering r(vi) < r(vg) < r(vh).

FIGURE 34.5

An example of the calculation of MSPFs and CSPFs.

As can be seen in the third components of GM’s in the example in Fig. 34.4, when gate vj in a network has more than one input connection whose w-th component is 1 and we have GM(w)(vj) = 0, the w-th components of MSPFs for all these input connections are ∗’s, as seen in the third components of GM(cij), GM(cgj), and GM(chj) in the example in Fig. 34.4. But the w-th components of CSPFs, however, are *’s except for one input connection whose value is required to be 1, as seen in the third components of GC(chj) in the example in Fig. 34.4. Which input connection is such an input connection depends upon order r. Intuitively, an input connection to the gate vj from an immediate predecessor gate that has a smaller ordinal number in order r will probably tend to have more *’s in its CSPF and, consequently, have a greater probability for this input connection to be removed. 3. The CSPF of a gate or input terminal, vi, can be calculated from GC(cij) as follows. If a gate or input terminal, vi, has only one output connection, cij, whose CSPF is GC(cij), then GC(vi), CSPF of vi, is given by

G C ( v i ) = G C ( c ij )

(34.5)

If vi has more than one output connection, then GC(vi) is not necessarily identical to GC(cij). In this case, the CSPF for any gate or input terminal, vi, in a network is given by the following:

GC ( vi ) =

∩ G (c )

v j ∈ IS ( v i )

C

ij

(34.6)

where the right-hand side of Eq. 34.6 is the intersection of GC(cij)’s of output connections, cij’s, of gate vi. © 2000 by CRC Press LLC

Unlike Eq. 34.3 for the case of MSPFs, Eq. 34.6 is simple and can be calculated in a short time. 4. Repeat Steps 2 and 3 until the calculation of CSPFs throughout the network is finished.  A gate or connection may have different CSPFs if the order r of processing is changed. On the other hand, each gate or connection has a unique MSPF, independent of the order of processing.

Comparison of MSPF and CSPF It is important to notice the difference in the ways of defining MSPF and CSPF. Any function f(vi) (or f(cij)), that belongs to the MSPF of a gate, connection, or input terminal, vi (or cij), can replace the original function realized at this vi (or cij) without changing any network output, keeping the functions at all other gates, connections, or input terminals intact. If functions at more than one gate, connection, and/or input terminal are simultaneously replaced by permissible functions in their respective MSPFs, some network outputs may change. In the case of CSPF, simultaneous replacement of the functions at any number of gates, connections, and input terminals by permissible functions in their respective CSPFs does not change any network output. Example 34.2: This example illustrates that if functions realized at more than one gate, connection, or input terminal are simultaneously replaced by permissible functions in their respective MSPFs, some network outputs may change, whereas simultaneous replacement by permissible functions in their respective CSPFs does not change any network outputs. Let us consider the network in Fig. 34.6(a) where all the gates have the same MSPFs as those in Fig. 34.4. In Fig. 34.6(a), let us simultaneously replace functions (1001), (0011), and (0010) realized at the inputs of gate vj in Fig. 34.4 by (1000), (0001), and (1001), respectively, such that (1000) ∈ GM(cij) = (10**), (0001) ∈ GM(cgj) = (*0**), and (1001) ∈ GM(chj) = (∗0∗∗) hold. Then the output function of gate vj in Fig. 34.6(a) becomes (0110). But we have (0110) ∉ GM(vj) because the third component 1 is different from the third component 0 of GM(vj). So, (0110) is not a permissible function in MSPF, GM(vj). But if we replace the function at only one input to gate vj by a permissible function of that input, the output function of vj is still a permissible function in MSPF, GM(vj). For example, if only (0011) at the second input of gate vj in Fig. 34.4 is replaced by (0001), the output function of vj becomes (0100), which is still a permissible function of GM(vj), as shown in Fig. 34.6(b).

FIGURE 34.6

MSPFs.

© 2000 by CRC Press LLC

If we use CSPF, we can replace more than one function. For example, let us consider the network in Fig. 34.7, where gate vj has GC(vj), the same as GC(vj) in Fig. 34.4. The functions at the inputs of gate vj in Fig. 34.7 belong to CSPFs calculated in Fig. 34.4; in other words, (1000) ∈ GC(cij) = (10∗∗), (0001) ∈ GC(cgj) = (∗0∗∗), and (1010) ∈ GC(chj) = (∗01∗). Even if all functions (1001), (0011), and (0010) in Fig. 34.4 are simultaneously replaced by these functions, function (0100) realized at the output of gate vj is still a permissible function in GC(vj). 

FIGURE 34.7

CSPFs.

Procedures based on CSPFs have the following advantages and disadvantages: 1. For the calculation of CSPFs, we need not use Eq. 34.3 which is time-consuming. 2. Even if a redundant connection is removed, we need not recalculate CSPFs for the new network. In other words, CSPFs at different locations in the network are independent of one another, whereas MSPFs at these locations may not be. Thus, using CSPFs, we can simultaneously remove more than one connection; whereas using MSPFs, we need to recalculate MSPFs throughout the network, whenever one connection is removed. If, however, we use CSPFs instead of MSPFs, we may not be able to remove some redundant connections by the pruning procedure because each CSPF is a subset of its respective MSPF and depends on processing order r. Because gates with smaller ordinal number in order, r, tend to have more ∗-components, the probabilities of removing these gates (or their output connections) are greater. Thus, if a gate is known to be irredundant, or hard to remove, we can assign a larger ordinal number in order r to this gate and this will help giving ∗-components to the CSPFs of other gates. 3. The property (2) is useful for developing network transformation procedures based on CSPFs, which will be discussed later. 4. Because each CSPF is a subset of a MSPF, the network obtained by the use of CSPFs is not necessarily irredundant. But if we use MSPFs for pruning after repeated pruning based on CSPFs, then the final network will be irredundant. For these reasons, there is a tradeoff between the processing time and the effectiveness of procedures.

Transformations We can delete redundant connections and gates from a network by repeatedly applying the pruning procedure (in other words, by repeating only Step 2, without using Step 3, in Procedure 34.1, the outline of the transduction method). In this case, if MSPF is used, as described in Procedure 34.4, the network that results is irredundant. However, to have greater reduction capability, we developed several transformations of a network. By alternatively repeating the pruning procedure (Step 2 in Procedure 34.1) and transformations (Step 3), we can reduce networks far more than by the use of only one of them. The following gate substitution procedure is one of these transformations.

© 2000 by CRC Press LLC

Procedure 34.6: Gate substitution procedure If there exist two gates (or input terminals), vi and vj, satisfying the following conditions, all the output connections of vj can be replaced by the output connections of vi without changing network outputs. Thus, vj is removable. 1. f(vi) ∈ G(vj), where G(vj) is a set of permissible functions of vj which can be an MSPF or a CSPF.  2. vi is not a successor of vj (no loop will be formed by this transformation). This is illustrated in Fig. 34.8. The use of MSPFs may give a better chance for a removal than their subsets such as CSPFs, although the calculation of MSPFs is normally time-consuming.

FIGURE 34.8

Gate substitution.

Example 24.3: Let us apply Procedure 34.6 to the network shown in Fig. 34.9(a), which realizes the function f = x 1 x 2 ∨ x 1 x 3 ∨ x2x3. Functions realized at input terminals and gates are as follows:

x 1 = f(v1) = (0 0 0 0 1 1 1 1), x 2 = f(v2) = (0 0 1 1 0 0 1 1), x 3 = f(v3) = (0 1 0 1 0 1 0 1), f(v4) = (0 1 1 1 0 0 0 1), f(v5) = (0 0 0 0 1 0 1 0), f(v6) = (1 0 0 0 1 1 0 0), f(v7) = (1 1 1 1 0 0 0 0), f(v8) = (0 1 1 1 0 0 0 0), and f(v9) = (1 0 0 0 1 0 0 0) Let us consider the following two different approaches. 1. Transformation by CSPFs: CSPFs for the gates are calculated as follows if at each gate, the input in the lower position has higher processing order than the input in the upper position:

GC(v4) = (0 1 1 1 0 0 0 1), GC(v5) = (* 0 0 0 * * 1 0), GC(v6) = (1 0 0 0 1 1 * 0), GC(v7) = (* * 1 * * * 0 *), GC(v8) = (0 1 * * 0 0 * *), and GC(v9) = (1 0 * * * * * *)

© 2000 by CRC Press LLC

FIGURE 34.9

An example of gate substitution.

Because f(v8)∈ GC(v7), the network in Fig. 34.9(b) is obtained by substituting connection c86 for c75. Then, gate v7 is removed, yielding a simpler network. 2. Transformation by MSPFs: In this case, the calculation of MSPFs is very easy because each gate in Fig. 34.9(a) has only one output connection. MSPFs for gates are as follows:

GM(v4) = (0 1 1 1 0 0 0 1), GM(v5) = (* 0 0 0 * * 1 0), GM(v6) = (1 0 0 0 * 1 * 0), GM(v7) = (* * 1 * * * 0 *), GM(v8) = (0 1 * * * 0 * *), and GM(v9) = (1 0 * * * * * *) Here, GM(v7) = GC(v7), and we get the same result in Fig. 34.9(b). This result cannot be obtained by the gate merging to be discussed later.  This gate substitution can be further generalized. In other words, a gate, vj, can be substituted by more than one gate, instead of by only one gate vi in Procedure 34.6. If we connect a new input to a gate or disconnect an existing input from a gate, the output of the gate may be modified. But if the new output is still contained in the set of permissible functions at this gate, the modification does not change the network outputs. By applying this procedure, we can change the network configuration and possibly can remove connections and/or gates. Even if we cannot reduce the network, a modification of the network is useful for further applications of other transformations. We have such a procedure if the connectable condition stated in the following or the disconnectable condition stated after the following is satisfied. Procedure 34.7: Connectable condition Let G(vj) be a set of permissible functions for gate vj which can be an MSPF or a CSPF. We can add a connection from input terminal or gate, vi, to vj without changing network outputs, if the following conditions are satisfied: 1. f(w)(vi) = 0 for all w’s such that G(w)(vj) = 1. 2. vi is not a successor of vj (no loop will be formed by this transformation). This is illustrated in Fig. 34.10. This transformation procedure based on the connectable condition can be extended into the forms that can be used to change the configuration of a network. When we cannot apply any transformations to a given network, we may be able to apply those transformations after these extended transformations.

© 2000 by CRC Press LLC

FIGURE 34.10

Connectable condition.

Example 34.4: If we add two connections to the network in Fig. 34.11(a), as shown in bold lines in Fig. 34.11(b), then the output connection (shown in a dotted line) of gate v12 becomes disconnectable  and v12 can be removed. Then we have the network shown in Fig. 34.11(c). Procedure 34.8: Disconnectable condition If we can find a set of inputs of gate vk such that the disjunction of the w-th component of the remaining inputs of vk is 1 for every w satisfying G(w)(vk) = 0, then this set of inputs can be deleted, as being redundant, without changing network outputs.  Procedures 34.7 and 34.8 will be collectively referred as the connectable/disconnectable conditions (or procedures). In the network in Fig. 34.12(a), x2 is connectable to gate v6, and x1 is connectable to gate v7. After adding these two connections, the outputs of v6 and v7 become identical, so v7 can be removed as shown in Fig. 34.12(b). This transformation is called gate merging. This can be generalized, based on the concept of permissible functions, as follows. Procedure 34.9: Generalized gate merging 1. Find two gates, vi and vj, such that the intersection, GC(vi)GC(vj), of their CSPFs is not empty, as illustrated in Fig. 34.13. 2. Consider an imaginary gate, vij, whose CSPF is to be GC(vi)GC(vj). 3. Connect all the inputs of gate vi and vj to vij. If vij actually realizes a function in GC(vi)GC(vj), then vij can be regarded as a merged gate of vi and vj. Otherwise, vi and vj cannot be merged without changing network outputs in this generalized sense.  4. If vij can replace both vi and vj, then remove redundant inputs of vij. Next, let us outline another transformation, called the error compensation procedure. In order to enhance the gate removal capability of the transduction method, the concept of permissible function is generalized to “a permissible function with errors.”5 Because the transformation procedures based on the error compensation are rather complicated, we outline the basic idea of these procedures along with an example, as follows:

© 2000 by CRC Press LLC

(c)

FIGURE 34.11

An example of the connectable/disconnectable conditions.

FIGURE 34.12

An example of gate merging.

1. Remove an appropriate gate or connection from the network.

FIGURE 34.13

Generalized gate merging.

2. Calculate errors in components of functions at gates or connections that are caused by the removal of the gate or connection, and then calculate permissible functions with errors throughout the network. These permissible functions with errors represent functions with erroneous components (i.e., components whose values are erroneous) as well as ordinary permissible functions that have no erroneous components. 3. Try to compensate for the errors by changing the connection configuration of the network. In order to handle the errors, the procedures based on ordinary permissible functions are modified. Example 34.5: Figure 34.14(a) shows a network whose output at gate v5 realizes

(1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1)

© 2000 by CRC Press LLC

FIGURE 34.14

An example of error compensation.

In order to reduce the network, let us remove gate v8, having the network in Fig. 34.14(b) whose output at gate v5 is (1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1) Note that the outputs of the two networks differ only in the 6-th components (underlined). We want to compensate for the value of the erroneous component of the latter network by adding connections. Functions realized at the input terminals v1 through v4 and gates v5 through v12 in the original network in Fig. 34.14(a) are as follows:

x1 x2 x3 x4

= f(v1) = (0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1) = f(v2) = (0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1) = f(v3) = (0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1) = f(v4) = (0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1) f(v5) = (1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1) f(v6) = (0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0) f(v7) = (0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0) f(v8) = (0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0) f(v9) = (0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0) f(v10) = (1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0) f(v11) = (0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0) f(v12) = (1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0)

The values of the 6-th component (i.e., the values corresponding to the input combination x1 = x3 = 0 and x2 = x4 = 1) are shown in Figs. 34.14(a) and (b). Components of vectors representing CSPFs can be calculated independently, so we calculate CSPFs for all components except the 6-th component (shown by “-”) of the network in Fig. 34.14(b), as follows:

GC(v5) = (1 0 0 0 0 – 0 0 0 0 1 0 1 1 0 1) GC(v6) = (0 * * * 1 – 1 * * * 0 * 0 0 1 0) GC(v7) = (0 1 * 1 * – * * 1 1 0 1 0 0 * 0) GC(v9) = (0 * 1 * * – * 1 * * 0 * 0 0 * 0) GC(v10) = (1 0 * 0 0 – 0 * 0 0 1 0 * * 0 *) GC(v11) = (0 * * * 0 – 0 * 1 * 0 * 1 * 0 *) GC(v12) = (1 * 0 * 1 – * 0 0 * * * 0 * * *) If we can change the 6-th component of any of f(v6), f(v7), or f(v9) (i.e., immediate predecessors of gate v5) from 0 to 1, the error in the network output can be compensated, as can be seen in Fig. 34.14(b) where v8 is removed. The value 0 in the 6-th component of the output at gate, v6, v7, or v9, is due to x4 = 1, x2 = 1, or f(v12) = 1, respectively. If we want to change the output of v9 from 0 to 1, the 6-th component of f(v12) must be 0. If we can change the output of v12 to any function in the set of permissible functions

H = (1 * 0 * 1 0 * 0 0 * * * 0 * * *) that is GC(v12) except the 6-th component specified to 0, the error will be compensated. We can generate such a function by connecting x4 to gate v12 and consequently by changing the output of v12 into

(1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0) which is contained in H. The network obtained is shown in Fig. 34.14(c). Thus, the network with 8 gates and 20 connections shown in Fig. 34.14(a) is reduced to the network with 7 gates and 18 connections in Fig. 34.14(c).  Let us describe the error compensation procedure illustrated by Example 34.5: 1. Remove a gate or a connection from a given network N, having a new network N′. 2. Calculate the erroneous components in the outputs of network N′. 3. Calculate the components of vectors representing MSPFs or CSPFs for the functions realized at the remaining gates and connections, corresponding to all error-free components of the outputs of N′. 4. Compensate for the errors by adding or removing connections. This procedure can remove gates and connections in a more systematic manner than the other transformation procedures discussed so far.

34.3 Various Transduction Methods In addition to gate substitution, connectable/disconnectable conditions, generalized gate merging, and error compensation, outlined thus far, some known transformations can be generalized for efficient processing, using permissible functions. In the gate merging procedure, for example, a permissible function which is common to two gates, vi and vj, can be easily found. Without permissible functions, the transformations would be excessively time-consuming. We can have different transduction methods by combining different transformations and the pruning procedure. In other words, we can have different transduction methods based on different orders in processing gates, connections, and components of MSPFs or CSPFs.

© 2000 by CRC Press LLC

These transduction methods can be realized in Fig. 34.2, which illustrates the basic structure of the transduction method outlined in Procedure 34.1. We can use these transduction methods in the following different manners: 1. An initial network can be designed by any conventional logic design method. Then we apply the transduction methods to such an initial network. The transduction methods applied to different initial networks usually lead to different final networks. 2. Instead of applying a transduction method only once, we can apply different transduction methods to an initial network in sequence. In each sequence, different or identical transduction methods can be applied in different orders. This usually leads to many different final networks. Thus, if we want to explore the maximum potential of the transduction methods, we need to use them in many different ways, as explained in 1 and 2.3,4

Computational Example of the Transduction Method Let us show a computational example of the transduction method. Suppose the initial network, which realizes a four-variable function, given as illustrated in Fig. 34.15(a), and this function has a minimal network shown in Fig. 34.15(b) (its minimality is proved by the integer programming logic design method). Beginning with the initial network of 12 gates shown in Fig. 34.15(a), the transduction method with error-compensation transformation (this version was called NETTRA-E3) produced the tree of solutions shown in Fig. 34.16. The size of the tree can be limited by the program parameter, NEPMAX.6 (In Fig. 34.16, NEPMAX was set to 2. If set to 8, we will have a tree of 81 networks). The notation “a/b:c” in Fig. 34.16 means a network numbered a (numbered according to the order of generation), consisting of b gates and c connections, and a line connecting a larger network with a smaller one means that the smaller is derived, treating the larger as an initial network. In Fig. 34.16, it is important to notice that while some paths lead to terminal nodes representing minimal networks, others lead to terminal nodes representing networks not very close to the minimal. By comparing the numbers of gates and connections in the networks derived at the terminal nodes of this solution tree, a best solution can be found.

FIGURE 34.15

Initial and final networks for Fig. 34.6.

Intermediate solutions are logic networks with different connection configurations of negative gates, so some of them may be more appropriate for layout than others.

FIGURE 34.16

A tree of solutions generated by the transduction method based on error compensation.

34.4 Design of Logic Networks with Negative Gates by the Transduction Method The transduction method has been described for the logic networks of NOR gates for the sake of simplicity, but it can be applied to logic networks of other types of gates, such as MOS logic gates and a mixture of AND gates and OR gates, tailoring its basic concepts (i.e., permissible functions and transformation). In this sense, it is important to understand what features different types of logic gates and consequently corresponding transistor circuits have in terms of logic operations and network transformations. In order to test the feasibility of design of logic networks with negative gates (MOS logic gates are negative gates) by the transduction method, a few synthesizers, called SYLON (an acronym for SYnthesis of LOgic Networks), were developed by modifying the transduction method.1,2,7,8,11,12 Some SYLON logic synthesizers consist of a mixture of technology-dependent optimization and technology-independent optimization. Here, let us outline SYLON-REDUCE,7 a logic synthesizer which is of totally technology-dependent optimization and is more algorithmic, wherein a logic network is processed in its target technology throughout the execution of REDUCE. REDUCE reduces an initial network, using permissible functions, where in order to make each logic gate easily realizable as a MOS logic circuit, each logic gate throughout the execution of REDUCE is a negative gate that satisfies prespecified constraints on the maximum numbers of MOSFETs connected in series in each path and the maximum number of parallel paths. The reduction is done by repeatedly resynthesizing each negative gate. In other words, the outputs of some candidate gates or network inputs are connected to a gate under resynthesis and the connection configuration inside the gate is restructured, reducing the complexity of the gate and disconnecting unnecessary candidate gates or network inputs. The resynthesized cell is adopted if it has no more MOSFETs than the old gate and does not violate the constraints on the complexity (i.e., the specified maximum number of MOSFETs connected in series or the specified maximum number of parallel paths) otherwise, it is discarded, restoring the old gate. This resynthesis of each gate is repeated until no improvement can be done. Thus, the network transformation is done in a more subtle manner than the original transduction method. The result is a network where each gate still satisfies the same constraints on the complexity and contains no more MOSFETs than the corresponding gate in the original network and the connection configuration of the network may be changed.

© 2000 by CRC Press LLC

References 1. Chen, K.-C., “Logic Synthesis and Optimization Algorithms,” Ph.D. diss., Dept. of Comput. Sci., Univ. of Illinois, Urbana, 320, 1990. 2. Chen, K.-C. and S. Muroga, “SYLON-DREAM: A multi-level network synthesizer,” Proc. Int’l. Conf. on Computer-Aided Design, pp. 552-555, 1989. 3. Hu, K. C., “Programming manual for the NOR network transduction system,” UIUCDCS-R-77887, Dept. Comp. Sci., Univ. of Illinois, Urbana, Aug. 1977. 4. Hu, K. C., and S. Muroga, “NOR(NAND) network transduction system (The principle of NETTRA system),” UIUCDCS-R-77-885, Dept. Comp. Sci., Univ. of Illinois, Urbana, Aug. 1977. 5. Kambayashi, Y., H. C. Lai, J. N. Culliney, and S. Muroga, “NOR network transduction based on error compensation (Principles of NOR network transduction programs NETTRA-E1, NETTRAE2, NETTRA-E3),” UIUCDCS-R-75-737, Dept. of Comp. Sci., Univ. of Illinois, Urbana, June 1975. 6. Lai, H. C. and J. N. Culliney, “Program manual: NOR network transduction based on error compensation (Reference manual of NOR network transduction programs NETTRA-E1, NETTRA-E2, and NETTRA-E3),” UIUCDCS-R-75-732, Dept. Comp. Sci., Univ. of Illinois, Urbana, June 1975. 7. Limqueco, J. C. and S. Muroga, “SYLON-REDUCE: A MOS network optimization algorithm using permissible functions,” Proc. Int’l. Conf. on Computer Design, Cambridge, MA, pp. 282-285, Sept. 1990. 8. Limqueco, J. C., “Logic Optimization of MOS Networks,” Ph.D. thesis, Dept. of Comput. Sci., University of Illinois, Urbana, 250, 1992. 9. Muroga, S., “Computer-aided logic synthesis for VLSI chips,” Advances in Computers, vol. 32, Ed. by M. C. Yovits, Academic Press, pp. 1-103, 1991. 10. Muroga, S., Y. Kambayashi, H. C. Lai, and J. N. Culliney, “The transduction method --- Design of logic networks based on permissible functions,” IEEE TC, 38, 1404-1424, Oct. 1989. 11. Xiang, X. Q., “Multilevel Logic Network Synthesis System, SYLON-XTRANS, and Read-Only Memory Minimization Procedure, MINROM,” Ph.D. diss., Dept. of Comput. Sci., Univ. of Illinois, Urbana, 286, 1990. 12. Xiang, X. Q. and S. Muroga, “Synthesis of multilevel networks with simple gates,” Int’l. Workshop on Logic Synthesis, Microelectronic Center of North Carolina, Research Triangle Park, NC, May 1989.

Muroga, S. "Emitter-Coupled Logic" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

35 Emitter-Coupled Logic 35.1 Introduction 35.2 Standard ECL Logic Gates Emitter-Dotting • Design of a Logic Network with Standard ECL Gates

Saburo Muroga

35.3 Modification of Standard ECL Logic Gates with Wired Logic

University of Illinois at Urbana-Champaign

35.4 ECL Series-Gating Circuits

Collector-Dotting

35.1 Introduction ECL, which stands for emitter-coupled logic, is based on bipolar transistors and is currently the logic family with the highest speed and a high output power capability, although power consumption is also the highest. ECL is more complicated to fabricate, covers a larger chip area, and is more expensive than any other logic family. As the speed of CMOS improves, ECL is less often used but is still useful for the cases where high speed, along with large output power, are necessary, such as high-speed transmission over communication lines. ECL has three types of transistor circuits: standard ECL logic gates, their modification with wired logic, and ECL series-gating.4,6,10,13

35.2 Standard ECL Logic Gates A standard ECL logic gate is a stand-alone logic gate, and logic networks can be designed using many as building blocks. Its basic circuit is shown in Fig. 35.1. ECL has unique logic capability, as explained in the following. The logic operation of the ECL gate shown in Fig. 35.1 is analyzed in Fig. 35.2, where the input z in Fig. 35.1 is eliminated for simplicity. The resistors connected to the bases of transistors, Tx and Ty , are for protecting these transistors from possible damage due to heavy currents through a transistor, but not for logic operations, and can be eliminated if there is no possibility for an excessively heavy current to flow. When input x has a high voltage representing logic value 1 and y has a low voltage representing logic value 0, transistor Tx becomes conductive, Ty becomes non-conductive, and a current flows through Tx and resistor R1, as illustrated in Fig. 35.2(a). In this case the voltage at the emitter of transistor Tr becomes higher than –4 V due to the current through resistor Rp, as shown. Consequently, the voltage at this emitter becomes higher and the voltage at its base becomes not sufficiently high against its emitter to make Tr conductive, so there is no current through Tr and resistor R2. Consequently, transistor Tf has a high voltage at its base, which makes Tf conductive, and output f has a high voltage representing logic value 1. On the other hand, transistor Tg is almost non-conductive (actually a small current flows, but let us ignore it for simplicity), since its base has a low voltage due to the voltage drop developed across resistor R1 by the current shown. Thus, output g has a low voltage representing logic value 0.

© 2000 by CRC Press LLC

FIGURE 35.1

Basic circuit of standard ECL logic gate.

FIGURE 35.2

Logic operation of ECL gate in Fig. 35.1.

© 2000 by CRC Press LLC

Even when y (instead of x), or both x and y, has a high voltage, the above situation is not changed, except for the current through Ty . Next, suppose that both inputs x and y have low voltages, as shown in Fig. 35.2(b). Then there is no current through resistor R1. Since the base of transistor Tr has a higher voltage than its emitter (0 V at the base and –0.8 V at the emitter), a current flows through R2 and Tr , as illustrated in Fig. 35.2(b). Thus, Tf has a low voltage at its base and becomes almost non-conductive (more precisely speaking, less conductive). Output f has, consequently, a low voltage, representing logic value 0. Transistor Tg has a high voltage at its base and becomes conductive. Thus, output g has a high voltage, representing logic value 1. Therefore, a current flows through only one of R1 and R2, switching quickly between these two paths. Notice that resistor Rp in Fig. 35.2 which is connected to a power supply of minus voltage is essential for this current steering because the voltage at the top end of Rp determines whether Tr becomes conductive or not. The emitter followers (shown in the dot-lined rectangles in Fig. 35.1) can deliver heavy output currents because an output current flows only through either transistor Tf or Tg and the on-resistance of the transistor is low. The above analysis of Fig. 35.2 leads to the truth table in Table 35.1. From this table, the network in Fig. 35.2 has two outputs: f = x ∨ y and g = x ∨ y . TABLE 35.1 Truth Table for Fig. 35.2 Inputs x 0 0 1 1

Outputs y 0 1 0 1

f 0 1 1 1

g 1 0 0 0

In a similar manner, we can find that the ECL gate in Fig. 35.1 has two outputs: f = x ∨ y ∨ z and g = x∨y∨z. The gate is denoted by the symbol shown in Fig. 35.3. The simultaneous availability of OR and NOR as the double-rail output logic, with few extra components, is the unique feature of the ECL gate, making its logic capability powerful.

FIGURE 35.3

Symbol for the standard ECL logic gate of Fig. 35.1.

Emitter-Dotting Suppose we have the emitter follower circuit shown in Fig. 35.4(a) (also shown in the dot-lined rectangles in Fig. 35.1), as part of an ECL logic gate. Its output function, f, at the emitter of the bipolar transistor is 1 when the voltage at the emitter is high (i.e., the bipolar transistor in (a) is conductive), and f is 0 when the voltage at the output terminal is low (i.e., the bipolar transistor in (a) is non-conductive). Then, suppose there is another alike circuit whose output function at the emitter is g. If the emitters of these two circuits are tied together as shown in Fig. 35.4(b), the new output function at the tied point is h = f ∨ g, replacing the original functions, f and g. This connection is called emitter-dotting, realizing WiredOR. The tied point represents the new function f ∨ g because if both transistors, T1 and T2, are nonconductive, the voltage at the tied point is low; otherwise (i.e., if one of T1 and T2, or both, is conductive), the voltage at the tied point is high.

© 2000 by CRC Press LLC

FIGURE 35.4

Emitter-dotting.

Wired-OR is an important feature of the ECL gate. The ECL gate in Fig. 35.1 has two emitter followers: one for f and the other for g. As shown in Fig. 35.5, the OR of the outputs can be realized without using an extra gate, simply by tying together these outputs. This is very convenient in logic design. If one output is Wired-ORed with another ECL gate, it does not express the original function. And it cannot be further Wired-ORed to other gates if we want to realize Wired-OR with the original function. But if the same output is repeated by adding an emitter follower inside the gate, as shown in Fig. 35.6 (i.e., the emitter follower inside a dot-lined rectangle in Fig. 35.1), then the new output can be Wired-ORed with another gate output or connected without Wired-OR to the succeeding gates. In the ECL gate at the top position in Fig. 35.6, for example, the first output f = x ∨ y ∨ z is connected to gates in the next level, while the same f in the second output is used to produce the output u ∨ v ∨ x ∨ y ∨ z by WiredORing with the output of the second ECL gate.

FIGURE 35.5

Wired-OR of ECL gates.

Design of a Logic Network with Standard ECL Gates An ECL gate network can be designed, starting with a network of only NOR gates, for the following reason. Consider a logic network of ECL logic gates shown in Fig. 35.7(a), where Wired-ORs are included. This network can be converted into the network without Wired-OR shown in (b) by directly connecting connections in each Wired-OR to the inputs of a NOR gate without changing the outputs at gates 4 and 5, possibly sacrificing the maximum fan-in restriction. Then, two NOR outputs of gate 2 in (a), for example, can be combined into one in (b). Then, this network can be converted into the network shown

© 2000 by CRC Press LLC

FIGURE 35.6

Multiple-output ECL gates for Wired-OR.

in (c) by eliminating OR outputs of all gates 1, 2, and 3 in (b), and connecting inputs of these gates directly to gates 4 and 5. Thus, the network in (c) that expresses the same outputs as the network in (a) consists of NOR gates only (i.e., the outputs of gates 4 and 5 in (c) are the same as those in (a)), possibly further sacrificing the maximum fan-in restriction at some gates. Notice that in this conversion, the number of gates does not change or decreases (if an ECL gate, like gate 1 in (c), has no outputs used, it can be deleted from (c)). Thus, from the given network of standard ECL gates with Wired-ORs, we can derive a NOR network of the same or fewer number of gates, possibly with greater fan-in at some gates, as shown in Fig. 35.7(d). Even if each gate has many NOR outputs or OR outputs, the situation does not change.

FIGURE 35.7

Conversion of an ECL gate network into a NOR gate network.

When we want to design a minimal standard ECL gate network for given functions f and f, it can be designed by reversing the preceding conversion, as follows.

© 2000 by CRC Press LLC

Procedure 35.1: Design of logic networks with standard ECL logic gates 1. Design a network for a given logic function f and another network for its complement, f using NOR gates only without considering maximum fan-in or fan-out restriction at each gate. Use a minimum number of NOR gates in each case. (The map-factoring method described in Chapter 31 is usually convenient for manual design of logic networks with a minimum number of NOR gates in single- or double-rail input logic.) 2. Choose one among the two logic networks obtained. Reduce the number of input connections to each gate, by providing Wired-ORs, or by using OR-outputs of other gates, if possible. In this case, extra NOR or OR outputs at each ECL gate must be provided whenever necessary (like the reverse conversion from Fig. 35.7(b) to Fig. 35.7(a), or from Fig. 35.7(c) to Fig. 35.7(b)). Thus, if any gate violates the maximum fan-in restriction, we can try to avoid it by using Wired-ORs or OR outputs. 3. This generally reduces fan-out of gates also; but if any gate still violates the maximum fan-out restriction, try to avoid it by using extra ECL gates (no simple good methods are known for doing this). The output ECL gate of this network presents f and f . If no gate violates the maximum fan-in and fan-out restrictions in Steps 2 and 3, the number of NOR gates in the original NOR network chosen in Step 2 is equal to the number of ECL gates in the resultant ECL network. So, if we originally have a network with a minimum number of NOR gates, the designed ECL network also has the minimum number of standard ECL logic gates. But extra ECL gates have to be added if some gates violate the maximum fan-in restriction, maximum fan-out restriction, or other constraints. 4. Repeat Steps 2 and 3 for the other network. Choose the better one.  Notice that the use of OR outputs and Wired-ORs generally reduces the number of connections or fan-ins (i.e., input transistors) and also reduces the total sum of connection lengths, thus saving chip area. For example, the total length of the connections for x and y in Fig. 35.7(c) can be almost twice the connection length between two gates in Fig. 35.7(b). Also, the total length of two connections in Fig. 35.7(b) can be almost twice the length for Wired-OR in Fig. 35.7(a). In Procedure 35.1, NOR networks with a minimum number of gates are important initial networks. It is known that when the number of gates is minimized, the number of connections in the networks also tends to be minimized.11 (For the properties of wired logic, see Ref. 7.)

35.3 Modification of Standard ECL Logic Gates with Wired Logic More complex logic functions than the output functions of the standard ECL logic gate shown in Fig. 35.1 can be realized by changing the internal structures of the standard ECL logic gates. In other words, if we connect points inside one ECL gate to some points of another ECL gate, we can realize a complex logic function with a simpler electronic circuit configuration. In other words, we can realize logic functions by freely connecting transistors, resistors, and diodes, instead of regarding the fixed connection configuration of transistors, resistors, and diodes as logic gates whose structure cannot be changed. This approach could be called transistor-level logic design. Wired logic is a powerful means for this, and collector-dotting and emitter-dotting are the basic techniques of wired logic.

Collector-Dotting Collector-dotting is commonly used in bipolar transistor circuitry to realize the Wired-AND operation. Suppose we have the inverter circuit shown in Fig. 35.8(a) as part of an ECL logic gate. Its output function, f, at the collector of the bipolar transistor is 1 when the voltage at the collector is high, and f is 0 when the voltage at the collector is low. Then, suppose there is another like circuit whose output function at the collector is g. If the collectors of these two circuits, instead of the emitters for emitter-dotting in Fig. 35.4(b), are tied together as shown in Fig. 35.8(b), the new output function at the tied point is h = f ⋅ g, replacing the original functions, f and g. This connection is called collector-dotting, realizing Wired-

© 2000 by CRC Press LLC

AND. The tied point represents the new function f ⋅ g because if one of T1 and T2, or both is conductive in Fig. 35.8(b), the voltage at the tied point is low; otherwise (i.e., only when both transistors, T1 and T2, are non-conductive), the voltage at the tied point can be high.

FIGURE 35.8

Collector-dotting.

In Fig. 35.9(a), Fig. 35.2 is repeated as gate 1 and gate 2. Transistor Tx has input x at its base, and its collector represents function x if Ty does not exist because when its base has a low voltage (i.e., logic value 0), its collector has a high voltage (i.e., logic value 1) and vice versa. Similarly, the collector of transistor Ty represents y , if Tx does not exist. Then by tying together these collectors (i.e., collectordotting), the tied point (i.e., point A) represents x ⋅ y = x ∨ z , as already explained with respect to Fig. 35.2. Notice that the collector of Tx and the collector of Ty do not represent the original functions x and y respectively, after collector-dotting. Since the voltage level at B is always opposite to that at A, point B represents x ∨ y. We can use collector-dotting more freely. Point A in gate 1 in Fig. 35.9(a) can be connected to point A′ or B′ in gate 2. Point B can also be connected to point A′ or B′. Such connections realize WiredAND or collector-dotting. By collector-dotting points B and B′ as shown in Fig. 35.9(b), point B 7 (also B′) represents new function (x ∨ y) ⋅ (z ∨ w), which also appears at the emitter of transistor T. After this collector-dotting, points B and B′ do not represent the original functions x ∨ y and z ∨ w, respectively, anymore. Also, note that the function at any point that is not collector-dotted, such as A and A′, is unchanged by collector-dotting of B and B′. In Fig. 35.9(b), two transistors, two diodes, and resistors (shown in the dotted line) are added for adjustment of voltage and current. But they have nothing to do with logic operations. Another example is the parity function xy ∨ xy realized by connecting two ECL gates as shown in Fig. 35.10. The parity function requires four ECL gates if designed with the standard ECL logic gates as shown in Fig. 35.10(c), but can be realized by the much simpler electronic circuit of Fig. 35.10(b). In other words, xy and xy are realized by Wired-AND, then these two products are Wired-ORed in order to realize xy ∨ xy . In Fig. 35.10 as well as Fig. 35.9, some resistors or transistors may be necessary for electronic performance improvement (since resistors R1 and R2 draw too much current in gate 1 in Fig. 35.10(a), new resistors are added in Fig. 35.10(b) in order to clamp the currents), and unnecessary resistors or transistors may be deleted, although such an addition or elimination of resistors or transistors has nothing to do with logic operations.

© 2000 by CRC Press LLC

FIGURE 35.9

Example of Wired-AND.

FIGURE 35.10 Parity function realized by ECL gate with Wired logic.

© 2000 by CRC Press LLC

35.4 ECL Series-Gating Circuits Figure 35.11(a) shows the basic pair of bipolar transistors in series-gating ECL, where A is connected to a power supply and B is connected to another power supply of minus voltage through a resistor. (Notice that this pair is Tx and Tr in Fig. 35.2, from which Ty is eliminated.) Transistor T1 has an input x connected to its base and the other transistor T2 has a constant voltage vref at its base. As illustrated in Fig. 35.2, vref is grounded through a resistor (i.e., vref = 0), where this resistor is for protection of transistor Tr from damage by a heavy current, and vref works as a reference voltage against changes of x. (The voltage at vref can be provided by a subcircuit consisting of resistors, diodes, and transistors, like the one in Fig. 35.9(b).) The collector of T1 represents a logic function x because the collector of T1 can have a high voltage (i.e., logic value 1) only when T1 is non-conductive, that is, the input is a low voltage (x = 0).

FIGURE 35.11 Series-gating.

The collector of T2 represents function x because it becomes a high voltage only when the input x is a high voltage (i.e., when the input is a high voltage, T1 becomes conductive and T2 becomes nonconductive because a current flows at any time through exactly one of two transistors, T1 and T2. Thus, the collector of T2 becomes a high voltage). In Fig. 35.11(b), we have two pairs of transistors. In other words, we have the pair with input y, in addition to the pair with input x shown in (a). Then let us connect them in series without Rpy , R1 and the power supply for R1, as shown in (c). The voltage at the collector of T3 is low only when T3 and T1 are both conductive and, consequently, a current i flows through T3 and T1. The voltage at the collector of T3 is high when either T3 or T1 is non-conductive (i.e., x = 0 or y = 0) and consequently no current (i.e., i) flows through T3 and T1. Thus, the collector of T3 represents the function xy , replacing the original function y shown in (b). This can be rewritten as xy = x ∨ y , so series-gating can be regarded as the OR operation. Many of the basic pair of transistors shown in Fig. 35.11(a) are connected in a tree structure, as shown in Fig. 35.12, where inputs x, y, and z, as well as reference voltages, vref-1, vref-2, and vref-3, need to be at appropriate voltage levels. Then the complement of all minterms can be realized at the collectors of transistors in the top level of the series connections. Two of these complemented minterms (i.e., x ∨ y ∨ z and x ∨ y ∨ z ) are shown with emitter followers, as examples at the far right end of Fig. 35.12. Some of these collectors of transistors in the top level can be collector-dotted to realize the desired logic functions, as illustrated in Fig. 35.13. Notice that once collectors are collector-dotted, these collectors do not express their respective original functions.

© 2000 by CRC Press LLC

FIGURE 35.12

ECL series-gating.

FIGURE 35.13

ECL series-gating.

The full adder in Fig. 35.14 is a more complex example of series-gating.3 In this figure, we use collectordotting by tying together some of collectors to realize Wired-AND, as explained already. For example, the voltage at the collector of transistor T31 represents function xyc because of series-gating with T11, T21, and T31. Usually, at most, three transistors are connected in series (the two transistors in the bottom

© 2000 by CRC Press LLC

FIGURE 35.14

ECL full adder with series-gating. (From Ref. 3.)

© 2000 by CRC Press LLC

level, T01 and T02, in Fig. 35.14 are for controlling the amount of current as part of the power supply). This is because too many transistors in series tend to slow down the speed of the gate due to parasitic capacitances to ground. Then, by collector-dotting, some collectors of the transistors in the top level, sum s, and carry c∗ are realized, as well as their complements, s and c* . Baugh and Wooley have designed a full adder in double-rail logic.2 Ueda designed a full adder with ECL gates in single-rail logic with fewer transistors.16 The implementation of Wired-AND in this manner requires careful consideration of readjustments of voltages and currents. (Thus, transistors or resistors may be added or changed in order to improve electronic performance, but this is not directly related to logic operations.) ECL series-gating can be extended as follows. Unlike the series-gating in Figs. 35.12, 35.13, and 35.14, the same input variables are not necessarily used in each level. For example, in the top level in Fig. 35.15, y and z are connected to the bases of transistors, instead of all y’s. Then, collectors can be collectordotted, although collector-dotting is not done in this figure. Complements of products, such as xy and xz , can be realized at collectors in the top level by the series-gating, as shown in Fig. 35.15. By this free connection of input variables, functions can be generally realized with fewer transistors.

FIGURE 35.15

Series-gating.

CMOS has very low power consumption at low frequency but may consume more power than ECL at high speed (i.e., at high frequency). This is because the power consumption of CMOS is proportional to CFV2, where C is parasitic capacitance, F is a switching frequency, and V is the power supply voltage. Thus, at high frequency, the power consumption of CMOS exceeds that of ECL, which is almost constant. It is important to note that compared with the standard ECL logic gate illustrated in Fig. 35.1, seriesgating ECL is faster with low power consumption for the following reasons: • Because of speed-up of bipolar transistor (reduction of base thickness, and others), delay time over connections among standard ECL logic gates is greater than delay time inside logic gates and

© 2000 by CRC Press LLC

then series-gating ECL, which can realize far more complex logic functions than NOR or OR realized by standard ECL gate and consequently eliminates long connections required among standard ECL logic gates, can have shorter delay. • A series-gating ECL circuit has lower power consumption than a logic network with standard ECL logic gates because power supplies to all standard ECL logic gates are combined into one for the series-gating ECL circuit and a current flows in only one path at any time.1,5,9,12 • Then, in recent years, the power consumption of series-gating ECL is reduced with improved circuits.8 • The power consumption of series-gating ECL can also be reduced by active pull-down of some emitters.14,15

References 1. Abe, S., Y. Watanabe, M. Watanabe, and A. Yamaoka, “M parallel series computer for the changing market,” Hitachi Review, vol. 45, no. 5, pp. 249-254,1996. 2. Baugh, C. R. and B. A. Wooley, “One bit full adder,” U.S. Patent 3,978,329, August 31, 1976. 3. Garret, L. S., “Integrated-circuit digital logic families III — ECL and MOS devices,” IEEE Spectrum, pp. 30-42, Dec. 1970. 4. Gopalan, K. G., Introduction to Digital Microelectronic Circuits, McGraw-Hill, 1996. 5. Higeta, K. et al., “A soft-error-immune 0.9-ns 1.15-Mb ECL-CMOS SRAM with 30-ps 120 k logic gates and on-chip test circuitry,” IEEE Jour. of Solid-State Circuits, vol. 31, no. 10, pp. 1443-1450, Oct. 1996. 6. Jager, R. C., Microelectronics Circuit Design, McGraw-Hill, 1997. 7. Kambayashi, Y. and S. Muroga, “Properties of wired logic,” IEEE TC, vol. C-35, pp. 550-563, 1986. 8. Kuroda, T., et al., “Capacitor-free level-sensitive active pull-down ECL circuit with self-adjusting driving capability,” Symp. VLSI Circuits, pp. 29-30, 1993. 9. Mair, C. A., et al., “A 533-MHz BiCMOS superscaler RISC microprocessor,” IEEE JSSC, pp. 16251634, Nov. 1997. 10. Muroga, S., VLSI System Design, John Wiley and Sons, 1982. 11. Muroga, S. and H.-C. Lai, “Minimization of logic networks under a generalized cost function,” IEEE TC, pp. 893-907, Sept. 1976. 12. Nambu, H., et al., “A 0.65-ns, 72-kb ECL-CMOS RAM macro for a 1-Mb SRAM,” IEEE Jour. of Solid-State Circuits, vol. 30, no. 4, pp. 491-499, April 1995. 13. Sedra, A. S. and K. C. Smith, Microelectronic Circuits, 4th ed., Oxford University Press, 1998. 14. Shin, H. J., “Self-biased feedback-controlled pull-down emitter follower for high-speed low-power bipolar logic circuits,” Symp. VLSI Circuits, pp. 27-28, 1993. 15. Toh, K.-Y. et al., “A 23-ps/2.1-mW ECL gate with an AC-coupled active pull-down emitter-follower stage,” Jour. SSC, pp. 1301-1306, Oct. 1989. 16. Ueda, T., Japanese Patent Sho 51-22779, 1976.

© 2000 by CRC Press LLC

Muroga, S. "CMOS" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

36 CMOS 36.1 CMOS (Complementary MOS) Output Logic Function of a CMOS Logic Gate • Problem of Transfer Curve Shift

Saburo Muroga University of Illinois at Urbana-Champaign

36.2 36.3 36.4 36.5 36.6

Logic Design of CMOS Networks Logic Design in Differential CMOS Logic Layout of CMOS Pseudo-nMOS Dynamic CMOS Domino CMOS • Dynamic CVSL • Problems of Dynamic CMOS

36.1 CMOS (Complementary MOS) A CMOS logic gate consists of a pair of subcircuits, one consisting of nMOSFETs and the other pMOSFETs, where all MOSFETs are of enhancement mode described in Chapter 30, Section 30.3. CMOS, which stands for complementary MOS,6,8–10 means that the nMOS and pMOS subcircuits are complementary. As a simple example, let us explain CMOS with the inverter shown in Fig. 36.1. A p-channel MOSFET is connected between the power supply of positive voltage Vdd and the output terminal, and an n-channel MOSFET is connected between the output terminal and the negative side, Vss, of the above power supply, which is usually grounded. When input x is a high voltage, pMOS becomes non-conductive and nMOS becomes conductive. When x is a low voltage, pMOS becomes conductive and nMOS becomes nonconductive. This is the property of pMOS and nMOS when the voltages of the input and the power supply are properly chosen, as explained with Fig. 30.19. In other words, when either pMOS or nMOS is conductive, the other is non-conductive. When x is a low voltage (logic value 0), pMOS is conductive, with non-conductive nMOS, and the output voltage is a high voltage (logic value 1), which is close to Vdd. When x is a high voltage, nMOS is conductive, with non-conductive pMOS, and the output voltage is a low voltage. Thus, the CMOS logic gate in Fig. 36.1 works as an inverter. The pMOS subcircuit in this figure essentially works as a variable load. When x stays at either 0 or 1, one of pMOS and nMOS subcircuits in Fig. 36.1 is always non-conductive, and consequently no current flows from Vdd to Vss through these MOSFETs. In other words, when no input changes, the power consumption is simply the product of the power supply voltage V (if Vss is grounded, V is equal to Vdd) and a very small current of a non-conductive MOSFET. (Ideally, there should be no current flowing through a non-conductive MOSFET, but actually a very small current which is less than 10 nA flows. Such an undesired, very small current is called a leakage current.) This is called the quiescent power consumption. Since the leakage current is typically a few nanoamperes, the quiescent power consumption of CMOS is less than tens of nW, which is very small compared with those for other logic families. Whenever the input x of this CMOS logic gate changes to a low voltage (i.e., logic value 0), the parasitic capacitance C at the output terminal (including parasitic capacitances at the inputs of the succeeding

© 2000 by CRC Press LLC

FIGURE 36.1

CMOS inverter.

CMOS logic gates, to which the output of this logic gate is connected) must be charged up to a high voltage through the conductive pMOS. (A current can be as large as 0.3 milliamperes or more.) Then, when the input x changes to a high voltage (i.e., logic value 1) at the next input change, the electric charge stored in the parasitic capacitance must be discharged through the conductive nMOS. Therefore, much larger power consumption than the quiescent power consumption occurs whenever the input changes. This dynamic power consumption due to the current during this transition period is given by CFV2, where C is the parasitic capacitance, V is the power supply voltage, and F is the switching frequency of the input. Thus the power consumption of CMOS is a function of frequency. CMOS consumes very little power at low frequency, but it consumes more than ECL as the frequency increases. As the integration size increases, CMOS is being almost exclusively used in VLSI because of low power consumption. But even CMOS has difficulty in dissipation of the heat generated in the chip when switching frequency increases. In order to alleviate this difficulty, valiants, such as dynamic CMOS, have been used which will be described later.

Output Logic Function of a CMOS Logic Gate Let us consider a CMOS logic gate in which many MOSFETs of the enhancement mode are connected in each of the pMOS and nMOS subcircuits (e.g., the CMOS logic gate in Fig. 36.2). By regarding the pMOS subcircuit as a variable load, the output function f can be calculated in the same manner as the one of an nMOS logic gate:

FIGURE 36.2

CMOS NOR gate.

© 2000 by CRC Press LLC

1. Calculate the transmission between the output terminal and Vss (or the ground), considering nMOSFETs as make-contacts of relays (transmission and relays are described in Chapter 30). 2. Then, its complement is the output function of this CMOS logic gate. Thus, the CMOS logic gate in Fig. 36.2, for example, has the output function f = x ∨ y . We can prove that if the pMOS subcircuit of any CMOS logic gate has the transmission between Vdd and the output terminal, calculated by regarding each pMOS as a make-contact relay, and this transmission is the dual of the transmission of the nMOS subcircuit, then one of the pMOS and nMOS subcircuits is always non-conductive, with the other conductive, for any combination of input values. (Note that regarding each pMOS as a make-contact relay, as we do each nMOS as a make-contact relay, means finding a relationship of connection configuration between nMOS subcircuit and pMOS subcircuit.) In the CMOS logic gate of Fig. 36.2, the pMOS subcircuit has transmission gd = xy, which is dual to the transmission g = x ∨ y of the nMOS subcircuit, where the superscript d on g means “dual.” Thus, any CMOS logic gate has the unique features of unusually low quiescent power consumption and dynamic power consumption CV2F. The input resistance of a CMOS logic gate is extremely high and at least 1014 Ω . This permits large fan-outs from a CMOS logic gate. Thus, if inputs do not change, CMOS has almost no maximum fanout restriction. The practical maximum fan-out is 30 or more, which is very large compared with other logic families. If the number of fan-out connections from a CMOS logic gate is too many, the waveform of a signal becomes distorted. Also, fan-out increases the parasitic capacitance and consequently reduces the speed, so fan-out is limited to a few when high speed is required. In addition to extremely low power consumption, CMOS has the unique feature that CMOS logic networks work reliably even if power supply voltage fluctuates, temperature changes over a wide range, or there is plenty of noise interference. This makes use of CMOS appropriate in rugged environments, such as for automobile, in factories, and weapons.

Problem of Transfer Curve Shift Unfortunately, when a CMOS logic gate has many inputs, its transfer curve (which shows the relationship between input voltage and output voltage of a CMOS logic gate) shifts, depending on how many of the inputs change values. For example, in the two-input NAND gate shown in Fig. 36.3(a), the transfer curve for the simultaneous change of the two inputs (1 and 2) is different from that for the change of only input 1, with input 2 kept at a high voltage. This is different from an nMOS logic gate (or a pMOS logic gate) discussed in the previous sections, where every driver MOSFET in its conductive state must have a much lower resistance than the load MOSFET in order to have a sufficiently large voltage swing. But if only input 1 in Fig. 36.3(a), for example, changes, the resistance of the pMOS subcircuit is twice as large as that for the simultaneous change of the two inputs 1 and 2; so parasitic capacitance, C, is charged in a shorter time in the latter case. Other examples are shown in (b) and (c) in Fig. 36.3. Because of this problem of transfer curve shift, the number of inputs to a CMOS logic gate is practically limited to four if we want to maintain good noise immunity. If we need not worry about noise immunity, the number of inputs to a CMOS logic gate can be greater.

36.2 Logic Design of CMOS Networks The logic design of CMOS networks can be done in the same manner as that of nMOS logic networks, because the nMOS subcircuit in each CMOS logic gate, with the pMOS subcircuit regarded as a variable load, essentially performs the logic operation, as seen from Fig. 36.2. The design procedures discussed for nMOS networks in Chapter 30, Section 30.3 can be used more effectively than in the case of nMOS networks, because more than four MOSFETs can be in series inside each logic gate, unless we are concerned about the transfer-curve shift problem or high-speed operation. Also, an appropriate use of transmission gates, discussed in the next paragraph, often simplifies networks.

© 2000 by CRC Press LLC

FIGURE 36.3

CMOS transfer curves for different numbers of inputs.

The transmission gate shown in Fig. 36.4 is a counterpart of the pass transistor (i.e., transfer gate) of nMOS, and is often used in CMOS network design. It consists of a pair of p-channel and nchannel MOSFETs. The control voltage d is applied to the gate of the n-channel MOSFET, and its complement d is applied to the gate of the p-channel MOSFET. If d is a high voltage, both MOSFETs become conductive, and the input is connected to the output. (Unlike a pass transistor with nMOS whose output voltage is somewhat lower than the input voltage, the output voltage of the transmission gate in CMOS is the same as the input voltage after the transition period.) If d is a low voltage, both MOSFETs become non-conductive, and the output is disconnected from the input, keeping the output

© 2000 by CRC Press LLC

FIGURE 36.4 Transmission gate.

voltage (which gradually becomes low because of current leakage) at its parasitic capacitance, as it was before the disconnection. Since the input and output are interchangeable, the transmission gate is bidirectional. A D-type flip-flop is shown in Fig. 36.5, as an example of CMOS circuits designed with transmission gates.

FIGURE 36.5 D-type flip-flop (c is a clock).

A full adder in CMOS with transmission gates is shown in Chapter 38. Also, pass transistors realized in nMOS can been used mixed with CMOS logic gates to reduce area or power consumption, as will be discussed in Chapter 37.

36.3 Logic Design in Differential CMOS Logic Differential CMOS logic is a logic gate that works very differently from the CMOS logic gate discussed so far. It has two outputs, f and its complement, f, and works like a flip-flop such that when one output is a high voltage, it always makes the other output have a low voltage. A logic gate in cascode voltage switch logic, which is abbreviated as CVSL, is illustrated in Fig. 36.6. CVSL is sometimes called differential logic because CVSL is a CMOS logic gate that realizes both an output function, f, and its complement, f , switching their values quickly. The CVSL gate shown in Fig. 36.6(a) has a driver that is a tree consisting of nMOSFETs, where each pair of nMOSFETs in one level has input xi and its complement x i . The top end of each path in the tree expresses the complement of a minterm (just like series-gating ECL described in Chapter 35). Then by connecting some of these top ends to both the gate of one pMOSFET (i.e., P1), and the drain of the other pMOSFET (i.e., P2), we can realize the complement of a sum-of-products. The connection of the remaining top ends to both the gate of P2 and the drain of P1 realizes its complement. The outputs in Fig. 36.6(a) realize

f = x1 x2 x3 ∨ x1 x2 x3 ∨ x1 x2 x3

© 2000 by CRC Press LLC

(a)

(b) FIGURE 36.6

(a) Static CVSL; (b) Static CVSL for the same functions as in (a).

and

f = x1 x2 x3 ∨ x1 x2 x3 ∨ x1 x2 x3 ∨ x1 x2 x3 ∨ x1 x2 x3 When one path in the tree is conductive, the output of it, say the output f, has a lower voltage, and P1 (i.e., the pMOSFET whose gate is connected to f ) becomes conductive. Then, the other output f has a high voltage. The driver tree in Fig. 36.6(a), which resembles series-gating ECL, can be simplified as shown in (b). (The tree structure in CSVL in (b) can be obtained from an ordered reduced binary decision diagram described in Chapters 23 and 26.) Notice that P1 and P2 in (a) are cross-connected for fast switching. CVSL can be realized without tree structure. Figure 36.7 shows such a CSVL gate, for f = xy and its complement. The connection configuration of nMOSFETs connected to one output terminal is dual to that connected to the other output terminal (in Fig. 36.7, nMOSFETs for the output terminal for f are connected in series, while nMOSFETs for f are connected in parallel).

© 2000 by CRC Press LLC

FIGURE 36.7

Static CVSL.

CVSL has a variant called dynamic CVSL, to be described later. In order to differentiate from this, CVSL here is usually called static CVSL.

36.4 Layout of CMOS Layout of CMOS logic networks is more complex than logic networks where only nMOSFETs or pMOSFETs are used because in CMOS networks, nMOSFETs and pMOSFETs need to be fabricated with different materials. This makes layout of CMOS complicated. We often need to consider appropriate connection configuration of MOSFETs inside each logic gate and a logic network such that speed or area is optimized. When we want to raise the speed, the switching speed of the pMOS subcircuit should be comparable to that of the nMOS subcircuit. Since the mobility of electrons is roughly 2.5 times higher than that of holes, the channel width W of p-channel MOSFETs must be much wider (often 1.5 to 2 times because the width can be made smaller than 2.5 times due to different doping levels) than that of n-channel MOSFETs (which work based on electrons) in order to compensate for the low speed of p-channel MOSFETs (which work based on holes) if the same channel length is used in each. Thus, for high-speed applications, NAND logic gates such as Fig. 36.3(a) are preferred to NOR logic gates, because the channel width of p-channel MOSFETs in series in a NOR logic gate such as Fig. 36.3(b) must be further increased such that the parasitic capacitance C can be charged up in a shorter time with low series resistance of these p-channel MOSFETs. When designers are satisfied with low speed, the same channel width is chosen for every p-channel MOSFET as for n-channel MOSFETs, since a CMOS logic gate requires already more area than a pMOS or nMOS logic gate and a further increase by increasing the channel width is not desirable unless really necessary.

36.5 Pseudo-nMOS In the case of the CMOS discussed so far, each CMOS logic gate consists of a pair of pMOS and nMOS subcircuits which realize dual transmission functions, as explained with Fig. 36.2. In design practice, however, some variations are often used. For example, Fig. 36.8 realizes NOR. The pMOS subcircuit consists of only a single MOSFET. Thus, the chip area is reduced and the speed is faster than static CMOS discussed so far because of a smaller parasitic capacitance. But more power is dissipated for some combinations of input values1 because the pMOS subcircuit in (a) is always conductive.9,10

© 2000 by CRC Press LLC

FIGURE 36.8

Peudo-nMOS.

36.6 Dynamic CMOS Dynamic CMOS4 is a CMOS logic gate that works in a manner very different from the CMOS logic gates discussed so far, which are usually called static CMOS. Using clock pulse, a parasitic capacitance is precharged, and then it is evaluated whether the parasitic capacitance is discharged or not, depending on the values of input variables. Dynamic CMOS has been used often for high speed because parasitic capacitance can be made small due to unique connection configuration of MOSFETs inside a logic gate, although good layout is not easy. Power consumption is not necessarily small. (In static CMOS, a current flows from the power supply to the ground during transition period. But in dynamic CMOS, there is no such current. But this does not mean that dynamic CMOS consumes less power because when input variables do not change their values, static CMOS does not consume power at all, whereas dynamic CMOS may consume power by repeating precharging. Dynamic CMOS and static CMOS have completely different power consumption mechanisms.)

Domino CMOS Domino CMOS, illustrated in Fig. 36.9, consists of pairs of a CMOS logic gate and an inverter CMOS logic gate.4 The first CMOS logic gate in each pair (such as logic gates 1, 3, and 5) has the pMOS subcircuit, consisting of a single pMOSFET with clock, and the nMOS subcircuit, consisting of many nMOSFETs with logic inputs and a single nMOSFET with a clock. The first CMOS logic gate is followed by an inverter CMOS logic gate (such as logic gates 2, 4, and 6). When a clock pulse is absent at all terminals labeled c, all parasitic capacitances (shown by dotted lines) are charged to value 1 (i.e., a high voltage) because all pMOSFETs are conductive. This process is called precharging. Thus, the outputs of all inverters become value 0. Suppose that x = v = 1 (i.e., a high voltage) and y = z = u = 0 (i.e., a low voltage). When a clock pulse appears, that is, c = 1, all pMOSFETs become non-conductive but the nMOS subcircuit in each of logic gates 1 and 5 becomes conductive, discharging parasitic capacitance. Then the outputs of logic gates 1, 2, 3, 4, 5, and 6 become 0, 1, 1, 0, 0, and 1, respectively. Notice that the output of logic gate 3 remains precharged because its nMOSFET for u remains non-conductive. Domino CMOS has the following advantages: • It has a small area because the pMOS subcircuit in each logic gate consists of a single pMOSFET. • It is faster (about twice) than the static CMOS discussed so far because parasitic capacitances are reduced by using a single pMOS in each logic gate and the first logic gate is buffered by an inverter. Also, an inverter, such as logic gate 2, has smaller parasitic capacitance at its output because it connects to only nMOSFET in logic gate 3, for example, compared to static CMOS where it connects to both pMOSFET and nMOSFET in each of next static CMOS logic gates, to which the output of this static CMOS is connected. This also makes domino CMOS faster.

© 2000 by CRC Press LLC

FIGURE 36.9

Domino CMOS.

• It is free of glitches (i.e., transition is smooth) because at the output of each logic gate, a high voltage remains or decreases, but no voltage increases from low to high.7 Domino CMOS has the following disadvantage: • Only positive functions with respect to input variables can be realized. (If both xi and x i for each xi is available as network inputs, the network can realize any function. But if only one of them, say xi, is available, functions that are dependent on x i cannot be realized by a domino CMOS logic network.) So we have to have domino CMOS networks in double-rail input logic (e.g., Ref. 2), or to add inverters, whenever necessary. Thus, although the number of MOSFETs in domino CMOS networks in single-rail input logic, such as Fig. 36.9, is almost half of static CMOS networks, the number of MOSFETs in such domino CMOS networks to realize any logic functions may become comparable to the number of MOSFETs in static CMOS networks in single-rail input logic.

Dynamic CVSL Static CVSL, which is previously described, can be easily converted into dynamic CVSL which is faster, as illustrated in Fig. 36.10. The parasitic capacitance of two output terminals are precharged through each of the two pMOSFETs during the absence of a clock pulse. Dynamic CVSL works in a similar manner to domino CMOS. Notice that two pMOSFETs are not cross-connected like static CVSL and

© 2000 by CRC Press LLC

FIGURE 36.10 Dynamic CVSL.

we have essentially two independent logic gates. Dynamic CSVL with two outputs, f and f , is in doublerail logic, so unlike domino CMOS, not only positive functions but also any logic functions can be realized. It is fast because the pMOS subcircuit of static CMOS is replaced by one pMOSFET and consequently parasitic capacitance is small and also because the output of a dynamic CSVL is connected only to the nMOS subcircuits, instead of to both the nMOS and pMOS of a next static CMOS logic gate. It is also free of glitches.

Problems of Dynamic CMOS Dynamic CMOS, such as domino CMOS and differential CMOS logic, is increasingly important for circuits that require high speed, such as arithmetic/logic units,5 although design and layout of appropriate distribution of voltages and currents are far trickier than static CMOS. Dynamic CMOS with a singlephase-clock has advantage of simple clock distribution lines.3

References 1. Cooper, J. A., J. A. Copland, and R. H. Krambeck, “A CMOS microprocessor for telecommunications applications,” ISSCC ’77, pp. 137-138. 2. Heikes, C., “A 4.5mm2 multiplier array for a 200MFLOP pipelined coprocessor,” ISSCC ’94, pp. 290-291. (Double-rail domino CMOS.) 3. Ji-Ren, Y., I. Karlsson, and C. Svensson, “A true single-phase-clock dynamic CMOS circuit technique,” IEEE JSSC, pp. 899-901, Oct. 1987. 4. Krambeck, R. H., C. M. Lee, and H.-F. S. Law, “High-speed compact circuits with CMOS,” IEEE JSSC, pp. 614-619, June 1982. (First paper on domino CMOS) 5. Lu, F. and H. Samueli, “A 200-MHz CMOS pipelined multiplier-accumulator using quasi-domino dynamic full-adder cell design,” IEEE JSSC, pp. 123-132, Feb. 1993. 6. Muroga, S., VLSI System Design, John Wiley & Sons, 1982. 7. Murphy, et al., “A CMOS 32b single chip microprocessor,” ISSCC ’81, pp. 230, 231, 276. 8. Shoji, M., CMOS Digital Circuit Technology, Prentice-Hall, 1988. 9. Vyemura, J. P., CMOS Logic Circuit Design, Kluwer Academic Publishers, 1999. 10. Weste, N. H. E. and K. Eshraghian, Principles of CMOS VLSI Design, 2nd ed., Addison Wesley, 1993.

© 2000 by CRC Press LLC

Yano, K., Muroga, S. "Pass Transistors" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

37 Pass Transistors Kazuo Yano Hitachi Ltd.

Saburo Muroga University of Illinois at Urbana-Champaign

37.1 Introduction 37.2 Electronic Problems of Pass Transistors 37.3 Top-down Design of Logic Functions with Pass-Transistor Logic

37.1 Introduction A MOSFET is usually used such that a voltage at the gate terminal of the MOSFET controls a current between its source and drain. When the voltages at the gate terminals of MOSFETs that constitute a logic gate are regarded as input variables x, y, and so on, the voltage at the output terminal represents the output function f. Here, x = 1 means a high voltage and x = 0 a low voltage. For example, MOSFETs in the logic gate in CMOS shown in Fig. 37.1(a) are used in this manner and the output function f represents x ∨ y . But a MOSFET can be used such that an input variable x applied at one of source and drain of the MOSFET is delivered to the other or not, depending on whether a voltage applied at the gate terminal of the MOSFET is high or low. For example, the n-channel MOSFET shown in Fig. 37.1(b) works in this manner and the MOSFET used in this manner is called a transfer gate or, more often, a pass transistor. When the control voltage c is high, the MOSFET becomes conductive and the input voltage x appears at the output terminal f (henceforth, let letters, f, x, and others represent terminals as well as voltages or signal values), no matter whether the voltage x is high or low. When the control voltage c is low, the MOSFET becomes non-conductive and the input voltage at x does not appear at the output terminal f. Pass transistors have been used for simplification of transistor circuits. Logic functions can be realized with fewer MOSFETs than logic networks of logic gates where MOSFETs are used like those in Fig. 37.1(a). The circuit in Fig. 37.2, for example, realizes the even-parity function x ⊕ y 2. This circuit works in the following manner. When x and y are both low voltages, n-channel MOSFETs, 1 and 2, are non-conductive and consequently no current flows from the power supply Vdd to the terminal x or y. Thus, the output voltage f is high because it is the same as the power supply voltage at Vdd. When x is a low voltage and y is a high voltage, MOSFET 1 becomes conductive and 2 is non-conductive. Consequently, a current flows from the power supply to x because x is a low voltage. Thus, the output voltage at f is low. Continuing the analysis, we have the truth table shown in Fig. 37.3(a). Then we can derive the truth table for function f = x ⊕ y in Fig. 37.3(b) by regarding a low and high voltages as 0 and 1, respectively. A logic gate for this function in CMOS requires 8 MOSFETs, as shown in Fig. 37.4, whereas the circuit realized with pass transistors in Fig. 37.2 requires only three MOSFETs. Notice that inputs x and y are connected to the sources of MOSFETs 1 and 2, unlike MOSFET in ordinary logic gates. Signal x at the source of MOSFET 1 is either sent to the drain or not, according to whether or not its MOSFET gate has a high voltage. Pass transistors, however, are sometimes used inside an ordinary logic gate, mixed with ordinary MOSFETs. MOSFETs 1, 2, and 3 in Fig. 37.5 are such pass transistors. (Actually, the pair of 1 and 2 is a transmission gate to be described in the following and also in Chapter 36, Section 36.2) Logic networks

© 2000 by CRC Press LLC

FIGURE 37.1

CMOS logic gate and pass-transistor circuit.

FIGURE 37.2

Circuit with pass transistors for the even-party function.

FIGURE 37.3

Behavior of the circuit in Fig. 37.2.

in CMOS where MOSFETs are used in the same way as those in Fig. 37.1(a) can sometimes be simplified by an appropriate use of pass transistors, possibly with speed-up. The wide use of a pass transistor is found in the DRAM memory cell, which consists of one capacitor and one pass transistor. The control of charging to or discharging from the memory capacitor is done through the pass transistor. This shows the impact of this technique in area reduction, power consumption reduction, and possibly also in speed-up. Pass transistors are also used in the arithmetic-logic unit of a computer, which requires speed and small area, such as fast adders (actually the logic gate in Fig. 37.5 is part of such an adder), multipliers and multiplexers.3,7,11,12 The circuit in Fig. 37.6(a) in double-rail input logic (i.e., x, x , y, and y are available as inputs) realizes the odd-parity function x ⊕ y . A circuit for the inverter shown by the triangle with a circle in Fig. 37.6(a) is shown in (b). Pass transistors are often used for forming a demultiplexer, as illustrated in Fig. 37.7. Series connection of pass transistors has some resistance. So the number of control variables (here, x and y) is limited to at most four and the inverter shown in Fig. 37.6(b) is added after input g as a buffer.

© 2000 by CRC Press LLC

FIGURE 37.4

CMOS logic gate for function f = x ⊕ y

FIGURE 37.5

Pass transistors in a logic gate.

FIGURE 37.6

Parity function realized with pass transistors.

Using pass transistors, a latch (more precisely speaking, a D latch to store data) can be constructed as shown in Fig. 37.8. When control input c is a high positive voltage, the feedback loop is cut by the pass transistor with c, and the input value is fed into the cascade of two inverters. When c becomes a low voltage, the input is cut off and the loop that consists of two inverters and one pass transistor retains the information.

© 2000 by CRC Press LLC

FIGURE 37.7

Demultiplexer with pass transistors.

FIGURE 37.8

Latch with pass transistors.

The use of pass transistors in logic networks, however, has been very limited because of electronics problems to be discussed in the following. The majority of commercially available logic networks have been in static CMOS circuits. But as higher speed and smaller area are desired, this situation is gradually changing. The pass-transistor logic has recently attracted much attention under these circumstances and is anticipated to be widely used in the near future for its area/power saving and high-performance benefits. Beside pass transistors, there are many other unconventional MOS networks. All these networks are useful for simplification of electronic realization of logic networks or for improvement of performance, although complex adjustments of voltages or currents are often required.

37.2 Electronic Problems of Pass Transistors Suppose an n-channel MOSFET is used as a pass transistor, as shown in Fig. 37.9. Then, the MOSFET behaves electronically as follows. When the control voltage at c is high, the voltage at the input x is delivered to the output terminal f, no matter whether the voltage at the input x is high or low. But if the voltage at the input x is high, a current flows through the MOSFET to charge up the parasitic capacitance (shown in the dotted lines in Fig. 37.9) at f and stops when the difference between the voltage at f and the voltage at c reaches the threshold voltage, making the voltage at f somewhat lower than the

FIGURE 37.9

Electronic behavior of a pass transistor.

© 2000 by CRC Press LLC

voltage at x. When the voltage at c becomes low, the pass transistor becomes non-conductive and the electric charge stored on the parasitic capacitance gradually leaks to the ground through the parasitic resistor. If the voltage at the input x is low when the control voltage at c is high, the electric charge stored on the parasitic capacitance flows to the ground through the pass transistor and input terminal x also if it has not completely leaked to the ground yet. This complex electronic behavior of the pass transistor makes a circuit with pass transistors unreliable. The intermediate value of the voltage at f, which is lower by the threshold voltage than the voltage at x or partially leaked voltage, causes unpredictable operations of the logic network when the voltage at f is fed to ordinary CMOS logic gates in the next stage. Moreover, it degrades the switching speed of the CMOS logic gates. In the worst case, the circuit loses noise margin or it does not operate properly. There are three techniques to avoid this drawback. The first one is to combine an nMOS pass transistor and a pMOS pass transistor in parallel, as shown in Fig. 37.10(a). With this technique, when the pass transistor is conductive, the output voltage at f reaches exactly the same value as the input voltage at x, no matter whether the input voltage is high or low. This pair of nMOS and pMOS pass transistors is sometimes called a transmission gate. Although this has better stability over the pass transistor circuit in Fig. 37.9, it consumes roughly twice as large an area.

FIGURE 37.10

Techniques to avoid the influence of the low output voltage.

The second approach is to use a pMOS feedback circuit at the output of the nMOS pass transistor, as shown in Fig. 37.10(b). The gate of a p-channel MOSFET is driven by the CMOS inverter (shown as the triangle with a small circle), which works as an amplifier. When the CMOS inverter discharges the electric charge at the output, it also turns on the feedback pMOS to raise the pass transistor output to the power supply voltage, eliminating the unreliable operation. One limitation of this approach is that it does not solve the degradation of switching speed due to low voltage, because the speed is determined by the initial voltage swing before the pMOS turns on. Area increase with this approach is smaller than the transmission gate in Fig. 37.10(a). The third approach is to raise the gate voltage swing up to the normal voltage plus threshold voltage, which is used in DRAM and referred to as “word boost,” as shown in Fig. 37.10(c). This approach requires a boost driver every time the gate signal is generated, which is difficult to use in logic functions in general. In addition, a voltage that is higher than the power supply voltage is applied to the gate of a MOSFET, which requires special care against breakdown and reliability problems (these need to be solved by increasing the thickness of gate insulation). Another important consideration for pass-transistor operation is how many pass transistors can be connected in series without buffers. Many pass transistors connected in series can be treated as a serially connected resister-capacitor circuit. The delay of this RC (resistor-capacitor) circuit, which is proportional to the product of R and C, becomes roughly four times larger when both R and C are doubled. Thus, the delay of this circuit is proportional to the square of the number of pass transistors. This means that it is not beneficial to increase the number of pass transistors too many. However, short-pitch insertion of CMOS inverters increases the delay overhead of the buffers themselves. Empirically, the optimal passtransistor stages for delay minimization is known to be about two to three. In design practice, the number

© 2000 by CRC Press LLC

of pass transistors cannot be arbitrarily chosen because designers want to have a certain number of fanout connections and a buffer cannot support too many fan-outs and pass transistors. Also, the structure of a logic network cannot be arbitrarily chosen because of area size and power consumption.

37.3 Top-down Design of Logic Functions with Pass-Transistor Logic After designing logic networks manually or by CAD programs, computer systems have been designed. This is called a bottom-up design approach. In the 1990s, so-called top-down design has been accepted as the mainstream design approach. In the top-down logic design, register-transfer-level functionality is described with a hardware-description language, such as Verilog-HDL and VHDL (Very High Speed Integrated Circuit Hardware Description Language) rather than directly designing gate-level structure of logic networks, or “netlist.” And then, this is converted to logic networks of logic gates using a logic synthesizer (i.e., CAD programs for automated design of logic networks). This process resembles the compilation process of the software construction and it is sometimes referred to as “compile.” Based on this netlist, placement and routing of transistors are done automatically on an IC chip. By using this top-down approach, a logic designer can focus on the functional aspect of the logic rather than the indepth structural aspect. This enhances the productivity. Also, this enables one to easily port one design in one technology to another. Automated design of logic networks with pass transistors has been difficult to realize because of complex electronic behavior. So, conventionally, pass-transistor logic has been manually designed, particularly in arithmetic modules as shown in this section. But as reduction of power consumption, speedup or area reduction is strongly desired, this is changing. Logic design based on selectors with pass-transistors can be done in this top-down manner.10 Pass transistors have been used often as a selector by combing two pass transistors, as shown in Fig. 37.11(a). A selector is also called a multiplexer. The output f of the selector becomes input x when c = 1 and input y when c = 0. Figure 37.11(b) shows a selector realized in a logic gate and also in pass transistors. Compared with the selector in a CMOS logic gate shown on the left side of Fig. 37.11(b) which consists of ten MOSFETs, the selector in pass transistors shown on the right side of Fig. 37.11(b) consists of only four MOSFETs, reducing the number of MOSFETs to less than half, and consequently the area. A selector is known to be a universal logic element because

FIGURE 37.11

Selectors and various logic functions realized by pass transistors.

© 2000 by CRC Press LLC

it can be used as an AND, an OR, and an XOR (i.e., Exclusive-OR) by changing its inputs, as shown in Fig. 37.11(c). This property is also useful in the top-down design approach discussed in the following. The speed of a logic network with pass transistors is sometimes improved up to 2 times better than a conventional CMOS logic network, depending on logic functions. One limitation of this pass-transistor selector is that it suffers a relatively slow switching speed when the control signal arrives later than selected input signals. This is because an inverter is needed for the selector to have a complementary signal applied to the gate of a pass transistor. To circumvent this limitation, CPL (which stands for the complementary pass-transistor logic) has been conceived.11,12 In CPL, complementary signals are used for both inputs and outputs, eliminating the need for the inverter. The circuits that require complementary signals like CPL are sometimes categorized as dual-rail logics. Because of the need for complementary signal, CPL is sometimes twice as large as CMOS, but is sometimes surprisingly small if a designer succeeds in fully utilizing the functionality of a pass-transistor circuit. A very fast and compact CPL full adder, a multiplier, and a carry-propagate-chain circuit have been reported. A full adder realized with CMOS logic gates is compared with a full adder realized with selectors in pass transistors in Fig. 37.12. Speed and power consumption are significantly improved. Variants of CPL have been also reported, including DPL,8 SRPL,5 and SAPL.4 However, conventional switching theory, on which widely used logic synthesizers are based, cannot be conveniently used for this purpose because it is very difficult to derive convenient logic expressions based on the output function xc ∨ yc of the selector. Instead, BDD (i.e., binary decision diagrams) are used, as follows. A simple approach to use a selector as the basic logic element is to build a binary tree of passtransistor selectors, as shown in Fig. 37.13. The truth table shown in Fig. 37.13(a) is directly mapped into the tree structure shown in Fig. 37.13(b). When x = 1, y = 0, and z = 1, for example, the third 1 from the left in the top of Fig. 37.13(b) is connected to the output as f = 1. This original tree generally has redundancy, so it should be reduced to an irredundant form as shown in Fig. 37.13(c). This approach is simple and effective when the number of input variables is less than 5 or so. However, this does not work for functions with more input variables, because of the explosive increase of the tree size. To solve this, a binary decision diagram (i.e., BDD), has been utilized.10 Basic design flow of BDDbased pass-transistor circuit synthesis is shown in Fig. 37.14. The logic expressions for functions f1 and f2 shown in Fig. 37.14(a) are converted to the BDD in (b). Then, buffers (shown as triangles) are inserted in Fig. 37.14(c). In this case, only locations where the buffers should be inserted in Fig. 37.14(d) are specified and the nature of the BDD in Fig. 37.14(c) is not changed. In both Figs. 37.14(b) and (c), each solid line denotes the value 1 of a variable and each dotted line the value 0. For example, the downward solid line from the right-hand circle with w inside denotes w = 1. From f1, if we follow dotted lines three times and then the solid line once in each of (b) and (c), we reach the 0 inside the rectangle in the bottom. This means that f1 = 0 for w = x = y = 0 and z = 1. Preparation of an appropriate cell library based on selectors is required, as shown in Fig. 37.15, which consists of a simple two-input selector (Cell 1) and its variants (Cells 2 and 3). The inverters shown with a dot inside the triangle in Fig. 37.15, which is different from the simple inverter shown in Fig. 37.6(b), is to keep the electric charge on the parasitic capacitance at its input. In Fig. 37.14(d), the inverters of this kind have to be inserted corresponding to the buffers shown in (c). But in this case, the insertion has to be done such that the outputs f1 and f2 have the same polarity in both (c) and (d) because the inverters change signal values from 1 to 0 or from 0 to 1. In the design flow in Fig. 37.14, starting from the logic functions which are represented with logic equations or a truth table, the logic functions are then converted to a BDD. Each node of the BDD represents two-input selector logic, and, in this way, mapping to the above selector-based cells is straightforward, requiring only consideration of the fan-out and signal polarity. One difficulty of this approach with BDD is optimization of the logic depth, that is, the number of pass transistors from an input to an output. One important desired capability of a logic synthesizer for

© 2000 by CRC Press LLC

FIGURE 37.12

Full adder realized in CMOS logic gates and complementary pass-transistor logic (CPL).

© 2000 by CRC Press LLC

FIGURE 37.13

Binary-tree-based simple construction method of pass-transistor logic.

FIGURE 37.14

Design flow of BDD-based pass-transistor logic synthesis.

© 2000 by CRC Press LLC

FIGURE 37.15

Pass-transistor-based cell library.

this approach is the control of the logic depth, for example, so that a designer can limit the delay time from an input to an output. It is difficult to incorporate this requirement in the framework of a BDD. Another difficulty of the BDD-based synthesis is that the number of pass transistors connected in series increases linearly as the number of inputs increases and this number may become excessive. To solve these difficulties, MPL (which stands for multi-level pass-transistor logic) and its representation, multi-level BDD, have been proposed.6 In the above simple BDD-based approach, the output of a passtransistor selector is connected only to the source-drain path of another pass-transistor selector. This causes the above difficulty. In MPL, the output of a pass-transistor selector is flexibly connected to either a sourcedrain path or the gate of another MOSFET. Because of this freedom, the delay of the circuit can be flexibly controlled. It is known empirically that the delay, especially of a logic network having a large number of input variables, is reduced by a factor of 2, compared to the simple BDD approach. Another important extension of pass-transistor logic is to incorporate CMOS circuits in a logic network.9 Logic networks based on pass transistors are not always smaller than CMOS logic networks in area, delay, and power consumption. They are effective when selectors fit well to the target logic functions. Otherwise, conventional CMOS logic networks are a better choice. For example, a simple NAND function implemented in CMOS logic network has better delay, area, and power consumption than its pass-transistor-based counterpart. Combining pass-transistor logic and CMOS logic gives the best solution. Pass-transistor logic synthesis is still not as well developed as CMOS-based logic synthesis. However, even at its current level of development, it has shown generally positive results. In other words, 10 to 30% power reduction is possible, as compared with pure CMOS,9 showing enough potential1 to be further exploited in future research.

© 2000 by CRC Press LLC

References 1. A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power CMOS digital design,” IEEE J. Solid-State Circuits, SC-27, pp. 473-484, 1992. 2. Frei, A. H., W. K. Hoffman, and K. Shepard, “Minimum area parity circuit building block,” ICCC 80, pp. 680-684, 1980. 3. K. Kikuchi, Y. Nukada, Y. Aoki, T. Kanou, Y. Endo, and T. Nishitani, “A single-chip 16-bit 25ns realtime video/image signal processor,” 1989 IEEE International Solid-State Circuits Conference, pp. 170-171, 1989. 4. M. Matsui, H. Hara, Y. Uetani, L-S. Kim, T. Nagamatsu, Y. Watanabe, A. Chiba, K. Matsuda, and T. Sakurai, “A 200 MHz 13 mm2 2-D DCT macrocell using sense-amplifying pipeline flip-flop scheme,” IEEE J. Solid-State Circuits, vol. 29, pp. 1482-1489, 1994. 5. A. Parameswar, H. Hara, and T. Sakurai, “A high speed, low power, swing restored pass-transistor logic based multiply and accurate circuit for multimedia applications,” IEEE 1994 Custom Integrated Circuits Conference, pp. 278-281, 1994. 6. Y. Sasaki, K. Yano, S. Yamashita, H. Chikata, K. Rikino, K. Uchiyama, and K. Seki, “Multi-level pass-transistor logic for low-power ULSIs,” IEEE Symp. Low Power Electronics, pp. 14-15, 1995. 7. Y. Shimazu, T. Kengaku, T. Fujiyama, E. Teraoka, T. Ohno, T. Tokuda, O. Tomisawa, and S. Tsujimichi, “A 50MHz 24b floating-point DSP,” 1989 IEEE International Solid-State Conference, pp. 44-45, 1989. 8. M. Suzuki, N. Ohkubo, T. Yamanaka, A. Shimizu, and K. Sasaki, “A 1.5 ns 32 b CMOS ALU in double pass-transistor logic,” 1993 IEEE International Solid-State Circuits Conference Digest of Technical Papers, pp. 90, 91, 267, 1993. 9. S. Yamashita, K. Yano, Y. Sasaki, Y. Akita, H. Chikata, K. Rikino, and K. Seki, “Pass-transistor/CMOS collaborated logic: The best of both worlds,” 1997 Symp. VLSI Circuits Digest of Technical Papers, pp. 31-32, 1997. 10. K. Yano, Y. Sasaki, K. Rikino, and K. Seki, “Top-down pass-transistor logic design,” IEEE J. SolidState Circuits, vol. 31, pp. 792-803, 1996. 11. K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi, and A. Shimizu, “A 3.8ns CMOS 16×16 multiplier using complementary pass-transistor logic,” IEEE 1989 Custom Integrated Circuits Conference, 10.4.1-4, 1989. 12. K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi, and A. Shimizu, “A 3.8-ns CMOS 16×16-b multiplier using complementary pass-transistor logic,” IEEE J. Solid-State Circuits, SC-25, pp. 388-395, 1990.

© 2000 by CRC Press LLC

Takagi, N., et al."Adders" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

38 Adders Naofumi Takagi Nagoya University

Haruyuki Tago Toshiba Semiconductor Company

Charles R. Baugh C. R. Baugh and Associates

Saburo Muroga University of Illinois at Urbana-Champaign

38.1 38.2 38.3 38.4 38.5 38.6 38.7 38.8

Introduction Addition in the Binary Number System Serial Adder Ripple Carry Adder Carry Skip Adder Carry Look-Ahead Adder Carry Select Adder Carry Save Adder

38.1 Introduction Adders are the most common arithmetic circuits in digital systems. Adders are used to do subtraction and also are key components of multipliers and dividers, as described in Chapters 39 and 40. There are various types of adders with different speeds, areas, and configurations. We can select an appropriate one which satisfies given requirements. For the details of adders and addition methods, see Refs. 2, 4–6, 12, 15, 17, and 20.

38.2 Addition in the Binary Number System Before considering adders, let us take a look at addition in the binary number system. In digital systems, numbers are usually represented in the binary number representation, although the most familiar number representation to us is the decimal number representation. The binary number representation is with the radix (base) 2 and the digit set {0, 1}, while the decimal number representation is with the radix 10 and the digit set {0, 1, 2, …, 9}. For example, a binary number (i.e., a number in the binary number representation) [1101] represents 1·23 + 1·22 + 0·21 + 1·20 = 13, whereas a decimal number (i.e., a number in the decimal number representation) [8093] for example, represents 8·103 + 0·102 + 9·101 + 3·100 = 8093. In the binary number representation, an integer is represented as [xn–1xn–2…x0] where each binary digit, called a bit, xi, is one of the elements of the digit set {0, 1}. The binary representation [xn–1xn–2…x0] i represents the integer ∑ni =– 01 x i ⋅ 2 . By the binary number representation, we can represent not only an integer, but also a number that has a fractional part as well as an integral part, as by the decimal number representation. The binary i representation [xn–1xn–2…x0.x–1x–2…x-m] represents the number ∑ni =– –1m x i ⋅ 2 . For example, [1101.101] represents 13.625. By a binary representation with n-bit integral part and m-bit fractional part, we can represent 2n + m numbers in the range from 0 to 2n – 2-m. Let us consider addition of two binary numbers, X = [xn – 1xn – 2…x0.x–1x–2…x-m] and Y = [yn – 1yn - 2… y0.y-1y-2…y-m]. We can perform addition by calculating the sum at the i-th position, si and the carry to the next higher position ci + 1 from xi, yi, and the carry from the lower position ci according to the truth

© 2000 by CRC Press LLC

TABLE 38.1 Truth Table for One-bit Addition xi 0 0 0 0 1 1 1 1

yi 0 0 1 1 0 0 1 1

ci 0 1 0 1 0 1 0 1

ci + 1 0 0 0 1 0 1 1 1

si 0 1 1 0 1 0 0 1

table shown in Table 38.1, successively from the least significant bit to the most significant one, that is, from i = –m to n – 1, where c–m = 0. Then, we have a sum S = [snsn – 1sn – 2…s0.s–1s–2…s–m], where sn = cn. (i.e., sn is actually a carry cn from the (n – 1)-th position). There are two major methods for representing negative numbers in the binary number representation. One is sign and magnitude representation, and the other is two’s complement representation. In the sign and magnitude representation, the sign and the magnitude are represented separately, as in the usual decimal representation. The first bit is the sign bit and the remaining bits represent the magnitude. The sign bit is normally selected to be 0 for positive numbers and 1 for negative numbers. For example, 13.625 is represented as [01101.101] and –13.625 is represented as [11101.101]. The sign and magnitude binary representation [ x n – 1 x n – 2 … x 0 . x –1 x –2 …x – m ] represents the number x i ( – 1 ) n – 1 ⋅ ∑ni =– –2m x i ⋅ 2 . By the sign and magnitude representation with n-bit integral part (including a sign bit) and m-bit fractional part, we can represent 2n + m – 1 numbers in the range from –2n – 1 + 2 -m to 2n – 1 – 2-m. In the two’s complement representation, a positive number is represented in exactly the same manner as in the sign and magnitude representation. On the other hand, a negative number –X, where X is a positive number, is represented as 2n – X. For example, –13.625 is represented as [100000.000] – [01101.101], i.e., [10010.011]. The first bit of the integral part (the most significant bit) is 1 for negative numbers, indicating the sign of the number. The binary representation of 2n – X is called the two’s complement of X. We can obtain the representation of –X, i.e., 2n – X by complementing the representation of X bitwise and adding [0.00…001] (i.e., 1 in the position of 2–m but 0 in all other i positions). It is because when X = [xn – 1xn – 2…x0.x–1x–2…x–m] = ∑ni =– –1m x i ⋅ 2 , the negation of X, i.e., i –m i –m n–1 n–1 n –X, becomes 2 – X = ( ∑i = –m 2 + 2 ) – X = ∑i = –m ( 1 – x i ) ⋅ 2 + 2 , where 1 – xi is 1 or 0, according to whether xi is 0 or 1, i.e., 1 – xi is the complement of xi. For example, given a binary number [01101.101], we can obtain its negation [10010.011] by complementing [01101.101] bitwise and adding [0.001] to it, i.e., by [10010.010] + [0.001]. By a two’s complement representation with n-bit integral part (including a sign bit) and m-bit fractional part, we can represent 2n + m numbers in the range from –2n to 2n – 2–m. Each of all the binary number representations described so far (i.e., the positive number representation, sign and magnitude representation, and two’s complement representation) can express essentially the same numbers, that is, 2n + m numbers, although the second case expresses 2n + m – 1 numbers (i.e., one number less) when a number is expressed with n + m bits. Thus, these representations essentially do not lose the precision, no matter whether or not one of the n + m-bits is used as a sign bit, although the range of the numbers is different in each case. When we add two numbers represented in the sign and magnitude representation, we calculate the sign and the magnitude separately. When the operands (the augend and the addend) are with the same sign, the sign of the sum is the same as that of the operands, and the magnitude of the sum is the sum of those of the operands. A carry, 1, from the most significant position of the magnitude part indicates overflow. On the other hand, when the signs of the operands are different, the sign of the sum is the same as that of the operand with larger magnitude, and the magnitude of the sum is the difference of those of the operands.

© 2000 by CRC Press LLC

The addition of two numbers represented in the two’s complement representation, X + Y, for X = [xn –1xn – 2…x0.x–1 x–2…x–m] and Y = [yn – 1yn – 2…y0.y–1y–2…y–m] where xn – 1 and yn – 1 are their sign bits, can be done as follows, no matter whether each of X and Y is a positive or negative number: 1. The sign bits are added in the same manner as the two bits, xi and yi, in any other bit position, that is, according to Table 38.1. As illustrated in Fig. 38.1(a), the sign bits, xn – 1 and yn – 1, and the carry, cn – 1, from the (n – 2)-th position are added, producing the sum bit sn – 1 (in the sign bit position) and the carry cn. Always, cn is ignored, no matter whether it is 1 or 0. 2. When xn – 1 = yn – 1 (i.e., X and Y have the same sign bit), an overflow occurs if sn – 1 is different from xn – 1 and yn – 1 (i.e., cn ⊕ cn – 1 = 1 means an overflow, as can be seen from Table 38.1). This case is illustrated in Fig. 38.1(b) because we have sn – 1 = 0 while xn – 1 = yn – 1 =1 and hence sn – 1 is not equal to xn – 1 or yn – 1.

FIGURE 38.1

Examples of addition of numbers in two’s complement representation.

Next let us consider the subtraction of two numbers represented in the two’s complement representation, X – Y, that is, subtraction of Y from X, where each of X and Y is a positive or negative number. This can be done as addition as explained in the previous paragraph after taking the two’s complement of Y (i.e., deriving 2n – Y), no matter whether Y is a negative or positive number. Actually, the subtraction, X – Y, can be realized by the addition of X and the bitwise complement of Y with a carry input of 1 to the least significant position. This is convenient for realizing a subtracter circuit, whether it is a serial or parallel adder (to be described later). Henceforth, let us consider addition of n-bit positive binary integers (without the sign bit) for the sake of simplicity. Let the augend, addend, and sum be X = [xn – 1xn – 2…x0], Y = [yn – 1yn – 2…y0], and S = [snsn – 1sn – 2…s0] with sn = cn, respectively, where each of xi , yi, and si assumes a value of 0 or 1.

38.3 Serial Adder A serial adder operates similarly to manual addition. The serial adder, at each step, calculates the sum and carry at one bit position. It starts at the least significant bit position (i.e., i = 0) and each successive next step it sequentially moves to the next more significant bit position where it calculates the sum and carry. At the n-th step, it calculates the sum and carry at the most significant bit position (i.e., i = n – 1). In other words, the serial adder serially adds augend X and addend Y by adding xi, yi, and ci at the ith bit position from i = 0 to n – 1. From the truth table shown in Table 38.1, we have sum bit si = xi ⊕ yi ⊕ ci and carry to the next higher bit position ci+1 = xi·yi ∨ ci·(xi ∨ yi) (also ci+1 = xi · yi ∨ ci · (xi ⊕ yi)), where “·” is AND, “∨” is OR, and “⊕” is XOR, and henceforth, “·” will be omitted. This serial addition can be realized by the logic network, called a serial adder, or bit-serial adder, shown in Fig. 38.2, where its operation is synchronized by a clock. The addition of each i-th bit is done at a rate of one bit per cycle of clock, producing sum bits, si’s, at the same rate, from the least significant bit to the most significant one. In each cycle, si and ci + 1, are calculated from xi, yi, and the carry from the previous cycle, ci. The core logic network, shown in the rectangle in Fig. 38.2, for this one-bit addition for the i-th bit position

© 2000 by CRC Press LLC

FIGURE 38.2

A serial adder.

is called a full adder (abbreviated as FA). We obtain a logic network for an FA shown in Fig. 38.3 using AND, OR, and XOR gates. A D-type flip-flop may be used as a delay element which stores the carry for a cycle. Full adders realized in ECL (emitter-coupled logic) are described in Chapter 35. FAs with a minimum number of logic gates are known for different types of logic gates.10

FIGURE 38.3

A full adder.

A serial subtracter can be constructed with a minor modification of a serial adder, as explained in the last paragraph of Section 38.2.

38.4 Ripple Carry Adder A parallel adder performs addition at all bit positions simultaneously, so it is faster than serial adders. The simplest parallel adder is a ripple carry adder. An n-bit ripple carry adder is constructed by cascading n full adders, as shown in Fig. 38.4. The carry output of each FA is connected to the carry input of the FA of the next higher position. The amount of its hardware is proportional to n. Its worstcase delay is proportional to n because of ripple carry propagation. In designing an FA for a fast ripple carry adder, it is critical to minimize the delay from the carry-in, ci, to the carry-out, ci + 1. An FA can be realized with logic gates, such as AND gates, OR gates, and XOR gates, as exemplified in Fig. 38.3, and also can be realized with MOSFETs, including pass transistors,18,23 such that a carry © 2000 by CRC Press LLC

FIGURE 38.4

A ripple carry adder..

ci goes through logic gates which have some delays. But the speed of adders is very important for the speed of the entire computer. So, FAs are usually realized with more sophisticated transistor circuits using MOSFETs such that a carry ci can propagate fast to higher bit positions through pass transistors.22 An example of such an adder is shown in Fig. 38.5, being realized in CMOS. In Fig. 38.5(a), a carry propagates through transmission gate, T (described in Chapter 36, Section 36.2). When we have xi = yi = 0, T becomes non-conductive and nMOSFETs, 3 and 4, become conductive. Then, the carry-out, ci + 1, becomes 0 because the carry-out terminal ci + 1 is connected to the ground through 3 and 4. When xi = yi = 1, T becomes non-conductive and pMOSFETs, 1 and 2, become conductive. Then, the carry-out, ci + 1, becomes 1 because the carry-out terminal ci + 1 is connected to the power supply Vdd through 1 and 2. When xi = 0 and yi = 1, or xi = 1 and y i = 0, T becomes conductive and the carry-out terminal is connected to neither Vdd nor the ground, so a carry-in ci is sent to ci + 1 as a carry-out. Thus, we have the values of ci + 1 for different combinations of values of

(a) Full adder with non-complemented carries. FIGURE 38.5

A Manchester-type full adder in CMOS.

© 2000 by CRC Press LLC

(b) Full adder with complemented carries.

xi, yi, and ci, as shown in Table 38.1. This carry-path is called a Manchester carry chain. (T1 is another transmission gate, whereby a circuit on the carry path is simplified and carry propagation is sped up and nMOSFET 9 works as a pass transistor.) A ripple carry adder with Manchester carry chain is referred to as Manchester adder. This idea combines very well with the carry skip technique to be mentioned in section 38.5. The FA in Fig. 38.5(a) cannot send a carry over many positions in a ripple carry adder. For speedup, we need to insert an inverter in every few positions to send a high output power over many higher bit positions. In order to reduce the number of inverters which have delays in themselves, we can use the FA shown in Fig. 38.5(b) which works with complemented carries. An example with insertion of an inverter at the end of every four cascaded FAs is shown in Fig. 38.6, where a block of four of Fig. 38.5(a) and a block of four of Fig. 38.5(b) are alternated. In Fig. 38.5(b), inverters consisting of MOSFETS 5, 6, 7, and 8 are eliminated from (a), and the function xi ⊕ yi at point 10 in (b) is the complement of the function at the corresponding point in (a).

FIGURE 38.6

A ripple carry adder realized by connecting Figs. 38.5(a) and (b).

For high speed, a Manchester full adder realized in dynamic CMOS is used instead of the Manchester full adder shown in static CMOS, where dynamic CMOS and static CMOS are two different variations of CMOS. For example, a Manchester full adder in dynamic CMOS is used inside an adder (to be mentioned later) which is more complex but faster than ripple carry adders.7 In the simultaneous addition in all n-bit positions, a carry propagates n positions in the worst case, but on the average, it propagates only about log2 n positions.13 The average computation time of a ripple carry adder can be reduced by detecting the carry completion, because we need not always wait for the worst delay. An adder with such a mechanism is called a carry completion detection adder3 and is useful for asynchronous systems. When an FA is realized with ordinary logic gates, say NOR gates only, the total number of logic gates in the ripple carry adder is not minimized, even if the number of logic gates in each FA is minimized. But the number of logic gates in the ripple carry adder can be reduced by using modules that have more than one input for a carry-in (the carry-in ci, for example in Fig. 38.7 is represented by two lines, instead of one line) and more than one output for a carry-out (the complemented carryout c i + 1 in Fig. 38.7 is represented by three lines), as shown in Fig. 38.7 (where modules are shown in dot-lined rectangles), instead of using FAs which have only one input for a carry-in and only one output for a carry-out. The number of NOR gates of such a module is minimized by the integerprogramming logic design method11 and it is found that there are 13 minimal modules. Different types of ripple carry adders can be realized by cascading such minimal modules. Some of these adders have carry propagation times shorter than that of the ripple carry adder realized with FAs with NOR gates. Besides, there is an adder that has a minimum number of NOR gates — when this adder is constructed by cascading the three consecutive minimal modules shown in dot-lined rectangles in Fig. 38.7, where the module for the least significant bit position is slightly modified (i.e., replacement of the two carry inputs by a single carry input) and one NOR gate is added to convert the carry out in multiple-lines from the adder into the carry out in a single line. Then it is proved that the total number of NOR gates in this ripple carry adder is minimum for any value of n.8 Also, there is a ripple adder such that the number of connections, instead of the number of logic gates, is minimized.14 Related adders are referred to in Section 2.4.3 of Ref. 11. © 2000 by CRC Press LLC

A ripple adder that has the minimal number of NOR gates for any arbitrary bit length can be realized by cascading these three consecutive minimal modules. FIGURE 38.7 Three consecutive minimal modules for a ripple carry adder that has the minimal number of NOR gates for any arbitrary bit length.

38.5 Carry Skip Adder In a ripple carry adder, a carry propagates through the i-th FA when xi ≠ yi, i.e., xi ⊕ yi = 1. Henceforth, we denote xi ⊕ yi as pi. A carry propagates through a block of consecutive FAs, when all pi’s in the block are 1. This condition (i.e., all pi’s are 1) is called the carry propagation condition of the block. A carry skip adder is a ripple carry adder that is partitioned into several blocks of FAs, attaching a carry skip circuit to each block, as shown in Fig. 38.8.9 A carry skip circuit detects the carry propagation condition of the block and lets the carry from the next lower block bypass the block when the condition holds. In Fig. 38.8, carry skip circuits are not attached to the blocks at the most and least significant few positions because the attachment does not speed up the carry propagation much.

FIGURE 38.8

A carry skip adder.

In the carry skip circuit included in Fig. 38.8, the carry output, Ch + 1, from block h that consists of k FAs starting from j-th position is calculated as:

C h + 1 = c j + k ∨ P hC h © 2000 by CRC Press LLC

where cj + k is the carry from the FA at the most significant position of the block,

P h = p j + k – 1p j + k – 2 … p j is the formula for the carry propagation condition of the block, Ch is the carry from the next lower block, and pi’s are calculated in FAs. An example of carry skip circuit is shown in Fig. 38.9.

FIGURE 38.9

A carry skip circuit used in Fig. 38.8.

A carry may ripple through FAs in the block where it is generated, bypass the blocks where the carry propagation condition holds, and then, ripple through FAs in the block where the carry propagation condition does not hold. When all blocks are of the same size, k FAs, the worst case occurs when a carry is generated at the least significant position and propagate to the most significant position. The worst delay is the sum of the delay for rippling through k – 1 FAs, the delay for bypassing n/k – 2 blocks, and the delay for rippling through k – 1 FAs. In the case that k is a constant independent of n, as well as in the case that k is proportional to n, the delay is proportional to n. We can reduce the worst delay to being proportional to n , by letting k be proportional to n . The amount of hardware for the entire adder is proportional to n in any case. Applying the principle used to develop the carry skip adder borrowed from the ripple carry adder, we have a two-level carry skip adder from the basic carry skip adder, for further improvements. Recursive application of the principle yields a multi-level carry skip adder.12

38.6 Carry Look-Ahead Adder As previously stated, the carry, ci + 1, produced at the i-th position is calculated as ci + 1 = xi·yi ∨ ci · (xi ⊕ yi). This means that a carry is generated if both xi and yi are 1, or an incoming carry is propagated if one of xi and yi is 1 and the other is 0. Therefore, letting gi denote xiyi, we have ci + 1 = gi ∨ cipi, where pi = xi ⊕ yi. Here, gi is the formula for the carry generation condition at the i-th position, i.e., when gi is 1, a carry is generated at this position. Substituting gi – 1 ∨ pi – 1ci – 1 for ci, we get

c i + 1 = g i ∨ p ig i – 1 ∨ p ip i – 1c i – 1 Recursive substitution yields c i + 1 = g i ∨ p i g i – 1 ∨ p i p i – 1 g i – 2 ∨ … ∨ p i p i – 1 …p 0 c 0 A carry look-ahead adder can be realized according to this expression, as illustrated in Fig. 38.10 for the case of four bits.21 According to this expression, ci + 1’s are calculated at all positions in parallel. It is hard to realize an n-bit carry look-ahead adder precisely according to this expression, unless n is small, because maximum fan-in and fan-out restriction is violated at higher positions. Large fan-out causes large delay, so the maximum fan-out is restricted. Also, the maximum fan-in is usually limited to 5 or less. There are some ways to alleviate this difficulty, as follows. One way is the partition of carry look-ahead adders into several blocks such that each block consists of k positions, starting from the j-th one. In a block h, the carry at each position, ci + 1, where j ≤ i ≤ j + k – 1, is calculated as

© 2000 by CRC Press LLC

FIGURE 38.10

A 4-bit carry look-ahead adder.

c i + 1 = g i ∨ p i g i – 1 ∨ … ∨ p i p i – 1 …p j + 1 g j ∨ p i p i – 1 …p j + 1 p j C h where Ch, i.e., cj is the carry from the next lower block. The carry from the next lower block, Ch, goes to only the positions of the block h, so the fan-outs and fan-ins do not increase beyond a certain value. Therefore, we can form an n-bit adder by cascading n/k k-bit carry look-ahead adders, where k is a small constant independent of n, often 4. The worst delay of this type of adder and also the hardware amount are proportional to n. Another way of alleviating the difficulty is recursively applying the principle of carry look-ahead to groups of blocks. The carry output of block h, Ch + 1 (= cj + k), is calculated as C h + 1 = g j + k – 1 ∨ p j + k – 1 g j + k – 2 ∨ … ∨ p j + k – 1 p j + k – 2 …p j + 1 g j ∨ p j + k – 1 p j + k – 2 …p j + 1 p j C h This means that in the block, a carry is generated if G h = g j + k – 1 ∨ p j + k – 1 g j + k – 2 ∨ … ∨ p j + k – 1 p j + k – 2 …p j + 1 g j is 1, and an incoming carry is propagated if Ph = pj + k – 1pj + k – 2…pj +1pj is 1. Gh is the formula for the carry generation condition of the block and Ph is the formula for the carry propagation condition of the block. (They are shown as P and G in Fig. 38.10) Let us consider a super-block, that is, a group of several consecutive blocks. The carry generation and the carry propagation condition of a super-block are detected from those of the blocks, in the same way that Gh and Ph are detected from gi’s and pi’s. Once the carry input to a super-block is given, carry outputs from the blocks in the super-block are calculated immediately. Consequently, we obtain a fast adder in which small carry look-ahead circuits which include carry calculation circuits are connected in a tree form. Figure 38.11 shows a 4-bit carry look-ahead circuit and Fig. 38.12 shows a 16-bit carry look-ahead adder using the 4-bit carry look-ahead circuits, where carry look-ahead circuits are shown as CLA in Fig. 38.12. The worst delay of this type of adder is proportional to log n. The number of logic gates is proportional to n.

© 2000 by CRC Press LLC

FIGURE 38.11

A 4-bit carry look-ahead circuit.

FIGURE 38.12 A 16-bit carry look-ahead adder. CLA stands for a carry look-ahead circuit which includes the carry calculation circuit shown in Fig. 38.10.

38.7 Carry Select Adder We can reduce the worst delay of a ripple carry adder by partitioning the adder into two blocks: one for higher bit positions and the other for lower bit positions. In the block for higher bit positions, we calculate two candidate sums in parallel, one assuming a carry input of 0 from the block for lower bit positions and the other assuming a carry input of 1, then we select the correct sum based on the actual carry output from the block for lower bit positions. When we partition the adder into two blocks of the same size, the delay becomes about half because the calculations in these two blocks are carried out concurrently. An adder based on this principle is called a carry select adder.1 We can further reduce the delay by partitioning the adder into more blocks. Figure 38.13 shows a block diagram of a carry select adder. When all blocks are of the same size, k positions, the worst case occurs when a carry is generated at the least significant position and stops at the most significant position. The worst delay is the sum of the delay for rippling through k – 1 FAs, and the delay for n/k – 1 selectors.

FIGURE 38.13

A carry select adder.

© 2000 by CRC Press LLC

In the case that k is a constant independent of n, as well as in the case that k is proportional to n, the delay is proportional to n. We can reduce the worst delay to being proportional to n , by letting k be proportional to n . The amount of hardware is proportional to n in any case. It is to be noticed that a selector is unnecessary to the least significant few positions (probably less than k) in Fig. 38.13 because it is known whether the carry-in is 0 or 1. We can reduce the amount of hardware by calculating two candidate sums using only one adder in each block.19 Applying the principle used to develop the carry select adder to each block, we can realize a two-level carry select adder. Recursive application of the principle yields a multi-level carry select adder. A conditional sum adder16 can be regarded as the extreme case. A carry select adder is used in a microprocessor with high performance.7

38.8 Carry Save Adder When we add up several numbers sequentially, it is not necessary to propagate the carries during each addition. Instead, the carries generated during an addition may be saved as partial carries and added with the next operand during the next addition. Namely, we can accelerate each addition by postponing the carry propagation. This leads to the concept of carry save addition. We may add up numbers by a series of carry save additions, followed by a carry propagate addition. Namely, for multiple-operand addition, only one carry propagate addition is required. An adder for carry save addition is referred to as a carry save adder, while the adders mentioned in the previous section are called carry propagate adders. A carry save adder sums up a partial sum and a partial carry from the previous stage as well as an operand and produces a new partial sum and partial carry. An n-bit carry save adder consists of just n full adders without interconnections among them.

References 1. Bedrij, O. J., “Carry-select adder,” IRE Trans. Elec. Comput., vol. EC-11, pp. 340–346, June 1962. 2. Cavanagh, J. J. F., Digital Computer Arithmetic — Design and Implementation, McGraw-Hill, 1984. 3. Gilchrist, B., J. H. Pomerene, and S. Y. Wong, “Fast carry logic for digital computers,” IRE Trans. Elec. Comput., vol. EC-4, pp. 133–136, 1955. 4. Hennessy, J. L. and D. A. Patterson, Computer Architecture — A Quantitative Approach, Appendix A, Morgan Kaufmann Publishers, 1990. 5. Hwang, K., Computer Arithmetic — Principles, Architecture, and Design, John Wiley & Sons, 1979. 6. Koren, I., Computer Arithmetic Algorithms, Prentice Hall, 1993. 7. Kowaleski, J. A. Jr. et al., “A dual-execution pipelined floating-point CMOS processor,” ISSCC Digest of Technical Papers, pp. 358–359, Feb. 1996. 8. Lai, H.-C. and S. Muroga, “Minimum parallel binary adders with NOR (NAND) gates,” IEEE TC, pp. 648–659, Sept. 1979. 9. Lehman, M. and N. Burla, “Skip techniques for high-speed carry propagation in binary arithmetic units,” IRE Trans. Elec. Comput., vol. EC-10, pp. 691–698, Dec. 1961. 10. Liu, T.-K., K. Hohulin, L.-E., Shiau, and S. Muroga, “Optimal one-bit full adders with different types of gates,” IEEE TC, pp. 63–70, Jan. 1974. 11. Muroga, S., “Computer-aided logic synthesis for VLSI chips,” pp. 1-103, Advances in Computers, vol. 32, Ed. by M. C. Yovits, Academic Press, 1991. 12. Omondi, A. R., Computer Arithmetic Systems — Algorithms, Architecture and Implementations, Prentice-Hall, 1994. 13. Reitwiesner, G. W., “The determination of carry propagation length for binary addition,” IRE Trans. Elec. Comput., vol. EC-9, pp. 35–38, 1960.

© 2000 by CRC Press LLC

14. Sakurai, A. and S. Muroga, “Parallel binary adders with a minimum number of connections,” IEEE Trans. Comput., C-32, pp. 969–976, Oct. 1983. (Correction: In Fig. 7, labels, a0 and c0, should be interchanged.) 15. Scott, N. R., Computer Number Systems & Arithmetic, Prentice-Hall, 1985. 16. Slansky, J., “Conditional sum addition logic,” IRE Trans. Elec. Comput., vol. EC-9, pp. 226–231, June 1960. 17. Spaniol, O., Computer Arithmetic Logic and Design, John Wiley & Sons, 1981. 18. Suzuki, M. et al., “A 1.5-ns 32-b CMOS ALU in double pass-transistor logic,” IEEE Jour. of SolidState Circuits, pp. 1145–1151, Nov. 1993. 19. Tyagi, A., “A reduced-area scheme for carry-select adders,” IEEE Trans. Comput., vol. 42, pp. 1163–1170, Oct. 1993. 20. Waser, S. and M. J. Flynn, Introduction to Arithmetic for Digital Systems Designers, Holt, Rinehart and Winston, 1982. 21. Weinberger, A. and J. L. Smith, “A one-microsecond adder using one-megacycle circuitry,” IRE Trans. Elec. Comput., vol. EC-5, pp. 65–73, 1956. 22. Weste, N. H. E. and K. Eshraghian, Principles of CMOS VLSI Design, 2nd ed., Addison Wesley, 1993. 23. Zhuang, N. and H. Wu, “A new design of the CMOS full adder,” IEEE JSSC, pp. 840–844, May 1992. (Full adder with transfer gates.)

© 2000 by CRC Press LLC

Takagi, N., et al. "Multipliers" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

39 Multipliers Naofumi Takagi Nagoya University

Charles R. Baugh C. R. Baugh and Associates

Saburo Muroga University of Illinois at Urbana-Champaign

39.1 39.2 39.3 39.4 39.5

Introduction Sequential Multiplier Array Multiplier Multiplier Based on Wallace Tree Multiplier Based on a Redundant Binary Adder Tree

39.1 Introduction Many microprocessors and digital signal processors now have fast multipliers in them. There are several types of multipliers with different speeds, areas, and configurations. For the details of multipliers and multiplication methods, see Refs. 3–5, 7–10, and 14. Here, let us consider multiplication of n-bit positive binary integers (without the sign bit) where a multiplicand X = [xn – 1xn – 2…x0] is multiplied by a multiplier Y = [yn – 1yn – 2…y0] to derive the product P = [p2n – 1p2n – 2…p0], where each of xi,, yi, and pi takes a value of 0 or 1.

39.2 Sequential Multiplier A sequential multiplier works in a manner similar to manual multiplication of two decimal numbers, although two binary numbers are multiplied in this case. A multiplicand X = [xn – 1xn – 2…x0] is multiplied by each bit of a multiplier Y = [yn – 1yn – 2…y0], forming the multiplicand-multiple Z = [zn – 1zn – 2…z0], where zi = xiyj for each i = 0, …, n – 1. Then, Z is shifted left by j bit positions and is added, in all digit positions in parallel, to the partial product Pj – 1 which has been formed by the previous steps, to generate the partial product Pj . Repeating this step for j = 0 to n – 1, the product P = [p2n – 1p2n – 2…p0] of 2n bits is derived. The only difference of this sequential multiplier from the manual multiplication is the repeated addition of each multiplicand-multiple, instead of one-time addition of all multiplicand-multiples at the end. This sequential multiplier is realized, as shown in Fig. 39.1, which consists of a Multiplicand Register of n-bits for storing multiplicand X, a Shift Register of 2n-bits for storing multiplier Y and partial product Pj – 1, a Multiplicand-Multiple Generator (denoted as MM Generator) for generating a multiplicandmultiple yj · X, and a Parallel Adder of n-bits. Initially, X is stored in the Multiplicand Register and Y is stored in the lower half (i.e., the least significant bit positions) of the Shift Register where the upper half of the Shift Register stores 0. This sequential multiplier performs one iteration step described above in each clock cycle. In other words, in each clock cycle, a multiplier bit yj is read from the right-most position of Shift Register. A multiplicand-multiple yj · X is produced by Multiplicand-Multiple Generator, which is X or 0 based on whether yj is 1 or 0, and is fed to Parallel Adder. The upper n-bit of the partial product is read from the upper half of Shift Register and also fed to Parallel Adder. The content of Shift Register is shifted one position to the right. The (n + 1)-bit output of Parallel Adder including the carry output, which is the upper part of the updated partial product, is stored into the upper (n + 1) positions of Shift Register. After n cycles, Shift Register holds the 2n-bit product, P.

© 2000 by CRC Press LLC

FIGURE 39.1

A sequential multiplier.

We can use any adder described in Chapter 38. The faster the adder, the shorter the clock cycle and hence the faster the multiplier. When we use a carry save adder as Parallel Adder, we have to modify Shift Register so that its upper half stores two binary numbers (i.e., a partial sum and a partial carry). Besides the carry save adder, we need a carry propagate adder for summing up the final partial sum and carry. We can accelerate sequential multiplication by processing multiple-bits of the multiplier per clock cycle. When k-bits of multiplier Y are processed per cycle, an n-bit multiplication is performed through n/k clock cycles. There are two methods for processing multiple-bits of the multiplier per cycle. One is generating several candidate multiplicand-multiples and then choosing an appropriate one among them. The other is generating k multiplicand-multiples and summing them up in one cycle. Also, these two methods can be combined. In the first method, Multiplicand-Multiple Generator generates 2k different multiplicand-multiples, 0, X, 2X, 3X, …, (2k – 1)X. For example, when k = 2, Multiplicand-Multiple Generator generates 2X and 3X, as well as X and 0. Multiplicand-Multiple Generator consists of a look-up table containing the necessary multiples of the multiplicand and a selector for selecting an appropriate multiple. The look-up table need not hold all multiples, because several of them can be generated from others by shifting whenever necessary. For example, when k = 2, only 3X must be pre-computed and be held. Extended Booth’s method13 is useful for reducing the number of pre-computed multiples. When k-bits are processed per cycle, k-bit Booth’s method is applied, which recodes multiplier Y into radix-2k redundant signed-digit representation with the digit set {–2k – 1, –2k – 1+1,…, 0,…, 2k – 1 – 1, 2k – 1}, where each digit in radix 2k takes a value among those in this digit set. Y is recoded into Yˆ by considering k + 1 bits of Y, i.e., ykj + k – 1, ykj + k – 2, …, ykj + 1, ykj, ykj – 1, (i.e., k + 1 bits among yn – 1, yn – 2, …, y0 of Y) per cycle, instead of only a single bit of Y (say, yj) per cycle, as illustrated in Fig. 39.2(a). More specifically, the j-th digit yˆ j of the recoded multiplier, where j = 0, 1,…, ( n + 1 ) ⁄ k – 1 , is calculated as yˆ j = –2k – 1 · ykj + k -– 1 + 2k – 2 · ykj + k – 2 + … + 2 · ykj + 1 + ykj + ykj – 1. In this case, since all components of Y = [yn – 1yn – 2 … y0] are recoded for every k components at a time, the recoded number becomes Yˆ = [ yˆ ( n + 1 ) ⁄ k – 1 yˆ ( n + 1 ) ⁄ k – 2 …yˆ 0 ] with ( n + 1 ) ⁄ k components, in contrast to multiplier Y = [yn – 1yn – 2 … y0] which has n components. Then we have a multiplicandmultiple yˆ j · X. Since the negation of a multiple can be produced by complementing it bitwise and adding 1 at the least significant position (this 1 is treated as a carry into the least significant position of Parallel Adder), the number of multiples to be held is reduced. For example, when k = 2, the multiplier is recoded to radix-4 redundant signed-digit representation with the digit set {–2, –1, 0, 1, 2} by means of the 2-bit Booth’s method (i.e., the extended Booth’s method with k = 2) as yˆ j = –2y2j + 1 + y2j + y2j – 1 and all multiples are produced from X by shift and/or complementation. In 2-bit Booth recoding, multiplier Y is partitioned into 2-bit blocks, and then at the j-th block, the recoded multiplier digit yˆ j is calculated from the two bits of the block, i.e., y2j + 1 and y2j, and the higher bit of the next lower block, i.e., y2j – 1, according to the rule shown in Table 39.1. For example, 11110001010 is recoded to 2 0 2 1 1 2 , where 1 and 2 denote –1 © 2000 by CRC Press LLC

FIGURE 39.2

Recoding in the extended Booth’s method. TABLE 39.1 The Recording Rule of 2-bit Booth’s Method y2j + 1 0 0 0 0 1 1 1 1

y2j 0 0 1 1 0 0 1 1

y2j – 1 0 1 0 1 0 1 0 1

yˆ j 0 1 1 2 –2 –1 –1 0

and –2, respectively, as illustrated in Fig. 39.2(b). (In this case, whenever a bit is not available in Y, such as the next lower bit of the least significant bit, it is regarded as 0.) When the extended Booth’s method is employed, Parallel Adder is modified such that negative numbers can be processed. In the second method for processing multiple-bits of the multiplier per clock cycle, Parallel Adder sums up k + 1 numbers, using k adders. Any adder can be used, but usually carry save adders are used for the sake of speed and cost. Carry save adders can be connected either in series or in tree form. Of course, the latter is faster, but the structure of Parallel Adder becomes more complicated because of somewhat irregular wire connections. By letting k be n, we have a parallel multiplier, processing the whole multiplier-bits in one clock cycle, as will be mentioned in the following section.

39.3 Array Multiplier The simplest parallel multiplier is an array multiplier2 in which the multiplicand-multiples (i.e., (yj · X)’s) are summed up one by one by means of a series of carry save adders. It has a two-dimensional array structure of full adders as shown in Fig. 39.3. Each row of full adders except the bottom one forms

© 2000 by CRC Press LLC

FIGURE 39.3

An array multiplier.

a carry save adder. The bottom row forms a ripple carry adder for the final carry propagate addition. An array multiplier is suited for VLSI realization because of its regular cellular array structure. The number of logic gates is proportional to n2. The delay is proportional to n. We can reduce the delay in the final adder by using a faster carry propagate adder such as a carry select adder. Also, we can reduce the delay in the array part in Fig. 39.3 by means of 2-bit Booth’s method mentioned in Section 39.2. Since 2-bit Booth’s method reduces the number of multiplicandmultiples to about half, the number of necessary carry save additions is also reduced to about half, and hence, the delay in the array part is reduced to about half. But the amount of hardware is not reduced much because a 2-bit Booth recoder and multiplicand-multiple generators, which essentially work as selectors, are required. Another method to reduce the delay in the array part is to double the accumulation stream.6 Namely, we divide the multiplicand-multiples into two groups, sum up the members of each group by a series of carry save adders independently of the other group, and then sum up the two accumulations into one. The delay in the array part is reduced to about half. The 2-bit Booth’s method can be combined with this method. We can further reduce the delay by increasing the number of accumulation streams, although it complicates the circuit structure.

39.4 Multiplier Based on Wallace Tree In the multiplier based on Wallace tree,13 the multiplicand-multiples are summed up in parallel by means of a tree of carry save adders. A carry save adder sums up three binary numbers and produces two binary numbers (i.e., a partial sum and a partial carry). Therefore, using n/3 carry save adders in parallel, we can reduce the number of multiplicand-multiples from n to about 2n/3. Then, using about 2n/9 carry save adders, we can further reduce it to 4n/9. Applying this principle about log3/2 n times, the number of multiplicand-multiples can be reduced to only two. Finally, we sum up these two multiplicandmultiples by means of a fast carry propagate adder. Figure 39.4 illustrates a block diagram of a multiplier based on Wallace tree. This consists of full adders, just like the array multiplier described previously. (Recall that a carry save adder consists of full

© 2000 by CRC Press LLC

FIGURE 39.4

A multiplier with Wallace tree.

adders.) The delay is small. It is proportional to log n when a fast carry propagate adder with O(log n) delay is used for the final addition. The number of logic gates is about the same as that of an array multiplier and is proportional to n2. However, its circuit structure is not suited for VLSI realization because of the complexity. The 2-bit Booth’s method can be also applied to this type of multiplier. However, it is not as effective as in the array multiplier, because the height of the Wallace tree decreases by only one or two, even though the number of the multiplicand-multiples is reduced to about half. A full adder, which is used as the basic cell in a multiplier based on Wallace tree, can be regarded as a counter which counts up 1’s in the three-input bits and outputs the result as a 2-bit binary number. Namely, a full adder can be regarded as a 3-2 counter. We can also use larger counters, such as 7-3 counters and 15-4 counters, as the basic cells, instead of full adders. We can increase the regularity in the circuit structure by replacing Wallace tree with a 4-2 adder tree,12 where a 4-2 adder is formed by connecting two carry save adders, shown in the dot-lined rectangles, in series, as shown in Fig. 39.5. A 4-2 adder is an adder that sums up four binary numbers, A = [an – 1…a0], B = [bn – 1…b0], C = [cn – 1…c0], and D = [dn – 1…d0], and produces two binary numbers, E = [en…e0] and F = [fn…f0], where f0 = 0. We can form a 4-2 adder tree by connecting 4-2 adders in a binary tree form. The delay of a multiplier based on a 4-2 adder tree is slightly larger than that of a multiplier with Wallace tree but is still proportional to log n. The number of logic gates is about the same as a multiplier based on Wallace tree, and is proportional to n2. It is more suited for VLSI realization than the Wallace tree.

© 2000 by CRC Press LLC

FIGURE 39.5

A 4-2 adder.

39.5 Multiplier Based on a Redundant Binary Adder Tree There is another fast multiplier based on a rather regular structure called a multiplier based on a redundant binary adder tree.11 In it, multiplicand-multiples are generated first as in other parallel multipliers, being regarded as redundant binary numbers, and are summed up pairwise by means of redundant binary adders connected in binary tree form. Then, finally, the product represented in the redundant binary representation is converted into the ordinary binary representation. The redundant binary representation, also called the binary signed-digit representation, is a binary representation with a digit set { 1, 0, 1}, where 1 denotes –1.1 An n-digit redundant binary number A = [an – 1an – 2…a0] has i the value ∑ni =– 01 a i ⋅ 2 , where ai takes a value 1, 0, or 1. There may be several redundant binary numbers which have the same value. For example, [0101], [ 0111 ], [ 1101 ], [ 1111 ], and [ 1011 ] all represent 5. Because of this redundancy, we can add two redundant binary numbers without carry propagation. Let us consider the addition of two redundant binary numbers, that is, the augend, A = [an – 1an – 2…a0], and the addend, B = [bn – 1bn – 2…b0] to derive the sum, S = [snsn – 1…s0], where each of ai, bi, and si takes a value –1, 0, or 1. The addition without carry propagation is done in two steps. In the first step, an intermediate carry ci + 1, and intermediate sum di in the i-th position are determined such that ai + bi = 2ci + 1 + di is satisfied (the 2 of 2ci + 1 means shifting ci + 1 to the next higher digit position as a carry), where each of ci + 1 and di is –1, 0, or 1. In this case, ci + 1 and di are determined such that a new carry will not be generated in the second step. In the second step, in each digit position, sum si is determined by adding intermediate sum di and intermediate carry ci from the next lower position, where ci takes a value, –1, 0, or 1. Suppose one of addend digit ai and addend digit bi is 1 and the other is 0. If ci + 1 = 0 and di = 1 in the first step, a new carry will be generated for ci = 1 from the next lower digit position in the second step. So, if there is a possibility of ci = 1, we choose ci + 1 = 1 and di = 1. On the other hand, if there is a possibility of ci = 1, we choose ci + 1 = 0 and di = 1. This makes use of the fact that 1 can be expressed by [01] and [ 11 ] in the redundant binary number representation. Whether ci becomes 1 or 1 can be detected by examining ai – 1 and bi – 1 in the next lower digit position but not further lower digit positions. For other combinations of the values of ai, bi, and ci, ci + 1 and di can be similarly determined. In the second step, si is determined by adding only two digits, ci and di. Suppose ci = 1. Then si is 0 or 1, based on whether di is 1 or 0. Notice that two combinations, ci = di = 1 and ci = di = 1, never occur. For other combinations of the values of ci and di, si is similarly determined. Consequently, sum digit si at each digit position can be determined by the three digit positions of the angend and addend, ai, ai-1 and ai – 2, and bi, bi – 1, and bi – 2. A binary number is a redundant binary number as it is, so there is no need for converting it to the redundant binary number representation. But conversion of a redundant binary number to a binary number requires an ordinary binary subtraction. Namely, we subtract the binary number that is derived by replacing every 1 by 0 and 1 by 1 in the redundant binary number, from the binary number that is

© 2000 by CRC Press LLC

derived by replacing 1 by 0. For example, [11011] is converted to [00111] by the subtraction [10001] – [01010]. We need borrow propagate subtraction for this conversion. This conversion in a multiplier based on a redundant binary adder tree corresponds to the final addition in the ordinary multipliers.

References 1. Avizienis, A., “Signed-digit number representations for fast parallel arithmetic,” IRE Trans. Elec. Comput., vol. EC-10, pp. 389–400, Sep. 1961. 2. Baugh, C. R. and B. A. Wooly, “A two’s complement parallel array multiplier,” IEEE Trans. Computers, vol. C-22, no. 12, pp. 1045–1047, Dec. 1973. 3. Cavanagh, J. J. F., Digital Computer Arithmetic — Design and Implementation, McGraw-Hill, 1984. 4. Hennessy, J. L. and D. A. Patterson, Computer Architecture —A Quantitative Approach, Appendix A, Morgan Kaufmann Publishers, 1990. 5. Hwang, K., Computer Arithmetic — Principles, Architecture, and Design, John Wiley & Sons, 1979. 6. Iwamura, J., et al., “A 16-bit CMOS/SOS multiplier-accumulator,” Proc. IEEE Intnl. Conf. on Circuits and Computers, 12.3, Sept. 1982. 7. Koren, I., Computer Arithmetic Algorithms, Prentice-Hall, 1993. 8. Omondi, A. R., Computer Arithmetic Systems — Algorithms, Architecture and Implementations, Prentice-Hall, 1994. 9. Scott, N. R., Computer Number Systems & Arithmetic, Prentice-Hall, 1985. 10. Spaniol, O., Computer Arithmetic Logic and Design, John Wiley & Sons, 1981. 11. Takagi, N., H. Yasuura, and S. Yajima, “High-speed VLSI multiplication algorithm with a redundant binary addition tree,” IEEE Trans. Comput., vol. C-34, no. 9, pp. 789–796, Sep. 1985. 12. Vuillemin, J. E., “A very fast multiplication algorithm for VLSI implementation,” Integration, VLSI Journal, vol. 1, no. 1, pp. 39–52, Apr. 1983. 13. Wallace, C. S., “A suggestion for a fast multiplier,” IEEE Trans. Elec. Comput., vol. EC-13, no. 1, pp. 14–17, Feb. 1964. 14. Waser, S. and M. J. Flynn, Introduction to Arithmetic for Digital Systems Designers, Holt, Rinehart and Winston, 1982.

© 2000 by CRC Press LLC

Takagi, N., Muroga, S. "Dividers" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

40 Dividers Naofumi Takagi Nagoya University

Saburo Muroga University of Illinois at Urbana-Champaign

40.1 Introduction 40.2 Subtract-And-Shift Dividers Restoring Method • Non-Restoring Method • SRT Method

40.3 Higher Radix Subtract-And-Shift Dividers 40.4 Even Higher Radix Dividers with a Multiplier 40.5 Multiplicative Dividers

40.1 Introduction There are two major classes of division methods: subtract-and-shift methods and multiplicative methods. For details of division methods and dividers, see Refs. 2, 5–11, and also Ref. 3 for subtract-and-shift division. Suppose two binary numbers, X and Y, are normalized in the range equal to or greater than 1/2 but smaller than 1 and also X < Y holds for the sake of simplicity. Then, a dividend X = [0.1x2…xn], which is an n-bit normalized binary fraction, is to be divided by a divisor, Y = [0.1y2…yn], which is another nbit normalized binary fraction, where each of xi and yi takes a value of 0 or 1. We assume to calculate the quotient Z = [0.1z2…zn] which satisfies |X/Y – Z| < 2–n, where zi takes a value of 0 or 1.

40.2 Subtract-And-Shift Dividers The subtract-and-shift divider works in a manner similar to manual division of one decimal number by another, using paper and pencil. In the case of manual division of decimal numbers, each xi of dividend X and yi of divisor Y is selected from {0, 1, …, 9} (i.e., the radix is 10). Each zi of quotient Z is selected from {0, 1, …, 9}. In the following case of dividers for a digital system, dividend X and divisor Y are binary numbers, so each xi of X and yi of Y is selected from {0, 1}, but the quotient (which is not necessarily represented as a binary number) is expressed in radix r. The radix r, is usually chosen to be 2k (i.e., 2, 4, 8, and so on.) So, a quotient is denoted with Q = [0.q1q2…q n/k ], to be differentiated from Z = [0.1z2…zn] expressed in binary numbers, although both Q and Z will henceforth be called quotients. The subtract-and-shift method with a radix r iterates the recurrence step of replacing Rj by r · Rj – 1 – qj · Y, where qj is the j-th quotient digit and Rj is the partial remainder after the determination of qj. Initially, Rj – 1 for j – 1 = 0; that is, R0 is X. Each recurrence step consists of the following four substeps. Suppose that r = 2k. 1. 2. 3. 4.

Shift of the partial remainder Rj – 1 to the left by k bit positions to produce r · Rj – 1. Determination of the quotient digit qj by quotient-digit selection. Generation of the divisor multiple qj · Y. Subtraction of qj · Y from r · Rj – 1 to calculate Rj.

The dividers for a digital system have many variations, depending on the methods in choosing the radix, the quotient-digit set from which qj is chosen (qj is not necessarily 0 or 1, even if it is in radix 2),

© 2000 by CRC Press LLC

and the representation of the partial remainder. The simplest cases are with a radix of 2 and the partial remainder represented in the non-redundant form. When r = 2, the recurrence is Rj = 2Rj – 1 – qj · Y. There are three methods: the restoring, the non-restoring, and the SRT methods.

Restoring Method In the radix-2 restoring method, a quotient digit, qj, is chosen from the quotient-digit set {0, 1}. When 2Rj – 1 – Y ≥ 0 (2Rj – 1 means shift of Rj – 1 by one bit position to the left), 1 is selected, and otherwise 0 is selected. Namely, Rj′ = 2Rj – 1 – Y is calculated first, and then, when Rj′ ≥ 0 holds, we set qj = 1 and Rj = Rj′, and otherwise qj = 0 and Rj = Rj′ + Y (i.e., Y is added back to Rj′). For every j, Rj is kept in the range, 0 ≤ Rj < Y. The j-th bit of the quotient in binary number Z = [0.1z2…zn], zj, is equal to qj (i.e., zj is the same as qj in radix 2 in this case of the restoring method). This method is called the restoring method, because Y is added back to Rj′ when Rj′ < 0. For speed-up, we can use 2Rj – 1 as Rj by keeping Rj – 1, instead of adding back Y to Rj′, when Rj′ < 0. Figure 40.1 shows an example of radix-2 restoring division.

FIGURE 40.1 An example of radix-2 restoring division. (Each of A, B, and C is a sign bit. Notice that D is always equal to E and is ignored.)

Non-Restoring Method In the radix-2 non-restoring method, a quotient digit, qj is chosen from the quotient-digit set {–1, 1}. Quotient digit qj is chosen according to the sign of Rj – 1. In other words, we set qj = 1 when Rj – 1 ≥ 0 and, otherwise, we set qj = –1, abbreviated as qj = 1. Then, Rj = 2Rj – 1 – qj · Y is calculated. Even if Rj is © 2000 by CRC Press LLC

negative, Y is not added back in this method, so this method is called the non-restoring method. Note that since we have R0 = X > 0 and (1/2) ≤ X < Y < 1 by the assumption, we always have q1 = 1 and R1 = 2R0 – Y = 2X – Y > 0, and hence, q2 = 1. For every j, Rj is kept in the range, –Y ≤ Rj < Y. The j-th bit of the quotient in binary number Z = [0.1z2…zn], zj, is 0 or 1, based on whether qj + 1 is –1 or 1. And we have always zn = 1. For example, when Q = [0.111111] where 1 denotes –1, we have Z = [0.110011]. (The given number [0.111111] is calculated as [0.111001] – [0.000110] = [0.110011]. In other words, the number derived by replacing all 1’s by 0’s and all 1’s by 1’s in the given number is subtracted from the number derived by replacing all 1’s by 0’s in the given number. This turns out to be a simple conversion between zj and qj + 1, as stated above, without requiring the subtraction.) Combining this conversion with the recurrence on Rj yields the method in which R1 = 2X – Y, and for j ≥ 2, when Rj – 1 ≥ 0, zj – 1 = 1 and Rj = 2Rj – 1 – Y and, otherwise, zj – 1 = 0 and Rj = 2Rj – 1 + Y. Figure 40.2 shows an example of radix-2 nonrestoring division. Note that the remainder for the restoring method is always negative, whereas the remainder for the non-storing method (also the SRT method to be described in the following) can be negative, and consequently when the remainder is negative, the quotient of the latter is greater by 2–n than the former. (This explains the difference between the quotients in Figs. 40.1 and 40.2, where R6 = 1 in Fig. 40.2 indicates that the remainder is negative.)

FIGURE 40.2

An example of radix-2 non-restoring division.

Figure 40.3 shows a radix-2 non-restoring divider that performs one recurrence step in each clock cycle. Register For Partial Remainder Rj – 1 initially stores the dividend X and then, during the division, stores a partial remainder Rj – 1 which may become negative. Divisor, Y, in Register For Divisor Y is added to or subtracted from the twice of the Rj – 1 stored in Register For Partial Remainder Rj – 1 by Adder/Subtracter, based on whether the left-most bit of Register For Partial Remainder Rj – 1 (i.e., the sign bit of Rj – 1) is 1 or 0, and then the result (i.e., Rj) is stored back into Register For Partial Remainder Rj – 1. Concurrently, the complement of the sign bit of Rj – 1 (i.e., zj – 1) is fed to Shift Register For Zj – 2 (which stores the partial quotient Zj – 2 = [0.1z2…zj – 2]) from the right end, and the partial quotient stored in Shift Register For Zj –2 is shifted one position to the left. The divider performs one recurrence step in each clock cycle. After n cycles, Shift Register For Zj -2 holds the quotient Z. We can use any carry propagate adder (subtracter) as the Adder/Subtracter. The faster the adder, the shorter the clock cycle, and hence, the faster the divider.

© 2000 by CRC Press LLC

FIGURE 40.3

A radix-2 non-restoring divider.

SRT Method In the radix 2 SRT method, the quotient-digit set is {–1, 0, 1}. In this case, –1 or 0 or 1 is selected as qj, based on whether 2Rj – 1 0 makes it more difficult to satisfy Eq. 47.5. 5 6

© 2000 by CRC Press LLC

∀ 〈 R i, R f〉 :t cd = t cd ⇒ T Skew ( i, f ) = 0 i

f

(47.7)

Therefore, Eqs. 47.5 and 47.6 become i, f

ˆ PM T Skew ( i, f ) = t cd – t cd = 0 ≤ T CP – D i

f

i, f

(47.8)

ˆ Pm ≤ 0 = T Skew ( i, f ) = t cd – t cd –D i

f

(47.9)

Note that Eq. 47.8 can be satisfied for each local data path Ri  Rf in a circuit if a sufficiently large ,f ˆ iPM value — larger than the greatest value D in a circuit — is chosen for TCP. Furthermore, Eq. 47.9 ,f ˆ iPm can be satisfield across an entire circuit if it can be ensured that D ≥ 0 for each local data path Ri  Rf in the circuit. The timing constraint Eqs. 47.8 and 47.9 can be satisfield since choosing a sufficiently ,f ˆ iPm is positive for a properly designed local data path Ri large clock period TCP is always possible and D  Rf. The application of this zero clock skew methodology (Eqs. 47.7, 47.8, and 47.9) has been central to the design of fully synchronous digital circuits for decades.13, 26 By requiring the clock signal to arrive j at each register Rj with approximately the same delay t cd ,7 these design methods have become known as zero clock skew methods. As shown by previous research,13,15-17,27-29 both double and zero clocking hazards may be removed from a synchronous digital circuit even when the clock skew is non-zero; that is, TSkew(i, f) ≠ 0 for some (or all) local data paths Ri  Rf. As long as Eqs. 47.5 and 47.6 are satisfied, a synchronous digital system can operate reliably with non-zero clock skews, permitting the system to operate at higher clock frequencies while removing all race conditions. 1 2 The vector column of clock delays TCD = [ t cd , t cd , …]T is called a clock schedule.13,25 If TCD is chosen such that Eqs. 47.5 and 47.6 are satisfied for every local data path Ri  Rf, TCD is called a consistent clock schedule. A clock schedule that satisfies Eq. 47.7 is called a trivial clock schedule. Note that a trivial clock i f schedule TCD implies global zero clock skew since for any i and f, t cd = t cd , thus, TSkew(i, f) = 0. Fishburn25 first suggested an algorithm for computing a consistent clock schedule that is non-trivial. Furthermore, Fishburn showed25 that by exploiting negative and positive clock skew within the local data paths Ri  Rf, a circuit can operate with a clock period TCP less than the clock period achievable by a trivial (or zero skew) clock schedule that satisfies the conditions specified by Eqs. 47.5 and 47.6. In fact, Fishburn25 determined an optimal clock schedule by applying linear programming techniques to solve for TCD so as to satisfy Eqs. 47.5 and 47.6 while minimizing the objective function Fobjective = TCP. The process of determining a consistent clock schedule TCD can be considered as the mathematical problem of minimizing the clock period TCP under the constraints Eqs. 47.5 and 47.6. However, there are important practical issues to consider before a clock schedule can be properly implemented. A clock distribution network must be synthesized such that the clock signal is delivered to each register with the proper delay so as to satisfy the clock skew schedule TCD. Furthermore, this clock distribution network must be constructed so as to minimize the deleterious effects of interconnect impedances and process parameter variations on the implemented clock schedule. Synthesizing the clock distribution network typically consists of determining a topology for the network, together with the circuit design and physical layout of the buffers and interconnect within the clock distribution network.13

Structure of the Clock Distribution Network The clock distribution network is typically organized as a rooted tree structure,13,15,23 as illustrated in Fig. 47.4, and is often called a clock tree.13 A circuit schematic of a clock distribution network is shown in Fig. 47.4(a). An abstract graphical representation of the tree structure depicted in Fig. 47.4(a) is shown in Fig. 47.4(b). The unique source of the clock signal is at the root of the tree. This signal is distributed 7

Equivalently, it is required that the clock signal arrive at each register at approximately the same time.

© 2000 by CRC Press LLC

(a) Circuit structure of the clock distribution network.

FIGURE 47.4

(b) Clock tree structure that corresponds to the circuit shown in (a).

Tree structure of a clock distribution network.

from the source to every register in the circuit through a sequence of buffers and interconnects. Typically, a buffer in the network drives a combination of other buffers and registers in the VLSI circuit. An interconnection network of wires connects the output of the driving buffer to the inputs of these driven buffers and registers. An internal node of the tree corresponds to a buffer, and a leaf node of the tree corresponds to a register. There are N leaves8 in the clock tree labeled F1 through FN, where leaf Fj corresponds to register Rj. A clock tree topology that implements a given clock schedule TCD must enforce a clock skew TSkew(i, f) for each local data path Ri  Rf of the circuit in order to ensure that both Eqs. 47.5 and 47.6 are satisfied. This topology, however, can be affected by three important issues relating to the operation of a fully synchronous digital system. Linear Dependency of the Clock Skews An important corollary related to the conservation property13 of clock skew is that there is a linear dependency among the clock skews of a global data path that form a cycle in the underlying graph of the circuit. Specifically, if v0, e1, v1≠ v0, …, vk – 1, ek, vk ≡ v0 is a cycle in the underlying graph of the circuit, then

0 = [ t cd – t cd ] + [ t cd – t cd ] + … 0

1

1

2

k–1

=

∑T

Skew

(47.10)

( i, i + 1 )

i=0

The property described by 47.10 is illustrated in Fig. 47.3 for the undirected cycle v1, v4, v3, v2, v1. Note that

0 = ( t cd – t cd ) + ( t cd – t cd ) + ( t cd – t cd ) + ( t cd – t cd ) 1

4

4

3

3

2

2

1

= T Skew ( 1, 4 ) + T Skew ( 4, 3 ) + T Skew ( 3, 2 ) + T Skew ( 2, 1 )

(47.11)

The importance of this property is that Eq. 47.10 describes the inherent correlation among certain clock skews within a circuit. Therefore, these correlated clock skews cannot be optimized independently of each other. Returning to Fig. 47.3, note that it is not necessary that a directed cycle exists in the directed graph G of a circuit for Eq. 47.10 to hold. For example, v2, v3, v4 is not a cycle in the directed circuit graph G in Fig. 47.3(a) but v2, v3, v4 is a cycle in the undirected circuit graph Gu in Fig. 47.3(b). In addition, TSkew(2, 3) + TSkew(3, 4) + TSkew(4, 2) = 0; that is, the skews TSkew(2, 3), TSkew(3, 4), and TSkew(4, 2) are linearly dependent. A maximum of (V – 1) = (N – 1) clock skews can be chosen independently of each other in a circuit, which is easily proven by considering a spanning tree of the underlying circuit graph Gu.23,24 Any spanning tree of Gu will contain (N – 1) edges — each edge corresponding to a local

8

The number of registers N in the circuit.

© 2000 by CRC Press LLC

data path — and the addition of any other edge of Gu will form a cycle such that Eq. 47.10 holds for this cycle. Note, for example, that for the circuit modeled by the graph shown in Fig. 47.3, four independent clock skews can be chosen such that the remaining three clock skews can be expressed in terms of the independent clock skews. Permissible Ranges Previous research17,29 has indicated that tight control over the clock skews rather than the clock delays is necessary for the circuit to operate reliably. The relationships in Eqs. 47.5 and 47.6 are used in Ref. 29 to determine a permissible range of the allowed clock skew for each local data path. The concept of a permissible range for the clock skew TSkew(i, f) of a local data path Ri  Rf is illustrated in Fig. 47.5. ,f ,f ˆ iPm ˆ iPM , TCP – D ] — as shown in Fig. 47.5 — Eqs. 47.5 and 47.6 are satisfied. When TSkew(i, f) ∈ [– D ,f ˆ iPm ) because a race condition The clock skew TSkew(i, f) is not permitted to be in either the interval (–∞, – D i, f ˆ PM ,+ ∞) because the minimum clock period will be limited. will be created or the interval (TCP – D

FIGURE 47.5 The permissible range of the clock skew of a local data path Ri  Rf. A timing violation exists if ,f ,f ˆ iPm ˆ iPM TSkew(i, f) ∉ [– D , TCP – D ].

Also, note that the reliability of the circuit is related to the probability of a timing violation occurring for any local data path Ri  Rf. Therefore, the reliability of any local data path Ri  Rf of the circuit (and therefore of the entire circuit) is increased in two ways: 1. By choosing the clock skew TSkew(i, f) for a local data path as far as possible from the borders of ,f ,f ˆ iPm ˆ iPM ], that is, by (ideally) positioning the clock skew TSkew(i, f) in the the interval [– D , TCP – D ,f ,f ˆ iPM ˆ iPm +D )], middle of the permissible range, that is, TSkew(i, f) = 1/2 [TCP – ( D i, f i, f ˆ PM – D ˆ Pm ) of the permissible range of the local data path Ri  Rf 2. By increasing the width TCP – ( D Due to the linear dependence of the clock skews shown previously, however, it is not possible to build a typical circuit such that for each local data path Ri  Rf, the clock skew TSkew(i, f) is in the middle of the permissible range. Differential Character of the Clock Tree j

In a given circuit, the clock signal delay t cd from the clock source to the register Rj is equal to the sum of the propagation delays of the buffers on the unique path that exists between the root of the clock tree and the leaf Fj corresponding to the j-th register. Furthermore, if Ri  Rf is a sequentially-adjacent pair * of registers, there is a portion of the two paths — denoted P if — between the root of the clock tree and Ri and Rf, respectively, that is common to both paths. This concept is illustrated in Fig. 47.6. A portion of a clock tree is shown in Fig. 47.6 where each of the vertices 1 through 10 corresponds to a buffer in the clock tree. The vertices 4, 5, and 9 are leaves of the tree and correspond to the registers R4, R5, and R9, respectively.9 The local data paths R4  R5 and R5  R9 are indicated with arrows in Fig. 47.6, while the paths of the clock signals to each of the registers R4, R5, and R9 are shown in Fig. 47.6 lightly shaded. The portion of the clock signal paths common to both registers of a local data path is shaded darker in Fig. 47.6; note the segments 1 → 2 → 3 for R4  R5 and 1 → 2 for R5  R9. Similarly, there is a portion of the clock signal path to any of the registers Ri and Rf in a sequentiallyi f adjacent pair of registers Ri  Rf, denoted by P if and P if , respectively, that is unique to this register. 9

Note that not all of the vertices correspond to registers.

© 2000 by CRC Press LLC

FIGURE 47.6

Illustration of the differential nature of the clock tree.

Returning to Fig. 47.6, the segments 3 → 4 and 3 → 5 are unique to the clock signal paths to the registers R4 and R5, while the segments 2 → 3 → 5 and 2 → 6 → 9 are unique to the clock signal paths to the registers R5, and R9, respectively. Note that the clock skew TSkew(i, f) between the sequentially adjusted pair of registers Ri  Rf is i f equal to the difference between the accumulated buffer propagation delays between P if and P if , that is, i f * TSkew(i, f) = Delay ( P if ) – Delay ( P if ). Therefore, any variations of circuit parameters over P if will not affect the value of the clock skew TSkew(i, f) . For the example shown in Fig. 47.6, TSkew (4,5) = Delay 4 5 5 9 ( P 4, 5 ) – Delay ( P 4, 5 ) and TSkew (5,9) = Delay ( P 5, 9 ) – Delay ( P 5, 9 ). The differential feature of the clock tree suggests an approach for minimizing the effects of process parameter variations on the correct operation of the circuit. To illustrate this approach, each branch p → q of the clock tree shown in Fig. 47.6 is labeled with two numbers: τp,q > 0 is the intended delay of the branch and εp,q ≥ 3 0 is the maximum error (deviation) of this delay.10 In other words, the actual delay of the branch p → q is in the interval [τp,q – εp,q, τp,q + εp,q]. With this notation, the target clock skew values for the local data paths R4  R5 and R5  R9 are shown in the middle column in Table 47.1. The bounds of the actual clock skew values for the local data paths R4  R5 and R5  R9 (considering the ε variations) are shown in the right-most column in Table 47.1. TABLE 47.1 Target and Actual Values of the Clock Skews for the Local Data Paths R4  R5 and R5  R9 Shown in Fig. 47.6 TSkew(4, 5) TSkew(5, 9)

Target Skew τ3, 4 – τ3, 5 τ2, 3 + τ3, 5 – τ2, 6 – τ6, 9

Actual Skew Bounds τ3, 4 – τ3, 5 ± (ε3, 4 + ε3, 4) τ2, 3 + τ3, 5 – τ2, 6 – τ6, 9 ± (ε2, 3 + ε3, 5 + ε2, 6 + ε6, 9)

10 The deviation ε is due to parameter variations during circuit manufacturing as well as to environmnetal conditions during operation of the circuit.

© 2000 by CRC Press LLC

*

As the results in Table 47.1 demonstrate, it is advantageous to maximize P if for any local data path * Ri  Rf with a relatively narrow permissible range, such that the parameter variations on P if do not i, f i, f * ˆ Pm , TCP – D ˆ PM ] is wider, P if may be permitted affect TSkew(i, f) Similarly, when the permissible range [– D to be only a small franction of the total path from the root to Ri and Rf, respectively. Future research work will explore this approach of synthesizing a clock tree based on choosing a tree structure which restricts the possible variations of those local data paths with narrow permissible ranges, and tolerates larger delay variations for those local data paths with wider permissible ranges.

47.4 Timing Properties of Synchronous Storage Elements Common Storage Elements The general structure and principles of operation of a fully synchronous digital VLSI system were described in Section 47.2. In this section, the timing constraints due to the combinational logic and the storage elements within a synchronous system are reviewed. The clock distribution network provides the time reference for the storage elements — or registers — thereby enforcing the required logical order of operations. This time reference consists of one or more clock signals that are delivered to each and every register within the integrated circuit. These clock signals control the order of computational events by controlling the exact times the register data inputs are sampled. The data signals are inevitably delayed as these signals propagate through the logic gates and along interconnections within the local data paths. These propagation delays can be evaluated within a certain accuracy and used to derive timing relationships among signals in a circuit. In this section, the properties of commonly used types of registers and their local timing relationships for different types of local data paths are described. After discussing registers in general in the next subsection, the properties of levelsensitive registers (latches) and the significant timing parameters of these registers are reviewed. Edgesensitive registers (flip-flops) and their timing parameters are also analyzed. Properties and definitions related to the clock distribution network are reviewed, and finally, the mathematical foundation for analyzing timing violations in both flip-flops and latches is discussed.

Storage Elements The storage elements (registers) encountered throughout VLSI systems vary widely in their function and temporal relationships. Independent of these differences, however, all storage elements share a common feature — the existence of two groups of signals with largely different purposes. A generalized view of a register is depicted in Fig. 47.7. The I/O signals of a register can be divided into two groups as shown in Fig. 47.7. One group of signals — called the data signals — consists of input and output signals of

FIGURE 47.7

A general view of a register.

© 2000 by CRC Press LLC

the storage element. These input and output signals are connected to the data signal terminals of other storage elements as well as to the terminals of ordinary logic gates. Another group of signals — identified by the name control signals — are those signals that control the storage of the data signals in the registers but do not participate in the logical computation process. Certain control signals enable the storage of a data signal in a register independently of the values of any data signals. These control signals are typically used to initialize the data in a register to a specific well-known value. Other control signals — such as a clock signal — control the process of storing a data signal within a register. In a synchronous circuit, each register has at least one clock (or control) signal input. The two major groups of storage elements (registers) are considered in the following sections based on the type of relationship that exists among the data and clock signals of these elements. In latches, it is the specific value or level of a control signal11 that determines the data storage process. Therefore, latches are also called level-sensitive registers. In contrast to latches, a data signal is stored in flip-flops as controlled by an edge of a control signal. For that reason, flip-flops are also called edge-triggered registers. The timing properties of latches and flip-flops are described in detail in the following two sections.

Latches A latch is a register whose behavior depends upon the value or level of the clock signal.8,30-36 Therefore, a latch is often referred to as a transparent latch, a level-sensitive register, or a polarity hold latch. A simple type of latch with a clock signal C and an input signal D is depicted in Fig. 47.8(a)—the output of the latch is typically labeled Q. This type of latch is also known as a D latch and its operation is illustrated in Fig. 47.8(b).

(a) A level-sensitive register or latch.

FIGURE 47.8

(b) Idealized operation of the latch shown in (a).

Schematic representation and principle of operation of a level-sensitive register (latch).

The register illustrated in Fig. 47.8 is a positive-polarity12 latch since it is transparent during that portion of the clock period for which C is high. The operation of this positive latch is summarized in Table 47.2. As described in Table 47.2 and illustrated in Fig. 47.8(b), the output signal of the latch follows the data input signal while the clock signal remains high, that is, C = 1 ⇒ Q = D. Therefore, the latch is said 11 12

This signal is most frequently the clock signal. Or simply a positive latch.

© 2000 by CRC Press LLC

TABLE 47.2 Operation of the Positive-Polarity D Latch Clock

Output

State

High Low

Passes input Maintains output

Transparent Opaque

to be in a transparent state during the interval t0 < t < t1 shown in Fig. 47.8(b). When the clock signal C changes from 1 to 0, the current value of D is stored in the register and the output Q remains fixed to that value regardless of whether the data input D changes. The latch does not pass the input data signal to the output, but rather holds onto the last value of the data signal when the clock signal made the high-to-low transition. By analogy with the term transparent introduced above, this state of the latch is called opaque and corresponds to the interval t1 < t < t2 shown in Fig. 47.8(b) where the input data signal is isolated from the output port. As shown in Fig. 47.8(b), the clock period is TCP = t2 – t0. The edge of the clock signal that causes the latch to switch to its transparent state is identified as the leading edge of the clock pulse. In the case of the positive latch shown in Fig. 47.8(a), the leading edge of the clock signal occurs at time t0. The opposite direction edge of the clock signal is identified as the trailing edge — the falling edge at time t1 shown in Fig. 47.8(b). Note that for a negative latch, the leading edge is a high-to-low transition and the trailing edge is a low-to-high transition. Parameters of Latches Registers such as the D latch illustrated in Fig. 47.8 and the flip-flops described later are built of discrete transistors. The exact relationships among signals on the terminals of a register can be presented and evaluated in analytical form.37–39 In this section, however, registers are considered at a higher level of abstraction in order to hide the details of the specific electrical implementation. The latch parameters are briefly introduced next. Note: The remaining portion of this section uses an extensive notation for various parameters of signals and storage elements. A glossary of terms used throughout this chapter is listed in the appendix. Minimum Width of the Clock Pulse L The minimum width of the clock pulse C Wm is the minimum permissible width of this portion of the clock L signal during which the latch is transparent. In other words, C Wm is the length of the time interval between the leading and the trailing edge of the clock signal such that the latch will operate properly. Increasing the L L L L value of C Wm any further will not affect the values of D DQ , δ S , and δ H (defined later). The minimum L width of the clock pulse, C Wm = t6 – t1, is illustrated in Fig. 47.9. The clock period is TCP = t8 – t1. Latch Clock-to-Output Delay L The clock-to-output delay D CQ (typically called the clock-to-Q delay) is the propagation delay of the latch L from the clock signal terminal to the output terminal. The value of D CQ = t2 – t1 is depicted in Fig. 47.9 and is defined assuming that the data input signal has settled to a stable value sufficiently early, that is, setting the data input signal earlier with respect to the leading clock edge will not affect the value L of D CQ . Latch Data-to-Output Delay L The data-to-output delay D DQ (typically called the data-to-Q delay) is the propagation delay of the latch L from the data signal terminal to the output terminal. The value of D DQ is defined assuming that the clock signal has set the latch to its transparent state sufficiently early, that is, making the leading edge of L L the clock signal occur earlier will not change the value of D DQ . The data-to-output delay D DQ = t4 – t3 is illustrated in Fig. 47.9. Latch Setup Time L The latch setup time δ S = t6 – t5, shown in Fig. 47.9, is the minimum time between a change in the data signal and the trailing edge of the clock signal such that the new value of D would propagate to the output Q of the latch and be stored within the latch during its opaque state.

© 2000 by CRC Press LLC

FIGURE 47.9

Parameters of a level-sensitive register.

Latch Hold Time L The latch hold time δ H is the minimum time after the trailing clock edge that the data signal must remain constant so that this value of D is successfully stored in the latch during the opaque state. This definition L L of δ H assumes that the last change of the value of D has occurred to later than δ S before the trailing L edge of the clock signal. The term δ H = t7 – t6 is shown in Fig. 47.9. Note: The latch parameters previously introduced are used to refer to any latch in general, or to a specific instance of a latch when this instance can be unambiguously identified. To refer to a specific instance i of a latch explicitly, the parameters are additionally shown with a superscript. For example, Li L D CQ refers to the clock-to-output delay of latch i. Also, adding m and M to the subscript of D CQ and L L L D DQ can be used to refer to the minimum and maximum values of D CQ and D DQ , respectively.

Flip-Flops An edge-triggered register or flip-flop is a type of register which, unlike the latches described previously, is never transparent with respect to the input data signal.8, 30-36 The output of a flip-flop normally does not follow the input data signal at any time during the register operation, but rather holds onto a previously stored data value until a new data signal is stored in the flip-flop. A simple type of flip-flop with a clock signal C and an input signal D is shown in Fig. 47.10(a); similar to latches, the output of a flip-flop is usually labeled Q. This specific type of register, shown in Fig. 47.10(a), is called a D flip-flop and its operation is illustrated in Fig. 47.10(b). In typical flip-flops, data is stored either on the rising edge (low-to-high transition) or on the falling edge (high-to-low transition) of the clock signal. The flip-flops are known as positive-edge-triggered and negative-edge-triggered flip-flops, respectively. The terms latching, storing, or positive edge is used to identify the edge of the clock signal on which storage in the flip-flop occurs. For the sake of clarity, the latching edge of the clock signal for flip-flops will also be called the leading edge (compare with the previous discusion of latches). Also, note that certain flip-flops — known as double-edged-triggered (DET) flip-flops40-44 — can store data at either edge of the clock signal. The complexity of these flipflops, however, is significantly higher and these registers are therefore rarely used. As shown in the timing diagram in Fig. 47.10(b), the output of the flip-flop remains unchanged most of the time, regardless of the transitions in the data signal. Only values of the data signal in the vicinity of the storing edge of the clock signal can affect the output of the flip-flop. Therefore, changes in the

© 2000 by CRC Press LLC

(a) An edge-triggered register or flip-flop.

(b) Idealized operation of the flip-flop shown in (a).

FIGURE 47.10 Schematic representation and principle of operation of an edge-triggered register (flip-flop).

output will only be observed when the currently stored data has a logic value x, and the storing edge of the clock signal occurs while the input data signal has a logic value of x. Parameters of Flip-Flops The significant timing parameters of an edge-triggered register are similar to those of latches and are presented next. These parameters are illustrated in Fig. 47.11. Minimum Width of the Clock Pulse F The minimum width of the clock pulse C Wm is the minimum permissible width of the time interval between the latching edge and the non-latching edge of the clock signal. The minimum width of the clock pulse F C Wm = t6 – t3 is shown in Fig. 47.11 and is defined as the minimum interval between the latching and non-latching edges of the clock pulse such that the flip-flop will operate correctly. Further increasing F F F C Wm will not affect the values of the setup time δ S and hold time δ H (defined later). The clock period TCP = t6 – t1 is also shown in Fig. 47.11.

FIGURE 47.11 Parameters of an edge-triggered register.

© 2000 by CRC Press LLC

Flip-Flop Clock-to-Output Delay F F As shown in Fig. 47.11, the clock-to-output delay D CQ of the flip-flop is D CQ = t5 – t3. This propagation delay parameter — typically called the clock-to-Q delay — is the propagation delay from the clock F signal terminal to the output terminal. The value of D CQ is defined assuming that the data input signal has settled to a stable value sufficiently early, that is, setting the data input any earlier with respect to the F latching clock edge will not affect the value of D CQ . Flip-Flop Setup Time F F F The flip-flop setup time δ S is shown in Fig. 47.11 — δ S = t3 – t2. The parameter δ S is defined as the minimum time between a change in the data signal and the latching edge of the clock signal such that the new value of D propagates to the output Q of the flip-flop and is successfully latched within the flip-flop. Flip-Flop Hold Time F The flip-flop hold time δ H is the minimum time after the arrival of the latching clock edge in which the data signal must remain constant in order to successfully store the D signal within the flip-flop. The hold F time δ H = t4 – t3 is illustrated in Fig. 47.11. This definition of the hold time assumes that the last change F of D has occurred no later than δ S before the arrival of the latching edge of the clock signal. Note: Similar to latches, the parameters of these edge-triggered registers refer to any flip-flop in general, or to a specific instance of a flip-flop when this instance is uniquely identified. To refer to a specific instance i of a flip-flop explicitly, the flip-flop parameters are additonally shown with a superscript. For Fi example, δ S refers to the setup time parameter flip-flop i. Also, adding m and M to the subscript of F F D CQ can be used to refer to the minimum and maximum values of D CQ , respectively.

The Clock Signal The clock signal is typically delivered to each storage element within a circuit. This signal is crucial to the correct operation of a fully synchronous digital system.The storage elements serve to establish the relative sequence of events within a system so that those operations that cannot be executed concurrently operate on the proper data signals. A typical clock signal c(t) in a synchronous digital system is shown in Fig. 47.12. The clock period TCP of c(t) is indicated in Fig. 47.12. In order to provide the highest possible clock frequency, the objective is for TCP to be the smallest number such that

∀t:c ( t ) = c ( t + nT CP )

(47.12)

where n is an integer. The width of the clock pulse CW is shown in Fig. 47.12 where the meaning of CW has been previously explained.

FIGURE 47.12 A typical clock signal.

© 2000 by CRC Press LLC

Typically, the period of the clock signal TCP is a constant,that is, ∂TCP/∂t = 0. If the clock signal c(t) has a delay τ from some reference point, then the leading edges of c(t) occur at times

τ + mT CP

for

m ∈ { …, – 2, – 1, 0, 1, 2, … }

(47.13)

and the trailing edges of c(t) occur at times

τ + C W + mT CP

for

m ∈ { …, – 2, – 1, 0, 1, 2, … }

(47.14)

In practice, however, it is possible for the edges of a clock signal to fluctuate in time, that is, not to occur precisely at the times described by Eqs. 47.13 and 47.14 for the leading and trailing edges, respectively. This phenomenon is known as clock jitter and may be due to various causes, such as variations in the manufacturing process, ambient temperature, power supply noise, and oscillator characteristics. To account for this clock jitter, the following parameters are introduced: • The maximum deviation L of the leading edge of the clock signal: that is, the leading edge is guaranteed to occur anywhere in an interval (τ + kTCP –L, τ + kTCP + L) • The maximum deviation T of the trailing edge of the clock signal: that is, the trailing edge is guaranteed to occur anywhere in the interval (τ + CW + kTCP –T, τ + CW + kTCP +T), Clock Skew Consider a local data path such as the path shown in Fig. 47.2(b). Without loss of generality, assume that the registers shown in Fig. 47.2(b) are flip-flops. The clock signal with period TCP is delivered to each of the registers Ri and Rf. Let the clock signal driving the register Ri be denoted as Ci. and the clock signal i f driving the registerRf be denoted by Cf . Also, let t cd and t cd be the delays of Ci and Cf. to the registers Ri 13 and Rf., respectivly. As described by Eq. 47.13, the latching or leading edges of Ci. occur at times

…, τ + t cd – T CP, τ + t cd, τ + t cd + T CP, … i

i

i

Similarly, the latching or leading edges of Cf. occur at times

…, τ + t cd – T CP, τ + t cd, τ + t cd + T CP, … f

f

f

as described by Eq. 47.14. i f The clock skew TSkew(i, f) = t cd – t cd between Ci and Cf is introduced next as the difference of the arrival i times of Ci and Cf .13 This concept is illustrated by Fig. 47.13. Note that depending on the values of t cd

FIGURE 47.13 Lead/lag relationships causing clock skew to be zero, negative, or positive. 13

i

f

Note that these delays t cd and t cd are measured with respect to the same reference point.

© 2000 by CRC Press LLC

f

i

f

i

f

i

f

and t cd , the skew can be zero ( t cd = t cd ), negative ( t cd < t cd ), or positive ( t cd > t cd ). Furthermore, note that the clock skew as defined above is only defined for sequentially-adjacent registers, that is, a local data path (such as the path shown in Fig. 47.2(b)).

Analysis of a Single-Phase Local Data Path with Flip-Flops A local data path composed of two flip-flops and combinational logic between the flip-flops is shown in Fig. 47.14. Note the initial flip-flop Ri, which is the origin of the data signal, and the final flip-flop Rf, which is the destination of the data signal. The combinational logic block Lif between Ri and Rf accepts the input data signals supplied by Ri and other registers and logic gates and transmits the operated upon data signals to Rf . The period of the clock signal is denoted by TCP and the delays of the clock signal Ci i f and Cf to the flip-flops Ri and Rf are denoted by t cd and t cd , respectively. The input and output data signals to Ri and Rf are denoted by Di , Qi ,Df ,and Qf , respectively.

FIGURE 47.14 A single-phase local data path.

An analysis of the timing properties of the local data path shown in Fig. 47.14 is offered in the following sections. First, the timing relationships to prevent the late arrival of data signals to Rf are examined in the next subsection. The timing relationships to prevent the early arrival of signals to the register Rf are then described followed by analyses that borrow some notation from Refs. 11 and 12. Similar analyses of synchronous circuits from the timing perspective can be found in Refs. 45 through 49. Preventing the Late Arrival of the Data Signal in a Local Data Path with Flip-Flops The operation of the local data path Ri  Rf shown in Fig. 47.14 requires that any data signal that is Ff being stored in Rf arrives at the data input Df of Rf no later than δ S before the latching edge of the clock signal Cf It is possible for the opposite event to occur, that is, for the data signal Df not to arrive at the register Rf sufficiently early in order to be stored successfully within Rf . If this situation occurs, the local data path shown in Fig. 47.14 fails to perform as expected and it is said that a timing failure or violation has been created. This form of timing violation is typically called a setup (or long path) violation. A setup violation is depicted in Fig. 47.15 and is used in the following discussion. The identical clock periods of the clock signals Ci and Cf are shaded for identification in Fig. 47.15. Also shaded in Fig. 47.15 are those portions of the data signals Di , Qi ,and Df that are relevant to the operation of the local data path shown in Fig. 47.14. Specifically, the shaded portion of Di corresponds to the data to be stored in Ri at the beginning of the k-th clock period. This data signal propagates to the output of the register Ri and is illustrated by the shaded portion of Qi shown in Fig. 47.15. The combinational logic operates on Qi during the k-th clock period. The result of this operation is the shaded portion of the signal Df which must be stored in Rf during the next (k + 1)-st clock period. Observe that, as illustrated in Fig. 47.15, the leading edge of Ci that initiates the k-th clock period i occurs at time t cd + kTCP. Similarly, the leading edge of Cf that initiates the (k + 1)-th clock period occurs f Ff at time t cd + (k + 1) TCP . Therefore, the latest arrival time t AM of Df at Rf must satisfy.

t AM ≤ [ t cd + ( k + 1 )T CP – ∆ L ] – δ S Ff

© 2000 by CRC Press LLC

f

F

Ff

(47.15)

FIGURE 47.15 Timing diagram of a local data path with flip-flops with violation of the setup constraint. f

The term [ t cd + (k + 1)TCP – ∆ L ] on the right-hand side of Eq. 47.15 corresponds to the critical F Ff situation of the leading edge of Cf arriving earlier by the maximum possible deviation ∆ L . The – δ S F term on the right-hand side of Eq. 47.15 accounts for the setup time of Rf (recall the definition of δ s ). Ff Note that the value of t AM in Eq. 47.15 consists of two components: F

Fi

1. The latest arrival time t QM that a valid data signal Qi appears at the output of Ri: that is, the sum Fi i F Fi t QM = t cd + kTCP + ∆ L + D CQM of the latest possible arrival time of the leading edge of Ci and the maximum clock-to-Q delay of Ri, i, f 2. The maximum propagation delay D PM of the data signals through the combinational logic block Lif and interconnect along the path Ri  Rf. Ff

Therefore, t AM can be described as i, f

i

i, f

t AM = t QM + D PM = ( t cd + kT CP + ∆ L + D CQM ) + D PM . Ff

Fi

F

Fi

(47.16)

By substituting Eq. 47.16 into Eq. 47.15, the timing condition guaranteeing correct signal arrival at the data input D of Rf is i

i, f

f

( t cd + kT CP + ∆ L + D CQM ) + D PM ≤ [ t cd + ( k + 1 )T CP – ∆ L ] – δ S . F

Fi

F

Ff

(47.17)

The above inequality can be transformed by subtracting the kTCP terms from both sides of Eq. 47.17. i f Furthermore, certain terms in Eq. 47.17 can be grouped together and, by noting that t cd – t cd = TSkew(i, f) is the clock skew between the registers Ri and Rf , © 2000 by CRC Press LLC

i, f

T Skew ( i, f ) + 2∆ L ≤ T CP – ( D CQM + D PM + δ S ) F

Fi

Ff

(47.18)

Note that a violation of Eq. 47.18 is illustrated in Fig. 47.15. The timing relationship Eq. 47.18 represents three important results describing the late arrival of the signal Df at the data input of the final register Rf in a local data path Ri  Rf : i, f

1. Given any values of TSkew(i, f) ∆ L , D PM , δ S , and D CQM , the late arrival of the data signal at Rf can be prevented by controlling the value of the clock period TCP . A sufficiently large value of TCP can always be chosen to relax Eq. 47.18 by increasing the upper bound described by the righthand side of Eq. 47.18. 2. For correct operation, the clock period TCP does not necessarily have to be larger than the term Fi i, f Ff ( D CQM + D PM + δ S ). If the clock skew TSkew(i, f) is properly controlled, choosing a particular negative value for the clock skew will relax the left side of Eq. 47.18, thereby permitting Eq. 47.18 ,f Fi Ff ˆ iPM + δ S ) < 0. to be satisfied despite TCP – ( D CQM + D i, f F Fi Ff ˆ 3. Both the term 2 ∆ L and the term ( D CQM + D PM + δ S ) are harmful in the sense that these terms impose a lower bound on the clock period TCP (as expected). Although negative skew can be used to relax the inequality of Eq. 47.18, these two terms work against relaxing the values of TCP and TSkew(i, f) F

Ff

Fi

Finally, the relationship in Eq. 47.18 can be rewritten in a form that clarifies the upper bound on the clock skew TSkew(i, f) imposed by Eq. 47.18: i, f

T Skew ( i, f ) ≤ T CP – ( D CQM + D PM + δ S ) – 2∆ L Fi

Ff

F

(47.19)

Preventing the Early Arrival of the Data Signal in a Local Data Path with Flip-Flops Late arrival of the signal Df at the data input of Rf (see Fig. 47.14) was analyzed in the previous subsection. In this section, the analysis of the timing relationships of the local data path Ri  Rf to prevent early data arrival of Df is presented. To this end, recall from previous discussion that any data signal Df being Ff stored in Rf must lag the arrival of the leading edge of Cf by at least δ H . It is possible for the opposite new event to occur, that is, for a new data D f to overwrite the value of Df and be stored within the register Rf. If this situation occurs, the local data path shown in Fig. 47.14 will not perform as desired because of a catastrophic timing violation known as a hold (or short path) violation. In this section, hold timing violations are analyzed. It is shown that a hold violation is more dangerous than a setup violation since a hold violation cannot be removed by simply adjusting the clock period TCP (unlike the case of a data signal arriving late where TCP can be increased to satisfy Eq. 47.18). A hold violation is depicted in Fig. 47.16, which is used in the following discussion. The situation depicted in Fig. 47.16 is different from the situation depicted in Fig. 47.15 in the following sense. In Fig. 47.15, a data signal stored in Ri during the k-th clock period arrives too late to be stored in Rf during the (k + 1)-th clock period. In Fig. 47.16, however, the data stored in Ri during the k-th clock period arrives at Rf too early and destroys the data that had to be stored in Rf during the same k-th clock period. To clarify this concept, certain portions of the data signals are shaded for easy identification in Fig. 47.16. The data Di being stored in Ri at the beginning of the k-th clock period is shaded. This data signal propagates to the output of the register Ri and is illustrated by the shaded portion of Qi shown in Fig. 47.16. The output of the logic (left unshaded in Fig. 47.16) is being stored within the register Rf at the beginning of the (k + 1)-th clock period. Finally, the shaded portion of Df corresponds to the data that must be stored in Rf at the beginning of the k-th clock period. Note that, as illustrated in Fig. 47.16, the leading (or latching) edge of Ci that initiates the k-th clock i period occurs at time t cd +kTCP . Similarly, the leading (or latching) edge of Cf that initiates the k-th f Ff clock period occurs at time t cd + kTCP.. Therefore, the earliest arrival time t Am of the data signal Df at the register Rf must satisfy the following condition:

© 2000 by CRC Press LLC

FIGURE 47.16 Timing diagram of a local data path with flip-flops with a violation of the hold constraint. f

t Am ≥ ( t cd + kT CP + ∆ L ) + δ H Ff

f

F

Ff

(47.20)

The term ( t cd + kTCP + ∆ L ) on the right-hand side of Eq. 47.20 corresponds to the critical situation F of the leading edge of the k-th clock period of Cf arriving late by the maximum possible deviation ∆ L . Ff Note that the value of t Am in Eq. 47.20 has two components: F

Fi

1. The earliest arrival time t Qm that a valid data signal Qi appears at the output of Ri: that is, the Fi i F Fi sum t Qm = t cd + kTCP – ∆ L + D CQm of the earliest arrival time of the leading edge of Ci and the minimum clock-to-Q delay of Ri i, f 2. The minimum propagation delay D Pm of the signals through the combinational logic block Lif and interconnect wires along the path Ri  Rf Ff

Therefore, t Am can be described as i, f

i

i, f

t Am = t Qm + D Pm = ( t cd + kT CP – ∆ L + D CQM ) + D Pm Ff

Ff

F

Fi

(47.21)

By substituting Eq. 47.21 into Eq. 47.20, the timing condition that guarantees that Df does not arrive too early at Rf is i

i, f

f

( t cd + kT CP – ∆ L + D CQm ) + D Pm ≥ ( t cd + kT CP + ∆ L ) + δ H F

Fi

F

Ff

(47.22) i

f

The inequality Eq. 47.22 can be further simplified by regrouping terms and noting that t cd – t cd = TSkew(i, f) is the clock skew between the registers Ri and Rf:

© 2000 by CRC Press LLC

i, f

T Skew ( i, f ) – 2∆ L ≥ – ( D CQm + D Pm ) + δ H F

Fi

Ff

(47.23)

Recall that a violation of Eq. 47.23 is illustrated in Fig. 47.16. The timing relationship described by Eq. 47.23 provides certain important facts describing the early arrival of the signal Df at the data input of the final register Rf of a local data path: 1. Unlike Eq. 47.18, the inequality Eq. 47.23 does not depend on the clock period TCP . Therefore, a violation of Eq. 47.23 cannot be corrected by simply manipulating the value of TCP . A synchronous digital system with hold violations is non-functional, while a system with setup violations will still operate correctly at a reduced speed.14 For this reason, hold violations result in catastrophic timing failure and are considered significantly more dangerous than the setup violations previously described. 2. The relationship in Eq. 47.23 can be satisfied with a sufficiently large value of the clock skew F Ff TSkew(i, f). However, both the term 2 ∆ L and the term δ H are harmful in the sense that these terms impose a lower bound on the clock skew TSkew(i, f) between the register Ri and Rf. Although positive skew may be used to relax Eq. 47.23, these two terms work against relaxing the values of Fi i, f TSkew(i, f) and ( D CQm + D Pm ). Finally, the relationship in Eq. 47.23 can be rewritten to stress the lower bound imposed on the clock skew TSkew(i, f) by Eq. 47.23: i, f

T Skew ( i, f ) ≥ – ( D Pm + D CQ ) + δ H + 2∆ L Fi

Ff

F

(47.24)

Analysis of a Single-Phase Local Data Path with Latches A local data path consisting of two level-sensitive registers (or latches) and the combinational logic between these registers (or latches) is shown in Fig. 47.17. Note the initial latch Ri, which is the origin of the data signal, and the final latch Rf, which is the destination of the data signal. The combinational logic block Lif between Ri and Rf accepts the input data signals sourced by Ri and other registers and logic gates and transmits the data signals that have been operated on to Rf . The period of the clock signal is denoted by i f TCP and the delays of the clock signals Ci and Cf to the latches Ri and Rf are denoted by t cd and t cd , respectively. The input and output data signals to Ri and Rf are denoted by Di, Qi, Df, and Qf, respectively.

FIGURE 47.17 A single-phase local data path with latches.

An analysis of the timing properties of the local data path shown in Fig. 47.17 is offered in the following sections. The timing relationships to prevent the late arrival of the data signal at the latch Rf are examined, as well as the timing relationships to prevent the early arrival of the data signal at the latch Rf.

14 Increasing the clock period T CP in order to satisfy Eq. 47.18 is equivalent to reducing the frequency of the clock signal.

© 2000 by CRC Press LLC

The analyses presented in this section build on assumptions regarding the timing relationships among the signals of a latch similar to those assumptions used in the previous chapter section. Specifically, it is L guaranteed that every data signal arrives at the data input of a latch no later than δ S time before the L trailing clock edge. Also, this data signal must remain stable at least δ H time after the trailing edge, that L is, no new data signal should arrive at a latch δ H time after the latch has become opaque. Observe the differences between a latch and a flip-flop.45,50 In flip-flops, the setup and hold requirements described in the previous paragraph are relative to the leading — not to the trailing — edge of the clock signal. Similar to in flip-flops, the late and early arrival of the data signal to a latch give rise to timing violations known as setup and hold violations, respectively. Preventing the Late Arrival of the Data Signal in a Local Data Path with Latches A similar signal setup to the example illustrated in Fig. 47.15 is assumed in the following discussion. A data signal Di, is stored in the latch Ri during the k-th clock period. The data Qi, stored in Ri propagates through the combinational logic Lif and the interconnect along the path Ri  Rf . In the (k + 1)-th clock period, the result Df of the computation in Lif is stored within the latch Rf . The signal Df must arrive at L least δ S time before the trailing edge of Cf in the (k + 1)-th clock period. Lf Similar to the discussion presented in the previous section, the latest arrival time t AM of Df at the D input of Rf must satisfy f

t AM ≤ [ t cd + ( k + 1 )T CP + C Wm – ∆ T ] – δ S Lf

L

L

Lf

(47.25)

Note the difference between Eqs. 47.25 and 47.15. In Eq. 47.15, the first term on the right-hand side is f F [ t cd + (k + 1) TCP – ∆ L ], while in Eq. 47.25, the first term on the right-hand side has an additional term L L C Wm . The addition of C Wm corresponds to the concept that, unlike flip-flops, a data signal is stored in L a latch, shown in Fig. 47.17, at the trailing edge of the clock signal (the C Wm term). Similar to the case f L L of flip-flops, the term [ t cd + (k + 1) TCP + C Wm – ∆ T ] in the right-hand side of Eq. 47.25 corresponds to the critical situation of the trailing edge of the clock signal Cf arriving earlier by the maximum possible L deviation ∆ T . Lf Observe that the value of t AM in Eq. 47.25 consists of two components: Li

1. The latest arrival time t QM when a valid data signal Qi appears at the output of the latch Ri, 2. The maximum signal propagation delay through the combinational logic block Lif and the interconnect along the path Ri  Rf Lf

Therefore, t AM can be described as i, f

Lf

Li

t AM = D PM + t QM

(47.26) Li

However, unlike the situation of flip-flops discussed previously, the term t Qm on the right-hand side of Li Eq. 47.26 is not the sum of the delays through the register Ri. The reason is that the value of t QM depends on whether the signal Di arrived before or during the transparent state of Ri in the k-th clock period. Li Therefore, the value of t Qm in Eq. 47.26 is the greater of the following two quantities: i

t QM = max [ ( t AM + D DQM ), ( t cd + kT CP + ∆ L + D CQM ) ] Li

Li

Li

L

Li

(47.27)

There are two terms in the right-hand side of Eq. 47.27: Li

Li

1. The term ( t AM + D DQM ) corresponds to the situation in which Di arrives at Ri after the leading edge of the k-th clock period i L Li 2. The term ( t cd + kTCP + ∆ L + D CQM ) corresponds to the situation in which Di arrives at Ri before the leading edge of the k-th clock pulse arrives

© 2000 by CRC Press LLC

Lf

By substituting Eq. 47.27 into Eq. 47.26, the latest time of arrival t AM is: i, f

i

t AM = D PM + max [ ( t AM + D DQM ), ( t cd + kT CP + ∆ L + D CQM ) ] Lf

Li

Li

L

Li

(47.28)

D PM + max [ ( t AM + D DQM ), ( t cd + kT CP + ∆ L + D CQM ) ] f L L Lf ≤ [ t cd + ( k + 1 )T CP + C Wm – ∆ T ] – δ S

(47.29)

which is in turn substituted into Eq. 47.25 to obtain i, f

Li

i

Li

L

Li

Equation Eq. 47.29 is an expression for the inequality that must be satisfied in order to prevent the late arrival of a data signal at the data input D of the register Rf. By satisfying Eq. 47.29, setup violations in the local data path with latches shown in Fig. 47.17 are avoided. For a circuit to operate correctly, Eq. 47.29 must be enforced for any local data path Ri  Rf consisting of the latches Ri and Rf. The max operation in Eq. 47.29 creates a mathematically difficult situation since it is unknown which of the quantities under the max operation is greater. To overcome this obstacle, this max operation can be split into two conditions: i, f

f

D PM + ( t AM + D DQM ) ≤ [ t cd + ( k + 1 )T CP + C Wm – ∆ T ] – δ S i, f

Li

Li

i

L

L

f

Lf

(47.30)

D PM + ( t cd + kT CP + ∆ L + D CQM ) ≤ [ t cd + ( k + 1 )T CP + C Wm – ∆ T ] – δ S L

Li

L

i

L

Lf

(47.31)

f

Taking into account that the clock skew TSkew(i, f) = t cd – t cd , Eqs. 47.30 and 47.31 can be rewritten as i, f

f

D PM + ( t AM + D DQM ) ≤ [ t cd + ( k + 1 )T CP + C Wm – ∆ T ] – δ S Li

Li

L

L

Lf

i, f

Lf

T Skew ( i, f ) + ( ∆ L + ∆ T ) ≤ ( T CP + C Wm ) – ( D CQM + D PM + δ S ) L

L

L

Li

(47.32) (47.33)

Equation 47.33 can be rewritten in a form that clarifies the upper bound on the clock skew TSkew(i, f) imposed by Eq. 47.33: i, f

f

D PM + ( t AM + D DQM ) ≤ [ t cd + ( k + 1 )T CP + C Wm – ∆ T ] – δ S Li

Li

Lf

(47.34)

T Skew ( i, f ) ≤ ( T CP + C Wm – ∆ L – ∆ T ) – ( D CQM + D PM + δ S )

(47.35)

L

L

L

L

L

i, f

Li

Lf

Preventing the Early Arrival of the Data Signal in a Local Data Path with Latches A similar signal setup to the example illustrated in Fig. 47.16 is assumed in the discussion presented in this section. Recall the difference between the late arrival of a data signal at Rf and the early arrival of a data signal at Rf. In the former case, the data signal stored in the latch Ri during the k-th clock period arrives too late to be stored in the latch Rf during the (k + 1)-th clock period. In the latter case, the data signal stored in the latch Ri during the k-th clock period propagates to the latch Rf too early and overwrites the data signal that was already stored in the latch Rf during the same k-th clock period. In order for the proper data signal to be successfully latched within Rf during the k-th clock period, there should not be any changes in the signal Df until at least the hold time after the arrival of the storing Lf (trailing) edge of the clock signal Cf . Therefore, the earliest arrival time t Am of the data signal Df at the register Rf must satisfy the following condition: f

t Am ≥ ( t cd + kT CP + C Wm + ∆ T ) + δ H Lf

© 2000 by CRC Press LLC

L

L

Lf

(47.36)

f

The term ( t cd + kTCP + C Wm + ∆ T ) on the right-hand side of Eq. 47.36 corresponds to the critical situation of the trailing edge of the k-th clock period of the clock signal Cf arriving late by the maxiumum L Lf possible deviation ∆ T . Note that the value of t Am in Eq. 47.36 consists of two components: L

L

Li

1. The earliest arrival time t Qm that a valid data signal Qi appears at the output of the latch Ri: that Li i L Li is, the sum t Qm = t cd + kTCP – ∆ L + D CQm of the earliest arrival time of the leading edge of the Li clock signal Ci and the minimum clock-to-Q delay D CQm of Rf, i, f 2. The minimum propagation delay D Pm of the signal through the combinational logic Lif and the interconnect along the path Ri  Rf. Lf

Therefore, t Am can be described as i, f

i

i, f

t Am = t Qm + D Pm = ( t cd + kT CP – ∆ L + D CQm ) + D Pm Lf

Li

L

Li

(47.37)

By substituting Eq. 47.37 into Eq. 47.36, the timing condition guaranteeing that Df does not arrive too early at the latch Rf is i

i, f

f

( t cd + kT CP – ∆ L + D CQm ) + D Pm ≥ ( t cd + kT CP + C Wm + ∆ T ) + δ H L

Li

L

L

Lf

(47.38) i

f

The inequality Eq. 47.38 can be further simplified by reorganizing the terms and noting that t cd – t cd = TSkew(i, f) is the clock skew between the registers Ri and Rf: i, f

T Skew ( i, f ) – ( ∆ L + ∆ T ) ≥ – ( D CQm + D Pm ) + δ H L

L

Li

Lf

(47.39)

The timing relationship described by Eq. 47.39 represents two important results describing the early arrival of the signal Df at the data input of the final latch Rf of a local data path: 1. The relationship in Eq. 47.39 does not depend on the value of the clock period TCP. Therefore, if a hold timing violation in a synchronous system has occurred,15 this timing violation is catastrophic. 2. The relationship in Eq. 47.39 can be satisfied with a sufficiently large value of the clock skew L L Lf TSkew(i, f). Furthermore, both the term ( ∆ L + ∆ T ) and the term δ H are harmful in the sense that these terms impose a lower bound on the clock skew TSkew(i, f) between the latches Ri and Rf. Although positive skew TSkew(i, f) > 0 can be used to relax Eq. 47.39, these two terms make it Li i, f difficult to satisfy the inequality in Eq. 47.39 for specific values of TSkew(i, f) and ( D CQm + D Pm ). Furthermore, Eq. 47.39 can be rewritten to emphasize the lower bound on the clock skew TSkew(i, f) imposed by Eq. 47.39: i, f

T Skew ( i, f ) ≥ ( ∆ L + ∆ T ) – ( D CQm + D Pm ) + δ H L

L

Li

Lf

(47.40)

47.5 A Final Note The properties of registers and local data paths were described in this chapter. Specifically, the timing relationships to prevent setup and hold timing violations in a local data path consisting of two positive edge-triggered flip-flops were analyzed. The timing relationships to prevent setup and hold timing violations in a local data path consisting of two positive-polarity latches were also analyzed. In a fully synchronous digital VLSI system, however, it is possible to encounter types of local data paths different from those circuits analyzed in this chapter. For example, a local data path may begin

15

As described by the inequality Eq. 47.39 not being satisfied.

© 2000 by CRC Press LLC

with a positive-polarity, edge-sensitive register Ri, and end with a negative-polarity, edge-sensitive register Rf. It is also possible that different types of registers are used, for example, a register with more than one data input. In each individual case, the analyses described in this chapter illustrate the general methodology used to derive the proper timing relationships specific to that system. Furthermore, note that for a given system, the timing relationships that must be satisfied for the system to operate correctly — such as Eqs. 47.19, 47.24, 47.34, 47.35, and 47.40 — are collectively referred to as the overall timing constraints of the synchronous digital system.13,51–55

References 1. Kilby, J. S., “Invention of the Integrated Circuit,” IEEE Transactions on Electron Devices, vol. ED23, pp. 648-654, July 1976. 2. Rabaey, J. M., Digital Integrated Circuits: A Design Perspective. Prentice-Hall, Inc., Upper Saddle River, NJ, 1995. 3. Gaddis, N. and Lotz, J., “A 64-b Quad-Issue CMOS RISC Microprocessor,” IEEE Journal of SolidState Circuits, vol. SC-31, pp. 1697-1702, Nov. 1996. 4. Gronowski, P. E. et al., “A 433-MHz 64-bit Quad-Issue RISC Microprocessor,” IEEE Journal of Solid-State Circuits, vol. SC-31, pp. 1687-1696, Nov. 1996. 5. Vasseghi, N., Yeager, K., Sarto, E., and Seddighnezhad, M., “200-Mhz Superscalar RISC Microprocessor,” IEEE Journal of Solid-State Circuits, vol. SC-31, pp. 1675-1686, Nov. 1996. 6. Bakoglu, H. B., Circuits, Interconnections, and Packaging for VLSI. Addison-Wesley Publishing Company, 1990. 7. Bothra, S., Rogers, B., Kellam, M., and Osburn, C. M., “Analysis of the Effects of Scaling on Interconnect Delay in ULSI Circuits,” IEEE Transactions on Electron Devices, vol. ED-40, pp. 591597, Mar. 1993. 8. Weste, N. W. and Eshraghian, K., Principles of CMOS VLSI Design: A Systems Perspective. AddisonWesley Publishing Company, Reading, MA, 2nd ed., 1992. 9. Mead, C. and Conway, L., Introduction to VLSI Systems. Addison-Wesley Publishing Company, Reading, MA, 1980. 10. Anceau, F., “ASynchronous Approach for Clocking VLSI Systems,” IEEE Journal of Solid-State Circuits, vol. SC-17, pp. 51-56, Feb. 1982. 11. Afghani M., and Svensson, C., “A Unified Clocking Scheme for VLSI Systems,” IEEE Journal of Solid State Circuits, vol. SC-25, pp. 225-233, Feb. 1990. 12. Unger, S. H. and Tan, C.-J., “Clocking Schemes for High-Speed Digital Systems,” IEEE Transactions on Computers, vol. C.-35, pp. 880-895, Oct. 1986. 13. Friedman, E. G., Clock Distribution Networks in VLSI Circuits and Systems. IEEE Press, 1995. 14. Bowhill, W. J. et al., “Circuit Implementation of a 300-MHz 64-bit Second-generation CMOS Alpha CPU,” Digital Technial Journal, vol. 7, no. 1, pp. 100-118, 1995. 15. Neves, J. L. and Friedman, E. G., “Topological Design of Clock Distribution Networks Based on Non-Zero Clock Skew Specification,” Proceedings of the 36th IEEE Midwest Symposium on Circuits and Systems, pp. 468-471, Aug. 1993. 16. Xi, J. G. and Dai, W. W.-M., “Useful-Skew Clock Routing With Gate Sizing for Low Power Design,” Proceedings of the 33rd ACM/IEEE Design Automation Conference, pp. 383-388, June 1996. 17. Neves, J. L. and Friedman, E. G., “Design Methodology for Synthesizing Clock Distribution Networks Exploiting Non-Zero Localized Clock Skew,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. VLSI-4, pp. 286-291, June 1996. 18. Jackson, M. A. B., Srinivasan, A., and Kuh, E. S., “Clock Routing for High-Performance ICs,” Proceedings of the 27th ACM/IEEE Design Automation Conference, pp. 573-579, June 1990. 19. Tsay, R.-S., “An Exact Zero-Skew Clock Routing Algorithm,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. CAD-12, pp. 242-249, Feb. 1993.

© 2000 by CRC Press LLC

20. Chou, N.-C. and Cheng, C.-K., “On General Zero-Skew Clock New Construction,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. VLSI-3, pp. 141-146, Mar. 1995. 21. Ito, N., Sugiyama, H., and Konno, T., “ChipPRISM: Clock Routing and Timing Analysis for HighPerformance CMOS VLSI Chips,” Fujitsu Scientific and Technical Jornal, vol. 31, pp. 180-187, Dec. 1995. 22. Leiserson, C. E. and Saxe, J. B., “A Mixed-Integer Linear Programming Problem Which is Efficiently Solvable,” Journal of Algorithms, vol. 9, pp. 114-128, Mar. 1988. 23. Cormen, T. H., Leiserson, C. E., and Rivest, R. L., Introduction to Algorithms. MIT Press, 1989. 24. West, D. B., Introduction to Grpah Theory. Prentice-Hall, 1996. 25. Fishburn, J. P., “Clock Skew Optimization,” IEEE Transactions on Computers, vol. C-39, pp. 945951, July 1990. 26. Lee, T.-C. and Kong, J., “The New Line in IC Design,” IEEE Spectrum, pp. 52-58, Mar. 1997. 27. Friedman, E. G., “The Application of Localized Clock Distribution Design to Improving the Performance of Retimed Sequential Circuits,” Proceedings of the IEEE Asia-Pacific Conference on Circuits and Systems, pp. 12-17, Dec. 1992. 28. Kourtev, I. S. and Friedman, E. G., “Simultaneous Clock Scheduling and Buffered Clock Tree Synthesis,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 1812-1815, June 1997. 29. Neves, J. L. and Friedman, E. G., “Optimal Clock Skew Scheduling Tolerant to Process Variations,” Proceedings of the 33rd ACM/IEEE Design Automation Conference, pp. 623-628, June 1996. 30. Glasser, L. A. and Dobberpuhl, D. W., The Design and Analysis of VLSI Circuits. Addison-Wesley Publishing Company, 1985. 31. Uyemura, J. P., Circuit Design for CMOS VLSI. Kluwer Academic Publishers, 1992. 32. Kang, S. M. and Leblebici, Y., CMOS Digital Integrated Circuits: Analysis and Design. The McGrawHill Companies, Inc., 1996. 33. Sedra, A. S. and Smith, K. C., Microelectronic Circuits. Oxford University Press, 4th ed., 1997. 34. Kohavi, Z., Switching and Finite Automata Theory. McGraw-Hill Book Company, New York, NY, 2nd ed., 1978. 35. Mano, M. M. and Kime, C. R., Logic and Computer Design Fundamentals. Prentice-Hall, Inc., 1997. 36. Wolf, W., Modern VLSI Design: A Systems Approach. Prentice-Hall, Inc., 1994. 37. Kacprzak, T. and Albicki, A., “Analysis of Metastable Operation in RS CMOS Flip-Flops,” IEEE Journal of Solid-State Circuits, vol. SC-22, pp. 57-64, Feb. 1987. 38. Jackson, T. A. and Albicki, A., “Analysis of Metastable Operation in D latches,” IEEE Transactions on Circuits and Systems — I: Fundamental Theory and Applications, vol. CAS I-36, pp. 1392-1404, Nov. 1989. 39. Friedman, E. G., “Latching Characteristics of a CMOS Bistable Register,” IEEE Transactions on Circuits and Systems—I: Fundamental Theory and Applications, vol. CAS I-40, pp. 902-908, Dec. 1993. 40. Unger, S. H., “Double-Edge-Triggered Flip-Flops,” IEEE Transactions on Computers, vol. C-30, pp. 447-451, June 1981. 41. Lu, S.-L., “A Novel CMOS Implementation of Double-Edge-Triggered D-flip-flops,” IEEE Journal of Solid State Circuits, vol. SC-25, pp. 1008-1010, Aug. 1990. 42. Afghani, M. and Yuan, J., “Double-Edge-Triggered D-Flip-Flops for High-Speed CMOS Circuits,” IEEE Journal of Solid State Circuits, vol. SC-26, pp. 1168-1170, Aug. 1991. 43. Hossain, R., Wronski, L., and Albicki, A., “Double Edge Triggered Devices: Speed and Power Constraints,” Proceedings of the 1996 IEEE International Symposium on Circuits and Systems, vol. 3, pp. 1491-1494, 1993. 44. Blair, G. M., “Low-Power Double-Edge Triggered Flip-Flop,” Electronics Letters, vol. 33, pp. 845847, May 1997.

© 2000 by CRC Press LLC

45. Lin, I., Ludwig, J. A., and Eng, K., “Analyzing Cycle Stealing on Synchronous Circuits with LevelSensitive Latches,” Proceedings of the 29th ACM/IEEE Design Automation Conference, pp. 393-398, June 1992. 46. Lee, J. fuw, Tang, D. T., and Wong, C. K., “A Timing Analysis Algorithm for Circuits with LevelSensitive Latches,” IEEE Transactions on Computer-Aided Design, vol. CAD-15, pp. 535-543, May 1996. 47. Szymanski, T. G., “Computing Optimal Clock Schedules,” Proceedings of the 29th ACM/IEEE Design Automation Conference, pp. 399-404, June 1992. 48. Dagenais, M. R. and Rumin, N. C., “On the Calculation of Optimal Clocking Parameters in Synchronous Circuits with Level-Sensitive Latches,” IEEE Transactions on Computer-Aided Design, vol. CAD-8, pp. 268-278, Mar. 1989. 49. Sakallah, K. A., Mudge, T. N., and Olukotun, O. A., “checkTc and minTc: Timing Verification and Optimal Clocking of Synchronous Digital Circuits,” Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, pp. 552-555, Nov. 1990. 50. Sakallah, K. A., Mudge, T. N., and Olukotun, O. A., “Analysis and Design of Latch-Controlled Synchronous Digital Circuits,” IEEE Transactions on Computer-Aided Design, vol. CAD-11, pp. 322333, Mar. 1992. 51. Kourtev, I. S. and Friedman, E. G., “Topological Synthesis of Clock Trees with Non-Zero Clock Skew,” Proceedings of the 1997 ACM/IEEE International Workshop on Timing Issues in the Specification and Design of Digital Systems, pp. 158-163, Dec. 1997. 52. Kourtev, I. S. and Friedman, E. G., “Topological Synthesis of Clock Trees for VLSI-Based DSP Systems,” Proceedings of the IEEE Workshop on Signal Processing Systems, pp. 151-162, Nov. 1997. 53. Kourtev, I. S., and Friedman, E. G., “Integrated Circuit Signal Delay,” Encyclopedia of Electrical and Electronics Engineering. Wiley Publishing Company, vol. 10, pp. 378-392, 1999. 54. Neves, J. L. and Friedman, E. G., “Synthesizing Distributed Clock Trees for High Performance ASICs,” Proceedings of the IEEE ASIC Conference, pp. 126-129, Sept. 1994. 55. Neves, J. L. and Friedman, E. G., “Buffered Clock Tree Synthesis with Optimal Clock Skew Scheduling for Reduced Sensitivity to Process Parameter Variations,” Proceedings of the ACM/SIGDA International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, pp. 131-141, Nov. 1995. 56. Deokar, R. R. and Sapatnekar, S. S., “A Fresh Look at Retiming via Clock Skew Optimization,” Proceedings of the 32nd ACM/IEEE Design Automation Conference, pp. 310-315, June 1995.

© 2000 by CRC Press LLC

Appendix Glossary of Terms The following notations are used in this section. 1. Clock Signal Parameters TCP:

The clock period of a circuit

∆L :

The tolerance of the leading edge of any clock signal

∆T :

The tolerance of the trailing edge of any clock signal

∆L : L

The tolerance of the leading edge of a clock signal driving a latch

∆T :

The tolerance of the trailing edge of a clock signal driving a latch

∆L : F

The tolerance of the leading edge of a clock signal driving a flip-flop

∆T : F

The tolerance of the trailing edge of a clock signal driving a flip-flop

L

The minimum width of the clock signal in a circuit with latches

F

The minimum width of the clock signal in a circuit with flip-flops

L

C Wm : C Wm :

2. Latch Parameters L

D CQ :

The clock-to-output delay of a latch

D

Li CQ

D

L CQm

: The minimum clock-to-output delay of a latch

D

Li CQm

: The minimum clock-to-output delay of the latch Ri

:

The clock-to-output delay of the latch Ri

L

D CQM : The maximum clock-to-output delay of a latch Li

D CQM : The maximum clock-to-output delay of the latch Ri L

D DQ :

The data-to-output delay of a latch

D

Li DQ

D

L DQm

: The minimum data-to-output delay of a latch

D

Li DQm

: The minimum data-to-output delay of the latch Ri

D

L DQM

: The maximum data-to-output delay of a latch

D

Li DQM

: The maximum data-to-output delay of the latch Ri

:

The data-to-output delay of the latch Ri

δS :

The setup time of a latch

δ :

The setup time of the latch Ri

δ :

The hold time of a latch

δ :

The hold time of the latch Ri

L

Li S L H

Li H

t

L AM

:

The latest arrival time of the data signal at the data input of a latch

t

Li AM

:

The latest arrival time of the data signal at the data input of the latch Ri

t

L Am

:

The earliest arrival time of the data signal at the data input of a latch

t

Li Am

:

L

t QM :

The earliest arrival time of the data signal at the data input of the latch Ri The latest arrival time of the data signal at the data output of a latch

© 2000 by CRC Press LLC

Li

The latest arrival time of the data signal at the data output of the latch Ri

t QM : t

L Qm

:

The earliest arrival time of the data signal at the data output of a latch

t

Li Qm

:

The earliest arrival time of the data signal at the data output of the latch Ri

3. Flip-flop Parameters F

D CQ :

The clock-to-output delay of a latch

D

Fi CQ

D

F CQm

: The minimum clock-to-output delay of a flip-flop

D

Fi CQm

: The minimum clock-to-output delay of the flip-flop Ri

D

F CQM

: The maximum clock-to-output delay of a flip-flop

D

Fi CQM

: The maximum clock-to-output delay of the flip-flop Ri

δ : F S

:

The clock-to-output delay of the latch Ri

The setup time of a flip-flop

δ :

The setup time of the flip-flop Ri

δH :

The hold time of a flip-flop

δH :

The hold time of the flip-flop Ri

Fi S F

Fi

t

F AM

:

The latest arrival time of the data signal at the data input of a flip-flop

t

Fi AM

:

The latest arrival time of the data signal at the data input of the flip-flop Ri

t

F Am

:

The earliest arival time of the data signal at the data input of a flip-flop

t

Fi Am

:

The earliest arrival time of the data signal at the data input of the flip-flop Ri

t

F QM

:

The latest arrival time of the data signal at the data output of a flip-flop

t

Fi QM

:

The latest arival time of the data signal at the data output of the flip-flop Ri

F

The earliest arrival time of the data signal at the data output of a flip-flop

Fi

The earliest arrival time of the data signal at the data output of the flip-flop Ri

t Qm : t Qm :

4. Local Data Path Parameters R i  R f : A local data path from register Ri to register Rf exists R i  R f : A local data path from register Ri to register Rf does not exist

© 2000 by CRC Press LLC

Hwang, J. "ROM/PROM/EPROM" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

48 ROM/PROM/EPROM 48.1 Introduction 48.2 ROM Core Cells • Peripheral Circuitry • Architecture

48.3 PROM

Jen-Sheng Hwang National Science Council

Read Only Memory Module Architecture • Conventional Diffusion Programming ROM • Conventional VIA-2 Contact Programming ROM • New VIA-2 Contact Programming ROM • Comparison of ROM Performance

48.1 Introduction Read-only memory (ROM) is the densest form of semiconductor memory, which is used for the applications such as video game software, laser printer fonts, dictionary data in word processors, and soundsource data in electronic musical instruments. The ROM market segment grew well through the first half of the 1990s, closely coinciding with a jump in personal computer (PC) sales and other consumer-oriented electronic systems, as shown in Fig. 48.1.1 Because a very large ROM application base (video games) moved toward compact disc ROM-based systems (CD-ROM), the ROM market segment declined. However, greater functionality memory products have become relatively cost-competitive with ROM. It is believed that the ROM market will continue to grow moderately through the year 2003.

48.2 ROM Read-only memories (ROMs) consist of an array of core cells whose contents or state is preprogrammed by using the presence or absence of a single transistor as the storage mechanism during the fabrication process. The contents of the memory are therefore maintained indefinitely regardless of the previous history of the device and/or the previous state of the power supply.

Core Cells A binary core cell stores binary information through the presence or absenc of a single transistor at the intersection of the wordline and bitline. ROM core cells can be connected in two possible ways: a parallel NOR array of cells or a series NAND array of cells each requiring one transistor per storage cell. In this case, either connecting or disconnecting the drain connection from the bitline programs the ROM cell. The NOR array is larger as there is potentially one drain contact per transistor (or per cell) made to each bitline. Potentially, the NOR array is faster as there are no serially connected transistors as in the NAND array approach. However, the NAND array is much more compact as no contacts are required within the array itself. However, the serially connected pull-down transistors that comprise the bitline are potentially very slow.2

© 2000 by CRC Press LLC

FIGURE 48.1

The ROM market growth and forecast.

Encoding multiple-valued data in the memory array involves a one-to-one mapping of logic value to transistor characteristics at each memory location and can be implemented in two ways: (i) adjust the width-to-length (W/L) ratios of the transistors in the core cells of the memory array; or (ii) adjust the threshold voltage of the transistors in the core cells of the memory array.3 The first technique works on the principle that W/L ratio of a transistor determines the amount of current that can flow through the device (i.e., the transconductance). This current can be measured to determine the size of the device at the selected location and hence the logic value stored at this location. In order to store 2 bits per cell, one would use one of four discrete transistor sizes. Intel Corp. used this technique in the early 1980s to implement high-density look-up tables in its i8087 math co-processor. Motorola Inc. also introduced a four-state ROM cell with an unusual transistor geometry that had variable W/L devices. The conceptual electrical schematic of the memory cell along with the surrounding peripheral circuitry is shown in Fig. 48.2.2

Peripheral Circuitry The four states in a two-bit per cell ROM are four distinct current levels. There are two primary techniques to determine which of the four possible current levels an addressed cell generates. One technique compares the current generated by a selected memory cell against three reference cells using three separate sense amplifiers. The reference cells are transistors with W/L ratios that fall in between the four possible standard transistor sizes found in the memory array as illustrated in Fig. 48.3.2 The approach is essentially a two-bit flash analog-to-digital (A/D) converter. An alternate method for reading a two-bit per cell device is to compute the time it takes for a linearly rising voltage to match the output voltage of the cell. This time interval then can be mapped to the equivalent two-bit binary code corresponding the memory contents.

© 2000 by CRC Press LLC

FIGURE 48.2

Geometry-variable multiple-valued NOR ROM.

FIGURE 48.3

ROM sense amplifier.

© 2000 by CRC Press LLC

Architecture Constructing large ROMs with fast access times requires the memory array to be divided into smaller memory banks. This gives rise to the concept of divided wordlines and divided bitlines that reduces the capacitance of these structures allowing for faster signal dynamics. Typically, memory blocks would be no larger than 256 rows by 256 columns. In order to quantitatively compare the area advantage of the multiple-valued approach, one can calculate the area per bit of a two-bit per cell ROM divided by the area per bit of a one-bit per cell ROM. Ideally, one would expect this ratio to be 0.5. In the case of a practical two-bit per cell ROM,4 the ratio is 0.6 since the cell is larger than a regular ROM cell in order to accommodate any one of the four possible size transistors. ROM density in the Mb capacity range is in general very comparable to that of DRAM density despite the differences in fabrication technology.2 In user-programmable or field-programmable ROMs, the customer can program the contents of the memory array by blowing selected fuses (i.e., physically altering them) on the silicon substrate. This allows for a “one-time” customization after the ICs have been fabricated. The quest for a memory that is nonvolatile and electrically alterable has led to the development of EPROMs, EEPROMs, and flash memories.2

48.3 PROM Since process technology has shifted to QLM or PLM to achieve better device performance, it is important to develop a ROM technology that offers short TAT, high density, high speed, and low power. There are many types of ROM each with merits and demerits:5 • The diffusion programming ROM has excellent density but has a very long process cycle time. • The conventional VIA-2 contact programming ROM has better cycle time, but it has poor density. • An architecture VIA-2 contact programming ROM for QLM and PLM processes has simple processing with high density which obtains excellent results targeting 2.5 V and 2.0 V supply voltage.

Read Only Memory Module Architecture The details of the ROM module configuration are shown in Fig. 48.4. This ROM has a single access mode (16-bit data read from half of ROM array) and a dual access mode (32-bit data read from both

FIGURE 48.4

ROM module array configuration.

© 2000 by CRC Press LLC

FIGURE 48.5

Detail of low power selective bit line precharge and sense amplifier circuits.

ROM array) with external address and control signals. One block in the array contains 16-bit lines and is connected to a sense amplifier circuit as shown in Fig. 48.5. In the decoder, only one bit line in 16 bits is selected and precharged by P1 and T1.5 16 bits in half array at a single access mode or 32 bits in a dual access mode are dynamically precharged to VDD level. Dl is a pull down transistor to keep unselected bit lines at ground level. The speed of the ROM will be limited by bit line discharge time in the worst case ROM coding. When connection exists on all of bit lines vertically, total parasitic capacitance Cbs on the bit line by Ndiffusions and Cbg will be a maximum. Tills situation is shown in Fig. 48.6a. In the 8KW ROM, 256 bit cells are in the vertical direction, resulting in 256 times of cell bit line capacitance. In this case, discharge time from VDD to GND level is about 6 – 8ns at VDD = 1.66 V and depends on ROM programming type such as diffusion or VIA-2. Short circuit currents in the sense amplifier circuits arc avoided by using a delayed enable signal (Sense Enable). There are dummy bit lines on both sides of the array as indicated in Fig 48.4. This line contains “0” s on all 256 cells and has the longest discharge time. It is used to generate timing for a delayed enable signal that activates the sense amplifier circuits. These circuits were used for all types of ROM to provide a fair comparison of the performance of each type of ROM.5

© 2000 by CRC Press LLC

Conventional Diffusion Programming ROM Diffusion programmed ROM is shown in Fig. 48.6. This ROM has the highest density because bit line contact to discharge transistor can be shared by two-bit cells (as shown in Fig. 48.6). Cell-A in Fig. 48.6a is coding “0” adding diffusion which constructs transistor, but Cell-B is coding “1” which does not have diffusion and resulted in field oxide without transistor as shown in Fig. 48.6c. This ROM requires very long fabrication cycle time since process steps for the diffusion programming are required.5

FIGURE 48.6

Diffusion programming ROM.

Conventional VIA-2 Contact Programming ROM In order to obtain better fabrication cycle time, conventional VIA-2 contact programming ROM was used as shown in Fig. 48.7, Cell-C in Fig. 48.7a is coding “1” Cell-D is coding “1”. There are determined by VIA-2 code existence on bit cells. The VIA-2 is final stage of process and base process can be completed just before VIA-2 etching and remaining process steps are quite few. So, VIA-2 ROM fabrication cycle time is about 1/5 of the diffusion ROM. The demerit of VIA-2 contact and other type of contact programming ROM was poor density. Because diffusion area and contact must be separated in each ROM bit cell as shown in Fig. 48.7c, this results in reduced density, speed, and increased power. Metal4 and VIA-3 at QLM process were used for word line strap in the ROM since RC delay time on these nobles is critical for 100MIPS DSP.5

New VIA-2 Contact Programming ROM The new architecture VIA-2 programming ROM is shown in Fig. 48.8. A complex matrix constructs each 8-bit block with GND on each side. Cell-E in Fig. 48.8a is coding “0”. Bit4 and N4 are connected by VIA2. Cell-F is coding “1” since Bit5 and N5 are disconnected. Coding other bit lines (Bit 0, 1, 2, 3,5, 6, and 7) follow the same procedure. This is one of the coding examples to discuss worst case operating speed. In the layout shown in Fig. 48.8b, the word line transistor is used not only in the active mode but also to isolate each bit line in the inactive mode. When the word line goes high, all transistors are turned on.

© 2000 by CRC Press LLC

FIGURE 48.7

Conventional VIA-2 programming ROM.

FIGURE 48.8

New VIA-2 programming ROM.

© 2000 by CRC Press LLC

All nodes (N0 - N7) are horizontally connected with respect to GND. If VIA-2 code exists on all or some of nodes (N0 - N7) in the horizontal direction, the discharge time of bit lines is very short since this ROM uses a selective bit fine precharge method.5 Figure 48.9 shows timing chart of each key signal and when Bit4 is accessed, for example, only this line will be precharged during precharge phase. However, all other bit lines are pulled down to GND by Dl transistors as shown in Fig. 48.4. When VIA-2 code exists like N4 and Bit4, this line will be discharged. But if it does not exist, this line will stay at VDD level dynamically as described during word line active phase, which is shown in Fig. 48.9. After this operation, valid data appears on data out node of data latch circuits.5

FIGURE 48.9

Timing chart of new VIA-2 programming ROM.

In order to evaluate worst case speed, no VIA-2 coding on horizontal bit cell was used since transistor series resistance at active mode will be maximum with respect to GND. However, in this situation, charge sharing effects and lower transistor resistance during the word line active mode allow fast discharge of bit lines despite the increased parasitic capacitance on bit line to 1.9 times. This is because all other nodes (N0-N7) will stay at GND dynamically. The capacitance ratio between bit line (Cb) and all nodes except N4 (Cn) was about 20:1. Fast voltage drop could be obtained by charge sharing at the initial stage of bit line discharging. About five voltage drop could be obtained on 8KW configuration through the chargesharing path shown in Fig. 48.8c. With this phenomenon, the full level discharging was mainly determined by complex transistor RC network connected to GND as shown in Fig. 48.8a. This new ROM has much

© 2000 by CRC Press LLC

wider transistor width than conventional ROMs and much smaller speed degradation due to process deviations, because conventional ROMs typically use the minimum allowable transistor size to achieve higher density and are more sensitive due to process variations.5

Comparison of ROM Performance The performance comparison of each type of ROM are listed in Table 48.1. 8KW ROM module area ratio was indicated using same array configuration, and peripheral circuits with layout optimization to achieve fair comparison. The conventional VIA-2 ROM was 20% bigger than diffusion ROM, but new VIA-2 ROM was only 4% bigger. TAT ratio (days for processing) was reduced to 0.2 due to final stage of process steps. SPICE simulations were performed to evaluate each ROM performance considering low voltage applications. The DSP targets 2.5 V and 2.0 V supply voltage as chip specification with low voltage comer at 2.3 V and 1.8 V, respectively. However, a lower voltage was used in SPICE simulations for speed evaluation to account for the expected 7.5 supply voltage reduction due to the IR drop from the external supply voltage on the DSP chip. Based on this assumption, VDD = 2.13 V and VDD = 1.66 V were used for speed evaluation. The speed of new VIA-2 ROM was optimized at 1.66V to get over 100 MHz and demonstrated 106 MHz operation atVDD = 1.66 V, 125dc, (based on typical process models). Additionally, 149 MHz at VDD = 2.13 V, 125dc was demonstrated with the typical model and 123 MHz using the slow model. This is a relatively small deviation induced by changes in process parameters such as width reduction of the transistors. By using the fast model, operation at 294 MHz was demonstrated without any timing problems. This means the new ROM has very high productivity with even three sigma of process deviation and wide range of voltages and temperatures.5 TABLE 48.1 Comparison of ROM Performance Comparison Item 8KW (Area ratio) TAT (Day ratio) Speed @ 2,13 V, 125dc. Weak. Speed @ 2.13 V, 125dc. Typical. Speed @ 2.81 V, -40dc. Strong. Speed @ 1.66 V. 125dc, Typical. [email protected] V,-40dc. Strong. 100 MHz. (16-bit single access) [email protected] V@40dc. Strong. 100 MHz. (32-bit dual access)

Diffusion ROM 1.0 1.0

Conventional VIA-2 ROM 1.2 0.2

New VIA-2 ROM 1.04 0.2

83 MHz

86 MHz

123 MHz

166 MHz

98M Hz

149 MHz

277 MHz

179 MHz

294 MHz

103 MHz

75 MHz

106 MHz

15.6 mW

19.3 mW

2 UrnW

29.6 mW

37.1 mW

401 mW

Performance was measured with worst coding (all coding “1” ).

References 1. Karls, J., Status 1999: A Report On The Integrated Circuit Industry, Integrated Circuit Engineering Corporation, 1999. 2. Gulak, P. G., A review of multiple-valued memory technology, IEEE International Symposium on Multi-valued Logic, 1998 3. Rich, D.A., A Survey of Multi valued memories, IEEE Trans. On Comput., vol. C-35, no. 2, pp. 99106, Feb. 1986. 4. Prince, B., Semiconductor Memories, 2nd ed., John Wiley & Sons Ltd., New York, 1991. 5. Takahashi, H., Muramatsu, S., and Itoigawa, M., A new contact programming ROM architecture for digital signal processor, Symposium on VLSI Circuits, 1998.

© 2000 by CRC Press LLC

Tseng, Y. "SRAM" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

49 SRAM

Yuh-Kuang Tseng Industrial Research and Technology Institute

49.1 Read/Write Operation 49.2 Address Transition Detection (ATD) Circuit for Synchronous Internal Operation 49.3 Decoder and Word-Line Decoding Circuit 49.4 Sense Amplifier 49.5 Output Circuit

49.1 Read/Write Operation Figure 49.1 shows a simplified readout circuit for an SRAM. The circuit has static bit-line loads composed of pull-up PMOS devices M1 and M2. The bit-lines are pulled up to VDD by bit-line load transistors M1 and M2. During the read cycle, one word-line is selected. The bit line BL is discharged to a level determined by the bit-line load transistor M1, the accessed transistor N1, and the driver transistor N2 as shown in Fig. 49.1(b). At this time, all selected memory cells consume a dc column current flowing through the bit-line load transistors, accessed transistors, and driver transistors. This current flow increases the operating power and decreases the access speed of the memory. Figure 49.2 shows a simplified circuit diagram for SRAM write operation. During the write cycle, the input data and its complement are placed on the bit-lines. Then the word-line is activated. This will force the memory cell to flip into the state represented on the bit-lines, whereas the new data is stored in the memory cell. The write operation can be described as follows. Consider a high voltage level and a low voltage level are stored in both node 1 and node 2, respectively. If the data is to be written into the cell, then node 1 becomes low and node 2 becomes high. During this write cycle, a dc current will flow from VDD through bit-line load transistor M1 and write circuits to ground. This extra dc current flow in write cycle increases the power consumption and degrades the write speed performance. Moreover, in the tail portion of write cycle, if data 0 has been written into node 1 as shown in Fig. 49.2, the turn-on word-line transistor N1 and driver transistor N2 form a discharge circuit path to discharge the bit-line voltage. Thus, the write recovery time is increased. In high-speed SRAM, write recovery time is an important component of the write cycle time. It is defined as the time necessary to recover from the write cycle to the read state after the WE signal is disabled.1 During the write recovery period, the selected cell is in the quasi-read condition,2 which consumes dc current as in the case of read cycle. Based on the above discussion, the dc current problems that occur in the read and write cycles should be overcome to reduce power dissipation and improve speed performance. Some solutions for the dc current problems of conventional SRAM will be described. During the active mode (read cycle or write cycle), the word-line is activated, and all selected columns consume a dc current. Thus, the word-line activation duration should be shortened to reduce the power consumption and improve speed performance during the active mode. This is possible by using the Address Transition Detection (ATD) technique3 to generate the pulsed word-line signal with enough time to achieve the read and write operation, as shown in Fig. 49.3.

© 2000 by CRC Press LLC

FIGURE 49.1

(a) Simplified readout circuit for an SRAM, (b) signal waveform.

FIGURE 49.2

Simplified circuit diagram for SRAM write operation

© 2000 by CRC Press LLC

FIGURE 49.3

Word-line signal and current reduction by pulsing the word line

However, the memory cells asserted by the pulsed word-line signal still consume dc current from VDD through bit-line load transistors, accessed transistors, and driver transistors or write circuits to the ground during the word-line activation period. A dynamic bit-line loads circuit technique2,4-6 can be used to eliminate the dc power consumption during operation period. Figure 49.4 shows a simplified circuit configuration and time diagram for read and write operation. In the read cycle, the bit-line load transistors are turned off because the ΦLD signal is in the high state. The bit-line load consists of only the stray capacitance. Therefore, the selected memory cell can rapidly drive the bit-line load, resulting in a fast access time. Moreover, the dc column current consumed by the other activated memory cells can be eliminated. Similarly, the dc current consumption in the write cycle can be eliminated. A memory cell’s readout current Icell depends on the channel conductance of the transfer gates in a memory cell. As the supply voltage is scaled down, the speed performance of SRAM is decreased, significantly, due to small cell’s readout current. To increase the channel conductance, widening the channel width and/or boosting word-line voltage are used. For low-voltage operation, boosting the wordline voltage is effective in shortening the delay time, in contrast to widening the channel width. However, this causes an increased power dissipation and a large transition time due to enhanced bit-line swing. To solve these problems, a step-down boosted-word-line scheme that shortens the readout time with little power dissipation penalty was reported by Morimura and Shibata in 1998.7

FIGURE 49.4

Simplified circuit configuration and time diagram for read and write operation.

© 2000 by CRC Press LLC

The concept of this scheme is shown in Fig. 49.5(b), in contrast to the conventional full-bootstedword-line scheme in Fig. 49.5(a). The step-down boosted-word-line scheme also boosts the selected word-line, but the boosted period is restricted only at the beginning of memory cell access. This enables the sensing operation to start early, by fast bit-line transition. During the sensing period of bit-line signals, the word-line potential is stepped down to the supply voltage to suppress the power dissipation; the reduced bit-line signals are sufficient to read out data by current sensing, and the reduced bit-line swing is effective in shortening the bit-line transition time in the next read cycle (Fig. 49.5(c)). As a result, fast readout is accomplished with little dissipation penalty (Fig. 49.5(d)).

FIGURE 49.5 Step-down boosted-word-line scheme: (a) conventional boosted word-line, (b) step-down boosted word-line, (c) bit-line transition, and (d) current consumption of a selected memory cell. (From Ref. 7.)

The step-down boosted-word-line scheme is also used in data writing. In the writing cycle, the proposed scheme is just as effective in reducing the memory-cell current because the memory cells unselected by column-address signals consume the same power as in the read cycle. The boosted wordline voltage shortens the time for writing data because it increases the channel conductance of the access transistor in the selected memory cells. The writing recovery operation starts after the word-line voltage is stepped down. Reducing the memory cell’s current accelerates the recovery operation of lower bitlines. So, a shorter recovery time than that of the conventional full-boosted-word-line scheme is obtained.

© 2000 by CRC Press LLC

Other circuit techniques for dc column current reduction, such as divided word-line (DWL)8 and hierarchical word decoding (HWD)9 structures will be described in the following sections.

49.2 Address Transition Detection (ATD) Circuit for Synchronous Internal Operation1,10 The address transition detection (ATD) circuit plays an important role in achieving internal synchronization of operation in SRAM. ATD pulses can be used to generate the different time signals for pulsing word-lines, sensing amplifier, and bit-line equalization. The ATD pulse activating φ(ai) is generated with XOR circuits by detecting “L” to “H” or “H” to “L” transitions of any input address signal ai, as shown in Fig. 49.6. All the ATD pulses generated from all the address input transitions are summed up to one pulse, φATD as shown in Fig. 49.6. The pulse width of φATD, is controlled by the delay element τ. The pulse width is usually stretched out with a delay circuit and used to reduce or speed up signal propagation in the SRAM.

FIGURE 49.6 (a) Summation circuit of all ATD pulses generated from all address transitions (b) ATD pulse waveform. (From Ref. 10.)

49.3

Decoder and Word-Line Decoding Circuit10-13

Two kinds of decoders are used in SRAM: the row decoder and the column decoder. Row decoders are needed to select one row of word-lines out of a set of rows in the array. A fast decoder can be implemented by using AND/NAND and OR/NOR gates. Figure 49.7 shows the schematic diagrams of static and dynamic AND gate decoders. The static NAND-type structure is chosen due to its low power consumption, that is, only the decoded row transitions. The dynamic structure is chosen due to its speed and power improvement over conventional static NAND gates. From a low-voltage operation standpoint, a dynamic NOR-base decoding would provide lower delay times through the decoder due to the limited amount of stacking of devices. Figure 49.8 shows circuit

© 2000 by CRC Press LLC

FIGURE 49.7

Circuit diagrams of a three-input AND gate: (a) static CMOS, (b) dynamic CMOS.

diagrams of dynamic NOR gates. The dynamic CMOS gate as shown in Fig. 49.8(a) consists of inputNMOSs whose drain nodes are precharged to a high level by a PMOS when a clock signal Φ is at a low level, and conditionally discharged by the input-NMOSs when a clock signal Φ is at a high level. The delay time of the dynamic NOR/OR gate does not increase when the number of input signals increases. This is because only one PMOS and two NMOSs are connected in series, even if the number of input signals is large. However, the output of the OR signal is slower than that of the NOR signal because the OR signal is generated from the inverter driven by the NOR signal. Figure 49.8 (b) shows the source-coupled-logic (SCL)11 NOR/OR circuit. When a clock signal Φ is at a low level, the drain nodes of the NMOS (N1, N2) are precharged to a high level in the circuit. If at

© 2000 by CRC Press LLC

FIGURE 49.8

Circuit diagrams of three-input NOR/OR gates: (a) dynamic CMOS, (b) SCL> (From Ref. 11)

least one of input signals of the circuit is at a high level and the clock Φ then turns to a high level, node N1 is discharged to a low level and node N2 remains at a high level. On the other hand, if all the input signals are at a low level and Φ then turns to a high level, node N2 is discharged and node N1 remains at a high level. The SCL circuit can produce an OR signal and a NOR signal simultaneously. Thus, the SCL circuit is suitable for predecoders that have a large number of input signals and for address buffers that need to produce OR and NOR signals simultaneously. Column decoders select the desired bit pairs out of the sets of bit pairs in the selected row. A typical dynamic AND gate decoder as shown in Fig. 49.7(b) can be used for column decoding because the AND structure meets the delay requirements (column decode is not in the worst-case delay path) and does so at a much lower power consumption. A highly integrated SRAM adopts a multi-divided memory cell array structure to achieve high-speed word decoding and reduce column power dissipation. For this purpose, many high-speed word-decoding circuit architectures have been proposed, such as divided word-line (DWL)8 and hierarchical word decoding (HWD)9 structures. The multi-stage decoder circuit technique is adopted in both word-decoding circuit structures to achieve high-speed and low-power operation. The multi-stage decoder circuit has advantages over the one-stage decoder in reducing the number of transistors and fanin. Also, it reduces the loading on the address input buffers. Figure 49.9 shows the decoder structure for a typical partitioned memory array with divided word-line (DWL). The cell array is divided into NB blocks. If the SRAM has NC columns, each block contains NC/NB columns. The divided word-line in each block is activated by the global word-line and the vertical block select line. Consequently, only the memory cells connected to one divided word-line within a selected block are accessed in a cycle. Hence, the column current is reduced because only the selected columns switch. Moreover, the word-line selection delay, which is the sum of the global word-line delay and the divided word-line delay, is reduced. This is because the total capacitance of the global word-line is smaller than that of a conventional word-line. The delay time of each divided word-line is small due to the short length. In the block decoder, an additional signal Φ, which is generated from an ATD pulse generator, can be adopted to enable the decoder and ensure the pulse activated word-line. However, in high-density SRAM, with a capacity of more than 4 Mb, the number of blocks in the DWL structure will have to increase. Therefore, the capacitance of the global word-line will increase and that causes the delay and power to increase. To solve this problem, the hierarchical word decoding (HWD)9 circuit structure, as shown in Fig. 49.10, was proposed. The word-line is divided into multi-levels. The number of levels is determined by the total capacitance of the word select line to efficiently distribute it.

© 2000 by CRC Press LLC

FIGURE 49.9

FIGURE 49.10

Divided word-line (DWL) structure. (From Ref. 8.)

Hierarchical word decoding structure.

Hence, the delay and power are reduced. Figure 49.11 shows the delay time and the total capacitance of the word decoding path comparison for the optimized DWL and HWD structures of 256-Kb, 1-Mb, and 4-Mb SRAMs.

49.4 Sense Amplifier10 During the read cycle, the bit-lines are initially precharged by bit-line load transistors. When the selected word-line is activated, one of the two bit-lines is pulled low by driver transistor, while the other stays high. The bit-line pull-down speed is very slow due to the small cell size and large bit-line load capacitance. Differential sense amplifiers are used for speed purposes because they can detect and amplify a very small © 2000 by CRC Press LLC

FIGURE 49.11

Comparison of DWL and HWD. (From Ref. 9. With permission.)

level difference between two bit-lines. Thus, a fast sense amplifier is an important factor in realizing fast access time. Figure 49.12 shows a switching scheme of well-known current-mirror sense amplifiers.14 Two amplifiers are serially connected to obtain a full supply voltage swing output because one stage of the amplifier

FIGURE 49.12

Two-stage current-mirror sense amplifier. (From Refs. 10 and 14. With permission.)

© 2000 by CRC Press LLC

does not provide enough gain for a full swing. The signal ΦSA is generated with an ATD pulse. It is asserted for a period of time, enough to amplify the small difference on data lines; then it is deactivated and the amplified output is latched. Hence, the switch reduces the power consumption, especially at relatively low frequencies. A latch-type sense amplifier such as a PMOS cross-coupled amplifier,15 as shown in Fig. 49.13, greatly reduces the dc current after amplification and latching, because the amplifier provides a nearly full supply voltage swing with positive feedback of outputs to PMOSFETs. As a result, the current in the PMOS cross-coupled sense amplifier is less than one fifth of that in a current-mirror amplifier. Moreover, this positive feedback effect gives much faster sensing speed than the conventional amplifier. To obtain correct and fast operation, the equalization element EQL is connected between the output terminals and are turned on with pulse signals ΦS and its complement during the transition period of the input signals.

FIGURE 49.13

PMOS cross-coupled amplifier. (From Ref. 15. With permission.)

However, the latch-type sense amplifier has a large dependence on the input voltage swing, especially at low current operation conditions. An NMOS source-controlled latched sense amplifier16 as shown in Fig. 49.14 is able to quickly amplify an input voltage swing as small as 10 mV. The sense amplifier consists of two PMOS loads, two NMOS drivers, and two feedback inverters. The sense amplifier control (SAC) signal is driven by the CS input buffer, and ΦS is a sense-amplifier equalizing pulse generated by the ATD pulse. The gate terminal of the NMOS driver is connected to the local data bus (LD1 and LD2), and the source terminal of the NMOS driver is controlled by the feedback inverter connected to the opposite output node of sense amplifier. Thus, the NMOS driver connected to the high-going output node turns off immediately. Therefore, the charge-up time of that node can be reduced because no current is wasted in the NMOS driver. A bidirectional sense amplifier, called a bidirectional read/write shared sense amplifier (BSA),17 is shown in Fig. 49.15. The BSA plays three roles. It functions as a sense amplifier for read operations, and it serves as a write circuit and a data input buffer for write operations. It consists of an 8-to-1 column selector and bit-line precharger, a CMOS dynamic sense amplifier, an SR flip-flop, and an I/O circuit.

© 2000 by CRC Press LLC

FIGURE 49.14

NMOS source-controlled latched sense amplifier. (From Ref. 16. With permission.)

FIGURE 49.15

Schematic diagram of BSA. (From Ref. 17. With permission.)

© 2000 by CRC Press LLC

Eight bit-line pairs are connected to a CMOS dynamic sense amplifier through CMOS transfer gates. The BLSW signal is used to select a column and to precharge bit-lines. When the BLSW signal is high, one of eight bit-line pairs is connected to the sense amplifier. When the BLSW signal is low, all bit-line pairs are precharged to VDD level. The SAEQB signal controls the sense amplifier equalization. When the SAEQB signal is low, sense nodes D and DB are equalized and precharged to the VDD level. The SENB signal activates the CMOS dynamic sense amplifier. The SR flip-flop holds the result. The output circuit consists of four p-channel transistors. If the result is high, I/O is connected to VDD (3.3 V) and IOB is connected to VDD (3 V) through p-channel devices. VDDL is a 3-V power supply provided externally. The I/O pair is connected to the sense amplifier through p-channel transfer gates controlled by ISWB. During write operations, ISWB falls to connect the I/O pair to the sense amplifier. Figure 49.16 shows operational waveforms of the BSA. At the beginning of the read operations, after some intrinsic delay from the rising edge of the SACLK, data from the selected cell is read onto the bitline pair. At the same time, the BLSW and the SAEQB rise. One of the eight CMOS transfer gates is turned on, the bit-line pair is connected to sense nodes D and DB, and precharging of the CMOS sense amplifier and bit-line pair is terminated. After the signal on the bit-line pair signal is sufficiently developed, the BLSW falls to disconnect the bit-line pair from the sense nodes D and DB. At the same time, the SENB falls to activate the sense amplifier. After the differential output data is latched onto the SR flip-flop, the SAEQB falls to start the equalization of the bit-line pair and the CMOS sense amplifier. At the beginning of the write operations, after some delay from the rising edge of SACLK, the ISWB signal falls, and the differential I/O pair is directly connected to the sense amplifier through p-channel

FIGURE 49.16

Operational waveforms of the BSA. (From Ref. 17. With permission.)

© 2000 by CRC Press LLC

transfer gates. After the signals D and DB are sufficiently developed, ISWB turns off the p-channel transfer gates to disconnect the sense amplifier from the I/O pair. At the same time, the SENB falls to sense the data, and BLSW rise to connect the sense amplifier to the bit-line pair. After the data is written into the selected memory cell, SAEQB and BLSW fall to start equalization of the bit-line pair and the CMOS sense amplifier. Conventional sense amplifiers operate incorrectly when threshold voltage deviation is larger than bitline swing, a current-sensing sense amplifier proposed by Izumikawa et al. in 1997 can continue to operate normally.18 Figure 49.17 illustrates the sense amplifier operations. Bit-lines are always charged up to VDD through load PMOSFETs. When memory-cells are selected with a word-line, the voltage difference in a bit-line pair appears (Fig. 49.17(a)). During this period, all column-select PMOSFETs are off, and no dc current flows in the sense amplifier. The sense amplifier differential outputs, referred to as ReadData, are equalized at ground level through pull-down NMOSFETs M7 and M8. After a 40-mV difference appears in a bit-line pair, power switch M9 of the sense amplifier and one column-select pair of PMOSFETs are set to on (Fig. 49.17(b)). The difference in bit-line voltages causes

FIGURE 49.17(a) Sense amplifier operation: (a) before sensing. (From Ref. 18. With permission.)

© 2000 by CRC Press LLC

FIGURE 49.17(b) Sense amplifier operation: (b) sensing. (From Ref. 18. With permission.)

a current difference between the differential pair PMOS in the sense amplifier, which appears as an output voltage difference. This voltage difference is amplified, and the read operation is accomplished. The current is automatically cut off because of the CMOS inverter. Consequently, the small bit-line swing is sensed without dc current consumption.

49.5 Output Circuit4 The key issue for designing the high-speed SRAM with byte-wide organization is noise reduction. There are two kinds of noise: VDD noise and GND noise. In the high-speed SRAM with byte-wide organization, when the output transistors drive a large load capacitance, the noise is generated and multiplied by 8 because eight outputs may change simultaneously. It is a fundamentally serious problem for the data zero output. That is to say, when the output NMOS transistor drives the large load capacitance, the GND potential of the chip goes up because of the peak current and the parasitic inductance of the GND line. Therefore, the address buffer and the ATD circuit are influenced by the GND bounce, and unnecessary signals are generated.

© 2000 by CRC Press LLC

FIGURE 49.18

Noise-reduction output circuit. (From Ref. 4. With permission.)

FIGURE 49.19 Waveforms of noise-reduction output circuit (solid line) and conventional output circuit: (a) gate bias, (b) data output, and (c) GND bounce. (From Ref. 4. With permission.)

© 2000 by CRC Press LLC

Figure 49.18 shows a noise-reduction output circuit. The waveforms of the noise-reduction output circuit and conventional output circuit are shown in Fig. 49.19. In the conventional circuit, nodes A and B are connected directly as shown in Fig. 49.18. Its operation and characteristics are shown by the dotted lines in Fig. 49.18. Due to the high-speed driving of transistor M4, the GND potential goes up, and the valid data are delayed by the output ringing. A new noise-reduction output circuit consists of one PMOS transistor, two NMOS transistors, one NAND gate, and the delay part ( its characteristics are shown by the solid lines in Fig. 49.19). The operation of this circuit is explained as follows. The control signals CE and OE are at high level and signal WE is at low level in the read operation. When the data zero output of logical high level is transferred to node C, transistor M1 is cut off, and M2 raises node A to the middle level. Therefore, the peak current that flows into the GND line through transistor M4 is reduced to less than one half that of the conventional circuit because M4 is driven by the middle level. After a 5-ns delay from the beginning of the middle level, transistor M3 raises node A to the VDD level. As a result, the conductance of M4 becomes maximum, but the peak current is small because of the low output voltage. Therefore, the increase of GND potential is small, and the output ringing does not appear.

References 1. Bellaouar, A. and Elmasry, M. I., Low-Power Digital VLSI Design Circuit and Systems, Kluwer Academic Publishers, 1995. 2. Ishibashi, K. et al., “A 1-V TFT-Load SRAM Using a Two-Step Word-Voltage Method,” IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 1519-1524, Nov. 1992. 3. Chen, C.-W. et al., “A Fast 32KX8 CMOS Static RAM with Address Transition Detection,” IEEE J. Solid-State Circuits, vol. SC-22, no. 4, pp. 533-537, Aug. 1987. 4. Miyaji, F. et al., “A 25-ns 4-Mbit CMOS SRAM with Dynamic Bit-Line Loads,” IEEE J. Solid-State Circuits, vol. 24, no. 5, pp.1213-1217, Oct. 1989. 5. Matsumiya, M. et al., “A 15-ns 16-Mb CMOS SRAM with Interdigitated Bit-Line Architecture,” IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 1497-1502, Nov. 1992. 6. Mizuno, H. and Nagano, T., “Driving Source-Line Cell Architecture for Sub-1V High-Speed LowPower Applications,” IEEE J. Solid-State Circuits, no. 4, pp. 552-557, Apr. 1996. 7. Morimura, H. and Shibata, N., “A Step-Down Boosted-Wordline Scheme for 1-V Battery-Operated Fast SRAM’s,” IEEE J. Solid-State Circuits, no. 8, pp. 1220-1227, Aug. 1998. 8. Yoshimito, M. et al., “A Divided Word-Line Structure in the Static RAM and Its Application to a 64 K Full CMOS RAM,” IEEE J. Solid-State Circuits, vol. SC-18, no. 5, pp. 479-485, Oct. 1983. 9. Hirose, T. et al., “A 20-ns 4-Mb CMOS SRAM with Hierarchical Word Decoding Architecture,” IEEE J. Solid-State Circuits, vol. 25, no. 5, pp. 1068-1074, Oct. 1990. 10. Itoh, K., Sasaki, K., and Nakagome, Y., “Trends in Low-Power RAM Circuit Technologies,” Proceedings of the IEEE, pp. 524-543, Apr. 1995. 11. Nambu, H. et al., “A 1.8-ns Access, 550-MHz, 4.5-Mb CMOS SRAM,” IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 1650-1657, Nov. 1998. 12. Cararella, J. S., “A Low Voltage SRAM For Embedded Applications,” IEEE J. Solid-State Circuits, vol. 32, no. 3, pp. 428-432, Mar. 1997. 13. Prince, B., Semiconductor Memories: A Handbook of Design, Manufacture, and Application, 2nd edition, John Wiley & Sons, 1991. 14. Minato, O. et al., “A 20-ns 64 K CMOS RAM,” in ISSCC Dig. Tech. Papers, pp. 222-223, Feb. 1984. 15. Sasaki, K., et al., “A 9-ns 1-Mbit CMOS SRAM,” IEEE J. Solid-State Circuits, vol. 24, no. 5, pp. 1219-1224, Oct. 1989. 16. Seki, T. et al., “A 6-ns 1-Mb CMOS SRAM with Latched Sense Amplifier,” IEEE J. Solid-State Circuits, vol. 28, no. 4, pp. 478-482, Apr. 1993.

© 2000 by CRC Press LLC

17. Kushiyama, N. et al., “An Experimental 295 MHz CMOS 4K X 256 SRAM Using Bidirectional Read/Write Shared Sense Amps and Self-Timed Pulse Word-Line Drivers,” IEEE J. Solid-State Circuits, vol. 30, no. 11, pp. 1286-1290, Nov. 1995. 18. Izumikawa, M. et al., “A 0.25-µm CMOS 0.9-V 100M-Hz DSP Core,” IEEE J. Solid-State Circuits, vol. 32, no. 1, pp. 52-60, Jan. 1997.

© 2000 by CRC Press LLC

Wu, C. "Embedded Memory" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

50 Embedded Memory 50.1 Introduction 50.2 Merits and Challenges On-chip Memory Interface • System Integration • Memory Size

50.3 Technology Integration and Applications 50.4 Design Methodology and Design Space Design Methodology

50.5 Testing and Yield 50.6 Design Examples

Chung-Yu Wu National Chiao Tung University

A Flexible Embedded DRAM Design • Embedded Memories in MPEG Environment • Embedded Memory Design for a 64bit Superscaler RISC Microprocessor

50.1 Introduction As CMOS technology progresses rapidly toward the deep submicron regime, the integration level, performance, and fabrication cost increase tremendously. Thus, low-integration low-performance small circuits or systems chips designed using deep submicron CMOS technology are not cost-effective. Only high-performance system chips that integrate CPU (central processing unit), DSP (digital signal processing) processors or multimedia processors, memories, logic circuits, analog circuits, etc. can afford the deep submicron technology. Such system chips are called system-on-a-chip (SOC) or system-on-silicon (SOS).1,2 A typical example of SOC chips is shown in Fig. 50.1. Embedded memory has become a key component of SOC and more practical than ever for at least two reasons:3 1. Deep submicron CMOS technology affords a reasonable tradeoff for large memory integration in other circuits. It can afford ULSI (ultra large-scale integration) chips with over 109 elements on a single chip. This scale of integration is large enough to build an SOC system. This size of circuitry inevitably contains different kinds of circuits and technologies. Data processing and storage are the most primitive and basic components of digital circuits, so that the memory implementation on logic chip has the highest priority. Currently in quarter-micron CMOS technology, chips with up to 128 Mbits of DRAM and 500 Kgates of logic circuit, or 64 Mbits of DRAM and 1 Mgates of logic circuit, are feasible. 2. Memory bandwidth is now one of the most serious bottlenecks to system performance. The memory bandwidth is one of the performance determinants of current von Neuman-type MPU (microprocessing unit) systems. The speed gap between MPUs and memory devices has been increased in the past decade. As shown in Fig. 50.1, the MPU speed has improved by a factor of 4 to 20 in the past decade. On the other hand, in spite of exponential progress in storage capacity, minimum access times for each quadrupled storage capacity have improved only by a factor of two, as shown in Fig. 50.2. This is partly due to the I/O speed limitation and to the fact that major

© 2000 by CRC Press LLC

FIGURE 50.1

An example of system-on-a-chip (SOC).

efforts in semiconductor memory development have focused on density and bit cost improvements. This speed gap creates a strong demand for memory integration with MPU on the same chip. In fact, many MPUs with cycle times better than 60 ns have on-chip memories. The new trend in MPUs, (i.e., RISC architecture) is another driving force for embedded memory, especially for cache applications.4 RISC architecture is strongly dependent on memory bandwidth, so that high-performance, non-ECL-based RISC MPUs with more than 25 to 50 MHz operation must be equipped with embedded cache on the chip.

50.2 Merits and Challenges The main characteristics of embedded memories can be summarized as follows.5

On-chip Memory Interface Advantages include: 1. Replacing off-chip drivers with smaller on-chip drivers can reduce power consumption significantly, as large board wire capacitive loads are avoided. For instance, consider a system which needs a 4-Gbyte/s bandwidth and a bus width of 256 bits. A memory system built with discrete SDRAMs (16-bit interface at 100 MHz) would require about 10 times the power of an embedded DRAM with an internal 256-bit interface. 2. Embedded memories can achieve much higher fill frequencies,6 which is defined as the bandwidth (in Mbit/s) divided by the memory size in Mbit (i.e., the fill frequency is the number of times per second a given memory can be completely filled with new data), than discrete memories. This is because the on-chip interface can be up to 512 bits wide, whereas discrete memories are limited to 16 to 64 bits. Continuing the above example, it is possible to make a 4-Mbit embedded DRAM with a 256-bit interface. In contrast, it would take 16 discrete 4-Mbit chips (256 K×16) to achieve the same width, so the granularity of such a discrete system is 64 Mbits. But the application may only call for, say, 8 Mbits of memory. 3. As interface wire lengths can be optimized for application in embedded memories, lower propagation times and thus higher speeds are possible. In addition, noise immunity is enhanced.

© 2000 by CRC Press LLC

Challenges and disadvantages include: 1. Although the power consumption per system decreases, the power consumption per chip may increase. Therefore, junction temperature may increase and memory retention time may decrease. However, it should be noted that memories are usually low-power devices. 2. Some sort of minimal external interface is still needed in order to test the embedded memory. The hybrid chip is neither a memory nor a logic chip. Should it be tested on a memory or logic tester, or on both?

System Integration Advantages include: 1. Higher system integration saves board space, packages, and pins, and yields better form factors. 2. Pad-limited design may be transformed into non-pad-limited by choosing an embedded solution. 3. Better speed scalability, along with CMOS technology scaling. Challenges and disadvantages include: 1. More expensive packages may be needed. Also, memories and logic circuits require different power supplies. Currently, the DRAM power supply (2.5 V) is less than the logic power supply (3.3 V); but this situation will reverse in the future due to the back-biasing problem in DRAMs. 2. The embedded memory process adds another technology for which libraries must be developed and characterized, macros must be ported, and design flows must be tuned. 3. Memory transistors are optimized for low leakage currents, yielding low transistor performance, whereas logic transistors are optimized for high saturation currents, yielding high leakage currents. If a compromise is not acceptable, expensive extra manufacturing steps must be added. 4. Memory processes have fewer layers of metal than do logic circuit processes. Layers can be added at the expense of fabrication cost. 5. Memory fabs are optimized for large-volume production of identical products, for high-capacity utilization and for high yield. Logic fabs, while sharing these goals, are slanted toward lower batch sizes and faster turnaround time.

Memory Size The advantage is that: 1. Memory size can be customized and memory architecture can be optimized for dedicated applications. Challenges and disadvantages include: 1. On the other hand, the system designer must know the exact memory requirement at the time of design. Later extensions are not possible, as there is no external memory interface. From the customer’s point of view, the memory component goes from a commodity to a highly specialized part that may command premium pricing. As memory fabrication processes are quite different, second-sourcing problems abound.

50.3 Technology Integration and Applications3,5 The memory technologies for embedded memories have a wide variation—from ROM to RAM—as listed in Table 50.1.3 In choosing these technologies, one of the most important figure of merits is the compatibility to logic process.

© 2000 by CRC Press LLC

TABLE 50.1 Embedded Memory Technologies and Applications Embedded Memory Technology ROM E/E2prom SRAM DRAM

Compatibility to Logic Process Diffusion, Vt, Contact programming High compatibility to logic process High-voltage device, tunneling insulator required 6-Tr/4-Tr single/double poly load cells. Wide range of compatibility Gate capacitor /4-T /planar /stacked / trench cells. Wide range of compatibility

Applications Microcode, program storage PAL, ROMbased logic Program, parameter storage, sequencer, learning machine High-speed buffers, cache memory High-density, high bit rate storage

Source: From Ref. 3.

1. Embedded ROM: ROM technology has the highest compatibility to logic process. However, its application is rather limited. PLA, or ROM-based logic design, is a well-used but rather special case of embedded ROM category. Other applications are limited to storage for microcode or welldebugged control code. A large size ROM for tables or dictionary applications may be implemented in generic ROM chips with lower bit cost. 2. Embedded EPROM/E2ROM: EPROM/E2PROM technology includes high-voltage devices and/or thin tunneling insulators, which require two to three additional mask steps and processing steps to logic process. Due to its unique functionality, PROM-embedded MPUs7 are well used. To minimize process overhead, single poly E2PROM cell has been developed.8 Counterparts to this approach are piggy-back packaged EPROM/MPUs or battery-backed SRAM/MPUs. However, considering process technology innovation, on-chip PROM implementation is winning the game. 3. Embedded SRAM is one of the most frequently used memory embedded in logic chips. Major applications are high-speed on-chip buffers such as TLB, cache, register file, etc. Table 50.2 gives a comparison of some approaches for SRAM integration. A six-transistor cell approach may be the most highly compatible process, unless any special structures used in standard 6-Tr SRAMs are employed. The bit density is not very high. Polysilicon resistor load 4-Tr cells provide higher bit density with the cost of process complexity associated with additional polysilicon-layer resistors. The process complexity and storage density may be compromised to some extent using a single layer of polysilicon. In the case of a polysilicon resistor load SRAM, which may have relaxed specifications with respect to data holding current, the requirement for substrate structure to achieve good soft error immunity is more relaxed as compared to low stand-by generic SRAMs. Therefore, the TFT (thin-film transistor) load cell may not be required for several generations due to its complexity. TABLE 50.2 Embedded SRAM Options SRAM Cell Type CMOS 6-Tr cell

NMOS 4-Tr Polysilicon Load Cell -Single Poly: -Double Poly:

Features No extra process steps to logic Lower bit density (Cell size, Acell=2.0 a.u.) Wide operational margin Low data-load current 1 additional step to logic process Higher density (Acell=1.25 a.u.) 3 addititional steps to logic process Higher density (Acell=1 a.u.)

Source: From Ref. 3.

4. Embedded DRAM (eDRAM) is not as widely used as SRAMs. Its high density features, however, are very attractive. Several different embedded DRAM approaches are listed in Table 50.3. A trench or stacked cell used in commodity DRAMs has the highest density, but the complexity is also high.

© 2000 by CRC Press LLC

TABLE 50.3 Embedded DRAM Technology Options Technology Standard DRAM Trench/Stacked Cell Planar C-plate poly-Si Cell Gate capacitor + 1-Tr Cell 4-Tr Cell

Features High density (cell size Acell = 1 a.u.) Large process overhead,>45% additional to logic High density (Acell> = 1.3 a.u.) Process overhead >35% additional to logic Relatively high density (Acell = 2.5 a.u.) No additional process to logic High speed, short cycle time Density is equivalent to 2-poly SRAM cell (equiv. to SRAM excpt refresh. Acell = 5 a.u.)

Source: From Ref. 3.

The cost is seldom attractive when compared to a multi-chip approach using standard DRAM, which is the ultimate in achieving low bit cost. This type of cell is well suited for ASM (application specific memory), which will be described in the next section. A planar cell with multiple (double) polysilicon structures is also suitable for memory-rich applications.9 A gate capacitor storage cell approach can be fully compatible to logic process providing relatively high density.10 The four-Tr cell (4-Tr SRAM cell minus resistive load) provides the same speed and density as SRAM, but full compatibility to logic process and requires refresh operation.11

50.4 Design Methodology and Design Space3,5 Design Methodology The design style of embedded memory should be selected according to applications. This choice is critically important for the best performance and cost balancing. Figure 50.2 shows the various design styles to implement embedded memories.

FIGURE 50.2

Various design styles for embedded memories. (From Ref. 3.)

© 2000 by CRC Press LLC

The most primitive semi-custom design style is based on unit the memory cell. It provides high flexibility in memory architecture and short design TAT (turn around time). However, the memory density is the lowest among various approaches. The structured array is a kind of gate array that has a dedicated memory array region in the master chip that is configurable to several variations of memory organizations by metal layer customization. Therefore, it provides relatively high density and short TAT. Configurability and fixed maximum memory area are the limitations to this approach. The standard cell design has high flexibility to the extent that the cell library has a variety of embedded memory designs. But in many cases, new system design requires new memory architectures. The memory performance and density is high, but the mask-to-chip TAT tends to be long. Super integration is an approach that integrates existing chip design, including I/O pads, so the design TAT is short and proven designs can be used. However, availability of memory architecture is limited and the mask-to-chip TAT is long. Hand-craft design (does not necessarily mean the literal use of human hands, but heavy interactive design) provides the most flexibility, high performance, and high density; but design TAT is the longest. Thus, design cost is the highest so that the applications are limited to high-volume and/or high-end systems. Standard memories, well-defined ASMs, such as video memories,12 integrated cache memories,13 and high-performance MPU-embedded memories, are good examples. An eDRAM (embedded DRAM) designer faces a design space that contains a number of dimensions not found in standard ASICs, some of which we will subsequently review. The designer has to choose from a wide variety of memory cell technologies which differ in the number of transistors and in performance. Also, both DRAM technology and logic technology can serve as a starting point for embedding DRAM. Choosing a DRAM technology as the base technology will result in high memory densities but suboptimal logic performance. On the other hand, starting with logic technology will result in poor memory densities, but fast logic circuits. To some extent, one can therefore trade logic speed against logic area. Finally, it is also possible to develop a process that gives the best of both worlds—most likely at higher expense. Furthermore, the designer can trade logic area for memory area in a way heretofore impossible. Large memories can be organized in very different ways. Free parameters include the number of memory banks, which allow the opening of different pages at the same time, the length of a single page, the word width, and the interface organization. Since eDRAM allows one to integrate SRAMs and DRAMs, the decision between on/off-chip DRAM-and SRAM/DRAM-partitioning must be made. In particular, the following problems must be solved at the system level: • Optimizing the memory allocation • Optimizing the mapping of the data into memory such that the sustainable memory bandwidth approaches the peak bandwidth • Optimizing the access scheme to minimize the latency for the memory clients and thus minimize the necessary FIFO depth The goals are to some extent independent of whether or not the memory is embedded. However, the number of free parameters available to the system designer is much larger in an embedded solution, and the possibility of approaching the optimal solution is thus correspondingly greater. On the other hand, the complexity is also increased. It is therefore incumbent upon eDRAM suppliers to make the tradeoffs transparent and to quantize the design space into a set of understandable if slightly suboptimal solutions.

50.5 Testing and Yield3,5 Although embedded memory occupies a minor portion of the total chip area, the device density in the embedded memory area is generally overwhelming. Failure distribution is naturally localized at memory areas. In other words, embedded memory is a determinant of total chip yield to the extent that the memory portion has higher device density weighted by its silicon area.

© 2000 by CRC Press LLC

For a large memory-embedded VLSI, memory redundancy is helpful to enhance the chip yield. Therefore, the embedded-memory testing, combined with the redundancy scheme, is an important issue. The implementation of means for direct measurement of embedded memory on wafer as well as in assembled samples is necessary. In addition to off-chip measurement, on-chip measurement circuitry is essential for accurate AC evaluation and debugging. Testing DRAMs is very different from testing logic. In the following, the main points of notice are discussed. • The fault models of DRAMs explicitly tested for are much richer. They include bit-line and wordline failures, crosstalk, retention time failures, etc. • The test patterns and test equipment are highly specialized and complex. As DRAM test programs include a lot of waiting, DRAM test times are quite high, and test costs are a significant fraction of total cost. • As DRAMs include redundancy, the order of testing is: (1) pre-fuse testing, (2) fuse blowing, (3) post-fuse testing. There are thus two wafer-level tests. The implication on eDRAMs is that a high degree of parallelism is required in order to reduce test costs. This necessitates on-chip manipulation and compression of test data in order to reduce the offchip interface width. For instance, Siemens Corp. offers a synthesizable test controller supporting algorithmic test pattern generation (ATPG) and expected-value comparison [partial built-in self test (BIST)]. Another important aspect of eDRAM testing is the target quality and reliability. If eDRAM is used for graphics applications, occasional “soft” problems, such as too short retention time of a few cells, are much more acceptable than if eDRAM is used for program data. The test concept should take this costreduction potential into account, ideally in conjunction with the redundancy concept. A final aspect is that a number of business models are common in eDRAM, from foundry business to ASIC-type business. The test concept should thus support testing the memory, either from a logic tester or a memory tester, so that the customer can do memory testing on his logic tester if required.

50.6 Design Examples Three examples of embedded memory designs are described. The first one is a flexible embedded DRAM design from Siemens Corp.5 The second one is the embedded memories in MPEG environment from Toshiba Corp.14 The last one is the embedded memory design for a 64-bit superscaler RISC microprocessor from Toshiba Corp. and Silicon Graphics, Inc.15

A Flexible Embedded DRAM Design5 There is an increasing gap between processor and DRAM speed: processor performance increases by 60% per year in contrast to only a 10% improvement in the DRAM core. Deep cache structures are used to alleviate this problem, albeit at the cost of increased latency, which limits the performance of many applications. Merging a microprocessor with DRAM can reduce the latency by a factor of 5 to 10, increase the bandwidth by a factor of 50 to 100, and improve the energy efficiency by a factor of 2 to 4.16 Developing memory is a time-consuming task and cannot be compared with a high-level based logic design methodology which allows fast design cycles. Thus, a flexible memory concept is a prerequisite for a successful application of eDRAM. Its purpose is to allow fast construction of application-specific memory blocks that are customized in terms of bandwidth, word width, memory size, and the number of memory banks, while guaranteeing first-time-right designs accompanied by all views, test programs, etc. A powerful eDRAM approach that permits fast and safe development of embedded memory modules is described. The concept, developed by Siemens Corp. for its customers, uses a 0.24-µm technology based on its 64/256 Mbit SDRAM process.5 Key features of the approach include:

© 2000 by CRC Press LLC

• Two building-block sizes, 256 Kbit and 1 Mbit; memory modules with these granularities can be constructed • Large memory modules, from 8 to 16 Mbit upwards, achieving an area efficiency of about 1 Mbit/mm2 • Embedded memory sizes up to at least 128 Mbits • Interface widths ranging from 16 to 512 bits per module • Flexibility in the number of banks as well as the page length • Different redundancy levels, in order to optimize the yield of the memory module to the specific chip • Cycle times better than 7 ns, corresponding to clock frequencies better than 143 MHz. • A maximum bandwidth per module of about 9 Gbyte/s • A small, synthesizable BIST controller for the memory (see next section) • Test programs, generated in a modular fashion Siemens Corp. has made eDRAM since 1989 and has a number of possible applications of its eDRAM approach in the pipeline, including TV scan-rate converters, TV picture-in-picture chips, modems, speechprocessing chips, hard-disk drive controllers, graphics controllers, and networking switches. These applications cover the full range of memory sizes (from a few Mbits to 128 Mbits), interface widths (from 32 to 512 bits), and clock frequencies (from 50 to 150 MHz), which demonstrates the versatility of the concept.

Embedded Memories in MPEG Environment14 Recently, multimedia LSIs, including MPEG decoders, have been drawing attention. The key requirements in realizing multimedia LSIs are their low-power and low-cost features. This example presents embedded memory-related techniques to achieve these requirements, which can be considered as a review of the state-of-the-art embedded memory macro techniques applicable to other logic LSIs. Figure 50.3 shows embedded memory macros associated with the MPEG2 decoder. Most of the functional blocks use their own dedicated memory blocks and, consequently, memory macros are rather

FIGURE 50.3

Block diagram of MPEG2 decoder LSI. (From Ref. 14.)

© 2000 by CRC Press LLC

small and distributed on a chip. Memory blocks are also connected to a central address/data bus for implementing direct test mode. An input buffer for the IDCT is shown in Fig. 50.4. Eight 16-bit data from D0 to D7 come from the inverse quantization block sequentially. The stored data should then be read out as 4-bit chunks orthogonal to the input sequence. The 4-bit data is used to address a ROM in the IDCT to realize a distributed arithmetic algorithm.

FIGURE 50.4

Input buffer structure for IDCT. (From Ref. 14.)

The circuit diagram of an orthogonal memory whose circuit diagram is shown in Fig. 50.5. It realizes the above-mentioned functionality with 50% of the area and the power that would be needed if the IDCT input buffer were built with flip-flops. In the orthogonal memory, word-lines and bit-lines run both vertically and horizontally to achieve the functionality. The macro size of the orthogonal memory is 420 µm × 760 µm, with a memory cell size of 10.8 µm × 32.0 µm.

FIGURE 50.5

Circuit diagram of orthogonal memory.(From Ref. 14.)

© 2000 by CRC Press LLC

FIFOs and other dual-port memories are designed using a single-port RAM operated twice in one clock cycle to reduce area, as shown in Fig. 50.6. A dual-port memory cell is twice as large as a singleport memory cell.

FIGURE 50.6

Realizing dual-port memory with a single-port memory (FIFO case). (From Ref. 14.)

All memory blocks are synchronous self-timed macros and contain address pipeline latches. Otherwise, the timing design needs more time, since the lengths of the interconnections between latches and a decoder vary from bit to bit. Memory power management is carried out using a Memory Macro Enable signal when a memory macro is not accessed, which reduces the total memory power to 60%. Flip-flop (F/F) is one of the memory elements in logic LSIs. Since digital video LSIs tend to employ several thousand F/Fs on a chip, the design of the F/F is crucial for small area and low power. The optimized F/F with hold capability is shown in Fig. 50.7. Due to the optimized smaller transistor sizes, especially for clock input transistors, and a minimized layout accomodating a multiplexer and a D-F/F in one cell, 40% smaller power and area are realized compared with a normal ASIC F/F.

FIGURE 50.7

Optimized flip-flop. (From Ref. 14.)

Establishing full testability of on-chip memories without much overhead is another important issue. Table 50.4 compares three on-chip memory test strategies: a BIST (Built-In Self Test), a scan test, and a direct test. The direct test mode, where all memories can be directly accessed from outside in a test mode, is implemented because of its inherent small area. In a test mode, DRAM interface pads are turned into test pins and can access to each memory block through internal buses, as shown in Figs. 50.3 and 50.8.

© 2000 by CRC Press LLC

TABLE 50.4 Comparison of Various Memory Test Strategies Items Area Test time Pattern control Bus capacitance At-speed test : Good ∆: Fair Source: Ref. 14.

FIGURE 50.8

Direct    ∆ 

Scan ∆ X   X

BIST X  X  

X: Poor

Direct test architecture for embedded memories. (From Ref. 14.)

The present MPEG2 decoder contains a RISC whose firmware is stored in an on-chip ROM. In order to make the debugging easy and extensive, an instruction RAM is put outside the pads in parallel to the instruction ROM and activated by an Al-masterslice in an initial debugging stage as shown in Fig. 50.9. For a sample chip mounted in a plastic package, the instruction RAM is cut out by a scribe line. This scheme enables extensive debugging and early sampling at the same time for firmware-ROM embedded LSIs.

Embedded Memory Design for a 64-bit Superscaler RISC Microprocessor15 High-performance embedded memory is a key component in VLSI systems because of the high-speed and wide bus width capability eliminating inter-chip communication. In addition, multi-ported buffer memories are often demanded on a chip. Furthermore, a dedicated memory architecture that meets the special constraint of the system can neatly reduce the system critical path. On the other hand, there are several issues in embedded RAM implementation. The specialty or variety of the memories could increase design cost and chip cost. Reading very wide data causes large power dissipation. Test time of the chip could be increased because of the large memory. Therefore, design efficiency, careful power bus design, and careful design for testability are necessary.

© 2000 by CRC Press LLC

FIGURE 50.9

Instruction RAM masterslice for code debugging. (From Ref. 14.)

TFP is a high-speed and highly concurrent 64-bit superscaler RISC microprocessor, which can issue up to four instructions per cycle.17,18 Very wide bandwidth of on-chip caches is vital in this architecture. The design of the embedded RAMs, especially on caches and TLB, is reported. The TFP integer unit (IU) chip implements two integer ALU pipelines and two load/store pipelines. The block diagram is shown in Fig. 50.10. A five-stage pipeline is shown in Fig. 50.11. In the TFP IU chip, RAM blocks occupy a dominant part of the real estate. The die size is 17.3 mm × 17.3 mm. In addition to other caches, TLB, and register file, the chip also includes two buffer queues: SAQ (store address queue) and FPQ (floating point queue). Seventy-one percent of all overall 2.6 million transistors are used for memory cells. Transistor counts of each block are listed in Table 50.5.

FIGURE 50.10

Block diagram of TFP IU. (From Ref. 15.)

The first generation of TFP chip was fabricated using Toshiba’s high-speed 0.8 µm CMOS technology: double poly-Si, triple metal, and triple well. A deep n-well was used in PLL and cache cell arrays in order to decouple these circuits from the noisy substrate or power line of the CMOS logic part. The chip operates up to 75 MHz at 3.1 V and 70°C, and the peak performance reaches 300 MIPS. Features of each embedded memory are summarized in Table 50.6. Instruction, branch, and data caches are direct mapped because of the faster access time. High-resistive poly-Si load cells are used for these caches since the packing density is crucial for the performance.

© 2000 by CRC Press LLC

FIGURE 50.11

TFP IU pipelining. (From Ref. 15.)

TABLE 50.5 Transistor Counts Block Cache, TLB memory cell RegFile, FPQ, SAQ memory cells Custom block without memory cell Random blocks Total

Transistor Count 1,761,040 106,624 209,218 250,621 2,627,503

Ratio (%) 67.02% 4.06% 19.38% 9.54% 100.00%

Source: Ref. 15.

Instruction cache (ICACHE) is 16 KB of virtual address memory. It provides four instructions (128 bit wide) per cycle. Branch cache (BCACHE) contains branch target address with one flag bit to indicate a predicted branch. BCACHE contains 1-K entries and is virtually indexed in parallel with ICACHE. Data cache (DCACHE) is 16 KB, dual ported, and supports two independent memory instructions (two loads, or one load and one store) per cycle. Total memory bandwidth of ICACHE and DCACHE reaches 2.4 GB/s at 75 MHz. Floating point load/store data bypass DCACHE and go directly to bigger external global cache.17,19 DCACHE is virtually indexed and physically tagged. TLB is dual ported, three-set-associative memory containing 384 entries. A unique address comparison scheme is employed here, which will be described in the following section. It supports several different page sizes, ranging from 4 KB to 16 MB. TLB is indexed by low-order 7 bits of virtual page number (VPN). The index is hashed by exclusive-OR with a low-order ASID (address space identifier) so that many processes can co-exist in TLB at one time. Since several different RAMs are used in TFP chip, the design efficiency is important. Consistent circuit schemes are used for each of the caches and TLB RAMs. Layout is started from the block that has the tightest area restriction, and the created layout modules are exported to other blocks with small modification. The basic block diagram of cache blocks is shown in Fig. 50.12, and timing diagram is shown in Fig. 50.13. Unlike a register file or other smaller queue buffers, these blocks employ dual-railed bit-lines. To achieve 75-MHz operation in the worst-case condition, it should operate at 110 MHz under typical conditions. In this targeted 9-ns cycle time, address generation is done about 3 ns before the end of the

© 2000 by CRC Press LLC

TABLE 50.6 Summary of Embedded RAM Features Block Instruction cache

Feature 16 KB, direct mapped

Cell Size Hi-R cell

(ICACHE)

32 B line size Vitually addressed 4 instructions per cycle

6.75 µm × 9 µm

Branch Cache (BCACHE)

1 K entries, direct mapped

Hi-R cell 6.75 µm × 9 µm

Data cache

2-ported, 16 KB, direct mapped 32 B line size Virtually indexed and physically tagged Write through One valid bit for 32 b word 4-ported (2 read, 2 write) 34.3µm × 18.9µm

Hi-R cell 12.6 µm × 9.45 µm

TLB

3 sets, 384 entries 2-ported Index is hashed by ASID Supported page size: 4K,8K,16K,64K,1M,4,16M

CMOS cell 21.2 µm ×13.7 µm

Register file

64 b × 32 entries 13-ported (9 read, 4 write)

CMOS cell 59.5 µm × 42.8 µm

Floating point queue (FPQ)

Dispatches 4 floating-point instructions per cycle 3-ported (2 read, 1 write) 16 entries

16.1 µm × 40.7 µm

Store address queue (SAQ)

Content addressable 3-ported (1 read, 1 write, 1 compare) 32 entries, 2 banked

CMOS cell 35.1 µm × 17.1 µm

Valid RAM (VRAM)

CMOS cell

Source: Ref. 15.

cycle, as shown in Fig. 50.11. To take advantage of this big address set-up time, address is received by transparent latch: TLAT_N (transparent while clock is low) instead of flip-flop. Thus, decode is started as soon as address generation is done and is finished before the end of the cycle. Another transparent latch—TLAT_P (transparent while clock is high)—is placed after the sense amplifier and it holds read data while the clock is low. Word-line (WL) is enabled while clock is high. Since the decode is already finished, WL can be driven to “high” as fast as possible. The sense amplifier is enabled (SAE) with a certain delay after the wordline. The paired current-mirror sense amplifier is chosen since it provides good performance without overly strict SAE timing. Bit-line is precharged and equalized while the clock is low. The clock-to-data delay of DCACHE, which is the biggest array, is 3.7 ns under typical conditions: clock-to-WL is 0.9 ns and WL-to-data is 2.8 ns. Since on-chip PLL provides 50% duty clock, timing pulses such as SAE or WE (write enable) are created from system clock by delaying the positive edge and negative edge appropriately. As both word-line and sense amplifier are enabled in just half the time of one cycle, the current dissipation is reduced by half. However, the power dissipation and current spike are still an issue because the read/write data width is extremely large. Robust power bus matrix is applied in the cache and TLB blocks so that the dc voltage drop at the worst place is limited to 60 mV inside the block. From a minimum cycle time viewpoint, write is more critical than read because write needs bigger bit-line swing, and the bit-line must be precharged before the next read. To speed up precharge time,

© 2000 by CRC Press LLC

FIGURE 50.12

Basic RAM block diagram. (From Ref. 15)

FIGURE 50.13

RAM timing diagram. (From Ref. 15)

precharge circuitry is placed on both the top and bottom of the bit-line. In addition, the write circuitry dedicated to cache-refill is placed on the top side of DCACHE and ICACHE to minimize the wire delay of the write data from input pad. Write data bypass selector is implemented so that the write data is available as read data in the same cycle with no timing penalty. Virtual to physical address translation and following cache hit check are almost always one of the critical paths in a microprocessor. This is because the cache tag comparison has to wait for the VTLB (RAM that contains virtual address tag) search operation and the following physical address selection from PTLB (RAM that contains physical address).20 A timing example of the conventional scheme is

© 2000 by CRC Press LLC

shown in Fig. 50.14. In TFP, the DCACHE tag is directly compared with all the three sets of PTLB data in parallel—which are merely candidates of physical address at this stage—without waiting for the VTLB hit results. The block diagram and timing are shown in Figs. 50.15 and 50.16. By the time this hit check of the cache tag is done, VTLB hit results are just ready and they select the PTLB hit result immediately. The “ePmatch” signal in Fig. 50.16 is the overall cache hit result. Although three times more comparators are needed, this scheme saves about 2.8 ns as compared to the conventional one. In TLB, sense amplifiers of each port are separately placed on the top and bottom of the array to mitigate the tight layout pitch of the circuit. A large amount of wire creates problems around VTLB,

FIGURE 50.14

Conventional physical cache hit check. (From Ref. 15.)

FIGURE 50.15

TFP physical cache hit check. (From Ref. 15.)

© 2000 by CRC Press LLC

FIGURE 50.16

Block diagram of TLB and DTAG. (From Ref. 15.)

PTLB, and DTAG (DCACHE tag RAM) from both layout and critical path viewpoints. This was solved by piling them to build a data path (APATH: Address Data Path) by making the most of the metal-3 vertical interconnection. Although this metal-3 signal line runs over TLB arrays in parallel with the metal1 bit-line, the TLB access time is not degraded since horizontal metal-2 word-line shields the bit-line from the coupling noise. The data fields of three sets are scrambled to make the data path design tidy; 39-bit (in VTLB) and 28-bit (in PTLB) comparators of each set consist of optimized AND-tree. WiredOR type comparators are rejected because a longer wired-OR node in this array configuration would have a speed penalty. As TFP supports different page sizes, VPN and PFN (page frame number) fields change, depending on the page size. The index and comparison field of TLB are thus made selectable by control signals. 32-bit DCACHE data are qualified by one valid bit. A valid bit needs the read-modify-write operation based on the cache hit results. However, this is not realized in one cycle access because of tight timing. Therefore, two write ports are added to valid bit and write access is moved to the next cycle: the W-stage. The write data bypass selector is essential here to avoid data hazard. To minimize the hardware overhead of the VRAM (valid bit RAM) row decoder, two schemes are applied. First, row decoders of read ports are shared with DCACHE by pitch-matching one VRAM cell height with two DCACHE cells. Second, write word-line drivers are made of shift registers that have read word-lines as inputs. The schematic is shown in Fig. 50.17.

FIGURE 50.17

VRAM row decoder. (From Ref. 15.)

© 2000 by CRC Press LLC

Although the best way to verify the whole chip layout is to do DRC (design rule check) and LVS (layout versus schematic) check that includes all sections and the chip, it was not possible in TFP since the transistor count is too large for CAD tools to handle. Thus, it was necessary to exclude a large part of the memory cells from the verification flow. To avoid possible mistakes around the boundary of the memory cell array, a few rows and columns were sometimes retained on each of the four sides of a cell array. In the case when this breaks signal continuity, text is added on the top level of the layout to make a virtual connection, as shown in Fig. 50.18. These works are basically handled by CAD software plus small programming without editing the layout by hand. Direct testing of large on-chip memory is highly preferable in VLSI because of faster test time and complete test coverage. TFP IU defines cache direct test in JTAG test mode, in which cache address, data, write enable, and select signals are directly controlled from the outside. Thus, very straightforward evaluation is possible. Utilizing 64-bit, general-purpose bus that runs across the chip, the additional hardware for the data transfer is minimized.

FIGURE 50.18

RAM layout verification. (From Ref. 15.)

Since defect density is a function of device density and device area, large on-chip memory can be a determinant of total chip yield. Raising embedded memory yield can directly lead to the rise of the chip yield. Failure symptoms of the caches have been analyzed by making a fail-bit-map, and this has been fed back to the fabrication process.

References 1. Borel, J., Technologies for Multimedia Systems on a Chip. In 1997 International Solid State Circuits Conference, Digest of Technical Papers, 40, 18-21, Feb. 1997. 2. De Man, H., Education for the Deep Submicron Age: Business as Usual? In Proceedings of the 34th Design Automation Conference, p. 307-312, June 1997. 3. Iizuka, T., Embedded Memory: A Key to High Performance System VLSIs. Proceedings of 1990 Symposium on VLSI Circuits, p. 1-4, June 1990.

© 2000 by CRC Press LLC

4. Horowitz, M., Hennessy, J., Chow, P., Gulak, P., Acken, J., Agrawal, A., Chu, C., McFarling, S., Przybylski, S., Richardson, S., Salz, A., Simoni, R., Stark, D., Steenkiste, P., Tjiang, S., and Wing, M., A 32b Microprocessor with On-chip 2K-Byte Instruction Cache. ISSCC Dig. of Tech. Papers, p. 30-31, Feb. 1987. 5. Wehn, N. and Hein, S., Embedded DRAM architectural trade-offs. Proceedings of Design, Automation and Test in Europe, p. 704-708, 1998. 6. Przybylski, S. A., New DRAM Technologies: A Comprehensive Analysis of the New Architectures. Report, 1996. 7. Wada, Y., Maruyama, T., Chida, M., Takeda, S., Shinada, K., Sekiguchi, K., Suzuki, Y., Kanzaki, K., Wada, M., and Yoshikawa, M., A 1.7-Volt Operating CMOS 64 KBit E2PROM. Symp. on VLSI Circ., Kyoto, Dig. of Tech. Papers, p. 41-42, May 1989. 8. Matsukawa, M., Morita, S., Shinada, K., Miyamoto, J., Tsujimoto, J., Iizuka, T., and Nozawa, H., A High Density Single Poly Si Structure EEPROM with LB (Lowered Barrier Height) Oxide for VLSI’s. Symp. on VLSI Technology, Dig. of Tech. Papers, p. 100-101, 1985. 9. Sawada, K., Sakurai, T., Nogami, K., Iizuka, T., Uchino, Y., Tanaka, Y., Kobayashi, T., Kawagai, K., Ban, E., Shiotari, Y., Itabashi, Y., and Kohyama, S., A 72K CMOS Channelless Gate Array with Embedded 1Mbit Dynamic RAM. IEEE CICC, Proc. 20.3.1, May 1988. 10. Archer, D., Deverell, D., Fox, F., Gronowski, P., Jain, A., Leary, M., Olesin, A., Persels, S., Rubinfeld, P., Schmacher, D., Supnik, B., and Thrush, T., A 32b CMOS Microprocessor with On-Chip Instruction and Data Caching and Memory Management. ISSCC Digest of Technical Papers, p. 32-33; Feb. 1987. 11. Beyers, J. W., Dohse, L. J., Fucetola, J. P., Kochis, R. L., Lob, C. G., Taylor, G. L., and Zeller, E. R., A 32b VLSI CPU Chip. ISSCC Digest of Technical Papers, p. 104-105, Feb. 1981. 12. Ishimoto, S., Nagami, A., Watanabe, H., Kiyono, J., Hirakawa, N., Okuyama, Y., Hosokawa, F., and Tokushige, K., 256K Dual Port Memory. ISSCC Digest of Technical Papers, p. 38-39, Feb. 1985. 13. Sakurai, T., Nogami, K., Sawada, K., Shirotori, T., Takayanagi, T., Iizuka, T., Maeda, T., Matsunaga, J., Fuji, H., Maeguchi, K., Kobayashi, K., Ando, T., Hayakashi, Y., and Sato, K., A Circuit Design of 32Kbyte Integrated Cache Memory. 1988 Symp. on VLSI Circuits, p. 45-46, Aug. 1988. 14. Otomo, G., Hara, H., Oto, T., Seta, K., Kitagaki, K., Ishiwata, S., Michinaka, S., Shimazawa, T., Matsui, M., Demura, T., Koyama, M., Watanabe, Y., Sano, F., Chiba, A., Matsuda, K., and Sakurai, T., Special Memory and Embedded Memory Macros in MPEG Environment. Proceedings of IEEE 1995 Custom Integrated Circuits Conference, p. 139-142, 1995. 15. Takayanagi, T., Sawada, K., Sakurai, T., Parameswar, Y., Tanaka, S., Ikumi, N., Nagamatsu, M., Kondo, Y., Minagawa, K., Brennan, J., Hsu, P., Rodman, P., Bratt, J., Scanlon, J., Tang, M., Joshi, C., and Nofal, M., Embedded Memory Design for a Four Issue Superscaler RISC Microprocessor. Proceedings of IEEE 1994 Custom Integrated Circuits Conference, p. 585-590, 1994. 16. Patterson, D. et al. Intelligent RAM (IRAM): Chips that Remember and Compute. In 1997 International Solid State Circuits Conference, Digest of Technical Papers, 40, 224-225, February 1997. 17. Hsu, P., Silicon Graphics TFP Micro-Supercomputer Chip Set. Hot Chips V Symposium Record, p. 8.3.1-8.3.9, Aug. 1993. 18. Ikumi, N. et al., A 300 MIPS, 300 MFLOPS Four-Issue CMOS Superscaler Microprocessor. ISSCC 94 Digest of Technical Papers, Feb. 1994. 19. Unekawa, Y. et al., A 110 MHz/1Mbit Synchronous TagRAM. 1993 Symposium on VLSI Circuits Digest of Technical Papers, p. 15-16, May 1993. 20. Takayanagi, T. et al., 2.6 Gbyte/sec Cache/TLB Macro for High-Performance RISC Processor. Proceedings of CICC’91, p. 10.21.1-10.2.4, May 1991.

© 2000 by CRC Press LLC

Shen R.S., et al."Flash Memories" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

51 Flash Memories 51.1 Introduction 51.2 Review of Stacked-Gate Non-volatile Memory 51.3 Basic Flash Memory Device Structures n-Channel Flash Cell • p-Channel Flash Cell

51.4 Device Operations Device Characteristics • Carrier Transport Schemes • Comparisons of Electron Injection Operations • List of Operation Modes

51.5 Variations of Device Structure CHEI Enhancement • FN Tunneling Enhancement • Improvement of Gate Coupling Ratio

51.6 Flash Memory Array Structures

Rick Shih-Jye Shen Frank Ruei-Ling Lin Amy Hsiu-Fen Chou Evans Ching-Song Yang Charles Ching-Hsiang Hsu National Tsing-Hua University

NOR Type Array • AND Type Families • NAND Type Array

51.7 Evolution of Flash Memory Technology 51.8 Flash Memory System Applications and Configurations • Finite State Machine • Level Shifter • Charge-Pumping Circuit • Sense Amplifier • Voltage Regulator • Y-Gating • Page Buffer • Block Register • Summary

51.1 Introduction In past decades, owing to process simplicity, stacked-gate memory devices have become the mainstream in the non-volatile memory market. This chapter is divided into seven sections to review the evolution of stacked-gate memory, device operation, device structures, memory array architectures, and flash memory system. In Section 51.2, a short historical review of stacked-gate memory device and the current flash device are described. Following this, the current–voltage characteristics, charge injection/ejection mechanisms, and the write/erase configurations are mentioned in detail. Based on the descriptions of device operation, some modifications in the memory device structure to improve performance are addressed in Section 51.4. Following the introductions of single memory device cells, descriptions of the memory array architectures are employed in Section 51.6 to facilitate the understanding of device operation. In Section 51.7, a table lists the history of flash memory development over the past decade. Finally, Section 51.8 is dedicated to the issues related to implementation of a flash memory system.

51.2 Review of Stacked-Gate Non-Volatile Memory The concept of a memory device with a floating gate was first proposed by Kahng and Sze in 1967.1 The suggested device structure was started from a basic MOS structure. As shown in Fig. 51.1, the insulator in the conventional MOS structure was replaced with a thin oxide layer (I1), an isolated metal layer (M1), and a thick oxide layer (I2). These stacked oxide and metal layers led to the so-called MIMIS structure. In this

© 2000 by CRC Press LLC

FIGURE 51.1

Schematic cross-section of MIMIS structure.

device structure, the first insulator layer I1 had to be thin enough to allow electrons injected into the floating gate M1. Besides, the second insulator layer I2 is required to be thick enough to avoid the loss of stored charge during charge injection operation. During electron injection operation, a high electric field (~10 MV/cm) enables the electron tunneling through I1 directly, and the injected electrons are captured in the floating gate and thus change the I–V characteristics. On the other hand, a negative voltage is applied at the external gate to remove the stored electrons during the discharge operation by the same direct tunneling mechanism. Owing to the very thin oxide layer I1, the defects in the oxide and the back tunneling phenomena lead to a poor charge retention capability. However, this MIMIS structure demonstrated, for the first time, the possibility of implementation of non-volatile memory device based on the MOS structure. After MIMIS was invented, several improvements were proposed to enhance the performance of MIMIS. One was the utilization of dielectric material with a large amount of electron-trapping centers as a replacement of the floating metal gate.2,3 The injected electrons would be trapped in the bulk and also at the interface traps in the dielectric material, such as silicon nitride (Si3N4), Al2O3, Ta2O5. The device structure with these insulating layers as electron storage node was referred as a charge trapping device. Another solution to improve the oxide quality and charge retention capability was the increase of the thickness of the tunnel dielectric I1. This device structure based on the MIMIS structure but with a thicker insulating layer was also referred as floating gate device. In the initial development period, the charge trapping devices had several advantages compared with floating gate devices. They allowed high density, good write/erase endurance capability, and fast programming/erase time. However, the main obstacle for the wide application in charge trapping devices was the poorer charge retention capability than in floating gate devices. On the other hand, the floating gate devices showed a major drawback of not being electrically erasable. Therefore, the erase operation had to be preceded by the time-consuming UV-irradiation process. However, the floating gate devices had been applied successfully because of the following advantages and improvements. First, the floating gate devices were compatible with the standard double polysilicon NMOS process and then became compatible with CMOS process after minor modification. Second, an excellent charge retention capability was obtained because of the thicker gate oxide. Besides, the thicker oxide leads to a relieved gate disturbance issue. Furthermore, the development of electrical erase operation technique during the 1980s made the write/erase operation easier and more efficient. Based on these reasons, most commercial non-volatile memory companies focused their research efforts on the floating gate devices. Therefore, floating gate devices have become the mainstream product in the non-volatile market. A high operation voltage is unavoidable when the thickness of oxide I1 increases in MIMIS structure. Thus, another way to achieve electron injection was necessary to make the injection operation more efficient. In 1971, the introduction of a memory element with avalanche injection scheme was demonstrated.4 This first operating floating gate device — named Floating gate Avalanche injection MOS (FAMOS), as shown in Fig. 51.2 — was a p-channel MOSFET in which no electrical contact was made

© 2000 by CRC Press LLC

FIGURE 51.2

Schematic cross-section of FAMOS structure.

to the silicon gate. The injection operation of the FAMOS memory structure is initiated by avalanche phenomena in the drain region underneath the gate. The electron-hole pair generation is caused by applying a high reversed bias at the drain/substrate junction. Some of generated electrons drift toward the floating gate by the positive oxide field which is induced by the capacitive coupling between floating gate and drain. However, the inefficient injection process was the major drawback in this device structure. In order to improve the injection efficiency, the Stacked-gate Avalanche injection MOS (SAMOS) with an external gate was proposed, as shown in Fig. 51.3. Owing to the additional gate bias, the programming speed was improved by an increased drift velocity of electrons in the oxide and the field induced energy barrier lowering at the Si–SiO2, interface. Besides, by employing this control gate, the electrical erase operation became possible by building up a high electric field across the inter-polysilicon dielectric.

FIGURE 51.3

Schematic cross-section of p-channel SAMOS structure.

All the stacked-gate devices mentioned above are p-channel devices, which utilize avalanche injection scheme. However, if a smaller access time is required for the read operation, n-channel devices are necessary because of higher channel carrier mobility. Since the avalanche injection in an n-channel device is based on the hole injection, other injection mechanisms are required for n-channel stacked-gate memory cells. There are two major injection schemes for the n-channel memory cell. One is the channel hot electron injection (CHEI) and the other one is high electric field (Fowler-Nordheim, FN) tunneling mechanism. These two operation schemes lead to different device structures. The memory devices using

© 2000 by CRC Press LLC

the CHEI scheme allow a thicker gate oxide, whereas the memory devices using FN tunneling scheme require thinner oxide. In 1980, researches at Intel Corp. proposed the FLOTOX (FLOating gate Tunnel OXide) device, as shown in Fig. 51.4, in which the electrons are injected into and ejected from the floating gate through a high-quality thin oxide region outside the channel region.5 The FLOTOX cell must be isolated by a select transistor to avoid the over-erase issue and therefore it consists of two transistors. Although this limits the density of such memory in comparison with EPROM and the Flash cell, it enables the byte-by-byte erase and reprogramming operation without having to erase the entire chip or sector. Based on this, the FLOTOX cell is suitable for the applications in which low density, high reliability, and non-volatile memory are required.

FIGURE 51.4

Schematic cross-section of FLOTOX structure.

Another modification of operation from EEPROM is the erase of the whole memory chip instead of erasing a byte. By using an electrical erase signal, all cells in the memory chip, which is called a Flash device, are erased simultaneously. The first Flash memory cell was proposed and realized in a three-layer polysilicon technology by Toshiba Corp.6 The first polysilicon is used as the erase gate, the second polysilicon as the floating gate, and the third polysilicon as the control gate, as shown in Fig. 51.5(c). In this device, programming operation is performed by channel hot electron injection and erase operation is carried out by extracting the stored electron from the floating gate to erase gate for all the bits at the same time.

51.3 Basic Flash Memory Device Structures n-Channel Flash Cell Based on the concept proposed by researchers at Toshiba Corp., the developments in Flash memory have burgeoned since the end of 1980s. There are three categories of device structures based on the n-channel MOS structure. Besides the triple polysilicon Flash cell, the most popular Flash cell structures are the ETOX cell and the split-gate cell. In 1985, Mukherjee et. al.7,9 proposed a source-erase Flash cell called the ETOX (EPROM with Tunnel OXide). This cell structure is the same as that of the UV-EPROM, as shown in Fig. 51.6, but with a thin tunnel oxide layer. The cell is programmed by CHEI and erased by applying a high voltage at the source terminal. A split-gate memory cell was proposed by Samachisa et. al. in 1987.8 This split-gate Flash cell with a drain-erase type has two polysilicon layers, as shown in Fig. 51.7. The cell can be regarded as two transistors in series. One is a floating gate memory, which is similar to an EPROM cell; the other, which is used as a select transistor, is an enhancement transistor controlled by the control gate.

© 2000 by CRC Press LLC

FIGURE 51.5 Tripe-gate Flash memory structure proposed by Toshiba: (a) layout of the cell; (b) cross-section along the channel length, and (c) cross-section along the channel width.

p-Channel Flash Cell The p-channel Flash memory cell was first proposed by Hsu et. al. in 1992.9 Recently, several studies have been done on this device structure.10–13 This Flash cell structure is similar to the ETOX cell but with pchannel. The erase mechanism is still by FN tunneling. As to the electron injection, there are two injection schemes that can be employed: CHEI and BBHE (Band-to-Band tunneling induced Hot Electron injection).11 The p-channel Flash cell features high electron injection efficiency, scalability, immunity to the hot hole injection and reduced oxide field during programming. Based on these advantages, the p-channel Flash memory cell seems to reveal a high potential for future low-power Flash applications.

51.4 Device Operations Device Characteristics Capacitive Coupling Effects and Coupling Ratios The I–V characteristics of stacked gate can be derived from the MOSFET characteristics accompanying with the capacitive-coupling factors. For a stacked-gate device, the device structure can be depicted as an equivalent capacitive circuit, as shown in Fig. 51.8. Owing to being isolated from other terminals, the potential of floating gate, VFG, can be expressed as not only the total contributions from four terminals of the device, but also from the contribution of the stored charge in the floating gate:

© 2000 by CRC Press LLC

FIGURE 51.6 Schematic cross-section of ETOX-type Flash memory cell: (a) the top view of the cell, and (b) the cross-section along the channel length and channel width.

FIGURE 51.7

Schematic cross-section of split-gate Flash memory cell.

C FG CB CD CS Q - V + ---------------V - V + --------------- V – ---------------V FG = --------------+ --------------C TOTAL G C TOTAL WELL C TOTAL D C TOTAL S C TOTAL

(51.1)

C TOTAL = C FG + C B + C D + C S

(51.2)

CB CD C FG CS -, α = ---------------, α = ---------------, α = --------------α FG = --------------C TOTAL B C TOTAL D C TOTAL S C TOTAL

(51.3)

and

© 2000 by CRC Press LLC

FIGURE 51.8

Schematic cross-section of stacked-gate device and its equivalent capacitive model.

where CFG, CB, CD, and CS are the capacitances between floating gate and control gate, well terminal, drain terminal, and source terminal, respectively. Q is the charge stored on the floating gate, and αFG, αB, αD, αS are the gate, well, drain, and source coupling ratios, respectively. Current–Voltage Characteristics The current–voltage relationship in a stacked-gate device has been studied and modeled in detail.14,15 By employing Eq. 51.1 for general I–V characteristics in MOSFETs, a simplified I-V relationship in stacked gate devices can be obtained:

C FG CD Q - V + --------------- V – ---------------V FG = --------------C TOTAL G C TOTAL D C TOTAL C Q = α FG  V G + -------D-V D – --------  C FG C FG

(51.4)

for V S = V WELL = 0V In the linear region,

µn ⋅ C ox ⋅ W  V - V FG – V TH – -----D- ⋅ V D I D = --------------------------- L 2 α FG ⋅ µn ⋅ C ox ⋅ W C 1 Q V TH - V G +  -------D- – --- V D – -------- – -------= ----------------------------------------- V  C FG 2 L C FG α FG D

(51.5)

And also in saturation region,

µn ⋅ C ox ⋅ W - ( V FG – V TH ) 2 I D = ---------------------------2L C α FG ⋅ µn ⋅ C ox ⋅ W  Q V TH 2 - V G + -------D-V D – -------- – -------= ------------------------------------------ 2L C FG C FG α FG  2

(51.6)

From Eqs. 51.5 and 51.6, it is clearly demonstrated that the stacked-gate device suffers from drain bias coupling during operation. An increase of drain current can be observed, both in output characteristics and transfer characteristics. Fig. 51.9 shows the subthreshold characteristics of both the n-channel and p-channel Flash devices. An obvious increase of the subthreshold current can be observed while the drain

© 2000 by CRC Press LLC

FIGURE 51.9

FIGURE 51.10

The subthreshold characteristics of n- and p-channel Flash memory cells.

The output characteristics of stacked-gate memory cells.

bias increases. In addition, the increased drain current characteristics in the saturation region are shown in Fig. 51.10. Threshold Voltage of Flash Memory Devices Threshold voltage is defined as the minimum voltage needed to turn on the device. For a stacked-gate device, the threshold voltage measured from the control gate is an indicator of charge storage condition. From Eq. 51.4, we can obtain

C Q V FGTH = α FG  V GTH + -------D-V D – --------  C FG C FG

(51.7)

According to this equation, there exists a linear relationship between threshold voltage measured from floating gate and control gate, drain bias, and stored charge amount. The threshold voltage measured from the floating gate is only determined by the process procedures and device structures. Therefore, the change of the threshold voltage measured from control gate linearly depends on the change of the stored charge amount under a fixed drain bias in a specific stacked-gate device. Thus, this can be expressed as

∆Q ∆V GTH = -------C FG

© 2000 by CRC Press LLC

(51.8)

Based on this relationship, the amount of charge storage in stacked-gate memory cell can be monitored by the measured threshold voltage. As shown in Fig. 51.11, the transfer characteristic shifts toward a higher gate bias region, while the increasing amount of electrons are stored in the floating gate for both n- and p-channel Flash memory cells. Thus, device conduction during read operation determines the stored information of the stacked-gate devices. At a specific gate bias condition for reading, as shown in Fig. 51.11, the memory with/without stored charge would lead to different amounts of drain current. The stored electron in the floating gate leads no current flow through the channel at the “READ” bias in the n-channel Flash cell, whereas the channel would conduct at the read operation for the p-channel cell with the electron stored in the floating gate. The sense amplifier in the peripheral circuit can detect the drain current and provide the stored information for external applications.

FIGURE 51.11

The transfer characteristics of n- and p-channel Flash memory cells.

Carrier Transport Schemes Transport of charge through the oxide layer is the basic mechanism that permits operation of stackedgate memory devices. It makes possible charging and discharging of the floating gate. In order to achieve the write/erase operations, the charge must move across the potential barrier built by the insulating layers between floating gate and other terminals of the memory device. There are different charge transport mechanisms and they can be categorized by the charge energy:16 1. Charges with sufficiently high energy can surmount the Si–SiO2 potential barrier, including: a. Hot electrons initiated from substrate avalanche b. Hot electrons in a junction (initiated from p-n junction avalanche) c. Thermally excited electrons (thermionic emissions and Schottky effect) d. “Lucky” electrons at the drain side (Auger scattering) 2. Charges with lower energy can cross the barrier by quantum mechanical tunneling effects: a. Trap-assisted tunneling through sites located within the barrier b. Direct tunneling when the tunneling distance is equal to the thickness of the oxide c. Fowler-Nordheim (FN) tunneling Hot carrier injection and FN tunneling injection are the common charge injection mechanisms in Flash memory cells. In this section, these charge injection mechanisms will be described in more detail. Channel Hot Electron Injection (CHEI) Figure 51.12 shows the schematic diagram of the CHEI for n- and p-channel MOSFET. When applying a high voltage at the drain terminal of an on-state device, electrons moving from the source terminal to the drain side are accelerated by the high lateral channel electric field near the drain terminal. Figure

© 2000 by CRC Press LLC

FIGURE 51.12 MOSFET.

Schematic illustration of the channel hot carrier effect in (a) n-channel MOSFET, and (b) p-channel

51.13 shows the plots of simulated electric field along the channel region. Notice that the electric field increases abruptly in the pinch-off region when the location approaches the drain terminal. Under the oxide field, which is favorable for attracting electrons, part of the heated electrons gain enough energy to surmount the Si–SiO2 potential barrier and inject into the gate terminal.

FIGURE 51.13

Simulated electric field along the channel in the n-channel MOSFET.

© 2000 by CRC Press LLC

Figure 51.14 shows the qualitative plot of gate current characteristic for n-channel MOSFETs. For the gate bias in the region “I”, a quite small gate current can be characterized. In this subthreshold region, the carrier injection mainly originates from the avalanche injection, which will be discussed in the next section. In region II, the channel conducts and the channel current increases as the gate bias increases and thus the gate current induced by CHEI increases. As the gate bias increases further, the gate current peaks at a high gate bias. Following the peak value of the gate current, the decreasing gate current is mainly caused by the decrease of the lateral electric field, as illustrated in region III.

FIGURE 51.14

Schematic gate current behavior in n-channel MOSFET.

On the other hand, the measured gate current characteristic in p-channel MOSFETs is shown in Fig. 51.15. Owing to the large potential barrier and short mean free path, the hot hole generated and accelerated in the channel cannot gain enough energy to surmount the oxide barrier. Thus, electron current initiated by channel hot electrons is still the dominant component of gate current in the pchannel MOSFET.17,18 Besides, the gate current peaks at a lower gate bias in a p-channel MOSFET and has a larger peak value than that in an n-channel MOSFET. In larger gate bias regions, the gate current is dominated by hole injection, which may be caused by the oxide field favoring the injection of the conducting holes into the gate terminal.19

FIGURE 51.15 The gate current behavior of p-channel MOSFET measured from the threshold voltage shift of the stacked-gate structure.

© 2000 by CRC Press LLC

In the 1980s, there were several approaches to describe the channel hot electron injection into the gate terminal. Takeda, et al.20 modeled the gate current in n-channel MOSFETs as thermionic emission from the heated electron gas over the Si–SiO2 potential barrier. This thermionic gate current model, referred as the “effective electron temperature model,” assumes that the heated electrons become an electron gas with a Maxwellian distribution with an effective temperature Te(x). The temperature Te(x) depends on the electric field and the location in the channel. The gate current is given by

kT e  1 ⁄ 2 ΦB  d ⋅ exp  – ----------- ⋅ exp  – -- J G = q ⋅ n S ⋅  ------------- k ⋅ T e  l  2πm*

(51.9)

where ns is the surface electron density, k is the Boltzmann constant, m* is the effective electron mass, ΦB is the Si–SiO2 potential barrier, d is the distance of the electron from the interface at Te(x), and the λ is the mean free path. The last term in Eq. 51.9 accounts for the probability of energy loss due to the collision while the electron moves toward the Si–SiO2 interface. Another gate current model, the lucky electron model, is based on the assumption that an electron is injected into oxide by obtaining enough energy from the lateral channel electric field without suffering any collision. The lucky electron approach for hot electron injection was originated by Shockley21 and Verway et. al.,22 who applied it in the study of substrate hot electron injection in MOSFETs and subsequently refined and verified by Ning et. al.23 Hu modified the substrate lucky electron injection model and applied it to CHEI in MOSFETs.24 In this model, there are three probabilities to describe the physical mechanism responsible for CHEI gate current.25 They are (1) the probability of a hot electron to gain enough kinetic energy and normal momentum, (2) the probability of not suffering any inelastic collision during transport to the Si–SiO2 interface, and (3) the probability of not suffering collision in oxide imagepotential well. Thus, the gate current originated from CHEI is given by L

IG =

∫I 0

D

( P1 ⋅ P2 ⋅ P3 ) ----------------------------dx λr

(51.10)

where ID is the channel current, L is the channel length, and λr is the redirection scattering mean free path. P1 is the probability that an electron can gain the energy equals the energy barrier under the channel electric field E without suffering optical phonon scattering and can be expressed as

Φ P 1 = exp  – ------B  Eλ

(51.11)

where λ is the mean free path for optical phonon scattering. P2 is the probability of not suffering any inelastic collision during transport to the Si–SiO2 interface and can be expressed as ∞

P2

n ( y ) ⋅ exp  – --- dy  λ ∫ = -----------------------------------------------------y

y =0





(51.12)

n ( y ) dy

y =0

The last probability factor is the scattering in the oxide image-potential well. P3 can be expressed as26:

y P 3 = exp  – ------o-  λ ox

© 2000 by CRC Press LLC

(51.13)

Ong et al. modified the lucky electron model to analyze the hot electron injection effects in p-channel MOSFETs.27,28 Based on Eq. 51.10 and substituting substrate current (ISUB) for drain current (ID), the gate current in p-channel MOSFETs can be expressed as:

IG =



y=L y =0

( P1 ⋅ P2 ⋅ P3 ) I SUB ----------------------------dy λr

(51.14)

After describing the channel hot electron injection mechanisms, the charge injection characteristics based on the CHEI scheme are discussed. First, the output characteristics (ID–VD) of a memory cell are taken into account. The output characteristic of a stacked-gate device can be regarded as an injection indicator to examine the effects of channel hot electron injection under different device operation conditions and device structures. The output characteristics of the n-channel Flash memory under a high gate bias are shown in Fig. 51.16(a). The drain current rolls off at a lower drain bias as the channel length of the device decreases. This indicates obviously that the channel length reduction results in the increase of the lateral channel electric field and therefore the enhancement of hot electron injection. As the electron injection initiates, the stored electrons retard the conduction of the channel and the device is gradually turned off owing to the continuous electron injection. On the contrary, the output characteristics in the p-channel Flash memory, as shown in Fig. 51.16(b), reveal a quite different I–V behavior after electron injection. Owing to the reduction of threshold voltage after electron injection, the enhancement of further channel conduction can be observed as the drain bias increases.

FIGURE 51.16 (a) The output characteristics of the n-channel Flash memory at high gate bias, and (b) the output characteristics of the p-channel Flash memory at high gate bias.

© 2000 by CRC Press LLC

Second, the programming characteristics of the n- and p-channel Flash memory are demonstrated. Figure 51.17(a) shows the gate bias effects on the CHEI programming characteristics in an n-channel Flash memory cell. The threshold voltage increases as the electron injection process prolongs and then saturates at different values for different gate biases. On the other hand, Fig. 51.17(b) shows the CHEI programming characteristics in a p-channel Flash memory cell. Compared with the n-channel cell, the programming characteristic in the p-channel Flash cell reveals a large dependence on the gate bias condition. This is mainly caused by the CHEI that distributes within a narrower gate bias condition. The gate current in the p-MOSFET peaks at lower gate bias and decreases steeply when the gate bias becomes more negative. Therefore, the injected electrons during programming accompanied by the control gate bias lead to a more negative floating gate potential and the programming behavior is quite different at different gate bias conditions.

FIGURE 51.17 (a) The programming characteristics of the n-channel Flash memory using channel hot electron injection scheme; (b) the programming characteristics of the p-channel Flash memory using channel hot electron injection.

Drain Avalanche Hot Carrier (DAHC) Injection As shown in the region I of Fig. 51.14, the characteristic of the gate current is still a function of the gate voltage in n-channel MOSFETs. When VG is smaller than VG*, drain avalanche hot hole (DAHH) is the dominant carrier injected into the gate. On the other hand, when VG is larger than VG*, drain avalanche hot electron (DAHE) is the dominant carrier injected into the gate terminal. VG* is the point at which the amounts of the injected hot hole and injected hot electron are in balance. At this gate bias condition, the gate current is not observed. Conceptually, the existence of hot hole injection seems questionable because of the high barrier (3.8 eV) for hole injection at the Si–SiO2 interface. However, hot hole gate currents have been experimentally identified and modeled.29,32 Hofmann et. al.30 employed the effective electron temperature model20 and

© 2000 by CRC Press LLC

the concept of oxide scattering effects25 based on the two-dimensional distribution of electric field, charge carrier, and current density calculated by computer simulator. The hot hole injection and hot electron injection initiated by the avalanche generation were manifested qualitatively. Saks et al.32 proposed a modified floating gate technique to characterize these extremely small gate currents. It showed that a small positive gate current exists for gate bias near the threshold voltage. They also suggested that the hole current increases with increasing drain bias and decreasing effective channel length, which is analogous to the dependencies for channel hot electron injection. Comparison of hot hole and hot electron gate current as a function of the effective channel length also suggested that the lateral electric field near the drain plays an important role in the hole injection. In the stacked-gate devices, in the DAHH region, holes are injected into the floating gate, which increases the floating gate voltage gradually, and finally the floating gate voltage reaches the point VG*. On the contrary, in the DAHE region, electrons are injected into the floating gate, which decreases the floating gate, and the floating gate voltage also reaches the point VG*. Thus, the threshold voltage of the stacked-gate device would distribute at a specific value after the DAHC injection operation. As shown in Fig. 51.18, the threshold voltage of the flash cell after a period of DAHC operation time can converge to a specific value. For the cell with a threshold voltage larger than the converged value, the floating gate voltage is more negative than VG*, the hole injection occurs and makes the threshold voltage decrease. On the other hand, for the cell with a threshold voltage smaller than the converged value, it reveals a more positive potential in the floating gate, the electron injection occurs and increases the threshold voltage. In the Flash application, the DAHC injection is usually applied to the convergent operation.33 Owing to the process-induced device variations, the electron ejection operation usually causes a wide threshold distribution. Additionally, a trapped hole in the oxide enhances the FN tunneling current and generates the erratic erased cell.34 By employing the DAHC operation, a tighter threshold voltage distribution can be obtained.35

FIGURE 51.18

The convergetn characteristics of the n-channel Flash memory cell with DAHC operation.

Band-to-Band Tunneling Induced Hot Carrier Injection (BBHC) Carrier injection initiated by band-to-band tunneling accompanied by lateral junction electric field is also an important charge transport mechanism in Flash memory. As shown in Fig. 51.19, the BBHC operation conditions for n- and p-channel lead to different charge injection behaviors. For n-channel MOSFETs, the negative gate bias and positive drain bias lead to the possible hole injection toward the gate terminal. For p-channel MOSFETs, the operation conditions lead to the possible electron injection toward the gate terminal. The initiation of the BBHC injection can be divided into two procedures. One is the band-to-band tunneling, and the other is the acceleration due to lateral electric field and injection due to favorable oxide field.

© 2000 by CRC Press LLC

FIGURE 51.19 MOSFET.

The schematic illustration for BBHC injection for: (a) n-channel MOSFET, and (b) p-channel

The band-to-band tunneling phenomenon is usually referred as gate-induced drain leakage current.36 When a high drain voltage is applied with a grounded gate terminal, a deep depletion region is formed underneath the gate-to-drain overlap region. Electron-hole pairs are generated by the tunneling of valence band electrons into the conduction band and then collected by the drain and substrate terminals, separately. Since the minority carriers (hole in n-MOSFET and electron in p-MOSFET) generated by band-to-band tunneling in the drain region flow to the substrate due to the lateral electric field, the deep depletion region is always present and the band-to-band tunneling process proceeds without forming an inversion layer. The band-to-band tunneling characteristic can be estimated by the calculation of electric field distribution and the tunneling probability.37,38 Based on the depletion approximation and the assumption of uniform impurity distribution, the electric field E(x) in the depletion region is given by

Q ⋅ N 2 ⋅ ε si ⋅ V bend  q ⋅ No  E ( x ) = ---------------o ---------------------------- 1 – x ---------------------------ε si q ⋅ No  2 ⋅ ε si ⋅ V bend

(51.15)

where Vbend is the value of the band bending, No is the impurity density, and x is the coordinate normal to the Si–SiO2 interface. The continuity equation at the Si–SiO2 interface can be expressed as

V D – V bend ε si ⋅ E ( x = 0 ) = ε ox ⋅ E ox = ε ox ----------------------T ox

(51.16)

The tunneling characteristics are usually approximated by the relationship derived from the reverse biased p-n junction tunnel diode:39

© 2000 by CRC Press LLC

B 2 J = B 1 ⋅ E exp  – -----2  E

(51.17)

where B1 and B2 are physical constants. Most of the generated minority carriers are drained away from the substrate terminal. However, owing to the sufficient lateral electric field across the depletion region, these hot carriers may encounter Auger scattering and generate another electron-hole pair.40 When the drain bias is higher than Si–SiO2 barrier, the top barrier position seen by the cold generated minority carriers is lower at the depletion edge in the channel. Thus, the injection probability of the minority carrier becomes much higher. The probability of the generated minority carrier injection is given by.41

P inject =

d(V)

- dW ( V ) ∫ exp  – ----------λ 

2V ΦB  ≈  ---------D- – 1 ⋅ exp  – ------------------- ΦB   q ⋅ E m ⋅ λ

(51.18)

Thus, the injected current accompanied with Eq. 51.17 and oxide scattering factor P expressed in Eq. 51.13 can be given by

J inject = P ⋅ P inject ⋅ J

(51.19)

In the n-channel MOSFET, the BBHC injection process leads to a significant amount of hot hole injection.42,43 This situation is mostly encountered in the electron ejection operation of a Flash memory device with “edge” Fowler-Nordheim tunneling. The hole injection into the gate terminal would result in not only the deviation of the memory state, but also severe long-term device instability issues. However, on the contrary, the BBHC injection process leads to the electron injection in the p-channel MOSFET and has been employed in the programming scheme for p-channel Flash memory cell.10,11 Figure 51.20(a) shows the BBHE characteristics of the p-channel MOSFET. The drain and gate currents monotonically increase with respect to the gate bias because of the increase of the band-to-band tunneling efficiency and the more favorable oxide field for electron injection. Owing to operating in the off state, the electron injection efficiency of the BBHE scheme is much larger than that in the CHEI operation. The BBHE injection reveals a rather high injection efficiency (IG/ID) up to 10–2, which provides a quite efficient programming operation for the p-channel Flash cell.10 Figure 51.20(b) shows the programming characteristics based on the BBHE injection mechanism. The programming time is greatly shortened as the control gate voltage increases. As compared with the CHEI scheme shown in Fig. 51.17(b), the BBHE approach indeed reveals a faster programming speed. Fowler-Nordheim (FN) Tunneling The FN tunneling formula proposed by Fowler and Nordheim in 1928 can be described as

4 2m* ⋅ Φ B  2 J tunnel = Co ⋅ E ⋅ exp  – ------------------------------ 3⋅q⋅h⋅E  3

(51.20)

where Jtunnel and E are the tunneling current density and electric field across the oxide layer, respectively. Besides, Co is a material-dependent constant and m* is the carrier effective mass. The tunneling theory is developed using the semi-classical independent electron model. For a carrier with energy qUo, the general expression for the transmission coefficient Tc through on energy barrier depends on the barrier shape U(x), as shown in Fig. 51.21. The value of Tc is derived using the WKB (Wentzel-KramersBrillouin) approximation.44,46

© 2000 by CRC Press LLC

FIGURE 51.20 (a) The BBHE behavior in p-channel MOSFET with different bias conditions; and (b) the programming characteristics in p-channel Flash memory cell with BBHE injection scheme.

FIGURE 51.21 high voltage.

Schematic diagram of the potential barrier in the polysilicon-oxide-silicon system under applied

(51.21) © 2000 by CRC Press LLC

8 ⋅ m* ⋅ q ln T c = – ---------------------- ⋅ h



X tunnel 0

U ( x ) – U o dx

The tunneling current is obtained by integrating the product of the density of states Nc(W) and the transmission coefficient from lowest occupied energy WG to infinity,

J tunnel =



∞ WG

Nc ( W )Tc ( W ) dW

(51.22)

This expression is valid for any barrier shape. Under a strong oxide field E, the effective barrier is triangular and the coefficient can be obtained by integrating,

U ( x ) = φB – E ⋅ x

(51.23)

– 4 2 ⋅ m* ⋅ Φ B ln T c = -------------------------------------3⋅h⋅q⋅ E 3

(51.24)

where ΦB is the barrier height, ΦB = qφB. Solving Eqs. 51.22 and 51.24 with the assumption that only electrons at the Fermi level contribute to the current yields the Fowler-Nordheim formula for the tunneling current density Jtunnel at high electric field: 3 2 4 2 ⋅ m* ⋅ Φ B  q ⋅E - ⋅ exp  – ----------------------------------J tunnel = --------------------------------2  3⋅h⋅q⋅E  16 ⋅ π ⋅ h ⋅ Φ B 3

(51.25)

This equation can also be expressed as

β 2 J tunnel = α ⋅ E exp  – ---  E

(51.26)

where α and β are Fowler-Nordheim constants. The value of α is in the range of 4.7 × 10–5 to 6.32 × 10–7 A/V2 and β is in the range of 2.2 × 108 to 3.2 × 108 V/cm.47 The barrier height and tunneling distance determine the tunneling efficiency. Generally, the barrier height at the Si–SiO2 interface is about 3.1 eV, which is material dependent. This parameter is determined by the electron affinity and work function of the gate material. On the other hand, the tunneling distance depends on the oxide thickness and the voltage drop across the oxide. As indicated in Eq. 51.26, the tunneling current is exponentially proportional to the oxide field. Thus, a small variation in the oxide thickness or voltage drop would lead to a significant tunneling current change. Figure 51.22 shows the Fowler-Nordheim plot which can manifest the Fowler-Nordheim constants α and β. The Si–SiO2 barrier height can be determined based on this F-N plot by quantum-mechanical (QM) modeling.48

Comparisons of Electron Injection Operations As mentioned in the above section, there are several operation schemes that can be employed for electron injection, whereas only FN tunneling can be employed for ejecting electrons out of the floating gate. Owing to the specific features of the electron injection mechanism, the utilization of an electron injection scheme thereby determines the device structure design, process technology, and circuit design. The main features of CHEI and FN tunneling for n-channel Flash memory cell and also CHEI and BBHE injection for p-channel Flash memory cell are compared in Tables 51.1 and 51.2.

© 2000 by CRC Press LLC

FIGURE 51.22

Fowler-Nordheim plot of the thin oxide. TABLE 51.1 Comparisons of Fowler-Nordheim Tunneling and Channel Hot Electron Injection as Programming Scheme for Stacked-Gate Devices FN Tunneling Injection Scheme Low power consumption • Single external power supply High oxide field • Thinner oxide thickness required • Higher trap generation rate • More severe read disturbance issue • Highly technological problem Slower programming speed

CHEI Scheme High power consumption • Complicated circuitry technique Low oxide field • Oxide can be thicker • Higher oxide integrity • Low read disturbance issue Faster programming speed

TABLE 51.2 Comparisons of Band-to-Band Tunneling Induced Hot Electron Injection and Channel Hot Electron Injection as Programming Scheme for Stacked-Gate Devices Power consumption Injection efficiency Programming speed Electron injection window Oxide field

BBHE Injection Scheme Lower Higher Faster Wider Higher

CHEI Scheme Higher Lower Slower Narrower Lower

List of Operation Modes The employment of different electron transport mechanisms to achieve the programming and erase operations can lead to different device operation modes. Typically, in commercial applications, there are three different operation modes for n-channel Flash cells and two different operation modes for p-channel Flash cells. In the n-channel cell, as shown in Fig. 51.23, the write/erase operation modes include: (1) programming operation with CHEI and erase operation with FN tunneling ejection at source or drain side,6–8,49–61 as shown in Fig. 51.23(a), usually referred as NOR-type operation mode; (2) programming operation with FN tunneling ejection at drain side and erase operation with FN tunneling injection through channel region,62–70 as shown in Fig. 51.23(b), usually referred as AND-type operation mode; and (3) programming and erase operations with FN tunneling injection/ejection through channel region,71–78 usually referred as NAND-type operation mode. As to the p-channel cell, as shown in Fig. 51.24, the write/erase operation modes include: (1) programming operation with CHEI at drain side and erase operation with FN tunneling ejection through channel region,9 as shown in Fig. 51.24(a); (2) programming operation with BBHE at drain side and erase operation with FN tunneling injection through channel region,10,11 as shown in Fig. 51.24(b).

© 2000 by CRC Press LLC

FIGURE 51.23 Different n-channel Flash write/erase operations: (a) programmming operation with CHEI at drain side and erase operation with FN tunneling ejection at source side; (b) programming operation with FN tunneling ejection at drain side and erase operation with tunneling injection through channel region; and (c) programming and erase operations with FN tunneling injection/ejection through channel region.

These operation modes not only lead to different device structures but also different memory array architectures. The main purpose of utilizing various device structures for different operation modes is based on the consideration of the operation efficiency, reliability requirements, and fabrication procedures. In addition, the operation modes and device structures determine, and also are determined by, the memory array architectures. In the following sections, the general improvements of the Flash device structures and the array architectures for specific operation modes are described.

51.5 Variations of Device Structure CHEI Enhancement As mentioned above, alternative operation modes are proposed to achieve pervasive purposes and various features, which are approached either by CHEI or FN tunneling injection. Furthermore, it is indicated that the over 90% of the Flash memory product ever shipped is the CHEI-based Flash memory device.79 With the major manufacturers’ competition, many innovations and efforts are dedicated to improve the performance and reliability of CHEI schemes.50,53,56,57,61,80–83 As described in Eq. 51.11, an increase in the

© 2000 by CRC Press LLC

FIGURE 51.24 Different p-channel Flash write/erase operations: (a) programming operation with CHEI at drain side and erase operation with FN tunneling ejection through channel region; and (b) programming operation with BBHE at drain side and erase operation with FN tunneling injection through channel region.

electric field can enhance the probability of the electrons gaining enough energy. Therefore, the major approach to improve the channel hot electron injection efficiency is to enhance the electric field near the drain side. One of the structure modifications is utilizing the large-angle implanted p-pocket (LAP) around the drain to improve the programming speed.56,57,60,83 LAP has also been used to enhance the punch-through immunity for scaling down capability.50,53 As demonstrated in Fig. 51.13, the device with LAP has a twofold maximum electric field of that in the device without LAP structure. According to our previous report,83 additionally, the LAP cell with proper process design can satisfy the cell performance requirements such as read current and punch-through resistance and also reliable long-term charge retention. Besides, the utilization of the p-pocket implantation can achieve the low-voltage operation and feasible scaling down capability simultaneously.

FN Tunneling Enhancement From the standpoint of power consumption, the programming/erase operation based on the FN tunneling mechanism is unavoidable because of the low current during operation. As the dimension of Flash memory continues scaling down, in order to lower the operation voltage, a thinner tunnel oxide is needed. However, it is difficult to scale down the oxide thickness further due to reliability concerns. There are two ways to overcome this issue. One method is to raise the tunneling efficiency by employing a layer of electron injector on top of the tunnel oxide. Another method is to improve the gate coupling ratio of the memory cell without changing the properties of the insulator between the floating gate and well. The electron injectors on the top of the tunnel oxide enhance the electric field locally and thus the tunneling efficiency is improved. Therefore, the onset of tunneling behavior takes place at a lower operation voltage. There are two materials used as electron injectors: polyoxide layer84 and silicon-rich oxide (SRO) layer.85 The surface roughness of the polyoxide is the main feature for electron injectors. However, owing to the properties of the polyoxide, the electron trapping during write/erase operation limits the application for Flash memory cells. On the other hand, the oxide layer containing excess silicon exhibits lower charge trapping and larger charge-to-breakdown characteristics. These silicon components

© 2000 by CRC Press LLC

in the SRO layer form tiny silicon islands. The high tunneling efficiency is caused by the electric field enhancement of these silicon islands. Lin et al.47 reported that the Flash cell with SRO layer can achieve the write/erase capability up to 106 cycles. However, the charge retentivity of the Flash memory cell with electron injector layers would be poorer than the conventional memory cell because the charge loss is also aggravated by the enhancement of the SRO layer. Thus, the stacked-gate device with SRO layer was also proposed as a volatile memory cell which can feature a longer refresh time than that in the conventional DRAM cell.86

Improvement of Gate Coupling Ratio Another way to reduce the operation voltage is to increase the gate coupling ratio of the memory cell. From the description in the Section 51.4, the floating gate potential can be increased with an increased gate coupling ratio, through an enlarged inter-polysilicon capacitance. For the sake of obtaining a large interpoly capacitance, it is indispensable to reduce the interpoly dielectric thickness or increase the interpoly capacitor area. However, the reduced interpoly dielectric thickness would lead to charge loss during long-term operation. Therefore, a proper structure modification without increasing the effective cell size is necessary to increase the interpoly capacitance. It was proposed to put an extended floating gate layer over the bit-line region by employing two steps of polysilicon layer deposition.68,87 Such device structure with memory array modifications would achieve a smaller effective cell size and a high coupling ratio (up to 0.8). Shirai et al.88 proposed the process modification the increase to effective area on the top surface of the floating gate layer. This modified process, which forms a hemispherical-grained (HSG) polysilicon layer, can achieve a high capacitive coupling ratio (up to 0.8). However, the charge retentivity would be a major concern in considering the material as the electric injector.

51.6 Flash Memory Array Structures NOR Type Array In general, most of the Flash memory array, as shown in Fig. 51.25(a), is the NOR-type array.49–61 In this array structure, two neighboring memory cells share a bit-line contact and a common source line. Therefore, a half the drain contact size and half the source line width is occupied in the unit memory cell. Since the memory cell is connected to the bit-line directly, the NOR-type array features random access and lower series resistance characteristics. The NOR-type array can be operated in a larger read current and thus a faster read operation speed. However, the drawback of the NOR-type array is the large cell area per unit cell. In order to maintain the advantages in NOR-type array and also reduce the cell size, there were several efforts to improve the array architectures. The major improvement in the NOR-type array is the elimination of bit-line contacts — the employment of buried bit-line configuration.52 This concept evolves from the contactless EPROM proposed by Texas Instruments Inc. in 1986.89 By using this contactless bit-line concept, the memory cell has a 34% size reduction.

AND Type Families Another modification of the NOR-type array accompanied by a different operation mode is the AND-type array. In the NOR-type array, the CHEI is used as the electron injection scheme. However, owing to the considerations of power consumption and series resistance contributed by the buried bit-line/source, both the programming and erase operations utilize FN tunneling to eliminate the above concerns. Some improvements and modifications based on the NOR-type array have been proposed, including DIvided-bitline NOR (DINOR) proposed by Mitsubishi Corp.,65,68 Contactless NOR (AND) proposed by Hitachi Corp.,64,66 Asymmetrical Contactless Transistor (ACT) cell by Sharp Corp.,69 and Dual String NOR (DuSNOR) by Samsung Corp.70 and Macronix, Inc.67 The DINOR architecture employs the main bit-line and sub-bit-line configuration to reduce the disturbance issue during FN programming. The AND and DuSNOR structures

© 2000 by CRC Press LLC

FIGURE 51.25 (a) Schematic top view and cross-section of the NOR-type Flash memory array; and (b) schematic top view and cross-section of the NAND-type Flash memory array.

consist of strings of memory cells with n+ buried source and bit-lines. String-select and ground-select transistors are attached to the bit and source line, respectively. In DuSNOR structure, a smaller cell size can be realized because every two adjacent cell strings share a source line. Although a smaller cell size can be obtained utilizing the buried bit-line and source line, the resistance of the buried diffusion line would degrade the read performance. The read operation consideration will be the dominant factor in determining the size of a memory string in the AND and DuSNOR structures.

NAND Type Array In order to realize a smaller Flash memory cell, the NAND structure was proposed in 1987.90 As shown in Fig. 51.25(b), the memory cells are arranged in series. It was reported that the cell size of the NAND structure is only 44% of that in the NOR-type array under the same design rules. The operation mechanisms of a single memory cell in the NAND architecture is the same as NOR and AND architectures. However, the programming and read operations are more complex. Besides, the read operation speed is lower than that in the NOR-type structure because a number of memory cells are connected in series. Originally, the NAND structure was operated with CHEI programming an FN tunneling through the channel region.90 Later on, edge FN ejection at drain side was employed.62,63 However, owing to reliability concerns, operations utilizing the bi-polarity write/erase scheme were then proposed to reduce the oxide damage.71–78 Owing to the memory cells in the NAND structure being operated by FN write and erase, in order to improve the FN operation efficiency and reduce the operation voltage, the booster plate technology on the NAND structure was proposed by Samsung Corp.77

© 2000 by CRC Press LLC

51.7 Evolution of Flash Memory Technology In this section, as in Table 51.3, the development of device structures, process technology, and array architectures for Flash memory are listed by date. The burgeoning development in Flash memory devices reveals a prospective future. TABLE 51.3 The Development of the Flash Memory Year 1984 1985 1986 1987 1987 1987 1988 1988 1988 1988 1988 1989 1989 1989 1989 1990 1990 1990 1990 1990 1990 1990 1991 1991 1991 1991 1991 1991 1991 1992 1992 1992 1992 1992 1993 1993 1993 1993 1994 1994 1994 1994 1994 1994 1995 1995 1995 1995 1995 1995

Technology Flash memory (2 µm, 64 µm2) Source-side erase type Flash (1.5 µm, 25 µm2, 512 Kb) Source-side injection (SI-EPROM) Drain-erase type Flash, split gate device (128 Kb) NAND structure EEPROM (1 µm, 6.43 µm2, 512 Kb) Source-side erase Flash (0.8 µm, 9.3 µm2) ETOX-type Flash (1.5 µm, 36 µm2, 256 Kb) NAND EEPROM (1 µm, 9.3 µm2, 4 Mb) NAND EEPROM (1 µm, 12.9 µm2, 4 Mb) Poly-poly erase Flash (1.2 µm, 18 µm2) Contactless Flash (1.5 µm, 40.5 µm2) Negative gate erase ETOX-type Flash (1 µm, 15.2 µm2, 1 Mb) Sidewall Flash (1 µm, 14 µm2) Punch-through-erase Well-erase, bi-polarity W/E operation NAND, new self-aligned patterning (0.6 µm, 2.3 µm2) Contactless Flash, ACEE (0.8 µm, 8.6 µm2, 4 Mb) FACE cell (0.8 µm, 4.48 µm2) Negative gate erase (0.6 µm, 3.6 µm2, 16 Mb) Tunnel diode-based contactless Flash p-Pocket EPROM cell (0.6 µm, 16 Mb) SAS process PB-FACE cell (0.8 µm, 4.16 µm2) Burst-pulse erase (0.6 µm, 3.6 µm2) SSW-DSA cell (0.4 µm, 1.5 µm2, 64 Mb) Sector erase (0.6 µm, 3.42 µm2, 16 Mb) Self-convergence erase Virtual ground, auxiliary gate (0.5 µm, 2.59 µm2) AND cell (0.4 µm, 1.28 µm2, 64 Mb) DINOR array (0.5 µm, 2.88 µm2, 16 Mb) 2-Step erase method Buried source side injection p-Channel Flash Cell with SRO layer HiCR cell (0.4 µm, 1.5 µm2, 64 Mb) 3-D sidewall Flash Asymmetrical offset S/D DINOR (0.5 µm, 1.0 µm2) NAND EEPROM (0.4 µm, 1.13 µm2, 64 Mb) Self-convergent method Substrate hot electron (SHE) erase Dual-bit Split-Gate (DSG) cell (multi-level cell) SA-STI NAND EEPROM (0.35 µm, 0.67 µm2, 256 Mb) SST cell AND cell (0.25 µm, 0.4 µm2, 256 Mb) Multi-level NAND EEPROM Convergence erase scheme DuSNOR array (0.5 µm, 1.6 µm2) CISEI programming scheme SAHF cell (0.3 µm, 0.54 µm2, 256 Mb) P-Flash with BBHE scheme (0.4 µm)

© 2000 by CRC Press LLC

Affiliation Toshiba (Japan) EXCL (USA) UC Berkley (USA) Seeq, UC Berkley (USA) Toshiba (Japan) Hitachi (Japan) Intel (USA) Toshiba (Japan) Toshiba (Japan) WSI (USA) TI (USA) AMD (USA) Intel (USA) Toshiba (Japan) Toshiba (Japan) Toshiba (Japan) Toshiba (Japan) TI (USA) Intel (USA) Mitsubishi (Japan) TI (USA) Toshiba (Japan) Intel (USA) Intel (USA) NEC (Japan) NEC (Japan) Hitachi (Japan) Toshiba (Japan) Sharp (Japan) Hitachi (Japan) Mitsubishi (Japan) NEC (Japan) TI (USA) IBM (USA) NEC (Japan) Philip, Stanford (USA) Mitsubishi (Japan) Toshiba (Japan) Motorola (USA) Mitsubishi (Japan) Hyundai (Korea) Toshiba (Japan) SST (USA) Hitachi (Japan) Toshiba (Japan) UT, AMD (USA) Samsung (Korea) AT&T, Lucent (USA) NEC (Japan) Mitsubishi (Japan)

Ref. 6 7 49 8 90 50 91 62 63 92 93 94 95 51 96 71, 72 97 98 52 54 99 53 100 101 56 57 64 33, 35 59 66 65 102 60 9 87 103 68 74 104 105 106 75 124 107 108 109 70 110 88 10 continued

TABLE 51.3 (continued) 1995 1995 1995 1995 1995 1996 1996 1996 1996 1997 1997 1997 1997 1997 1997 1997 1997 1997

The Development of the Flash Memory

ACT cell (0.3 µm, 0.39 µm2) Multi-level with self-convergence scheme Multi-level SWATT NAND cell (0.35 µm, 0.67 µm2) SCIHE injection scheme Alternating word-line voltage pulse Self-limiting programming p-Flash High speed NAND (HS-NAND) (2 µm2, 16 Mb) Booster plate NAND (0.5 µm, 32 Mb) Shared bitline NAND (256 Mb) Φ-Cell NAND with STI (256 Mb) Shallow groove isolation (SGI) Word-line self-boosting NAND SPIN cell Booster line technology for NAND AMG array High k interpoly dielectric Self-convergent operation for p-Flash

Sharp (Japan) National (USA) Toshiba (Japan) AMD (USA) NKK (Japan) Mitsubishi (Japan) Samsung (Korea) Samsung (Korea) Samsung (Korea) SGS-Thomson (France) Toshiba (Japan) Hitachi (Japan) Samsung (Korea) Motorola (USA) Samsung (Korea) WSI (USA) Lucent (USA) NTHU (ROC)

69 111 112 113 114 11 76 77 115 116 117 118 119 120 121 122 123 12

51.8 Flash Memory System Applications and Configurations Flash memory is a single-transistor memory with floating gate for storing charges. Since 1985, the mass production of Flash memory has shared the market of non-volatile memory. The advantages of high density and electrical erasable operation make Flash memory an indispensable memory in the applications of programmable systems, such as network hubs, modems, PC BIOS, microprocessorbased systems, etc. Recently, image cameras and voice recorders have adopted Flash memory as the storage media. These applications require battery operation, which cannot afford large power consumption. Flash memory, a true non-volatile memory, is very suitable for these portable applications because stand-by power is not necessary. In the interest of portable systems, the specification requirements of Flash memory include some special features that other memories (e.g., DRAM, SRAM) do not have. For example, multiple internal voltages with single external power supply, power-down during stand-by, direct execution, simultaneous erase of multiple blocks, simultaneous re-program/erase of different blocks, precise regulation of internal voltage, embedded program/erase algorithms to control threshold voltage. Since 1995, an emerging need of Flash memory is to increase the density by doubling the number of bits per cell. The charge stored in the floating gate is controlled precisely to provide multi-level threshold voltage. The information stored in each cell can be 00, 01, 10, or 11. Using multi-level storage can decrease the cost per bit tremendously. The multi-level Flash memories have two additional requirements: (1) fast sensing of multi-level information, and (2) high-speed multi-level programming. Since the memory cell characteristics would be degraded after cycling, which leads to fluctuation of programmed states, fast sensing and fast programming are challenged by the variation of threshold voltage in each level. Another development is analog storage of Flash memory, which is feasible for image storage and voice record. The threshold voltage can be varied continuously between the maximum and minimum values to meet the analog requirements. Analog storage is suitable for recording the information that can tolerate distortion between the storing information and the restored information (e.g., image and speech data). Before exploring the system design of Flash memory, the major differences between Flash memory and other digital memory, such as SRAM and DRAM, should be clarified. First, multiple sets of voltages are required in Flash memory for programming, erase, and read operations. The high-voltage related circuit is a unique feature that differs from other memories (e.g., DRAM, SRAM). Second, the charac-

© 2000 by CRC Press LLC

teristics of Flash memory cell are degrading because of stress by programming and erasing. The controlling of an accurate threshold voltage by an internal finite state machine is the special function that Flash memory must have. In addition to the mentioned features, address decoding, sense amplifier, and I/O driver are all required in Flash memory. The system of Flash memory, as a result, can be regarded as a simplified mixed-signal product that employs digital and analog design concepts. Figure 51.26 shows the block diagram of Flash memory. The word-line driver, bit-line driver, and source-line driver control the memory array. The word-line driver is high-voltage circuitry, which includes a logic X-decoder and level shifter. The interface between the bit-line driver and the memory array is the Y-gating. Along the bit-line direction, the sense amplifier and data input/output buffer are in charge of reading and temporary storage of data. The high-voltage parts include charge-pumping and voltage regulation circuitry. The generated high voltage is used to proceed programming and erasing operations. Behind the X-decoder, the address buffer catches the address. Finally, a finite state machine, which executes the operation code, dictates the operations of the system. The heart of the finite state machine is the clocking circuit, which also feeds the clock to a two-phase generator for charge-pumping circuits. In the following sections, the functions of each block will be discussed in detail.

FIGURE 51.26

The block diagram of Flash memory system.

Finite State Machine A finite state machine (FSM) is a control unit that processes commands and operation algorithms. Figure 51.27(a) demonstrates an example of an FSM. Figure 51.27(b) shows the details of an FSM. The command logic unit is an AND-OR-based logic unit that generates next state codes, while the state register latches the current state. The current state is related to the previous state and input state. State transitions follow the designated state diagram or state table that describe the functionality to translate state codes into controlling signals that are required by other circuits in the memory. The tendency to develop Flash memories goes in the direction of simultaneous program, erase, and read in different blocks. The global FSM takes charge of command distribution, address transition detection (ATD), and data input/output. The address command and data are queued when the selected FSM is busy. The local FSM deals with

© 2000 by CRC Press LLC

FIGURE 51.27 machine.

(a) The hierarchical architecture of a finite state machine; and (b) the block diagram of a finite state

operations, including read, program, and erase, within the local block. The local FSM is activated and completes an operation independently when a command is issued. The global FSM manages the tasks distributing among local FSMs according to the address. The hierarchical local and global FSM can provide parallel processing; for instance, one block is being programmed while the other block is being erased. This feature of simultaneous read/write reduces the system overhead and speeds up the Flash memory. One example of the algorithm used in the FSM is shown in Fig. 51.28. The global FSM loads operating code (OP code) first, then the address transition detection (ATD) enables latch of the address when a different but valid address is observed. The status of the selected block is checked if the command can be executed right away, whereas the command, address, and/or data input are stored in the queues. The queue will be read when the local FSM is ready for excuting the next command. The operation code and address are decoded. Sense amplifiers are activated if a read command is issued. Charge-pumping circuits are back to work if a write command is issued. After all preparations are made, the process routine begins, which will be explained later. Following the completion of the process routine, the FSM checks its queues. If there is any command queued for delayed operation, the local FSM reads the queued data and continues the described procedures. Since these operations are invisible to the external systems, the system overhead is reduced. The process routine is shown in Fig. 51.29. The read procedure waits for the completion signal of the sense amplifier, and then the valid data is sent immediately. The programming and erasing operations require a verification procedure to ascertain completion of the operation. The iteration of programverification and erase-verification proceeds to fine-tune the threshold voltage. However, if the verification time exceeds the predetermined value, the block will be identified as a failure block. Further operation to this block is inhibited. Since the FSM controls the operations of the whole chip, a good design of the FSM can improve the operational speed.

Level Shifter The level shifter is an interface between low-voltage and high-voltage circuits. Flash memory requires high voltage on the word-line and bit-line during programming and erasing operations. The high voltage appearing in a short time is regarded as a pulse. Figure 51.30 shows an example of a level shifter. The input signal is a pulse in Vcc/ground level, which controls the duration of a high-voltage pulse. The supply

© 2000 by CRC Press LLC

FIGURE 51.28

The algorithims of a finite state machine for simultaneous read-write feature.

of the level shifter determines the output voltage level of the high-voltage pulse. The level shifter is a positive feedback circuit, which turns stable at the ground level and supply voltage level (high voltage is generated from charge-pumping circuits). The operation of the level shifter can be realized as follows. The low-voltage input can only turn off the NMOS transistor but cannot turn off the PMOS parts. On the other hand, high voltage can only turn off the PMOS transistor. Therefore, generation of two mutually inverted signals can turn off the individual loading path and provide no leakage current during standby. The challenges of the design are the transition power consumption and the possibility of latch-up. The delay of the feedback loop will result in large leakage current flowing from the high-voltage supply to ground. The leakage current is similar to the transition current of conventional CMOS circuits, but larger due to the delay of the feedback loop. As the large leakage current occurs due to generated substrate current by hot carriers, the level shifter is susceptible to latch-up. The design of the level shifter should focus on speeding up the feedback loop and employing a latch-up-free apparatus. More sophisticated level shifters should be designed to provide tradeoff between switching power and the switching speed. The level shifter is used in the word-line driver and the bit-line driver if the bit-line requires a voltage larger than the external power supply. The driver is expected to be small because the word-line pitch is nearly minimum feature size. Thus, the major challenges are to simplify the level shifter and to provide a high-performance switch.

© 2000 by CRC Press LLC

FIGURE 51.29

The algorithm of the process routine in Fig. 51.28.

Charge-Pumping Circuit The charge-pumping circuit is a high-voltage generator that supplies high voltage for programming and erasing operations. This kind of circuit is well-known in power equipment, such as power supplies, highvoltage switches, etc. A conventional voltage generator requires a power transformer, which transforms input power to output power without loss. In other words, low voltage and large current is transformed to high voltage and low current. The transformer uses the inductance and magnetic flux to generate high voltage very efficiently. However, in the VLSI arena, it is difficult to produce inductors and the chargepumping method is used instead. Figure 51.31 shows an example of a charge-pumping circuit that consists of multiple-stage pumping units. Each unit is composed of a one-way switch and a capacitor. The oneway switch is a high-voltage switch that does not allow charge to flow back to the input. The capacitor stores the transferred charge and gradually produces high voltage. No two consecutive stages operate at the same time. In other words, when one stage is transferring the charge, the next stage and the previous stage should serve as an isolation switch, which eliminates charge loss. Therefore, a two-phase clocking signal is required to proceed with the charge-pumping operation, producing no voltage drop between the input and output of the switch and large current drivability of the output. In addition, the voltage level must be higher than the previous stage. Therefore, the two-phase clocking signal must be levelshifted to individual high voltages to turn on and off the one-way switch in each pumping unit. A smaller charge-pumping or a more sophisticated level-shift circuit can be employed as self-boosted parts. The generated high voltage, in most cases, is higher than the required voltage. A regulation circuit, which can generate stable voltage and is immune to the fluctuation of external supply voltage and the operating temperature, is used to regulate the voltage and will be described later.

© 2000 by CRC Press LLC

FIGURE 51.30

Level shifter: (a) positive polarity pulse, and (b) negative polarity pulse.

FIGURE 51.31

(a) Charge-pumping circuit; (b) two-phase clock; and (c) pumping voltage.

© 2000 by CRC Press LLC

Sense Amplifier The sense amplifier is an analog circuit that amplifies small voltage differences. Many circuits can be employed — from the simplest two-transistor cross-coupled latches to the complicated cascaded currentmirrors sense amplifiers. Here, a symbolic diagram is used to represent the sense amplifier in the following discussion. The focus of the sensing circuit is on multi-level sensing, which is currently the engineering issue in Flash memory. Figures 51.32(a) and (b) show the schemes of parallel sensing and consecutive sensing, respectively. These two schemes are based on analog-to-digital conversion (ADC). Information stored in the Flash memory can be read simultaneously with multiple comparators working at the same time. The outputs of the comparators are encoded into N digits for 2N levels. Figure 51.32(b) shows the consecutive sensing scheme. The sensing time will be N times longer than the parallel sensing for 2N levels. The sensing algorithm is a conventional binary search that compares the middle values in the consecutive range of interest. Only one sense amplifier is required for a cell. In the example, the additional sense amplifier is used for speeding up the sensing process. The second-stage sense amplifier can be precharged and prepared while the first-stage sense amplifier is amplifying the signal. Thus, the sensing time overhead is reduced. When a multi-level scheme is used, the threshold voltage should be as tight as possible for each level. The depletion of unselected cells is strictly inhibited because the leakage current from unselected cells will destroy

FIGURE 51.32

(a) Parallel sensing scheme, and (b) consecutive sensing scheme.

© 2000 by CRC Press LLC

the true signal, which leads to error during sensing. Another challenge in multi-level sensing is the generation of reference voltages. Since the reference voltages are generated from the power supply, the leakage along the voltage divider path is unavoidable. Besides, the generated voltages are susceptible to the temperature variation and process-related resistance variation. If the variation of reference voltages cannot be minimized to a certain value, the ambiguous decision would be made for multi-level sensing due to unavoidable threshold spread for each level. Therefore, to provide high-sensitivity sense amplifier and to generate precise and robust reference voltages are the major developing goals for more than four-level Flash memory.

Voltage Regulator A voltage regulator is an accurate voltage generator that is immune to temperature variation, processrelated variation, and parasitic component effects. The concept of voltage regulation arises from the temperature-compensated device and the negative feedback circuits. Semiconductor carrier concentration and mobility are all dependent on the ambient temperature. Some devices have positive temperature coefficients, while others have negative coefficients. We can use both kinds of devices to produce a composite device for complete compensation. Figure 51.33 shows two back-to-back connected diodes that can be insensitive to the temperature over the temperature range of interest, if the doping concentration is properly designed. The forward-bias diode is negatively sensitive to temperature: the higher the temperature, the lower the cut-in voltage. On the other hand, the reverse-bias diode shows a reverse characteristic in the breakdown voltage. When connecting the two diodes and optimizing the diode characteristics, the regulated voltage can be insensitive to temperature. Nevertheless, the generated voltage is usually not what we want. A feedback loop, as shown in Fig. 51.34, is needed to generate precise

FIGURE 51.33 (a) Back-to-back connected temperature-compensated dual diodes; and (b) the characteristics of a diode as a function of temperature.

FIGURE 51.34

Voltage regulation block diagram.

© 2000 by CRC Press LLC

programming and erasing voltage. The charge-pumping output voltage and drivability are functions of the two-phase clocking frequency. The pumping voltage can be scaled to be compared with the precise voltage generator to provide a feedback signal for the clocking circuit whose frequency can be varied. With the feedback loop, the generated voltage can be insensitive to temperature. Whatever the desired output voltage is, the structure can be applied in general to produce temperature-insensitive voltage.

Y-Gating Y-gating is the decoding path of bit-lines. The bit-line pitch is as small as the minimum feature size. One register and one sense amplifier per bit-line is difficult to achieve. Y-gating serves as a switch that makes multiple bit-lines share one latch and one sense amplifier. Two approaches — indirect decoding and direct decoding — used as the Y-gating are shown in Figs. 51.35(a) and (b), respectively. Regarding the indirect decoding, if 2N bit-lines are decoded using one-to-two decoding unit, the cascaded stages are required with N decoding control lines. However, when the direct decoding schemes is used, 2N bit-lines require 2N decoding lines to establish a one-to-2N decoding network, and the pre-decoder is required to generate the decoding signal. The area penalty of indirect decoding is reduced but the voltage drop along

FIGURE 51.35

(a) Indirect decoding, and (b) direct decoding.

© 2000 by CRC Press LLC

the decoding path is of concern. To avoid the voltage drop, a boosted decoding line should be used to overcome the threshold voltage of the passing transistor. Another approach to eliminate voltage drop is the employment of a CMOS transfer gate. However, the area penalty arises again due to well-to-well isolation. Since Flash memory is very sensitive to the drain voltage, the boosted decoding control lines, together with the indirect decoding scheme, are suggested.

Page Buffer A page buffer is static memory (SRAM-like memory) that serves as a temporary storage of input data. The page buffer also serves as temporary storage of read data. With the page buffer, Flash memory can increase its throughput or bandwidth during programming and read, because external devices can talk to the page buffer in a very short time without waiting for the slow programming of Flash memory. After the input data is transferred to the page buffer, the Flash memory begins programming and external devices can do other tasks. The page size should be carefully designed according to the applications. The larger the page size, the more data can be transferred into Flash memory without having to wait for the completion of programming. However, the area penalty limits the page size. There exists a proper design of page buffer for the application of interest.

Block Register The block register stores the information about the individual block. The information includes failure of the block, write inhibit, read inhibit, executable operation, etc., according to the applications of interest. Some blocks, especially the boot block, are write-inhibited after first programming. This prevents virus injection in some applications, such as PC BIOS. The block registers are also Flash memory cells for storing block information, which will not disappear after power-off. When the local FSM is working on a certain block, the first thing is to check the status of the block by reading the register. If the block is identified as a failure block, no further operation can be made in this block.

Summary Flash memory is a system with mixed analog and digital systems. The analog circuits include voltagegeneration circuits, analog-to-digital converter circuits, sense amplifier circuits, and level-shifter circuits. These circuits require excellent functionality but small area consumption. The complicated analog designs in the pure-analog circuit do not meet the requirements of Flash memory, which requires large array efficiency, large memory density, and large storage volume. Therefore, the design of these analog circuits tends toward reduced design and qualified function. On the other hand, the digital parts of Flash memory are not as complicated as those digital circuits used in pure digital signal process circuits. Therefore, the mixed analog and digital Flash memory system can be implemented in a simplified way. Furthermore, Flash memory is a memory cell-based system. All the functions of the circuits are designed according to the characteristics of the memory cell. Once the cell structure of a memory differs, it will result in a completely different system design.

References 1. Kahng, D. and Sze, S. M., A floating gate and its application to memory devices, Bell Syst. Tech. J., vol. 46, p. 1283, 1967. 2. Frohman-Bentchlowsky, D., An integrated metal-nitride-oxide-silicon (MNOS) memory, IEDM Tech. Dig., 1968. 3. Pao, H. C and O’Connel, M., Appl. Phys. Lett. no. 12, p. 260, 1968. 4. Frohman-Bentchlowsky, D., A fully decoded 2048-bit electrically programmable FAMOS read only memory, IEEE J. Solid-State Circuits, vol. SC-6, no. 5, p. 301, 1971.

© 2000 by CRC Press LLC

5. Johnson, W., Perlegos, G., Renninger, A., Kuhn, G., and Ranganath, T., A 16k bit electrically erasable non-volatile memory, Tech. Dig. IEEE ISSCC, p. 152, 1980. 6. Masuoka, F., Asano, M., Iwahashi, H., Komuro, T., and Tanaka, S., A new Flash EEPROM cell using triple polysilicon technology, IEDM Tech. Dig., p. 464, 1984. 7. Mukherjee, S., Chang, T., Pang, R., Knecht, M., and Hu, D., A single transistor EEPROM cell and its implementation in a 512K CMOS EEPROM, IEDM Tech. Dig., p. 616, 1985. 8. Samachisa, G., Su, C.-S., Kao, Y.-S., Smarandoiu, G., Wang, C. Y.-M., Wong, T., and Hu, C., A 128K Flash EEPROM using double-polysilicon technology, IEEE J. Solid-State Circuits, vol. SC-22, no. 5, p. 676, 1987. 9. Hsu, C. C.-H., Acovic, A., Dori, L., Wu, B., Lii, T., Quinlan, D., DiMaria, D., Taur, Y., Wordeman, M., and Ning, T., A high speed, low power p-channel Flash EEPROM using silicon rich oxide as tunneling dielectric, Ext. Abstract of 1992 SSDM, p. 140, 1992. 10. Ohnakado, T., Mitsunaga, K., Nunoshita, M., Onoda, H., Sakakibara, K., Tsuji, N., Ajika, N., Hatanaka, M., and Miyoshi, H., Novel electron injection method using band-to-band tunneling induced hot electron (BBHE) for Flash memory with p-channel cell, IEDM Tech. Dig., p. 279, 1995. 11. Ohnakado, T., Takada, H., Hayashi, K., Sugahara, K., Satoh, S., and Abe, H., Novel self-limiting program scheme utilizing n-channel select transistors in p-channel DINOR Flash memory, IEDM Tech. Dig., 1996. 12. Shen, S.-J., Yang, C.-S., Wang, Y.-S., and Hsu, C. C.-H., Novel self-convergent programming scheme for multi-level p-channel Flash memory, IEDM Tech. Dig., p. 287, 1997. 13. Chung, S. S., Kuo, S. N., Yih, C. M., and Chao, T. S., Performance and reliability evaluations of pchannel Flash memories with different programming schemes, IEDM Tech. Dig., 1997. 14. Wang, S. T., On the I-V characteristics of floating gate MOS transistors, IEEE Trans. Electron Devices, vol. ED-26, no. 9, p. 1292, 1979. 15. Liong, L. C. and Liu, P.-C., A theoretical model for the current-voltage characteristics of a floating gate EEPROM cell, IEEE Trans. Electron Devices, vol. ED-40, no. 1, p. 146, 1993. 16. Manthey, J. T., Degradation of Thin Silicon Dioxide Films and EEPROM Cells, Ph.D. dissertation, 1990. 17. Ng, K. K. and Taylor, G. W., Effects of hot-carrier trapping in n and p channel MOSFETs, IEEE Trans. Electron Devices, vol. ED-30, p. 871, 1983. 18. Selmi, L., Sangiorgi, E., Bez, R., and Ricco, B., Measurement of the hot hole injection probability from Si into SiO2 in p-MOSFETs, IEDM Tech. Dig., p. 333, 1993. 19. Tang, Y., Kim, D. M., Lee, Y.-H., and Sabi, B., Unified characterization of two-region gate bias stress in submicronmeter p-channel MOSFET’s, IEEE Electron Device Lett., vol. EDL-11, no. 5, p. 203, 1990. 20. Takeda, E., Kume, H., Toyabe, T., and Asai, S., Submicrometer MOSFET structure for minimizing hot carrier generation, IEEE Trans. Electron Devices, vol. ED-29, p. 611, 1982. 21. Shockley, W., Problems related to p-n junction in silicon, Solid- State Electron., vol. 2, p. 35, 1961. 22. Verwey, J. F., Kramer, R. P., and de Maagt B. J., Mean free path of hot electrons at the surface of boron-doped silicon, J. Appl. Phys., vol. 46, p. 2612, 1975. 23. Ning, T. H., Osburn, C. M., and Yu, H. N., Emission probability of hot electrons from silicon into silicon dioxide, J. Appl. Phys., vol. 48, p. 286, 1977. 24. Hu, C., Lucky-electron model of hot-electron emission, IEDM Tech. Dig., p. 22, 1979. 25. Tam, S., Ko, P.-K., and Hu, C., Lucky-electron model of channel hot electron injection in MOSFET’s, IEEE Trans. Electron Devices, vol. ED-31, p. 1116, 1984. 26. Berglung, C. N. and Powell, R. J., Photoinjection into SiO2. Electron scattering in the image force potential well, J. Appl. Phys., vol. 42, p. 573, 1971. 27. Ong, T.-C., Ko, P. K., and Hu, C., Modeling of substrate current in p-MOSFET’s, IEEE Electron Device Lett., vol. EDL-8, no. 9, p. 413, 1987. 28. Ong, T.-C., Seki, K., Ko, P. K., and Hu, C., P-MOSFET gate current and device degradation, Proc. IEEE/IRPS, p. 178, 1989.

© 2000 by CRC Press LLC

29. Takeda, E., Suzuki, N., and Hagiwara, T., Device performance degradation due to hot carrier injection at energies below the Si-SiO2 energy barrier, IEDM Tech. Dig., p. 396, 1983. 30. Hofmann, K. R., Werner, C., Weber, W., and Dorda, G., Hot-electron and hole emission effects in short n-channel MOSFET’s, IEEE Trans. Electron Devices, vol. ED-32, no. 3, p. 691, 1985. 31. Nissan-Cohen, Y., A novel floating-gate method for measurement of ultra-low hole and electron gate currents in MOS transistors, IEEE Electron Device Lett., vol. EDL-7, no. 10, p. 561, 1986. 32. Sak, N. S., Hereans, P. L., Hove, L. V. D., Maes, H. E., DeKeersmaecker, R. F., and Declerck, G. J., Observation of hot-hole injection in NMOS transistors using a modified floating gate technique, IEEE Trans. Electron Devices, vol. ED-33, no. 10, p. 1529, 1986. 33. Yamada, S., Suzuki, T., Obi, E., Oshikiri, M., Naruke, K., and Wada, M., A self-convergence erasing scheme for a simple stacked gate Flash EEPROM, IEDM Tech. Dig., p. 307, 1991. 34. Ong, T. C., Fazio, A., Mielke, N., Pan, S., Righos, N., Atwood, G., and Lai, S., Erratic erase in ETOX Flash memory array, Proc. Symp. on VLSI Technology, p. 83, 1993. 35. Yamada, S., Yamane, T., Amemiya, K., and Naruke, K., A self-convergence erase for NOR Flash EEPROM using avalanche hot carrier injection, IEEE Trans. Electron Devices, vol. ED-43, no. 11, p. 1937, 1996. 36. Chen, J., Chan, T. Y., Chen, I. C., Ko, P. K., and Hu, C., Subbreakdown drain leakage current in MOSFET, IEEE Electron Device Lett., vol. EDL.-8, no. 11, p. 515, 1987. 37. Chan, T. Y., Chen, J., Ko, P. K., and Hu, C., The impact of gate-induced drain leakage on MOSFET scaling, IEDM Tech. Dig., p. 718, 1987. 38. Shrota, R., Endoh, T., Momodomi, M., Nakayama, R., Inoue, S., Kirisawa, R., and Masuoka, F., An accurate model of sub-breakdown due to band-to-band tunneling and its application, IEDM Tech. Dig., p. 26, 1988. 39. Chang, C. and Lien, J., Corner-field induced drain leakage in thin oxide MOSFET’s, IEDM Tech. Dig., p. 714, 1987. 40. Chen, I.-C., Coleman, D. J., and Teng, C. W., Gate current injection initiated by electron band-toband tunneling in MOS devices, IEEE Electron Device Lett., vol. EDL-10, no. 7, p. 297, 1989. 41. Yoshikawa, K., Mori, S., Sakagami, E., Ohshima, Y., Kaneko, Y., and Arai, N., Lucky-hole injection induced by band-to-band tunneling leakage in stacked gate transistor, IEDM Tech. Dig., p. 577, 1990. 42. Haddad, S., Chang, C., Swanminathan, B., and Lien, J., Degradation due to hole trapping in Flash memory cells, IEEE Electron Device Lett., vol. EDL-10, no. 3, p. 117, 1989. 43. Igura, Y., Matsuoka, H., and Takeda, E., New device degradation due to Cold carrier created by band-to-band tunneling, IEEE Electron Device Lett., vol. 10, no. 5, p. 227, 1989. 44. Lenzlinger, M. and Snow, E. H., Fowler-Nordheim tunneling into thermally grown SiO2, J. Appl. Phys., vol. 40, no. 1, p. 278, 1969. 45. Weinberg, Z. A., On tunneling in MOS structure, J. Appl. Phys., vol. 53, p. 5052, 1982. 46. Ricco, B. and Fischetti, M. V., Temperature dependence of the currents in silicon dioxide in the high field tunneling regime, J. Appl. Phys., vol. 55, p. 4322, 1984. 47. Lin, C. J., Enhanced Tunneling Model and Characteristics of Silicon Rich Oxide Flash Memory, Ph.D. dissertation, 1996. 48. Olivo, P., Sune, J., and Ricco, B., Determination of the Si-SiO2 barrier height from the FowlerNordheim plot, IEEE Electron Device Lett., vol. EDL-12, no. 11, p. 620, 1991. 49. Wu, A. T., Chan, T. Y., Ko, P. K., and Hu, C., A source-side injection erasable programmable readonly-memory (SI-EPROM) device, IEEE Electron Device Lett., vol. EDL-7, no. 9, p. 540, 1986. 50. Kume, H., Yamamoto, H., Adachi, T., Hagiwara, T., Komori, K., Nishimoto, T., Koike, A., Meguro, S., Hayashida, T., and Tsukada, T., A Flash-erase EEPROM cell with an asymmetric source and drain structure, IEDM Tech. Dig., p. 560, 1987. 51. Naruke, K., Yamada, S., Obi, E., Taguchi, S., and Wada, M., A new Flash-erase EEPROM cell with a side-wall select-gate on its source side, IEDM Tech. Dig., p. 603, 1989.

© 2000 by CRC Press LLC

52. Woo, B. J., Ong, T. C., Fazio, A., Park, C., Atwood, D., Holler, M., Tam, S., and Lai, S., A novel memory cell using Flash array contact-less EPROM (FACE) technology, IEDM Tech. Dig., p. 91, 1990. 53. Ohshima, Y., Mori, S., Kaneko, Y., Sakagami, E., Arai, N., Hosokawa, N., and Yoshikawa, K., Process and device technologies for 16M bit EPROM’s with large-tilt-angle implanted p-pocket cell, IEDM Tech. Dig., p. 95, 1990. 54. Ajika, N., Obi, M., Arima, H., Matsukawa, T., and Tsubouchi, N., A 5 volt only 16M bit Flash EEPROM cell with a simple stacked gate structure, IEDM Tech. Dig., p. 115, 1990. 55. Manos, P. and Hart, C., A self-aligned EPROM structure with superior data retention, IEEE Electron Device Lett., vol. EDL-11, no. 7, p. 309, 1990. 56. Kodama, N., Saitoh, K., Shirai, H., Okazawa, T., and Hokari, Y., A 5V only 16M bit Flash EEPROM cell using highly reliable write/erase technologies, Proc. Symp. on VLSI Technology, p. 75, 1991. 57. Kodama, N., Oyama, K., Shirai, H., Saitoh, K., Okazawa, T., and Hokari, Y., A symmetrical side wall (SSW)-DSA cell for a 64-M bit Flash memory, IEDM Tech. Dig., p. 303, 1991. 58. Liu, D. K. Y., Kaya, C., Wong, M., Paterson, J., and Shah, P., Optimization of a source-side-injection FAMOS cell for Flash EPROM application, IEDM Tech. Dig., p. 315, 1991. 59. Yamauchi, Y., Tanaka, K., Shibayama, H., and Miyake, R., A 5V-only virtual ground Flash cell with an auxiliary gate for high density and high speed application, IEDM Tech. Dig., p. 319, 1991. 60. Kaya, C., Liu, D. K. Y., Paterson, J., and Shah, P., Buried source-side injection (BSSI) for Flash EPROM programming, IEEE Electron Device Lett., vol. EDL-13, no. 9, p. 465, 1992. 61. Yoshikawa, K., Sakagami, E., Mori, S., Arai, N., Narita, K., Yamaguchi, Y., Ohshima, Y., and Naruke, K., A 3.3V operation nonvolatile memory cell technology, Proc. Symp. on VLSI Technology, p. 40, 1992. 62. Shirota, R., Itoh, Y., Nakayama, R., Momodomi, M., Inoue, S., Kirisawa, R., et al., A new NAND cell for ultra high density 5V-only EEPROM’s, Proc. Symp. on VLSI Technology, p. 33, 1988. 63. Momodomi, M., Kirisawa, R., Nakayama, R., Aritome, S., Endoh, T., Itoh, T., et al., New device technologies for 5V- only 4Mb EEPROM with NAND structure cell, IEDM Tech. Dig., p. 412, 1988. 64. Kume, H., Tanaka, T., Adachi, T., Miyamoto, N., Saeki, S., Ohji, Y., et al., A 3.42 µm2 Flash memory cell technology conformable to a sector erase, Proc. Symp. on VLSI Technology, p. 77, 1991. 65. Onoda, H., Kunori, Y., Kobayashi, S., Ohi, M., Fukumoto, A., Ajika, N., and Miyoshi, H., A novel cell structure suitable for a 3 volt operation, sector erase Flash memory, IEDM Tech. Dig., p. 599, 1992. 66. Kume, H., Kato, M., Adachi, T., Tanaka, T., Sasaki, T., and Okazaki, T., A 1.28 µm2 contactless memory cell technology for a 3V-only 64M bit EEPROM, IEDM Tech. Dig., p. 991, 1992. 67. Method for manufacturing a contact-less floating gate transistor, U.S. Patent 5453391, 1993. 68. Ohi, M., Fukumoto, A., Kunori, Y., Onoda, H., Ajika, N., Hatanaka, M., and Miyoshi, H., An asymmetrical offset source/drain structure for virtual ground array Flash memory with DINOR operation, Proc. Symp. on VLSI Technology, p. 57, 1993. 69. Yamauchi, Y., Yoshimi, M., Sato, S., Tabuchi, H., Takenaka, N., and Sakiyam, K., A new cell structure for sub-quarter micron high density Flash memory, IEDM Tech. Dig., p. 267, 1995. 70. Kim, K. S., Kim, J. Y., Yoo, J. W., Choi, Y. B., Kim, M. K., Nam, B. Y., et al, A novel dual string NOR (DuSNOR) memory cell technology scalable to the 256M bit and 1G bit Flash memory, IEDM Tech. Dig., p. 263, 1995. 71. Kirisawa, R., Aritome, S., Nakayama, R., Endoh, T., Shirota, R., and Masuoka, F., A NAND structures cell with a new programming technology for highly reliable 5V-only Flash EEPROM, Proc. Symp. on VLSI Technology, p. 129, 1990. 72. Aritome, S., Kirisawa, R., Endoh, T., Nakayama, R., Shirota, R., Sakui, K., Ohuchi, K., and Masuoka, F., Extended data retention characteristics after more than 104 write and erase cycles in EEPROM’s, Proc. IEEE/IRPS, p. 259, 1990. 73. Endoh, T., Iizuka, H., Aritome, S., Shirota, R., and Masuoka, F., New write/erase operation technology for Flash EEPROM cells to improve the read disturb characteristics, IEDM Tech. Dig., p. 603, 1992.

© 2000 by CRC Press LLC

74. Aritome, S., Hatakeyama, K., Endoh, T., Yamaguchi, T., Shuto, S., Iizuka, H., et al., A 1.13 µm2 memory cell technology for reliable 3.3V 64M NAND EEPROM’s, Ext. Abstract of 1993 SSDM, p. 446, 1993. 75. Aritome, S., Satoh, S., Maruyama, T., Watanabe, H., Shuto, S., Hermink, G. J., Shirota, R., Watanabe, S., and Masuoka, F., A 0.67 µm2 self-aligned shallow trench isolation cell (SA-STI cell) for 3V-only 256M bit NAND EEPROM’s, IEDM Tech. Dig., p. 61, 1994. 76. Kim, D. J., Choi, J. D., Kim, J. Oh, H. K., and Ahn, S. T., and Kwon, O.H., Process integration for the high speed NAND Flash memory cell, Proc. Symp. on VLSI Technology, p. 236, 1996. 77. Choi, J. D., Kim, D. J., Jang, D. S., Kim, J., Kim, H. S., Shin, W. C., Ahn, S. T., and Kwon, O. H., A novel booster plate technology in high density NAND Flash memories for voltage scaling down and zero program disturbance, Proc. Symp. on VLSI Technology, p. 238, 1996. 78. Entoh, T., Shimizu, K., Iizuka, H., and Masuoka, F., A new write/erase method to improve the read disturb characteristics based on the decay phenomena of the stress induced leakage current for Flash memories, IEEE Trans. Electron Device, vol. ED-45, no. 1, p. 98, 1998. 79. Lai, S. K., NVRAM technology, NOR Flash design and multi-level Flash, IEDM NVRAM Technology and Application Short Course, 1995. 80. Yamada, S., Hiura, Y., Yamane, T., Amemiya, K., Ohshima, Y., and Yoshikawa, K., Degradation mechanism of Flash EEPROM programming after programming/erase cycles, IEDM Tech. Dig., p. 23, 1993. 81. Cappelletti, P., Bez, R., Cantarelli, D., and Fratin, L., Failure mechanisms of Flash cell in program/erase cycling, IEDM Tech. Dig., p. 291, 1994. 82. Liu, Y. C., Guo, J.-C., Chang, K. L., Huang, C. I., Wang, W. T., Chang, A., and Shone, F., Bitline stress effects on Flash EPROM cells after program/erase cycling, IEEE Nonvolatile Semiconductor Memory Workshop, 1997. 83. Shen, S.-J., Chen, H.-M., Lin, C.-J., Chen, H.-H., Hong, G., and Hsu, C. C.-H., Performance and reliability trade-off of large-tilted-angle implant p-pocket (LAP) on stacked-gate memory devices, Japan. J. Appl. Phys., vol. 36, part 1, no. 7A, p. 4289, 1997. 84. DiMaria, D. J., Dong, D. W., Pesavento, F. L., Lam, C., and Brorson, B. D., Enhanced conduction and minimized charge trapping in electrically alterable read-only memories using off-stoichiometric silicon dioxide films, J. Appl. Phys., vol. 55, p. 300, 1984. 85. Lin, C.-J., Hsu, C. C.-H., Chen, H.-H., Hong, G., and Lu, L. S., Enhanced tunneling characteristics of PECVD silicon-rich-oxide (SRO) for the application in low voltage Flash EEPROM, IEEE Trans. Electron Device, vol. ED-43, no. 11, p. 2021, 1996. 86. Shen, S.-J., Lin C.-J., and Hsu, C. C.-H, Ultra fast write speed, long refresh time, low FN power operated volatile memory cell with stacked nanocrystalline Si film, IEDM Tech. Dig., p. 515, 1996. 87. Hisamune, Y. S., Kanamori, K., Kubota, T., Suzuki, Y., Tsukiji, M., Hasegawa, E., et al., A high capacitive-coupling ratio (HiCR) cell for 3V-only 64 M bit and future Flash memories, IEDM Tech. Dig., p. 19, 1993. 88. Shirai, H., Kubota, T., Honma, I., Watanabe, H., Ono, H., and Okazawa, T., A 0.54 µm2 self-aligned, HSG floating gate cell (SAHF cell) for 256M bit Flash memories, IEDM Tech. Dig., p. 653, 1995. 89. Esquivel, J., Mitchel, A., Paterson, J., Riemenschnieder, B., Tieglaar, H., et al., High density contactless, self aligned EPROM cell array technology, IEDM Tech. Dig., p. 592, 1986. 90. Masuoka, F., Momodomi, M., Iwata, Y., and Shirota, R., New ultra high density EPROM and Flash EEPROM with NAND structure cell, IEDM Tech. Dig., p. 552, 1987. 91. Kynett, V. N., Baker, A., Fandrich, M. L., Hoekstra, G. P., Jungroth, O., Hreifels, J. A., et al., An insystem re-programmable 32K × 8 CMOS Flash memory, IEEE J. Solid Stat., vol. SC-23, no. 5, p. 1157, 1988. 92. Kazerounian, R., Ali, S., Ma, Y., and Eitan, B., A 5 volt high density poly-poly erase Flash EPROM cell, IEDM Tech. Dig., p. 436, 1988. 93. Gill, M., Cleavelin, R., Lin, S., D’Arrigo, I., Santin, G., Shah, P., et al., A 5-volt contactless 256K bit Flash EEPROM technology, IEDM Tech. Dig., p. 428, 1988.

© 2000 by CRC Press LLC

94. Flash EEPROM array with negative gate voltage erase operation, U.S. Patent 5077691, filed:1989. 95. Kynett, V. N., Fandrich, M. L., Anderson, J., Dix, P., Jungroth, O., Hreifels, J. A., et al., A 90ns onemillion erase/program cycle 1Mbit Flash memory, IEEE J. Solid-State Circuits., vol. SC-24, no. 5, p. 1259, 1989. 96. Endoh, T., Shirota, R., Tanaka, Y., Nakayama, R., Kirisawa, R., Aritome, S., and Masuoka, F., New design technology for EEPROM memory cells with 10 million write/erase cycling endurance, IEDM Tech. Dig., p. 599, 1989. 97. Shirota, R., Nakayama, R., Kirisawa, R., Momodomi, M., Sakui, K., Itoh, Y., et al., A 2.3 µm2 memory cell structure for 16M bit NAND EEPROM’s, IEDM Tech. Dig., p. 103, 1990. 98. Riemenschneider, B., Esquivel, A. L., Paterson, J., Gill, M., Lin, S., Schreck, J., et al., A process technology for a 5-volt only 4M bit Flash EEPROM with an 8.6 µm2 cell, Proc. Symp. on VLSI Technology, p. 125, 1990. 99. Gill, M., Cleavelin, R., Lin, S., Middendorf, M., Nguyen, A., Wong, J., et al., A novel sub-lithographic tunnel diode based 5V-only Flash memory, IEDM Tech.Dig., p. 119, 1990. 100. Self-aligned source process and apparatus, U.S. Patent 5103274, filed:1991. 101. Woo, B. J., Ong, T. C., and Lai, S., A poly-buffered FACE technology for high density Flash memories, Proc. Symp. on VLSI Technology, p. 73, 1991. 102. Oyama, K., Shirai, H., Kodama, N., Kanamori, K., Saitoh, K., et al., A novel erasing technology for 3.3V Flash memory with 64 Mb capacity and beyond, IEDM Tech. Dig., p. 607, 1992. 103. Pein, H. and Plummer, J. D., A 3-D side-wall Flash EPROM cell and memory array, IEEE Electron Device Lett., vol. EDL-14, no. 8, p. 415, 1993. 104. Dhum, D. P., Swift, C. T., Higman, J. M., Taylor, W. J., Chang, K. T., Chang, K. M., and Yeargain, J. R., A novel band-to-band tunneling induced convergence mechanism for low current, high density Flash EEPROM applications, IEDM Tech. Dig., p. 41, 1994. 105. Tsuji, N., Ajika, N., Yuzuriha, K., Kunori, Y., Hatanaka, M., and Miyoshi, H., New erase scheme for DINOR Flash memory enhancing erase/write cycling endurance characteristics, IEDM Tech. Dig., p. 53, 1994. 106. Ma. Y., Pang, C. S., Chang, K. T., Tsao, S. C., Frayer, J. E., Kim, T., Jo, K., Kim, J., Choi, I., and Park, H., A dual-bit split-gate EEPROM (DSG) cell in contactless array for single Vcc high density Flash memories, IEDM Tech. Dig., p. 57, 1994. 107. Kato, M., Adachi, T., Tanaka, T., Sato, A., Kobayashi, T., Sudo, Y., et al., A 0.4 µm self-aligned contactless memory cell technology suitable for 256M bit Flash memory, IEDM Tech. Dig., p. 921, 1994. 108. Hemink, G. J., Tanaka, T., Endoh, T., Aritome, S., and Shirota, R., Fast and accurate programming method for multi-level NAND EEPROM’s, Proc. Symp. on VLSI Technology, p. 129, 1995. 109. Hu, C.-Y., Kencke, D. L., Banerjee, S. K., Richart, R., Bandyopadhyay, B., Moore, B., Ibok, E., and Garg, S., A convergence scheme for over-erased Flash EEPROM’s using substrate-bias-enhanced hot electron injection, IEEE Electron Device Lett., vol. EDL-16, no. 11, p. 500, 1995. 110. Bude, J. D., Frommer, A., Pinto, M. R., and Weber, G. R., EEPROM/Flash sub 3.0V drain-source bias hot carrier writing, IEDM Tech. Dig., p. 989, 1995. 111. Chi, M. H and Bergemont, A., Multi-level Flash/EPROM memories: new self-convergent programming methods for low-voltage applications, IEDM Tech. Dig., p. 271, 1995. 112. Aritome, S., Takeuchi, Y., Sato, S., Watanabe, H., Shimizu, K., Hemink, G., and Shirota, R., A novel side-wall transistor cell (SWATT cell) for multi-level NAND EEPROMs, IEDM Tech. Dig., p. 275, 1995. 113. Hu, C.-Y., Kencke, D. L., Banerjee, S. K., Richart, R., Bandyopadhyay, B., Moore, B., Ibok, E., and Garg, S., Substrate-current-induced hot electron (SCIHE) injection: a new convergence scheme for Flash memory, IEDM Tech. Dig., p. 283, 1995. 114. Gotou, H., New operation mode for stacked gate Flash memory cell, IEEE Electron Device Lett., vol. EDL-16, no. 3, p. 121, 1995.

© 2000 by CRC Press LLC

115. Shin, W. C., Choi, J. D., Kim, D. J., Kim, J., Kim, H. S., Mang, K. M., et al., A new shared bit line NAND cell technology for the 256Mb Flash memory with 12V programming, IEDM Tech. Dig., p. 173, 1996. 116. Papadas, C., Guillaumot, B., and Cialdella, B., A novel pseudo-floating-gate Flash EEPROM device (-cell), IEEE Electron Device Lett., vol. EDL-18, no. 7, p. 319, 1997. 117. Shimizu, K., Narita, K., Watanabe, H., Kamiya, E., Takeuchi, Y., Yaegashi, T., Aritome, S., and Watanabe, T., A novel high-density 5F2 NAND STI cell technology suitable for 256Mbit and 1Gbit Flash memories, IEDM Tech. Dig., p. 271, 1997. 118. Kobayashi, T., Matsuzaki, N., Sato, A., Katayama, A., Kurata, H., Miura, A., Mine, T., Goto, Y., et al. A 0.24 µm2 cell process with 0.18 µm width isolation and 3-D interpoly dielectric films for 1Gb Flash memories, IEDM Tech. Dig., p. 275, 1997. 119. Choi, J. D., Lee, D. G., Kim, D. J., Cho, S. S., Kim, H. S., Shin, C. H., and Ahn, S. T., A triple polysilicon stacked Flash memory cell with wordline self-boosting programming, IEDM Tech. Dig., p. 283, 1997. 120. Chen, W.-M., Swift, C., Roberts, D., Forbes, K., Higman, J., Maiti, B., Paulson, W., and Chang, K.T., A novel flash memory device with split gate source side injection and ONO charge storage stack (SPIN), Proc. Symp. on VLSI Technology, p. 63, 1997. 121. Kim, H. S., Choi, J. D., Kim, J., Shin, W. C., Kim, D. J., Mang, K. M., and Ahn, S. T., Fast parallel programming of multi-level NAND Flash memory cells using the booster-line technology, Proc. Symp. on VLSI Technology, p. 65, 1997. 122. Roy, A., Kazerounian, R., Irani, R., Prabhakar, V., Nguyen, S., Slezak, Y., et al., A new Flash architecture with a 5.8l2 scalable AMG Flash cell, Proc. Symp. on VLSI Technology, p. 67, 1997. 123. Lee, W.-H., Clemens, J. T., Keller, R. C., and Manchanda, L., A novel high K interpoly dielectric (IPD) Al2O3 for low voltage/high speed Flash memories: erasing in msec at 3.3V, Proc. Symp. on VLSI Technology, p. 117, 1997. 124. Kianian, S., et al., A novel 3-volt-only, small sector erase, high density Flash EEPROM, Proc. Symp. on VLSI Tech., p. 71, 1994.

© 2000 by CRC Press LLC

Cheng, K. "Dynamic Random Access Memory" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

52 Dynamic Random Access Memory

Kuo-Hsing Cheng Tamkang University

52.1 52.2 52.3 52.4 52.5 52.6 52.7 52.8 52.9

Introduction Basic DRAM Architecture DRAM Memory Cell Read/Write Circuit Synchronous (Clocked) DRAMs Prefetch and Pipelined Architecture in SDRAMs Gb SDRAM Bank Architecture Multi-level DRAM Concept of 2-bit DRAM Cell Sense and Timing Scheme • Charge-Sharing Restore Scheme • Charge-Coupling Sensing

52.1 Introduction The first dynamic RAM (DRAM) was proposed in 1970 with a capacity of 1 Kb. Since then, DRAMs have been the major driving force behind VLSI technology development. The density and performance of DRAMs have increased at a very fast pace. In fact, the densities of DRAMs have quadrupled about every three years. The first experimental Gb DRAM was proposed in 19951,2 and remains commercially available in 2000. However, multi-level storage DRAM techniques are used to improve the chip density and to reduce the defect-sensitive area on a DRAM chip.3,4 The developments in VLSI technology have produced DRAMs that realize a cheaper cost per bit compared with other types of memories.

52.2 Basic DRAM Architecture The basic block diagram of a standard DRAM architecture is shown in Fig. 52.1. Unlike SRAM, the addresses on the standard DRAM memory are multiplexed into two groups to reduce the address input pin counts and to improve the cost-effectiveness of packaging. Although the number of address input pin counts can be reduced by half using the multiplexed address scheme on the standard DRAM memory, the timing control of the standard DRAM memory becomes more complex and the operation speed is reduced. For high-speed DRAM applications, separate address input pins can be used to reduce the timing control complexity and to improve the operation speed. In general, the address transition detector (ATD) circuit is not needed in a DRAM memory. DRAM controller provides Row Address Strobe (RAS) and Column Address Strobe (CAS) to latch in the row addresses and the column addresses. As shown in Fig. 52.1, the pins of a standard DRAM are:

© 2000 by CRC Press LLC

FIGURE 52.1

Basic block diagram of a standard DRAM architecture.

• Address: which are multiplexed in time into two groups, the row addresses and the column addresses • Address control signals: the Row Address Strobe RAS and the Column Address Strobe CAS • Write enable signal: WRITE • Input/output data pins • Power-supply pins An example of address-multiplexed DRAM timing during basic READ mode is shown in Fig. 52.2. The row-falling edge of the address strobe (RAS) samples the address and starts the READ operation mode. The row addresses are supplied into the address pins and then comes the row address strobe (RAS) signal. Column addresses are not required until the row addresses are sent in and latched. The column addresses are applied into address pins and then latched in by the column address strobe (CAS) signal. The access time tRAS is the minimum time for the RAS signal to be low and tRC is the minimum READ cycle time. Notice that the multiplexed address arrangement penalizes the access time of the standard DRAM memory. The CMOS DRAMs have several rapid access modes in addition to the basic modes. Figure 52.3 shows an example of the rapid access modes. The timing waveform shown in Fig. 52.3 for DRAM operation is the page mode operation. In this mode, the row addresses are applied to the address pins and then clocked by the row address strobe RAS signal, and the column addresses are latched into the DRAM chip on the falling edge of CAS signal as in a basic READ mode. Along a selected row, the individual column

© 2000 by CRC Press LLC

FIGURE 52.2

Read timing diagram for 4M × 1 DRAM.

FIGURE 52.3

Fast page mode read timing diagram.

bit can be rapidly accessed, and readout is randomly controlled by the column address and the column address strobe CAS. By using the page mode, the access time per bit is reduced.

52.3 DRAM Memory Cell In early CMOS DRAM storage cell design, three-transistor and four-transistor cells were used in 1-Kb and 4-Kb generations. Later, a particular one-transistor cell, as shown in Fig. 52.4(a), became the industry standard.5,6 The one-transistor (1T) cell achieves smaller cell size and low cost. The cell consists of an nchannel MOSFET and a storage capacitor Cs. The charge is stored in the capacitor Cs and the n-channel MOSFET functions as the access transistor. The gate of the n-channel MOSFET is connected to the wordline WL and its source/drain is connected to the bit-line. The bit-line has a capacity CBL, including the parasitic load of the connected circuits. The DRAM cell stores one bit of information as the charge on the cell storage capacitor Cs. Typical values for the storage capacitor Cs are 30 to 50 fF. When the cell stores “1”, the capacitor is charged to VDD – Vt. When the stores “0”, the capacitor is discharged to 0 V. During the READ operation, the voltage of the selected word-line is high; the access n-channel MOSFET is turned on, thus connecting the storage capacitor Cs to the bit-line capacitance CBL as shown in Fig. 52.4(b). The bit-line capacitance CBL, including the parasitic load of the connected circuits, is

© 2000 by CRC Press LLC

FIGURE 52.4 (a) The one-transistor DRAM cell; and (b) during the READ operation, the voltage of the selected word-line is high, thus connecting the storage capacitor Cs to the bit-line capacitance CBL.

about 30 times larger than the storage capacitor Cs. Before the selection of the DRAM cell, the bit-line is precharged to a fixed voltage, typically VDD/2.7 By using the charge conservation principle, during the READ operation, the bit-line voltage changes by

CS  V DD - V – -------V s = ∆V BL = ------------------C BL + C S  cs 2 

(52.1)

Here, Vcs is the storage voltage on the DRAM cell capacitor Cs. A ratio R = CBL/Cs is important for the read sensing operation. If the cell stores “1” with a voltage Vcs = VDD – Vt, we have the small bitline sense signal

1 V DD ∆V ( 1 ) = -------------  -------- – V t  1 + R 2

(52.2)

If the cell stores “0” with a voltage Vcs = 0, we have the small bit-line sense signal

1 V DD ∆V ( 0 ) = -------------  -------1 + R 2 

(52.3)

Since ratio R = CBL/Cs is large, these readout bit-line sense signals ∆V(1) and ∆V(0) are very small. Typical values for the sense signal are about 100 mv. For low-voltage operation, the supply voltage VDD is reduced. Thus, a lower R ratio is required to maintain the sense signals to have enough margin against noise. The main approach is to use a large cell storage capacitor Cs. As shown in Fig. 52.5, a conventional Cs was implemented by a simple planar-type capacitor. The charge storage in the cell takes place on both the poly-1 gate oxide and the depletion capacitances. The planar DRAM cells have been used in the 1-T DRAMs from the 16 kb to the 1 Mb. The limits of the planar DRAM cell for retaining sufficient capacitance were reached in the mid-1980s in the 1-Mb DRAM. With the increased density higher than 1 Mb, smaller horizontal geometry on the surface of the wafer can be achieved by making increased use of the vertical dimension.8 One approach is to use a trench capacitor, as shown in Fig. 52.6(a).9 It is folded vertically into the surface of the silicon in the form of a trench. Another approach for reducing horizontal capacitor size is to stack the capacitor Cs over the n-channel MOSFET access transistor, as shown in Fig. 52.6(b).

© 2000 by CRC Press LLC

FIGURE 52.5

Structural innovations of planar DRAM cells.

52.4 Read/Write Circuit As shown in the previous section, the readout process is destructive because the resulting voltage of the cell capacitor Cs will no longer be (VDD – Vt) or 0 V. Thus, the same data must be amplified and written to the cell in every readout process. Next to the storage cells, a sense amplifier with positive feedback structure, as shown in Fig. 52.7, is the most important component in a memory chip to amplify the small readout signal in the readout process. The input and output nodes of the differential positive feedback sense amplifier are connected to the bit-lines BL and BL. The small readout signal appearing between BL and BL is detected by the differential sense amplifier and amplified to a full-voltage swing at BL and BL. For example, if the DRAM

© 2000 by CRC Press LLC

FIGURE 52.6

Schematic cross-section of DRAM cells: (a) trench capacitor cell, and (b) stacked capacitor cell.

FIGURE 52.7

A differential sense amplifier connected to the bit-line.

© 2000 by CRC Press LLC

memory cell in BL has a stored data “1”, then a small positive voltage ∆V(1) will be generated and added to the bit-line BL voltage after the readout process. The voltage in the bit-line BL will be ∆V(1) + VDD/2. In the same time, the bit-line BL will keep its previous precharged voltage level, which is precharged to VDD/2. Thus, the small positive voltage ∆V(1) appears between BL and BL, with VBL higher than VBL, immediately after the readout process. It is amplified by the differential sense amplifier. The waveforms of VB before and after activating the sense amplifier are shown in Fig. 52.8. After the sensing and restoring operations, the voltage VBL rises to VDD, and the voltage VBL falls to 0 V. The output at BL is then sent to the DRAM output pin.

FIGURE 52.8

Timing waveform of VB.

The various circuits for read, write precharge, and equalization function are shown in Fig. 52.9. The sequence of the read operation is performed as follows. 1. Initially, both the bit-lines BL and BL are precharged to VDD/2 and equalized before the data readout process. The precharge and equalizer circuits are activated by rising the control signal Φp. This will cause the bit-lines BL and BL to be at equal voltage. The control signal Φp goes low after the precharge and equalization. 2. The signal WL is selected by the row decoder. It goes up to connect the storage cell to the bit-lines BL and BL. A small voltage difference then appears between the bit-lines. The voltage level of the word-line signal WL can be greater than VDD to overcome the threshold voltage drop of the nchannel MOSFET transistor. Thus, the stored voltage level of data “1” at the memory cell can be raised to VDD. 3. Once a small voltage difference is generated between the bit-lines BL and BL by the storage cell, the differential sense amplifier is turned on by pulsing the sense control signal Φs high and the sense control signal Φs low. Then, the small voltage difference is amplified by the differential sense amplifier. The voltage levels in BL and BL will quickly move to VDD or 0 V by the regenerative action of the positive feedback operation in the differential sense amplifier. 4. After the readout sensing and restoring operations, the voltage levels of the bit-lines have a full voltage swing. Then the differential voltage levels at the bit-lines are read out to the differential output lines O and O, through a read circuit. A main sense amplifier is used to read and to amplify the output-lines. After these processes, the output data is selected and transferred to the output buffer.

© 2000 by CRC Press LLC

FIGURE 52.9 (a) Schematic circuit diagram of DRAM.

In the write mode, the write control signal WRITE is activated. Selected bit-lines BL and BL are connected to a pair of input data controlled by the write control and write driver. The write circuit drives the voltage levels at the bit-lines to VDD or 0 V, and the data are transferred to the DRAM cell when access transistor is turned on.

© 2000 by CRC Press LLC

FIGURE 52.9(b) READ operation waveforms.

52.5 Synchronous (Clocked) DRAMs The application of multimedia is a very hot topic nowadays, and the multimedia systems require high speed and large memory capacity to improve the quality of data processing. Under this trend, high density, high bandwidth, and fast access time are the key requirements of future DRAMs. The synchronous DRAM (SDRAM) has the characteristic of fast access speed, and is widely used for memory application in multimedia systems. The first SDRAM appeared in the 16-Mb generation, and the current state-of-the-art product is a Gb SDRAM with GB/s bandwidth.10–14 Conventionally, the internal signals in asynchronous (non-clocked) DRAMs are generated by “address transition detection” (ATD) techniques. The ATD clock can be used to activate the address decoder and driver, the sense amplifier, and the peripheral circuit of DRAMs. Therefore, the asynchronous DRAMs require no external system clocks and have a simple interface. However, during the asynchronous DRAM access cycle, the process unit must wait for the data from the asynchronous DRAM, as shown in Fig. 52.10. Therefore, the speed of the asynchronous DRAM is slow. On the other hand, the synchronous interface (clocked) DRAMs making it under the control of the edge of the system clock. The input addresses of a synchronous DRAM are latched into the DRAM, and the output data is available after a given number of clock cycles — during which the processor unit is

FIGURE 52.10

Read cycle timing diagram for asynchronous DRAM.

© 2000 by CRC Press LLC

FIGURE 52.11

Read cycle timing diagram for synchronous DRAM.

FIGURE 52.12

Block diagrams of a synchronous DRAM.

free and does not wait for the data from the SDRAM, as shown in Fig. 52.11. The block diagram of an SDRAM is shown in Fig. 52.12. With the synchronous interface scheme, the effective operation speed of a given system is improved.

52.6 Prefetch and Pipelined Architecture in SDRAMs The system clock activates the SDRAM architecture. In order to speed up the average access time, it is possible to use the system clock to store the next address in the input latch or to be sequentially clocked out for each address access output from the output buffer, as shown in Fig. 52.13.15 During the read cycle of the prefetch SDRAM, more than one data word is fetched from the memory array and sent to the output buffer. Using the system clock to control the prefetch register and buffer, multiple words of data can be sequentially clocked out for each address access. As shown in Fig. 52.13, the SDRAM has a 6-clock-cycle RAS latency to prefetch 4-bit data.

© 2000 by CRC Press LLC

FIGURE 52.13 Block diagrams of two types of synchronous DRAM output: (a) prefetch (b) pipelined.

52.7 Gb SDRAM Bank Architecture To consider the Gb SDRAM realization, the chip layout and bank/data bus architecture is important for data access. Figure 52.14 shows the conventional bank/data bus architecture of 1-Gb SDRAM.16 It contains 64 DQ pins, 32 × 32-Mb SDRAM blocks, and four banks; and they all prefetch 4 bits. During the read cycle, the eight 32-Mb DRAM blocks of one bank are accessed simultaneously. The 256-bit data is accessed to the 64 DQ pins and 4 bits are prefetched. In an activated 32-Mb array block, 32bit data is accessed and associated with eight specific DQ pins. Therefore, it requires a data I/O bus switching circuit between the 32-Mb SDRAM bank and the eight DQ pins. It makes the data I/O bus more complex, and the access time is slower.

FIGURE 52.14

1-Gb SDRAM bank/data bus architecture.

In order to simplify the bus structure, the distributed bank (D-bank) architecture is proposed as shown in Fig. 52.15. The 1-Gb SDRAM is implemented by 32 × 32-Mb distributed banks. A 32-Mb distributed bank contains two 16-Mb memory arrays as shown in Fig. 52.16. The divided word-line technique is used to activate the segment along the column direction. Using this scheme, each of the eight 2-Mb segments is selectively activated; sense amplifiers of one of the eight segments are activated; and all the 16-K sense amplifiers are activated simultaneously. As compared with the conventional architecture, the distributed bank architecture has a much simplified data I/O bus structure.

© 2000 by CRC Press LLC

FIGURE 52.15

1-Gb SDRAM D-bank architecture.

FIGURE 52.16

16-Mb memory array for architecture D-bank.

© 2000 by CRC Press LLC

52.8 Multi-level DRAM In modern application-specific IC (ASIC) memory designs, there are some important items — memory capacity, fabrication yield, and access speed — that need to be considered. The memory capacity required for ASIC application has been increasing very rapidly, and the bit-cost reduction is one of the most important issues for file application DRAMs. In order to achieve high yield, it is important to reduce the defect-sensitive area on a chip. The multi-level storage DRAM technique is one of the circuit technologies that can reduce the effective cell size. It can store multiple voltage levels in a single DRAM cell. For example, in a four-level system, each DRAM cell corresponds to 2-bit data of “11”, “10”, “01”, and “00”. Thus, the multi-level storage technique can improve the chip density and reduce the defect-sensitive area on a DRAM chip, and it is one of the solutions to the “density and yield” problem.

52.9 Concept of 2-bit DRAM Cell The 2-bit DRAM is an important architecture in the multi-level DRAM. Let us discuss an example of a multi-level technique used for a 4-Gb DRAM by NEC.17 Table 52.1 lists both the 2-bit/4-level storage concept and the conventional 1-bit/2-level storage concept. In the conventional 1-bit/2-level DRAM cell, the storage voltage levels are Vcc or GND, corresponding to logic values “1” or “0”. The signal charge is one half the maximum storage charge. In the 2-bit/4-level DRAM cell, the storage voltage levels are Vcc, two-thirds Vcc, one-third Vcc, and GND, corresponding to logic values “11”, “10”, “01”, and “10”, respectively. Three reference voltage levels are used to detect these four storage levels. Reference levels are positioned at the midlevel between the four storage levels. Thus, the signal charge between the storage and reference levels is one sixth of the maximum storage charge. TABLE 52.1 Four-Level Storage Data

Four-Level Storage Storage Voltage Level Reference Level

11

Vcc

10

2/3 Vcc

01

1/3 Vcc

00

GND

1 0

Vcc GND

Signal Level 1/6 Vcc

5/6 Vcc 4-Level (2-bit) Storage

3/6 Vcc 1/6 Vcc

2-Level Storage

1/2 Vcc

1/2 Vcc

Sense and Timing Scheme The circuit diagram of the 2-bit/4-level storage technique is shown in Fig. 52.17. A pair of bit-lines is separated into two sections by transfer switches in order to have a capacitance ratio of two between Sections A and B. Two sense amplifiers and two cross-coupled capacitors Cc are connected to each section. During the stand-by cycle, the transfer signal TG is high and the transfer switch is turned ON. The bit-lines are precharged to the half-Vcc level. As shown in Fig. 52.17(b), at time T1, the circuit is operated in the active cycle, and a word-line is selected and the charge stored in the cell Cs is transferred to the bit-lines. At time T2, the transfer switches are turned OFF and the bit-lines are isolated. At time T3, the sense amplifier in Section A is activated and the bit-lines in Section A are driven to Vcc and GND, depending on the stored data. The amplified data in Section A is the most significant bit (MSB) of the stored data because the reference level is half-Vcc.

© 2000 by CRC Press LLC

FIGURE 52.17

Principle of sense and restore: (a) circuit diagram, and (b) timing diagram.

At the same time interval, the MSB is transferred to the bit-lines in Section B through a crosscoupled capacitor Cc. It can change the bit-line level in Section B for subsequent least significant bit (LSB) sensing. At time T4, the sense amplifier in section B is activated and the LSB is sensed. At time T5, the transfer switch is turned ON, the charge on each bit-line is shared, and the read-out data is restored to the memory cell.

© 2000 by CRC Press LLC

Charge-Sharing Restore Scheme Table 52.2 lists the restored level generated by the charge-sharing restore scheme. The MSB is latched in Section A, and the LSB is latched in Section B. The capacitance ratio between Sections A and B is 2. The charge of the MSB and the charge of the LSB are combined on the bit-line, and the restore level Vrestore is generated. TABLE 52.2 Charge-Sharing Restore Scheme Charge–Sharing Restore Scheme MSB Restore Level LS 1 B 0

1 Vcc 2/3 Vcc

0 1/3 Vcc 0 (GND)

2Cb • MSB + Cb • LSB V restore = Vcc ----------------------------------------------------------3Cb

Charge-Coupling Sensing Figure 52.18 shows the charge in bit-line levels due to coupling capacitor Cc. The MSB is sensed using the reference level of half-Vcc, as mentioned earlier. The MSB generates the reference level for LSB sensing. When Vs is defined as the absolute signal level of data “11” and “00”, the absolute signal level of data “10” and “01” is one-third of Vs. Here, Vs is directly proportional to the ratio between storage capacitor Cs and bit-line capacitance.

FIGURE 52.18

Charge-coupling sensing.

In the case of sensing data “11”, the initial signal level is Vs. After MSB sensing, the bit-line level in Section B is changed for LSB sensing by the MSB through coupling capacitor Cc. The reference bit-line in Section B is raised by Vc, and the other bit-line is reduced by Vc. For LSB sensing, Vc is one-third of Vs due to the coupling capacitor Cc. Using the two-step sensing scheme, the 2-bit data in a DRAM cell can be implemented.

© 2000 by CRC Press LLC

References 1. Sekiguchi., T. et al., “An Experimental 220MHz 1Gb DRAM,” ISSCC Dig. Tech. Papers, pp. 252253, Feb. 1995. 2. Sugibayashi, T., et al., “A 1Gb DRAM for File Applications,” ISSCC Dig. Tech. Papers, pp. 254-255, Feb. 1995. 3. Murotani, T. et al., “A 4-Level Storage 4Gb DRAM,” ISSCC Dig. Tech. Papers, pp. 74-75, Feb. 1997. 4. Furuyama, T. et al., “An Experimental 2-bit/Cell Storage DRAM for Macrocell or Memory-onLogic Application,” IEEE J. Solid-State Circuits, vol. 24, no. 2, pp. 388-393, April 1989. 5. Ahlquist, C.N. et al., “A 16k 384-bit Dynamic RAM,” IEEE J Solid-State Circuits, vol. SC- 11, no. 3, Oct. 1976. 6. El-Mansy, Y. et al., “Design Parameters of the Hi-C SRAM cell,” IEEE J. Solid-State Circuits, vol. SC-17, no. 5, Oct. 1982. 7. Lu, N. C. C., “Half-VDD Bit-Line Sensing Scheme in CMOS DRAM’s,” IEEE J. Solid-State Circuits, vol. SC-19, no. 4, Aug. 1984. 8. Lu, N. C. C., “Advanced Cell Structures for Dynamic RAMs,” IEEE Circuits and Devices Magazine, pp. 27-36, Jan. 1989. 9. Mashiko, K. et al., “A 4-Mbit DRAM with Folded-Bit-Line Adaptive Sidewall-Isolated Capacitor (FASIC) Cell,” IEEE J. Solid-State Circuits, vol. SC-22, no. 5, Oct. 1987. 10. Prince, B., et al., “Synchronous Dynamic RAM,” IEEE Spectrum, p. 44, Oct. 1992. 11. Yoo, J.-H. et al., “A 32-Bank 1Gb DRAM with 1GB/s Bandwidth,” ISSCC Dig. Tech. Papers, pp. 378379, Feb. 1996. 12. Nitta, Y. et al., “A 1.6GB/s Data-Rate 1Gb Synchronous DRAM with Hierarchical Square-Shaped Memory Block and Distributed Bank Architecture,” ISSCC Dig. Tech. Papers, pp. 376-377, Feb. 1996 13. Yoo, J.-H. et al., “A 32-Bank 1 Gb Self-Strobing Synchronous DRAM with 1 Gbyte/s Bandwidth,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1635-1644, Nov. 1996. 14. Saeki, T. et al., “A 2.5-ns Clock Access, 250-MHz, 256-Mb SDRAM with Synchronous Mirror Delay,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1656-1668, Nov. 1996. 15. Choi, Y. et al., “16Mb synchronous DRAM with 125Mbyte/s data rate,” IEEE J. Solid-State Circuits, vol. 29, no. 4, April 1994. 16. Sakashita, N. et al., “A 1.6GB/s Data-Rate 1-Gb Synchronous DRAM with Hierarchical Square Memory Block and Distributed Bank Architecture,” IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1645-1655, Nov. 1996. 17. Okuda, T. et al., “A Four-Level Storage 4-Gb DRAM,” IEEE J. Solid-State Circuits, vol. 32, no. 11, pp. 1743-1747, Nov. 1997. 18. Prince, B., Semiconductor Memories, 2nd edition, John Wiley & Sons, 1993. 19. Prince, B., High Performance Memories New Architecture DRAMs and SRAMs Evolution and Function, 1st edition, Betty Prince, 1996. 20. Toshiba Applications Specific DRAM Databook, D-20, 1994.

© 2000 by CRC Press LLC

Margala, M. "Low-Power Memory Circuits" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

53 Low-Power Memory Circuits 53.1 Introduction 53.2 Read-Only Memory (ROM) Sources of Power Dissipation • Low-Power ROMs

53.3 Flash Memory Low-Power Circuit Techniques for Flash Memories

53.4 Ferroelectric Memory (FeRAM) 53.5 Static Random-Access Memory (SRAM) Low-Power SRAMs

53.6 Dynamic Random-Access Memory (DRAM)

Martin Margala University of Alberta

Low-Power DRAM Circuits

53.7 Conclusion

53.1 Introduction In recent years, rapid development in VLSI fabrication has led to decreased device geometries and increased transistor densities of integrated circuits, and circuits with high complexities and very high frequencies have started to emerge. Such circuits consume an excessive amount of power and generate an increased amount of heat. Circuits with excessive power dissipation are more susceptible to run time failures and present serious reliability problems. Increased temperature from high-power processors tends to exacerbate several silicon failure mechanisms. Every 10°C increase in operating temperature approximately doubles a component’s failure rate. Increasingly expensive packaging and cooling strategies are required as chip power increases.1,2 Due to these concerns, circuit designers are realizing the importance of limiting power consumption and improving energy efficiency at all levels of design. The second driving force behind the low-power design phenomenon is a growing class of personal computing devices, such as portable desktops, digital pens, audioand video-based multimedia products, and wireless communications and imaging systems, such as personal digital assistants, personal communicators, and smart cards. These devices and systems demand high-speed, high-throughput computations, complex functionalities, and often real-time processing capabilities.3,4 The performance of these devices is limited by the size, weight, and lifetime of batteries. Serious reliability problems, increased design costs, and battery-operated applications have prompted the IC design community to look more aggressively for new approaches and methodologies that produce more power-efficient designs, which means significant reductions in power consumption for the same level of performance. Memory circuits form an integral part of every system design as dynamic RAMs, static RAMs, ferroelectric RAMs, ROMs, or Flash memories significantly contribute to system-level power consumption. Two examples of recently presented reduced-power processors show that 43% and 50.3%, respectively, of the total system power consumption is attributed to memory circuits.5,6 Therefore, reducing the power dissipation in memories can significantly improve the system power-efficiency, performance, reliability, and overall costs.

© 2000 by CRC Press LLC

In this chapter, all sources of power consumption in different types of memories will be identified; several low-power techniques will be presented; and the latest developments in low-power memories will be analyzed.

53.2 Read-Only Memory (ROM) ROMs are widely used in a variety of applications (permanent code storage for microprocessors or data look-up tables in multimedia processors) for fixed long-term data storage. The high area density and new submicron technologies with multiple metal layers increase the popularity of ROMs for a low-voltage, low-power environment. In the following section, sources of power dissipation in ROMs and applicable efficient low-power techniques are examined.

Sources of Power Dissipation A basic block diagram of a ROM architecture is presented in Fig. 53.1.7,8 It consists of an address decoder, a memory controller, a column multiplexer/driver, and a cell array. Table 53.1 lists an example of a power dissipation in a 2 K × 18 ROM designed in 0.6-µm CMOS technology at 3.3 V and clocked at 10 MHz.8 The cell array dissipates 89% of the total ROM power, and 11% is dissipated in the decoder, control logic, and the drivers. The majority of the power consumed in the cell array is due to the precharging of large capacitive bit-lines. During the read and write cycles, more than 18 bit-lines are switched per access because the word-line selects more bit-lines than necessary. The example in Fig. 53.2 shows a 121 multiplexer and a bit-line with five transistors connected to it. This topology consumes excessive amounts of power because 4 more bit lines will switch instead of just one. The power dissipated in the decoder, control logic, and drivers is due to the switching activity during the read and precharge cycles and generating control signals for the entire memory.

FIGURE 53.1

Basic ROM architecture. (© 1997, IEEE. With permission.)

Low-Power ROMs In order to significantly reduce the power consumption in ROMs, every part of the architecture has to be targeted and multiple techniques have to be applied. Angel and Swartzlander8 have identified several architectural improvements in the cell array that minimize energy waste and improve efficiency. These techniques include:

© 2000 by CRC Press LLC

TABLE 53.1 Power Dissipation ROM 2 K × 18 Block ** Decoder ROM core Control Drivers

Power (mW) 0.06 2.24 0.18 0.05

Percentage (%) 2.1 89 7.2 1.7

(Source: © 1997, IEEE. With permission.)

FIGURE 53.2

• • • • • • • • •

ROM bit-lines. (© 1997, IEEE. With permission.)

Hierarchical word line Selective precharging Minimization of non-zero terms Inverted ROM core(s) Row(s) inversion Sign magnitude encoding Sign magnitude and inverted block Difference encoding Smaller cell arrays

All of these methods result in a reduction of the capacitance and/or switching activity of bit- and row-lines. A hierarchical word-line approach divides memory in separate blocks and runs the block word-line in one layer and a global word-line in another layer. As a result, only the bit cells of the desired block are accessed. A selective precharging method addresses the problem of activating multiple bit-lines, although only a single memory location is being accessed. By using this method, only those bit-lines that are being accessed are precharged. The hardware overhead for implementing this function is minimum. A minimization of non-zero terms reduces the total capacitance of bit- and row-lines because zero-terms do not switch bit-lines. This also reduces the number of transistors in the memory core. An inverted ROM applies to a memory with a large number of 1s. In this case, the entire ROM array could be inverted and the final data will be inverted back in the output driver circuitry. Consequently, the number of transistors and the capacitance of bit- and row-lines are reduced. An inverted row method also minimizes non-zero terms, but on a row-by-row basis. This type of encoding requires an extra bit (MSB) that indicates whether or not a particular row is encoded. A sign and magnitude encoding is used to store negative numbers. This method also minimizes the number of 1s in the memory. However, a two’s complement conversion is required when data are retrieved from the memory. A sign and magnitude and an inverted block is a combination of the two techniques described previously. A difference encoding can be used to reduce the size of the cell array. In applications where

© 2000 by CRC Press LLC

a ROM is accessed sequentially and the data read from one address does not change significantly from the following address, the memory core can store the difference between these two entries instead of the entire value. The disadvantage is a need for an additional adder circuit to calculate the original value. In applications where different bit sizes of data are needed, smaller memory arrays are useful to implement. If stored in a single memory array, its bit size is determined by the largest number. However, most of the bit positions in smaller numbers are occupied by non-zero values that would increase the bit-line and row-line capacitance. Therefore, by grouping the data to smaller memory arrays according to their size, significant savings in power can be achieved. On the circuit level, powerful techniques that minimize the power dissipation can be applied. The most common technique is reducing the power supply voltage to approximately Vdd ≈ 2Vt in a correlation with the architectural-based scaling. In this region of operation, the CMOS circuits achieve the maximum power efficiency.9,10 This results in large power savings because the power supply is a quadratic term in a well-known dynamic power equation. In addition, the static power and short-circuit power are also reduced. It is important that all the transistors in the decoder, control logic, and driver block be sized properly for low-power, low-voltage operation. Rabaey and Pedram9 have shown that the ideal low-power sizing is when Cd = CL/2, where Cd is the total parasitic capacitance from driving transistors and CL is the total load capacitance of a particular circuit node. By applying this method to every circuit node, a maximum power efficiency can be achieved. Third, different logic styles should be explored for the implementation of the decoder, control logic, and drivers. Some alternative logic styles are superior to standard CMOS for low-power, low-voltage operation.11,12 Fourth, by reducing the voltage swing of the bit-lines, significant reduction in switching power can be obtained. One way of implementing this technique is to use NMOS precharge transistors. The bit-lines are then precharged to Vdd – Vt. A fifth method can be applied in cases when the same location is accessed repeatedly.8 In this case, a circuit called a voltage keeper can be used to store past history and avoid transitions in the data bus and adder (if sign and magnitude is implemented). The sixth method involves limiting short-circuit dissipation during address decoding and in the control logic and drivers. This can be achieved by careful design of individual logic circuits.

53.3 Flash Memory In recent years, flash memories have become one of the fastest growing segments of semiconductor memories.13,14 Flash memories are used in a broad range of applications, such as modems, networking equipment, PC BIOS, disk drives, digital cameras, and various new microcontrollers for leading-edge embedded applications. They are primarily used for permanent mass data storage. With the rapidly emerging area of portable computing and mobile telecommunications, the demand for low-power, low-voltage flash memories increases. Under such conditions, flash memories must employ low-power tunneling mechanisms for both write and erase operations, thinner tunneling dielectrics, and on-chip voltage pumps.

Low-Power Circuit Techniques for Flash Memories In order to prolong the battery life in mobile devices, significant reductions of power consumption in all electronic components have to be achieved. One of the fundamental and most effective methods is a reduction in power supply voltage. This method has also been observed in Flash memories. Designs with a 3.3-V power supply, as opposed to the traditional 5-V power supply, have been reported.15–20 In addition, multi-level architectures that lower the cost per bit, increase memory density, and improve energy efficiency per bit, have emerged.17,20 Kawahara et al.22 and Otsuka and Horowitz23 have identified major bottlenecks when designing Flash memories for low-power, low-voltage operation and proposed suitable technologies and techniques for deep sub-micron, sub-2V power supply Flash memory design. Due to its construction, a Flash memory requires high voltage levels for program and erase operations, often exceeding 10 V (Vpp). The core circuitry that operates at these voltage levels cannot be as aggressively

© 2000 by CRC Press LLC

scaled as the peripheral circuitry that operates with standard Vdd. Peripheral devices are designed to improve the power and performance of the chip, whereas core devices are designed to improve the read performance. Parameters such as the channel length, the oxide thickness, the threshold voltage, and the breakdown voltage must be adjusted to withstand high voltages. Technologies that allow two different transistor environments on the same substrate must be used. An example of transistor parameters in a multi-transistor process is given in Table 53.2. TABLE 53.2 Transistor Parameters VDD transistor Channel length Oxide thickness Threshold voltage

nmos 0.6 µm 10 nm 0.4 V

pmos 1.2 µm

VPP transistor nmos

pmos

22.3 nm 0.79 V

0.97 V

Source: © 1997, IEEE. With permission.

Technologies reaching deep sub-micron levels — 0.25 µm and lower — can experience three major problems (summarized in Fig. 53.3): (1) layout of the peripheral circuits due to a scaled Flash memory cell; (2) an accurate voltage generation for the memory cells to provide the required threshold voltage and narrow deviation; and (3) deviations in dielectric film characteristics caused by large numbers of memory cells. Kawahara et al.22 have proposed several circuit enhancements that address these problems. They proposed a sensing circuit with a relaxed layout pitch, bit-line clamped sensing multiplex, and intermittent burst data transfer for a three times feature-size pitch. They also proposed a low-power dynamic bandgap generator with voltage boosted by using triple-well bipolar transistors and voltagedoubler charge pumping, for accurate generation of 10 to 20 V that operate at Vdd under 2.5 V. They demonstrated these improvements on a 128-Mb experimental chip fabricated using 0.25-µm technology. On the circuit level, three problems have been identified by Otsuka and Horowitz:23 (1) interface between peripheral and core circuitry; (2) sense circuitry and operation margin; and (3) internal high voltage generation.

FIGURE 53.3

Quarter-micron flash memory. (© 1996, IEEE. With permission.)

© 2000 by CRC Press LLC

During program and erase modes, the core circuits are driven with higher voltage than the peripheral circuits. This voltage is higher than Vdd in order to achieve good read performance. Therefore, a levelshifter circuit is necessary to interface between the peripheral and core circuitry. However, when a standard power supply (Vdd) is scaled to 1.5 V and lower, the threshold voltage of Vpp transistors will become comparable to one half of Vdd or less, which results in significant delay and poor operation margin of the level shifter and, consequently, degrades the read performance. A level shifter is necessary for the row decoder, column selection, and source selection circuit. Since the inputs to the level shifters switch while Vpp is at the read Vpp level, the performance of the level shifter needs to be optimized only for a read operation. In addition to a standard erase scheme, Flash memories utilizing a negative-gate erase or program scheme have been reported.15,19 These schemes utilize a single voltage supply that results in lower power consumption. The level shifters in these Flash memories have to shift a signal from Vdd to Vpp and from Gnd to Vbb. Conventional level shifters suffer from delay degradation and increased power consumption when driven with low power supply voltage. There are several reasons attributed to these effects. First, at low Vdd (1.5 V), the threshold voltage of Vpp transistors is close to half the power supply voltage, which results in an insufficient gate swing to drive the pull-down transistors as shown in Fig. 53.4. This also reduces the operation margin of these shifters for the threshold voltage fluctuation of the Vpp transistor. Second, a rapid increase in power consumption at Vdd under 1.5 V is due to dc current leakage through Vpp to Gnd during the transient switching. At 1.5 V, 28% of the total power consumption of Vpp is due to dc current leakage. Two signal shifting schemes have been proposed: one for a standard flash memory and another for a negative-gate erase or program Flash memories. The first proposed design is shown in Fig. 53.5. This high-level shifter uses a bootstrapping switch to overcome the degradation due to a low input gate swing and improves the current driving capability of both pulldown drivers. It also improves the switching delay and the power consumption at 1.5 V because the

FIGURE 53.4 Conventional high-level shifter circuits with (a) feedback pMOS, (b) cross-coupled pMOS. (© 1997, IEEE. With permission.)

© 2000 by CRC Press LLC

FIGURE 53.5

A high-level shifter circuit with bootstrapping switch. (© 1997, IEEE. With permission.)

bootstrapping reduces the dc current leakage during the transient switching. Consequently, the bootstrapping technique increases the operation margin. The layout overhead from the bootstrapping circuit, capacitors, and an isolated n-well is negligible compared to the total chip area because it is used only as the interface between the peripheral circuitry and the core circuitry. Figure 53.6 shows the operation of the proposed high-level shifter, and Fig. 53.7 illustrates the switching delay and the power consumption

FIGURE 53.6

Operation of the proposed high-level shifter circuit. (© 1997, IEEE. With permission.)

© 2000 by CRC Press LLC

FIGURE 53.7

Comparison between proposed and conventional high-level shifters. (© 1997, IEEE. With permission.)

versus the power supply voltage of the conventional design and the proposed design. The second proposed design, shown in Fig. 53.8, is a high/low-level shifter that also utilizes a bootstrapping mechanism to improve the switching speed, reduce dc current leakage, and improve operation margin. The operation of the proposed shifter is illustrated in Fig. 53.9. At 1.5 V, the power consumption decreases by 40% compared to a conventional two-stage high/low-level shifter, as shown in Fig. 53.10. The proposed level shifter does not require an isolated n-well and therefore the circuit is suitable for a tight-pitch design and a conventional well layout. In addition to the more efficient level-shift scheme, Otsuka and Horowitz23 also addressed the problem of sensing under very low power supply voltages (1.5 V) and proposed a new self-bias bit-line sensing method that reduces the delay’s dependence on bit-line capacitance and achieves a 19-ns reduction of the sense delay at low voltages. This enhances the power efficiency of the chip. On a system level, Tanzawa, et al.25 proposed an on-chip error correcting circuit (ECC) with only 2% layout overhead. By moving the ECC from off-chip to on-chip, 522-Byte temporary buffers that are required for conventional ECC and occupy a large part of ECC area, have been eliminated. As a result, the area of ECC circuit has been reduced by a factor of 25. The on-chip ECC has been optimized, which resulted in an improved power-efficiency by a factor of two.

53.4 Ferroelectric Memory (FeRAM) Ferroelectric memory combines the advantages of a non-volatile Flash memory and the density and speed of a DRAM memory. Advances in low-voltage, low-power design toward mobile computing applications have been seen in the literature.28,29 Hirano et al.28 reported a new 1-transistor/1-capacitor nonvolatile ferroelectric memory architecture that operates at 2 V with 100-ns access time. They achieved these results using two new improvements: a bit-line-driven read scheme and a non-relaxation reference cell. In previous ferroelectric architectures, either a cell-plate-driven or non-cell-plate driven read scheme, as shown in Figs. 53.11(a) and (b), was used.30,31 Although the first architecture could operate at low

© 2000 by CRC Press LLC

FIGURE 53.8

Proposed high/low-level shifter circuit. (© 1997, IEEE. With permission.)

FIGURE 53.9

Operation of the proposed high/low-level shifter circuit. (© 1997, IEEE. With permission.)

supply voltages, the large capacitance of the cell plate, which connects to many ferroelectric capacitors and a large parasitic capacitor, would degrade the performance of the read operation due to large transient time necessary to drive the cell plate. The second architecture suffers from two problems. The first problem is the risk of losing the data stored in the memory due to the leakage current of a capacitor. The storage

© 2000 by CRC Press LLC

FIGURE 53.10

Comparison between proposed and conventional high/low-level shifters. (© 1997, IEEE. With permission.)

node of a memory cell is floating and the parasitic p-n junction between the storage node and the substrate leaks the current. Consequently, the storage node reaches the Vss level and another node of the capacitor is kept at 1/2 Vdd, which causes the data destruction. Therefore, this scheme requires a refresh operation of memory cell data. The second problem arises from a low-voltage operation. Due to a voltage across the memory cell capacitor being at 1/2 Vdd under this scheme, the supply voltage must be twice as high as the coercive voltage of ferroelectric capacitors, which prevents the low-voltage operation. To overcome these problems, Hirano et al.28 have developed a new bit-line-driven read scheme which is shown in Figs. 53.12 and 53.13. The bit-line-driven circuit precharges the bit-lines to supply Vdd voltage. The cell plate line is fixed at ground voltage in the read operation. An important characteristic of this configuration is that the bit-lines are driven, while the cell plate is not driven. Also, the precharged voltage level of the bit-lines is higher than that of the cell plate. Figure 53.14 shows the limitations of previous schemes and the new scheme. During the read operation, the first previously presented scheme30 requires a long delay time to drive the cell plate line. However, the proposed scheme exhibits faster transient response because the bit-line capacitance is less than 1/100 of the cell plate-line capacitance. The second previously presented scheme31 requires a data refresh operation in order to secure data retention. The read scheme proposed by Hirano et al.28 does not require any refresh operation since the cell plate voltage is at 0 V during the stand-by mode. The reference voltage generated by a reference cell is a critical aspect of a low-voltage operation of ferroelectric memory. The reference cell is constructed with one transistor and one ferroelectric capacitor. While a voltage is applied to the memory cell to read the data, the bit-line voltage reading from the reference cell is set to about the midpoint of “H” and “L” which are read from the main-memory-cell data. The state of the reference cell is set to “Ref ” as shown at the left side of Fig. 53.15. However, a ferroelectric capacitor suffers from the relaxation effect, which decreases the polarization as shown at the right side of Fig. 53.15. As a result, each state of the main memory cells and the reference cell is shifted, and the read operation of “H” data is marginal and prohibits the scaling of power supply voltage.

© 2000 by CRC Press LLC

FIGURE 53.11

(a) Cell-plate-driven read scheme, and (b) non-cell-plate-driven read scheme. (© 1997, IEEE. With

permission.)

FIGURE 53.12

Memory cell array architecture. (© 1997, IEEE. With permission.)

© 2000 by CRC Press LLC

FIGURE 53.13

Memory cell and peripheral circuit with bit-line-driven read scheme. (© 1997, IEEE. With permission.)

FIGURE 53.14

Limitations of previous schemes and proposed solutions. (© 1997, IEEE. With permission.)

FIGURE 53.15

Reference cell proposed by Sumi et al. in Ref. 30. (© 1997, IEEE. With permission.)

Hirano et al.28 have developed a reference cell that does not suffer from a relaxation effect, moves always along the curve from the “Ref ” point, and therefore enlarges the read operation margin for “H” data. This proposed scheme enables a low-voltage operation down to 1.4 V. Fujisawa et al.29 addressed the problem of achieving high speed and low power operation in ferroelectric memories. Previous designs suffered from excessive power dissipation due to the need of a refresh cycle30,31 because of the leakage current from a capacitor storage node to the substrate where the cell plates are fixed to 1/2 Vdd. Figure 53.16 shows a comparison of the power dissipation between ferroelectric memories

© 2000 by CRC Press LLC

FIGURE 53.16

Comparison of the power dissipation between FeRAMs and DRAMs. (© 1997, IEEE. With permission.)

(FeRAMs) and DRAMs. It can be observed that the power consumption of peripheral circuits is identical, but the power consumption of memory array sharply increases in the 1/2 Vdd plate FeRAMs. These problems can be summarized as follows: • The memory cell capacitance is large and therefore the capacitance of the data-line needs to be set larger in order to increase the signal voltage of non-volatile data. • The non-volatile data cannot be read by the 1/2 Vdd subdata-line precharge technique because the cell plate is set to 1/2 Vdd. Therefore, the data-line is precharged to Vdd or Gnd. When the memory cell density rises, the number of activated data-lines increases. This increases power dissipation of the array. A selective subdata-line activation technique as shown in Fig. 53.17, which was proposed by Hamamoto et al., overcomes this problem. However, its access time is slower compared to all-subdataline activation because the selective subdataline activation requires a preparation time. Therefore, neither of these two techniques can simultaneously achieve low-power and high-speed operation.

FIGURE 53.17

Low power dissipation techniques. (© 1997, IEEE. With permission.)

Fujisawa et al.29 demonstrated a low-power high-speed FeRAM operation using an improved chargeshare modified (CSM) precharge-level architecture. The new CSM architecture solves the problems of slow access speed and high power dissipation. This architecture incorporates two features that reduce the sensing period, as shown in Fig. 53.18. The first feature is the charge-sharing between the parasitic capacitance of the main data-line (MDL) and the subdata-line (SDL). During the stand-by mode, all SDLs and MDLs are precharged to 1/2 Vdd and Vdd, respectively. During the read operation, the

© 2000 by CRC Press LLC

FIGURE 53.18

Principle of the CSM architecture. (© 1997, IEEE. With permission.)

precharge circuits are all cut off from the data lines (time t0). After the y-selection signal (YS) is activated (time t1), the charge in the parasitic capacitance of the MDL (Cmdl) is transferred to the selected parasitic capacitance of the SDL (Csdl) and the selected SDL potential is raised by chargesharing. As a result, the voltage is applied only to a memory cell intersecting selected word-line (WL) and YS. The second feature is a simultaneous activation of WL and YS without causing a loss of the readout voltage. During the write operation, only data of the selected memory cell is written, whereas all the other memory cells keep their non-volatile data. Consequently, the power dissipation does not increase during this operation. The writing period is equal to the sensing period because WL and YS can also be activated simultaneously in the write cycle.

53.5 Static Random-Access Memory (SRAM) SRAMs have experienced a very rapid development of low-power, low-voltage memory design during recent years due to an increased demand for notebooks, laptops, hand-held communication devices, and IC memory cards. Table 53.3 summarizes some of the latest experimental SRAMs for very low-voltage and low-power operation. TABLE 53.3 Low-Power SRAMs Performance Comparison Memory Size (Ref.) 4 Kb (40) 4 Kb (40) 32 Kb (44) 32 Kb (48) 32 Kb (49) 32 Kb (42) 32 Kb (55) 256 Kb (53) 1 Mb (50) 1 Mb (52) 4.5 Mb (51) 7.5 Mb (47) 7.5 Mb (58)

Power Supply 0.9 V 1.6 V 1V 1V 1V 1V 1V 1.4 V 1V 0.8 V 1.8 V 3.3 V 3.3 V

CMOS Technology 0.6 µm 0.6 µm 0.35 µm 0.35 µm 0.25 µm 0.25 µm 0.25 µm 0.4 µm 0.5 µm 0.35 µm 0.25 µm 0.6 µm 0.8 µm

Access Time 39 ns 12 ns 17 ns 11.8 ns 7.3 ns — 7 ns 60 ns 74 ns 10 ns 1.8 ns 6 ns 18 ns

Power Dissipation 18 µW @ 1 MHz 64 µW @ 1 MHz 5 µW @ 50 MHz 3 µW @ 10 MHz 0.9 µW @ 100 MHz 0.9 µW @ 100 MHz 3.9 µW @ 100 MHz 3.6 µW @ 5 MHz 1 µW @ 10 MHz 5 µW @ 100 MHz 2.8 W @ 550 MHz 8.42 µW @ 50 MHz 4.8 µW @ 20 MHz

In this section, active and passive sources of power dissipation in SRAMs will be discussed and common low-power techniques will be analyzed.

© 2000 by CRC Press LLC

Low-Power SRAMs Sources of SRAM Power There are different sources of active and stand-by (data retention) power present in SRAMs. The active power is the sum of the power consumed by the following components: • • • •

Decoders Memory array Sense amplifiers Periphery (I/O circuitry, write circuitry, etc.) circuits

The total active power of an SRAM with m × n array of cells can be summarized by the expression9,33,34: P active = ( mi active + m ( n – 1 )i leak + ( n + m )fC DE V INT + mi DC ∆tf + C PT V INT f + I DCP )V dd

(53.1)

where iactive is the effective current of selected cells, ileak is the effective data retention current of the unselected memory cells, CDE is the output node capacitance of each decoder, VINT is the internal power supply voltage, iDC is the dc current consumed during the read operation, ∆t is the activation time of the dc current consuming parts (i.e., sense amplifiers), f is the operating frequency, CPT is the total capacitance of the CMOS logic and the driving circuits in the periphery, and IDCP is the total static (dc) or quasistatic current of the periphery. Major sources of IDCP are column circuitry and differential amplifiers on the I/O lines. The stand-by power of an SRAM has a major source represented by ileakmn because the static current from other sources is negligibly small (sense amplifiers are disabled during this mode). Therefore, the total stand-by power can be expressed as:

Pstandby = mnileak × Vdd

(53.2)

Techniques for Low-Power Operation In order to significantly reduce the power consumption in SRAMs, all contributors to the total power must be targeted. The most efficient techniques used in recent memories are: • Capacitance reduction of word-lines and the number of cells connected to them, data-lines, I/O lines, and decoders • DC current reduction using new pulse operation techniques for word-lines, periphery, circuits, and sense amplifiers • AC current reduction using new decoding techniques (i.e., multi-stage static CMOS decoding) • Operating voltage reduction • Leakage current reduction (in active and stand-by mode) utilizing multiple threshold voltage (MTCMOS) or variable threshold voltage technologies (VT-CMOS) Capacitance Reduction The largest capacitive elements in a memory are word-lines, bit-lines, and data-lines, each with a number of cells connected to them. Therefore, reducing the size of these lines can have a significant impact on power consumption reduction. A common technique often used in large memories is called Divided Word Line (DWL), which adopts a two-stage hierarchical row decoder structure as shown in Fig. 53.19.34 The number of sub-word-lines connected to one main word-line in the data-line direction is generally four, substituting the area of a main row decoder with the area of a local row decoder. DWL features two-step decoding for selecting one word-line, greatly reducing the capacitance of the address lines to a row decoder and the word-line RC delay.

© 2000 by CRC Press LLC

FIGURE 53.19

Divided word-line structure (DWL). (© 1995, IEEE. With permission.)

A single bit-line cross-point cell activation (SCPA) architecture reduces the power further by improving the DWL technique.36 The architecture enables the smallest column current possible without increasing the block division of the cell array, thus reducing the decoder area and the memory core area. The cell architecture is shown in Fig. 53.20. The Y-address controls the access transistors and the X-address. Since only one memory cell at the cross-point of X- and Y- is activated, a column current is drawn only by the accessed cell. As a result, the column current is minimized. In addition, SCPA allows the number of blocks to be reduced because the column current is independent of the number of block divisions in the SCPA. The disadvantage of this configuration is that during the write “high” cycle, both X- and Y-lines have to be boosted using a word-line boost circuit.

FIGURE 53.20

Memory cell used for SCPA architecture. (© 1994, IEEE. With permission.)

Caravella proposed a similar subdivision technique to DWL, which he demonstrated on 64 × 64 bit cell array.39,40 If Cj is a parasitic capacitance associated with a single bit cell load on a bit-line (junction and metal) and if Cch is a parasitic capacitance associated with a single bit cell on the word-line (gate, fringe, and metal), then the total bit-line capacitance is 64 × Cj and the total word capacitance is 64 × Cch . If the array is divided into four isolated sub-arrays of 32 × 32 bit cells, the total bit-line and wordline capacitances would be halved, as shown in Fig. 53.21. The total capacitance per read/write that would need to be discharged or charged is given by 1024 × Cj + 32 × Cch for the sub-array architecture as opposed to 4096 × Cj + 64 × Cch for the 64 × 64 array. This technique carries a penalty due to additional decode and control logic and routing.

© 2000 by CRC Press LLC

FIGURE 53.21

Memory architecture. (© 1997, IEEE. With permission.)

Pulse Operation Techniques Pulsing the word-lines, equalization, and sense lines can shorten the active duty cycle and thus reduce the power dissipation. In order to generate different pulse signals, an on-chip address transition detection (ATD) pulse generator is used.34 This circuit, shown in Fig. 53.22, is a key element for the active power reduction in memories.

FIGURE 53.22 Address transition detection circuits: (a) and (b) ATD pulse generators; (c) ATD pulse wave-forms; and (d) a summation circuit of all ATD pulses generated from all address transitions. (© 1995, IEEE. With permission.)

An ATD generator consists of delay circuits (i.e., inverter chains) and an XOR circuit. The ATD circuit generates a φ(ai) pulse every time it detects an “L”-to-“H” or “H”-to-“L” transition on the input address signal ai. Then, all ATD-generated pulses from all address transitions are summed through an OR gate to a single pulse φATD. This final pulse is usually stretched out with a delay circuit to generate different pulses needed in the SRAM and used to reduce power or speed up a signal propagation. Pulsed operation techniques are also used to reduce power consumption by reducing the signal swing on high-capacitance predecode lines, write-bus-lines, and bit-lines without sacrificing the performance.37,42,49 These techniques target the power that is consumed during write and decode operations.

© 2000 by CRC Press LLC

Most of the power savings comes from operating the bit-lines from Vdd/2 rather than Vdd. This approach is based on the new half-swing pulse-mode gate family. Figure 53.23 shows a half-swing pulse-mode AND gate. The principle of the operation is in a merger of a voltage-level converter with a logical AND. A positive half-swing (transitions from a rest state Vdd/2 to Vdd and back to Vdd/2) and a negative halfswing (transitions from a rest state Vdd/2 to Gnd and back to Vdd/2) combined with the receiver-gate logic style result in a full gate overdrive with negligible effects of the low-swing inputs on the performance of the receiver. This structure is combined with a self-resetting circuitry and a PMOS leaker to improve the noise margin and the speed of the output reset transition, as shown in Figure 53.24.

FIGURE 53.23

Half-swing pulse-mode AND gate: (a) NMOS-style, and (b) PMOS-style (© 1998, IEEE. With permission.)

Both negative and positive half-swing pulses can reduce the power consumption further by using a charge recycling. The charge used to produce the assert transition of a positive pulse can also be used to produce the reset transition of a negative pulse. If the capacitances of positive and negative pulses match, then no current would be drawn from the Vdd/2 power supply (Vdd/2 voltage is generated by an on-chip voltage converter). Combining the half-swing pulse-mode logic with the charge recycling techniques, 75% of the power on high-capacitance lines can be saved.49 AC Current Reduction One of the circuit techniques that reduces AC current in memories is multi-stage decoding. It is common that fast static CMOS decoders are based on OR/NOR and AND/NAND architectures. Figure 53.25 shows one example of a row decoder for a three-bit address. The input buffers drive the interconnect capacitance

© 2000 by CRC Press LLC

FIGURE 53.24

Self-resetting half-swing pulse-mode gate with a PMOS leaker. (© 1998, IEEE. With permission.)

of the address line and also the input capacitance of the NAND gates. By using a two-stage decode architecture, the number of transistors, fanin and the loading on the address input buffers are reduced, as shown in Fig. 53.26. As a result, both speed and power are optimized. The signal φx, generated by the ATD pulse generator, enables the decoder and secures pulse-activated word-line Operating Voltage Reduction and Low-Power Sensing Techniques Operating voltage reduction is the most powerful method for power conservation. Power supply voltage reductions down to 1 V35,42,44,46,48–50,55 and below40,52,53 have been reported. This aggressively scaled environment requires news skills in new fast-speed and low-power sensing schemes. .A charge-transfer sense

FIGURE 53.25

A row decoder for a three-bit address.

© 2000 by CRC Press LLC

a + b : number of bits for row decoding.

FIGURE 53.26

A two-stage decoder architecture.

amplifying scheme combined with a dual-Vt CMOS circuit achieves a fast sensing speed and a very low power dissipation at 1 V power supply.44,55 At this voltage level, the “roll-off ” on threshold voltage versus gate length, the shortest gate length causes the Vth mismatch between the pair of MOSFETs in the differential sense amplifier. Figure 53.27 shows the schematic of a charge-transfer sense amplifier. The charge-transfer (CT) transistors perform the sensing and act as a cross-couple latch. For the read operation, the supply voltage of the sense amplifiers changes from 1 V to 1.5 V by p-MOSFETs. The threshold voltage mismatch between two CTs is completely compensated because CTs themselves form a latch. Consequently, the bit-line precharge time, before the word-line pulse, can be omitted due to improved sensitivity. The cycle time is shortened because all clock timing signals in read operation are completed within the width of the word-line pulse.

FIGURE 53.27

Charge-transfer sense amplifier. (© 1998 IEEE. With permission.)

© 2000 by CRC Press LLC

Another method is the step-down, boosted-word-line scheme combined with current-sensing amplification. Boosting a selected word-line voltage shortens the bit-line delay before the stored data are sensed. The power consumption is reduced during the word-line selection using a stepping down technique of selected world-line potential.46 However, this causes an increased power dissipation and a large transition time due to enhanced bit-line swing. The operation of this scheme is shown in Figure 53.28. After the selected word-line is boosted, it is restricted to only a short period at the beginning of the memory-cell access. This enables an early sensing operation. When the bit-lines are sensed, the word-line potential is reduced to the supply voltage level to suppress the power dissipation. Reduced signals on the bit-lines are sufficient to complete the read cycle with the current sensing. A fast read operation is obtained with little power penalty. The step-down boosting method is also used for write operation. The circuit diagram of this method is shown in Fig. 53.29. Word drivers are connected to the boosted-pulse generator via switches S1 and S2. These switches separate the parasitic capacitance CB from the boosted line, thus reducing its capacitance. NMOS transistors are more suitable for implementing these switches because they do not require a level-shift circuit. Transistor Q1 is used for the stepping-down function. During the boost, the gate electrode is set to Vdd. If the word-line charge exceeds Vdd + |Vtp|, then Q1 (|Vtp| is a threshold voltage of Q1) turns on and the word-line is clamped. After the stepping-down process, φSEL switches low and Q1 guarantees Vdd voltage on the word-line.

FIGURE 53.28 Step-down boosted word-line scheme: (a) conventional, (b) step-down boosted word-line, (c) bitline transition, and (d) current consumption of a selected memory cell. (© 1998 IEEE. With permission.)

An efficient method for reducing the AC power of bit-lines and data-lines is to use the current-mode read and write operations based on new current-based circuit techniques.47,56,57 Wang et al. proposed a new SRAM cell that supports current-mode operations with very small voltage swings on bit-lines and datalines. A fully current-mode technique consumes only 30% of the power consumed by a previous current-read-only design. Very small voltage swings on bit-lines and data-lines lead to a significant reduction of ac power. The new memory cell has seven transistors, as shown in Fig. 53.30. The additional

© 2000 by CRC Press LLC

FIGURE 53.29

Circuit schematic of step-down boosted word-line method. (© 1998 IEEE. With permission.)

FIGURE 53.30

New 7-transistor SRAM memory cell. (© 1998, IEEE. With permission.)

transistor Meq clears the content of the memory cell prior to the write operation. It performs the cell equalization. This transistor is turned off during the read operation so it does not disrupt the normal operation. An n-type current conveyor is inserted between the data input cell and the memory cell in order to perform a current-mode write operation, which is a complementary way to read. The equalization transistor is sized to be as large as possible to improve fast equalization speed, but not to increase the cell size. After suitable sizing, the new 7-transistor cell is 4.3% smaller than its 6-transistor counterpart, as illustrated in Fig. 53.31. Another new current-mode sense amplifier for 1.5-V power supply was proposed by Wang and Lee.57 The new circuit overcomes the problems of a conventional sense amplifier with pattern dependency by implementing a modified current conveyor. A pattern-dependency problem limits the scaling of the operating voltage. Also, the circuit does not consume any DC power because it is constructed as a complementary device. As a result, the power consumption is reduced by 61 to 94% compared with a

© 2000 by CRC Press LLC

FIGURE 53.31

SRAM cell layout: (a) 6T cell, and (b) new 7T cell. (© 1998, IEEE. With permission.)

conventional design. The circuit structure of the modified current conveyor is similar to a conventional current conveyor design. However, an extra PMOS transistor Mp7, as seen in Fig. 53.32, is used. The transistor is controlled by RX signal (a complement of CS). After every read cycle, transistor Mp7 is turned on and equalizes nodes RXP and RXN, which eliminates any residual differential voltage between these two nodes (limitation in conventional designs).

FIGURE 53.32

SRAM read circuitry with the new current-mode sense amplifier. (© 1998, IEEE. With permission.)

© 2000 by CRC Press LLC

Leakage Current Reduction In order to effectively reduce the dynamic power consumption, the threshold voltage is reduced along with the operating voltage. However, low threshold voltages increase the leakage current during both active and stand-by modes. The fundamental method for a leakage current reduction is a dual-Vth or a variable-Vth circuit technique. An example of one such technique is shown in Fig. 53.33.44,55 Here, high Vth MOS transistors are utilized to reduce the leakage current during stand-by mode. As the supply voltage for the word decoder (g) is lowered to 1 V, all transistors forming the decoder are low Vth to retain high performance. The leakage currents during the stand-by mode are substantially reduced by a cut-off switch (SWP, SWN). SWN consists of a high Vth transistor, and SWP consists of a low Vth transistor. Both switches are controlled by a 1.5-V signal. Hence, the SWN gains considerable conductivity. SWP can be quickly cut off because of the reverse-biasing. The operating voltage of the local decoder (w) is boosted to 1.5 V. The high operating voltage gives sufficient drivability even to high Vth transistors.

FIGURE 53.33

Dual Vth CMOS circuit scheme. (© 1998, IEEE. With permission.)

FIGURE 53.34 permission.)

Dynamic leakage cut-off scheme: (a) circuit schematic and (b) its operation. (© 1998, IEEE. With

© 2000 by CRC Press LLC

This technique belongs to schemes that use dynamic boosting of the power supply voltage and wordlines. However, in these schemes, the gate voltage of MOSFETs is often raised to more than 1.4 V, although the operating voltage is 0.8 V. This creates reliability problems. Kawaguchi et al.54 introduced a new technique — a dynamic leakage cut-off (DLC) scheme. Operation waveforms are shown in Fig. 53.34. A dynamic change of n-well and p-well bias voltages to Vdd and Vss, respectively, for selected memory cells is the key feature of this architecture. At the same time, the nonselected memory cells are biased with ~2Vdd for VNWELL, and ~–Vdd for VPWELL. After this, the Vth of the selected cells becomes low, which aids in high drive. Thus, a fast operation is executed. On the other hand, the Vth of the unselected memory cells is high enough to achieve low subthreshold current consumption. This technique is similar to the Variable Threshold CMOS (VT CMOS) technique; however, the difference is in the synchronization signal of the well bias. While in VT CMOS, the well bias is synchronized with a stand-by signal, DLC technique is synchronized with the word-line signal. Nii et al.48 improved the MT-CMOS technique further and proposed the Auto-Backgate Controlled (ABC) MT-CMOS method. The ABC MT-CMOS reduces significantly the leakage current during the “sleep” mode. The circuit diagram of this method is shown in Fig. 53.35. Transistors Q1-Q4 are highthreshold devices that act as switches to cut off the leakage current. The internal circuitry is designed with low-Vt devices. During the active mode, signal SL is pulled low and SL is pulled high. Q1, Q2, and Q3 turn on, Q4 turns off, and virtual power supply VVDD and the substrate bias BP become 1 V. During the sleep mode, signal SL is pulled high, SL is pulled low, and Q1, Q2, and Q3 turn off, whereas Q4 turns on and BP becomes 3.3 V. The leakage current that flows from Vdd2 to ground through D1, and D2 determines voltages Vd1, Vd2, and Vm. Vd1 is a bias between the source and the substrate of the PMOS transistors, Vd2 is a bias of the NMOS transistors, and Vm is a voltage between the virtual power line VVDD and the virtual ground VGND. The leakage current is reduced to 20 pA/cell.

FIGURE 53.35

A schematic diagram of ABC-MT-CMOS circuit. (© 1998, IEEE. With permission.)

53.6 Dynamic Random-Access Memory (DRAM) Similar to all previous types of memories, DRAM has undergone a remarkable development toward higher access speed, higher density, and reduced power.34,61–64 As for reducing power, a variety of techniques targeting various sources of power in DRAMs have been reported. In this section, sources of power consumption will be discussed and then several methods for the reduction of active and data retention power in DRAMs will be described.

© 2000 by CRC Press LLC

Low-Power DRAM Circuits Sources of DRAM Power The total power dissipated in a DRAM has two components: the active power and the data retention power. Major contributors to the active power are: decoders (row and column), memory array, sense amplifier, DC current dissipation of other circuits (a refresh circuitry, a substrate back-bias generator, a boosted level generator, a voltage reference circuit, a half-Vdd generator and a voltage down converter), and remaining periphery circuits (main sense amplifier, I/O buffers, write circuitry, etc). The total active power can be described as:

P active = [ ( mC D ∆V D + C PT V INT )f + I DCP ]V dd

(53.3)

where CD is the data-line capacitance, ∆VD is the data-line voltage swing (0.5 Vdd), m is the number of cells connected to the activated data-line, CPT is the capacitance of the periphery circuits, VINT is the internal supply voltage, and IDCP is the static current. The total data retention power is given as:

P retention =

= [ ( mC D ∆V D + C PT V INT ) ( n ⁄ t REF ) + I DCP ]V dd

(53.4)

where n is the number of words that require refresh and 1/tREF is the frequency of the refresh operation (current). Techniques for Low-Power Operation To reduce power consumption during both modes of DRAM operation, many circuit techniques can be applied, including: • Capacitance reduction, especially of data-lines, word-lines, and shared I/O, using partial activation of multi-divided data-lines and partial activation of multi-divided word-lines • Lowering of external and internal voltages • DC power reduction of peripheral circuits during the active mode by using static CMOS decoders, pulse techniques, and ATD circuit, similar to SRAMs • Refresh power reduction (in addition to capacitance reduction and operating voltages reduction, which are also applicable to the refresh mode, decreasing the frequency of refresh cycle or decreasing the number of words n that require refresh affects the total refresh power). • AC and DC power reduction of circuits such as a voltage down converter (VDC), a half-voltage generator (HVG), a boosted voltage generator (BVG), and a back-bias generator (BBG) Capacitance Reduction Charging and discharging large data- and word-lines contribute to large amounts of dissipated power in a DRAM.34,64 Therefore, minimizing the capacitance of these lines can accomplish significant gains in power savings. There are two fundamental methods used to reduce capacitance in DRAMs: partial activation of multi-divided data-line and partial activation of multi-divided word-line. The concept of both techniques is shown in Figs. 53.36 and 53.37. The foundation of partial activation of multi-divided data-line (Fig. 53.36) is in reducing the number of memory cells connected to an active data-line, thus reducing its capacitance CD. The data-lines are divided into small sections with shared I/O circuitry and a sense amplifier. By sharing these resources, further reduction of CD is achieved. The partial activation is performed by activating only one sense amplifier along the data-line. The principle of the partial activation of multi-divided word-line (see Fig. 53.37) is very similar to that of SRAMs. A single word-line is divided into several ones by the subwordline drivers (SWL). Every SWL has to be selected by the main word-line (MWL) and the row select line signal (RX). Thus, only a partial word-line will be activated.

© 2000 by CRC Press LLC

FIGURE 53.36

Multi-divided data-line architecture. (© 1995, IEEE. With permission.)

FIGURE 53.37

Hierarchical word-line architecture. (© 1995, IEEE. With permission.)

A similar method, called a hierarchical decoding scheme with dynamic CMOS series logic predecoder, has been proposed for synchronous DRAMs (SDRAMs).65,66 This method targets the power losses in the peripheral region of the memory. This power is consumed due to the large capacitive loading of the datalines, the address-lines, and the predecoder lines. The scheme is shown in Fig. 53.38. The hierarchical decoder uses predecoded signal lines where the redundancy circuits are connected directly from the global lines. This results in a reduced capacitive loading and a 50% reduction in the number of bus lines (column and row decoders). This circuit technique can be combined with a design of a small-swing single-address driver with a dynamic predecoder.65,66 This scheme allows a reduction of 23 address lines. The schematic diagram of this circuit is shown in Fig. 53.39. Also, the scheme achieves a small swing in address lines with a short pulse-driven pull-up transistor with a level holder of half-VINT power. The pull-up for the reduced swing bus line is achieved with a short pulse and its width brings the bus signal close to the small swing voltage (VINTL). © 2000 by CRC Press LLC

FIGURE 53.38 A decoding scheme with the hierarchical predecoded row signal and global signals shared with redundancy. (© 1998, IEEE. With permission.)

DC Current Reduction During the active mode, most of the DC power in DRAMs and SDRAMs is consumed by the periphery circuits and I/O lines. The decoding and pulsed operation techniques based on an ATD circuit and similar to those for SRAMs can be applied. In order to minimize power consumption of I/O lines in SDRAMs, two circuit techniques have been proposed.68 As for the first technique, the extended small-swing read operation (∆VI/O = ±200 mV), the small-swing data paths (local I/O and global I/O) are extended up to the output buffer stages through main I/O (MIO) lines (see Fig. 53.39). Shared current sense amplifiers (I/O sense amplifiers) also reduce power consumption. In the second technique, the single I/O line driving

FIGURE 53.39

Block diagram of I/O datapath. (© 1996, IEEE. With permission.)

© 2000 by CRC Press LLC

write operation halves the operating current of long global I/O lines and main I/O lines. By combining these two methods, as much as 30% of total peripheral power can be saved. Another power-saving method for low-power SDRAMs is based on a new cell-operating concept.69 When the operating voltage of the memory array is scaled to 1.8 V for 1-Gb SDRAMs, the performance significantly degrades due to the following factors. First, the sensing speed decreases due to the noticeable threshold voltage of source-floated transistors. Second, a triple-pumping circuit may be required to increase the power of boosted word-lines (relatively high Vpp). The concept of the proposed method is that the bit-lines are precharged to ground level (Vss). The word-line reset voltage is –0.5 V (as compared with 1/2 Vdd in conventional schemes) so that a cell leakage current can be prevented while lowering the threshold voltage of pass transistors. This eliminates word-line boosting because the triple-boosting circuit is no longer required. Operating Voltages Reduction Lowering external and internal operating voltages is considered as an important technique for achieving significant savings of power. In both active and stand-by modes, voltages from different sources, such as Vdd, VINT, or ∆VD, as described in Eqs. 53.3 and 53.4, largely contribute to a total power consumption. Over the last decade, a trend in the reduction of the external power supply voltage Vdd for DRAMs has been observed, sliding from 12 V down to 3.3, 2.5, and 1.2 V.66,67,69,76,79 An experimental circuit with Vdd as low as 1 V has been recently reported.77 The lack of a universal standard external operating power supply voltage has resulted in DRAMs with an on-chip voltage-down converter (VDC) that uses widely accepted power supply voltages Vdd, such as 5 V or lately 3.3 V, and lowers the operating voltage for the memory core, thus gaining power savings.33,34,73 VDC is one of the most important DRAM circuits in achieving DRAM operation at battery voltage levels. In power-limited applications, VDC must have a stand-by current less than 1 µA over a wide range of operating temperatures, process, and power supply voltage variations. Also, its output impedance has to be low. There are additional on-chip voltage generators: half-Vdd generator (HVG) for precharging bit-lines; back-bias generator (BBG) for subthreshold current and junction capacitance reduction, improving device isolation and latch-up immunity and circuit protection against voltage undershoots of input signals; and boosted voltage generator (BVG) for driving the word-lines.33,34 The HVG circuit has been used since 1-Mb DRAM generation. It is an efficient technique to reduce the voltage swing on bit-lines from a full Vdd swing to 1/2Vdd swing. During the sensing, one bit-line switches from 1/2Vdd to Vdd and the second bit-line from 1/2Vdd to ground. As a result, the peak switching current is reduced and the noise level is suppressed. Recently, a new technique that eliminates 1/2Vdd bitline switching was proposed.70 This new method, called “non-precharged bit-line sensing” (NPBS), provides the following three features (as seen in Fig. 53.40): (1) the precharge operation time is reduced by 78% because the bit-lines are not substantially precharged; (2) the sensing speed increases because the bit-lines that have not been precharged remain at low or high levels, increasing the VGS and VDS

FIGURE 53.40

NPBS circuit and its operation. (© 1998, IEEE. With permission.)

© 2000 by CRC Press LLC

voltages for the sense amplifier transistor; (3) the power dissipation is reduced when the same data occur on the bit-line. The power is reduced by about 43%. In order to maintain or improve the speed and reliability of DRAM operations, the threshold voltage Vt has to follow the same scaling pattern as the main power supply voltage. This scenario, however, results in a rapid increase of leakage currents in the entire memory during both active and stand-by modes. Therefore, an internal back-bias generator (BBG) circuit, also known as the charge-pump, is needed to improve lowvoltage, low-power operation by reducing the subthreshold currents. Figure 53.41 shows the schematic of a pumping circuit that avoids the Vt losses.71 When the clock (clk) is at logic low, the node voltage of the node A reaches |Vtp| – Vdd. The PMOS transistor p1 clamps the voltage of the node B to the ground level. The VBB voltage settles at |Vtp| – Vdd – Vtn. When clk changes to logic high, the node A changes to Vtp and the node B is capacitively coupled to –Vdd. As a result, VBB voltage changes to –Vdd. This circuit requires triple-well technology to eliminate minority carrier injection of the n1 transistor. To limit the power consumption of this circuit during DRAM’s stand-by mode, the frequency of the clk signal can be reduced. This is possible to implement with BBG’s own ring oscillator controlled by BBG’s enable signal.

FIGURE 53.41

Low-voltage pumping circuit.

A boosted voltage circuit (BVG) is used in DRAMs to generate a power supply signal higher than Vdd for driving the word-lines. This word-line voltage is higher than Vdd by at least the threshold voltage. The boosted level cannot be directly applied to drive the load. An isolation transistor is necessary to separate the switching boosted voltage from the load. One such arrangement is shown in Fig. 53.42.72 This particular circuit generates an output of 2Vdd. Voltage scaling has no effect on its performance and, therefore, it is suitable for Vdd reduction down to sub-1V levels. Leakage Current Reduction and Data-Retention Power The key limitation in achieving a battery (1 V) or solar cell (0.5 V) operation will be the subthreshold power consumption that will dominate both active and stand-by DRAM modes. In this subsection, circuit techniques that drastically reduce leakage and data-retention power will be described. Several methods that address the exponentially increasing threshold voltage in rapidly scaled technologies have been proposed. One such method, a well-driving scheme, uses a dynamic Vt by driving the well (see Fig. 53.43).64,74 Thus, the threshold voltage is higher during the stand-by mode than in the active mode. The advantage of this method is a fast operation in the active mode and a leakage current suppression in the stand-by mode. To reduce the subthreshold currents in various DRAM voltage generators, a self-off-time detector circuit could be used.75 It automatically evaluates the optimal off-time interval and controls the dynamic ON/OFF switching ratio of power-dissipation circuits such as level detectors. This method is directly applicable to any on-chip voltage generator or self-refresh circuit. The block diagram of this architecture is shown in Fig. 53.44.

© 2000 by CRC Press LLC

FIGURE 53.42

Boosted voltage generator. (© 1991, IEEE. With permission.)

FIGURE 53.43

Low-voltage well-driving scheme. (© 1995, IEEE. With permission.)

FIGURE 53.44

Block diagram of BBG circuit using the self-off-time detector. (© 1997, IEEE. With permission.)

© 2000 by CRC Press LLC

A charge-transfer presensing scheme (CTPS) with 1/2Vcc bit-line precharge and a nonreset block control scheme (NRBC) reduces the data-retention current by 75%.76 The principle of the CTPS technique is shown in Fig. 53.45. The sense amplifier SA and the bit-line BL are separated by the transfer-gate TG. The bit-line is precharged to 1/2VccA (power supply voltage for the array) and the sense amplifier node is precharged to a voltage higher than VccA. When TG is at a low level, the word-line WL is activated and the data from the memory cell MC is transferred to the bit-line BL. A small voltage change appears on the bit-line pair. Then, the TG voltage is set to the voltage for the charge-transfer condition, and the charge of SA node is transferred to the bit-line. The transfer is complete when the bit-line voltage reaches VTG – Vtn. After that, a large variation of the readout voltage appears on the SA pair.

FIGURE 53.45

Concept of CTPS and its circuit organization; BL = 1/2Vcc, VccA = 0.8 V. (© 1997, IEEE. With permission.)

The CTSP technique reduces the active array current and prolongs the data-retention time. The dataretention power can be reduced further by the nonreset row block control scheme (NRBC), which is used to reduce the charge/discharge number of row block control circuits to 1/128 of the conventional method. The NRBC architecture is shown in Fig. 53.46. NRBC is a divided word-line structure where one subword-line (SWL) in the selected row block is activated if one main word-line (MWL) and one

FIGURE 53.46

Basic circuits of the row block control in NRBC. (© 1997 IEEE. With permission.)

© 2000 by CRC Press LLC

of four subdecode signals (SD0~3) are activated in this row block. Also, the transfer-gates TG_L and TG_R are activated at both sides of this row block. After the data-retention mode is set, SD and TG signals do not swing fully at every cycle but only every 128 cycles for activating the same row block. As a result, the row control current is reduced by 70% compared with the conventional scheme. Another effective method for leakage current reduction is subthreshold leakage current suppression system (SCSS), shown in Fig. 53.47.78 The method features high drivability (Ids) and low-Vt transistors. The principle of this method is reducing the active mode leakage current with a body bias control and reducing the stand-by mode current by body bias and switched-source impedance. PMOS transistors use the boosted word-line voltage as a body bias, whereas NMOS transistors use memory cell substrate voltage as a body bias. In addition to leakage suppression techniques, extending the refresh time can also significantly reduce power consumption during the stand-by mode, as shown in Eq. 53.4.67,80,81 The refresh time is determined from the time needed for the stored charge in the memory cell to keep enough margin against leakage at high temperature. In order to achieve long refresh characteristics for a low voltage operation, a negative word-line method can be applied.67 Figure 53.48 shows the concept of this method.

FIGURE 53.47

Subthreshold leakage current suppression system. (© 1998, IEEE. With permission.)

FIGURE 53.48

Principle of the negative voltage word-line technique. (© 1997, IEEE. With permission.)

© 2000 by CRC Press LLC

A negative gate-source voltage Vgs is applied, which decreases the subthreshold current of the MC transistor and provides a noise-free dynamic refresh. It also enables the shallow back-bias voltage Vbb that reduces the electrical field between the storage node and the p-well region under the memory cell and results in a small junction leakage current. This achieves longer static refresh time. Figure 53.49 shows an example of the negative voltage word-line driver. Dual-period self-refresh (DPS-refresh) scheme is a method that can extend the refresh time by four to six times.80 The principle of the DPS-refresh scheme is shown in Fig. 53.50 and the corresponding timing diagram in Fig. 53.51. The key concept is to use two different internal self-refresh periods. All word-lines are separated into two groups according to retention test data that are stored in a PROM mode register implemented in the chip periphery. The short period t1 corresponds to a conventional self-refresh period determined by the minimum retention time in a chip. The long period t2 is set to the optimum refresh value. If all memory cells connected to a specific word-line have a retention time longer than t2, they are called long-period word-line cells (LPWL) and are refreshed in the long period of t2. Otherwise, they are called short-period word-line cells (SPWL) and the word-line is refreshed in the short period t1. The DPS-refresh operation is then achieved by periodically skipping refresh cycles for LPWLs. The operation is composed of T1 periods repeated (n – 1), times followed by a T2. For a refresh cycle during T1 period, the inhibit_k , where k is from 0 to 3, goes low if the word-line selected in the array block k is an LPWL and disables all ANDgated MSi signals. As a result, the refresh operation is not executed. However, during the T2-

FIGURE 53.49

Negative voltage word-line driver. (© 1997, IEEE. With permission.)

FIGURE 53.50 A schematic diagram of mode-register controlled DPS-refresh method. (© 1998, IEEE. With permission.)

© 2000 by CRC Press LLC

FIGURE 53.51

Timing diagram: (a) PROM read operation, and (b)DPS-refresh operation. (© 1998, IEEE. With

permission.)

period, inhibit_k signals are driven high by T2 clock signal. This signal is generated by the most significant bit refresh address A11 divided by p period using the programmable divide-by-p counter. The period of A11 is equal to the short refresh period t1. Consequently, LPWLs are refreshed every “p × t1” periods. The advantage of the DPS-refresh operation is that word-lines which have the same refresh address but are located in different array blocks are individually controlled by inhibit_k signals, which aids in prolonging the refresh time. Using this method, one half of the self-refresh current is saved compared with the conventional self-refresh technique.

53.7 Conclusion In this chapter, the latest developments in low-power circuit techniques and methods for ROMs, Flash memories, FeRAMs, SRAMs, and DRAMs were described. All major sources of power dissipation in these memories were analyzed. Key techniques for drastic reduction of power consumption were identified. These are: capacitance reduction, very low operating voltages, DC and AC current reduction, and suppression of leakage currents. Many of the reviewed techniques are applicable to other applications such as ASICs, DSPs, etc. Battery and solar-cell operation requires an operating voltage environment in sub1V area. These conditions demand new design approaches and more sophisticated concepts to retain high device reliability. Experimental circuits operating at these voltage levels slowly start to emerge in all types of memories. However, there is no universal solution for any of these designs, and many challenges still await memory designers.

References 1. Pivin, D., “Pick the Right Package for Your Next ASIC Design,” EDN, vol. 39, no. 3, pp. 91–108, Feb. 3, 1994. 2. Small, C., “Shrinking Devices Put the Squeeze on System Packaging,” EDN, vol. 39, no. 4, pp. 41–46, Feb. 17, 1994. 3. Manners, D., “Portables Prompt Low-Power Chips,” Electronics Weekly, no. 1574, p. 22, Nov. 13, 1991.

© 2000 by CRC Press LLC

4. Mayer, J., “Designers Heed the Portable Mandate,” EDN, vol. 37, no. 20, pp. 65–68, Nov. 5, 1992. 5. Stephany, R. et al., “A 200MHz 32b 0.5W CMOS RISC Microprocessor,” in ISSCC Digest of Technical Papers, pp. 15.5-1 to 15.5-2, Feb. 1998. 6. Igura, H. et al., “An 800MOPS 100mW 1.5V Parallel DSP for Mobile Multimedia Processing,” in ISSCC Digest of Technical Papers, pp. 18.3-1 to 18.3-2, Feb., 1998. 7. Sharma, A. K., Semiconductor Memories — Technology, Testing and Reliability, IEEE Press, 1997. 8. de Angel, E. and Swartzlander, E. E. Jr., “Survey of Low Power Techniques for ROMs,” in Proceedings of ISLPED’97, pp. 7–11, Aug., 1997. 9. Rabaey, J. and Pedram, M., Editors, Low-Power Methodologies, Kluwer Academic Publishers, 1996. 10. Margala, M. and Durdle, N. G., “Noncomplementary BiCMOS Logic and CMOS Logic Styles for Low-Voltage Low-Power Operation — A Comparative Study,” IEEE Journal of Solid-State Circuits, vol. 33, no. 10, pp. 1580–1585, Oct. 1998. 11. Margala, M. and Durdle, N. G., “1.2 V Full-Swing BiNMOS Logic Gate,” Microelectronics Journal, vol. 29, no. 7, pp. 421–429, Jul. 1998. 12. Margala, M. and Durdle, N. G., “Low-Power 4-2 Compressor Circuits,” International Journal of Electronics, vol. 85, no. 2, pp. 165–176, Aug. 1998. 13. Grossman, S., “Future Trends in Flash Memories,” in Proceedings of MTDT’96, pp. 2–3, Aug. 1996. 14. Verma, R., “Flash Memory Quality and Reliability Issues,” in Proceedings of MTDT’96, pp. 32–36, Aug. 1996. 15. Ohkawa, M. et al., “A 98 mm2 Die Size 3.3-V 64-Mb Flash Memory with FN-NOR Type FourLevel Cell,” IEEE Journal of Solid-State Circuits, vol. 31, no. 11, pp. 1584–1589, Nov. 1996. 16. Kim, J.-K. et al., “A 120-mm2 64-Mb NAND Flash Memory Achieving 180 ns/Byte Effective Program Speed,” IEEE Journal of Solid-State Circuits, vol. 32, no. 5, pp. 670–679, May 1997. 17. Jung, T.-S. et al., “A 117-mm2 3.3-V Only 128-Mb Multilevel NAND Flash Memory for Mass Storage Applications,” IEEE Journal of Solid-State Circuits, vol. 31, no. 11, pp. 1575–1583, Nov. 1996. 18. Hiraki, M. et al., “A 3.3V 90 MHz Flash Memory Module Embedded in a 32b RISC Microcontroller,” in Advanced Program of ISSCC’99, p. 17, Nov. 1998. 19. Atsumi, S. et al. ,"A 3.3 V-only 16 Mb Flash Memory with row-decoding scheme,” in ISSCC Digest of Technical Papers, pp. 42–43, Feb. 1996. 20. Takeuchi, K. et al., “A Multipage Cell Architecture for High-Speed Programming Multilevel NAND Flash Memories,” IEEE Journal Solid-State Circuits, vol. 33, no. 8, pp. 1228–1238, Aug. 1998. 21. Takeuchi, K. et al., “A Negative Vth Cell Architecture for Highly Scalable, Excellently Noise Immune and Highly Reliable NAND Flash Memories,” in Digest of Technical Papers of Symposium on VLSI Circuits, pp. 234–235, Jun. 1998. 22. Kawahara, T. et al., “Bit-Line Clamped Sensing Multiplex and Accurate High Voltage Generator for Quarter-Micron Flash Memories,” IEEE Journal of Solid-State Circuits, vol. 31, no. 11, pp. 1590–1600, Nov. 1996. 23. Otsuka, N. and Horowitz, M., “Circuit Techniques for 1.5-V Power Supply Flash Memory,” IEEE Journal of Solid-State Circuits, vol. 32, no. 8, pp. 1217–1230, Aug. 1997. 24. Mihara, M. et al., “A 29 mm2 1.8V-Only 16 Mb DINOR Flash Memory with Gate-Protected PolyDiode Charge Pump,” in Advanced Program of ISSCC’99, p. 17, Nov. 1998. 25. Tanzawa, T. et al., “A Compact On-Chip ECC for Low Cost Flash Memories,” IEEE Journal of SolidState Circuits, vol. 32, no. 5, pp. 662–669, May 1997. 26. Nozoe, A. et al., “A 256Mb Multilevel Flash Memory with 2MB/s Program Rate for Mass Storage Application,” in Advanced Program of ISSCC’99, p. 17, Nov. 1998. 27. Imamiya, K. et al., “A 130 mm2 256Mb NAND Flash with Shallow Trench Isolation Technology,” in Advanced Program of ISSCC’99, p. 17, Nov. 1998. 28. Hirano, H. et al., “2-V/100ns 1T/1C Nonvolatile Ferroelectric Memory Architecture with BitlineDriven Read Scheme and Nonrelaxation Reference Cell,” IEEE Journal of Solid-State Circuits, vol. 32, no. 5, pp. 649–654, May 1997.

© 2000 by CRC Press LLC

29. Fujisawa, H. et al., “The Charge-Share Modified (CSM) Precharge-Level Architecture for HighSpeed and Low-Power Ferroelectric Memory,” IEEE Journal of Solid-State Circuits, vol. 32, no. 5, pp. 655–661, May 1997. 30. Sumi, T. et al., “A 256Kb nonvolatile ferroelectric memory at 3 V and 100 ns,” in ISSCC Digest of Technical Papers, pp. 268–269, Feb. 1994. 31. Koike, H. et al., “A 60-ns 1-Mb Nonvolatile Ferroelectric Memory with a Nondriven Cell Plate Line Write/Read Scheme,” IEEE Journal of Solid-State Circuits, vol. 31, no. 11, pp. 1625–1634, Nov. 1996. 32. Womack, R. et al., “A 16-kb ferroelectric nonvolatile memory with a bit parallel architecture,” in ISSCC Digest of Technical Papers, pp. 242–243, Feb. 1989. 33. Bellaouar, A. and Elmasry, M. I., Low-Power Digital VLSI Design, Circuits and Systems, Kluwer Academic Publishers, 1996. 34. Itoh, K. et al., “Trends in Low-Power RAM Circuit Technologies,” Proceedings of the IEEE, pp. 524–543, Apr. 1995. 35. Morimura, H. and Shibata, N., “A 1-V 1-Mb SRAM for Portable Equipment,” in Proceedings of ISLPED’96, pp. 61–66, Aug. 1996. 36. Ukita, M. et al., “A Single Bitline Cross-Point Cell Activation (SCPA) Architecture for Ultra Low Power SRAMs,” in ISSCC Digest of Technical Papers, pp. 252–253, Feb. 1994. 37. Amrutur, B. S. and Horowitz, M. A., “Techniques to Reduce Power in Fast Wide Memories,” in Proceedings of SLPE’94, pp. 92–93, 1994. 38. Toyoshima, H. et al., “A 6-ns, 1.5-V, 4-Mb BiCMOS SRAM,” IEEE Journal of Solid-State Circuits, vol. 31, no. 11, pp. 1610–1617, Nov. 1996. 39. Caravella, J. S., “A 0.9 V, 4 K SRAM for Embedded Applications,” in Proceedings of CICC, pp. 119–122, May 1996. 40. Caravella, J. S., “A Low Voltage SRAM for Embedded Applications,” IEEE Journal of Solid-State Circuits, vol. 32, no. 3, pp. 428–432, Mar. 1997. 41. Haraguchi, Y. et al., “A Hierarchical Sensing Scheme (HSS) of High-Density and Low-Voltage Operation SRAMs,” in Digest of Technical Papers of Symposium on VLSI Circuits, pp. 79–80, Jun. 1997. 42. Mori, T. et al., “A 1V 0.9 mW at 100 MHz 2k×16b SRAM utilizing a Half-Swing Pulsed- Decoder and Write-Bus Architecture in 0.25 µm Dual-Vt CMOS,” in ISSCC Digest of Technical Papers, pp. 22.4-1–22.4-2, Feb. 1998. 43. Kuang, J. B. et al., “SRAM Bitline Circuits on PD SOI: Advantages and Concerns,” IEEE Journal of Solid-State Circuits, vol. 32, no. 6, pp. 837–843, Jun. 1997. 44. Kawashima, S. et al., “A Charge-Transfer Amplifier and an Encoded-Bus Architecture for LowPower SRAM’s,” IEEE Journal of Solid-State Circuits, vol. 33, no. 5, pp. 793–799, May 1998. 45. Amrutur, B. S. and Horowitz, M. A., “A Replica Technique for Wordline and Sense Control in LowPower SRAM’s,” IEEE Journal of Solid-State Circuits, vol. 33, no. 8, pp. 1208–1219, Aug. 1998. 46. Morimura, H. and Shibata, N., “A Step-Down Boosted-Wordline Scheme for 1-V Battery Operated Fast SRAM’s,” IEEE Journal of Solid-State Circuits, vol. 33, no. 8, pp. 1220-1227, Aug. 1998. 47. Wang, J.-S. et al., “Low-Power Embedded SRAM Macros with Current-Mode Read/Write Operations,” in Proceedings of ISLPED, pp. 282-287, Aug. 1998. 48. Nii, K. et al., “A Low Power SRAM using Auto-Backgate-Controlled MT-CMOS,” in Proceedings of ISLPED, pp. 293-298, Aug. 1998. 49. Mai, K. W. et al., “Low-Power SRAM Design Using Half-Swing Pulse-Mode Techniques,” IEEE Journal of Solid-State Circuits, vol. 33, no. 11, pp. 1659-1671, Nov. 1998. 50. Sato, H. et al., “A 5-MHz, 3.6mW, 1.4-V SRAM with Nonboosted, Vertical Bipolar Bit- Line Contact Memory Cell,” IEEE Journal of Solid-State Circuits, vol. 33, no. 11, pp. 1672- 1681, Nov. 1998. 51. Nambu, H. et al., “A 1.8-ns Access, 550-MHz, 4.5-Mb CMOS SRAM,” IEEE Journal of Solid-State Circuits, vol. 33, no. 11, pp. 1650-1658, Nov. 1998. 52. Yamauchi, H. et al., “A 0.8V/100MHz/sub-5mW-Operated Mega-bit SRAM Cell Architecture with Charge-Recycle Offset-Source Driving (OSD) Scheme,” in Digest of Technical Papers of Symposium on VLSI Circuits, pp. 126–127, Jun. 1996.

© 2000 by CRC Press LLC

53. Itoh, K. et al., “A Deep Sub-V, Single Power-Supply SRAM Cell with Multi-Vt Boosted Storage Node and Dynamic Load,” in Digest of Technical Papers of Symposium on VLSI Circuits, pp. 132–133, Jun. 1996. 54. Kawaguchi, H. et al., “Dynamic Leakage Cut-off Scheme for Low-Voltage SRAM’s,” in Digest of Technical Papers of Symposium on VLSI Circuits, pp. 140–141, Jun. 1998. 55. Fukushi, I. et al., “A Low-Power SRAM Using Improved Charge Transfer Sense Amplifiers and a Dual-Vth CMOS Circuit Scheme,” in Digest of Technical Papers of Symposium on VLSI Circuits, pp. 142–143, Jun. 1998. 56. Khellah, M. and Elmasry, M. I., “Circuit Techniques for High-Speed and Low-Power Multi-Port SRAMS,” in Proceedings of ASIC, pp. 157–161, Sep. 1998. 57. Wang, J.-S. and Lee, H.Y., “A New Current-Mode Sense Amplifier for Low-Voltage Low- Power SRAM Design,” in Proceedings of ASIC, pp. 163–167, Sep. 1998. 58. Shultz, K. J. et al., “Low-Supply-Noise Low-Power Embedded Modular SRAM,” IEE ProceedingsCircuits, Devices and Systems, vol. 143, no. 2, pp. 73–82, Apr. 1996. 59. van der Wagt, P. et al., “RTD/HFET Low Standby Power SRAM Gain Cell,” Texas Instruments Research Web-site, 4 pages, 1997. 60. Greason, J. et al., “A 4.5 Megabit, 560MHz, 4.5GByte/s High Bandwidth SRAM,” in Digest of Technical Papers of Symposium on VLSI Circuits, pp. 15–16, Jun. 1997. 61. Aoki, M. and Itoh, K., “Low-Voltage and Low-Power ULSI Circuit Techniques,” IEICE Transactions on Electronics, vol. E77-C, no. 8, pp. 1351–1360, Aug. 1994. 62. Suzuki, T. et al., “High-Speed Circuit Techniques for Battery-Operated 16 MBit CMOS DRAM,” IEICE Transactions on Electronics, vol. E77-C, no. 8, pp. 1334–1342, Aug. 1994. 63. Lee, K. et al., “Low-Voltage, High-Speed Circuit Designs for Gigabit DRAM’s,” IEEE Journal of Solid-State Circuits, vol. 32, no. 5, pp. 642–648, May 1997. 64. Itoh, K. et al., “Limitations and Challenges of Multigigabit DRAM Chip Design,” IEEE Journal of Solid-State Circuits, vol. 32, no. 5, pp. 624–634, May 1997. 65. Lee, K.-C. et al., “A 1GBit SDRAM with an Independent Sub-Array Controlled Scheme and a Hierarchical Decoding Scheme,” in Digest of Technical Papers of Symposium on VLSI Circuits, pp. 103–104, Jun. 1997. 66. Lee, K. et al., “A 1GBit SDRAM with an Independent Sub-Array Controlled Scheme and a Hierarchical Decoding Scheme,” IEEE Journal of Solid-State Circuits, vol. 33, no. 5, pp. 779–786, May 1998. 67. Tsuruda, T. et al., “High-Speed/High-Bandwidth Design Methodologies for On-Chip DRAM Core Multimedia System LSI’s,” IEEE Journal of Solid-State Circuits, vol. 32, no. 3, pp. 477–482, Mar. 1997. 68. Joo, J.-H. et al., “A 32-Bank 1 Gb Self-Strobing Synchronous DRAM with 1 GByte/s Bandwidth,” IEEE Journal of Solid-State Circuits, vol. 31, no. 11, pp. 1635–11644, Nov. 1996. 69. Eto, S. et al., “A 1-Gb SDRAM with Ground-Level Precharged Bit Line and Nonboosted 2.1-V Word Line,” IEEE Journal of Solid-State Circuits, vol. 33, no. 11, pp. 1697–1702, Nov. 1998. 70. Kato, Y. et al., “Non-Precharged Bit-Line Sensing Scheme for High-Speed Low-Power DRAMs,” in Digest of Technical Papers of Symposium on VLSI Circuits, pp. 16–17, Jun. 1998. 71. Tsikikawa, Y. et al., “An Efficient Back-Bias Generator with Hybrid Pumping Circuit for 1.5V DRAMs,” in Digest of Technical Papers of Symposium on VLSI Circuits, pp. 85–86, May 1993. 72. Nakagome, Y. et al., “An Experimental 1.5-V 64-Mb DRAM,” IEEE Journal of Solid-State Circuits, vol. 26, no. 4, pp. 465–471, Apr. 1991. 73. Tanaka, H. et al., “A Precise On-Chip Voltage Generator for a Giga-Scale DRAM with a Negative Word-Line Scheme,” in Digest of Technical Papers of Symposium on VLSI Circuits, pp. 94–95, Jun. 1998. 74. Seta, K. et al., “50% Active Power Saving Without Speed Degradation Using Standby Power Reduction (SPA) Circuit,” in ISSCC Digest of Technical Papers, pp. 318–319, Feb. 1995. 75. Song, H. J., “A Self-Off-Time Detector for Reducing Standby Current of DRAM,” IEEE Journal of Solid-State Circuits, vol. 32, no. 10, pp. 1535–1542, Oct. 1997.

© 2000 by CRC Press LLC

76. Tsukude, M. et al., “A 1.2- to 3.3-V Wide Voltage-Range/Low-Power DRAM with a Charge-Transfer Presensing Scheme,” IEEE Journal of Solid-State Circuits, vol. 32, no. 11, pp. 1721–1727, Nov. 1997. 77. Shimomura, K. et al., “A 1-V 46-ns 16-Mb SOI-DRAM with Body Control Technique,” IEEE Journal of Solid-State Circuits, vol. 32, no. 11, pp. 1712–1720, Nov. 1997. 78. Hasegawa, M. et al., “A 256 Mb SDRAM with Subthreshold Leakage Current Suppression,” in ISSCC Digest of Technical Papers, pp. 5.5-1 to 5.5-2, Feb. 1998. 79. Okudi, T. and Murotani, T., “A Four-Level Storage 4-Gb DRAM,” IEEE Journal of Solid- State Circuits, vol. 32, no. 11, pp. 1743–1747, Nov. 1997. 80. Idei, Y. et al., “Dual-Period Self-Refresh Scheme for Low-Power DRAM’s with On-Chip PROM Mode Register,” IEEE Journal of Solid-State Circuits, vol. 33, no. 2, pp. 253–259, Feb. 1998. 81. Tanizaki, T. et al., “Practical Low Power Design Architecture for 256 Mb DRAM,” in Proceedings of ESSCIRC’97, pp. 188–191, Sep. 1997. 82. Hamanoto, T. et al., “400-MHz Random Column Operating SDRAM Techniques with Self-Skew Compensation,” IEEE Journal of Solid-State Circuits, vol. 33, no. 5, pp. 770–778, May 1998.

© 2000 by CRC Press LLC

Song, B. "Nyquist-Rate ADC and DAC" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

54 Nyquist-Rate ADC and DAC 54.1 Introduction Resolution • Linearity • Monotonicity • Clock Jitter • Nyquist-Rate vs. Oversampling

54.2 ADC Design Arts State of the Art • Technical Challenge in Digital Wireless • ADC Figure of Merit

54.3 ADC Architecture Slope-Type ADC • Successive-Approximation ADC • Flash ADC • Subranging ADC • Multi-Step ADC • Pipeline ADC • Digital Error Correction • One-Bit Pipeline ADC • Algorithmic, Cyclic, or Recursive ADC • TimeInterleaved Parallel ADC • Folding ADC

54.4 ADC Design Considerations Sampling Error Considerations • Techniques for HighResolution and High-Speed ADCs

54.5 DAC Design Art 54.6 DAC Architectures Resistor-String DAC • Current-Ratioed DAC • R-2R Ladder DAC • Capacitor-Array DAC • Thermometer-Coded Segmented DAC • Integrator-Type DAC

Bang-Sup Song University of California at San Diego

54.7 DAC Design Considerations Effect of Limited Slew Rate • Glitch • Techniques for HighResolution DACs

54.1 Introduction The rapidly growing electronics field has witnessed the digital revolution that started with the digital telephone switching system in the early 1970s. The trend continued with digital audio in the 1980s and with digital video in the 1990s. The digital technique is expected to prevail in the coming multimedia era and to influence even future wireless PCS/PCN systems. All electrical signals in the real world are analog in nature, and their waveforms are continuous in time. Since most signal processing is done numerically in discrete time, devices that convert an analog waveform into a stream of discrete digital numbers, or vice versa, have become technical necessities in implementing high-performance digital processing systems. The former is called an analog-to-digital converter (ADC or A/D converter), and the latter is called a digital-to-analog converter (DAC or D/A converter). Typical systems in this digital era can be grouped and explained as in Fig. 54.1. The processed data are stored and recovered later using magnetic or optical media such as tape, magnetic disc, or optical disc. The system can also transmit or receive data through communication channels such as telephone

© 2000 by CRC Press LLC

FIGURE 54.1

Information processing systems.

switch, cable, optical fiber, and wireless RF media. Through the Internet computer networks, even compressed digital video images are now made accessible from anywhere and at any time.

Resolution Resolution is a term used to describe a minimum voltage or current that an ADC/DAC can resolve. The fundamental limit is a quantization noise due to the finite number of bits used in the ADC/DAC. In an N-bit ADC, the minimum incremental input voltage of Vref /2N can be resolved with a full-scale input range of Vref . That is, limited 2N digital codes are available to represent the continuous analog input. Similarly, in an N-bit DAC, 2N input digital codes can generate distinct output levels separated by Vref /2N with a full-scale output range of Vref . The signal-to-noise ratio (SNR) is defined as the power ratio of the maximum signal to the in-band uncorrelated noise. The spectrum of the quantization noise is evenly distributed within the Nyquist bandwidth (half the sampling frequency). This inband rms noise decreases by 6 dB when the oversampling ratio is doubled. This implies that, when oversampled, the SNR within the signal band can be made higher. The SNR of an ideal N-bit ADC/DAC is approximated as

( )

SNR = 1.5 × 22 N ≈ 6.02N + 1.76 dB

(54.1)

The resolution is usually characterized by the SNR, but the SNR accounts only for the uncorrelated noise. The real noise performance is better represented by the signal-to-noise and distortion ratio (SNDR, SINAD, or TSNR), which is the ratio of the signal power to the total inband noise including harmonic distortion. Also, a slightly different term is often used in place of the SNR. The useful signal range or dynamic range (DR) is defined as the power ratio of the maximum signal to the minimum signal. The minimum signal is defined as the smallest signal for which the SNDR is 0 dB, while the maximum signal is the full-scale signal. Therefore, the SNR of the non-ideal ADC/DAC can be lower than the ideal DR because the noise floor can be higher with a large signal present. In practice, performance is not only

© 2000 by CRC Press LLC

FIGURE 54.2

Definition of ADC nonlinearities: (a) DNL and (b) INL.

limited by the quantization noise but also by non-ideal factors such as noises from circuit components, power supply coupling, noisy substrate, timing jitter, settling, and nonlinearity, etc. An alternative definition of the resolution is the effective number of bits (ENOB), which is defined by

ENOB =

( )

SNDR − 1.76 bits 6.02

(54.2)

Usually, the ENOB is defined for the signal at half the sampling frequency.

Linearity The input/output ranges of an ideal N-bit ADC/DAC are equally divided into 2N small units, and one least significant bit (LSB) in the digital code corresponds to the analog incremental voltage of Vref /2N. Static ADC/DAC performance is characterized by differential nonlinearity (DNL) and integral nonlinearity (INL). The DNL is a measure of deviation of the actual ADC/DAC step from the ideal step for one LSB, and the INL is a measure of deviation of the ADC/DAC output from the ideal straight line drawn between two end points of the transfer characteristic. Both DNL and INL are measured in the unit of an LSB. In practice, the largest positive and negative numbers are usually quoted to specify the static performance. The examples of these DNL and INL definitions for ADC are explained in Fig. 54.2. However, several different definitions of INL may result, depending on how two end points are defined. In some architectures, the two end points are not exactly 0 and Vref . The non-ideal reference point causes an offset error, while the non-ideal full-scale range gives rise to a gain error. In most applications, these offset and gain errors resulting from the non-ideal end points do not matter, and the integral linearity can be better defined in a relative measure using a straight-line linearity concept rather than the endpoint linearity. The straight line can be defined as two end points of the actual transfer function, or as a theoretical straight line adjusted for best fit. The former definition is sometimes called end-point linearity, while the latter is called best-straight-line linearity. Unlike ADC, the output of a DAC is a sampled-and-held step waveform held constant during a word clock period. Any deviation from the ideal step waveform causes an error in the DAC output. High-speed DACs which usually have a current output are either terminated with a 50 to 75-Ω low-impedance load or buffered by a wideband transresistance amplifier. The linearity of a DAC is often limited dynamically by the non-ideal settling of the output node. Anything other than ideal exponential settling results in linearity errors.

© 2000 by CRC Press LLC

Monotonicity In both the ADC and the DAC, the output should increase over its full range as the input increases. That is, the negative DNL should be smaller than one LSB for any ADC/DAC to be monotonic. Monotonicity is critical in most applications, in particular digital control or video applications. The source of nonmonotonicity is an inaccuracy in binary weighting of a DAC. For example, the most significant bit (MSB) has a weight of half the full range. If the MSB weight is not accurate, the full range is divided into two non-ideal half ranges, and a major error occurs at the midpoint of the full scale. The similar nonmonotonicity can take place at the quarter and one-eighth points. In DACs, monotonicity is inherently guaranteed if a DAC uses thermometer decoding. However, it is impractical to implement high-resolution DACs using thermometer codes since the number of elements grows exponentially as the number of bits increases. Therefore, to guarantee monotonicity in practical applications, DACs have been implemented using either a segmented DAC or an integrator-type DAC. Oversampling interpolative DACs also achieve monotonicity using a pulse-density modulated bitstream filtered by a lossy integrator or by a low-pass filter. Similarly, ADCs using slope-type, subranging, or oversampling architectures are monotonic.

Clock Jitter Jitter is loosely defined as a timing error in analog-to-digital and digital-to-analog conversions. The clock jitter greatly affects the noise performance of both ADCs and DACs. For example, in ADCs, the right signal sampled at the wrong time is the same as the wrong signal sampled at the right time. Similarly, DACs need precise timing to correctly reproduce an analog output signal. If an analog waveform is not generated with the identical timing with which it is sampled, distortion will result because the output changes at the wrong time. This in turn introduces either spurious components related to the jitter frequency or a higher noise floor unless the jitter is periodic. If the jitter has a Gaussian distribution with an rms jitter of ∆t, the worst-case SNR resulting from this random clock jitter is

SNR = −20 × log

2πB∆t M1 2

(54.3)

where B is the signal bandwidth and M is the oversampling ratio. The oversampling ratio M is defined as

M=

fs 2B

(54.4)

where fs is the sampling clock frequency. The timing jitter error is more critical in reproducing highfrequency components. In other words, for an N-bit ADC/DAC, an upper limit for the tolerable clock jitter is

∆t ≤

1  2M    2πB2N  3 

12

(54.5)

This implies that the error power induced in the baseband by clock jitter should be no larger than the quantization noise. For example, a Nyquist-rate 16-b ADC/DAC with a 22-kHz bandwidth should have a clock jitter of less than 90 ps.

Nyquist-Rate vs. Oversampling In recent years, high-resolution ADCs and DACs at the low end of the spectrum such as for digital audio, voice, and instrumentation are dominantly implemented using oversampling techniques. Although

© 2000 by CRC Press LLC

Nyquist-rate techniques can achieve comparable resolution, such techniques are in general sensitive to non-ideal factors such as process, component matching, and even environmental changes. The inherent advantage of oversampling provides a unique solution in the digital VLSI environment. The oversampling technique achieves high resolution by trading speed for accuracy. The oversampling lessens the effect of quantization noise and clock jitter. However, the quantization or regeneration of a signal above MHz using oversampling techniques is costly even if possible. Therefore, typical applications for high-sampling rates require sampling at a Nyquist rate.

54.2 ADC Design Arts The conversion speed of the ADC is limited by the time needed to complete all comparator decisions. Flash ADCs make all the decisions at once, while successive-approximation ADCs make one-bit decisions at a time. Although it is fast, the complexity of the flash ADC grows exponentially. On the other hand, the successive-approximation ADC is simple but slow since the bit decisions are made in sequence. Between these two extremes, there exist many architectures resolving a finite number of bits at a time, such as pipeline and multi-step ADCs. They balance complexity and speed. Figure 54.3 shows recent high-speed ADC applications in the resolution-versus-speed plot. ADC architecture depends on system requirements. For example, with IF (intermediate frequency) filters, wireless receivers need only 5 to 6 b ADC at a few MHz sampling rate. However, without IF filters, the dynamic range of 12 to 14 b is required for the IF sampling depending on IF as shown in Fig. 54.3.

State of the Art Some architectures are preferred to others for certain applications. Three architectures stand out for three important areas of applications. For example, the oversampling converter is exclusively used to achieve high resolution above the 12-b level at low frequencies. The difficulty in achieving better than 12-b matching in conventional techniques gives a fair advantage to the oversampling technique. For medium

FIGURE 54.3

Recent high-speed ADC applications.

© 2000 by CRC Press LLC

FIGURE 54.4

Performance of recently published ADCs: resolution versus speed.

speed with high resolution, pipeline or multi-step ADCs are promising. At extremely high frequencies, only flash and folding ADCs survive, but with low resolution. Figure 54.4 is a resolution-versus-speed plot showing this trend. As both semiconductor process and design technologies advance, the performance envelope will be pushed further. The demand for higher resolution at higher sampling rates is a main driver of this trend.

Technical Challenge in Digital Wireless In digital wireless systems, a need to quantize and to create a block of spectrum with low intermodulation has become the single most challenging problem. Implementing IF filters digitally has already become a necessity in wireless cell sites and base stations. Even in hand-held units, placing data conversion blocks closer to the RF (radio frequency) has many advantages. A substantial improvement in system cost and complexity of the RF circuitry can be realized by implementing high selectivity function digitally, and the digital IF can increase immunity to adjacent and alternate channel interferences. Furthermore, the RF transceiver architecture can be made independent of the system and can be adapted to different standards using software. Low-spurious, low-power data converters are key components in this new software radio environment. The fundamental limit in quantizing IF spectrum is the crosstalk and overload, and the system performance heavily depends on the SFDR (spurious-free dynamic range) of the sampling ADC. To meet this growing demand, low-spurious data conversion blocks are being actively developed in ever-increasing numbers. For a 14b-level ideal dynamic range while sampling at 50 MHz, it is necessary to control the sampling jitter down below 0.32 ps. Considering that the current state-of-the-art commercial bipolar chip exhibits the jitter range of 0.7 ps, the jitter on the order of a fraction of a picosecond is considered to be at the limit of CMOS capability. However, unlike nonlinearity that causes interchannel mixing, the random jitter in IF sampling increases only the random noise floor. As a result, the random jitter is not considered fundamental in this application. This challenging new application for digital IF processing will lead to the implementation of data converters with very wide SFDR of more than 90 dB. Considering the current state of the art in CMOS

© 2000 by CRC Press LLC

FIGURE 54.5

Figure 54.of merit (L) versus year.

ADCs, most architectures known to date are unlikely to achieve a high sampling rate of over 50 MHz, even with 0.2 to 0.3 µm technologies. Although a higher sampling rate of 65 MHz is reported using bipolar and BiCMOS (bipolar and CMOS), it has been implemented at a 12-b level. However, two highspeed candidate architectures, pipeline (or multi-step) and folding, are potential candidate architectures to challenge these limits with new system approaches.

ADC Figure of Merit The ADC performance is often represented by a figure of merit L which is defined as L = 2N × fs /P, where N is the number of bits, fs is the sampling rate in Msamples/s, and P is the power consumption in mW. The higher the L is, the more bits are obtained at higher speed with lower power. The plot of L versus year shown in Fig. 54.5 shows the low-power and high-speed trend both for leading integrated CMOS and bipolar/BiCMOS ADCs published in the last decade.

54.3 ADC Architectures In general, the main criteria of choosing ADC architectures are resolution and speed, but auxiliary requirements such as power, chip area, supply voltage, latency, operating environment, or technology often limit the choices. The current trend is toward low-cost integration without using expensive discrete technologies such as thin film and laser trimming. Therefore, a growing number of ADCs are being implemented using mainstream VLSI technologies such as CMOS or BiCMOS.

Slope-Type ADC Traditionally, slope-type ADCs have been used for multimeters or digital panel meters mainly because of their simplicity and inherent high linearity. There can be many variations, but dual- or triple-slope techniques are commonly used because the single-slope method is sensitive to the switching error. The resolution of this type of ADC depends on the accurate control of charge on the capacitor. The dualslope technique in Fig. 54.6(a) starts with the initialization of the integrating capacitor by opening the

© 2000 by CRC Press LLC

FIGURE 54.6

(a) Dual-slope and (b) triple-slope ADC techniques.

switch S1 with the input switch S2 connected to Vref . If Vref is negative, Vx will increase linearly with a slope of Vref /RC. After a time T1, the switch S2 is switched to Vin . Then, Vx will decrease with a new slope of –Vin /RC. The comparator detects the zero-crossing time T2. From T1 and T2 , the digital ratio of Vin /Vref can be obtained as T1/T2. The triple-slope technique shown in Fig. 54.6(b) needs no op-amp to reduce the offset effect. Unlike the dual-slope method comparing two slopes, it measures three times T1, T2 , and T3 by charging the capacitor with Vref , Vin , and ground with three switches S1, S2 , and S3 , respectively. The comparator threshold can be set to negative VTH . From three time measurements, the ratio of Vin /Vref can be computed as (T2 – T3)/(T1 – T3).

Successive-Approximation ADC The simplest concept of A/D conversion is comparing analog input voltage with an output of a DAC. The comparator output is fed back through the DAC as explained in Fig. 54.7. The successive-approximation

FIGURE 54.7

Successive-approximation ADC technique.

© 2000 by CRC Press LLC

register (SAR) performs the most straightforward binary comparison. The sampled input is compared with the DAC output by progressively dividing the range by two as explained in the 4-b example. The conversion starts by sampling input, and the first MSB decision is made by comparing the sample-andhold (S/H) output with Vref /2 by setting the MSB of the DAC to 1. If the input is higher, the MSB stays as 1. Otherwise, it is reset to 0. In the second bit decision, the input is compared with 3Vref /4 in this example by setting the second bit to 1. Note that the previous decision set the MSB to 1. If the input is lower, as in the example shown, the second bit is set to 0, and the third bit decision is done by comparing the input with 5Vref /8. This comparison continues until all the bits are decided. Therefore, the N-bit successive-approximation ADC requires N+1 clock cycles to complete one sample conversion. The performance of the successive-approximation ADC is limited by the DAC resolution and the comparator accuracy. The commonly used DACs for this architecture are a resistor-string DAC and a capacitor-array DAC. Although binary-weighted capacitors have a 10b-level matching in MOS,1 diffused resistors have poor matching and high voltage coefficient. If differential resistor-string DACs are used, performance can be improved to the capacitor-array DAC level.2 In general, the capacitor DAC exhibits poor DNL while the resistor-string DAC exhibits poor INL.

Flash ADC The most straightforward way of making an ADC is to compare the input with all the divided levels of the reference simultaneously. Such a converter is called a flash ADC, and the conversion occurs in one step. The flash ADC is the fastest among all ADCs. The flash ADC concept is explained in Fig. 54.8, where divided reference voltages are compared to the input. The binary encoder is needed because the output of the comparator bank is thermometer-coded. The resolution is limited both by the accuracy of the

FIGURE 54.8

Flash ADC technique.

© 2000 by CRC Press LLC

divided reference voltages and by the comparator resolution. The metastability of the comparator produces a sparkle noise when the gain of the comparator is low. The reference division can be done using capacitor dividers3,4 or transistor sizing5 for small-scale flash ADCs. However, only resistor-string DACs can provide references as the number of bits grows. In practical implementations, the limit is the exponential growth in the number of comparators and resistors. For example, an N-bit flash needs 2N – 1 comparators and 2N resistors. Furthermore, for the Nyquist-rate sampling, the input needs a S/H to freeze the input for comparison. As the number of bits grows, the comparator bank presents a significant loading to the input S/H, diminishing the speed advantage of this architecture. Also, the control of the reference divider accuracy and the comparator resolution degrades, and the power consumption becomes prohibitively high. As a result, flash converters with more than 10-b resolution are rare. Flash ADCs are commonly used as coarse quantizers in the pipeline or multi-step ADCs. The folding/interpolation ADC, which is conceptually a derivative of the flash ADC, reduces the number of comparators by folding the input range.6 For high resolution, the flash ADC needs a low-offset comparator with high gain, and the comparator is often implemented in a multi-stage configuration with offset cancelation. The front-end of the multistage comparator is called a preamplifier. A technique called interpolation saves the number of preamplifiers by interpolating the adjacent preamplifier outputs as shown in Fig. 54.9(a), where two preamplifier outputs Va and Vb are used to generate three more outputs V1, V2, and V3 using a resistor divider. The interpolation can improve the DNL within the interpolated range, but the overall DNL and INL are not improved. Interpolating any arbitrary number of levels is possible by making more resistor taps. The interpolation is usually done using resistors, but the interpolation in the current domain is also possible. However, interpolating with independent current sources does not improve the DNL. Another technique called averaging, as explained in Fig. 54.9(b) is often used to average out the offsets of the neighboring preamplifiers as well as to enhance the accuracy of the reference divider.7 The idea is to couple the outputs of the preamplifier transconductance (Gm) stage so that the significant errors can be spread over the adjacent preamplifier outputs as explained. For example, if the coupling resistor value is infinite, there exists no averaging. As the coupling resistor value decreases, one preamplifier output becomes the weighted sum of the outputs of its neighboring preamplifiers. Therefore, the overall DNL

FIGURE 54.9

(a) Interpolation and (b) averaging techniques.

© 2000 by CRC Press LLC

FIGURE 54.10

Coarse and fine reference ladders for two-step subranging ADC.

and INL can improve significantly.8 However, for the case in which errors to average have the same polarity, the averaging is not that effective. In practice, both the interpolation and the averaging concepts are often combined.

Subranging ADC Although the interpolation and averaging techniques simplify the flash ADC, the number of comparators stays the same. Instead of making all the bit decisions at once, resolving a few bits at a time makes the system simpler and more manageable. It also enables us to use a digital error correction concept. The simplest subranging ADC concept is explained in Fig. 54.10 for the two-step conversion case. It is a straightforward subranging since one subrange out of 2M subranges is chosen in the coarse M-bit decision. Once one subrange is selected, the N-bit fine decision can be done using a fine reference ladder interpolating the selected subrange. Note that the subrange after the coarse decision is Vref /2M and the fine comparators should have a resolution of M+N bits. Unless the digital error correction with redundancy is used, the coarse comparators should also have a resolution of M+N bits.

Multi-Step ADC The tactic of making a few bit decisions at a time as shown in the subranging case can be generalized. A slight modification of the subranging architecture shown in Fig. 54.11(a) to include a residue amplifier with a gain of 2M results in Fig. 54.11(b). The residue is defined as the difference between the input and the nearest DAC output lower than the input. The difference between the two concepts is subtle, but including one residue amplifier drastically changes the system requirements. The obvious advantage of using the residue amplifier is that the fine comparators do not need to be accurate because the residue from the coarse decision is amplified by 2M. That is, the subrange after the coarse decision is no longer Vref /2M. The disadvantage is the accuracy and settling of the high-gain residue amplifier.

© 2000 by CRC Press LLC

FIGURE 54.11

Variations of the subranging concepts: (a) without and (b) with residue amplifier.

Whether the residue is amplified or not, the subranging block consists of a coarse ADC, a DAC, a residue subtractor, and an amplifier. In theory, this block can be repeated as shown in Fig. 54.12. How many times it is repeated determines the number of steps. So, in general terms, the n-step ADC has n–1 subranging blocks. To complete a conversion in one cycle, usually poly-phase subdivided clocks are needed. Due to the difficulty in clocking, the number of steps for the multi-step architecture is usually limited to two, which does not incur a speed penalty and needs the standard two-phase clocking. There are many variations in the multi-step architecture. If no ploy-phase clocking is used, it is called a ripple ADC. Also in the two-step ADC, if one ADC is repeatedly used both for the coarse and fine decisions, it is called a recycling ADC.9 In this ADC example, the capacitor-array multiplying DAC (MDAC) also performs the S/H function in addition to the residue amplification. This MDAC, with

FIGURE 54.12

Multi-step ADC architecture.

© 2000 by CRC Press LLC

either a binary-ratioed or thermometer-coded capacitor array, is a general form of the residue amplifier. The same capacitor array has been used with a comparator to implement a charge-redistribution successive-approximation ADC.1 This MDAC is suited for MOS technologies, but other forms of the residue amplification are possible using resistor-strings or current DACs.

Pipeline ADC The complexity of the two-step ADC, although manageable and simpler than the flash ADC, still grows exponentially as the number of bits to resolve increases. Specifically, for high resolution above 12 b, the complexity reaches about the maximum, and a need to pipeline subranging blocks arises. The pipeline ADC architecture shown in Fig. 54.13 is the same as the subranging or multi-step ADC architecture shown in Fig. 54.12 except for the interstage S/H. Since the S/Hs are clocked by alternating clock phases, each stage needs to perform the decision and the residue amplification in each clock phase. Pipelining the residue greatly simplifies the ADC architecture. The complexity grows only linearly with the number of bits to resolve. Due to its simplicity, the pipeline ADCs have been gaining popularity in the digital VLSI environment. In the pipeline ADC, each stage resolves a few bits quickly and transfers the residue to the following stage so that the residue can be resolved further in the subsequent stages. Therefore, the accuracy of the interstage residue amplifier limits the overall performance. The following four non-ideal error sources can affect the performance of the multi-step or pipeline ADCs: ADC resolution, DAC resolution, gain error of the residue amplifier, and inaccurate settling of the residue amplifier. The offset of the residue amplifier does not affect the linearity, but it appears as a system offset. Among these four error sources, the first three are static, but the residue amplifier settling is dynamic. If the residue amplifier is assumed to settle within one clock phase, three static error sources are limiting the linearity performance. Figure 54.14 explains the residue from the 2-b stage in the systems shown in Figs. 54.12 and 54.13. In the ideal case, as the input is swept from 0 to the full range Vref , the residue change from 0 to Vref repeats each time Vref is subtracted at the ideal locations of the 2-b ADC thresholds, which are Vref /4 apart. In this case, the 2-b stage does not introduce any nonlinearity error. However, in the other cases with ADC, DAC, and gain errors, the residue ranges do not match with the ideal full-scale Vref . If the residue range is smaller than the full range, missing codes are generated; and if the residue goes out of bounds, excessive codes are generated at the ADC thresholds. Unlike the DAC and gain errors, the ADC error appears as a shift of the residue either by the amounts of Vref or –Vref as long as the DAC subtracts the ideal Vref and the residue amplifier gain is ideal. This suggests that the error can be corrected digitally by adding or subtracting the full range.

FIGURE 54.13

Pipeline ADC architecture.

© 2000 by CRC Press LLC

FIGURE 54.14

2b residue versus input: (a) ideal, and with (b) ADC, (c) DAC, and (d) gain errors.

Digital Error Correction Any multi-step or pipeline ADC system can be made insensitive to the ADC error if the ADC error is digitally corrected. The residue normally going out of the full range can still be digitized by the following stage if the residue amplifier gain is reduced. That is, if the residue amplifier gain is set to 2N–1 instead of 2N, the residue plots are as shown in Fig. 54.15. If the residue is bounded with the full range of 0 to Vref , the inner range from Vref /4 to 3Vref /4 is the normal conversion range, and two redundant outer ranges are used to cover the residue error resulting from the inaccurate coarse conversion. Now the problem is that this redundancy requires extra resolution to cover the overrange. The half ranges on both sides are used to cover the residue error in this 2-b case. That is, one full bit of extra resolution is necessary for

FIGURE 54.15

Over-ranged 2-b residue versus input: (a) ideal and with (b) ADC errors.

© 2000 by CRC Press LLC

FIGURE 54.16

Half-bit shifted 2-b residue versus input: (a) with and (b) without ADC threshold-level shift.

redundancy in the 2-b case. However, the amount of redundancy depends on the ADC error range to be corrected in the multi-bit cases.10 In general, it is a tradeoff between comparator resolution and redundancy. The range marked as B is the normal range, and A and C indicate the correction ranges in Fig. 54.15(b). The digital correction works by subtracting 1 from the previous decision if the residue exceeds 3Vref /4, as in the case marked as A. On the other hand, if the residue goes below Vref /4, as in the case marked as C, 1 is added from the previous decision. Although the digital subtraction is simple, biasing the ADC threshold levels by half the ideal interval has an advantage. Figure 54.16 compares the residue plots for two cases with and without the ADC threshold shift by Vref /8. This is to fully utilize the ADC conversion range from 0 to Vref . The shift of Vref /8 makes the residue start from 0 and end at Vref in Fig. 54.16(b), contrary to the previous case where the residue starts from Vref /4 and ends at 3Vref /4. This results in saving one comparator. The former case needs 2N–1 comparators, while the latter case needs 2N–2. The only minor issue is that the latter exhibits a half LSB systematic offset due to this shift. This half-bit-level shift makes the ADC error occur only with the same polarity. As a result, only the addition is necessary for digital correction in the case of Fig. 54.16(b). This is explained in the 4-b ADC made of three stages using one-bit correction per stage in Fig. 54.17. The vertical axis marks the signal and residue levels as well as ADC decision levels. The dotted and shaded areas follow the residue paths when the ADC error occurs, but the end results are the same after digital correction. This half interval shift is valid for stages resolving any number of bits. Overall, the digital error correction enables fast data conversion using inaccurate comparators. However, the DAC resolution and the residue gain error still remain as the fundamental limits in multi-step and pipeline ADCs. The currently known ways to overcome these limits are either trimming or self-calibration.

One-Bit Pipeline ADC The degenerate case of the pipeline ADC is when only one bit is resolved per stage as shown in Fig. 54.18. Each stage multiplies its input Vin by two and subtracts the reference voltage Vref to generate the residue voltage. If the sign of 2Vin – Vref is positive, the bit is 1 and the residue goes to the next stage. Otherwise, the bit is 0 and Vref is added back to the residue before it goes to the next stage. However, in reality, it is more desirable if the reference restoring time is saved. In the non-restoring algorithm, the previous bit decision affects the polarity of the reference voltage to be used in the current bit decision. If the previous bit is 1, the residue voltage is 2Vin – Vref , as in the restoring algorithm. But if the previous bit is 0, the residue voltage is 2Vin + Vref . The switched-capacitor implementation of the basic functional block performing 2Vin ± Vref is explained using two identical capacitors and one op-amp in Fig. 54.19 [11]. During the sampling phase, the bottom plates of two capacitors are switched to the input, and the top plate is connected either to the op-amp output or to the op-amp input common-mode voltage. During the amplification phase, one

© 2000 by CRC Press LLC

FIGURE 54.17

Example of digital error correction for three-stage 4b ADC (1100).

FIGURE 54.18

One-bit per stage pipeline ADC architecture.

of the capacitors is connected to the output of the op-amp for feedback, but the other is connected to ±Vref . Then, the output of the op-amp will be 2Vin – Vref and 2Vin + Vref , respectively, after the op-amp settles. However, this simple one-bit pipeline ADC is of no use if the comparator resolution is limited. If any redundancy is used for digital correction, at least two bits should be resolved. A close look at Fig. 54.16(b) gives a clue to using the simple functional block shown in Fig. 54.19 for the 2-b residue amplification. The case explained in Fig. 54.16(b) is sometimes called 1.5-b rather than 2-b because it needs only three

© 2000 by CRC Press LLC

FIGURE 54.19

The simplest two-level MDAC: (a) sampling phase and (b) amplification phase.

DAC levels rather than four. The functional block in Fig. 54.19 can have a two-level DAC subtracting ±Vref . However, in differential architecture, by shorting the input, one midpoint can be interpolated. Using the tri-level DAC, the 1.5-b pipeline ADC can be implemented with the following algorithm.12 If the input Vin is lower than –Vref /4, the residue output is 2Vin + Vref . If the input is higher than Vref /4, the residue output is 2Vin – Vref . If the input is in the middle, the output is 2Vin .

Algorithmic, Cyclic, or Recursive ADC The interstage S/H used in the multi-step architecture provides a flexibility of the pipeline architecture. In the pipeline structure, the same hardware repeats as shown in Fig. 54.13. That is, the throughput rate of the pipeline is fast while the overall latency is limited by the number of stages. Instead of repeating the hardware, using the same stage repeatedly greatly saves hardware, as shown in Fig. 54.20. That is, the throughput rate of the pipeline is directly traded for hardware simplicity. Such a converter is called an algorithmic, cyclic, or recursive ADC. The functional blocks used for the algorithmic ADC are identical to the ones used in the pipeline ADC.

Time-Interleaved Parallel ADC The algorithmic ADC just described sacrifices the throughput rate for small hardware. However, the time-interleaved parallel ADC takes quite the opposite direction. It duplicates more hardware in parallel for higher throughput rates. The system is shown in Fig. 54.21, where the throughput rate increases by the number of parallel paths strobed. Although it significantly improves the throughput rate and many refinements have been reported, it suffers from many problems.13 Due to the multiplexing, even static nonlinearity mismatch between paths appears as a fixed pattern noise. Also, it is difficult to generate clocks with exact delays, and inaccurate clocking increases the noise floor.

FIGURE 54.20

Algorithmic, cyclic, or recursive ADC architecture.

© 2000 by CRC Press LLC

FIGURE 54.21

Time-interleaved parallel ADC architecture.

Folding ADC The folding ADC is similar to the flash ADC except for using fewer comparators. This reduction in the number of comparators is achieved by replacing the comparator preamplifiers with folding amplifiers. In its original arrangements,14 the folding ADC digitizes the folded signal with a flash ADC. The folded signal is equivalent in concept to the residue of the subranging, multi-step, or pipeline ADC, but the difference is that the generation of the folding signal is done solely in the analog domain. Since the digitized code from the folding amplifier output repeats over the whole input range, a coarse coding is required, as in all subranging-type ADCs. Consider a system configured as a 4-b folding ADC as shown in Fig. 54.22. Four folding amplifiers can be placed in parallel to produce four folded signals. Comparators check the outputs of the folding amplifiers for zero crossing. If the input is swept, the outputs of the fine comparators show a repeating pattern, and eight different codes can be obtained by the four comparators. Because there are two identical fine code patterns, one comparator is needed to distinguish them. However, if this coarse comparator is misaligned with the fine quantizer, missing codes will result. A digital correction similar to that for the multi-step or pipeline ADC can be employed to correct the coarse quantizer error. For this system example, one-bit redundancy is used by adding two extra comparators in the coarse quantizer. The shaded region in the figure is where errors occur. Having several folded signals instead of one has many advantages in designing high-speed ADCs. The folding amplifier requires neither linear output nor accurate settling. This is because in the folding ADC, the zero-crossings of the folded signals matter, but not their absolute values. Therefore, the offset of the folding amplifiers becomes the most important design issue. The resolution of the folding ADC can be further improved using the interpolation concept. When the adjacent folded signals are interpolated by I times, the number of zero-crossing points are also increased by I times. So, the resolution of the final ADC is improved by log2I bits. The higher bound for the degree of interpolation is set by the comparator resolution, the gain of the folding amplifiers, the linearity of folded signals, and the interpolation accuracy. Since the folding process increases the internal signal bandwidth by the number of foldings, the folding ADC performance is limited by the folding amplifier bandwidth. To increase the number of foldings while maintaining the reasonable comparator resolution, the folding amplifier’s gain should be high. Since the higher gain limits the amplifier bandwidth, it is necessary to cascade the folding stages.6,8

54.4 ADC Design Considerations In general, multi-step ADCs are made of cascaded low-resolution ADCs. Each low-resolution ADC stage provides a residue voltage for the subsequent stage, and the accuracy of the residue voltage limits the resolution of the converter. One of the residue amplifiers commonly used in CMOS is a switched-capacitor MDAC, whose connections during two clock phases are illustrated in Fig. 54.23 for an N-bit case. An

© 2000 by CRC Press LLC

FIGURE 54.22

A 4-b folding ADC example with digital correction.

extra capacitor C is usually added to double the feedback capacitor size so that the residue voltage may remain within the full-scale range for digital correction.

Sampling Error Considerations Since the ADC works on a sampled signal, the accuracy in sampling fundamentally limits the system performance. It is well known that the noise power to be sampled on a capacitor along with the signal is KT/C. It is inversely proportional to the sampling capacitor size. The sampled rms voltage noise is 64 µV with 1 pF, but decreases to 20 µV with 10 pF. For accurate sampling, sampling capacitors should be large, but sampling on large capacitors takes time. The speed of the ADC is fundamentally limited by the sampling KT/C noise. In sampling, there exists another important error source. Direct sampling on a capacitor suffers from switch feedthrough error due to the charge injection when switches are opened. A common way to reduce

© 2000 by CRC Press LLC

FIGURE 54.23

General N-bit residue amplifier: (a) sampling phase and (b) amplification phase.

FIGURE 54.24

Open-loop bottom-plate differential sampling on capacitors.

this charge feedthrough error is to turn off the switches connected to the sampling capacitor top plate slightly earlier than the switches connected to the bottom plate. This is explained in Fig. 54.24. Usually, the top plate is connected to the op-amp summing or comparator input node. The top plate switch is switched with one MOS transistor with a clock phase marked as Φ′1 which makes a falling transition earlier than other clocks. The bottom plate is switched with a CMOS switch (both NMOS and PMOS) — with clocks marked as Φ1 and Φ1. These clocks make falling transitions after the prime clock does. The net effect is that the feedthrough voltage stays constant because the top plate samples the same voltage repeatedly. The differential sampling using two capacitors symmetrically is known to provide the most accurate sampling known to date. Unless limited by speed, the sampling error as well as the low-end spectrum of the sampled noise can be eliminated using a correlated double sampling (CDS) technique. The system has been used to remove the flicker noise or slowly-varying offset such as in charge-coupled device (CCD). The CDS needs two sampling clocks. The idea is to subtract the previously sampled sampling error from the new sample

© 2000 by CRC Press LLC

TABLE 54.1

Matching Feedthrough Bandwidth Settling Gain Speed Complexity Problems

Three Dominant ADC Architectures Interpolated Flash

Multi-step

Pipeline

Least Most critical Least Least Least Fast Complex Clock jitter Time skew Sampling error

Medium Medium Most critical Medium Medium Slow Medium Low loop gain

Most critical Least Medium Most critical Most critical Medium Simple Matching Wide bandwidth High gain

after one clock delay. The result is to null the sampling error spectrum at every multiple of the sampling frequency fs. The CDS is effective only for the low-frequency spectrum.

Techniques for High-Resolution and High-Speed ADCs Considering typical requirements, three representative ADC architectures are compared in Table 54.1. To date, all techniques known to improve ADC resolution are as follows: trimming, dynamic matching, ratioindependent technique, capacitor-error averaging, walking reference, and self-calibration. However, the trimming is irreversible and expensive. It is only possible at the factory or with precision equipments. The dynamic matching technique is effective, but it generates high-frequency noise. The ratio-independent techniques either require many clock cycles or are limited to improve differential linearity. The latter case is good for monotonicity, but it also requires accurate comparison. The capacitor-error averaging technique requires three clock cycles, and the walking reference is sensitive to clock sampling error. The selfcalibration technique requires extra hardware for calibration and digital storage, but its compatibility with existing proven architectures may provide potential solutions both for high resolution and for high speed. The ADC self-calibration concepts originated from the direct code-mapping concept using memory. The calibration is to predistort the digital input to the DAC so that the DAC output can match the ideal level from the calibration equipment. Due to the precision equipment needed, this method has limited use. The first self-calibration concept applied to the binary-ratioed successive-approximation ADC is to internally measure capacitor DAC ratio errors using a resistor-string calibration DAC as shown in Fig. 54.25.15 Later, an improved concept of the digital-domain calibration was developed for the multistep or pipeline ADCs.16

FIGURE 54.25

Self-calibrated successive-approximation ADC.

© 2000 by CRC Press LLC

FIGURE 54.26

Segment-error measuring technique for digital self-calibration.

The general concept of the digital-domain calibration is to measure code or bit errors, to store them in the memory, and to subtract them during the normal operation. The concept is explained in Fig. 54.26 using a generalized N-bit MDAC with a capacitor array. If the DAC code increases by 1, the MDAC output should increase by Vref or Vref /2 with digital correction. Any deviation from this ideal step is defined as a segment error. Code errors are obtained by accumulating segment errors. This segment-error measurement needs two cycles. The first cycle is to measure the switch feedthrough error, and the next cycle is to measure the segment error. The segment error is simply measured as shown in Fig. 54.26 using the LSB-side of the ADC by increasing the digital code by 1. In the case of N = 1, the segment error becomes a bit error. If binary bit errors are measured and stored, code errors should be calculated for subtraction during the normal operation. How to store DAC errors is a tradeoff issue. Examples of the digital calibration are well documented for the cases of segment-error17 and bit-error18 measurements, respectively.

54.5 DAC Design Arts There are many different circuit techniques used to implement DACs, but the popular ones widely used today are of the parallel type in which all bits change simultaneously upon the application of an input code word. Serial DACs, on the other hand, produce an analog output only after receiving all digital input data in a sequential form. When DACs are used as stand-alone devices, their output transient behaviors limited by glitch, slew rate, word clock jitter, settling, etc. are of paramount importance, but used as subblocks of ADCs, DACs need only to settle within a given time interval. An output S/H, usually called a deglitcher, is often used for better transient performance. Three of the most popular architectures of DACs are resistor string, ratioed current sources, and a capacitor array. The current-ratioed DAC finds most applications as a stand-alone DAC, while the resistor-string and capacitor-array DACs are mainly used as ADC subblocks. For speeds over 100 MHz, most state-of-the-art DACs employ current sources switched directly to output resistors.19-23 Furthermore, owing to the high bit counts (12 to 16 b), segmented architectures are employed, with the current sources broken into two or three segments. The CMOS design has the advantages of lower power, smaller area, and lower manufacturing costs. In all cases, it is of interest to note that the dynamic performance of the DACs degrades rapidly as input frequencies increase, and true dynamic performance is not attained except at low frequencies. Since a major application of widebandwidth, high-resolution DACs is in communications, poor dynamic performance is undesirable,

© 2000 by CRC Press LLC

owing to the noise leakage from frequency multiplexed channels into other channels. The goal of better dynamic performance continues to be a target of ongoing research and development.

54.6 DAC Architectures An N-bit DAC provides a discrete analog output level, either voltage or current, for every level of 2N digital words that is applied to the input. Therefore, an ideal voltage DAC generates 2N discrete analog output voltages for digital inputs varying from 000…00 to 111…11. In the unipolar case, the reference point is 0 when the digital input is 000…00; but in bipolar or differential DACs, the reference point is the midpoint of the full scale when the digital input is 100…00. Although purely current-output DACs are possible, voltage-output DACs are common in most applications.

Resistor-String DAC The simplest voltage divider is a resistor string. Reference levels can be generated by connecting 2N identical resistors in series between Vref and ground. Switches to connect the divided reference voltages to the output can be either 1-out-of-2N decoder or binary tree decoder as shown in Fig. 54.27 for the 3-b example. Since it requires a good switch, the stand-alone resistor-string DAC is easier to implement using CMOS. However, the lack of switches does not limit the application of the resistor string as a voltage reference divider subblock for ADCs in other process technologies. Resistor strings are widely used as an integral part of the flash ADC as a reference divider. All resistorstring DACs are inherently monotonic and exhibit good differential linearity. However, they suffer from poor integral linearity and also have the drawback that the output resistance depends on the digital input

FIGURE 54.27

Resistor-string DAC.

© 2000 by CRC Press LLC

FIGURE 54.28

Current-ratioed DAC.

code. This causes a code-dependent settling time when charging the capacitive load. This non-uniform settling time problem can be alleviated by adding low-resistance parallel resistors or by compensating the MOS switch overdrive voltages.

Current-Ratioed DAC The most popular stand-alone DACs in use today are current-ratioed DACs. There are two types: one is a weighted-current DAC and the other is an R-2R DAC. The weighted-current DAC shown in Fig. 54.28 is made of an array of switched binary-weighted current sources and the current summing network. In bipolar technology, the binary weighting is achieved by ratioed transistors and emitter resistors with binary related values of R, R/2, R/4, etc., while in MOS technology, only ratioed transistors are used. DACs relying on active device matching can achieve an 8b-level performance with a 0.2 to 0.5% matching accuracy using a 10- to 20-µm device feature size, while degeneration with thin-film resistors gives a 10b-level performance. The current sources are switched on or off by means of switching diodes or emitter-coupled differential pairs (source-coupled pairs in CMOS). The output current summing is done by a wideband transresistance amplifier; but in high-speed DACs, the output current directly drives a resistor load for maximum speed. The weighted-current design has the advantage of simplicity and high speed, but it is difficult to implement a high-resolution DAC because a wide range of emitter resistors and transistor sizes are used, and very large resistors cause problems with both temperature stability and speed.

R-2R Ladder DAC This large resistor ratio problem is alleviated by using a resistor divider known as an R-2R ladder, as shown in Fig. 54.29. The R-2R network consists of series resistors of value R and shunt resistors of value 2R. The top of each shunt resistor of value 2R has a single-pole double-throw electronic switch that connects the resistor either to ground or to the current summing node. The operation of the R-2R ladder network is based on the binary division of current as it flows down the ladder. At any junction of series resistor of value R, the resistance looking to the right side is 2R. Therefore, the input resistance at any junction is R, and the current splits into two equal parts at the junction since it sees equal resistances in both directions. As a result, binary-weighted currents flow into shunt resistors in the ladder. The digitally controlled switches direct the currents to either ground or to the summing node. The advantage of the

© 2000 by CRC Press LLC

FIGURE 54.29

R-2R DAC.

R-2R ladder method is that only two values of resistors are used, greatly simplifying the task of matching or trimming and temperature tracking. In addition, for high-speed applications, relatively low resistor values can be used. Excellent results can be obtained using laser-trimmed thin-film resistor networks. Since the output of the R-2R DAC is the product of the reference voltage and the digital input word, the R-2R ladder DAC is often called an MDAC.

Capacitor-Array DAC Capacitors made of double-poly or poly-diffusion in MOS technology are considered one of the most accurate passive components comparable to thin-film resistors in the bipolar process, both in the matching accuracy and voltage and temperature coefficients.1 The only disadvantage in the capacitor-array DAC implementation is the use of a dynamic charge redistribution principle. A switched-capacitor counterpart of the resistor-string DAC is a parallel capacitor array of 2N unit capacitors with a common top plate. The capacitor-array DAC is not appropriate for stand-alone applications without a feedback amplifier virtually grounding the top plate and an output S/H or deglitcher. The operation of the capacitor-array DAC shown in Fig. 54.30(a) is based on the thermometer-coded DAC principle and has the distinct advantage of monotonicity. However, due to the complexity of handling the thermometercoded capacitor array, a binary-weighted capacitor array is often used, as shown in Fig. 54.30(b) by grouping unit capacitors in binary ratio values. One important application of the capacitor-array DAC is as a reference DAC for ADCs. As in the case of the R-2R MDAC, the capacitor-array DAC can be used as an MDAC to amplify residue voltages for multi-step or pipeline ADCs.

Thermometer-Coded Segmented DAC Applying a two-step conversion concept, a DAC can be made in two levels using coarse and fine DACs. The fine DAC divides one coarse MSB segment into fine LSBs. If one fixed MSB segment is subdivided to generate LSBs, matching among MSB segments creates a non-monotonicity problem. However, if the next MSB segment is subdivided instead of the fixed segment, the segmented DAC can maintain monotonicity regardless of the MSB matching. This is called the next-segment approach. The most widely used segmented DAC is a current-ratioed DAC, whose MSB DAC is made of identical elements for the nextsegment approach, except that the LSB DAC is a current divider as shown in Fig. 54.31. To implement a segmented DAC using two resistor-string DACs, voltage buffers are needed to drive the LSB DAC

© 2000 by CRC Press LLC

FIGURE 54.30

Capacitor-array DACs: (a) thermometer-coded and (b) binary-weighted.

without loading the MSB DAC. Although the resistor-string MSB DAC is monotonic, overall monotonicity is not guaranteed due to the offsets of the voltage buffers. The use of a capacitor-array LSB DAC eliminates the need for voltage buffers.

Integrator-Type DAC As mentioned, monotonicity is guaranteed only in a thermometer-coded DAC. The thermometer coding of a DAC output can be implemented either by repeating identical DAC elements many times or by using

© 2000 by CRC Press LLC

FIGURE 54.31

Thermometer-coded segmented DAC.

the same element over and over. The former requires more hardware, but the latter requires more time. In the continuous-time integrator-type DAC, the integrator output is a linear ramp and the time to stop integration can be controlled digitally. Therefore, monotonicity can be maintained. Similarly, the discretetime integrator can integrate a constant amount of charge repeatedly and the number of integrations can be controlled digitally. The integration approach can give high accuracy, but its disadvantage is that its slow speed limits its applications.

54.7 DAC Design Considerations Figure 54.32 illustrates two step responses of a DAC when it settles with a time constant τ and when it slews with a slew rate S. The transient errors given by the shaded areas are h/τ and h2/2S, respectively. This implies that a single time-constant settling of the former case only generates a linear error in the output, which does not affect the DAC linearity, but the slew-limited settling generates a nonlinear error. Even in the single-time constant case, the code-dependent settling time constant can introduce a nonlinearity error because the settling error is a function of the time constant t. This is true for a resistorstring DAC, which exhibits a code-dependent settling time because the output resistance of the DAC depends on the digital input.

FIGURE 54.32

DAC settling cases: (a) exponential and (b) slew-limited case.

© 2000 by CRC Press LLC

Effect of Limited Slew Rate The slew-rate limit is a significant source of nonlinearity since the error is proportional to the square of the signal, as shown in Fig. 54.32(b). The height and width of the error term change with the input. The worst-case harmonic distortion (HD) when generating a sinusoidal signal with a magnitude Vo with a limited slew rate of S is24 :

ωTc 2 × Vo , k = 1, 3, 5, 7L HDk = 8 2 STc πk k − 4 sin2

(

)

(54.6)

where Tc is the clock period. For a given distortion level, the minimum slew rate is given. Any exponential system with a bandwidth of ωo gives rise to signals with the maximum slew rate of 2ωoVo . Therefore, by making 2ωoVo > S, the DAC system will exhibit no distortion due to the limited slew rate.

Glitch Glitches are caused by small time differences between some current sources turning off and others turning on. Take, for example, the major code transition at half-scale from 011…11 to 100…00. Here, the MSB current source turns on while all other current sources turn off. The small difference in switching times results in a narrow half-scale glitch, as shown in Fig. 54.33. Such a glitch, for example, can produce distorted characters in CRT display applications. To alleviate both glitch and slew-rate problems related to transients, a DAC is followed by a deglitcher. The deglitcher stays in the hold mode while the DAC changes its output value. After the switching transients have settled, the deglitcher is changed to the sampling mode. By making the hold time suitably long, the output of the deglitcher can be made independent of the DAC transient response. However, the slew rate of the deglitcher is on the same order as that of the DAC, and the transient distortion will still be present — now as an artifact of the deglitcher.

Techniques for High-Resolution DACs The following methods are often used to improve the linearity of DACs: Laser trimming, off-chip adjustment, common-centroid layout technique, dynamic element matching technique, voltage or current sampling, and electronic calibration techniques. The trend is toward more sophisticated and intelligent electronic solutions that overcome and compensate for some of the limitations of conventional trimming techniques. Electronic calibration is a general term to describe various circuit techniques, which

FIGURE 54.33

DAC output glitch.

© 2000 by CRC Press LLC

usually predistort the DAC transfer characteristic so that the DAC linearity can be improved. The selfcalibration is to incorporate all the calibration mechanisms and hardware on the DAC as a built-in function so that users can recalibrate whenever necessary. The application of dynamic element matching to the binary-weighted current DAC is a straightforward switching of two complementary currents.25 Its application to the binary voltage divider using two identical resistors or capacitors requires exchanging resistors or capacitors. This can be easily achieved by reversing the polarity of the reference voltage for the divide-by-two case. However, in the general case of N-element matching, the current division is inherently simpler than the voltage division. In general, to match the N independent elements, a switching network with N inputs and N outputs is required. The function of the switching network is to connect any input out of N inputs to one output with an average duty cycle of 1/N. The simplest one is a barrel shifter rotating the input-output connections in a predetermined manner. This barrel shifter generates a low-frequency modulated error when N gets larger because the same pattern repeats every N clocks. A more sophisticated randomizer with the same average duty cycle can distribute the mismatch error over the wider frequency range. The voltage or current sampling concept is an electronic alternative to direct mechanical trimming. The voltage sampler is usually called a S/H, while the current sampler is called a current copier. The voltage is usually sampled on the input capacitor of a buffer amplifier, and the current is usually sampled on the input capacitor of a transconductance amplifier such as MOS transistor gate. Therefore, both voltage and current sampling techniques are ultimately limited by their sampling accuracy. The idea behind the voltage or current sampling DAC is to use one voltage or current element repeatedly. One example of the voltage sampling DAC is a discrete-time integrating DAC. The integrator integrates a constant charge repeatedly, and its output is sampled. This is equivalent to generating equally spaced reference voltages by stacking identical unit voltages.26 The fundamental problem associated with this sampling voltage DAC approach is the accumulation of the sampling error and noise in generating larger voltages. Similarly, the current sampling DAC can sample a constant current on current sources made of MOS transistors.27 Since one reference current is copied on other identical current samplers, the matching accuracy can be maintained as long as the sampling errors are kept constant. Since it is not practical to make a high-resolution DAC using voltage or current sampling alone, this approach is limited to generating MSB DACs for the segmented DAC or for the subranging ADCs. Self-calibration is based on an assumption that the segmented DAC linearity is limited by the MSB DAC so that only errors of MSBs can be measured, stored in memory, and recalled during normal operation. There are two different ways of measuring the MSB errors. In one method, individual-bit non-linearities, usually appearing as component mismatch errors, are measured digitally,15,18 and a total error, which is called a code error, is computed from individual-bit errors depending on the output code during normal conversion. On the other hand, the other method measures and stores digital code errors directly and eliminates the digital code-error computation during normal operation.16,17 The former requires less digital memory, while the latter requires fewer digital computations.

References 1. J. McCreary and P. Gray, All-MOS charge redistribution analog-to-digital conversion techniquespart I, IEEE J. Solid-State Circuits, vol. SC-10, pp. 371-379, Dec. 1975. 2. S. Ramet, A 13-bit 160kHz differential analog to digital converter, ISSCC Dig. Tech. Papers, pp. 20-21, Feb. 1989. 3. C. Mangelsdorf, H. Malik, S. Lee, S. Hisano, and M. Martin, A two-residue architecture for multistage ADCs, ISSCC Dig. Tech. Papers, pp. 64-65, Feb. 1993. 4. W. Song, H. Choi, S. Kwak, and B. Song, A 10-b 20-Msamples/s low power CMOS ADC, IEEE J. Solid-State Circuits, vol. 30, pp. 514-521, May 1995. 5. T. Cho and P. Gray, A 10-bit, 20-Msamples/s, 35-mW pipeline A/D converter, IEEE J. Solid-State Circuits, vol. 30, pp. 166-172, Mar. 1995.

© 2000 by CRC Press LLC

6. P. Vorenkamp and R. Roovers, A 12-bits, 60MSPS cascaded folding & interpolating ADC, IEEE J. Solid-State Circuits, vol. 32, pp. 1876-1886, Dec. 1997. 7. K. Kattmann and J. Barrow, A technique for reducing differential nonlinearity errors in flash A/D converters, ISSCC Dig. Tech. Papers, pp. 170-171, Feb. 1991. 8. K. Bult and A. Buchwald, An embedded 240-mW 10-bit 50MS/s CMOS ADC in 1-mm2, IEEE J. Solid-State Circuits, vol. 32, pp. 1887-1895, Dec. 1997. 9. B. Song, S. Lee, and M. Tompsett, A 10b 15-MHz CMOS recycling two-step A/D converter, IEEE J. Solid-State Circuits, vol. SC-25, pp. 1328-1338, Dec. 1990. 10. S. Lewis and P. Gray, A pipelined 5-Msamples/s 9-bit analog-to-digital converter, IEEE J. SolidState Circuits, vol. SC-22, pp. 954-961, Dec. 1987. 11. B. Song, M. Tompsett, and K. Lakshmikumar, A 12-bit 1-Msample/s capacitor error-averaging pipelined A/D converter, IEEE J. Solid-State Circuits, vol. SC-23, pp. 1324-1333, Dec. 1988. 12. S. Lewis, S. Fetterman, G. Gross Jr., R. Ramachandran, and T. Viswanathan, A 10-b 20-Msample/s analog-to-digital converter, IEEE J. Solid-State Circuits, vol. SC-27, pp. 351-358, Mar. 1992. 13. C. Conroy, D. Cline, and P. Gray, An 8-b 85-MS/s parallel pipeline A/D converter in 1-µm CMOS, IEEE J. Solid-State Circuits, vol. SC-28, pp. 447-454, Apr. 1993. 14. R. Plassche and R. Grift, A high speed 7 bit A/D converter, IEEE J. Solid-State Circuits, vol. SC-14, pp. 938 – 943, Dec. 1979. 15. H. Lee, D. Hodges, and P. Gray, A self-calibrating 15-bit CMOS A/D converter, IEEE J. Solid-State Circuits, vol. SC-19, pp. 813-819, Dec. 1984. 16. S. Lee and B. Song, Digital-domain calibration of multistep analog-to-digital converters, IEEE J. Solid-State Circuits, vol. SC-27, pp. 1679-1688, Dec.1992. 17. S. Kwak, B. Song, and K. Bacrania, A 15-b, 5-Msamples/s low spurious CMOS ADC, IEEE J. SolidState Circuits, vol. 32, pp. 1866-1875, Dec. 1997. 18. A. Karanicolas, H. Lee, and K. Bacrania, A 15-b 1-Msample/s digitally self calibrated pipeline ADC, IEEE J. Solid-State Circuits, vol. 28, pp. 1207-1215, Dec. 1993. 19. D. Mercer, A 16-b D/A converter with increased spurious free dynamic range, IEEE J. Solid-State Circuits, vol. 29, pp. 1180-1185, Oct. 1994. 20. B. Tesch and J. Garcia, A low glitch 14-b 100-MHz D/A converter, IEEE J. Solid-State Circuits, vol. 32, pp. 1465-1469, Sept. 1997. 21. D. Mercer and L. Singer, 12-b 125 MSPS CMOS D/A designed for special performance, Proc. IEEE Int. Symp. Low Power Electronics and Design, pp. 243-246, Aug. 1996. 22. C. Lin and K. Bult, A 10b 250MSample/s CMOS DAC in 1mm2, ISSCC Dig. Tech. Papers, pp. 214-215, Feb. 1998. 23. A. Marques, J. Bastos, A. Bosch, J. Vandenbusche, M. Steyaert, and W. Sansen, A 12b accuracy 300M sample/s update rate CMOS DAC, ISSCC Dig. Tech. Papers, pp. 216-217, Feb. 1998. 24. D. Freeman, Slewing distortion in digital-to-analog conversion, J. Audio Eng. Soc., vol. 25, pp. 178-183, Apr. 1977. 25. R. Plassche, Dynamic element matching for high accuracy monolithic D/A converters, IEEE J. Solid-State Circuits, vol. SC-11, pp. 795-800, Dec. 1976. 26. D. Kerth, N. Sooch, and E. Swanson, A 12-bit 1-MHz two-step flash ADC, IEEE J. Solid-State Circuits, vol. SC-24, pp. 250-255, Apr. 1989 27. D. Groeneveld, H. Schouwenaars, H. Termeer, and C. Bastiaansen, A self-calibration technique for monolithic high-resolution D/A converters, IEEE J. Solid-State Circuits, vol. SC-24, pp. 1517-1522, Dec. 1989.

© 2000 by CRC Press LLC

Fattaruso, J.W., Williams III, L.A. "Oversampled Analog-to-Digital and Digital-to-Analog Converters" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

55 Oversampled Analog-to-Digital and Digital-to-Analog Converters 55.1 Introduction 55.2 Basic Theory of Operation Time-Domain Representation • Frequency-Domain Representation • Sigma-Delta Modulators in Data Converters • Tones

55.3 Alternative Sigma-Delta Architectures High-Order Modulators • Cascoded Modulators • Bandpass Modulators

55.4 Filtering for Sigma-Delta Modulators Anti-alias and Reconstruction Filters • Decimation and Interpolation Filters

55.5 Circuit Building Blocks Switched-Capacitor Integrators • Operational Amplifiers • Comparators • Complete Modulator • D/A Circuits • Continuous-Time Modulators

55.6 Practical Design Issues

John W. Fattaruso Louis A. Williams III Texas Instruments, Incorporated

kT/C Noise • Integrator Gain Scaling • Amplifier Gain • Sampling-Nonlinearity and Reference Corruption • HighOrder Integrator Reset • Multi-level Feedback

55.7 Summary

55.1 Introduction In the absence of some form of calibration or trimming, the precision of the Nyquist rate converters described Chapter 54 is strictly dependent on the precision of the VLSI components that comprise the converter circuits. Oversampled data converters are a means of exchanging the speed and data processing capability of modern sub-micron integrated circuits for precision that would otherwise not be readily attainable.1,2 The precision of an oversampled data converter can exceed the precision of its circuit components by several orders of magnitude. In this chapter, the basic operation and design techniques of the most widely used class of oversampled data converters — sigma-delta1 modulators — are described. In Section 55.2, the basic theory of sigma-delta 1The reader will find functionally identical modulator blocks in the literature named either “sigma-delta” modulators or “delta-sigma” modulators, with the choice of terminology largely up to personal preference. We use the former term here.

© 2000 by CRC Press LLC

modulators is presented, using both time-domain and frequency-domain approaches. The issue of nonharmonic tones is also discussed. In Section 55.3, more complex sigma-delta architectures are described, including higher-order architectures, cascaded architectures, and bandpass architectures. Filtering techniques unique to sigma-delta modulators are presented in Section 55.4. In Section 55.5, the basic circuit building blocks for sigma-delta modulators are described; and in Section 55.6, circuit design issues specific to sigma-delta-based data converters are discussed.

55.2 Basic Theory of Operation Oversampled data conversion techniques have their roots in the design of signal coders for communication systems.3-6 Oversampling techniques differ from Nyquist techniques in that their comprehension and design procedures draw equally from time-domain and frequency-domain representations of signals, whereas Nyquist techniques are readily understood in just the time domain. In general, the function of data conversion by oversampling is typically performed by a serial connection of a modulator and various filter blocks. In analog-to-digital (A/D) conversion, shown in Fig. 55.1, the analog input signal xi(t) is first bandlimited by an anti-alias filter, then sampled at a rate fS . This sampling rate is M times faster than a comparable Nyquist rate converter with the same signal bandwidth; the value of M is the oversampling ratio. The sampled signal x[n] is coded by a modulator block that quantizes the data into a finite number of discrete levels. The resulting coded signal y[n] is down-sampled, or decimated, by a factor of M to produce an output that is comparable to a Nyquist rate converter. Digital-to-analog oversampled data conversion is basically the reverse of analog-to-digital conversion. As shown in Fig. 55.2, the Nyquist-rate digital samples are oversampled by an interpolation filter, coded by a modulator, and then reconstructed in the analog domain by an analog filter. In both analog-to-digital and digital-to-analog data conversion, the block with the most unique signal processing properties is the modulator. The remainder of this section and the subsequent sections focus on the properties and architectures for oversampled data modulators.

Time-Domain Representation The simplest modulator that would perform the requisite conversion to discrete output levels is the quantization function Q(x) shown in Fig. 55.3. This quantization can be thought of as merely the sum

FIGURE 55.1

Oversampled A/D conversion.

FIGURE 55.2

Oversampled D/A conversion.

© 2000 by CRC Press LLC

FIGURE 55.3

Quantizer transfer function.

FIGURE 55.4

Quantization error.

FIGURE 55.5

Error correction feedback.

of the original signal x[n] with a sampled error signal e[n], as illustrated in Fig. 55.4. In Nyquist rate converters, the error is reduced by using a large number of small steps in the quantizer characteristic. In oversampled data converters, specifically sigma-delta modulators, the error is corrected by a feedback network. This correction is made by estimating the error in advance and subtracting it from the input, as shown in Fig. 55.5, where is the error estimate. If this estimate were perfect, ê[n] would equal e[n] and the output y[n] would equal the input x[n]. However, since the error is not known until it is made, e[n] is not known when ê[n] is needed. Therefore, some means must be found to estimate the error. In the case of sigma-delta converters, the error can be estimated by exploiting some knowledge of the frequency domain behavior of the input signal. Specifically, it is assumed that the signal is changing very slowly from sample to sample, or equivalently, its bandwidth is much less than the sampling rate.

© 2000 by CRC Press LLC

FIGURE 55.6

First-order error estimation.

For exceedingly slow signals, a first-order estimate of the error to be committed in quantization can be formed. The first-order estimate of the current error e[n] is simply the previous error e[n – 1]. This error may be found simply by a subtraction across the quantization block as shown in Fig. 55.6, and the output y[n] is

[] [] [] [ ]

y n = x n + e n − e n −1

(55.1)

The essential property of this structure is that if an error is committed that is not large enough to be corrected by a displacement to another quantizer level on the next sample, then the history of successive errors accumulate in the feedback loop until they eventually push the quantizer into another level. In this manner, the output of the quantizer will, over time, correct the errors committed in previous samples, increasing the precision of the information being generated as a time sequence of samples. As will be shown in Section 55.5, the most convenient and accurate sampled-data circuit building block in practice is an integrator. With a few straightforward steps, the system of Fig. 55.6 can be transformed into that of Fig. 55.7, where the delay element is now immersed in an integrator feedback loop. The output of this transformed modulator is

[] [ ] [] [ ]

y n = x n −1 + e n − e n −1

FIGURE 55.7

First-order equivalent modulator.

© 2000 by CRC Press LLC

(55.2)

FIGURE 55.8

Second-order error estimation.

Comparing Eqs. 55.1 and 55.2, it is evident that this transformation does require the addition of a single clock delay block in the input path x[n], but this extra clock cycle of latency has no effect on the precision or frequency response of the modulator. The structure in Fig. 55.7 is generally known as a firstorder sigma-delta modulator.7-9 An increase in precision can be obtained by using more accurate estimates of the expected quantizer error.6 A second-order estimate of e[n] may be formed by assuming that the error e[n] varies linearly with time. In this case, an estimate of the current error e[n] may be computed by changing the previous error e[n – 1] by an amount equal to the change between e[n – 2] and e[n – 1]. The second order error estimate is thus

[ ] [ ] ( [ ] [ ]) [ ] [ ]

eˆ n = e n − 1 + e n − 1 − e n − 2 = 2e n − 1 − e n − 2

(55.3)

and is illustrated in Fig. 55.8. The output of the second-order estimation modulator is

[] [] [] [ ] [ ]

y n = x n + e n − 2e n − 1 + e n − 2

(55.4)

It can be shown, after a number of steps, that the modulator in Fig. 55.8 can be transformed into a modulator in which the feedback loop delays are again immersed in practical integrator blocks. This second-order sigma-delta modulator10-12 is shown in Fig. 55.9; the output of this transformed modulator is

[] [ ] [] [ ] [ ]

y n = x n − 2 + e n − 2e n − 1 + e n − 2

(55.5)

which is entirely equivalent to that given by Eq. 55.4, except for the addition of two inconsequential delays of the input signal x[n]. A further increase in precision can be obtained using even higher-order estimates of the quantizer error, such as quadratic or cubic. These high-order error estimate modulators can also be transformed

© 2000 by CRC Press LLC

FIGURE 55.9

FIGURE 55.10

Second-order equivalent modulator.

First-order sample output.

into a series of delaying integrators in a feedback loop. Unfortunately, as discussed in Section 55.3, practical difficulties emerge for orders greater than two, and alternative architectures are generally needed. Computer simulation of modulator systems is straightforward, and Fig. 55.10 shows the simulated output of the first-order modulator of Fig. 55.7 when fed with a simple sinusoidal input. The resolution of the quantizer in the modulator loop was assumed to be eight levels. (The modulator output is drawn with continuous lines to emphasize the oscillatory nature of the modulator output, but the quantities plotted have meaning only at each sample time.) The coarsely quantized output code generally follows the input, but with occasional transitions that track intermediate values over local sample regions. A second-order modulator with an eight-level quantizer exhibits the simulated behavior shown in Fig. 55.11. Note that the oscillations by which the loop attempts to minimize quantization error appear “busier” than in the first-order case of Fig. 55.10. It will be shown in the frequency domain that, for a given signal bandwidth, the more vibrant output code oscillations in Fig. 55.11 actually represent the input signal with higher precision than the first-order case in Fig. 55.10. A special case that is of practical significance is the second-order modulator with a two-level quantizer, that is, simply a comparator closing the feedback loop. A simulation of such a modulator is shown in Fig. 55.12. Although the quantized representation at the output appears crude, the continuous nature of the input level is expressed in the density of output codes. When the input level is high (around sample numbers 32 and 160), there is a greater preponderance of ‘1’ output codes; and at low swings of the input (around sample numbers 96 and 224), the ‘0’ output code dominates. The examples in Figs. 55.10 to 55.12 demonstrate that information generated from the modulator expresses, in a high-speed coded form, the coarsely quantized input signal and the deviation between the signal and the quantization levels. Although the time domain coded modulator output looks somewhat unintelligible, the output characteristics are clearer in the frequency domain.

© 2000 by CRC Press LLC

FIGURE 55.11

Second-order sample output.

FIGURE 55.12

Second-order one-bit modulator sample output.

FIGURE 55.13

Generalized sigma-delta modulator.

Frequency-Domain Representation The modulators in Figs. 55.7 and 55.9 can be generalized as the sampled-data system shown in Fig. 55.13, where the time domain signals x[n], y[n], and e[n] are written as their frequency-domain equivalents X(z), Y(z), and E(z). The modulator output in Fig. 55.13 can be written in terms of the input X(z) and the quantizer error E(z) as

()

() ()

()()

Y z = H x z X z + He z E z

© 2000 by CRC Press LLC

(55.6)

where the input transfer function, Hx(z), is

Hx z =

()

() 1 + A( z ) F ( z )

()

1

Az

(55.7)

and the error transfer function, He(z), is

He z =

()()

1+ A z F z

.

(55.8)

Strictly speaking, the error E(z) is directly dependent on the input X(z). Nonetheless, if the input to the quantizer is sufficiently busy, that is, the input to the quantizer crosses through several quantization levels, the quantizer error approaches having the behavior of a random value that is uniformly distributed between ±δ/2 and is uncorrelated with the input, where δ is the quantization level separation illustrated in Fig. 55.3. In the frequency domain, the error noise power spectrum is uniform with a total error power of δ2/12.13,14 The error power between the frequencies fL and fH is

See ≈

fH

∫ (

δ2 6 fS

H e e jπf

fs

)

2

df

(55.9)

fL

where fS is the sampling rate. The error power between fL and fH can be reduced independent of the quantizer error level separation δ by having a small error transfer function He(z) in that frequency band. Sigma-delta modulators have this property. A sigma-delta modulator is a system such as that in Fig. 55.13 in which the error transfer function He(z) is small and the input transfer function Hx(z) is about unity for some band of frequencies. That is,

H e e j 2 πf

(

fs

)

(

fs

) ≈ 1; f

H x e j 2 πf

fL ≤ f ≤ fH

 1;

L

≤ f ≤ fH

(55.10)

(55.11)

The requirements in Eqs. 55.10 and 55.11 are equivalent to requiring that the loop gain be large and the feedback gain be unity, that is

(

fs

)

(

fs

) ≈ 1; f

A e j 2 πf F e j 2 πf

 1;

fL ≤ f ≤ fH

L

≤ f ≤ fH

(55.12)

(55.13)

There are many system designs that have the sigma-delta properties of high loop gain and unity feedback gain. The previous examples in Figs. 55.7 and 55.9 are part of an important class of modulator architectures called noise-differencing modulators that are particularly well suited to VLSI implementation. The forward path in a noise-differencing sigma-delta modulator consists of a series of delaying integrators. The order of the modulator is defined as the number of integrators. The forward gain of an L-th order modulator is

© 2000 by CRC Press LLC

 z −1  A z = −1   1− z 

()

L

(55.14)

The feedback gain in a noise-differencing modulator is designed such that the modulator open-loop gain is

()()

Az F z =

1

(1 − z ) −1

L

−1

(55.15)

From Eqs. 55.14 and 55.15, it follows that the output for an L-th order noise-differencing sigma-delta modulator is

() (

()

) () L

Y z = z − L X z + 1 − z −1 E z

(55.16)

The simulated frequency response for a second-order noise-differencing sigma-delta modulator with a sinusoidal input is shown in Fig. 55.14. The large spike in the center is the original input signal. It is clear from the plot that the noise energy is lowest at low frequencies. Noise-differencing modulators are designed to reduce the quantization noise in the baseband, that is, fL = 0 and fH  fS . The oversampling ratio, M, is

M=

fS 2 fH

(55.17)

(In a Nyquist rate converter, M = 1.) The baseband noise power for a noise-differencing modulator is

See ≈

FIGURE 55.14

δ2 6 fS

fH

∫(

1 − e − j 2 πf

fS

)

L

2

df ≈

0

Second-order simulated frequency response.

© 2000 by CRC Press LLC

δ2 π2 L 1 12 2 L + 1 M 2 L +1

(55.18)

FIGURE 55.15

Calculated dynamic range vs. oversampling ratio.

One important measure of a sigma-delta modulator is its dynamic range, defined here as the ratio of the maximum sinusoidal input power to the noise power. With the quantizer output limited, as shown in Fig. 55.3, to a range of ∆, the maximum sinusoidal signal power is ∆2/8. The quantizer range ∆ is related to the quantizer level separation δ by the number of quantization levels K, where

δ=

∆ K −1

(55.19)

The dynamic range of a noise-differencing sigma-delta modulator is then

DR =

(

)

2 3 2L + 1 2 L+1 1 M K − 2 π2 L

(55.20)

Because the dynamic range is such a strong function of the oversampling ratio, the number of bits required to achieve a given dynamic range is substantially less in a sigma-delta modulator than in a Nyquist-rate converter. To illustrate this, the dynamic range, as given by Eq. 55.20, is shown in Fig. 55.15 as a function of the oversampling ratio, M, for three combinations of modulator order, L, and number of quantization levels, K. The equivalent resolution in bits that would be required of a Nyquist-rate converter to achieve the same dynamic range is shown in the right-hand axis of this figure. It can be inferred from Eq. 55.20 that a large dynamic range can be obtained even with only two quantization levels. This is important when circuit imperfections in actual sigma-delta data converter implementations are considered.

Sigma-Delta Modulators in Data Converters The generalized modulator shown in Fig. 55.13 must be subtly modified when applied to the A/D and D/A converters in Figs. 55.1 and 55.2. In an A/D converter, the quantizer is actually a coarse K-level A/D converter (ADC), having an analog input and a digital output. (Since K is generally small, the K-level ADC is usually just a small flash converter, as described in Chapter 54.) The quantized digital code is fed

© 2000 by CRC Press LLC

FIGURE 55.16

Sigma-delta modulator for A/D conversion.

FIGURE 55.17

K-level DAC error in sigma-delta A/D converter.

back into the analog F(z) through a K-level D/A converter (DAC), as shown in Fig. 55.16. Imperfections in this K-level DAC will introduce an additional error term DAD(z) as shown in Fig. 55.17. With the addition of this DAC error term, the modulator output is

()

( )[ ( ) ( ) ( )]

()()

YAD z = H x z X z − F z DAD z + H e z E z

(55.21)

Since the feedback transfer function, F(z), is unity in the band of interest (see Eq. 55.13), the DAC error is indistinguishable from the input. If there are more than two quantization levels, any mismatch between the level separations in the DAC will manifest itself as distortion because the DAC input is signal dependent. On the other hand, if there are only two quantization levels, there is only one level separation, and errors in the DAC levels will not cause distortion. (At worst, DAC errors in a two-level modulator will introduce a dc offset and a gain error.) Thus, with two-level sigma-delta modulators, it is possible to achieve low distortion and low-noise performance without precise component matching. Unfortunately, most, if not all, of the statistical conditions that led to Eq. 55.9 and the subsequent equations are violated when a two-level single-threshold quantizer is used in a sigma-delta modulator.15 Furthermore, the effective gain of the quantizer, which in Fig. 55.3 is implied to be unity, is undefined for a single-threshold quantizer. Nonetheless, empirical evidence has indicated that Eq. 55.20 is still a reasonable approximation for two-level noise-differencing sigma-delta modulators, and is useful as a design guideline for the amount of oversampling needed to achieve a specific dynamic range for a given modulator order.16

© 2000 by CRC Press LLC

FIGURE 55.18

Sigma-delta modulator for D/A conversion.

FIGURE 55.19

K-level DAC error in sigma-delta D/A converter.

As in sigma-delta ADCs, there is also a DAC error term in sigma-delta based DACs. In a sigma-delta DAC, the modulator loop is implemented digitally, and the output of that loop is applied to a coarse K-level DAC that provides the analog input for the reconstruction filter, as shown in Fig. 55.18. Imperfections in the K-level DAC will introduce an error term DDA(z) as shown in Fig. 55.19. With the addition of this error term, the modulator output is

()

() ()

()

()()

YDA z = H x z X z + DDA z + H e z E z

(55.22)

Since the input transfer function Hx(z) is unity in the band of interest (see Eq. 55.11), the DAC error is indistinguishable from the input, just as in the A/D case. Once again, two-level quantization can be used to avoid DAC-introduced distortion.

Tones One problem with the simplified noise model of sigma-delta modulators that led to Eq. 55.20 is the failure to predict non-harmonic tones. This is especially true for two-level modulators. Repetitive patterns in the coded modulator output that cause discrete spectral peaks at frequencies not harmonically related to any input frequency can occur in sigma-delta modulators.10,16-18 These tones can manifest themselves as odd “chirps” or “pops,” and they exist even in ideal sigma-delta modulators; they are not caused by circuit imperfections.2 The origin of sigma-delta tones is illustrated in the following example. Consider a first-order sigmadelta modulator, such as that in Fig. 55.7, with a dc input of 0.0005. Let the quantizer have two output levels, +0.5 and –0.5. The output of such a modulator will be a sequence of +0.5’s and –0.5’s such that the time average of the outputs is 0.0005. To achieve this average, the output of the first-order modulator will be a stream of alternating +0.5’s and –0.5’s, with an extra +0.5 every 1000 clock cycles. This is

© 2000 by CRC Press LLC

FIGURE 55.20

(a) Output sequence with average of 0.0005; and (b) running average of (a).

illustrated in Fig. 55.20(a), where T is the clock period (T = 1/fS). The two-cycle running average of this output is shown in Fig. 55.20(b). For the most part, this running average is zero, except that at every 1000 clock cycles, there is a one clock cycle pulse. This repetitive pulse produces a tone in the output spectrum at a frequency of

fP =

fS M = fH 1000 500

(55.23)

If the oversampling ratio M is less than 500, this tone will appear in the baseband spectrum. In sigma-delta modulators with more active input signals, the output sequence is typically more complex than that illustrated in Fig. 55.20. Nonetheless, the concept underlying tone behavior is that repeating patterns in the quantizer output cause non-uniformity in the quantizer error spectrum, which in the worst case is a discrete spectral peak. A measured tone for a second-order modulator with a dc input is shown in Fig. 55.21.19

FIGURE 55.21

Measured tone in second-order modulator with dc input.

© 2000 by CRC Press LLC

Several means of mitigating sigma-delta tones have been used. The first rule is to avoid using firstorder sigma-delta modulators. Aside from having inferior noise-shaping properties compared to other modulator architectures, first-order modulators have terrible tone properties.15,16 The situation improves dramatically with second- and higher-order modulators. In fact, the presence of tones may only be a perceptual or marketing concern, as the total tone power is usually less than the broadband quantization noise power.20 When tone magnitudes must be reduced, several techniques have proven effective. These include dither, cascaded architectures, and multi-level quantization. Of these three, dither is the only technique whose sole benefit is the reduction of tones. The simplest type of dither is to add a moderate amplitude outof-band signal, such as a square wave, to the input.9,21,22 This dither signal is attenuated by the same filter that attenuates the quantization noise. The purpose of this dither is to keep the sigma-delta modulator out of modes that produce patterns, and for some types of tones this technique is effective. A more rigorously effective technique is to add a large amplitude pseudo-random noise signal at the quantizer input.23 This noise is spectrally shaped just like the quantization noise, and is the most effective dither scheme for eliminating tones. Its drawbacks are the expense in silicon area of the random noise generator and the 2- to 3-dB reduction in dynamic range caused by the dither noise. The other two tone mitigation techniques, cascaded architectures and multi-level quantization, are simply more complex sigma-delta architectures that happen to have improved tone properties over simple noise-differencing two-level sigma-delta modulators. These techniques are covered in Sections 55.3 and 55.4, respectively.

55.3 Alternative Sigma-Delta Architectures Equation 55.20 appears to indicate that the order of the modulator, L, can be any value, and that increasing L would be beneficial. However, one further problem with two-level sigma-delta modulators is that twolevel noise-differencing modulators of order greater than two can exhibit unstable behavior.10 For this reason, only first- and second-order modulators were discussed in the Section 55.2. Nonetheless, there have been acceptably stable practical alternative architectures that achieve quantization noise shaping that is superior to a second-order modulator. Two such architectures, high-order and cascaded modulators, are discussed in this section. Another assumption in the previous section was that the noise-shaped region in a sigma-delta modulator is centered around dc. This is not necessarily the case; sigma-delta modulators with noise-shaped regions at frequencies other than near dc are called bandpass modulators and are discussed at the end of this section.

High-Order Modulators A high-order modulator is a modulator such as that depicted in Fig. 55.13 in which there are more than two zeros in the noise transfer function. As stated earlier, if two-level quantization is employed, a simple noise-differencing series of integrators cannot be used, as such architectures produce unstable oscillations with large inputs that do not recover when the input is removed. To overcome this problem, high-order modulators use forward and feedback transfer functions that are more complex than the noise-differencing functions in Eqs. 55.14 and 55.15.24-26 The general rule of thumb in the design of high-order modulators is that the modulator can be made stable if

()

lim H e z = 1

(55.24)

()

(55.25)

z→ ∞

H e z ≤ A, for z = 1

© 2000 by CRC Press LLC

FIGURE 55.22

Fourth-order error spectrum.

and the integrator outputs are clipped and/or scaled to prevent self-sustaining instability.26,27 The maximum error gain A is about 1.5, but the value used represents a tradeoff between noise attenuation and modulator stability. These rules cover a broad class of filter types and modulator architectures, and the type of filter used generally follows the traditions of the previous designers in an organization. As an example, consider a fourth-order modulator with a highpass Butterworth error transfer function having a maximum gain, A, of 1.5, and a cutoff frequency set such that Eq. 55.24 is satisfied. The error spectrum of the Butterworth filter is shown in Fig. 55.22, along with the error transfer function of an ideal fourth-order difference. While the Butterworth filter holds the maximum gain to 1.5 (3.5 dB), and while both filters have a fourth-order noise shaping slope in the baseband (27 dB/octave), the error power in the baseband is 44 dB higher with the Butterworth filter than with the ideal noise-differencing filter. This error penalty is typical of high-order designs; there is usually a direct tradeoff between stability and noise reduction. Consider the more general case of an L-th order highpass Butterworth error transfer function. The error transfer function of such a filter around the unit circle is

( )

H e e jω

2

=

1 ω A2  tan  2 β

2L

1 ω 1 +  tan  2 β

2L

(55.26)

The filter coefficients He(z) for needed to satisfy Eq. 55.26 can be computed using standard digital filter design techniques.28 For a given filter order, L, and gain, A, the parameter β must be chosen to satisfy Eq. 55.24. (The condition in Eq. 55.25 is always satisfied when Eq. 55.26 is true.) These solutions can be computed numerically, and it is found empirically that

β = Ae β N

© 2000 by CRC Press LLC

(55.27)

TABLE 55.1

High-Order Butterworth Gain Factors and Dynamic Range (DR) Loss Zero Placement

L

βN

β

Loss in DR (dB)

DR Improvement (dB)

Net Loss in DR (dB)

3 4 5 6 7

0.052134 0.051709 0.033866 0.034903 0.025390

0.1570 0.1557 0.1020 0.1051 0.0764

33.7 44.1 72.6 84.8 117.7

8.0 12.8 17.9 23.2 28.6

25.8 31.2 54.7 61.6 89.1

Note: Except for βN , all values are calculated for A = 1.5.

where the values for βN are tabulated in Table 55.1. The loss in dynamic range relative to an ideal noisedifferencing modulator, given by, A2/(2β)2L, is also tabulated. In spite of this loss, high-order modulators can still achieve better noise performance than second-order modulators. However, because of the compromise in dynamic range required to stabilize high-order modulators, third-order modulators are generally not worth the effort. More common are fourth-and fifth-order modulators. The noise penalty required to stabilize high-order modulators can be mitigated to some extent by alternate zero placement.25 Classic noise-differencing modulators place all of the zeros of the error transfer function at dc (z = 1). This causes most of the noise power to be concentrated at the highest baseband frequencies. If, instead, the zeros are distributed throughout the baseband, the total noise in the baseband can be reduced, as illustrated in Fig. 55.23. The amount by which zero placement can improve the noise transfer function is summarized in Table 55.1. Also tabulated is the net loss in dynamic range of a highorder Butterworth modulator that uses zero placement relative to an ideal noise-differencing modulator that has zeros at dc.

Cascaded Modulators Cascaded, or multi-stage, architectures are an alternative means of achieving higher-order noise shaping without the stability problems of the high-order modulators described in the previous section.29,30 In a cascaded modulator, two or more stable first- or second-order modulators are connected in series, with

FIGURE 55.23

Fourth-order distributed zeros.

© 2000 by CRC Press LLC

FIGURE 55.24

Cascaded sigma-delta modulators.

the input of each stage being the error from the previous stage, as illustrated in Fig. 55.24. Referring to this illustration, the first stage of the cascade has two outputs, y1 and e1. The output y1 is an estimate of the input x. The error in this estimate is e1. The second stage has as its input the error from the first stage, e1, and its outputs are y2 and e2. The second stage output y2 is an estimate of the first stage error e1. By subtracting this estimate of the first stage error from the output of the first stage, y1, only the second stage error remains. Thus, the error cancellation network uses the output of one stage to cancel the error in the previous stage. For example, in a cascaded architecture comprising a second-order noise-differencing modulator followed by a first-order noise-differencing modulator, the transforms of the output of the two stages, as given by Eq. 55.16, are

Y1 z = z −2 X z + 1 − z −1 E1 z

()

() (

) ()

(55.28)

()

() (

) ()

(55.29)

2

Y2 z = z −1E1 z + 1 − z −1 E2 z

If the error cancellation network combines the two outputs such that

()

() (

) () 2

Y z = z −1 Y1 z − 1 − z −1 Y2 z

(55.30)

then the error in the first stage will be cancelled, and the output will be

()

() (

) () 3

Y z = z −3 X z − 1 − z −1 E2 z

(55.31)

The final output of this cascaded modulator is third-order noise shaped. As a general rule, the noise shaping of a cascaded architecture is comparable to a single-stage modulator whose order is the sum of all the orders in the cascade. The extent to which the errors in a cascaded modulator can be cancelled depends on the matching between the stages. The earliest multi-stage modulators were cascades of three first-order stages, often called the MASH architecture.29 The disadvantage of this structure is that in order to achieve third-order

© 2000 by CRC Press LLC

performance, the error in the first stage, which is only first-order shaped, must be cancelled. Cancelling this relatively large error places a stringent requirement on inter-stage matching. An alternative architecture that has much more relaxed matching requirements is the cascade of a second-order modulator followed by a first-order modulator. This architecture, like the MASH, ideally achieves third-order noise shaping. Its advantage is that the matching can be 100 times worse than a MASH and still achieve better noise shaping performance.31 An additional benefit of cascaded modulators is improved tone performance. It has been shown both analytically and experimentally that the error spectra of the second and subsequent stages in a cascade are not plagued by the spectral tones that can exist in single-stage modulators.19,32 To the extent that the first-stage error is cancelled, any tones in the first-stage error spectrum are attenuated, and the final output of the cascaded modulator is nearly tone-free.

Bandpass Modulators The aforementioned sigma-delta architectures, called herein baseband modulators, all have zeros at or near dc; that is, at frequencies much less than the modulator sampling rate. It is also possible to group these zeros at some other point in the sampling spectrum; such architectures are called bandpass modulators. Bandpass architectures are useful in systems that need to quantize a narrow band signal that is centered at some frequency other than dc. A common example of such a signal is the intermediate frequency (IF) signal in a communications receiver. The simplest method for designing a bandpass modulator is by applying a transformation to an existing baseband modulator architecture. The most common transformation is to replace occurrences of z with –z2.2 Such an architecture has zeros at fS/4 and is stable if the baseband modulator is stable.33 A comparison of the error transfer function of baseband and bandpass modulators is shown in Fig. 55.25. Note that a bandpass modulator generated through this transformation has twice the order of its equivalent baseband counterpart. For example, a fourth-order bandpass modulator is comparable to a second-order baseband modulator. The noise shaping properties of a bandpass modulator generated through the –z2 transformation are equivalent to the baseband modulator that was transformed. Thus, the approximation in Eq. 55.20 can be used where L is the order of the baseband modulator that was transformed and M is the effective oversampling ratio, which in a bandpass modulator is the sampling rate divided by the signal bandwidth.

FIGURE 55.25

Bandpass noise transfer function.

© 2000 by CRC Press LLC

FIGURE 55.26

IQ demodulation with a bandpass modulator.

There are advantages and disadvantages to bandpass modulators when compared with traditional down-conversion and baseband modulation. One advantage of the bandpass modulator is its insensitivity to 1/f noise. Since the signal of interest is far from dc, 1/f noise is often insignificant. Another advantage of bandpass modulation applies specifically to bandpass modulators having zeros at fS /4 that are used in quadrature I and Q demodulation systems. If the narrowband IF signal is to be demodulated by a cosine and sine waveform, as shown in Fig. 55.26, the demodulation operation becomes simple multiplication by 1, –1, or 0 when the demodulation frequency is fS /4.34 Furthermore, because a single modulator is used, the bandpass modulator is free of the I/Q path mismatch problems that can exist in baseband demodulation approaches. Two disadvantages of bandpass modulators involve the sampling operation. Sampling in a bandpass modulator has linearity requirements that are comparable to a Nyquist-rate converter sampling at the same IF frequency; this is much more severe than the linearity requirements of the sampling operation in a baseband converter with the same signal bandwidth. Also, because of the higher signal frequencies, the sampling in bandpass modulators is much more sensitive to clock jitter. To date, the state of the art in bandpass modulators has about 20 dB less in dynamic range than comparable baseband modulators.2 While the remainder of this chapter focuses once again on baseband modulators, many of the techniques are applicable to bandpass modulators as well.

55.4 Filtering for Sigma-Delta Modulators In Sections 55.2 and 55.3 of this chapter, the discussion focused on the operation of the sigma-delta modulator core. While this core is the most unique aspect of sigma-delta data conversion, there are also filtering blocks that constitute an important part of sigma-delta A/D and D/A converters. In this section, the non-modulator components in baseband sigma-delta converters, namely the analog and digital filters, are described. First, the requirements of the analog anti-alias and reconstruction filters are described. Second, typical architectures for the decimation and interpolation filters are discussed. While much of the design of these filters use standard techniques covered elsewhere in this volume, there are aspects of these filters that are specific to sigma-delta modulator applications.

Anti-alias and Reconstruction Filters The purpose of the anti-alias filter, shown in Fig. 55.1 at the input of the sigma-delta A/D converter, is, as the name would indicate, to prevent aliasing. The sampling operation maps, or aliases, all frequencies into the range bounded by ±fS /2.28 Specifically, all signals within a baseband bandwidth of multiples of the sampling rate are mapped into the baseband. This is generally undesirable, so the anti-alias filter is designed to attenuate this aliasing to some tolerable level. One advantage of sigma-delta converters over Nyquist-rate converters is that this anti-aliasing filter has a relatively wide transition region. As illustrated in Fig. 55.27, the passband region for this filter is the signal bandwidth fB , while the stopband region for © 2000 by CRC Press LLC

FIGURE 55.27

Anti-alias filter for sigma-delta A/D converters.

this filter is only within fB of the sampling rate. Thus, the transition region is 2(M – 1) fB , and since M  1, the transition region is relatively wide. A wide transition region generally means a simple filter design. The precise nature of the anti-alias filter is application dependent, and can be designed using any number of standard analog filter techniques.35 The reconstruction filter, shown in Fig. 55.2 at the output of the sigma-delta D/A converter, is also an analog filter. Its primary purpose is to remove unwanted out-of-band quantization noise. The extent to which this noise must be removed varies widely from system to system. If the analog output is to be applied to an element that is naturally bandlimited, such as a speaker, then very little attenuation may be necessary. On the other hand, if the output is applied to additional analog circuitry, care must be taken lest the high-frequency noise distort and map itself into the baseband. Circuit techniques for this filter are addressed further in Section 55.5.

Decimation and Interpolation Filters In general, the filter characteristics of the decimation filter, shown in Fig. 55.1 at the output of the sigmadelta A/D converter, are much sharper than those of the anti-alias filter; that is, the transition region is narrower. The saving grace is that the filter is implemented digitally, and modern sub-micron processes have made complex digital filters economically feasible. Nonetheless, care must be taken or the filter architecture will become more computationally complex than is necessary. The basic purpose of the decimation filter is to attenuate quantization noise and unwanted signals outside the baseband so that the output of the decimation filter can be down-sampled, or decimated, without significant aliasing. Normally, the most efficient means of accomplishing this is to apply a multirate filter architecture, such as that illustrated in Fig. 55.28.36,37 The comb filter is a relatively crude, but easy to implement, filter that has zeros equally spaced throughout the sampled spectrum. The frequency response of an N-th order comb filter, HC(z), is

 1 1− z−R  HC z =  −1   R 1− z 

()

N

(55.32)

where R is the impulse response length of the comb filter. If R is set equal to the decimation ratio of the comb filter (the comb filter input rate divided its output rate), then the filter zeros will occur at every point that would alias to dc.38,39 If the filter order N is one more than the modulator order, then the comb filter will be adequate to attenuate the out-of-band quantization noise to the point where it does not adversely increase the baseband noise after decimation.40

© 2000 by CRC Press LLC

FIGURE 55.28

Typical decimation filter architecture.

Following the comb filter is typically a series of one or more FIR filters. Since the sample rates of these FIR filters are much slower than the oversampled clock rate, each filter output can be computed over many clock cycles. Also, since the output of each filter is decimated, only the samples that will be output need to be calculated. These properties can be exploited to devise computationally efficient structures for decimation filtering.41 In the example in Fig. 55.28, the first FIR filter is decimating from 4× to 2× oversampling. Since the output of this filter is still oversampled, the transition region is relatively wide and the attenuation at midband need not be very high. Thus, an economical half-band filter (a filter in which every other coefficient is zero) can be used.37 The final FIR filter is by far the most complex. It usually has to have a very sharp transition region, and for strict anti-alias performance, it cannot be a halfband filter. In high-performance sigma-delta modulators, this filter is often in the range of 50 to 200 taps in length. Standard digital filter design techniques can be used to select that tap weights for this filter.28 Since it is going to be a complex filter anyway, it can also be used to compensate for any frequency droop in the previous filter stages. The interpolation filter, shown in Fig. 55.2 at the input of the sigma-delta D/A converter, up-samples the input digital words to the oversampling rate. In many ways, this filter is the inverse of a decimation filter, typically comprising a complex up-sampling FIR filter, optionally followed by one or more simple FIR filters, followed by an up-sampling comb filter. The up-sampling operation, without this filter, would produce images of the baseband spectrum at multiples of the baseband frequency. The purpose of the interpolation is to attenuate these images to a tolerable level. What constitutes tolerable is very much a system-dependent criterion. Design techniques for the interpolation filter parallel those of the decimation filter discussed above.

55.5 Circuit Building Blocks For analog-to-digital conversion, the modulator is implemented primarily in the analog domain as shown in Fig. 55.16. In digital-to-analog conversion, the modulator output if filtered by an analog reconstruction filter as depicted in Fig. 55.2. The basic analog circuit building blocks for these data converters are described in this section. These building blocks include switched-capacitor integrators, the amplifiers that are imbedded in the integrators, comparators, and circuits for sigma-delta based D/A conversion. At the end of this section, the techniques for continuous-time sigma-delta modulation are briefly discussed.

Switched-Capacitor Integrators Switched-capacitor integration stages are commonly used to perform the signal processing functions of integration and summation required for realization of the discrete time transfer functions A(z) and F(z) in Fig. 55.16. The circuit techniques outlined herein are drawn from a rich literature of switched-capacitor filters42-45 that is detailed elsewhere in this volume. Figure 55.29 is a typical integrator stage for the case of single bit feedback,11,20 and is designed to perform the discrete time computation

()

VOUT z = K

© 2000 by CRC Press LLC

( ()

( ))

z −1 VIN z − VDAC z 1 − z −1

(55.33)

FIGURE 55.29

Typical integrator stage.

independent of the parasitic capacitances associated with the capacitive devices shown. The curved line in the capacitor symbol is the device terminal with which the preponderance of the parasitic capacitance is associated. For example, this will be the bottom plate of a stacked planar capacitance structure, where the parasitic capacitance is that between the bottom plate and the IC substrate. The circuit’s precision stems from the conservation of charge at the two input nodes of the operational amplifier, and the cyclic return of the potential at those nodes to constant voltages. More details may be found in the chapter on switched capacitor filters (Chapter 59). Fully differential circuits will be shown here, as these are almost universally preferred over single-ended circuits in monolithic implementations owing to their greatly improved power supply rejection, MOS switch feedthrough rejection, and suppression of even-order non-linearities. The switches shown in Fig. 55.29 are generally full CMOS switches, as detailed in Fig. 55.30. However, integrators with very low power supply voltages may necessitate the use of only one polarity of switch device, possibly with a switch gate voltage-boosting arrangement.46 Sampling capacitors CSP and CSM are FIGURE 55.30 Full CMOS switch. designed with the same capacitance CS, and the effect of slight fabrication mismatches between the two will be mitigated by the common-mode rejection of the amplifier. Similarly, integration capacitors CIP and CIM are designed to be identical with capacitance CI . The discrete-time signal to be integrated is applied between the input terminals VIN+ and VIN– , and the output is taken between VOUT+ and VOUT– . VINCM is the common mode input voltage required by the amplifier. The single-bit DAC feedback voltage is applied between VDAC+ and VDAC– . The stage must be clocked by two non-overlapping signals, φ1 and φ2. During the φ1 phase, the differential input voltage is sampled on the bottom plates of CSP and CSM, while their top plates are held at the amplifier commonmode input level. During this phase, the amplifier summing nodes are isolated from the capacitor network, and the amplifier output will remain constant at its previously integrated value. During the φ2 phase, the bottom plates of the sampling capacitors CSP and CSM experience a differential potential shift of (VDAC – VIN), while the top plates are routed into the amplifier summing nodes. By forcing its differential input voltage to a small level, the amplifier will effect a transfer of a charge of CS(VIN – VDAC) to the integration capacitors, and therefore the differential output voltage will shift to a new value by an

© 2000 by CRC Press LLC

increment of (CS /CI)(VIN – VDAC). Since this output voltage shift will accumulate from cycle to cycle, the discrete-time transfer function will be that of Eq. 55.33, with

K=

CS CI

(55.34)

Over several cycles of initial operation, the amplifier input terminals will be driven to the common-mode level that is precharged onto the top plates of the sampling capacitors. In order to suppress any signal-dependent clock feedthrough from the switches, it is helpful to slightly delay the clock phases that switch variable signal voltages with respect to the phases that switch current into constant potentials. The channel charge in each turned-on switch device can potentially dissipate onto the sampling capacitors when the switches are turned off, producing an error in the sampled charge. This channel charge is dependent on the difference between the switch gate-to-source voltage and its threshold voltage; and as the source voltage varies with signal voltage, the clock feedthrough charge will vary with the signal. By turning the switches that see constant potentials at the end of each cycle off first, and thus floating the sampling capacitor, the only clock feedthrough is a charge that is to the first order independent of signal level, and results only in a common-mode shift that is suppressed by the amplifier. This acts to reduce the non-linearity of the integrator and the harmonic distortion generated by the modulator. The timing for the delayed and undelayed clocks is illustrated in Fig. 55.31, where the clock phases φ1D and φ2D represent phases that are slightly delayed versions of φ1 and φ2, respectively. The delayed clocks drive the switches that are subject to full-signal voltage swings, the analog and reference voltage inputs, as shown in Fig. 55.29. The undelayed clocks drive the switches associated with the amplifier summing node and common-mode input bias voltage, which will always be driven to the same potential by the end of each clock cycle. A typical clock generator circuit to produce these phase relationships is shown in Fig. 55.32. The delay time ∆t is generated by the propagation delay through two CMOS inverters. Other, more complex integration circuits are used in some sigma-delta implementations, for example, to suppress errors due to limited amplifier gain47,48 or to effectively double the sampling rate of the integrators.49,50 For the modulator structures discussed in Section 55.3 that are more elaborate than a second-order loop, more complex switched-capacitor filtering is required. These may still, however, be designed with the same basic integrator architecture as in Fig. 55.29, but with extra sampling capacitors feeding the amplifier summing node to implement additional signal paths.26,33,51 Consult Chapter 59 in this volume on switched-capacitor filtering for more information.

Operational Amplifiers Embedded in the switched-capacitor integrator shown in Fig. 55.29 is an operational amplifier. There are three major types of operational amplifiers typically used in switched-capacitor integrators52: the

FIGURE 55.31

Delayed clock timing.

© 2000 by CRC Press LLC

FIGURE 55.32

Non-overlapping clock generator with delayed clocks.

FIGURE 55.33

Folded cascode amplifier.

folded cascode amplifier,42 shown in Fig. 55.33; the two-stage amplifier,43 shown in Fig. 55.34; and the class AB amplifier,45 shown in Fig. 55.35. When the available supply voltage is high enough to permit stacking of cascode devices to develop high gain, a folded cascode amplifier is commonly used. A typical topology is shown in Fig. 55.33. The input devices are PMOS, since most IC processes feature PMOS devices that exhibit lower 1/f noise than their NMOS counterparts.53 The input differential pair M1 and M2 is biased with the drain current of M3. FETs M5-M8 function as current sources, and M9-M12 form cascode devices that boost the output impedance. The amplifier is compensated for stability in the integrator feedback loop by the dominant pole that is formed at its output node with the high output impedance and the load capacitance. In an integrator stage, the amplifier will be loaded with the load capacitance of the following stage sampling capacitance as well as its own integration capacitance. The non-dominant pole at the drains of M1 and M2 limit the unity-gain frequency, which can be quite high.

© 2000 by CRC Press LLC

FIGURE 55.34

Two-stage amplifier.

FIGURE 55.35

Class AB amplifier.

When the power supply voltage is limited, and cascode devices cannot be stacked and still preserve adequate signal swing, a two-stage amplifier is a common alternative to the folded cascode amplifier. As shown in Fig. 55.34, the input differential pair of M1 and M2 now feed the active load current sources

© 2000 by CRC Press LLC

of M9 and M10 to form the first stage. The second stage comprises common-source amplifiers M7 and M8, loaded with current sources M5 and M6. Due to the presence of two poles from the two stages of roughly comparable frequencies, compensation is generally achieved with a pole-splitting RC local feedback network as shown.52 Often, the resistors RC1 and RC2 are actually implemented as NMOS devices biased into their ohmic region by tying their gates to VDD . In this arrangement, the effective resistance of RC1 and RC2 will approximately track any drift in mobility of M7 and M8 over temperature and processing variations, preserving the compensated phase margin. For a given process, the bandwidth of a two-stage amplifier is less than what can be achieved than by a folded cascode design; but because the two-stage has no stacked cascode devices, the signal swing is higher. In the case of modulators with higher clock speeds, both folded cascode and two-stage amplifiers may have unacceptably long settling times; in these amplifiers, the maximum slewing current that can be applied to charge or discharge the load capacitance is limited by fixed current sources. This slewing limitation can be overcome by a class AB amplifier topology that can supply a variable amount of output current and is capable of providing a large pulse of current early in the settling cycle when the differential input error voltage is high. A typical class AB amplifier topology is shown in Fig. 55.35. The input differential pair from the folded cascode and two-stage designs is replaced by M1 through M4, and their drain currents are mirrored to the output current sources M9–M12 by diode-connected devices M5–M8. Cascode devices M13–M16 enhance the output impedance and gain. As with the folded cascode design, frequency compensation is accomplished by a dominant pole at the output node. The input voltage is fed directly to the NMOS input devices and to the PMOS input devices through the level-shifting source follower and diode combination M17–M20. This establishes the quiescent bias current through the input network M1–M4, and therefore through the output devices as well. In each of the three amplifier topologies discussed above, there is either one or a set of two matched current sources driving both differential outputs. These current sources are controlled by a gate bias line labeled VCMBIAS . The current output of these devices will determine the common-mode output voltage of the amplifier independent, to the first order, of the amplified differential signal. The appropriate potential for VCMBIAS is determined by a feedback loop that is only operable in the common mode and is separate from the differential feedback instrumental in the charge integration process. Since a discrete time modulator is, by its nature, clocked periodically, a natural choice for the implementation of this common-mode feedback loop is the switched-capacitor network of Fig. 55.36.44,45 Capacitors CCM1 and CCM2 act as a voltage divider for transient voltages that derives the average, or common-mode, voltage of the amplifier output terminals. This applies corrective negative feedback transients to the VCMBIAS node to stabilize the feedback loop during each clock period while the amplifier is differentially settling. A dc bias is then maintained on CCM1 and CCM2 by the switched-capacitor network on the left side of the figure. This will slowly transfer the charge necessary to establish and maintain a dc level shift that

FIGURE 55.36

Switched-capacitor common-mode feedback.

© 2000 by CRC Press LLC

makes up the difference between the common-mode level desired at the amplifier output terminals (VCMDES ) and the approximate gate bias required by the common mode current devices (VBAPPROX ). The former is usually set at mid-supply by a voltage divider, and the latter can be derived from a matched diode-connected device. Since the clocking of this switching network is done synchronously to the amplifier integrator clocking, no charge injection will occur during the sensitive settling process of the amplifier. In order to minimize the charge injection at the clock transitions, capacitors CS1 and CS2 are usually made very small, and therefore dozens of clock cycles may be required for the common-mode bias to settle and the modulator to become operable.

Comparators The noise shaping mechanism of the modulator feedback loop allows the loop behavior to be tolerant of large errors in circuit behavior at locations closer to the output end of the network. Modulators are generously tolerant of large offset errors in the comparators used in the A/D converter forming the feedback path. For this reason, almost all modulators use simple regenerative latches as comparators. No preamp is generally needed, as the small error from clock kickback can easily be tolerated. Simulations show that offset errors that are even as large as 10% of the reference level will not degrade modulator performance significantly. The circuit of Fig. 55.37 is typical.54 This is essentially a latch composed of two cross-connected CMOS inverters, M1–M4. Switch devices M5–M8 will disconnect this network when the clock input is low, and throw the network into a regenerative mode with the rising edge of the clock. The state in which the regeneration will settle may be steered by the relative strengths of the bias current output by devices M9 and M10, which in turn depend on the differential input voltage.

Complete Modulator Figure 55.38 illustrates a complete second-order, single-bit feedback modulator assembled from the components discussed above.11 The discrete time integrator gain factors that are derived in Sections 55.2 and 55.3 are realized by appropriate ratios between the integration and sampling capacitors in each stage.

FIGURE 55.37

Typical modulator comparator.

© 2000 by CRC Press LLC

FIGURE 55.38

Complete second-order sigma-delta modulator.

Since the single-bit feedback DAC is only responsible for generating two output levels, it may be implemented by simply switching an applied differential reference voltage VREF+ to VREF– in a direct or reversed sense to the sampling capacitor bottom plates during the amplifier integration phase, φ2 .

D/A Circuits For the DAC system shown in Fig. 55.2, the oversampled bit stream is generated by straightforward digital implementations of the modulator signal flow graphs discussed in Section 55.2. The remaining analog components are the low-resolution DAC block and the reconstruction filter. Integrated sigma-delta D/A implementations often employ two-level quantization, and the DAC block may either be designed as charge-based55 or current-based.56 Multi-level DAC approaches are also used, but for harmonic content less than about –60 dB below the reference, some form of dynamic element matching must be added, as discussed in Section 55.6. The charge-based approach for sigma-delta D/A conversion is illustrated in Fig. 55.39, which is similar to the switched-capacitor integrator of Fig. 55.29, but without an analog signal input. As in Fig. 55.38, VDAC may be either polarity of VREF according to the bit value to be converted. Figure 55.40 shows a typical topology for current-based converters. In both cases, the leaky integration function around the amplifier contributes to the first pole of the reconstruction filtering. An efficient combination of the current-based approach and a digital delay line realizing an FIR reconstruction filter is also possible.57 Additional reconstruction filtering beyond that provided by in the DAC may also be necessary. This is accomplished using the appropriate analog sampled-data filtering techniques described in Chapter 59 of this section.

© 2000 by CRC Press LLC

FIGURE 55.39

Charge-based DAC.

FIGURE 55.40

Current-based DAC.

Continuous-Time Modulators In general, the amplifiers contained in the switched-capacitor integrators in a sampled-data sigma-delta data converter dissipate the majority of the analog circuit power. Since the integrator sections must settle accurately within each clock period at the oversampled rate, the amplifiers must often be designed with a unity-gain frequency much higher than the oversampled rate; typical unity-gain frequencies are in the hundreds of MHz. In applications in which dissipating the lowest possible power is important, sigma-delta modulators may also be implemented using continuous-time integrators. In these continuous-time modulators, the analog signal is not sampled until the quantizer at the back of the modulator loop.58 Owing to the typical means employed for the DAC feedback, continuous-time modulators tend to be more sensitive to sampling clock jitter, but the influences of any aliasing distortion and non-linearity at the sampler take place late in the loop where noise shaping is steepest, and as a consequence the anti-aliasing filter of Fig. 55.1 may often be omitted.59 The power advantage comes from the relaxed speed requirement of the integrator stages, which now need only have unity-gain frequencies on the order of the oversampled clock frequency.

© 2000 by CRC Press LLC

FIGURE 55.41

Gm-C integrator.

Instead of switched-capacitor discrete-time integrators, the continuous-time modulators generally use active Gm-C integrators. Circuits like that shown in Fig. 55.41 are typical.59 The input differential pair M1 and M2 is degenerated by source resistance R1 to improve linearity. The output analog voltage is developed across capacitor C1, which may be split as shown to place the bottom plate parasitic capacitance at a common-mode node. As the integrator is now unclocked, continuous-time common-mode feedback must be used, as discussed in the literature for continuous time filtering.60

55.6 Practical Design Issues As with any design involving analog components, there are a number of circuit limitations and tradeoffs in sigma-delta data converter design. The design considerations discussed in this section include kT/C noise, integrator scaling, amplifier gain, and sampling non-linearity. Also discussed in this section are the techniques of integrator reset and multi-level feedback.

kT/C Noise In switched-capacitor-based modulators, one fundamental non-ideality associated with using a MOS device to sample a voltage on a capacitor is the presence of a random variation of the sampled voltage after the MOS switch opens.61-63 This random component has a Gaussian distribution with a variance of kT/C, where k is Boltzman’s constant, C is the capacitance, and T is the absolute temperature. The variation stems from thermal noise in the resistance of the MOS channel as it is opening. The noise voltage has a mean power of 4kTRB, where R is the channel resistance and B is the bandwidth. It is lowpass filtered by its characteristic resistance and the sampling capacitor to an equivalent noise bandwidth of 1/RC. The total integrated variance will thus be kT/C, independent of the resistance of the switch. If, in the process of developing the integrated signal, a sampling operation on n capacitors is used, then since we assume Gaussian noise distribution, the variance of the eventual integrated value will be nkT/C. For the case of a fully differential integrator, where a differential signal is sampled onto two sampling capacitors and then transferred to two integration capacitors, n is 4. This effect, along with the input referred noise of the amplifier, will limit the achievable noise floor of the modulator. The first stage sampling capacitors must be sized so as to limit this noise contribution to an acceptable level. From this

© 2000 by CRC Press LLC

FIGURE 55.42

Integrator output distribution for an eight-level modulator.

starting point, and the capacitive ratios required for realizing the various integrator gains, the remaining capacitor sizes can be determined. The modulator will be much less sensitive to kT/C noise generated in integrators past the first, and the capacitors in theses integrators can be made considerably smaller.

Integrator Gain Scaling The integration stages in Section 55.2 were discussed as ideal elements, capable of developing any real output voltage. In practice, the output voltage of real integrators is limited to at most the supply voltage of the embedded amplifier. To ensure that this limitation does not adversely affect the modulator performance, a survey of the likely limit of integrator output voltages must be made for a given value of the DAC reference voltage. The modulator may be simulated over a large number of samples with a representative sinusoidal input, and a histogram of all encountered output voltages tabulated. These histograms may be expected to scale linearly with the reference voltage level. In general, this statistical survey will show that a modulator designed to realize the integrator gain constants in the ideal topologies of Sections 55.2 and 55.3 will have different ranges of expected output voltages from each of its integrators. For example, Figs. 55.42 and 55.43 show the simulated output voltages at the two integrators in a secondorder modulator with eight-level and two-level feedback, respectively. Since the largest value possible of reference level will generally mean the best available signal-to-noise ratio for a given circuit power consumption, the integrator gain constants may be adjusted from their straightforward values so that the overall modulator transfer function remains the same, but the output voltages are scaled so that no integrator limits the signal swing markedly before the other.11 Figures 55.44 and 55.45 illustrate the properly scaled second-order modulator examples.

Amplifier Gain Another mechanism by which the actual characteristic of the integrator circuits fall short of the ideal is the limitation of finite amplifier gain. A study of many simulations of modulators with various amplifier gains11 has shown that a modulator needs amplifiers with gains about numerically equal to the decimation ratio of the filter that follows it in order to avoid significant noise shaping errors. At least this is the result with perfectly linear amplifiers, and in practice, amplifier gains often need to be at least 10 times this high to avoid distortion in the integrator characteristic due to the non-linearity of the amplifier gain characteristic.

© 2000 by CRC Press LLC

FIGURE 55.43

Integrator output distribution for a two-level modulator.

FIGURE 55.44

Integrator output distribution for a scaled eight-level modulator.

One approach used when the simple circuits of Section 55.5 (Operational Amplifiers) do not develop enough gain in a given process is the regulated cascode gain enhancement.64,65,68 Figure 55.46 illustrates a typical circuit topology. This subcircuit may be substituted for the output common-source amplifier stages in the amplifiers of Figs. 55.33 to 55.35 if the power supply voltage can accommodate its somewhat increased requirement for headroom.

Sampling-Nonlinearity and Reference Corruption The sigma-delta modulator is remarkably tolerant of most circuit non-idealities past the input sampling network. However, the linearity of the sampling process at the very first input sampling capacitor will be the upper bound for the linearity for the entire modulator. Care must be exercised to ensure the

© 2000 by CRC Press LLC

FIGURE 55.45

Integrator output distribution for a scaled two-level modulator.

FIGURE 55.46

Regulated cascode topology.

switches are sufficiently large so that the sampled voltage will be completely settled through their nonlinear resistance, but not so large that any residual signal dependent clock feedthrough is significant. Another susceptibility of modulators is to non-linear corruption of the reference voltage. If the digital bit stream output, through a parasitic feedback path either on- or off-chip, can affect the reference voltage sampled during clock phase φ2 in Fig. 55.29, then there will be a term in the output signal dependent on the square of the input voltage. This will distort the noise shaping properties of the modulator and generate second harmonic distortion, even with fully differential circuitry. This is illustrated in the spectrum in Fig. 55.47, which is the output of a modulator having the same conditions as Fig. 55.14, except that a parasitic feedback path is assumed that would change the reference voltage by 1% for the “1” output bits on the previous cycle, relative to its value with “0” output bits. As can be seen by comparison with Fig. 55.14, that the ability of the modulator to shift quantization noise out of the baseband has been greatly compromised, and a prominent second harmonic has been generated. Care must be taken in chip and printed circuit board application design so that the reference voltage remains isolated from the signals carrying the output bit stream.

© 2000 by CRC Press LLC

FIGURE 55.47

Output spectrum with reference corruption.

Fully differential circuitry is almost universally employed in integrated VLSI modulators to reduce sampling non-linearity and reference contamination. Even-order non-linearities and common-mode switch feedthrough are cancelled with fully differential circuits, and power supply rejection is greatly improved, leading to more isolated reference potentials. For high-precision modulators, the integrator topology is often changed from that of Fig. 55.29 to Fig. 55.48.26,51 The input signal and the DAC output voltage are sampled independently during phase φ1, and then both are discharged together into the summing node during φ2. At the expense of additional area for capacitors and higher kT/C noise, this

FIGURE 55.48

Integrator with separate DAC feedback capacitor.

© 2000 by CRC Press LLC

arrangement ensures that the same charge is drawn from the reference supply onto the DAC sampling capacitors CDP and CDM and then discharged into the summing node each cycle. Thus, a potentially undesirable mechanism for reference supply loading that is dependent on the output bit history is eliminated.66

High-Order Integrator Reset Although careful design of the loop filter for higher-order modulators, as discussed above in Section 55.3 (High-Order Modulators), will yield a generally stable design, their stability cannot be mathematically guaranteed, as in the case of second-order loops. To protect against the highly undesirable state of low frequency limit cycle oscillations due to an occasional, but improbable, input overload condition, some form of forced integrator reset is sometimes used.26,51 Generally, these count the consecutive ‘1’ or ‘0’ bits out of the modulator, and close a resetting switch to discharge integration capacitors for a short time if the modulator generates a longer consecutive sequence than normal operation allows. This will naturally interrupt the operation of the modulator, but will only be triggered in the case of pathological input patterns for which linear operation would not necessarily be expected. Another approach to a stability safety mechanism for higher-order loops is to arrange the scaling of the integrator gains so that they clip against their maximum output voltage swings in a prescribed sequence as the input level rises. The sequence is designed to gradually lower the effective order of the modulator,59 and return operation to a stable mechanism.

Multi-level Feedback Expanding the second-order modulator to more than two-level feedback can be accomplished by the circuit in Fig. 55.49. For K-level feedback, K – 1 latch comparators are arranged in a flash structure as

FIGURE 55.49

Multi-bit second-order modulator.

© 2000 by CRC Press LLC

shown on the right. There must be a different offset voltage designed into each of the comparators so that they detect when each quantization level is crossed by the second integrator output. This can be implemented by a resistor string54,67 or an input capacitive sampling network.68 The output of the K – 1 comparators is then a thermometer code representation of the integrator output. This may be translated into binary for the modulator output, but the raw thermometer code is the most convenient to use as a set of feedback signals. They each will drive a switch that will select either VREF+ or VREF– to be used as the bottom plate potential for the integrator sampling capacitors. If all sampling capacitors are of equal value, the net charge being integrated will have a term that varies linearly with the quantization level. Each comparator output drives two switches, so there are 2(K – 1) switches and capacitors in the sampling array. In any practical integrated structure, even if careful common-centroid layout techniques are used, the precision with which the various capacitors forming the sampling array will be matched is typically limited to 0.1 or 0.2%. As discussed in Section 55.2, this will limit the harmonic distortion that is inherent in the modulator to about –60 dB or higher. However, by varying the assignment of which sampling switch is driven by which comparator output dynamically as the modulator is clocked, much of this distortion may be traded off for white or frequency-shaped noise at the modulator output. This technique is referred to as dynamic element matching. One simple way of implementing dynamic element matching is indicated in Fig. 55.49 with the block labeled “scrambler.” This block typically comprises an array of switches that provide a large number of permutations in the way the comparator output lines can be mapped onto sampling switch lines. A multiple-stage butterfly network is one relatively simple approach.69 The mapping permutation is then changed every modulator clock cycle. Assuming each comparator output will, over a time period less than the final baseband period, end up mapped to all the sampling capacitors, all of the capacitor mismatches will be averaged out. The energy that would be found in input signal harmonics without scrambling will be spread out in some fashion into a higher modulator noise output. There have been various algorithms published in the literature on how best to control the sequence of mapping perturbations. A simple random sequence will render the mismatch into white noise, increasing the baseband output noise floor.68,69 More complex sequences relying on a knowledge of the history of quantizer levels are capable of coloring the spread noise so that much of it appears outside the baseband and is suppressed by the following decimation filtering.70-73

55.7 Summary In this chapter, a brief overview of sigma-delta data converters has been presented. Sigma-delta data conversion is a technique that effectively trades speed for resolution. High-linearity data conversion can be accomplished in modern IC processes without expensive device trimming or calibration. For a far more detailed treatment of this topic, refer to Norsworthy, Schreier, and Temes.2 For a compilation of some of the seminal papers that helped establish sigma-delta modulation as a mainstream technique, refer to Candy and Temes.1

References 1. J. Candy and G. Temes, Oversampling Delta-Sigma Data Converters, IEEE Press, 1992. 2. S. Norsworthy, R. Schreier, and G. Temes, Delta-Sigma Data Converters: Theory, Design, and Simulation, IEEE Press, 1996. 3. C. Cutler, Transmission systems employing quantization, U.S. Patent 2,927,962, Mar. 8, 1960. 4. H. Spang III and P. Schultheiss, Reduction of quantizing noise by use of feedback, IRE Trans. on Communication Systems, pp. 373–380, Dec. 1962. 5. H. Inose and Y. Yasuda, A unity bit coding method by negative feedback, Proc. IEEE, vol. 51, pp. 1524–1535, Nov. 1963.

© 2000 by CRC Press LLC

6. S. Tewksbury and R. Hallock, Oversampled, linear predictive and noise-shaping coders of order N>1, IEEE Trans. on Circuits and Systems, vol. CAS-25, pp. 436–447, July 1978. 7. J. Candy, A use of limit cycle oscillations to obtain robust analog-to-digital converters, IEEE Trans. on Communications, vol. COM-22, pp. 298–305, Mar. 1974. 8. H. Fiedler and B. Hoefflinger, A CMOS pulse density modulator for high-resolution A/D converters, IEEE J. of Solid-State Circuits, vol. SC-19, pp. 995–996, Dec. 1984. 9. B. Leung, R. Neff, P. Gray, and R. Brodersen, Area-efficient multichannel oversampled PCM voiceband coder, IEEE J. of Solid-State Circuits, vol. SC-23, pp. 1351–1357, Dec. 1988. 10. J. Candy, A use of double integration in sigma delta modulation, IEEE Trans. on Communications, vol. COM-33, pp. 249–258, Mar. 1985. 11. B. Boser and B. Wooley, The design of sigma-delta modulation analog-to-digital converters, IEEE J. of Solid-State Circuits, vol. 23, pp. 1298–1308, Dec. 1988. 12. V. Friedman, D. Brinthaupt, D. Chen, T. Deppa, J. Elward, E. Fields, J. Scott, and T. Viswanathan, A dual-channel voice-band PCM codec using Σ∆ modulation technique, IEEE J. of Solid-State Circuits, vol. 24, pp. 274–280, Apr. 1989. 13. W. Bennett, Spectra of quantized signals, Bell System Tech. Journal, vol. 27, pp. 446–472, 1948. 14. B. Widrow, Statistical analysis of amplitude quantized sampled-data systems, Trans. AIEE, vol. 79, pp. 555–568, Jan. 1961. 15. R. Gray, Oversampled sigma-delta modulation, IEEE Trans. on Communications, vol. COM-35, pp. 481–489, May 1987. 16. J. Candy and O. Benjamin, The structure of quantization noise from sigma-delta modulation, IEEE Trans. on Communications, vol. COM-29, pp. 1316–1323, Sept. 1981. 17. B. Boser and B. Wooley, Quantization error spectrum of sigma-delta modulators, 1988 IEEE Intnl. Symp. on Circuits and Systems, pp. 2331–2334, 1988. 18. R. Gray, Quantization noise spectra, IEEE Trans. Information Theory, vol. 36, pp. 1220–1244, Nov. 1990. 19. L. Williams and B. Wooley, A third-order sigma-delta modulator with extended dynamic range, IEEE J. of Solid-State Circuits, vol. 29, Mar. 1994. 20. B. Brandt, D. Wingard, and B. Wooley, Second-order sigma-delta modulation for digital-audio signal acquisition, IEEE J. of Solid-State Circuits, vol. 26, pp. 618–627, Apr. 1991. 21. J. Everard, A single-channel PCM codec, IEEE J. of Solid-State Circuits, vol. SC-14, pp. 25–37, Feb. 1979. 22. M. Hauser, P. Hurst, and R. Brodersen, MOS ADC-filter combination that does not require precision analog components, ISCC Dig. Tech. Papers, pp. 80–82, Feb. 1985. 23. S. Norsworthy, Effective dithering of sigma-delta modulators, Proc. of the 1992 IEEE Intnl. Symp. on Circuits and Systems, pp. 1304–1307, May 1992. 24. D. Welland, B. Del Signore, E. Swanson, T. Tanaka, K. Hamashita, S. Hara, and K. Takasuka, A stereo 16-bit detla-sigma A/D converter for digital audio, J. Audio Engineering Society, vol. 37, pp. 476–486, June 1989. 25. K. Chao, S. Nadeem, W. Lee, and C. Sodini, A higher order topology for interpolative modulators for oversampling A/D converters, IEEE Trans. on Circuits and Systems, vol. 37, pp. 309–318, Mar. 1990. 26. R. Adams, P. Ferguson, A. Ganesan, S. Vincelette, A. Volpe, and R. Libert, Theory and practical implementation of a fifth-order sigma-delta A/D converter, J. Audio Eng. Soc., vol. 39, pp. 515–528, July/Aug. 1991. 27. R. Schreier, An empirical study of high-order single-bit delta-sigma modulators, IEEE Trans. on Circuits and Systems. II. Analog and Digital Signal Processing, vol. 40, Aug. 1993. 28. A. Oppenheim and R. Schafer, Discrete-Time Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1989. 29. Y. Matsuya, K. Uchimura, A. Iwata, T. Kobayashi, M. Ishikawa, and T. Yoshitome, A 16-bit oversampling A-to-D conversion technology using triple-integration noise shaping, IEEE J. of SolidState Circuits, vol. SC-22, pp. 921–929, Dec. 1987.

© 2000 by CRC Press LLC

30. L. Longo and M. Copeland, A 13 bit ISDN-band oversampling ADC using two-stage third order noise shaping, IEEE 1988 Custom Integrated Circuits Conference, pp. 21.2.1–4, 1988. 31. L. Williams and B. Wooley, Third-order cascaded sigma-delta modulators, IEEE Trans. on Circuits and Systems, vol. 38, pp. 489–498, May 1991. 32. P.-W. Wong and R. Gray, Two stage sigma-delta modulation, IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 38, pp. 1937–1952, Nov. 1990. 33. L. Longo and B.-R. Horng, A 15b 30kHz bandpass sigma-delta modulators, 1993 IEEE Intnl. SolidState Circuits Conf., pp. 226–227, Feb. 1993. 34. R. Schreier and W. M. Snelgrove, Decimation for bandpass sigma-delta analog-to-digital conversion, 1990 IEEE Intnl. Symp. on circuits and Systems, vol. 3, pp. 1801–1804, May 1990. 35. R. Gregorian and G. Temes, Analog MOS Integrated Circuits for Signal Processing, John Wiley & Sons, 1986. 36. D. Goodman and M. Carey, Nine digital filters for decimation and interpolation, IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-25, pp. 126–126, Apr. 1977. 37. R. Crochiee and L. Rabiner, Multirate Digital Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1983. 38. E. Hogenauer, An economical class of digital filters for decimation and interpolation, IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-29, pp. 155–162, Apr. 1981. 39. S. Chu and C. Burrus, Multirate filters designs using comb filters, IEEE Trans. on circuits and Systems, vol. CAS-31, pp. 913–924, Nov. 1984. 40. J. Candy, Decimation for sigma delta modulation, IEEE Trans. on Communications, vol. COM-34, pp. 72–76, Jan. 1986. 41. B. Brandt and B. Wooley, A low-power, area-efficient digital filter for decimation and interpolation, IEEE J. of Solid-State Circuits, vol. 29, pp. 679–687, June 1994. 42. T. Choi, et al., High-frequency CMOS switched-capacitor filters for communications applications, IEEE J. of Solid-State Circuits, vol. 15, pp. 929–938, Dec. 1980. 43. W. C. Black, et al., A high-performance low power CMOS channel filter, IEEE J. of Solid-State Circuits, vol. 15, pp. 929–938, Dec. 1980. 44. D. Senderowitz, et al., A family of differential NMOS analog circuits for a PCM codec filter chip, IEEE J. of Solid-State Circuits, vol. 17, pp. 1014–1023, Dec. 1982. 45. R. Castello and P. R. Gray, A high-performance micropower switched-capacitor filter, IEEE J. of Solid-State Circuits, vol. 20, pp. 1122–1132, Dec. 1985. 46. T. B. Cho and P. R. Gray, A 10b, 20 Msample/s, 35 mW pipeline A/D converter, IEEE J. of SolidState Circuits, vol. 30, pp. 166–172, Mar. 1995. 47. K. Nagaraj et al., Switched-capacitor integrator with reduced sensitivity to amplifier gain, Electronics Letters, vol. 22, p. 1103, Oct. 1986. 48. K. Huag, G. C. Temes, and L. Martin, Improved offset-compensation scheme for SC circuits, 1984 IEEE Intnl. Symp. on Circuits and Systems, pp. 1054–1057, 1984. 49. P. J. Hurst and W. J. McIntyre, Double sampling on switched-capacitor delta-sigma A/D converters, 1990 IEEE Symp. on Circuits and Systems, pp. 902–905, May 1990. 50. D. Senderowitz et al., Low voltage double-sampled sigma-delta converters, IEEE J. of Solid-State Circuits, vol. 32, pp. 1907–1919, Dec. 1997. 51. P. Furguson, et al., An 18b, 20kHz dual sigma-delta A/D converter, 1991 IEEE Intnl. Solid-State Circuits Conf., pp. 68–69, Feb. 1991. 52. P. R. Gray and R. G. Meyer, MOS operational amplifier design — A tutorial overview, IEEE J. of Solid-State Circuits, vol. 17, pp. 969–981, Dec. 1982. 53. A. Abidi, C. Viswanathan, J. Wu, and J. Wikstrom, Flicker noise in CMOS: A unified model for VLSI processes, 1987 Symp. VLSI Technology, pp. 85–86, May 1987. 54. A. Yukawa, A CMOS 18-bit high speed A/D converter IC, IEEE J. of Solid-State Circuits, vol. 20, pp. 775–779, June 1985.

© 2000 by CRC Press LLC

55. B. Kup, E. Dijkmans, P. Naus, and J. Sneep, A bit-stream digital-to-analog converter with 18-b resolution, IEEE J. of Solid-State Circuits, vol. 26, pp. 1757–1763, Dec. 1991. 56. R. Adams, K. Q. Nguyen, and K. Sweetland, A 113-dB SNR oversampled DAC with segmented noise-shaped scrambling, IEEE J. of Solid-State Circuits, vol. 33, pp. 1871–1878, Dec. 1998. 57. D. Su and B. Wooley, A CMOS oversampling D/A converter with a current-mode semidigital reconstruction filter, IEEE J. of Solid-State Circuits, vol. 28, pp. 1224–1233, Dec. 1993. 58. R. Schreier and B. Zhang, Delta-sigma modulators employing continuous-time circuitry, IEEE Trans. on Circuits and Systems. I. Fundamental Theory and Applications, vol. 44, pp. 324–332, Apr. 1996. 59. E. J. van der Zwan and E. C. Dijkmans, A 0.2-mW CMOS sigma-delta modulator for speech coding with 80dB dynamic range, IEEE J. of Solid-State Circuits, vol. 31, pp. 1873–1880, Dec. 1996. 60. Y.P. Tsividis, Integrated continuous-time filter design — an overview, IEEE J. of Solid-State Circuits, vol. 29, pp. 166–176, Mar. 1994. 61. K. C. Hsieh, Noise limitations in switched-capacitor filters, Ph.D. dissertation, Univ. California, Berkeley, Dec. 1981. 62. C. Gobet and A. Knob, Noise analysis of switched-capacitor networks, 1981 Intnl. Symp. on Circuits and Systems, Apr. 1981. 63. C. Gobet and A. Knob, Noise generated in switched-capacitor networks, Electronics Letters, vol. 19, no. 19, 1980. 64. E. Säckinger and W. Guggenbühl, A high-swing, high-impedance MOS cascode circuit, IEEE J. of Solid-State Circuits, vol. SC-25, pp. 289–298, Feb. 1990. 65. K. Bult and G. J. G. M. Geelen, A fast-settling CMOS opamp for SC circuits with 90-dB DC gain, IEEE J. of Solid-State Circuits, vol. SC-25, no. 6, pp. 1379–1384, Dec. 1990. 66. D. Ribner, R. Baertsch, S. Garverick, D. McGrath, J. Krisciunas, and T. Fuji, A third-order multistage sigma-delta modulator with reduced sensitivity to nonidealities, IEEE J. of Solid-State Circuits, vol. 26, pp. 1764–1774, Dec. 1991. 67. B. Brandt and B. Wooley, A 50-MHz multibit sigma-delta modulator for 12-b 2-MHz A/D conversion, IEEE J. of Solid-State Circuits, vol. 26, pp. 1746–1756, Dec. 1991. 68. J. Fattaruso et al., Self-calibration techniques for a second-order multibit sigma-delta modulator, IEEE J. of Solid-State Circuits, vol. 28, pp. 1216–1223, Dec. 1993. 69. L. Carley, A noise-shaping coder topology for 15+ bit converters, IEEE J. of Solid-State Circuits, vol. 24, pp. 267–273, Apr. 1989. 70. F. Chen and B. Leung, A high resolution multibit sigma-delta modulator with individual level averaging, IEEE J. of Solid-State Circuits, vol. 30, pp. 453–460, Apr. 1995. 71. T. Kwan, R. Adams, and R. Libert, A stereo multi-bit Σ∆ D/A with asynchronous master-clock interface, IEEE J. of Solid-State Circuits, vol. 31, pp. 1881–1887, Dec. 1996. 72. B. Leung and S. Sutarja, Multibit Σ-∆ A/D converter incorporating a novel class of dynamic element matrching techniques, IEEE Trans. on Circuits and Systems. II. Analog and Digital Signal Processing, vol. 39, pp. 35–51, Jan. 1992. 73. L. Williams III, An audio DAC with 90 dB linearity using MOS to metal-metal charge transfer, 1998 IEEE Intnl. Solid-State Circuits Conf., pp. 58–59, Feb. 1998.

© 2000 by CRC Press LLC

Steyaert, M., Borremans, M., Janssens, J., De Muer, B. "RF Communication Circuits" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

56 RF Communication Circuits 56.1 Introduction 56.2 Technology Active Devices • Passive Devices

56.3 The Receiver Receiver Topologies • Full Integration • The Downconverter • The LNA

56.4 The Synthesizer

Michiel Steyaert Marc Borremans Johan Janssens Bram De Muer Katholieke Universiteit Leuven, ESAT-MICAS

Synthesizer Topology • The Oscillator • The Prescaler • Fully Integrated Synthesizer

56.5 The Transmitter Down-conversion vs. Up-conversion • CMOS Mixer Topologies

56.6 Toward Fully Integrated Transceivers 56.7 Conclusion

56.1 Introduction A few years ago, the world of wireless communications and its applications started to grow rapidly. The main cause for this event was the introduction of digital coding and digital signal processing in wireless communications. This digital revolution is driven by the development of high-performance, low-cost, CMOS technologies that allow for the integration of an enormous amount of digital functions on a single die. This allows, in turn, for the use of sophisticated modulation schemes, complex demodulation algorithms, and high-quality error detection and correction systems, resulting in high-performance lossless communication channels. Today, the digital revolution and the high growth of the wireless market also bring many changes to the analog transceiver front-ends. The front-ends are the interface between the antenna and the digital modem of the wireless transceiver. They have to detect very weak signals (µV) which come in at a very high frequency (1 to 2 GHz) and, at the same time, they have to transmit at the same high-frequency high power levels (up to 2 W). This requires high-performance analog circuits, like filters, amplifiers, and mixers which translate the frequency bands between the antenna and the A/D-conversion and digital signal processing. Low cost and a low power consumption are the driving forces and they make the analog front-ends the bottleneck for future RF design. Both low cost and low power are closely linked to the trend toward full integration. An even further level of integration renders significant space, cost, and power reductions. Many different techniques to obtain a higher degree of integration for receivers, transmitters, and synthesizers have been presented over the past years.1-3 This chapter introduces and analyzes some advantages and disadvantages and their fundamental limitations.

© 2000 by CRC Press LLC

Parallel to the trend to further integration, there is the trend to the integration of RF circuitry in CMOS technologies. The mainstream use for CMOS technologies is the integration of digital circuitry. The use of these CMOS technologies for high-performance analog circuits yields, however, many benefits. The technology is, of course — if used without any special adaptations toward analog design — cheap. This is especially true if one wants to achieve the ultimate goal of full integration: the complete transceiver system on a single chip, with both the analog front-end and the digital demodulator implemented on the same die. This can only be achieved in either a CMOS or a BiCMOS process. BiCMOS has better devices for analog design, but its cost will be higher, not only due to the higher cost per area, but also due to the larger area that will be needed for the digital part. Plain CMOS has the extra advantage that the performance gap between devices in BiCMOS and nMOS devices in deep sub-micron CMOS, and even nMOS devices in the same BiCMOS process, is becoming smaller and smaller due to the much higher investments in the development of CMOS than bipolar. The fT’s of the nMOS devices are getting close to the fT’s of npn devices. Although some research had been done in the past on the design of RF in CMOS technologies,4 it is only in the few years that real attention has been given to its possibilities.5,6 Today several research groups at universities and in industry are researching this topic.2,3,7,9 As bipolar devices are inherently better than CMOS devices, RF CMOS is by some seen as a possibility for only low-performance systems, with reduced specification (like ISM),8,10 or that the CMOS processes need adaptations, like substrate etching under inductors.7 Others feel, however, that the benefits of RF CMOS can be much bigger and that it will be possible to use plain deep sub-micron CMOS for the full integration of transceivers for high-performance applications, like GSM, DECT, and DCS 1800.2,3 First, this chapter analyzes some trends, limitations, and problems in technologies for high-frequency design. Second, the down-converter topologies and implementation problems are addressed. Third, the design and trends toward fully integrated low-phase noise PLL circuits are discussed. Finally, the design of fully integrated up-converters is studied.

56.2 Technology Active Devices Due to the never-ending progress in technology and the requirement to achieve a higher degree of integration for DSP circuits, sub-micron technologies are nowadays considered standard CMOS technologies. The trend is even toward deep sub-micron technologies (e.g., transistor lengths of 0.1 µm. Using the square law relationship for MOS transistors to calculate the ft of a MOS device no longer holds, due to the high electrical fields. Using a more accurate model, which includes the mobility degradation due to the high electrical fields, results in

ft =

=

gm 2π C gs µ 2π 2 3L2

(V

)   µ  1 + 2 q +  (V  v L  gs

− Vt

max

(56.1) gs

 − Vt  

)

Hence, by going to deep sub-micron technologies, the square law benefit in L for speed improvement drastically reduces due to the second term in the denominator of Eq. 56.1. Even for very deep sub-micron technologies, the small signal parameter gm has no square law relationship anymore:

© 2000 by CRC Press LLC

FIGURE 56.1

Comparison of ft and fmax.

gm =

(

µCoxW Vgs − Vt

)

   µ  L1 + 2 θ + V V −  gs t vmax L    

(

)

(56.2)

with transistor lengths smaller than approximately

L
ωB , the positive pulses appear at QB while QA = 0. If ωA = ωB, then the PFD generates pulses at either QAor QB with a width equal to the phase difference between the two inputs. The outputs QAand QB are usually called the “up” and “down” signals, respectively. If the input signal fails, which usually happens at the NRZ data recovery applications during missing or extra transmissions, the output of the PFD would stick on the high state (or low state). This condition may cause the VCO to oscillate fast or slow abruptly, which results in noise jitter or even losing lock. This problem can be remedied by additional control logic circuits to make the PFD output toggle back and forth between the two logic levels with 50% duty cycle,18 the loop is interpreted as zero phase error. The “rotational FD” described by Messerschmitt can also solve this issue.9 The output of a PFD can be converted to a dc control voltage by driving a three-state charge-pump, as described in Section 57.2.

© 2000 by CRC Press LLC

FIGURE 57.19 Waveforms of the signals for the JK-flipflop phase detector: (a) waveforms at zero phase error, (b) waveforms at positive phase error

57.4 PLL Applications Clock and Data Recovery In data transmission systems such as optical communications, telecommunications, disk drive systems, and local networks, data is transmitted on baseband or passband. In most of these applications, only data signals are transmitted by the transmitter; clock signals are not transmitted in order to save hardware cost. Therefore, the receiver should have some scheme to extract the clock information from the received data stream and regenerate transmitted data using the recovery clock. This scheme is called timing recovery or clock recovery. To recover the data correctly, the receiver must generate a synchronous clock from the input data stream, and the recovered clock must synchronize with the bit rate (the baud of data). The PLL can be used to recover the clock from the data stream, but there are some special design considerations. For example, because of the random nature of data, the choice of phase-frequency detectors is restricted. In particular, a three-state PD is not proper; because of missing data transitions, the PD will interpret the VCO frequency to be higher than the data frequency, and the PD output stays on “down” state to make the PLL lose lock, as shown in Fig. 57.21. Thus, the choice of phase-frequency detector for random binary data requires a careful examination of their responses when some transitions are absent. One useful method is the rotational frequency detector described in Ref. 9. The random data also causes the PLL to introduce undesired phase variation in the recovered clock; this is called timing jitter and is an important issue of the clock recovery.

© 2000 by CRC Press LLC

(a)

(b) FIGURE 57.20

(a) PFD diagram and (b) input and output waveforms of PFD.

FIGURE 57.21

Response of a three-state PD to random data.

Data Format Binary data is usually transmitted in a NRZ (Non-Return-to-Zero) format, as shown in Fig. 57.22(a), because of the consideration of bandwidth efficiency. In NRZ format, each bit has a duration of TB (bit period). The signal does not go to zero between adjacent pulses 1representing 1’s. It can be shown23 in that the corresponding spectrum has no line component at ƒB = T------B ; most of the spectrum of this signal f lines below -----B- . The term “non-return-to-zero” distinguishes from another data type called “return-to2 zero” (RZ), as shown in Fig. 57.22(b), in which the signal goes to zero between consecutive bits. Therefore,

© 2000 by CRC Press LLC

FIGURE 57.22

(a) NRZ data and (b) RZ data.

the spectrum of RZ data has a frequency component at ƒB . For a given bit rate, RZ data needs wider transmitting bandwidth; therefore, NRZ data is preferable when channel or circuits bandwidth is a concern. Due to the lack of a spectral component at the bit rate of NRZ format, a clock recovery circuit may lock to spurious signals or fail to lock at all. Thus, a non-linear process at NRZ data is essential to create a frequency component at the baud rate. Data Conversion One way to recover a clock from NRZ data is to convert it to RZ-like data that has a frequency component at bit rate, and then recover clock from data using a PLL. Transition detection is one of the methods to convert NRZ data to RZ-like data. As illustrated in Fig. 57.23(a), the edge detection requires a mechanism to sense both positive and negative data transitions. In Fig. 57.23(b), NRZ data is delayed and compared with itself by an exclusive-OR gate; therefore, the transition edges are detected. In Fig. 57.24, the NRZ data Vi is first differentiated to generate pulses corresponding to each transition. These pulses are made to be all positive by squaring the differentiated signal v·i . The result is the signal Vi′ that looks just like RZ data, where pulses are spaced at an interval of TB .

FIGURE 57.23

Edge detection of NRZ data.

© 2000 by CRC Press LLC

FIGURE 57.24

Converting NRZ to RZ-like signal.

Clock Recovery Architecture Based on different PLL topologies, there are several clock recovery approaches. Here, the early-late and the edge-detector based methods are described. Figure 57.25 shows the block diagram of the early-late method. If the input lags the VCO output, Fig. 57.26 shows the waveforms for this case. In Fig. 57.26, the early integrator integrates the input signal for the early-half period of the clock signal and holds it for remainder of the clock signal. On the other hand, the late integrator integrates the input signal for the late-half period of the clock signal and holds it for the next early-half period. The average difference between the absolute values of the late hold and the early hold voltage generated from a low-pass filter gives the control signal to adjust the frequency of the VCO. As mentioned above, this method is popular for rectangular pulses. However, there are some drawbacks to this method. Since this method relies on the shape of pulses, a static phase error can be introduced if the pulse shape is not symmetric. In high-speed applications, this approach requires a fast settling integrator that limits the operating speed of the clock recovery circuit and the acquisition time cannot be easily controlled.

FIGURE 57.25

Early-late block diagram.

© 2000 by CRC Press LLC

FIGURE 57.26

Clock waveforms for early-late architecture.

The most widely used technique for clock recovery in high-performance, wide-band data transmission applications is the edge-detection based method. The edge-detection method is used to convert data format such that the PLL can lock the correct band frequency. More details were given in the previous subsection. There are many variations of this method, depending on the exact implementation of each PLL loop component. The “quadricorrelator” introduced by Richman7 and modified by Bellisio24 is a frequency-difference discriminator and has been implemented in a clock recovery architecture. Figure 57.27 is a phase-recovery locked loop using edge-detection method and quadricorrelator to recover timing information from NRZ data.25 As shown in Fig. 57.27, the quadricorrelator follows the edgedetector with a combination of three loops sharing the same VCO. Loop I and II form a frequency-locked loop that contains the quadricorrelator for frequency detection. Loop III is a typical phase-locked loop for phase alignment. The phase- and frequency-locked loops share the same VCO; the interaction between two loops is a very important issue. As described in Ref. 25, when ω1 ≈ ω2 , the dc feedback signal produced by loop I and II approaches zero, and loop III dominates the loop performance. A composite frequencyand phase-locked loop is a good method to achieve fast acquisition and a narrow PLL loop bandwidth

FIGURE 57.27

Quadricorrelator.

© 2000 by CRC Press LLC

to minimize the VCO drift. Nevertheless, because the wide band frequency-locked loop can respond to noise and spurious components, it is essential to disable the frequency-locked loop when the frequency error gets into the lock-in range of the PLL to minimize the interaction. More clock recovery architectures are described in Refs. 18, 20, 22, 26–28.

Frequency Synthesizer A frequency synthesizer generates any of a number of frequencies by locking a VCO to an accurate frequency source such as a crystal oscillator. For example, RF systems usually require a high-frequency local oscillator whose frequency can be changed in small and precise steps. The ability to multiply a reference frequency makes PLLs attractive for synthesizing frequencies. The basic configuration used for frequency synthesis is shown in Fig. 57.28(a). This system is capable of generating an integer multiple frequency of a reference frequency. A quartz crystal is usually used as the reference clock source because of its low jitter characteristic. Due to the limited speed of a CMOS device, it is difficult to generate frequency directly in the range of GHz or more. To generate higher frequencies, prescalers are used; they are implemented with other IC technologies such as ECL. Figure 57.28(b) shows a synthesizer structure using a prescaler V; the output frequency becomes

fout =

NVfi M

(57.46)

Because the scaling factor V is much greater than one, obviously, it is no longer possible to generate any desired integer multiple of the reference frequency. This drawback can be circumvented by using a so-called dual-modulus prescaler, as shown in Fig. 57.29. A dual-modulus prescaler is a divider whose division can be switched from one value to another by a control signal. The following shows that the dual-modulus prescaler makes it possible to generate a number of output frequencies that are spaced only by one reference frequency. The VCO output is divided by V/V+1 dual-modulus prescaler. The

FIGURE 57.28 Frequency-synthesizer block diagrams: (a) basic frequency-synthesizer system; (b) system extends the upper frequency range by using an additional high-speed prescaler.

© 2000 by CRC Press LLC

FIGURE 57.29

The block diagram of dual-modulus frequency synthesizer. 1

1

output of the prescaler is fed into a “program counter” ---N- and a “swallow counter” ---A . The dual-modulus prescaler is set to divide by V+1 initially. After “A” pulses out of the prescaler, the swallow counter is full and changes the prescaler modulus to V. After additional “N-A” pulses out of the prescaler, the program counter changes the prescaler modulus back to V+1, restarts the swallow counter, and the cycle is repeated. In this way, the VCO frequency is equal to (V + 1) A + V (N – A) = VN + A times the reference frequency. Note that N must be larger than A. If this is not the case, the program counter would be full earlier than 1 --- , and both counters would be reset. Therefore, the dual-modulus prescaler would never be switched A from V + 1 to V. For example, if V = 64, then A must be in the range of 0 to 63 such that Nmin = 64. The smallest realizable division ratio is

(N ) tot

min

= N min V = 4096

(57.47)

The synthesizer of Fig. 57.29 is able to generate all integer multiples of the reference frequency, starting from Ntot = 4096. For extending the upper frequency range of frequency synthesizers, but still allowing the synthesis of lower frequency, the four-modulus prescaler is a solution.1 Based on the above discussions, the synthesized frequency is an integer multiple of a reference frequency. In RF applications, the reference frequency is usually larger than the channel spacing for loop dynamic performance considerations, in which the wider loop bandwidth for a given channel spacing allows faster settling time and reduces the phase jitter requirements to be imposed on the VCO. Therefore, a “fractional” scaling factor is needed. Fractional division ratios of any complexity can be realized. For example, a ratio of 3.7 is obtained if a counter is forced to divide by 4 in seven cycles of each group of ten cycles and by 3 in the remaining three cycles. On the average, this counter effectively divides the input frequency by 3.7.

References 1. R. E. Best, Phase-Locked Loops Theory, Design, Applications, McGraw-Hill, New York, 1984. 2. D. G. Troha and J. D. Gallia, Digital phase-locked loop design using S-N54/74LS297, Application Note AN 3216, Texas Instruments Inc., Dallas, TX. 3. W. B. Rosink, All-digital phase-locked loops using the 74HC/HCT297, Philips Components, 1989. 4. F. M. Gardner, Phaselock Techniques, 2nd ed.

© 2000 by CRC Press LLC

5. S. G. Tzafestas, Walsh Functions in Signal and Systems Analysis and Design, Van Nostrand, 1985. 6. F. M. Gardner, Acquisition of phaselock, Conference Record of the International Conference on Communications, vol. I, pp. 10-1 to 10-5, June 1976. 7. D. Richman, Color carrier reference phase synchronization accuracy in NTSC color television, Proc. IRE, vol. 42, pp. 106-133, Jan. 1954. 8. F. M. Gardner, Properties of frequency difference detector, IEEE Trans. on Communication, vol. COM-33, no. 2, pp. 131-138, Feb. 1985. 9. D. G. Messerschmitt, Frequency detectors for PLL acquisition in timing and carrier recovery, IEEE Trans. on Communication, vol. COM-27, no. 9, pp. 1288-1295, Sept. 1979. 10. R. B. Lee, Timing recovery architecture for high speed data communication system, Masters thesis, 1993. 11. M. Bazes, A novel precision MOS synchronous delay lines, IEEE J. Solid-State Circuits, vol. 20, pp. 1265-1271, Dec. 1985. 12. M. G. Johnson and E. L. Hudson, A variable delay line PLL for CPU-coprocessor synchronization, IEEE J. Solid-State Circuits, vol. 23, pp. 1218-1223, Oct. 1988. 13. B. Kim, T. C. Weigandt, and P. R. Gray, PLL/DLL systems noise analysis for low jitter clock synthesizer design, ISCAS Proceedings, pp. 31-35, 1994. 14. M. V. Paemel, Analysis of a charge-pump PLL: a new model, IEEE Trans. on Comm., vol. 42, no. 7, pp. 131-138, Feb. 1994. 15. F. M. Gardner, Charge-pump phase-locked loops, IEEE Trans. on Comm., vol. COM-28, pp. 18491858, Nov. 1980. 16. F. M. Gardner, Phase accuracy of charge pump PLL’s, IEEE Trans. on Comm., vol. COM-30, pp. 2362-2363, Oct. 1982. 17. T. C. Weigandt, B. Kim, and P. R. Gray, Analysis of timing recovery jitter in CMOS ring oscillator, ISCAS Proceedings, pp. 27-30, 1994. 18. T. H. Lee and J. F. Bulzacchelli, A 155-MHz clock recovery delay- and phase-locked loop, IEEE J. of Solid-State Circuits, vol. 27, no. 12, pp. 1736-1746, Dec. 1992. 19. M. P. Flyun and S. U. Lidholm, A 1.2 µm CMOS current-controlled oscillator, IEEE J. Solid-State Circuits, vol. 27, no. 7, pp. 982-987, July 1992. 20. S. K. Enam and A. A. Abidi, NMOS IC’s for clock and data regeneration in gigabit-per-second optical-fiber receivers, IEEE J. Solid-State Circuits, vol. 27, no. 12, pp. 1763-1774, Dec. 1992. 21. M. Horowitz et al., PLL design for a 500MB/s interface, ISSCC Digest Technical Paper, pp. 160-161, Feb. 1993. 22. A. Pottbacker and U. Langmann, An 8GHz silicon bipolar clock-recovery and data-regenerator IC, IEEE J. Solid-State Circuits, vol. 29, no. 12, pp. 1572-1751, Dec. 1994. 23. B. P. Lathi, Modern Digital and Analog Communication System, HRW, Philadelphia, 1989. 24. J. S. Bellisio, A new phase-locked loop timing recovery method for digital regenerators, IEEE Int. Comm. Conf. Rec., vol. 1, pp. 10-17-10-20, June 1976. 25. B. Razavi, A 2.5-Gb/s 15-m W clock recovery circuit, IEEE J. Solid-State Circuits, vol. 31, no. 4, pp. 472-480, Apr. 1996. 26. R. J. Baumert, P. C. Metz, M. E. Pedersen, R. L. Pritchett, and J. A. Young, A monolithic 50-200MHz CMOS clock recovery and retiming circuit, IEEE Custom Integrated Circuits Conference, pp. 14.5.514.5.4, 1989. 27. B. Lai and R. C. Walker, A monolithic 622Mb/s clock extraction data retiming circuit, IEEE Inter. Solid-State Circuits Conference, pp. 144-145, 1991. 28. B. Kim, D. M. Helman, and P. R. Gray, A 30MHz hybrid analog/digital clock recovery circuit in 2-µm CMOS, IEEE J. Solid-State Circuits, vol. 25, no. 6, pp. 1385-1394, Dec. 1990.

© 2000 by CRC Press LLC

Khoury, J.M. "Continuous-Time Filters" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

58 Continuous-Time Filters 58.1 Introduction 58.2 State-Variable Synthesis Techniques Biquadratic Filters • Leapfrog Filters

58.3 Realization of VLSI Integrators Gm-C Integrators and Filters • Gm-OTA-C Filters • MOSFET-C Filters • Alternate Continuous-Time Filter Techniques

58.4 Filter Tuning Circuits

John M. Khoury Lucent Technologies

Direct Tuning • Master-Slave Tuning • Q Tuning Loops

58.5 Conclusion

58.1 Introduction Modern Very Large Scale Integrated (VLSI) circuits realize complex mixed analog-digital systems on a monolithic semiconductor chip. These systems generally incorporate signal processing operations that can be performed either in the digital domain using digital signal processing (DSP) techniques or in the analog domain with analog signal processing (ASP) circuits. ASP techniques fall into two basic categories: continuous-time or sampled-data. The selection of DSP, continuous-time ASP, or sampled-data ASP approaches is highly dependent on the system requirements; however, continuous-time filters are generally preferable in applications that require low power, high-frequency operation, and moderate dynamic range. Fully integrated continuous-time filters have found wide application in many VLSI systems that include modems, telephone circuits, disk drive read channels, video processing circuits, and others. The applications usually fall into one of the three basic configurations shown in Fig. 58.1. In the top two views, the continuous-time filter provides anti-aliasing and smoothing functions for sample-data signal processing operations that are either performed with switched-capacitor (SC), switched-current (SI), or DSP filters. Generally, for these applications, the precise signal processing functions are kept in the sampleddata domain. The continuous-time filter can then have non-stringent frequency response specifications provided that the ratio of half the sampling rate to the band edge is large for the sampled-data filter. Often for lower power operation or extremely high frequency operation, the entire signal processing is performed with the continuous-time ASP as shown in the bottom of Fig. 58.1. When designing a system, the natural question arises as to which is the best approach for the performing the core signal processing operations: DSP or ASP. In general, DSP is usually the obvious choice if the application has the following attributes: (1) frequency response specifications that must be repeatable to within fractions of decibels (dB) over all manufacturing processes, (2) band edges are in the 100-kHz range and below, (3) dynamic range requirements exceed 80 dB, and (4) a high degree of programmability or coefficient adaptation is needed.

© 2000 by CRC Press LLC

FIGURE 58.1

General uses of continuous-time filter.

ASP typically has a clear advantage when the critical frequencies in the filter exceed several hundred kHz. In today’s 0.35-µm CMOS technologies, SC filters are generally limited to sampling rates below 100 MHz; hence, filter passbands will typically be below 25 MHz. SI filters have similar limitations. Continuous-time filters are then the only viable alterntive for passbands in the 10’s of MHz and higher. Continuous-time filters may be designed for the entire frequency range from audio to above 100 MHz. When used in audio and above audio applications, continuous-time filters can achieve the lowest possible power of all the filtering techniques; dynamic ranges above 90 dB can be obtained and linearity above 80 dB is possible with certain continuous-time design approaches.1-4 Higher frequency continuous-time filters can also be designed to achieve excellent linearity.5 In the 10- to 150-MHz range, continuous-time filters usually will achieve linearity and dynamic range performance in the 60-dB range and cutoff frequency accuracy in the range of a few percent.6-11 Integrated continuous-time filters to date have not achieved the performance required in the high-frequency, high-Q, and high-linearity wireless application. The fundamental limitations of thermal noise present in all integrated continuous-time, SC, and SI filters severely limits the achievable dynamic range under high-Q conditions.12 Integrated continuous-time filters differ from discrete active filters and SC or SI filters in that a frequency tuning circuit is almost always required to obtain accurate frequency response characteristics.1,7,9,10,13-15 In VLSI chips, the capacitor values can vary by ± 15% for linear capacitors such as doublepoly or metal-metal capacitors. Similarly, the resistance element, whether it is a diffused resistor, polysilicon resistor, or transistor will vary widely with processing and temperature. The combined effect often results in RC products that vary by as much as ± 50%. In continuous-time filter applications such as antialiasing or reconstruction functions, such variation is often acceptable. However, to achieve corner frequency accuracy of a few percent, a tuning circuit is required. In addition to frequency tuning/scaling the filter, circuits to tune the quality factor of the filter are sometimes employed.6,8,11,16-18 The following sections in this chapter cover the state-variable implementation of continuous-time filters, the design of VLSI integrators, and the design of highly linear continuous-time filters. The chapter concludes with the design of tuning circuits.

58.2 State-Variable Synthesis Techniques In the 1960s, considerable research was performed in the area of active filter design. At that time, the focus was on discrete circuit implementations that operated with single-ended circuitry. Although many creative and theoretically appealing approaches were invented and used commercially for discrete designs,

© 2000 by CRC Press LLC

only a few of the circuit topologies are well suited to VLSI implementations. An extensive discussion of active filter realizations can be found in Ref. 19. Of all the possible active filter topologies possible, the state-variable filter is the most general in form and most widely used in VLSI continuous-time filters today. The key advantage of state-variable filters is that they require only two basic building blocks: (1) integrators and (2) weighted summers. In VLSI solutions, the integrators are realized with on-chip capacitors, an active element such as an operational amplifier (op-amp), and a resistive element or transconductance amplifier. Signal summation is performed in the voltage or current domains, depending on the technique used. The topology of state-variable filters can take on many varied forms. In the most general case, a linear system with N state variables would consist of N integrators with signal coupling between any and all integrators. In practice, coupling between integrators is limited to make the design realizable. In the biquadratic (biquad) filter structure, the N-th order filter is realized as a cascade of second-order circuits, followed by a first-order circuit, if N is odd. The biquad approach is widely used for its simplicity, ease of design, and ease of debugging.19 An alternate form of state-variable filters, called the “leapfrog” topology, is realized by simulating the equations that govern RLC ladder filters.19 In the leapfrog topology, only the state variables (i.e., integrators) that are adjacent to one another are coupled. Leapfrog filters are more difficult to design, but they generally offer improved passband magnitude response accuracy and better dynamic range performance than the cascade of biquads.

Biquadratic Filters The biquad structure realizes the filtering function as a cascade of second-order filters. The structure decouples the poles of the system and can ease the overall design approach. The general equation governing the biquadratic filter is

Vout

ω oz s + ω oz2 Qz s =K Vin s ω op 2 2 s + s + ω op Qp

()

s2 +

()

(58.1)

where Vout (s) and Vin(s) are the output and input signals of the biquad, respectively, K is a gain constant, ωp and ωz are the frequencies of the poles and zeros, and Qp and Qz are the quality factors of the poles and zeroes, respectively. Although many methods of realizing this transfer function are possible, the statevariable approach uses a loop of two integrators connected with negative feedback to realize the poles. Damping around one (or both) integrator(s) makes the corresponding integrator lossy and implements the pole quality factor, Qp . The zeroes of the biquad can be achieved by (1) creating an output signal Vout (s) that is the weighted summation of the two integrator outputs, as well as the input signal, Vin (s), or (2) by summing scaled values of the input signal into both integrators as well as directly to the output. The block diagram of the generalized biquad, shown in Fig. 58.2, places the zeroes and adjusts the overall gain of the filter with the K1, K2 , and K3 constants. The block diagram of Fig. 58.2 can be easily converted to an integrated VLSI filtering technique with a one-for-one substitution of the integrators and weighted summers with the corresponding VLSI circuits. Integrated implementations will be discussed in the sections on Gm-C, Gm-OTA-C, and MOSFET-C filters. If the high-order transfer function is of the form:

N s ( ) D((s))

H s =

© 2000 by CRC Press LLC

(58.2)

FIGURE 58.2

Biquad block diagram.

then H(s) can be factored into second-order sections where the numerators and denominators of these biquads are at most second order, as in Eq. 58.1. Issues of how to arrange the cascade of second-order functions and how to pair the poles and zeroes can greatly affect the filter’s performance in terms of signal swing, dynamic range, and dc offset accumulation. A few simple rules of how to realize a cascaded filter are enumerated here. 1. Factor into Biquadratic Terms: Split the numerator, N(s), and the denominator, D(s), into products of second-order functions. If either N(s), or D(s) is odd-order, a first-order term will be necessary. The transfer function is then in the following form:

()

H s =

() () () D (s) D (s) D (s)…

N1 s N 2 s N 3 s … 1

2

(58.3)

3

2. Pair Poles and Zeroes: Convert Eq. 58.3 into a product of second-order transfer functions, HA(s) HB(s) HC(s) …, by pairing each Ni(s) with a Dj(s) in such a way that HA(jω), HB(jω), HC(jω), etc. has as flat a magnitude response over the passband as possible. In this way, the signal at the various points in the cascade of the filter will be large and hence less susceptible to interference. Interference could be due to the thermal noise of the active and passive circuits, power supply noise, and crosstalk from digital circuits on-chip. To make HA(jω), HB(jω), HC(jω), etc. as flat as possible over the passband, pair the zeroes of Ni(s) as close in frequency and Q as the poles of Dj(s). This method minimizes the variation caused by Ni(jω)/Dj(jω) because the effects of the pole and zero pairs tend to partially cancel. 3. Choose Cascade Order: The next decision is to order the biquads (and maybe a first-order term). Many practical factors influence the optimum ordering. A few examples are: a. Order the cascade to equalize signal swing as much as possible throughout the filter to maximize dynamic range. b. Choose the first biquad to be lowpass or bandpass to reject high-frequency noise, eliminating overload in the remaining stages. c. If the offset at the filter output is critical, the last stage should be a highpass or bandpass to block the dc of previous stages. d. Avoid high-Q biquads at the last stage because these biquads have higher fundamental noise and worse sensitivity to power supply noise than low-Q stages.12

© 2000 by CRC Press LLC

FIGURE 58.3

Dynamic range scaling.

e. In general, do not place allpass stages at the end of the cascade because these have wideband noise. It is usually best to place allpass stages near the beginning of the filter. f. If several highpass or bandpass stages are available, one can place them at the beginning, middle, and end of the filter. This will prevent input dc offset from overloading the filter, will prevent internal offsets of the filter itself from accumulating (and hence decreasing available signal swing), and will provide a filter output that has low dc offset. g. The effect of thermal noise at the filter output varies with ordering; therefore, several decibels (dB) of SNR can often be gained with biquad reordering. 4. Dynamic Range Optimization: Dynamic range optimization is simply the scaling of gains within the filter to make sure that the overload levels of the integrators (or summers) are equalized so that all elements will saturate at the same signal level. If the frequency spectrum of the input signal is known, then dynamic range scaling of the filter’s should be performed with this signal. The maximum amplitude input signal should be provided and the gains scaled until all integrator and summer outputs are at their maximum level. Note that gain scaling should be performed so as not to modify any loop gains in the filter; otherwise, the transfer function would be altered. If the frequency spectrum of the input signal is unknown, the typical approach is to assume the input signal is a single sinusoid. The filter is then dynamic range scaled so that for the maximum amplitude input sinusoid, all integrator and summer outputs have the same maximum value for any possible sinusoidal frequency. Usually, the frequency of the input sinusoid is swept over the filter’s passband and the maximum levels are then gain scaled. Pictorially this can be seen in Fig. 58.3. Here, the filter consists of a cascade of three biquads: A, B, C. The frequency response to each biquad output is shown. In case 1, the signal will clip at the output of biquads A and B first; whereas in case 2, the output, C, will saturate first. It is only in case 3 — where the maximum gains have been equalized — that clipping occurs in all three biquads at the same level. Dynamic range scaling must be performed not only at the output of each biquad, but also at the output of the internal integrator. As an example, consider the classical Tow-Thomas biquad in Fig. 58.4. The derivation of this biquad from the block diagram in Fig. 58.2 should be self-evident. The frequency response shows that the internal node Vx will clip at a lower input amplitude level than the output. The signal amplitude at node Vx must be reduced by a factor F. The reduction in gain can be achieved by lowering the impedance in the feedback loop of the first integrator by F. However, to maintain constant loop gain around the two integrator loop, the input resistor of the second integrator must become R/F. The result of this dynamic range scaling is shown in Fig. 58.5.

Leapfrog Filters The leapfrog filter topology uses active integrators and weighted summers to simulate all the equations governing RLC ladder filters.19 The question naturally arises as to why passive ladder filters should be chosen. First, a wealth of knowledge and design tables exist for these filters. Designers can easily use

© 2000 by CRC Press LLC

FIGURE 58.4

Tow-Thomas biquad prior to dynamic range scaling.

FIGURE 58.5

Tow-Thomas biquad after dynamic range scaling.

tabulated data to design classical ladder filters that implement Butterworth, Chebychev, Bessel, etc. responses. With a few simple steps, these ladders can be transformed into an active leapfrog topology with element values.19 The second and more important reason to simulate ladder filters is that in the passband, the sensitivity of the filter’s magnitude response to element value variation is extremely low. This low sensitivity is not true in the stopband, nor is it true for the phase response of the filter. Since leapfrog filters simulate all the equations governing the ladder filter, these sensitivity advantages carry over to the active realization. Finally, filters that are relatively insensitive to component errors usually have lower thermal noise. In most applications, leapfrog filters will have superior performance relative to biquadratic filters in terms of noise and passband magnitude response accuracy. Snelgrove and Sedra20 analyzed biquad filters, leapfrog filters, and filters optimized for noise and magnitude response sensitivity. The leapfrog filters achieved performance close to the optimized design, but the biquad approach showed significantly degraded performance. The design of leapfrog filters can be found in Ref. 19. Here, we will show by example the design of these filters. Consider the third-order lowpass doubly terminated ladder shown in Fig. 58.6. Since the

© 2000 by CRC Press LLC

FIGURE 58.6

Third-order doubly terminated lowpass LC ladder filter.

active filter must simulate all equations governing the ladder, the first step is to write all the equations. The two-terminal branch relationships are:

IA =

VA I V I V , V1 = 1 , I 2 = 2 , V3 = 3 , I O = O RA sC1 sL2 sC3 RL

(58.4)

The KVL equations are:

VA = Vin − V1, V2 = V1 − V3 , VO = V3

(58.5)

I1 = I A − I 2 , I 3 = I 2 − I O

(58.6)

The KCL equations are:

As can be seen in Eqs. 58.4 to 58.6, some variables are currents while others are voltages. In the implementations, usually all signal variables are either voltages or currents. Here, voltage signals will be assumed. To convert the current signals in Eqs. 58.4 to 58.6 to voltages, all currents can be scaled by a resistance r of arbitrary value (e.g., 1 Ω). After the scaling by r, Eqs. 58.4 to 58.6 become:

rI A =

rVA rI V rI rV , V1 = 1 , rI2 = 2 , V3 = 3 , rIO = o , RA srC1 sL2 r srC3 RL

(58.7)

VA = Vin − V1 , V2 = V1 − V3 , VO = V3 , rI1 = rI A − rI2 , rI 3 = rI2 − rIO Using Eq. 58.7, the signal flow graph (SFG) shown in Fig. 58.7 can be obtained. Arrows flowing into a circle represent summation, and the values next to the arrows indicate a scaling operation. In the SFG, the KVL equations are implemented with the top two summation circles, while the KCL equations are

FIGURE 58.7

Signal flow graph representation of LC ladder filter.

© 2000 by CRC Press LLC

FIGURE 58.8

Leapfrog filter with gain scaling extracted.

on the bottom side. If one were to implement this SFG directly, the gain of the filter would be less than 0 dB (e.g., –6 dB for an equally terminated ladder). In fact, the gain would be the same as the original RLC ladder. By replicating the gain block, r/RA , as shown in Fig. 58.8 an additional degree of freedom is obtained to implement arbitrary filter gains. Dotted lines are used to indicate that the integrators on the end of the filter are damped, while the inner integrator is lossless. In the realization of high-order ladder filters, the internal integrators will always be lossless, while the outside ones will be lossy due to the ladder terminations. Highpass and bandpass leapfrog filters can be realized directly from the lowpass LC ladder with the use of the classical lowpass-to-highpass or lowpass-to-bandpass transformations.19 For illustrative purposes, the bandpass case is considered here. Starting from a lowpass prototype with frequency domain variable s, a bandpass filter with bandwidth BW and center frequency ωo can be realized in the frequency domain variable p with the following transformation:

s=

p2 + ω o2 pBW

(58.8)

Applying the transformation element by element to a third-order lowpass ladder, one obtains the bandpass ladder shown in Fig. 58.9. An SFG can be generated directly from the bandpass ladder and the active filter realized. Alternatively, the lowpass-to-bandpass transformation can be applied directly to the lowpass active filter of Fig. 58.8, resulting in the bandpass active filter in Fig. 58.10. Notice that each integrator has been replaced with a bandpass biquad and that the biquads corresponding to the terminations are damped.

FIGURE 58.9

Bandpass ladder filter.

© 2000 by CRC Press LLC

FIGURE 58.10

Bandpass leapfrog filter realization.

Designers proficient in the use of SFGs can readily transform the active leapfrog realization to include zeroes in the transfer function.

58.3 Realization of VLSI Integrators Once the state-variable topology has been created, the VLSI filter realization is determined by the approach used for the integrator. This section describes the most common types of VLSI integrators and their corresponding summing circuits. The three most common types of implementations are the Gm-C, Gm-OTA-C and MOSFET-C filters. Gm-C filters are generally recognized to offer the highest possible frequency operation at the lowest power; however, the structures are sensitive to parasitic capacitances and generally have higher noise and offset than other techniques. Gm-OTA-C filters are far less parasitic sensitive than Gm-C designs, but at the cost of higher power. Finally, MOSFET-C filters generally are the most parasitic insensitive, and have the least noise and offset; however, the frequency of operation is usually the lowest of the three approaches. In BiCMOS technology where extremely wideband op-amps can be made, MOSFET-C techniques possess bandwidth capabilities approaching that of Gm-C and GmOTA-C filters.

Gm-C Integrators and Filters Gm-C filters implement integrators with a transconductance amplifier loaded by a capacitor. As shown in Fig. 58.11 a differential transconductance amplifier (also called a transconductor) takes an input voltage, Vind, and produces at its output a current Iout = Gm Vind . This output current is integrated by the capacitor to produce the output voltage signal, Vout . The transfer function of the Gm-C integrator is

()

H s =

FIGURE 58.11

Gm-C integrator.

© 2000 by CRC Press LLC

Gm ω o = sC s

(58.9)

where ωo is the unity-gain frequency of the integrator. The ideal integrator has infinite dc gain, a unitygain frequency of ωo , and a phase shift of –90° for all frequencies. Capacitors in VLSI technology are usually high quality, so all stringent integrator requirements fall on the transconductor design. Since the transconductor is a voltage-to-current (V → I) converter, it should have: (1) high input impedance to accurately sense the input voltage signal, (2) high output impedance so the output signal appears as a current source, (3) high dc gain, (4) wide bandwidth so as not to create phase and magnitude errors in the integrator response, (5) large signal handling capability at the input and output for good dynamic range, and (6) a well-defined and tunable V → I mechanism to be used for frequency scaling the filter to remove process and temperature variations. In CMOS or BiCMOS technology, achieving high input impedance is simple due to the gate terminal of the MOSFET. Designing for high output impedance can be achieved with cascoding and with the use of regulated cascodes21; however, there is a tradeoff between high bandwidth and high output impedance. The most difficult aspect of Gm cell design is making the V → I mechanism tunable simultaneously achieving good linearity in the presence of large input signal swings. Building state-variable filters with Gm-C filters follows directly from the block diagrams or SFGs. Signal summation is performed in the current domain by placing transconductor outputs in parallel. Consider the SFG in Fig. 58.12. The bandpass Gm-C filter is readily implemented as in Fig. 58.13. The loop of two

FIGURE 58.12

Signal flow graph representation of a state-variable biquad.

FIGURE 58.13

Gm-C realization of a bandpass biquad.

© 2000 by CRC Press LLC

FIGURE 58.14

An MOS differential pair used as a V → I converter.

integrators is on the top of the figure, biquad damping is performed with the Gm /Q transconductor, and input signal scaling and summation are achieved with the remaining Gm cell. The Gm /Q transconductor connected in the negative feedback configuration on itself implements a resistor of value Q/Gm. The key aspect in the design of Gm-C filters is the transconductor design and more specifically, the V → I converter. In the simplest case, the V → I converter can be a simple MOS differential pair, as shown in Fig. 58.14. The large signal differential output current, Ioutd , is given by22 :

I outd = 2I out = µ n

  CoxW 4I Vind   − Vind 2L  µ n Cox W (2L) 

( )

2

(58.10)

Ioutd is a non-linear function of the input Vind . The transconductance of the differential pair, Gm = dIoutd /dVind , is maximum at Vind = 0 and falls off for increased signal swing. The maximum Gm is:

Gm = 2Iµ nCox W L = g m1 = g m 2

(58.11)

The transconductance can be tuned with the tail current 2I to frequency scale the Gm-C filter; however, the tuning range is small since Gm only varies as the square root of the current. In general, if the targeted filter application requires no programmability and the critical frequencies are nominally fixed, then roughly a 2:1 tuning range is needed to accommodate process and temperature variations. The tail current would then require a 4:1 variation, greatly impacting power dissipation. The more significant disadvantage of this V → I converter is its small linear input range. It can be shown that the linear differential input voltage range is much smaller than ± 2 (Vgs1,2–bias – VT), where Vgs1,2–bias is the bias level of M1 and M2 for Vind = 0. To maximize the linear input range Vgs1,2–bias must be kept large by using small W/L ratios. Even with use of small W/L ratios, the input range is typically limited to less than ±200 mV for linearity of 40 to 60 dB. Many linearization techniques have been invented using MOSFETs in the saturation and triode regions, as well as BiCMOS solutions. A few approaches are discussed here. In the basic MOS differential pair, the output current, Ioutd , increases with Vind ; however, the rate of increase drops off at higher input amplitude levels. One solution is to have another source of current that is added to the transconductor’s output. If that current is zero for low levels of Vind , but increases with Vind , the net effect is to linearize the overall Gm cell. Rather than adding a current, we can instead subtract a current from the Gm stages output. The amount to be subtracted would be maximum for Vind = 0 and would drop to zero for large differential input signals. This technique uses an additional differential pair, as shown in Fig. 58.15.10 Transistors M3 and M4 operate at lower current than M1 and M2 and are biased such that (Vgs3,4–bias – VT)  (Vgs3,4–bias – VT). Detailed design equations for the sizing of M1-M4, I1, and I2 can be found in

© 2000 by CRC Press LLC

FIGURE 58.15

A cross-coupled linearized MOS V → I converter.

FIGURE 58.16

Individual differential pair Gm (dotted curves) and combined Gm (solid curve).

Ref. 10. The current through M3 and M4 will saturate at lower input voltages than the drain currents of M1 and M2. The concept is more clearly understood with Fig. 58.16. The dotted lines show the transconductance of the individual differential pairs versus input differential signal. The dotted lines for positive Gm correspond to the transistors M1 and M2, while the negative Gm refers to M3 and M4. Adding the Gm curves, the solid curve is obtained. Notice now that the transconductance curve is flat for small Vind . It is possible to add the outputs of multiple differential pairs with slightly different non-linearities to broaden the region over which Gm is constant. A related approach is given in Ref. 23. A classical linearization method is to use a differential pair with source degeneration. However, in most technologies, the resistor used for degeneration would vary with temperature and processing and be nominally fixed in value. To afford tunability, the degeneration resistor can be replaced with a MOSFET, M3, operating in the triode region, as shown in the Gm cell of Fig. 58.17.24 M1 and M2 act as source followers and the transconductor’s signal currect, Iout, is ideally the drain current of M3. The drain current of M3 can be expanded in a Taylor series for the case of zero drain-to-source voltage and obtains the following from the “3/2 Power” model25 of the MOSFET:

© 2000 by CRC Press LLC

FIGURE 58.17

Source degenerated MOS V → I converter.

 I = W L µCox  VC − VQ − VT VD − VS +  

(

)

(

)(



 ai VDi − VSi   

) ∑ ( i=2

)

(58.12)

where VQ is the bias level of the source and drain with respect to the body, VS and VD are the voltages on the source and drain terminals, respectively, and the ai are constants. Notice that if the source and drain voltages are balanced around a common-mode voltage VQ, such that VD = VQ + Vind /2 and VS = VQ – Vind /2, then as can be seen from Eq. 58.12, ∞   i   ≈ W L µCox VC − VQ − VT Vind I = W L µCox VC − VQ − VT Vind + aiVind   i >1,odd  

(

)

(

)



(

) (

)

(58.13)

For many applications, the remaining odd-order non-linearity is low enough (e.g., –65 dB) to be inconsequential. For applications requiring superior linearity, cross-coupled triode degenerated differential pairs can be used to theoretically cancel all high-order non-linearities.24,26 Based on Eq. 58.13 the transconductance of the Gm cell is:

(

) (

Gm = W L µCox VC − VQ − VT

)

(58.14)

assuming M1 and M2 are ideal source followers. As desired, the Gm is tunable with the control voltage, VC , connected to the gate of M3. The linear input range can in practice be on the order of ±1.0 V, provided that M3 remains in the triode region. The maximum input signal is thus equal to

(

Vind−max = 2 VC − VQ − VT

)

(58.15)

The maximum input signal swing is a function of the tuning control voltage, VC ; consequently, the dynamic range is tightly coupled to the tuning range. In some situations where a programmable filter is required, multiple transistors can replace M3 and the triode devices can either be connected to VC, or

© 2000 by CRC Press LLC

FIGURE 58.18

Gm cell operating in triode and saturation.

switched off to implement ranges in the filter.7 In contrast to the transconductors in Figs. 58.14 and 58.15, power dissipation is unaffected by tuning. Finally, source followers M1 and M2 must be extremely low impedance (i.e., have high gm) to drive the resistance of M3. Either large W/L devices can be used for M1 and M2, or negative feedback can be used around M1 and M2 to reduce their impedance at their source terminals.5,27 As an alternative to requiring low impedance source followers, the Gm cell shown in Fig. 58.18 can be used.9 For small input signals, M3 and M4 operate in the triode region as in the previous design; however, the effective control voltage to tune the devices is set to the gate-to-source bias level of M1 and M2. For large positive input differential signals, more current flows in M1, increasing the VGS of M1. If M3 and M4 had fixed gate voltages, the current through them would drop off, resulting in lower Gm as in the design of Fig. 58.17. However, the gate voltage of M3 increases under this input condition and helps to maintain a constant Gm. By scaling M1 and M2 to M3 and M4 (e.g., to a ratio of about 7:1) the linear range can be expanded. The linear range is also larger than that of Fig. 58.17 because M3 and M4 can operate in both the triode and saturation regions. All transconductor designs discussed to this point have achieved an expanded linear range as a function of device matching or use of balanced input differential signals. Since matching and signal balancing can never be perfect, the achievable linearity is a strong function of layout and processing. In contrast, the transconductor of Fig. 58.19 achieves high linearity by maintaining constant drain-to-source voltage across triode devices M1 and M2.8,28,29 Using the basic triode equation for a MOSFET,

[(

)

1 W I = µ nCox 2 Vgs − Vt VDS − VDS2 2 L

]

(58.16)

one can see that if the drain-to-source voltage is held constant, the relationship between VGS and I is linear, except for an offset. Cascode devices Q1 and Q2 in Fig. 58.19 are used to hold VDS1 = VDS2 = IdRd, resulting in a linear transconductance of

g m = µ nCox

© 2000 by CRC Press LLC

W Rd I d L

(58.17)

FIGURE 58.19

V → I converter with differential pair operating in triode with constant drain-to-source voltage.

The Gm cell is easily tuned with the collector current of Q3, Id . Q1-Q3 could be replaced with MOSFETs; however, since the transconductance of bipolar devices is higher than MOSPETs, the BiCMOS solution provides superior cascoding to hold the drain-to-source voltages of M1 and M2 constant. Many alternative linearization schemes exist for the V → I converters in CMOS, BiCMOS, and bipolar technology.5,17,23,30-32 Invariably, nearly all the techniques require matching of transistors and/or balanced signals to achieve optimal linearity performance. Also, many of the techniques, particularly the MOSFETbased transconductors, rely on simplified large signal models (e.g., square law) of the transistors to model and cancel the non-linearity. In reality, more complex transistor equations, as found in Ref. 33, are needed to better predict performance. Ultimately, only experimental results over many process lots must be used for guaranteeing a specified linearity. Once the V → I converter design has been determined, the entire Gm amplifier or integrator can be assembled using known op-amp structures. The design in Fig. 58.20 uses the V → I converter of Fig. 58.17 with a folded-cascode output stage. The cascoding raises the output impedance of the amplifier and

FIGURE 58.20

Folded cascode MOS Gm-C integrator.

© 2000 by CRC Press LLC

FIGURE 58.21

Non-idealities modeled in the Gm-C integrator.

optionally A1 through A4 create regulated cascodes to raise the output impedance further.21 The nominally equal capacitors Clp and Cln serve two functions. For differential-mode output signals, the capacitors integrate the output current. For common-mode signals, they provide high frequency common-mode feedback (CMFB) via the gates of M12 and M13, while low frequency CMFB is poerformed with standard techniques.34 Folding the cascode structure permits larger output voltage swings and equal input and output common-mode voltage levels. Use of folding as opposed to an unfolded cascode (i.e., telescopic structure) will generally result in increased input-referred noise and offset due to the addition of transistors M6 and M7. Gm-C Integrator Frequency Response Errors The Gm-C integrator will have magnitude and phase errors due to parasitic capacitances and resistances. Consider the non-ideal integrator shown in Fig. 58.21. C is the integrating capacitance, cin is the parasitic capacitance due to wire routing and the input of the next stage, and cout and rout are the parasitic output capacitance and resistance of the transconductor, respectively. Rlead is a series resistance that is sometimes added to provide phase lead for correction of parasitic effects. Assuming that the Gm amplifier has a dc level of Gmo , a parasitic pole at ωp and a parasitic zero at ωz , the transfer function of the non-ideal integrator can be derived as:

( ) (1(+ s ω ) ) 1 + sr (Cr + c

Ha s =

Gmo 1 + s ω z

out

p

out

out + c in

)

(58.18)

The ideal integrator has infinite dc gain, a rolloff of 6 dB/octave, a unity-gain frequency of ωo = Gmo /C and a phase of 90° for all frequencies. Parasitics generally have a much stronger effect on the integrator’s phase response than the magnitude response. The integrator phase errors in turn are the largest source of filter magnitude response errors. In general, the integrator phase accuracy at ωo is most critical. The Gm-C integrator’s phase error as a function of frequency is:

()

(

) (

)

[

(

φ I −error ω ≈ π 2 + ω ω z − ω ω p − arctan Gmorout ω ω ox

)]

(58.19)

where ωox is the actual as opposed to the ideal unity-gain frequency. As an example, consider a Gm-C integrator with a unity-gain frequency of 20 MHz, C = 1 pF, Gmσ = 125.7 µS, and rout = 1 MΩ. If the transconductor has a parasitic pole at 300 MHz, but no parasitic zero, the resulting phase error at 20 MHz is –3.4°. Depending on the application, such an error may be acceptable. Two methods exist for correcting the phase error. The simplest approach is to add a zero to the transfer function to create phase lead of +3.4° at 20 MHz, resulting in zero net phase error at ωo. The small value resistor, Rlead , in Fig. 58.21 is used for this purpose. Phase lead can also be created within the Gm amplifier with known feedforward techniques. If the accuracy of the phase is critical, tunable phase lead or lag can be performed.6,8,11,16

© 2000 by CRC Press LLC

FIGURE 58.22

Integrator phase error impact on biquad filter Q.

The natural question arises as to what integrator phase accuracy is required. Although the requirement is dependent on the filter topology, and transfer function, a good estimate can be determined by considering the damped loop of two integrators shown in Fig. 58.22. This loop of integrators is found in cascade of biquad filters and leapfrog filters and is therefore quite general. The system poles are determined by the loop gain transfer function, T(s),

G     1   Gm 1 T s =  m  =    sC   sC + Gm Q   s ω o   s ω o + 1 Q 

()

(58.20)

Under Ideal conditions and assuming a high value of Q, the loop gain phase shift is

()

φω ≈

ωo ωQ

(58.21)

Notice that as Q increases, the net phase shift around the loop approaches zero; thus making any integrator phase errors a large source of damping errors. Consider two examples. First, the continuous-time filter used in hard disk drive read channels often has a Bessel response where all the poles are low Q. Assuming Q ≈ 2, the nominal phase shift around the integrator loop, φ(ωo) ≈ 26.6°, is quite large. If the total integrator phase error is to be kept 78) is needed. This results in large die area, and large power consumption. The capacitor spread could be reduced with a slight modification of the transfer function to the one given in the following:

()

H z =

z −1 − 1 z − 1.9719 ⋅ z −1 + 0.98755 −2

(59.20)

With respect to the bilinear transformation of Eq. 59.9, in this case, the zero at dc is maintained, while the zero at Nyquist frequency (at Fs/2, i.e., at z = –1) is eliminated. The normalized capacitor values are indicated again in Table 59.1, in Column II. It can be seen that a large reduction of the capacitor spread is obtained (from 80 to 8, for the E-type). The obtained frequency response is reported in Fig. 59.16 with line II. In the passband no significant changes occur; on the other hand in the stopband, the maximum signal attenuation is about –35 dB. In some applications, this solution is acceptable, also in consideration of the considerable capacitor spread reduction. For this reason, if not strictly necessary, the zero at Fs/2 can be eliminated. However reducing the factor fo/Fs results in reducing the stopband attenuation. For instance, for fo = 200 kHz (i.e., fo/Fs = 0.2), the frequency response with and without the Nyquist zero are reported in Fig. 59.16 with line III and IV, respectively. In this case, the stopband attenuation is reduced to –22 dB and therefore the Nyquist zero could be strongly needed. The relative normalized capacitor values are indicated in Table 59.1 in Column III (with zeros at {z = 1, z = –1}), and in Column IV (with zeros at z = 1).

A Biquadratic Cell for High Sampling Frequency In the previous biquadratic cell, the two op-amps operate in cascade during the same clock phase. This requires that the second op-amp in cascade wait for the complete settling of the first op-amp to complete its settling. This, of course, reduces the maximum achievable sampling frequency, or, alternatively for a given sampling frequency increases the required power consumption since op-amps with larger bandwidth are needed. In order to avoid this aspect, the biquad shown in Fig. 59.17 can be used. In this scheme, the two op-amps settle in different clock phase and thus they have the full clock phase time slot to settle. The transfer function of the biquadratic cell is given in Eq. 59.21. As it can be seen a limitation occurs in the possible transfer function, since the term in z-2 is not present.

© 2000 by CRC Press LLC

FIGURE 59.17

TABLE 59.1

Capacitor Values for Different Designs

E-type

I

II

III

IV

A B C D E F G H I J

10.131 80.885 1.2565 10.131 1.9998 0 1 1 1 0

1 7.963 1 8.0838 1.5915 0 1.5885 1.5885 0 0

12.366 12.094 12.561 12.366 1.9991 0 1 1 1 0

1.0690 1.0000 7.4235 7.6405 1.1815 0 1 1 0 0

F-type

I

II

III

IV

A B C D E F G H I J

10.006 78.885 1.2565 10.006 0 1.9998 1 1 1 0

5.1148 39.446 1 8.1404 0 1 1.5885 1.5885 0 0

11.305 10.095 12.561 11.305 0 1.9991 1 1 1 0

5.9919 5.0497 7.4235 7.0793 0 1 1 1 0 0

High-frequency biquadratic cell.

()

H z =−

(

(

)

C 3 ⋅ C 5 + C1⋅ C 4 − C 3 ⋅ C 5 − C 2 ⋅ C 4 z −1

) (

)

C 3 ⋅ C 8 + C 6 + C 4 ⋅ C 9 − C 3 ⋅ C 8 − 2 ⋅ C 3 ⋅ C 6 ⋅ z −1 + C 3 ⋅ C 6 ⋅ z −2

(59.21)

High-Order Filters The previous first- and second-order cells can be used to build up high-order filters. The main architectures are taken from the theory for the active RC filters. Some of the most significant ones are: ladder6 (with good amplitude response robustness with respect to component spread), cascade of first- and second-order cells (with good phase response robustness with respect to component spread), follow-theleader feedback (for low-noise systems, like reconstruction filters in oversampled DAC).

© 2000 by CRC Press LLC

FIGURE 59.18

Output waveform evolution.

59.6 Implementation Aspects The arguments presented so far have to be implemented in actual integrated circuits. Such implementations have to minimize the effects of the non-idealities of the actual blocks, which are capacitors, switches, and op-amps. The capacitor behavior is quite stable, apart from capacitance non-linearities which affect the circuit performance only as a second-order effect. On the other hand, switches and op-amps must be properly designed to operate in the SC system. The switches must guarantee a minimum conductance to ensure a complete charge transfer within the available timeslot. For the same reason, the op-amps must ensure large-dc gain, large unity-gain bandwidth, and large slew rate. For instance, in Fig. 59.18, the ideal output waveform of an SC network is shown with a solid line, while the more realistic actual waveform is illustrated with the dotted line. The output sample is updated during phase φ1, while it is held (at the value achieved at the end of phase φ1 during phase φ2. In phase φ1, the output value moves from its initial to its final value. The slowness of this movement is affected by switch conductance, op-amp slew rate, and op-amp bandwidth. The transient response of the system can be studied using the linear model of Fig. 59.19, where the conductive switches are replaced by their on-resistance Ron and the impulsive charge injection is replaced by a voltage step. The assumption of a complete linear system should allow one to exactly study the system evolution. In this case, the circuit time-constants depend on input branch (τin = 2·Ron·Cs), opamp frequency response, and feedback factor.

FIGURE 59.19

Linear model for transient analysis.

© 2000 by CRC Press LLC

FIGURE 59.20

Poly1-poly2 capacitor cross-section.

Non-linear analysis is, however, necessary when op-amp slew rate occurs. This analysis is difficult to carry out and optimum performance can be achieved using computer simulations. Usually, for typical device models, 10% of the available timeslot (i.e., Ts/2) is used for slew rate, while 40% is used for linear settling.

Integrated Capacitors Integrated capacitors in CMOS technology for SC circuits are mainly realized using poly1-poly2 structure, whose cross-section is shown in Fig. 59.20. This capacitor implementation guarantees linear behavior over a large signal swing. The main drawbacks of integrated capacitors are related to their absolute and relative inaccuracy, and to their associated parasitic capacitance. The absolute value of integrated capacitors can change ±30% from their nominal values. However, the matching between equal capacitors can be on the order of 0.2%, provided that proper layout solutions are adopted (in close proximity, with guard rings, with common centroid structure). The matching of two capacitors of different value C can be expressed with the standard deviation of their ratio σC, which is correlated with the standard deviation of the ratio between two identical capacitors σC1 by Eq. 59.22.7

σC =

σ C1

(59.22)

C C1

This model can be used to evaluate the robustness of the SC system performance with respect to random capacitor variations using a Monte Carlo analysis. The plates of poly1-poly2 capacitor of value C present a parasitic capacitance toward the substrate, as shown in Fig. 59.18. Typically, this capacitance is about 10% of C for the bottom plate (cp1 = C/10), and it is 1% of C for the top plate (cp2 = C/100). In order to reduce the effect of these parasitic capacitances in the transfer function of the SC systems, it is useful to connect the top plate to the op-amp input node, and the bottom plate to low impedance nodes (op-amp output nodes or voltage sources). In addition, in Fig. 59.20, an n-well, biased with a clean voltage VREF; is placed under the poly1-poly2 capacitor in order to reduce noise coupling from the substrate, through parasitic capacitance.

MOS Switches The typical situation during sampling operation is shown in Fig. 59.21(a) (this is the input branch of the integrator of Fig. 59.7(a)). The input signal Vi is sampled on the sampling capacitor Cs in order to have Vc = Vi.

© 2000 by CRC Press LLC

FIGURE 59.21

(a) ideal sampling structure, (b) sampling structure with NMOS switches.

In Fig. 59.21(b), the switches are replaced by a single-nMOS device which operates in the triode region with an approximately zero voltage drop between drain and source. The switch on-resistance Ron can be expressed as:

Ron =

1 1 = W W µn ⋅ Cox ⋅ ⋅ Vgs − Vth µn ⋅ Cox ⋅ ⋅ VG − Vi − Vth L L

(

)

(

)

(59.23)

where VG is the amplitude of the clock driving phase, µn is the electron mobility, Cox is the oxide capacitance, and W and L are the width and length of the MOS device. Using VDD = 5 V (i.e., VG = 5 V), the dependence of Ron on the input voltage is plotted in Fig. 59.22(a). This means that if Ron is required by the capacitor value be lower than a given value (to implement a low Ron·Cs time constant), a limitation in the possible input swing is given. For instance, if the maximum possible Ron is 2.5 kΩ, the maximum input signal swing is [0 V–3.5 V]. To avoid this limitation, a complementary switch can be used. It consists of an NMOS and a PMOS device in parallel, as shown in Fig. 59.23. The PMOS switch presents a Ron behavior complementary to that of the NMOS, as plotted in Fig. 59.22(b). The complete switch Ron is then given by the parallel of the two contributions which is sufficiently low for all the signal swing. Using this solution requires one to distribute double clock lines controlling the NMOS and the PMOS. This could be critical for SC filters operating at high-sampling frequency, also in consideration of the synchronization of the two phases and of the digital noise from the distributed clocks which could reduce the dynamic range.

FIGURE 59.22

Switch on-resistance.

© 2000 by CRC Press LLC

FIGURE 59.23 Sampling structure with complementary switches.

FIGURE 59.24 linear model.

Sampling operation

Once a minimum conductance is guaranteed, the structure can be studied using the linear model for the MOS devices S1 and S2 which operate in the triode region, resulting in the circuit of Fig. 59.24. In this case, Vc follows Vi, through an exponential law with a time constant τin = Cs·2·Ron. Typically, at least 6·τin must be guaranteed in the sampling timeslot to ensure sufficient accuracy. For a given sampling capacitance value, this is achieved using switches with sufficiently low on-resistance and no voltage drop across its nodes. Large on-resistance results in a long time constant and incomplete settling, while a voltage drop results in an incorrect final value. MOS technology allows the implementation of analog switches satisfying both the previous requirements.

Transconductance Amplifier The SC technique appears the natural application of available CMOS technology design features. This is true also for the case of the op-amp design. In fact, SC circuits require an infinite input op-amp impedance, as in the case of op-amp using a MOS input device. On the other hand, CMOS op-amps are particularly efficient when the load impedance is not resistive and low, but only capacitive, as in the case of SC circuits. In addition, SC circuits allow one to process a full swing (rail-to-rail) signal and this is possible for CMOS op-amps. The main requirements to be satisfied by the op-amp remain the bandwidth, the slew rate, and the dc-gain. The bandwidth and the slew rate must be sufficiently large to guarantee accurate settling for all the signal steps. The op-amp gain must be sufficiently large to ensure a complete charge transfer. A tradeoff between large dc-gain (achieved with low-current and/or multistage structure) and large bandwidth (obtained at high-current and/or simple structure) must be optimized. For this case, the use of mixed technology (like BiCMOS) could help the proper design optimization.

59.7 Performance Limitations The arguments described thus far are valid assuming an ideal behavior of the devices in the SC network (i.e., op-amp, switches, and capacitor). However, in actual realization, each of them presents non-idealities which reduce the performance accuracy of the complete SC circuit. The main limitations and their effects are described in the following. Finally, considerations about noise in SC systems conclude this section.

Limitation Due to the Switches As described before, CMOS switches satisfy both low on-resistance and zero voltage-drop requirements. However, they introduce some performance limitations due to their intrinsic CMOS realization. The cross-section of an NMOS switch in its on-state is shown in Fig. 59.25. The connection between its nodes N1 and N2 is guaranteed by the presence of the channel, made up of the charge Qch. The amount of charge Qch can be written as:

© 2000 by CRC Press LLC

FIGURE 59.25

Switch charge profile of an NMOS switch in the on-state.

(

)

(

Qch = W ⋅ L ⋅ Cox ⋅ VG − Vi − VTH

)

(59.24)

where Vi is the channel (input) voltage. Both nodes N1 and N2 are at voltage Vi (no voltage drop between the switch nodes). In addition, the gate oxide which guarantees infinite MOS input impedance constitutes a capacitive connection between gate and both source and drain. This situation results in two non-ideal effects: charge injection and clock feedthrough. Charge Injection At the switch turn-off, the charge Qch given in Eq. 59.24 is removed from the channel and it is shared between the two nodes connected to the switch, with a partition depending on the node impedance level. The charge k·Qch is injected in N2 and collected on a capacitor Cc. A voltage variation nVc across the capacitor arises, which is given by:

∆Vc = k ⋅

Qch Cc

(59.25)

For all the switches of a typical SC integrator, as shown in Fig. 59.26(a), this effect is important. For instance, for the switch S4 connected to the op-amp virtual ground, the charge injection into the virtual ground is collected in the feedback capacitor and it is processed as an input signal. The amount of this charge injection depends on different parameters (see Eq. 59.24 and Eq. 59.25). Charge Qch depends on switch size W, which however cannot be reduced beyond a certain level; otherwise, the switch onresistance should increase. Thus, a tradeoff between charge injection and on-resistance is present. In addition, charge Qch depends on the voltage Vi which the switch is connected to. For the switches S2, S3, and S4, the injected charge is proportional to (VG – Vgnd) and is always fixed; as a consequence, it can be considered like an offset. On the other hand, for the switch S1 connected to the signal swing, the channel charge Qch is dependent on (VG – Vi), i.e., on the signal amplitude and thus also the charge injection is signal dependent. This creates an additional signal distortion.

FIGURE 59.26

Charge displacement during turn-off.

© 2000 by CRC Press LLC

Possible solutions for the reduction of the charge injection are: use of dummy switches, use of slowly variable clock phase, use of differential structures, use of delayed clock phases,8 and use of signaldependent charge pump.9 Dummy switches operate with complementary phases in order to sink the charge rejected by the original switches. The use of differential structures reduces the offset charge injection to the mismatch of the two differential paths. For the signal-dependent charge injection, the delayed phases of Fig. 59.26(b) are applied to the integrator of Fig. 59.26(a). This clock phasing is based on the concept that at the turnoff, S3 is open before S1. In such a way, when S1 opens, the impedance toward Cs is infinite and no signal-dependent charge injection occurs into Cs. Clock Feedthrough The clock feedthrough is the amount of signal that is injected in the sampling capacitor Cc from the clock phase through the MOS overlap capacitor (Cov) path, shown in Fig. 59.27, which is then proportional to the area of the switches. Using large switches, to reduce on-resistance, results in large charge injection and large clock feedthrough. This error is typically constant (it depends from capacitance partition) and therefore it can be greatly reduced by using differential structures. The voltage error ∆Vc across a capacitance Cc due to the feedthrough of the clock amplitude (VDD – VSS) can be written as:

(

)

∆Vc = VDD − VSS ⋅

Cov Cov + Cc

(59.26)

Limitation Due to the Op-amp The operation of SC networks is based on the availability of a “good” virtual ground which ensures a complete charge transfer from the sampling capacitors to the feedback capacitor. Whenever this charge transfer is uncompleted, the SC network performance derives from its nominal behavior. The main nonideality causes from the op-amp are: finite dc-gain, finite bandwidth, finite slew-rate, and gain non-linearity. Finite Op-amp dc-Gain Effects10,11 The op-amp finite gain results in a deviation of output voltage at the end of the sampling period from the ideal one, as shown in Fig. 59.28(a). This output sample deviation can be translated in the SC system performance deviation. For the finite gain effect, an analysis which correlates the op-amp gain Ao with SC network performance deviation, can be carried out under the hypothesis that the op-amp bandwidth is sufficiently large to settle within the available timeslot. For the case of the summing amplifier of Fig. 59.10, it can be demonstrated that the effect of the finite op-amp dc-gain (AO) is only in an overall gain error. For this reason SC FIR filters (based on this scheme) exhibit a low sensitivity to op-amp finite dc-gain. On the other hand, for the case of SC integrators, the

FIGURE 59.27

Clocking scheme for signal-dependent charge injection reduction.

© 2000 by CRC Press LLC

FIGURE 59.28

Op-amp induced errors.

finite gain effect results in pole and gain deviation. For instance, the transfer function of the integrator of Fig. 59.8(a) becomes:

V z ( ) V ((z )) = CfCs ⋅ 1+

Ha z =

o i

z −1 1  Cs   1  ⋅ 1 +  − 1 +  ⋅ z −1 Ao  Cf   Ao 

(59.27)

For a biquadratic cell, the op-amp finite gain results in pole frequency, pole quality factor deviations. The actual frequency and quality factor of the pole (foA and QA) are correlated to their nominal values (fo and Q) by the relationship:

foA =

Ao ⋅ fo 1 + Ao

QA =

 2 ⋅Q  ≈ 1 − ⋅Q 1 2  Ao  + Q Ao 1

(59.28)

Finite Bandwidth and Slew-Rate10,11 Also, op-amp finite bandwidth and slew-rate result in incomplete charge transfer, which still corresponds with deviation of the output sample with respect to its nominal value. For the case of only finite bandwidth, the effect is shown in Fig. 59.28(b). An analysis similar to that of the finite gain for the finite bandwidth and slew-rate effect is not easily extracted. This is due to the fact that incomplete settling is caused by the correlation of a linear effect (e.g., the finite bandwidth) and a non-linear effect (e.g., the slew rate). In addition, this case is worsened by the fact that in some structures, several op-amps are connected in cascade and then each op-amp (a part of the first one) has to wait for the operation conclusion of the preceding one. Op-amp Gain Non-linearity Since the SC structure allows one to process large swing signals, for this signal swing, the op-amp has to perform constant gain. When the op-amp gain is not constant for all the necessary output swing, distortion arises. An analysis can be carried out for the case of the integrator of Fig. 59.8(a).12 Assuming an op-amp input (vi)-to-output (vo) relationship expressed in the form:

vo = a1 ⋅ vi + a2 ⋅ vi 2 + a 3 ⋅ vi 3 +…

© 2000 by CRC Press LLC

(59.29)

The resulting harmonic components (for ωo·Ts  1) are given by:

HD2 =

HD3 =

a2 2 ⋅ a13 β

 Vo  1+    2 ⋅Vi 

Vo

2

 Vo  Vo2 1 +   2⋅a β  3 ⋅Vi  a3

4 1

(59.30a)

(59.30b)

The distortion can then be reduced or making constant low-gain (i.e., reducing a2 and a3) or using a very large op-amp gain (i.e., increasing a1). This second case is usually the adopted strategy.

Noise in SC Systems13,14 In SC circuits, the main noise sources are in the switches (thermal noise) and in the op-amp (thermal noise and 1/f noise). These noise sources are processed by the SC structure as an input signal, i.e., they are sampled (with consequent folding) and transferred to the output with a given transfer function. As explained for the signal, the output frequency range of an SC filter is limited in the band [0–Fs/2]. This means that for any noise source its sampled noise FIGURE 59.29 Sampling the noise band, independently on how large it is at the source before sampling, on a capacitor. is limited in the [0–Fs/2] range. On the other hand, the total power of noise remains constant after sampling; this means that the power density of sampled noise is increased by the factor Fb/(Fs/2), where Fb is the noise band at its source. This can be seen for the switch noise in the following simple example. Consider the structure of Fig. 59.29, where the resistance represents the switch on-resistance. Its associated noise spectral density is given by vn2 = 4kT·Ron (where k is the Boltzmann’s constant and T is the absolute temperature). The transfer 1 function to the output node (the voltage over the capacitor) is H(s) = . The total output 1 + s ⋅ Ron ⋅ Cs noise can be calculated from the expression: ∞



()

no2 = vn2 ⋅ H s

2

0

⋅ df =

kT Cs

(59.31)

The total sampled noise is then given by kT/Cs, and presents a bandwidth of Fs/2. This means that the output noise power density is kT/Cs·2/Fs. (See Fig. 59.30.)

FIGURE 59.30

Folding of the noise power spectral density.

© 2000 by CRC Press LLC

The same folding concept can be applied to the op-amp noise. For the op-amp 1/f noise, the corner frequency is usually lower than Fs/2. Therefore, the 1/f noise is not modified by the sampling. On the other hand, the white noise presents a bandwidth Fb larger than Fs/2. This means that this noise component is modified in a noise source of bandwidth Fs/2 and noise power density multiplied by the factor Fb/(Fs/2). When the noise sources are evaluated in this way, the output noise of an SC cell can be evaluated by summing the different components properly weighted by their transfer functions from their source position to the output node. A few considerations follow for the noise performance of an SC cell. Switch noise is independent of Ron, since its dependence is cancelled in the bandwidth dependence. Thus, this noise source is dependent only on Cs. Noise reduction is achieved by increasing Cs. This, however, trades with the power increase necessary to drive the enlarged capacitance. Of course, even if Ron does not appear in the noise expression, as the capacitor is enlarged, the Ron must be adequately decreased in order to guarantee a proper sampling accuracy. For the op-amp noise, the noise band is usually correlated with the signal bandwidth. Therefore, a good op-amp settling (achieved with a large signal bandwidth) is in contrast to low-noise performance (achieved with reduced noise bandwidth). Therefore, in low-noise systems, the bandwidth of the opamp is designed to be the minimum that guarantees proper settling.

59.8 Compensation Technique (Performance Improvements) SC systems usually operate with a two-phase clock in which the op-amp is ‘really’ active only during one phase, and during the other phase is ‘sleeping.’ Provided that the op-amp output node is not read during the second phase, this non-active phase could be used to improve the performance of the SC system, as shown in the following.15 1/f noise and offset can be reduced with Correlated Double Sampling (CDS) or the chopper technique. Similar structures are also able to compensate for the error due to a finite gain of the operational amplifier. On the other hand, proper structures are able to reduce the capacitor spread occurring in particular situations (high-Q or large time constant filters). Finally, the double-sampledtechnique can be used to increase, by a factor of two, the sampling frequency of the SC system.

CDS Offset-Compensated SC Integrator The extra phase available in a two-phase SC system can be used to reduce op-amp offset and 1/f noise effects at the output. A possible scheme is shown in Fig. 59.31,16 and operates as follows. Capacitor Cof is used to sample, during φ1, the offset voltage Voff as it appears in the inverting node of the op-amp with close to unitary feedback. During φ2, the inverting node is still at a voltage very close to Voff, since

FIGURE 59.31

Offset-compensated SC integrator.

© 2000 by CRC Press LLC

FIGURE 59.32

Offset-compensated SC integrator performance.

the bandwidth of Voff is assumed to be very small with respect to the sampling frequency. Capacitor Cof maintains the charge on its armatures and acts like a battery. Thus, node X is a good virtual ground, independent on the op-amp offset. In the same way, the output signal, read only during φ2, is offset independent. The effect of this technique can be simulated using the value of the first-order cell of the previous example (i.e., Cf = 15.92, and Cs = Cd = 1), and Cof = 1. The transfer function Vo/Voff is shown in Fig. 59.32. At low frequency, the Voff is highly rejected, while this is not the case of the standard (uncompensated) integrator. The main problem with this solution is due to the unity feedback operation of the structure during phase φ1. This requires the stability of the op-amp, which could require a very high power consumption.

Chopper Technique An alternative solution to reduce offset 1/f noise at the output is given by the chopper technique. It consists of placing one SC mixer for frequency Fs/2 at the op-amp input and a similar one at the opamp output. This action does not affect white noise. On the other hand, offset and 1/f noise are shifted to around Fs/2, not affecting the frequencies around dc, where the signal to be processed is supposed to be. This concept is shown for a fully differential op-amp in Fig. 59.33. In Fig.59.34, the input-referred noise power spectral density (PSD) without and with chopper modulation is shown. The white noise level (wnl) is not affected by the chopper operation and remains constant. It will be modified by the folding of the high-frequency noise, as previously described. This technique is particularly advantageous for SC systems since the mixer can be efficiently implemented with the SC technique as shown in Fig. 59.35.

Finite-Gain Compensated SC Integrator In the op-amp design, a tradeoff between op-amp dc-gain and bandwidth exists. Therefore, when a large bandwidth is needed, a finite dc-gain necessarily occurs, reducing SC filter performance accuracy. To

FIGURE 59.33

Op-amp with chopper.

© 2000 by CRC Press LLC

FIGURE 59.34

Input-referred noise power spectral density (PSD) without and with chopper technique.

FIGURE 59.35

Op-amp with SC chopper configuration.

avoid this, the available extra phase can be used to self-calibrate the structure with respect to the error due to the op-amp finite gain. In the literature, several techniques have been proposed. The majority of them are based on the concept of using a preview of the future output samples to pre-charge a capacitor placed in series to the op-amp inverting input node in order to create a ‘good’ virtual ground (as for offset cancellation). The various approaches differ on how they get the preview and how they calibrate the new virtual ground. For the different cases, they can be effective for a large bandwidth,17,18 for a small bandwidth,19,20 or for a passband bandwidth.21 As an example of this kind of compensation, one of the earliest proposed schemes is depicted in Fig. 59.36. The op-amp finite gain makes the op-amp inverting input node different from the virtual ground ideal behavior and assume the value –Vo/Ao, where Vo is the output value and Ao is the op-amp dcgain. In the scheme of Fig. 59.36, the future output sample is assumed to be close to the previous sample, sampled of Cg1. This limits the effectiveness of this scheme to signal frequencies f for which this

FIGURE 59.36

Finite-gain-compensated SC integrator.

© 2000 by CRC Press LLC

FIGURE 59.37

Finite-gain-compensated SC integrator performance.

assumption is valid (i.e., for f/Fs  1). The circuit operates as follows. During φ1, auxiliary capacitor Cg1 samples the output; while during φ2, Cg1 is used to precharge Cg2 to –Vo/Ao, generating a good virtual ground at node X. In Fig. 59.37, the frequency response of different integrators are compared. Line I refers to an uncompensated integrator with Ao = 100; line II refers to the uncompensated integrator with Ao = 10,000. This line matches with line III, which corresponds to the compensated integrator with Ao = 100. Finally, line IV is the frequency response of the ideal integrator. From this comparison, the compensation effect is to achieve an op-amp gain Ao performance similar to those achieved with an op-amp gain Ao2. Alternative solution to the op-amp gain compensation are based on the use of a replica amplifier matched with the main one. Also in this way the effectiveness of the solution is to achieve performance accuracy relative to an op-amp dc-gain of Ao2.

The Very-Long Time-Constant Integrator In the design of Very-Long Time-Constant integrators using the scheme of Fig. 59.8(a), typical key points to be considered are: • The capacitor spread: if the pole frequency fp is very low with respect to the sampling frequency Fs, then the capacitor spread S = Cf/Cs of a standard integrator (Fig. 59.8(a)) will be very large. This results in large die area and reduce performance accuracy for poor matching. • The sensitivity to the parasitic capacitances: proper structure can reduce capacitor spread. They, however, suffer from the presence of parasitic capacitance. Parasitic-insensitive or at least parasiticcompensated designs should then be considered. • The offset of the operational amplifier: offset-compensated op-amps are needed when the op-amp offset contribution cannot be tolerated. In the literature, several SC solutions have been proposed, more oriented toward reducing the capacitor spread than toward compensating either the parasitics or the op-amp offset. A first solution is based on the use of a capacitive T-network in a standard SC integrator, as shown in Fig. 59.38.22 The operation of the sampling T-structure is to realize a passive charge partition with the capacitors Cs1, and Cs2+Cs3. The final result is that only the charge on Cs3 is injected into the virtual ground. Therefore, the effect of this scheme is that Cs is replaced with the Cs_equiv, given by the expression:

Cs _ equiv = Cs3 ⋅

© 2000 by CRC Press LLC

Cs1 Cs1 + Cs2 + Cs3

(59.32)

The net gain of this approach is that, using Cs2 = S ·Cs1 = S ·Cs3, the capacitor spread is reduced to S . For example, an integrator with Cs = 1 and Cf = 40, can be realized with Cs1 = 1, Cs2 = 6, Cs3 = 1, and Cf = 5, i.e., with the capacitor spread reduced to 6.

Vo Cs1 Cs3 1 = ⋅ ⋅ Vi Cs1 + Cs2 + Cs3 Cf 1 − z −1 (59.33) FIGURE 59.38

A T-network long-time-constant SC integrator.

The major problem of the circuit of Fig. 59.38 is due to the fact that the T-network is sensitive to the parasitic capacitance Cp (due to Cs1, Cs2, and Cs3) in the middle node of the T-network, which is added to Cs2, reducing frequency response accuracy. A parasitic-insensitive circuit is the one proposed by Nararaj23 and shown in Fig. 59.39. In this case, the transfer function is given by Eq. 59.34. Also, in this case, Cs = Cx = S ·Cf are usually adopted to reduce the standard spread from S to S . However, for the Nagaraj integrator, the op-amp is used on both phases, disabling the possibility of using double-sampled structure.

Vo Cf Cx 1 = ⋅ ⋅ Vi Cs Cf + Cx 1 − z −1

FIGURE 59.39

(59.34)

The Nagaraj’s long-time-constant SC integrator.

It is also possible to combine a long-time-constant scheme with an offset-compensated scheme to obtain a long-time-constant offset-compensated SC integrator.15

Double-Sampling Technique If the output value of the SC integrator of Fig. 59.8(b) is read only at the end of φ2, the requirement for the op-amp to settle can be relaxed. For the integrator of Fig. 59.8(b), the time available for the op-amp to settle is Ts/2. The equivalent double-sampled structure is shown in Fig. 59.40. The capacitor values for the two structures are the same, and thus they implement the same transfer function. The time evolution for the two structures are compared in Fig. 59.41. For the double-sampled SC integrator, the time available for the op-amp to settle is doubled. This advantage can be used in two ways. First, when a high sampling frequency is required, if the opamp cannot settle in Ts/2, the extra time allows it to reach the speed requirement (i.e., the doublesampling technique is used to increase the sampling frequency). Second, at low sampling frequency when the power consumption must be strongly reduced, a smaller bandwidth guaranteed by the op-amp reduces its power consumption. The cost of the double-sampled structure is the doubling of all the switched capacitors. In addition, in the case of a small mismatch between the two parallel paths, mismatch energy could be present around Fs/4.24

© 2000 by CRC Press LLC

FIGURE 59.40

Double-sampled SC integrator.

FIGURE 59.41

Double-sampled operation.

59.9 Advanced SC Filter Solutions In this section, alternative solutions able to overcome some basic limitations to SC system performance improvement are proposed. They deal with the tradeoff between bandwidth-vs.-gain in the op-amp design for high-frequency SC filter, and the implementation of low-voltage SC filters.

Precise Op-amp Gain (POG) for High-Speed SC Structures25,26 Standard design of SC networks assumes operation with infinite gain and infinite bandwidth op-amps. However, in the op-amp design, a tradeoff exists between the speed and the gain. As a consequence with a high sampling frequency, the needed large bandwidth limits the op-amp gain to low values, thus limiting the achievable accuracy. Thus, standard design is less feasible for high-sampling frequency. A possible solution to this limitation is the Precise Op-amp Gain (POG) design approach, which consists of designing high-frequency SC networks, taking into account the precise gain value of the op-amps as a parameter in the capacitor design. The standard design op-amp tradeoff between speed-and-gain is then changed into the POG design tradeoff between speed-and-gain precision, which is more affordable in highfrequency op-amps. If the op-amp dc-gain is Ao, the transfer function of the first-order cell shown in Fig. 59.42 is given by:

()

H POG z = −

© 2000 by CRC Press LLC

Cs POG

  Cs POG + Cf POG + Cd POG  1  − Cf POG 1 +  z −1  Cf POG + Cd POG +  Ao  Ao   

(59.35)

FIGURE 59.42

Dumped SC integrator.

This expression for Ao equal to infinite becomes the standard case t.f HST, which is given by:

( ) (Cf + CdCs) − Cf z

H ST z = −

−1

(59.36)

To obtain the same t.f., the POG capacitor values are obtained from the standard values as given in the following:

Cs POG = Cs

 1  Cf POG = Cf ⋅ 1 +   Ao 

 1  Cs Cd POG = Cd 1 +  +  Ao  Ao

(59.37)

The concept here applied to the SC integrator can be applied to higher-order SC filters. It can be demonstrated that using the POG approach with an op-amp nominal gain Ao and an actual gain in the range [Ao(1 – ε), Ao(1 + ε)] achieves the same response accuracy as the standard approach with an infinite op-amp nominal gain with an actual gain given by:

Aeff =

( Ao + 1)(1 ± ε) ≈ Ao ε

ε

(59.38)

The value Aeff can then be defined as the effective gain of the POG approach. For example, for the opamp with Ao = 100 and ε = 0.08, the same performance accuracy is achieved as when using standard design with op-amp gain A = 1250. This value of Aeff can then be used in Eq. 59.27 to evaluate filter performance accuracy.

Low-Voltage Switched-Capacitor Solutions In the last few years, the interest in low-power, low-voltage integrated systems has consistently grown due to the increasing importance of portable equipment and to the reduction of the supply voltage of modern standard CMOS scaled-down technology ICs. For the design of SC filters operating at reduced supply voltages,27,28 capacitor properties are quite stable. On the other hand, at low supply, it is difficult to properly operate the MOS switches and the op-amps. With the supply voltage reduction, the MOS switches’ overdrive voltage is lowered, inhibiting proper operation of classical transmission gate (complementary switches). The switch conductance for different input voltages changes, depending on the supply voltage VDD . In Fig. 59.22, the case for VDD = 5 V was shown. In Fig. 59.43, the case for VDD = 1 V is reported for comparison. In this case, there is a critical voltage region centered around VDD /2 for which both switches are not conducting. In SC circuits, to achieve rail-to-rail swing, the output of the op-amp

© 2000 by CRC Press LLC

FIGURE 59.43

Switch Ron with VDD = 1 V.

must necessarily cross this critical region, where the switches connected to the op-amp output node will not properly operate at VDD = 1 V. On the other hand, op-amp operation can be achieved with proper design using a supply voltage as low as VTH+2·VOV, with some modifications at system level. Switches and op-amp sections may use different supply voltages. In fact, using voltage multiplier (a possible scheme is shown in Fig. 59.4429), it is possible to generate on-chip a voltage higher than the supply voltage. This “multiplied” voltage can then be used to power the entire SC circuit (op-amp and switches) or to drive only the switches. If the higher supply is used to bias the op-amp and switches, standard design solutions can be implemented. In addition, the op-amp powered with a higher supply voltage can manage a larger signal swing, with a consequential larger dynamic range. However, in a scaled-down technology, the maximum acceptable electric field between gate and channel (for gate oxide breakdown) and between drain and source (for hot electron damage) must be reduced. This puts an absolute limit on the value of the multiplied supply voltage. In addition, the need to supply a dc-curent to the op-amp from the multiplied supply forces one must use an external capacitor, which is an additional cost. FIGURE 59.44 Charge-pump for on-chip An alternative approach consists of using the multiplied voltage multiplication. supply to drive only the switches. In this case, the voltage multiplier does not supply any dc-current, thus avoiding any external capacitor. This solution, like the previous one, must not exceed the limit of the technology associated with the gate oxide breakdown. Nonetheless, this approach is largely used because it allows the filter to operate at high sampling frequency. In order to avoid any kind of voltage multiplier, the Switched-OpAmp (SOA) approach was proposed30,31 and is based on the following considerations: 1. The best condition for the switches driven with a low supply voltage is to be connected either to ground or to VDD . Thus, to properly operate, S2, S3, and S4 in Fig. 59.8(a) have to be referred to ground or to VDD (i.e., the op-amp input dc-voltage has to be either ground or VDD). This allows one to minimize the required op-amp supply voltage. On the other hand, the op-amp dc output voltage has to be at VDD /2 in order to have rail-to-rail output swing. 2. The switch S1 connected to the signal swing cannot operate properly for the full signal swing, as explained before. The resulting SOA SC integrator is shown in Fig. 59.45. The SOA approach uses an op-amp, which can operate in a tri-state mode. In this way, the critical output switch S1 is no longer necessary and it can be eliminated, moving the critical problem to the op-amp design. The function of the eliminated

© 2000 by CRC Press LLC

FIGURE 59.45

Switched-op-amp SC integrator.

critical switch S1 is implemented by turning on and off the op-amp through Sa, which is connected to ground. The input dc-voltage is set to ground. Therefore, all the switches are connected to ground (and realized with a single NMOS device) or to VDD (and realized with a PMOS device), and are driven with the maximum overdrive, given by VDD – VTH. Capacitor CDC in Fig. 59.45 gives a fixed charge injection into virtual ground, producing a voltage level shift between input (Vin_DC) and output (Vout_DC) op-amp dcvoltage. Since Vin_DC is set to ground, using CDC = CIN /2 sets Vout_DC = VDD /2. CDC has no effect on the signal transfer function given by:

()

H z =

C IN ⋅ z −1 2

(

C F ⋅ 1 − z −1

)

(59.39)

A fully differential architecture of the scheme of Fig. 59.45 provides both signal polarities at each node, useful to build up high-order structures without any extra elements (e.g., inverting stage). In addition, any disturbance (offset or noise) injected by CDC results in a common-mode signal, which is largely rejected by the fully differential operation. The SOA approach suffers from the following problems: 1. SOA structure operates with an op-amp, which is turned on and off. Its turn-on time becomes the main limitation in the possible maximum sampling frequency. 2. In an SOA structure, the output signal is available only during one clock phase; while during the other clock phase, the output is set to zero (return-to-zero), as shown in Fig. 59.46. If the output signal is read as a continuous-time waveform, the return to zero has two effects: a loss of 6 dB in the transfer function, and an increased distortion due to the large output steps. On the other

FIGURE 59.46

Switched-op-amp output waveform.

© 2000 by CRC Press LLC

hand, when the SOA integrator is used in front of a sampled-data system (like an ADC), the output signal is sampled only when it is valid and both the above problems are cancelled. A comparison between the three different low-voltage SC filter design approaches is given in Table 59.2. TABLE 59.2

Comparison Between Different Low-voltage SC Designs

VDDswith VDDop-amp New op-amp design New switch design Output swing Gate Break-down (VGS limit) Hot electron (VDS limit) Sampling frequency limitation Power consumption External component Continuous waveform: Gain loss Return-to-zero distortion

Supply multiplier

Clock multiplier

Switchedop-amp

VDDmult VDDmult No No + – – + – Yes 1 +

VDDmult VDD No/Yes No – – + + – No 1 +

VDD VDD Yes Yes – + + – + No 1/2 —

References The literature about SC filters is so wide that any list of referred publications should be considered incomplete. In the following, just few papers related to the discussed topics are indicated. 1. B. J. Hosticka, R. W. Brodersen, and P. R. Gray, MOS sampled-data recursive filters using switchedcapacitor integrators, IEEE J. Solid-State Circuits, vol. SC-12, pp. 600-608, Dec. 1997. 2. J. T. Caves, M. A. Copeland, C. F. Rahim, and S. D. Rosenbaum, Sampled analog filtering using switched capacitors as resistor equivalents, IEEE J. Solid-State Circuits, vol. SC-12, pp. 592-599, Dec. 1977. 3. R. Gregorian and G. C. Temes, Analog MOS Integrated Circuits for Signal Processing, John Wiley & Sons, 1986. 4. G. T. Uehara, and P. R. Gray, A 100MHz output rate analog-to-digital interface for PRML magnetic disk read channels in 1.2µ CMOS, IEEE Int. Solid State Circ. Conf. 1994, Digest of Tech. Papers, pp. 280-281, 1994. 5. P. E. Fleischer and K. R. Laker, A family of active switched-capacitor biquad building blocks, Bell Syst. Tech. J., vol. 58, pp. 2235-2269, 1979. 6. G. M. Jacobs, D. J. Allstot, R. W. Brodersen, and P. R. Gray, Design techniques for MOS switchedcapacitor ladder filters, IEEE Trans on Circ. and Syst, vol. CAS-25, pp. 1014-1021, Dec. 1978. 7. J. B. Shyu, G. C. Temes, and F. Krummenacher, Random error effects in matched MOS capacitors and current sources, IEEE J. of Solid-State Circuits, vol. SC-19, pp. 948-955, 1984. 8. D. G. Haigh and B. Singh, A switching scheme for switched-capacitor filters which reduces the effects of parasitic capacitances associated with switch control terminals, IEEE Intern. Symp. on Circ. and Systems, ISCAS 1993. 9. T. Brooks, D. H. Robertson, D. F. Kelly, A. DelMuro, and S. W. Harston, A cascaded sigma-delta pipeline A/D converter with 1.25MHz signal bandwidth and 89dB SNR, IEEE J. of Solid-State Circuits, vol. SC-32, pp. 1896-1906, Dec. 1997. 10. G. C. Temes, Finite amplifier gain and bandwidth effects in switched capacitor filters, IEEE J. SolidState Circuits, vol. SC-15, pp. 358-361, June 1980. 11. K. Martin and A. S. Sedra, Effects of the op amp finite gain and bandwidth on the performance of switched-capacitor filters, IEEE Trans. Circ. Syst., vol. CAS-28, no. 8, pp. 822-829, Aug. 1981.

© 2000 by CRC Press LLC

12. K. Lee and R. G. Meyer, Low-distortion switched-capacitor filter design techniques, IEEE Journal of Solid State Circuits, Dec. 1985. 13. C. A. Gobet and A. Knob, Noise analysis of switched-capacitor networks, IEEE Trans. on Circ. and Systems, vol. CAS-30, pp. 37-43, Jan. 1983. 14. J. H. Fischer, Noise sources and calculation techniques for switched-capacitor filters, IEEE Journal of Solid State Circuits, vol. SC-17, pp. 742-752, Aug. 1982. 15. C. Enz and G. C. Temes, Circuit techniques for reducing the effects of opamp imperfections: autozeroing, correlated double sampling, and chopper stabilization, Proceedings of IEEE, vol. 84, no. 11, pp. 1584-1614, Nov. 1996. 16. K. K. K. Lam and M. A. Copeland, Noise-cancelling switched-capacitor (SC) filtering technique, Electronics Letters, vol. 19, pp. 810-811, Sept. 1983. 17. K. Nagaraj, T. R. Viswanathan, K. Singhal, and J. Vlach, Switched-capacitor circuits with reduced sensitivity to amplifier gain, IEEE Trans. on Circuits and Systems, vol. CAS-34, pp. 571-574, May 1987. 18. L. E. Larson and G. C. Temes, Switched-capacitor building-blocks with reduced sensitivity to finite amplifier gain, bandwidth, and offset voltage, IEEE International Symposium on Circuits and Systems (ISCAS ‘87), pp. 334-338, 1987. 19. K. Haug, F. Maloberti, and G. C. Temes, Switched-capacitor integrators with low finite-gain sensitivity, Electronics Letters, vol. 21, pp. 1156-1157, Nov. 1985. 20. K. Nagaraj, J. Vlach, T. R. Viswanathan, and K. Singhal, Switched-capacitor integrator with reduced sensitivity to amplifier gain, Electronics Letters, vol. 22, pp. 1103-1105, Oct. 1986. 21. A. Baschirotto, R. Castello, and F. Montecchi, Finite gain compensated double-sampled switchedcapacitor integrator for high-Q bandpass filters, IEEE Trans. Circuits Syst., vol. CAS-I 39, no. 6, June 1992. 22. T. Huo and D. J. Allstot, MOS SC highpass/notch ladder filter, Proc. IEEE Int. Symp. Circ. Syst., pp. 309-312, May 1980. 23. K. Nagaraj, A parasitic insensitive area efficient approach to realizing very large time constant in switched-capacitor circuits, IEEE Trans. on Circuits and Systems, vol. CAS-36, pp. 1210-1216, Sept. 1989. 24. J. J. F. Rijns and H. Wallinga, Spectral analysis of double-sampling switched-capacitor filters, IEEE Trans. on Circ. and Systems, vol. 38, no. 11, pp. 1269-1279, Nov. 1991. 25. A. Baschirotto, F. Montecchi, and R. Castello, A 15MHz 20mW BiCMOS switched-capacitor biquad operating with 150Ms/s sampling frequency, IEEE Journal of Solid State Circuits, pp. 1357-1366, Dec. 1995. 26. A. Baschirotto, Considerations for the design of switched-capacitor circuits using precise-gain operational amplifiers, IEEE Transaction on Circuits and Systems. II., vol. 43, no. 12, pp. 827-832, Dec. 1996. 27. R. Castello, F. Montecchi, F. Rezzi, and A. Baschirotto, Low-voltage analog filter, IEEE Transactions on Circuits and Systems. II., pp. 827-840, Nov. 1995. 28. A. Baschirotto and R. Castello, 1V switched-capacitor filters, Workshop on Advances in Analog Circuit Design, Copenaghen, 28-30 Apr. 1998. 29. J. F. Dickson, On-chip high-voltage generation in MNOS integrated circuits using an improved voltage multiplier technique, IEEE J. of Solid-State Circuits, vol. SC-11, no. 3, pp. 374-378, June 1976. 30. J. Crols and M. Steyaert, Switched-opamp: an approach to realize full CMOS switched-capacitor circuits at very low power supply voltages, IEEE J. of Solid-State Circuits, vol. SC-29, no. 8, pp. 936-942, Aug. 1994. 31. A. Baschirotto and R. Castello, A 1V 1.8MHz CMOS Switched-opamp SC filter with rail-to-rail output swing, IEEE Journal of Solid State Circuits, pp. 1979-1986, Dec. 1997.

© 2000 by CRC Press LLC

Dharchoudhury, A., et al "Timing and Signal Integrity Analysis" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

60 Timing and Signal Integrity Analysis 60.1 Introduction 60.2 Static Timing Analysis DCC Partitioning • Timing Graph • Arrival Times • Required Times and Slacks • Clocked Circuits • TransistorLevel Delay Modeling • Interconnects and State TA • Process Variations and Static TA • Timing Abstraction • False Paths

60.3 Noise Analysis

Abhijit Dharchoudhury David Blaauw Motorola, Inc.

Stantanu Ganguly Intel Corp.

Sources of Digital Noise • Crosstalk Noise Failures • Modeling of Interconnect and Gates for Noise Analysis • Input and Output Noise Models • Linear Circuit Analysis • Interaction with Timing Analysis • Fast Noise Calculation Techniques • Noise, Circuit Delays, and Timing Analysis

60.4 Power Grid Analysis Problem Characteristics • Power Grid Modeling • Block Current Signatures • Matrix Solution Techniques • Exploiting Hierarchy

60.1 Introduction Microprocessors are rapidly moving into deep submicron dimensions, gigahertz clock frequencies, and transistor counts in excess of 10 million transistors. This trend is being fueled by the ever-increasing demand for more powerful computers on one side and by rapid advances in process technology, architecture, and circuit design on the other side. At these small dimensions and high speeds, timing and signal integrity analyses play a critical role in ensuring that designs meet their performance and reliability goals. Timing analysis is one of the most important verification steps in the design of a microprocessor because it ensures that the chip is meeting speed requirements. Timing analysis of multi-million transistor microprocessors is a very challenging task. This task is made even more challenging because in the deep submicron regime, transistor-level and interconnect-centric analyses become vital. Therefore, timing analysis must satisfy the two conflicting requirements of accurate low-level analysis (so that deep submicron designs can be handled) and efficient high-level abstraction (so that large designs can be handled). The term signal integrity typically refers to analyses that check that signals to not assume unintended values due to circuit noise. Circuit noise is a broad term that applies to phenomena caused by unintended circuit behavior such as unintentional coupling between signals, degradation of voltage levels due to leakage currents and power supply voltage drops, etc. Circuit noise does not encompass physical noise effects (e.g., thermal noise) or manufacturing faults (e.g., stuck-at faults). Signal integrity is also becoming a very critical verification task. Among the various signal integrity-related issues, noise induced by coupling between adjacent wires is perhaps the most important one. With the scaling of process technologies, coupling capacitances between wires are become a larger fraction of the total wire capacitances.

© 2000 by CRC Press LLC

Coupling capacitances are also larger because a larger number of metal layers are now available for routing, and more and more wires are running longer distances across the chip. As operating frequencies increase, noise induced on signal nets due to coupling is much greater. Noise-related functional failures are increasing as dynamic circuits become more prevalent, with circuit designers looking for increased performance at the cost of noise immunity. Another important problem in submicron high-performance designs is the integrity of the power grid that distributes power from off-chip pads to the various gates and devices in the chip. Increased operating frequencies result in higher current demands from the power and ground lines, which in turn increases the voltage drops seen at the devices. Excessive voltage drops reduce circuit performance and inject noise into the circuit, which may lead to functional failures. Moreover, with reductions in supply voltages, problems caused by excessive voltage drops become more severe. The analysis of the power and ground distribution network to measure the voltage drops at the points where the gates and devices of the chip connect to the power grid is called IR-drop or power grid analysis. In this chapter, we will briefly discuss the important issues in static timing analysis, noise analysis with particular emphasis on coupling noise, and IR-drop analysis methods. Additional information on these topics is available in the literature and the reader is encouraged to look through the list of references.

60.2 Static Timing Analysis Static timing analysis (TA)1-4 is a very powerful technique for verifying the timing correctness of a design. The power of this technique comes from the fact that it is pattern independent, implicitly verifies all signal propagation paths in the design, and is applicable to very large designs. Further, it lends itself easily to higher levels of abstraction, which makes it even more computationally feasible to perform full-chip timing analysis. The fundamental idea in static timing analysis is to find the critical paths in the design. Critical paths are those signal propagation paths that determine the maximum operating frequency of the design. It is easiest to think of critical paths as being those paths from the inputs to the outputs of the circuit that have the longest delay. Since the smallest clock period must be larger than the longest path delay, these paths dictate the operating frequency of the chip. In very simple terms, static TA determines these long paths using breadth-first search as follows. Starting at the inputs, the latest time at which signals arrive at a node in the circuit is determined from the arrival times at its fan-in nodes. This latest arrival time is then propagated toward the primary outputs. At each primary output, we obtain the latest possible arrival time of signals and the corresponding longest path. If the longest path does not meet the timing constraints imposed by the designer, then a violation is detected. Alternatively, if the longest path meets the timing constraints, then all other paths in the circuit will also satisfy the timing constraints. By propagating only the latest arrival time at a node, static TA does not have to explicitly enumerate all the paths in the design. Historically, simulation-based or dynamic timing analysis techniques had been the most common timing analysis technique. However, with increasing complexity and size of recent microprocessor designs, static timing analysis has become an indispensable part of design verification and much more popular than dynamic approaches. Compared to dynamic approaches, static TA offers a number of advantages for verifying the timing correctness of a design. Dynamic approaches are pattern-dependent. Since the possible paths and their delays are dependent on the state of the circuit, the number of input patters that are required to verify all the paths in a circuit is exponential with the number of inputs. Hence, only a subset of paths can be verified with a fixed number of input patterns. Only moderately large circuits can be verified because of the computational cost and size limitations of transient simulators. Static TA, on the other hand, implicitly verifies all the longest paths in the design without requiring input patterns. Dynamic timing analysis is still heavily used to verify complex and critical circuitry such as PLLs, clock generators, and the like. Dynamic simulation is also used to generate timing models for block-level static timing analysis. Dynamic timing analysis technique rely on a circuit simulator (e.g., SPICE5) or on a fast timing simulator (e.g., ILLIADS,6 ACES,7 TimeMill8) for performing the simulations. Because

© 2000 by CRC Press LLC

of the importance of static techniques in verifying the timing behavior of microprocessors, we will restrict the discussion below to the salient points of static TA.

DCC Partitioning The first step in transistor-level static TA is to partition the circuit into dc connected components (DCCs), also called channel-connected components. A DCC is a set of nodes which are connected to each other through the source and drain terminals of transistors. The transistor-level representation and the DCC partitioning of a simple circuit is shown in Fig. 60.1. As seen in the diagram, a DCC is the same as the gate for typical cells such as inverters, NAND and NOR gates. For more complex structures such as latches, a single cell corresponds to multiple DCCs. The inputs of a DCC are the primary inputs FIGURE 60.1 Transistor-level circuit parof the circuit or the gate nodes of the devices that are part of titioned into DCCs. the DCC. The outputs of a DCC are either primary outputs of the circuit or nodes that are connected to the gate nodes of devices in other DCCs. Since the gate current is zero and currents flow between source and drain terminals of MOS devices, a MOS circuit can be partitioned at the gates of transistors into components which can then be analyzed independently. This makes the analysis computationally feasible since instead of analyzing the entire circuit, we can analyze the DCCs one at a time. By partitioning a circuit into DCCs, we are ignoring the current conducted by the MOS parasitic capacitances that couple the source/drain and gate terminals. Since this current is typically small, the error is small. As mentioned above, DCC partitioning is required for transistor-level static TA. For higher levels of abstraction, such as gate-level static TA, the circuit has already been partitioned into gates, and their inputs are known. In such cases, one starts by constructing the timing graph as described in the next section.

Timing Graph The fundamental data structure in static TA is the timing graph. The timing graph is a graphical representation of the circuit, where each vertex in the graph corresponds to an input or an output node of the DCCs or gates of the circuit. Each edge or timing arc in the graph corresponds to a signal propagation from the input to the output of the DCC or gate. Each timing arc has a polarity defined by the type of transition at the input and output nodes. For example, there are two timing arcs from the input to the output of an inverter: one corresponds to the input rising and the output falling, and the other to the input falling and the output rising. Each timing arc in the graph is annotated with the propagation delay of the signal from the input to the output. The gate-level representation of a simple circuit is shown in Fig. 60.2(a) and the corresponding timing graph is shown in Fig. 60.2(b). The solid line timing arcs correspond to falling input transitions and rising output transitions, whereas the dotted line arcs represent rising input transitions and falling output transitions.

FIGURE 60.2

A simple digital circuit: (a) gate-level representation, and (b) timing graph.

© 2000 by CRC Press LLC

Note that the timing graph may have cycles which correspond to feedback loops in the circuit. Combinational feedback loops are broken and there are several strategies to handle sequential loops (or cycles of latches).5 In any event, the timing graph becomes acyclic and the vertices of the graph can be arranged in topological order.

Arrival Times Given the times at which the signals at the primary inputs or source nodes of the circuit are stable, the minimum (earliest) and maximum (latest) arrival times of signals at all the nodes in the circuit can be calculated with a single breadth-first pass through the circuit in topological order. The early arrival time a(v) is the smallest time by which signals arrive at node v and is given by

()

[() ]

a v = min a u + duv u ∈FI ( v )

(60.1)

Similarly, the late arrival time A(v) is the latest time by which signals arrive at node v and is given by

()

[() ]

A v = max A u + duv u ∈FI ( v )

(60.2)

In the above equations, FI(v) is the set of all fan-in nodes of v, i.e., all nodes that have an edge to v and duv is the delay of an edge from u to v. Equations 60.1 and 60.2 will compute the arrival times at a node v from the arrival times of its fan-in nodes and the delays of the timing arcs from the fan-in nodes to v. Since the timing graph is acyclic (or has been made acyclic), the vertices in the graph can be arranged in topological order (i.e., the DCCs and gates in the circuit can be levelized). A breadth-first pass through the timing graph using Eqs. 60.1 and 60.2 will yield the arrival times at all nodes in the circuit. Considering the example of Fig. 60.2, let us assume that the arrival times at the primary inputs a and b are 0. From Eq. 60.2, the maximum arrival time for a rising signal at node a1 is 1, and the maximum arrival time for a falling signal is also 1. In other words, Aa1,r = Aa1,f = 1, where the subscripts r and f denote the polarity of the signal. Similarly, we can compute the maximum arrival times at node b1 as Ab1,r = Ab1,f = 1, and at node d as Ad,r = 2 and Ad,f = 3. In addition to the arrival times, we also need to compute the signal transition times (or slopes) at the output nodes of the gates or DCCs. These transition times are required so that we can compute the delay across the fan-out gates. Note that there are many timing arcs that are incident at the output node and each gives rise to a different transition time. The transition time of the node is picked to be the transition time corresponding to the arc that causes the latest (earliest) arrival time at the node.

Required Times and Slacks Constraints are placed on the arrival times of signals at the primary output nodes of a circuit based on performance or speed requirements. In addition to primary output nodes, timing constraints are automatically placed on the clocked elements inside the circuit (e.g., latches, gated clocks, domino logic gates, etc.). These timing constraints check that the circuit functions correctly and at-speed. Nodes in the circuit where timing checks are imposed are called sink nodes. Timing checks at the sink nodes inject required times on the earliest and latest signal arrival times at these nodes. Given the required times at these nodes, the required times at all other nodes in the circuit can be calculated by processing the circuit in reverse topological order considering each node only once. The late required time R(v) at a node v is the required time on the late arriving signal. In other words, it is the time by which signals are required to arrive at that node and is given by

()

[() ]

R v = max R u − duv

© 2000 by CRC Press LLC

u ∈FO( v )

(60.3)

Similarly, the early required time r(v) is the required time on the early arriving signal. In other words, it is the time after which signals are required to arrive at node v and is given by

[( ) ]

()

r v = min r u − duv u ∈FO( v )

(60.4)

In these equations, FO(v) is the set of fan-out nodes of v (i.e., the nodes to which there is a timing arc from node v) and duv is the delay of the timing arc from node u to node v. Note that R(v) is the time before which a signal must arrive at a node, whereas r(v) is the time after which the signal must arrive. The difference between the late arrival time and the late required time at a node v is defined as the late slack at that node and is given by

() () ()

Sl v = R v − A v

(60.5)

Similarly, the early slack at node v is defined by

() () ()

Se v = a v − r v

(60.6)

Note that the late and early slacks have been defined in such a way that a negative value denotes a constraint violation. The overall slack at a node is the smaller of the early and late slacks; that is,

()

() ()

S v = min Sl v , Se v

(60.7)

Slacks can be calculated in the backward traversal along with the required times. If the slacks at all nodes in the circuit are positive, then the circuit does not violate any timing constraint. The nodes with the smallest slack value are called critical nodes. The most critical path is the sequence of critical nodes that connect the source and sink nodes. Continuing with the example of Fig. 60.2, let the maximum required time at the output node d be 1. Then, the late required time for a rising signal at node a1 is Ra1,r = –0.5 since the delay of the rising-tofalling timing arc from a1 to d is 1.5. Similarly, the late required time for a falling signal at node a1 is Ra1,f = Rd,r – 1 = 0. The required times at the other nodes in the circuit can be calculated to be: Rb1,r = –1, Rb1,f = 0, Ra,r = –1, Ra,f = –1.5, Rb,r = –1, and Rb,f = –2. The slack at each node is the difference between the required time and the arrival time and are as follows: Sd,r = –1.5, Sd,f = –2, Sa1,r = –1.5, Sa1,f = –1, Sb1,r = –2, Sb1,f = –1, Sa,r = –1, Sa,f = –1.5, Sb,r = –1, and Sb,f = –2. Thus, the critical path in this circuit is b falling — b1 rising — d falling, and the circuit slack is –2.

Clocked Circuits As mentioned earlier, combinational circuits have timing checks imposed only at the circuit primary outputs. However, for circuits containing clocked elements such as latches, flip-flops, gated clocks, domino/precharge logic, etc., timing checks must also be enforced at various internal nodes in the circuit to ensure that the circuit operates correctly and at-speed. In circuits containing clocked elements, a separate recognition step is required to detect the clocked elements and to insert constraints. There are two main techniques for detecting clocked elements: pattern recognition and clock propagation. In pattern recognition-based approaches, commonly used sequential elements are recognized using simple topological rules. For example. back-to-back inverters in the netlist are often an indication of a latch. For more complex topologies, the detection is accomplished using templates supplied by the user. Portions of a circuit are typically recognized in the graph of the original circuit by employing subgraph isomorphism algorithms.9 Once a subcircuit has been recognized, timing constraints are automatically

© 2000 by CRC Press LLC

inserted. Another application of pattern-based subcircuit recognition is to determine logical relationships between signals. For example, in pass-gate multiplexors, the data select lines are typically one-hot. This relationship cannot be obtained from the transistor-level circuit representation without recognizing the subcircuit and imposing the logical relationships for that subcircuit. The logical relationship can then be used by timing analysis tools. However, purely pattern recognition-based approaches can be restrictive and may necessitate a large number of templates from the user for proper functioning. In clock propagation-based approaches, the recognition is performed automatically by propagating clock signals along the timing graph and determining how these clock signals interact with data signals at various nodes in the circuit. The primary input clocks are identified by the user and are marked as (simple) clock nodes. Starting from the primary clock inputs and traversing the timing arcs in the timing graph, the type of the nodes are determined based on simple rules. These rules are illustrated in Fig. 60.3,

FIGURE 60.3 Sequential element detection: (a) simple clock, (b) gated clock, (c) merged clock, (d) latch node, and (e) footed and footless domino gates. Broken arcs are shown as dotted lines. Each arc is marked with the type of output transition(s) it can cause (e.g., R/F: rise and fall, R: rise only, and F: fall only).

© 2000 by CRC Press LLC

where we show the transistor-level subcircuits and the corresponding timing subgraphs for some common sequential elements. • A node that has only one clock signal incident on it and no feedback is classified as a simple clock node (Fig. 60.3(a)). • A node that has one clock and one or more data signals incident on it, but no feedback, is classified as a gated clock node (Fig. 60.3(b)). • A node that has multiple clock signals (and zero or more data signals) incident on it and no feedback is classified as a merged clock node (Fig. 60.3(c)). • A node that has at least one clock and zero or more data signals incident on it and has a feedback of length two (i.e., back-to-back timing arcs) is classified as a latch node (Fig. 60.3(d)). The other node in the two node feedback is called the latch output node. A latch node is of type data. The timing arc(s) from the latch output node to the latch is (are) broken. Latches can be of two types: level-sensitive and edge-triggered. To distinguish between edge-triggered and level-sensitive latches, various rules may be applied. These rules are usually designspecific and will not be discussed here. It is assumed that all latches are level-sensitive unless the user has marked certain latches to be edge-triggered. Note that the domino gates of Fig. 60.3(e) also satisfy the conditions for a latch node. For a latch node, both data and clock signals cause rising and falling transitions at the latch node. For domino gates, data inputs a and b cause only falling transitions at the domino node x. This condition can be used to distinguish domino nodes from latch nodes. Footed and footless domino gates can be distinguished from each other by looking at the clock transitions on the domino node. Since the footed gate has the clocked nMOS transistor at the “foot” of the evaluate tree, the clock signal at CK causes both rising and falling transitions at node x. In the footless domino gate, CK causes only a rising transition at node x. Clock propagation stops when a node has been classified as a data node. This type of detection can be easily performed with a simple breadth-first search on the timing graph. Once the sequential elements have been recognized, timing constraints must be inserted to ensure that the circuit functions correctly and at-speed.10 These are described below and illustrated in Figs. 60.4 and 60.5. • Simple clocks: In this case, no timing checks are necessary. The arrival times and slopes at the simple clock node are obtained just as in normal data node. • Gated clocks: The basic purpose of a gated clock is to enable or disable clock transitions at the input of the gate from propagating to the output of the gate. This is done by setting the value of the data input. For example, in the gated clock of Fig. 60.3(b), setting the data input to 1 will allow the clock waveform to propagate to the output, whereas setting the data input to 0 will disable transitions at the gate output. To make sure that this is indeed the behavior of the gated clock, the timing constraints should be such that transitions at the data input node(s) do not create transitions at the output node. For the gated NAND clock of Fig. 60.3(b), we have to ensure that the data can transition (high or low) only when the clock is low, i.e., data can transition after the clock turns low (short path constraint) and before the clock turns high (long path constraint). This is shown in Fig. 60.4(a). In addition to imposing this timing constraint, we also break the timing arc from the data node to the gated clock node since data transitions cannot create output clock transitions. • Merged clocks: Merged clocks are difficult to handle in static TA since the output clock waveform may have a different clock period compared to the input clocks. Moreover, the output clock waveform depends on the logical operation performed by the gate. To avoid these problems, static TA tools typically ask the user to provide the waveform at the merged clock node and the merged clock node is treated as a (simple) clock input node with that waveform. Users can obtain the clock waveform at the merged clock node by using dynamic simulation with the input clock waveforms.

© 2000 by CRC Press LLC

FIGURE 60.4 Timing constraints and timing graph modifications for sequential elements: (a) gated clock, (b) edgetriggered latch, and (c) level-sensitive latch. Broken arcs are shown as dotted lines.

• Edge-triggered latches: An edge-triggered latch has two types of constraints: set-up constraint and hold constraint. The set-up constraint requires that the data input node should be ready (i.e., the rising and falling signals should have stabilized) before the latch turns on. In the latch shown in Fig. 60.3(d), the latch is turned on by the rising edge of the clock. Hence, the data should arrive some time before the rising edge of the clock (this time margin is typically referred to as the setup time of the latch). This constraint imposes a required time on the latest (or maximum) arrival time at the data input of the latch and is therefore a long path constraint. This is shown in Fig. 60.4(b). The hold constraint ensures that data meant for the current clock cycle does not accidentally appear during the on-phase of the previous clock cycle. Looking at Fig. 60.4(b), this implies that the data should appear some time after the falling edge of the clock (this time margin is called the hold time of the latch). The hold time imposes a required time on the early (or minimum) arrival time at the data input node and is therefore a short path constraint. As the name implies, in edge-triggered latches, the on-edge of the clock causes data to be stored in the latch (i.e., causes transitions at the latch node). Since the data input is ready before the clock turns on, the latest arrival time at the latch node will be determined only by the clock signal. To make sure that this is indeed the behavior of the latch, the timing arc from the data input node to the latch node is broken, as shown in Fig. 60.4(b). One additional set of timing constraints is imposed for an edge-triggered latch. Since data is stored at the latch (or latch output) node, we must ensure that the data gets stored before the latch turns off. In other words, signals should arrive at the latch output node before the off-edge of the clock. • Level-sensitive latches: In the case of level-sensitive latches, the data need not be ready before the latch turns on, as is the case for edge-triggered latches. In fact, the data can arrive after the onedge of the clock — this is called cycle stealing or time borrowing. The only constraint in this case

© 2000 by CRC Press LLC

FIGURE 60.5 Domino circuit: (a) block diagram, and (b) clock waveforms and precharge and evaluate constraints. Note precharge implies the phase of operation (clock); the signals are falling.

is that the data gets latched before the clock turns off. Hence, the set-up constraint for a levelsensitive latch is that signals should arrive at the latch output node (not the latch node itself) before the falling edge of the clock, as shown in Fig. 60.4(c). The hold constraint is the same as before; it ensures that data meant for the current clock cycle arrives only after the latch was turned off in the previous clock cycle. This is also shown in Fig. 60.4(c). Since the latest arriving signal at the latch node may come from either the data or the clock node, timing arcs are not broken for a level-sensitive latch. Since data can flow through the latch, level-sensitive latches are also referred to as transparent latches. • Domino gates: Domino circuits have two distinct phases of operation: precharge and evaluate.11 Looking at the domino gate of Fig. 60.3(e), we see that in the precharge phase, the clock signal is low and the domino node x is precharged to a high value and the output node y is pre-discharged to a low value. During the evaluate phase, the clock is high and if the values of the gate inputs establish a path to ground, domino node x is discharged and output node y turns high. The difference between footed and footless domino gates is the clocked nMOS transistor at the “foot” of the nMOS evaluate tree. To demonstrate the timing constraints imposed on domino circuits, consider the domino circuit block diagram and the clock waveforms shown in Fig. 60.5. The footed domino blocks are labeled FD1 and FD2, and the footless blocks are labeled FLD1 and FLD2. From Fig. 60.5(b), note that all three clocks have the same period 2T, but the falling edge of CK2 is 0.25T after the falling edge of CK1 which in turn is 0.5T after the falling edge of CK0. Therefore, the precharge phase for FD1 and FD2 is T, for FLD1 is 0.5T, and for FLD2 is 0.25T. The various timing constraints for domino circuits are illustrated in Fig. 60.5 and discussed below.

© 2000 by CRC Press LLC

1. We want the output O to evaluate (rise) before the clock starts falling and to precharge (fall) before the clock starts rising. 2. Consider node N1, which is an output of FD1 and an input of FD2. N1 starts precharging (falling) when CK0 falls, and the constraint on it is that it should finish precharging before CK0 starts rising. 3. Next, consider node N2, which is an input to FLD1 clocked by CK1. Since this block is footless, N2 should be low during the precharge phase to avoid short-circuit current. N2 starts precharging (falling) when CK0 starts falling and should finish falling before CK1 starts falling. Note that the falling edges of CK0 and CK1 are 0.5T apart, and the precharge constraint is on the late or maximum arrival time of N2 (long path constraint). Also, N2 should start rising only after CK1 has finished rising. This is a constraint on the early or minimum arrival time of N2 (short path constraint). In this example, N2 starts rising with the rising edge of CK0 and, since all the clock waveforms rise at the same time, the short path constraint will be satisfied trivially. 4. Finally, consider node N3. Since N3 is an input of FLD2, it must satisfy the short-circuit current constraints. N3 starts precharging (falling) when CK1 starts falling and it should fall completely before CK2 starts falling. Since the two clock edges are 0.25T apart, the precharge constraint on N3 is tighter than the one on N2. As before, the short path constraint on N3 is satisfied trivially. The above discussion highlights the various types of timing constraints that must be automatically inserted by the static TA tool. Note that each relative timing constraint between two signals is actually composed of two constraints. For example, if signal d must rise before clock CK rises, then (1) there is a required time on the late or maximum rising arrival time at node d (i.e., Ad,r < ACK,r), and (2) there is a required time on the early or minimum rising arrival time at the clock node CK (i.e., aCK,r < ad,r). There is one other point to be noted. Set-up and hold constraints are fundamentally different in nature. If a hold constraint is violated, then the circuit will not function at any frequency. In other words, hold constraints are functional constraints. Set-up constraints, on the other hand, are performance constraints. If a set-up constraint is violated, the circuit will not function at the specified frequency, but it will function at a lower frequency (lower speed of operation). For domino circuits, precharge constraints are functional constraints, whereas evaluate constraints are performance constraints.

Transistor-Level Delay Modeling In transistor-level static TA, delays of timing arcs have to be computed on the fly using transistor-level delay estimation techniques. There are many different transistor-level delay models which provide different tradeoffs between speed and accuracy. Before reviewing some of the more popular delay models, we define some notations. We will refer to the delay of a timing arc as being its propagation delay (i.e., the time difference between the output and the input completing half their transitions). For a falling output, the fall times is defined as the time to transition from 90% to 10% of the swing; similarly, for a rising output, the rise time is defined as the time to transition from 10% to 90% of the swing. The transition time at the output of the timing arc is defined to be either the rise time or the fall time. In many of the delay models discussed below, the transition time at the input of a timing arc is required to find the delay across the timing arc. At any node in the circuit, there is a transition time corresponding to each timing arc that is incident on that node. Since for long path static TA, we find the latest arriving signal at a node and propagate that arrival time forward, the transition time at a node is defined to be the output transition time of the timing arc which produced the latest arrival time at the node. Similarly, for short path analysis, we find the transition time as the output transition time of the timing arc that produced the earliest arrival time at the node. Analytical closed-form formulae for the delay and output transition times are useful for static TA because of their efficiency. One such model was proposed in Hedenstierna et al.,12 where the propagation

© 2000 by CRC Press LLC

delay across an inverter is expressed as a function of the input transition time sin, the output load CL, and the size and threshold voltages of the NMOS and PMOS transistors. For example, the inverter delay for a rising input and falling output is given by

t d = k0

(

CL + sin k1 + k2Vtn βn

)

(60.8)

where βn is the NMOS transconductance (proportional to the width of the device), Vtn is the NMOS threshold voltage, and k0, k1, and k2 are constants. The formula for the rising delay is the same, with PMOS device parameters being used. The output transition time is considered to be a multiple of the propagation delay and can be calibrated to a particular technology. More accurate analytical formulae for the propagation delay and output transition time for an inverter gate have been reported in the literature.13,14 These methods consider more complex circuit behavior such as short-circuit current (both NMOS and PMOS transistors in the inverter are conducting) and the effect of MOS parasitic capacitances that directly couple the input and outputs of the inverter. More accurate models of the drain current and parasitic capacitances of the transistor are also used. The main shortcoming of all these delay models is that they are based on an inverter primitive; therefore, arbitrary CMOS gates seen in the circuit must be mapped to an equivalent inverter.15 This process often introduces large errors. A simpler delay model is based on replacing transistors by linear resistances and using closed-form expressions to compute propagation delays.16,17 The first step in this type of delay modeling is to determine the charging/discharging path from the power supply rail to the output node that contains the switching transistor. Next, each transistor along this path is modeled as an effective resistance and the MOS diffusion capacitances are modeled as lumped capacitances at the transistor source and drain terminals. Finally, the Elmore time constant18 of the path is obtained by starting at the power supply rail and adding the product of each transistor resistance and the sum of all downstream capacitances between the transistor and the output node. The accuracy of this method is largely dependent on the accuracy of the effective resistance and capacitance models. The effective resistance of a MOS transistor is a function of its width, the input transition time, and the output capacitance load. It is also a function of the position of the transistor in the charging/discharging path. The position variable can have three values: trigger (when the input at the gate of the transistor is switching), blocking (when the transistor is not switching and it lies between the trigger and the output node), and support (when the transistor is not switching and lies between the trigger and the power supply rail). The simplest way to incorporate these effects into the resistance model is to create a table of the resistance values (using circuit simulation) for various values of the transistor width, the input transition, and the output load. During delay modeling, the resistance value of a transistor is obtained by interpolation from the calibration table. Since the position is a discrete variable, a different table must be stored for each position variable. The effective MOS parasitic capacitances are functions of the transistor width and can also be modeled using a table look-up approach. The main drawbacks of this approach are the lack of accuracy in modeling a transistor as a linear resistance and capacitance, as well as not considering the effect of parallel charging/discharging paths and complementary paths. In our experience, this approach typically gives 10–20% accuracy with respect to SPICE for standard gates (inverters, NANDs, NORs, etc.); for complex gates, the error can be greater. These methods do not compute the transition time or slope at the output of the DCC. The transition time at the output node is considered to be a multiple of the propagation delay. Note that the propagation delay across a gate can be negative; this is the case, for example, if there is a slow transition at the input of a strong but lightly loaded gate. As a result, the transition time would become negative, giving a large error compared to the correct value. Yet another method of modeling the delay from an input to an output of a DCC (or gate) is based on running a circuit simulator such as SPICE,5 or a fast timing simulator such as ILLIADS6 or ACES.7 Since the waveform at the switching input is known, the main challenge in this method is to determine the assertions (whether an input should be set to a high or low value) for the side inputs which gives rise to

© 2000 by CRC Press LLC

a transition at the output of the DCC.19 For example, let us consider a rising transition at the input causing a falling transition at the output. In this case, a valid assertion is one that satisfies the following two conditions: (1) before the transition, there should be no conducting path between the output node and Gnd, and (2) after the transition, there should be at least one conducting path between the output node and Gnd and no conducting path between the output node and Vdd. The sensitization condition for a rising output transition is exactly symmetrical. The valid assertions are usually determined using a binary decision diagram.20 For a particular input-output transition, there may be many valid assertions; these valid assertions may have different delay values since the primary charging/discharging path may be different or different node capacitances in the side paths may be charged/discharged. To find the assertion that causes the worst-case (or best-case) delay, one may resort to explicit simulations of all the valid assertions or employ other heuristics to prune out certain assertions. The main advantage of this type of delay modeling is that very accurate delay and transition time estimates can be obtained since the underlying simulator is accurate. The added accuracy is obtained at the cost of additional runtime. Since static timing analyzers typically use simple delay models for efficiency reasons, the top few critical paths of the circuit should be verified using circuit simulation.21,22

Interconnects and Static TA As is well known, interconnects are playing a major role in determining the performance of current microprocessors, and this trend is expected to continue in the next generation of processors.23 The effect of interconnects on circuit and system performance should be considered in an accurate and efficient manner during static timing analysis. To illustrate interconnect modeling techniques, we will use the example shown in Fig. 60.6(a) of a wire connecting a driving inverter to three receiving inverters.

FIGURE 60.6 Handling interconnects in static TA: (a) a typical interconnect, (b) distributed RC model of interconnect, (c) reduced π-model to represent the loading of the interconnect, (d) effective capacitance loading, and (e) propagation of waveform from root to sinks.

© 2000 by CRC Press LLC

The simplest interconnect model is to lump all the interconnect and receiver gate capacitances at the output of the driver gate. This approximation may greatly overestimate the delay across the driver gate since, in reality, all of the downstream capacitances are not “seen” by the driver gate because of resistive shielding due to line resistances. A more accurate model of the wire as a distributed RC line is shown in Fig. 60.6(b). This is the wire model output by most commercial RC extraction tools. In Fig. 60.6(b), node r is called the root of the interconnect and is driven by the driver gate, and the other end points of the wire at the inputs of the receiver gate are called sinks of the interconnect and are labeled s1, s2, and s3. Interconnects have two main effects: (1) the interconnect resistance and capacitance determines the effective load seen by the driving gate and therefore its delay, and (2) due to non-zero wire resistances, there is a non-zero delay from the root to the sinks of the interconnect — this is called the time-of-flight delay. To model the effect of the interconnect on the driver delay, we first replace the metal wire with a π-model load as shown in Fig. 60.6(c).24 This is done by finding the first three moments of the admittance Y(s) of the interconnect at node r. It can be shown that the admittance is given by Y(s) = m1s + m2s2 + ˆ = s(C + C ) – s2RC2 + s3R2C3 + …, m3s3 + …. Next, we obtain the admittance of the π-load as Y(s) 2 2 1 2 where R, C1, and C2 are the parameters of the π-load model. To obtain the parameters of the π-load, we ˆ equate the first three moments of Y(s) and Y(s). This gives us the following equations for the parameters of the π-load model:

C2 =

m22 m2 m2 , C1 = m1 − 2 , and R = − 33 m3 m3 m2

(60.9)

Now, if we are using a transistor-level delay model or a pre-characterized gate-level delay model that can only handle purely capacitive loading and not π-model loads, we have to determine an effective capacitance Ceff that will accurately model the π-load. The basic idea of this method25,26 is to equate the average current drawn by the π-model load to the average current drawn by the Ceff load. Since the average current drawn by any load is dependent on the transition time at the output of the gate and the transition time is itself a function of the load, we have to iterate to converge to the correct value of Ceff . Once the effective capacitance has been obtained, the delay across the driver gate and the waveform at node r can be obtained. The waveform at the root node is then propagated to the sink nodes s1, s2, s3 across the transfer functions H1(s), H2(s), and H3(s), respectively. This procedure is illustrated in Fig. 60.6(e). If the driver waveform can be simplified as a ramp, the output waveforms at the sink nodes can be computed easily using reduced-order modeling techniques like AWE27 and the time-of-flight delay between the root node and the sink nodes can be calculated.

Process Variations and Static TA Unavoidable variations and disturbances present in IC manufacturing processes cause variations in device parameters and circuit performances. Moreover, variations in the environmental conditions (of such parameters are temperature, supply voltages, etc.) also cause variations in circuit performances.28 As a result, static TA should consider the effect of process and environmental variations. Typically, statistical process and environmental variations are considered by performing analysis at two process corners: bestcase corner and worst-case corner. These process corners are typically represented as different device model parameter sets, and as the name implies, are for the fastest and slowest devices. For gate-level static TA, gate characterization is first performed at these two corners yielding two different gate delay models. Then, static TA is performed with the best-case and worst-case gate delay models. Long path constraints (e.g., latch set-up and performance or speech constraints) are checked with the worst-case models and short path constraints (e.g., latch hold constraints) are checked with the best-case models.

© 2000 by CRC Press LLC

Timing Abstraction Transistor-level timing analysis is very important in high-performance microprocessor design and verification since a large part of the design is hand-crafted and cannot be pre-characterized. Analysis at the transistor level is also important to accurately consider interconnect effects such as gate loading, chargesharing, and clock skew. However, full-chip transistor-level analysis of large microprocessor designs is computationally infeasible, making timing abstraction a necessity. Gate-Level Static TA A straightforward extension of transistor-level static TA is to the gate level. At this level of abstraction, the circuit has been partitioned into gates, and the inputs and outputs of each gate have been identified. Moreover, the timing arcs from the inputs to the outputs of a gate are typically pre-characterized. The gates are characterized by applying a ramp voltage source at the input of the gate and an explicit load capacitance at the output of the gate. Then, the transition time of the ramp and the value of the load capacitance is varied, and circuit simulation (e.g., SPICE) is used to compute the propagation delays and output transition times for the various settings. These data points can be stored in a table or abstracted in the form of a curve-fitted equation. A popular curve-fitting approach is the k-factor equations,26 where the delay td and output transition time tout are expressed as non-linear functions of the input transition time sin and the capacitive output load CL :

(

)

(60.10)

(

)

(60.11)

t d = k1 + k2C L sin + k3C L2 + k4C L + k5 t out = k1′ + k2′C L sin + k3′C L2 + k4′C L + k5′ .

The various coefficients in the k-factor equations are obtained by curve fitting the data. Several modifications, including more complex equations and dividing the plane into a number of regions and having equations for each region, have been proposed. The main advantage of gate-level static TA is that costly on-the-fly delay and output transition time calculations can be replaced by efficient equation evaluations or table look-ups. This is also a disadvantage since it requires that all the timing arcs in the design are pre-characterized. This may be a problem when parts of the design are not complete and the delays for some timing arcs are not available. This problem can be avoided if the design flow ensures that at early stages of a design, estimated delays are specified for all timing arcs which are then replaced by characterized numbers when the design gets completed. To apply gate-level TA to designs that contain a large amount of custom circuits, timing rules must be developed for the custom circuits also. Gate-level static TA is still at a fairly low level of abstraction and the effects of interconnects and clock skew can be considered. Moreover, at the gate level, the latches and flip-flops of the design are visible and so timing constraints can be inserted directly at those nodes. Black-Box Modeling At the next higher level of abstraction, gates are grouped together into blocks and the entire design (or chip) now consists of these blocks or “boxes.” Each box contains combinational gates as well as sequential elements such as latches as shown in Fig. 60.7(a). Timing checks inside the block can be verified using static TA at the transistor or gate level. At the chip level, the internal nodes of the box are no longer visible and its timing behavior must be abstracted at the input, output, and clock pins of the box. In black-box modeling, we assume that the first and last latch along any path from input to output of the box are edge-triggered latches; in other words, cycle stealing is not allowed across these latches (cycle stealing may be allowed across other transparent latches inside the box). The first latch along a path from input to output is called an input latch and the last latch is called an output latch. With this assumption, there can be two types of paths to the outputs of the box. First, paths that originate at box inputs and end at box outputs without traversing through any latches. These paths are represented as input-output arcs in the block-box with the path delays annotated on the arcs. Second, there are paths that originate

© 2000 by CRC Press LLC

FIGURE 60.7 High-level timing abstraction: (a) a block containing combinational and sequential elements, (b) black-box model, and (c) gray-box model.

at the clock pins of the output edge-triggered latches and end at the box outputs.These paths are represented as clock-to-input arcs in the black-box and the paths delays are annotated on the arcs. Finally, the set-up and hold time constraints of the input latches are translated to constraints between the box inputs and clock pins. These constraints will be checked at the chip-level static TA. The constraints and the arcs are shown in Fig. 60.7(b). Note that the timing checkpoints inside a block have been verified for a particular set of clocks when the black-box model is generated. Since these timing checkpoints are no longer available at the chip level, a black-box model is valid only for a particular frequency. If a different clock frequency (or different clock waveforms) is used, then the black-box model must be regenerated. Gray-Box Modeling Gray-box modeling removes the edge-triggered latch restrictions of black-box modeling. All latches inside the box are allowed to be level-sensitive and therefore have to be visible at the top level so that the constraints can be checked and cycle-stealing is allowed through these latches. As shown in Fig. 60.7(c), the gray-box model consists of timing arcs from the box inputs to the input latches, from latches to latches, and from the output latches to the box outputs. The clock pins of each of the latches are also visible at the chip level, and so the set-up and hold timing constraints for each latch in the box is checked at the chip level. In addition to these timing arcs, there can also be direct input-output timing arcs. Note that since the timing checkpoints internal to the box are available at the chip level, the gray-box model is frequently independent — unlike the black-box model.

False Paths To find the critical paths in the circuit, static TA propagates the arrival times from the timing inputs to the timing outputs. Then, it propagates the required times from the outputs back to the inputs and computes the slacks along the way. During propagation, static TA does not consider the logical functionality of the circuit. As a result, some of the paths that it reports to the user may be such that they cannot be activated by any input vector. Such paths are called false paths.29-31 An example of a false path is shown in Fig. 60.8(a). For x to propagate to a, we must set y = 1, which is the non-controlling value of the NAND gate. Similarly, for a to propagate to b, we set z = 1. Now, since y = z = 1, e = 0 (the controlling

© 2000 by CRC Press LLC

FIGURE 60.8

False path examples: (a) static false path, and (b) dynamic false path.

value for a NAND gate), and there can be no signal propagation from b to c. Therefore, there can be no propagation from x to c (i.e., x – a – b – c is a false path). False paths that arise due to logical correlations are called static false paths to distinguish them from dynamic false paths, which are caused by temporal correlations. A simple example of a dynamic false path is shown in Fig. 60.8(b). Suppose we want to find the critical path from node x to the output d. It is clear that there are two such paths, x – a – d and x – a – b – c – d, of which the latter has a larger delay. In order to sensitize the longer path x – a – b – c – d, we would set the other inputs of the circuit to the non-controlling values of the gates (i.e., y = z = u = 1). If there is a rising transition on node x, there will be a falling transition on nodes a and c. However, because of the propagation delay from a to c, node a will fall well before node c. As soon as node a falls, it will set the primary output d to be 1 (since the controlling value of a NAND gate is 0). Because node a always reaches the controlling value before node c, it is not possible for a transition at node c to reach the output. In other words, the path x rising – a falling – b rising – c falling – d rising is a dynamic false path. Note that if we add some combinational logic between the output of the first NAND gate and the input of the last NAND gate to slow the signal a down, then the transition on c could propagate to the output. The example shown above is for purposes of illustration only and may appear contrived. However, dynamic false paths are very common in carry-lookahead adders.32 Finding false paths in a combinational circuit is an NP-complete problem. There are a number of heuristic approaches that find the longest paths in a circuit while determining and ignoring the false paths.29-31 Timing analysis techniques that can avoid false paths specified by the user have also been reported.33,34

60.3 Noise Analysis In digital circuits, nodes that are not switching are at the nominal values of the supply (logic 1) and ground (logic 0) rails. In a digital system, noise is defined as a deviation of these node voltages from their stable high or low values. Digital noise should be distinguished from physical noise sources that are common in analog circuits (e.g., shot noise, thermal noise, flicker noise, and burst noise).35 Since noise causes a deviation in the stable logic voltages of a node, it can be classified into four categories: (1) high undershoot noise reduces the voltage of a node that is supposed to be at logic 1; (2) high overshoot noise which increases the voltage of a logic 1 node above the supply level (Vdd); (3) low overshoot noise increases the voltage of a node that is supposed to be at logic 0; and (4) low undershoot noise which reduces the voltage of a logic 0 node below the ground level (Gnd).

Sources of Digital Noise The most common sources of noise in digital circuits are crosstalk noise, power supply noise, leakage noise and charge-sharing noise.36

© 2000 by CRC Press LLC

Crosstalk Noise Crosstalk noise is the noise voltage induced on a net that is at a stable logic value due to interconnect capacitive coupling with a switching net. The net or wire that is supposed to be at a stable value is called the victim net. The switching nets that induce noise on the victim net are called aggressor nets. Crosstalk noise is the most common source of noise in deep submicron digital designs because, as interconnect wires get scaled, coupling capacitances become a larger fraction of the total wire capacitances.23 The ratio of the width to the thickness of metal wires reduces with scaling, resulting in a larger fraction of the total capacitance of the wire being contributed by coupling capacitances. Several examples of functional failures caused by crosstalk noise are given in the next section. Power Supply Noise This refers to noise on the power supply and ground nets of a design that is passed onto the signal nets by conducting transistors. Typically, the power supply noise has two components. The first is produced by IR-drop on the power and ground nets due to the current demands of the various gates in the chip (discussed in the next section). The second component of the power supply noise comes from the RLC response of the chip and package to current demands that peak at the beginning of a clock cycle. The first component of power supply noise can be reduced by making the wires that comprise the power and ground network wider and denser. The second component of the noise can be reduced by placing onchip decoupling capacitors.37 Charge-Sharing Noise Charge-sharing noise is the noise induced at a dynamic node due to charge redistribution between that node and the internal nodes of the gate.32 To illustrate charge-sharing noise, let us again consider the two-input domino NAND gate of Fig. 60.9(a). Let us assume that during the first evaluate phase shown in Fig. 60.9(b), both nodes x and x1 are discharged. Then, during the next precharge phase, let us assume that the input a is low. Node x will be precharged by the PMOS transistor MP, but x1 will not and will remain at its low value. Now, suppose CK turns high, signaling the beginning of another evaluate phase. If during this evaluate phase, a is high but b is low, nodes x and x1 will share charge resulting in the waveforms shown in Fig. 60.9(b): x will be pulled low and x1 will be pulled high. If the voltage on x is reduced by a large amount, the output inverter may switch and cause the output node y to be wrongly set to a logic high value. Charge-sharing in a domino gate is avoided by precharging the internal nodes in the NMOS evaluate tree during the precharge phase of the clock. This is done by adding an anticharge sharing device such as MNc in Fig. 60.9(c) which is gated by the clock signal. Leakage Noise Leakage noise is due to two main sources: subthreshold conduction and substrate noise. Subthreshold leakage current32 is the current that flows in MOS transistors even when they are not conducting (off).

FIGURE 60.9 Example of charge-sharing noise: (a) a two-input domino NAND gate, (b) waveforms for chargesharing event, and (c) anti charge-sharing device.

© 2000 by CRC Press LLC

This current is a strong function of the threshold voltage of the device and the operating temperature. Subthreshold leakage is an important design parameter in portable devices since battery life is directly dependent on the average leakage current of the chip. Subthreshold conduction is also an important noise mechanism in dynamic circuits where, for a part of the clock cycle, a node does not have a strong conducting path to power or ground and the logic value is stored as a charge on that node. For example, suppose that the inputs a and b in the two-input domino NAND gate of Fig. 60.9(a) are low during the evaluate phase of the clock. Due to subthreshold leakage current in the NMOS evaluate transistors, the charge on node x may be drained away, leading to a degradation in its voltage and a wrong value at the output node y. The purpose of the half latch device MPfb is to replenish the charge that may be lost due to the leakage current. Another source of leakage noise is minority carrier back injection into the substrate due to bootstrapping. In the context of mixed analog-digital designs, this is often referred to as substrate noise.38 Substrate noise is often reduced by having guard bands, which are diffusion regions around the active region of a transistor tied to supply voltages so that the minority carriers can be collected.

Crosstalk Noise Failures In this section, we provide some examples of functional failures caused by crosstalk noise. Functional failures result when induced noise voltages cause an erroneous state to be stored at a memory element (e.g., at a latch node or a dynamic node). Consider the simple latch circuit of Fig. 60.10(a) and let us

FIGURE 60.10 Crosstalk noise-induced functional failures: (a) latch circuit; (b) high undershoot noise on d does not cause functional failure in (b) but does cause failure in (c); (d) same latch circuit with noise induced on an internal node; and (e) low undershoot noise causing a failure.

© 2000 by CRC Press LLC

assume that the data input d is a stable high value and the latch l has a stable low value. If the net corresponding to node d is coupled to another net e and there is a high to low transition on net e, net d will be pulled low. When e has finished switching, d will be pulled back to a high value by the PMOS transistor driving net d and the noise on d will dissipate. Thus, the transition on net e will cause a noise pulse on d. If the amplitude of this noise pulse is large enough, the latch node l will be pulled high. Depending on the conditions under which the noise is injected, it may or may not cause a wrong value to be stored at the latch node. For example, let us consider the situation depicted in Fig. 60.10(b), where CK is high and the latch is open. If the noise pulse on d appears near the middle of the clock phase, then the latch node will be pulled high; but as the noise on d dissipates, latch node l will return to its correct value because the latch is open. However, if the noise pulse on d appears near the end of the clock phase as shown in Fig. 60.10(c), the latch may turn off before the noise on d dissipates, the latch node may not recover, and a wrong value will be stored. A similar unrecoverable error may occur if noise appears on the clock net turning the latch on when it was meant to be off. This might cause a wrong value to be latched. Now, let us consider the latch circuit of Fig. 60.10(d) where the wire between the input inverter and the pass gate of the latch is long and subject to coupling capacitances. Suppose the latch is turned off (CK is low), the data input is high so that the node d′ is low, and a high value is stored at the latch node. If net e transitions from a high to a low value, a low undershoot noise will be introduced on d′. If this noise is sufficiently large, the NMOS pass transistor will turn on even through its gate voltage is zero (since its gate-source voltage will become greater than its threshold voltage). This will discharge the latch node l, resulting in a functional failure. In order to push performance, domino circuits are becoming more and more prevalent.88 These circuits trade performance for noise immunity and are susceptible to functional noise failures. A noise-related functional failure in domino circuits is shown in Fig. 60.11. Again, let us consider the two-input domino NAND gate shown in Fig. 60.11(a). Let us assume that during the evaluate phase, a is held to a low value by the driving inverter, but b is high. Then, x should remain charged and y should remain low. If an unrelated net d switches high, and there is sufficient coupling between signals a and d, then a low overshoot noise pulse will be induced on node a. If the pulse is large enough, a path to ground will be created and node x will be discharged. As shown in Fig. 60.11(b), this will erroneously set the output node of the domino gate to a high value. When the noise on a dissipates, it will return to a low value, but x and y are not able to recover from the noise event, causing a functional failure. As the examples above demonstrate, functional failures due to digital noise cause circuits to malfunction. Noise analysis is becoming an important failure mechanism in deep submicron designs because of several technology and design trends. First, larger die sizes and greater functionality in modern chips result in longer wires, which makes the circuit more susceptible to coupling noise. Second, scaling of interconnect geometries has resulted in increased coupling between adjacent wire.23 Third, the drive for faster performance has increased the use of faster non-restoring logic families such as domino logic.

FIGURE 60.11 Functional failure in domino gates: (a) two-input NAND gate, and (b) voltage waveforms when input noise causes a functional failure.

© 2000 by CRC Press LLC

These circuit families have faster switching speeds at the expense of reduced noise immunity. False switching events at the inputs of these gates are catastrophic since precharged nodes may be discharged and these nodes cannot recover their original state when the noise dissipates. Fourth, lower supply voltage levels reduce the magnitudes of the noise margins of circuits. Finally, in state-of-the-art microprocessors, many functional units located in different parts of the chip are operating in parallel and this causes a lot of switching activity in long wires that run across different parts of the chip. All of these factors make noise analysis a very important task to verify the proper functioning of digital designs.

Modeling of Interconnect and Gates for Noise Analysis Let us consider the example of Fig. 60.12(a) where three wires are running in parallel and are capacitively coupled to each other. Suppose that we are interested in finding the noise that is induced on the middle net by the adjacent nets switching. The middle net is called the victim net and the two neighboring nets are called aggressors. Consider the situation when the victim net is held to a stable logic zero value by the victim driver and both the aggressor nets are switching high. Due to the coupling between the nets, a low overshoot noise will be induced on the victim net as shown in Fig. 60.12(a). If the noise pulse is large and wide enough, the victim receiver may switch and cause a wrong value at the output of the inverter. The circuit-level models for this system are explained below and shown in Fig. 60.12(b). 1. The (net) complex consisting of the victim and aggressor nets is modeled as a coupled distributed RC network. The coupled RC lines are typically output by a parasitic extraction tool. 2. The non-linear victim driver is holding the victim net to a stable value. We model the non-linear driver as a linear holding resistance. For example, if the victim driver holds the output to logic 0 (logic 1), we determine an effective NMOS (PMOS) resistance. The value of the holding resistance for a gate can be obtained by pre-characterization using SPICE. 3. The aggressor driver is modeled as a Thevenin voltage source in series with a switching resistance. The Thevenin voltage source is modeled as a shifted ramp, where the ramp starts switching at time t0 and the transition time is ∆t. The switching resistance is denoted by Rs . 4. The victim receiver is modeled as a capacitor of value equal to the input capacitance of the gate.

FIGURE 60.12 (a) A noise pulse induced on the victim net by capacitive coupling to adjacent aggressor nets, and (b) linearized model for analysis.

© 2000 by CRC Press LLC

These models convert the non-linear circuit into a linear circuit. The multiple sources in this linear circuit can now be analyzed using linear superposition. For each aggressor, we get a noise pulse at the sink(s) of the victim net, while shorting the other aggressors. These noise pulses have different amplitudes and widths; the amplitude and width of the composite noise waveform is obtained by aligning these noise pulses so that their peaks line up. This is a conservative assumption to simulate the worst-case noise situation.

Input and Output Noise Models As mentioned earlier, noise creates circuit failures when it propagates to a charge-storage node and causes a wrong value to be stored at the node. Propagating noise across non-linear gates39 makes the noise analysis problem complex. In this discussion, a more conservative simple model will be discussed. With each input terminal of a victim receiver gate, we associate a noise rejection curve.40 This is a curve of the noise amplitude versus the noise width that produces a predefined amount of noise at the output. If we assume a triangular noise pulse at the input of FIGURE 60.13 A typical noise rejection curve. the victim receiver, the noise rejection curve defines the amplitude-width combination that produces a fixed amount of noise at the output of the receiver. A sample noise rejection curve is shown in Fig. 60.13. As the width becomes very large, the noise amplitude tends toward the dc noise margin of the gate. Due to the lowpass nature of a digital gate, very sharp noise pulses are filtered out and do not cause any appreciable noise at the output. When the noise pulse at the sink(s) of the victim net have been obtained, the pulse amplitude and width are compared against the noise rejection curve to determine if a noise failure occurs. Since we do not propagate noise across gates, noise injected into the victim net at the output of the victim driver must model the maximum amount of noise that may be produced at the output of a gate. The output noise model is a dc noise that is equal to the predefined amount of output noise that was used to determine the input noise rejection curve above. Contributions from other dc noise sources such as IR-drop noise may be added to the output noise. If we assume that there is no resistive dc path to ground, this output noise appears unchanged at the sink(s) of the victim net.

Linear Circuit Analysis The linear circuit that models the net complex to be analyzed can be quite large since the victim and aggressor nets are modeled as a large number of RC segments and the victim net can be coupled to many aggressor nets. Moreover, there are a large number of nets to be analyzed. Since general circuit simulation tools such as SPICE can be extremely time-consuming for these networks, fast linear circuit simulation tools such as RICE41 can be used to solve these large net complexes. RICE uses reduced-order modeling and asymptotic waveform evaluation (AWE) techniques27 to speed up the analysis while maintaining sufficient accuracy. Techniques that overcome the stability problems in AWE, such as Pade via Lancszos (PVL),42 Arnoldi-based techniques,43 congruence transform-based techniques (PACT),44 or combinations (PRIMA),45 have been proposed recently.

Interaction with Timing Analysis Calculation of crosstalk noise interacts tightly with timing analysis since timing analysis lets us determine which of the aggressor nets can switch at the same time. This reduces the pessimism of assuming that for a victim net, all the nets it is coupled to can switch simultaneously and induce noise on it. Timing analysis defines timing windows by the earliest and latest arrival times for all signals. This is shown in Fig. 60.14 for three aggressors A1, A2, and A3 of a particular victim net of interest. Based upon these timing windows, we can define five different scenarios for noise analysis where different aggressors can

© 2000 by CRC Press LLC

FIGURE 60.14

Effect of timing windows on aggressor selection for noise analysis.

switch simultaneously. For example, in interval T1, only A1 can switch; in T2, A1, and A2 can switch; in T3, only A2 can switch; and so on. Note that in this case, all three aggressors can never switch at the same time. Without considering the timing windows provided by timing analysis, we would have overestimated the noise by assuming that all three aggressors could switch at the same time.

Fast Noise Calculation Techniques Any state-of-the-art microprocessors will have many nets to be analyzed, but typically only a small fraction of the nets will be susceptible to noise problems. This motivates the use of extremely fast techniques that provably overestimate the noise at the sinks of a net. If a net passes the noise test under this quick analysis, then it does not need to be analyzed any further; if a net fails the noise test, then it can be analyzed using more accurate techniques. In this sense, these fast techniques can be considered to be noise filters. If these noise filters produce sufficiently accurate noise estimates, then the expectation is that a large number of nets would be screened out quickly. This combination of fast and detailed analysis techniques would therefore speed up the overall analysis process significantly. Note that noise filters must be provably pessimistic and that multiple noise filters with less and less pessimism can be used one after the other to successively screen out nets. Let us consider the net complex shown in Fig. 60.15(a), where we have modeled the net as distributed RC lines, the victim driver as a linear holding resistance, and the aggressors as voltage ramps and linear

FIGURE 60.15 Noise filters: (a) original net complex with distributed RC models for aggressors and victims, (b) aggressor lines have only coupling capacitances to victim, (c) aggressors are directly coupled to sink of victim, and (d) single (strongest) aggressor and all grounded capacitors of victim moved away from sink.

© 2000 by CRC Press LLC

resistances. The grounded capacitances of the victim net is denoted as Cgv , and the coupling capacitances to the two aggressors are denoted as Cc1 and Cc2. In Figs. 60.15(b-d), we show the steps through which we can obtain a circuit which will provide a provably pessimistic estimate of the noise waveform. In Fig. 60.15(b), we have removed the resistances of the aggressor nets. This is pessimistic because, in reality, the aggressor waveform slows down as it proceeds along the net. By replacing it with a faster waveform, more noise will be induced on the victim net. In Fig. 60.15(c), the aggressor waveforms are capacitively coupled directly into the sink net; for each aggressor, the coupling capacitance is equal to the sum of all the coupling capacitances between itself and the victim net. Since the aggressor is directly coupled to the sink net, this transformation will result in more induced noise. In Fig. 60.15(d), we have made two modifications; first, we replaced the different aggressors by one capacitively coupled aggressor and, second, we moved all the grounded capacitors on the victim net away from the sink node. The composite aggressor is just the fastest aggressor (i.e., the aggressor that has the smallest transition time) and it is coupled to the victim net by a capacitor whose value is equal to the sum of all the coupling capacitances in the victim net. To simplify the victim net, we sum all the grounded capacitors and insert it at the root of the victim net and sum all the net resistances. By moving the grounded (good) capacitors away from the sink net, we increase the amount of coupled noise. This simple network can now be analyzed very quickly to compute the (pessimistic) noise pulse at the sink. An efficient method to compute the peak noise amplitude at the sink of the victim net is described by Devgan.46 Under infinite ramp aggressor inputs, the maximum noise amplitude is the final value of the coupled noise. For typical interconnect topologies, these analytical computations are simple and quick.

Noise, Circuit Delays, and Timing Analysis Circuit noise, especially crosstalk noise, significantly affects switching delays. Let us consider the example of Fig. 60.16(a), where we are concerned about the propagation delay from A to C. In the absence of any coupling capacitances, the rising waveform at C is shown by the dotted line of Fig. 60.16(b). However, if net 2 is switching in the opposite direction (node E is rising as in Fig. 60.16(b)), then additional charge is pumped into net 1 due to the coupling capacitors causing the signals at nodes B1 and B2 to slow down. This in turn causes the inverter to switch later and causes the propagation delay from A to C to be much larger, as shown in the diagram. Note that if net 2 switched in the same direction as net 1, then the delay from A to C would be reduced. This implies that delays across gates and wires depend on the switching activity on adjacent coupled nets. Since coupling capacitances are a large fraction of the total capacitance of wires, this dependence will be significant and timing analysis should account for this behavior. Using the same terminology as crosstalk noise analysis, we call the net whose delay is of primary interest (net 1 in the above example) the victim net and all the nets that are coupled to it are called aggressor nets. A model that is commonly used to approximate the effect of coupling capacitors on circuit delays is to replace each coupling capacitor by a grounded capacitor of twice the value. This model is accurate

FIGURE 60.16

Effect of noise on circuit delays: (a) victim and aggressor nets, and (b) typical waveforms.

© 2000 by CRC Press LLC

only when the victim and aggressor nets are identical and the waveforms on the two nets are identical, but switching in opposite directions. For some cases, doubling the coupling capacitance may be pessimistic, but in many cases it is not — the effective capacitance is much more than twice the coupling capacitance. Note that the effect on the propagation delay due to coupling will be strongly dependent on how the aggressor waveforms are aligned with respect to each other and to the victim waveform. Hence, one of the main issues in finding the effect of noise on delay is to determine the aggressor alignments that cause the worst propagation delay. A more accurate model for considering the effect of noise on delay is described by Dartu et al.47 In this approach, the gates are replaced by linearized models (e.g., the Thevenin model of the gate consists of a shifted ramp voltage source in series with a resistance). Once the circuit has been linearized, the principle of linear superposition is applied. The voltage waveform at the sink of the victim net is first obtained by assuming that all aggressors are “quiet.” Then the victim net is assumed to be quiet and each aggressor is switched one at a time and the resultant noise waveFIGURE 60.17 Aligning the composforms at the victim sink node is recorded. These noise waveforms ite noise waveform with the original are offset with respect to each other because of the difference in waveform to produce worst-case delay. the delays between the aggressors to the victim sink node. Next, the aggressor noise waveforms are shifted such that the peaks get lined up and a composite noise waveform is obtained by adding the individual noise waveforms. The remaining issue is to align the composite noise waveform with the noise-free victim waveform to obtain the worst delay. This process is described in Fig. 60.17, where we show the original noise-free waveform Vorig and the (composite) noise waveform Vnoise at the victim sink node. Then, the worst case is to align the noise such that its peak is at the time when Vorig = 0.5Vdd – VN , where VN is the peak noise.47,48 The final waveform at C is marked Vfinal . The impact of noise on delays and the impact of timing windows on noise analysis implies that one has to iterate between timing and noise analysis. There is no guarantee that this process will converge; in fact, one can come up with examples when the process diverges. This is one of the open issues in noise analysis.

60.4 Power Grid Analysis The power distribution network distributes power and ground voltages to all the gates and devices in the design. As the devices and gates switch, the power and ground lines conduct current and due to the resistance of the lines, there is an unavoidable voltage drop at the point of distribution. This voltage drop is called IR-drop. As device densities and switching currents increase, larger currents flow in the power distribution network causing larger IR-drops. Excessive voltage drops in the power grid reduce switching speeds of devices (since it directly affects the current drive of devices) and noise margins (since the effective rail-to-rail voltage is lower). Moreover, as explained in the previous section, IR-drops inject dc noise into circuits which may lead to functional or performance failures. Higher average current densities lead to undesirable wear-and-tear of metal wires due to electromigration.49 Considering all these issues, a robust power distribution network is vital in meeting performance and reliability goals in highperformance microprocessors. This will achieve good voltage regulation at all the consumption points in the chip, notwithstanding the fluctuations in the power demand across the chip. In this section, we give a brief overview of various issues involved in power grid analysis.

Problem Characteristics The most important characteristic of the power grid analysis problem is that it is a global problem. In other words, the voltage drop in a certain part of the chip is related to the currents being drawn from that as well as other parts of the chip. For example, if the same power line is distributing power to several

© 2000 by CRC Press LLC

functional units in a certain part of the chip, the voltage drop in one functional unit depends on the currents being drawn by the other functional units. In fact, as more and more of the functional units switch together, the IR-drop in all the functional units will increase because the current supply demand on the power line is more. Since IR-drop analysis is a global problem and since power distribution networks are typically very large, a critical issue is the large size of the network. For a state-of-the-art microprocessor, a number of nodes in the power grid is on the order of millions. An accurate IR-drop analysis would simulate the non-linear devices in the chip, together with the non-ideal power grid, making the size of the network even more unmanageable. In order to keep IR-drop analysis computationally feasible, the simulation is done in two steps. First, the non-linear devices are simulated assuming perfect supply voltages, and the power and ground currents drawn by the devices are recorded (these are called current signatures). Next, these devices are modeled as independent time-varying current sources for simulating the power grid and the voltage drops at the consumption points (where transistors are connected to power and ground rails) are measured. Since voltage drops are typically less than 10% of the power supply voltage, the error incurred by ignoring the interaction between the device currents and the actual supply voltage is usually small. The linear power and ground network is still very large and hierarchy has to be exploited to reduce the size of the analyzed network. Hierarchy will be discussed in more detail later. Yet another characteristic of the IR-drop analysis problem is that it is dependent on the activity in the chip, which in turn is dependent on the vectors that are supplied. An important problem in IR-drop analysis is to determine what this input pattern should be. For IR-drop analysis, patterns that produce maximum instantaneous currents are required. This topic has been addressed by a few papers,50-52 but will not be discussed here. However, the fact that vectors are important means that transient analysis of the power grid is required. Since each solution of the network is expensive and since many simulations are necessary, dynamic IR-drop analysis is very expensive. The speed and memory issues related to linear system solution techniques becomes important in the context of transient analysis. An important issue in transient analysis is related to the capacitances (both parasitic and intentional decoupling) in the power grid. Since capacitors prevent instantaneous changes in node voltages, IR-drop analysis without considering capacitors will be more pessimistic. A pessimistic analysis can be done by ignoring all power grid capacitances, but a more accurate analysis with capacitances may require additional computation time for solving the network. Yet another issue is raised by the vector dependence. As mentioned earlier, the non-linear simulation to determine the currents drawn from the power grid is done separately (from the linear network) using the supplied vectors. Since the number of transistors in the whole chip is huge, simultaneous simulation of the whole chip may be infeasible because of limitations in non-linear transient simulation tools (e.g., SPICE or fast timing simulators). This necessitates partitioning the chip into blocks (typically corresponds to functional units, like floating point unit, integer unit, etc.) and performing the simulation one block at a time. In order to preserve the correlation among the different blocks, the blocks must be simulated with the same underlying set of chip-wide vectors. To determine the vectors for a block, a logic simulation of the chip is done, and the signals at the inputs of the block are monitored and used as inputs for the block simulation. Since dynamic IR-drop analysis is typically expensive (especially since many vectors are required), techniques to reduce the number of simulations are often used. A commonly used technique is to compress the current signatures from the different clock cycles into a single cycle. The easiest way to accomplish this is to find the maximum envelope of the multi-cycle current signature. To find the maximum envelope over N cycles, the single-cycle current signature is computed using

()

(

)

isc t = max iorig t + kT , 1 ≤ k ≤ N , 0 ≤ t ≤ T

(60.12)

where isc (t) is the single-cycle, iorig (t) is the original current signature, and T is the clock period. Since this method does not preserve the correlation among different current sources (sinks), it may be overly pessimistic.

© 2000 by CRC Press LLC

A final characteristic of IR-drop analysis is related to the way in which the analysis is typically done. Typically, the analysis is done at the very last stages of the design when the layout of the power network is available. However, IR-drop problems that could be revealed at this stage are very expensive or even impossible to fix. IR-drop analysis that is applicable to all stages of a microprocessor design has been addressed by Dharchoudhury et al.53

Power Grid Modeling The power and ground grids can be extracted by a parasitic extractor to obtain an R-only or an RC network. Extraction implies that the layout of the power grid is available. To insert the transistor current sources at the proper nodes in the power grid, the extractor should preserve the names and locations of transistors. Power grid capacitances come from metal wire capacitances (coupling and grounded), device capacitances, and decoupling capacitors inserted in the power grid to reduce voltage fluctuations. Several interesting issues are raised in the modeling of power grid capacitances. The power or ground net is coupled to other signal nets and since these nets are switching, the effective grounded capacitance is difficult to compute. The same is true for capacitances of MOS devices connected to the power grid. Making the problem worse, the MOS capacitances are voltage dependent. These issues have not been completely addressed as yet. Typically, one resorts to worst-case analysis by ignoring coupling capacitances to signal nets and MOS device capacitances, but considering only the grounded capacitances of the power grid and the decoupling capacitors. There are three other issues related to power grid modeling. First, for electromigration purposes, via arrays should be extracted as resistance arrays so that current crowding can be modeled. Electromigration problems are primarily seen in the vias and if the via array is modeled as a single resistance, such problems could be masked. Second, the inductance of the package pins also creates a voltage drop in the power grid. This drop is created by the time-varying current in the pins (v = Ldi/dt). This effect is typically handled by adding a fixed amount of drop on top of the on-chip IR-drop estimate. Third, a word of caution about network reduction or crunching. Most commercial extraction tools have options to reduce the size of an extracted network. This reduction is typically performed using reduced-order modeling techniques with interconnect delay being the target. This reduction is intended for signal nets and is done so that errors in the interconnect delay is kept below a certain threshold. For IR-drop analysis, such crunching should not be done since we are not interested in the delay. Moreover, during the reduction the nodes at which transistors hook up to the power grid could be removed.

Block Current Signatures As mentioned above, accurate modeling of the current signatures of the devices that are connected to the power grid is important. At a certain point in the design cycle of a microprocessor, different blocks may be at different stages of completion. This implies that multiple current signature models should be available so that all the blocks in the design can be modeled at various stages in the design.53 The most accurate model is to provide transient current signatures for all the devices that are connected to the supply or ground grid. This assumes that the transistor-level representation of the entire block is available. The transient current signatures are obtained by transistor-level simulation (typically with a fast transient simulator) with user-specified input vectors. As mentioned earlier, in order to maintain correlation with other blocks, the input vectors for each block must be derived from a common chipwide input vector set. At the chip-level, the vectors are usually hot loops (i.e., the vectors try to turn on as many blocks as possible). The block-level inputs for the transistor-level simulation are obtained by monitoring the signal values at the block inputs during a logic simulation of the entire chip with the hot loop vectors. At the other end of the spectrum, the least accurate current model for a block is an area-based dc current signature. This is employed at early stages of analysis when the block design is not complete. The average current consumption per unit area of the block can be computed from the average power consumption

© 2000 by CRC Press LLC

specification for the chip and the normal supply voltage value. Since the peak current can be larger than the average current, some multiple of the average per-unit-area current is multiplied by the block area to compute the current consumption for the block. An intermediate current model can be derived from a full-chip gate-level power estimation tool. Given a set of input vectors, this tool computes the average power consumed by each block over a cycle. From the average power consumption, an average current can be computed for each cycle. Again, to account for the difference between the peak and average currents, the average current can be multiplied by a constant factor. Hence, one obtains a multi-cycle dc current signature for the block in this model.

Matrix Solution Techniques The large size of power grids places very stringent demands on the linear system solver, making it the most important part of an IR-drop analysis tool. The power grids in typical state-of-the-art microprocessors usually contain multiple layers of metal (processes with up to six layers of metal are currently available) and the grid is usually designed as a mesh. Therefore, the network cannot usually be reduced significantly using a tree-link type of transformation. In older-generation microprocessors, the power network was often “routed” and therefore more amenable to tree-link type reductions. In networks of this type, significant reduction in the size can typically be obtained.54 In general, matrix solution techniques can be categorized into two major types: direct and iterative.55 The size and structure of the conductance matrix of the power grid is important in determining the type of linear solution technique that should be used. Typically, the power grid contains millions of nodes, but the conductance matrix is very sparse (typically, less than five entries per row or column of the matrix). Since it is a conductance matrix, the matrix will also be symmetric positive definite — for a purely resistive grid, the conductance matrix may be ill-conditioned. Iterative solution techniques apply well to sparse systems, but their convergence can be slowed down by ill-conditioning. Convergence can usually be improved by applying pre-conditioners. Another important advantage of iterative methods is that they do not suffer from size limitations as much as direct techniques. Iterative techniques usually need to store the sparse matrix and a few iteration vectors during the solution. The disadvantage of iterative techniques is in transient solution. If constant time steps are used during transient simulation, the conductance matrix remains the same from one time point to another and only the right-hand side vector changes. Iterative techniques depend on the right-hand side and so a fresh solution is required for each time point during transient simulation. The solution from previous time points cannot be reused. The most widely used iterative solution technique for IR-drop analysis is the conjugate gradient solution technique. Typically, a pre-conditioner such as incomplete Cholesky pre-conditioning is also used in conjunction with the conjugate gradient scheme. Direct techniques rely on first factoring the matrix and then using these factors with the right-hand side vector to find the solution. Since the matrix is symmetric positive definite, one can apply specialized direct techniques such as Cholesky factorization. The main advantage of direct techniques in the context of IR-drop analysis is in transient analysis. As explained earlier, transient simulation with constant time steps will result in the linear solution of a fixed matrix. Direct techniques can factor this matrix once and the factors can be reused with different right-hand side vectors to give some efficiency. The main disadvantage of direct techniques is memory usage to store the factors of the conductance matrix. Although the conductance matrix is sparse, its factors are not and this means that the memory usage will be O(n2), were n is the size of the matrix.

Exploiting Hierarchy From the discussions above, it is clear that IR-drop analysis of large microprocessor designs can be limited by size restrictions. The most effective way to reduce the size is to exploit the hierarchy in the design. In this discussion, we will assume a two-level hierarchy consisting of the chip and its constituent blocks. This hierarchy in the blocks also partitions the entire power distribution grid into two parts: the global

© 2000 by CRC Press LLC

grid and the intra-block grid. The global grid distributes power from the chip pads to tap points in the various blocks (these are called block ports) and the intra-block grid distributes power from these tap points to the transistors in the block. This partitioning allows us to apply hierarchical analysis. First, the intra-block power grid can be analyzed to find the voltages at the transistor tap points. This analysis assumes that the voltages at the block ports are equal to ideal supply (Vdd ) or ground (0). The intrablock analysis must also determine a macromodel for the block which is then used for analyzing the global grid. A block admittance macromodel will consist of a current source at each port and an admittance matrix relating the currents and voltages among the ports. The size of the admittance matrix will be equal to the number of ports and each entry will model the effect of the voltage at one port to the current at some other port. In other words, the off-diagonal entries in the admittance matrix will model current redistribution between the ports of the block. Note that, in general, the admittance matrix will be dense and have p2 entries if p is the number of ports. If n is the number of nodes in the intrablock grid, this block would have contributed a sparse submatrix of size n to the global grid during flat analysis. For hierarchical analysis, this block contributes a dense submatrix of size p. If p . The comparison can be: equal (eq), not equal (ne), greater than (gt), etc. A predicate is specified for each destination predicate. Predicate defining instructions are also predicated, as specified by Pin . The predicate determines the value written to the destination predicate register based upon the result of the comparison and of the input predicate, Pin . For each combination of comparison result and Pin , one of three actions may be performed on the destination predicate: it can write 1, write 0, or leave it unchanged. There are six predicate types which are particularly useful, the unconditional (U), OR, and AND type predicates and their complements. Table 63.1 contains the truth table for these predicate definition types. TABLE 63.1

Predicate Definition Truth Table Pout —

Pin

Comparison

U

U

OR

OR

AND

AND

0 0 1 1

0 1 0 1

0 0 0 1

0 0 1 0

— — — 1

— — 1 —

— — 0 —

— — — 0

Unconditional destination predicate registers are always defined, regardless of the value of Pin and the result of the comparison. If the value of Pin is 1, the result of the comparison is placed in the predicate register (or its compliment for U). Otherwise, a 0 is written to the predicate register. Unconditional predicates are utilized for blocks which are executed based on a single condition. The OR-type predicates are useful when execution of a block can be enabled by multiple conditions, such as logical AND (&&) and OR (||) constructs in C. OR-type destination predicate registers are set if Pin is 1 and the result of the comparison is 1 (0 for OR); otherwise, the destination predicate register is

© 2000 by CRC Press LLC

FIGURE 63.11

Instruction sequence: (a) program code, (b) traditional execution, (c) predicated execution.

unchanged. Note that OR-type predicates must be explicitly initialized to 0 before they are defined and used. However, after they are initialized, multiple OR-type predicate defines may be issued simultaneously and in any order on the same predicate register. This is true since the OR-type predicate either writes a 1 or leaves the register unchanged, which allows implementation as a wired logical OR condition. ANDtype predicates are analogous to the OR type predicate. AND-type destination predicate registers are cleared if Pin is 1 and the result of the comparison is 0 (1 for AND); otherwise, the destination predicate register is unchanged. Figure 63.11 contains a simple example illustrating the concept of predicated execution. Figure 63.11(a) shows a common programming if-then-else construction. The related control flow representation of that programming code is illustrated in Fig. 63.11(b). Using if-conversion, the code in Fig. 63.11(b) is then transformed into the code shown in Fig. 63.11(c). The original conditional branch is translated into a pred_eq instructions. Predicate register p1 is set to indicate if the condition (A = B) is true, and p2 is set if the condition is false. The “then” part of the if-statement is predicated on p1 and the “else” part is predicated on p2 The pred_eq simply decides whether the addition or subtraction instruction is performed and ensures that one of the two parts is not executed. There are several performance benefits for the predicated code. First, the microprocessor does not need to make any branch predictions since all the branches in the code are eliminated. This removes related penalties due to misprediction branches. More importantly, the predicated instructions can utilize multiple instruction execution capabilities of modern microprocessors and avoid the penalties for mispredicting branches.

Speculative Execution The amount of ILP available within basic blocks is extremely limited in non-numeric programs. As such, processors must optimize and schedule instructions across basic block code boundaries to achieve higher performance. In addition, future processors must content with both long latency load operations and long latency cache misses. When load data is needed by subsequent dependent instructions, the processor execution must wait until the cache access is complete. In these situations, out-of-order machines dynamically reorder the instruction stream to execute nondependent instructions. Additionally, out-of-order machines have the advantage of executing instructions that follow correctly predicted branch instructions. However, this approach requires complex circuitry at the cost of chip die space. Similar performance gains can be achieved using static compile-time speculation methods without complex out-of-order logic. Speculative execution, a technique for executing an instruction before knowing its execution is required, is an important technique for exploiting ILP in programs. Speculative execution is best known for hiding memory latency. These methods utilize instruction set architecture support of special speculative instructions. A compiler utilizes speculative code motion to achieve higher performance in several ways. First, in regions of code where insufficient ILP exists to fully utilize the processor resources, useful instructions

© 2000 by CRC Press LLC

FIGURE 63.12

Instruction sequence: (a) traditional execution, (b) speculative execution.

may be executed. Second, instructions at the beginning of long dependence chains may be executed early to reduce the computation’s critical path. Finally, long latency instructions may be initiated early to overlap their execution with other useful operations. Figure 63.12 illustrates a simple example of code before and after a speculative compile-time transformation is performed to execute a load instruction above a conditional branch. Figure 63.12(a) shows how the branch instruction and its implied control flow define a control dependence that restricts the load operation from being scheduled earlier in the code. Cache miss latencies would halt the processor unless out-of-order execution mechanisms were used. However, with speculation support, Fig. 63.12(b) can be used to hide the latency of the load operation. The solution requires the load to be speculative or nonfaulting. A speculative load will not signal an exception for faults such as address alignment or address space access errors. Essentially, the load is considered silent for these occurrences. The additional check instruction in Fig. 63.12(b) enables these signals to be detected when the original execution does reach the original location of the load. When the other path of branch’s execution is taken, such silent signals are meaningless and can be ignored. Using this mechanism, the load can be placed above all existing control dependences, providing the compiler with the ability to hide load latency. Details of compiler speculation can be found in Ref. 9.

63.6 Industry Trends The microprocessor industry is one of the fastest moving industries today. Healthy demands from the market-place have stimulated strong competition, which in turn resulted in great technical innovations.

© 2000 by CRC Press LLC

Computer Microprocessor Trends The current trends of computer microprocessors include deep pipelining, high clock frequency, wide instruction issue, speculative and out-of-order execution, predicated execution, natural data types, large on-chip caches, floating point capabilities, and multiprocessor support. In the area of pipelining, the Intel Pentium II processor is pipelined approximated twice as deeply as its predecessor Pentium. The deep pipeline has allowed the clock Pentium II processor to run at a much higher clock frequency than Pentium. In the area of wide instruction issue, the Pentium II processor can decode and issue up to three X86 instructions per clock cycle, compared to the two-instruction issue bandwidth of Pentium. Pentium II has dedicated a very significant amount of chip area to Branch Target Buffer, Reservation Station, and Reorder Buffer to support speculative and out-of-order execution. These structures together allow the Pentium II processor to perform much more aggressive speculative and out-of-order execution than Pentium. In particular, Pentium II can coordinate the execution of up to 40 X86 instructions, which is several times larger than Pentium. In the area of predicated execution, Pentium II supports a conditional move instruction that was not available in Pentium. This trend is furthered by the next generation IA-64 architecture where all instructions can be conditionally executed under the control of predicate registers. This ability will allow future microprocessors to execute control intensive programs much faster than their predecessors. In the area of data types, the MMX instructions from Intel have become a standard feature of all X86 microprocessors today. These instructions take advantage of the fact that multimedia data items are typically represented with a smaller number of bits (8 to 16 bits) than the width of an integer data path today (32 to 64 bits). Based on an observation, the same operation is often repeated on all data items in multimedia applications, the architects of MMX specify that each MMX instruction performs the same operation on several multimedia data items packed into one integer word. This allows each MMX instruction to process several data items simultaneously to achieve significant speed-up in targeted applications. In 1998, AMD proposed the 3DNow! instructions to address the performance needs of 3-D graphics applications. The 3DNow! instructions are designed based on the concept that 3-D graphics data items are often represented in single precision floating-point format and they do not required the sophisticated rounding and exception handling capabilities specified in the IEEE Standard format. Thus, one can pack two graphics floating-point data into one double-precision floating-point register for more efficient floating-point processing of graphics applications. Note that MMX and 3DNow! are similar in concepts applied to integer and floating-point domains. In the area of large on-chip caches, the popular strategies used in computer microprocessors are either to enlarge the first-level caches or to incorporate second-level and sometimes third-level caches on-chip. For example, the AMD K7 microprocessor has a 64KB first-level instruction cache and a 64-KB firstlevel data cache. These first-level caches are significantly larger than those found in the previous generations. For another example, the Intel Celeron microprocessor has a 128-KB second level combined instruction and data cache. These large caches are enabled by the increased chip density that allows many more transistors on the chip. The Compaq Alpha 21364 microprocessor has both: a 64-KB first-level instruction cache, a 64-KB first-level data cache, and a 1.5-MB second-level combined cache. In the area of floating-point capabilities, computer microprocessors in general have much stronger floating-point performance than their predecessors. For example, the Intel Pentium II processor achieves several times the floating-point performance improvements of the Pentium processor. For another example, most RISC microprocessors now have floating-point performances that rival supercomputer CPUs built just a few years ago. Due to the increasing demand of multiprocessor enterprise computing servers, many computer microprocessors now seamlessly support cache coherence protocols. For example, the AMD K7 microprocessor provides direct support for seamless multiprocessor operation when multiple K7 microprocessors are connected to a system bus. This capability was not available in its predecessor, the AMD K6.

© 2000 by CRC Press LLC

Embedded Microprocessor Trends There are three clear trends in embedded microprocessors. The first trend is to integrate a DSP core with an embedded CPU/controller core. Embedded applications increasingly require DSP functionalities such as data encoding in disk drives and signal equalization for wireless communications. These functionalities enhance the quality of services of their end computer products. At the 1998 Embedded Microprocessor Forum, ARM, Hitachi, and Siemens all announced products with both DSP and embedded microprocessors.10 Three approaches exist in the integration of DSP and embedded CPUs. One approach is to simply have two separate units placed on a single chip. The advantage of this approach is that it simplifies the development of the microprocessor. The two units are usually taken from existing designs. The software development tools can be directly taken from each unit’s respective software support environments. The disadvantage is that the application developer needs to deal with two independent hardware units and two software development environments. This usually complicates software development and verification. An alternative approach to integrating DSP and embedded CPUs is to add the DSP as a co-processor of the CPU. This CPU fetches all instructions and forwards the DSP instructions to the co-processor. The hardware design is more complicated than the first approach due to the need to more closely interface the two units, especially in the area of memory accesses. The software development environment also needs to be modified to support the co-processor interaction model. The advantage is that the software developers now deal with a much more coherent environment. The third approach to integrating DSP and embedded CPUs is to add DSP instructions to a CPU instruction set architecture. This usually requires brand-new designs to implement the fully integrated instruction set architecture. The second trend in embedded microprocessors is to support the development of single-chip solutions for large-volume markets. Many embedded microprocessor vendors offer designs that can be licensed and incorporated into a larger chip design that includes the desired input/output peripheral devices and Application-Specific Integrated Circuit (ASIC) design. This paradigm is referred to as system-on-a-chip design. A microprocessor that is designed to function in such a system is often referred to as a licensable core. The third major trend in embedded microprocessors is aggressive adoption of high-performance techniques. Traditionally, embedded microprocessors are slow to adopt high-performance architecture and implementation techniques. They also tend to reuse software development tools, such as compilers from the computer microprocessor domain. However, due to the rapid increase of required performance in embedded markets, the embedded microprocessor vendors are now making fast moves in adopting high-performance techniques. This trend is especially clear in the DSP microprocessors. Texas Instruments, Motorola/Lucent, and Analog Devices have all announced aggressive EPIC DSP microprocessors to be shipped before the Intel/HP IA-64 EPIC microprocessors.

Microprocessor Market Trends Readers who are interested in market trends for microprocessors are referred to Microprocessor Report, a periodical publication by MicroDesign Resources (www.MDRonline.com). In every issue, there is a summary of microarchitecture features, physical characteristics, availability, and pricing of microprocessors.

References 1. J. Turley, RISC volume gains but 68K still reigns, Microprocessor Report, vol. 12, pp. 14-18, Jan. 1998. 2. J.L. Hennessy and D.A. Patterson, Computer Architecture A Quantitative Approach, Morgan Kaufman, San Francisco, CA, 1990. 3. J.E. Smith, A study of branch prediction strategies, Proceedings of the 8th International Symposium on Computer Architecture, pp. 135-14, May 1981.

© 2000 by CRC Press LLC

4. W.W. Hwu and T.M. Conte, The susceptibility of programs to context switching, IEEE Transactions on Computers, vol. C-43, pp. 993-1003, Sept. 1994. 5. L. Gwennap, Klamath extends P6 family, Microprocessor Report, Vol. 1, pp. 1-9, February 1997. 6. R.M. Tomasulo, An efficient algorithm for exploiting multiple arithmetic units, IBM Journal of Research and Development, vol. 11, pp. 25-33, Jan. 1967. 7. J.R. Allen et al. Conversion of control dependence to data dependence, Proceedings of the 10th ACM Symposium on Principles of Programming Languages, pp. 177-189, Jan. 1983. 8. V. Kathail, M.S. Schlansker, and B.R. Rau, HPL PlayDoh architecture specification: Version 1.0, Tech. Rep. HPL-93-80, Hewlett-Packard Laboratories, Palo Alto, CA, Feb. 1994. 9. S.A. Mahlke et al. Sentinel scheduling: A model for compiler-controlled speculative execution, ACM Transactions on Computer Systems, vol. 11, Nov. 1993. 10. Embedded Microprocessor Forum (San Jose, CA), Oct. 1998.

© 2000 by CRC Press LLC

Gupta, S., Gupta, R.K. "ASIC Design" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

64 ASIC Design 64.1 64.2 64.3 64.4 64.5 64.6 64.7 64.8

Introduction Design Styles Steps in the Design Flow Hierarchical Design Design Representation and Abstraction Levels System Specification Specification Simulation and Verification Architectural Design Behavioral Synthesis • Testable Design

64.9

Logic Synthesis Combinational Logic Optimization • Sequential Logic Optimization • Technology Mapping • Static Timing Analysis • Circuit Emulation and Verification

64.10 Physical Design Layout Verification

Sumit Gupta Rajesh K. Gupta University of California at Irvine

64.11 64.12 64.13 64.14 64.15 64.16

I/O Architecture and Pad Design Tests after Manufacturing High-Performance ASIC Design Low Power Issues Reuse of Semiconductor Blocks Conclusion

64.1 Introduction Microelectronic technology has matured considerably in the past few decades. Systems which until the start of the decade required a printed circuit board for implementation are now being developed on a single chip. These systems-on-a-chip (SOCs) are becoming a reality due to vast improvements in chip fabrication and process technology. A key component in SOC and other semiconductor chips are Application-Specific Integrated Circuits (ASICs). These are specialized circuit blocks or entire chips which are designed specifically for a given application or an application domain. For instance, a video decoder circuit may be implemented as an ASIC chip to be used inside a personal computer product or in a range of multimedia appliances. Due to the custom nature of these designs, it is often possible to squeeze in more functionality under performance requirements — while reducing system size, power, heat, and cost — than possible with standard IC parts. Due to cost and performance advantages, ASICs and semiconductor chips with ASIC blocks are used in a wide range of products, from consumer electronics to space applications. Traditionally, the design of ASICs has been a long and tedious process because of the different steps in the design process. It has also been an expensive process due to the costs associated with ASIC manufacturing for all but applications requiring more than tens of thousands of IC parts. Lately, the

© 2000 by CRC Press LLC

situation has been changing in favor of increased use of ASIC parts, in part helped by robust design methodologies and increased use of automated circuit synthesis tools. These tools allow designers to go from high-level design descriptions, all the way to final chip layouts and mask generation for the fabrication process. These developments, coupled with an increasing market for semiconductor chips in nearly all every-day devices, have led to a spur in the demand for ASICs and chips which have ASICs in them. ASIC design and manufacturing span a broad range of activities, which includes product conceptualization, design and synthesis, verification, and testing. Once the product requirements have been finalized, a high-level design is done from which the circuit is synthesized or successively refined to the lowest level of detail. The design has to be verified for functionality and correctness at each stage of the process to ensure that no errors are introduced and the product requirements are met. Testing here refers to manufacturing test, which involves determining if the chip has no manufacturing defects. This is a challenging problem since it is difficult to control and observe internal wires in a manufactured chip and it is virtually impossible to repair the manufactured chips. At the same time, volume manufacturing of semiconductors requires that the product be tested in a very short time (usually less than a second). Hence, we need to develop a test methodology which allows us to check if a given chip is functional in the shortest possible amount of time. In this chapter, we focus on ASIC design issues and their relationship to other ASIC aspects, such as testability, power optimization, etc. We concentrate on the design flow, methodology, synthesis, and physical issues, and relate these to the computer-aided design (CAD) tools available. The rest of this chapter is organized in the following manner. Section 64.2 introduces the notion of a design style and the ASIC design methodologies. Section 64.3 outlines the steps in the design process followed by a discussion of the role of hierarchy and design abstractions in the ASIC design process. Following sections on architectural design, logic synthesis, and physical design give examples to demonstrate the key ideas. We elucidate the availability and the use of appropriate CAD tools at various steps of the ASIC design.

64.2 Design Styles ASIC design starts with an initial concept of the required IC part. Early in this product conceptualization phase, it is important to decide the design style that will be most suitable for the design and validation of the eventual ASIC chip. A design style refers to a broad method of designing circuits which uses specific techniques and technologies for the design implementation and validation. In particular, a design style determines the specific design steps and the use of library parts for the ASIC part. Design styles are determined, in part, by the economic viability of the design, as determined by tradeoffs between performance, pricing, and production volume. For some applications, such as defense systems and space applications, although the volume is low, the cost is of little concern due to the time-criticality of the application and the requirements of high performance and reliability. For applications such as consumer electronics, the high volume can offset high production costs. Design styles are broadly classified into custom and semi-custom designs.1 Custom designs, as the name suggests, involve the complete design to be hand-crafted so as to optimize the circuit for performance and/or area for a given application. Although this is an expensive design style in terms of effort and cost, it leads to high-quality circuits for which the cost can be amortized over a large volume production. The semi-custom design style limits the circuit primitives and uses predesigned blocks which cannot be further fine-tuned. These predesigned primitive blocks are usually optimized, well-designed, and wellcharacterized, and ultimately help raise the level of abstraction in the design. This design style leads to reduced design times and facilitates easier development of CAD tools for design and optimization. These CAD tools allow the designer to choose among the various available primitive blocks and interconnect them to achieve the design functionality and performance. Semi-custom design styles are becoming the norm due to increasing design complexity. At the current level of circuit complexity, the loss in quality by using a semi-custom design style is often very small compared to a custom design style.

© 2000 by CRC Press LLC

FIGURE 64.1

Classification of custom and semi-custom design styles.

Semi-custom designs can be classified into two major classes: cell-based design and array-based design, which can further be further subdivided into subclasses as shown in Fig. 64.1.1 Cell-based designs use libraries of predesigned cells or cell generators, which can synthesize cell layouts given their functional description. The predesigned cells can be characterized and optimized for the various process technologies that the library targets. Cell-based designs can be based on standard-cell design, in which basic primitive cells are designed once and, thereafter, are available in a library for each process technology or foundry used. Each cell in the library is parameterized in terms of area, delay, and power. These libraries have to be updated whenever the foundry technology changes. CAD tools can then be used to map the design to the cells available in the library in a step known as technology mapping or library binding. Once the cells are selected, they are placed and wired together. Another cell-based design style uses cell generators to synthesize primitive building blocks which can be used for macro-cell-based design (see Fig. 64.1). These generators have traditionally been used for the automatic synthesis of memories and programmable logic arrays (PLAs), although recently module generators have been used to generated complex datapath components such as multipliers.2 Module generators for macro-cell generation are parameterizable, that is, they can be used to generate different instances of a module such as a 8 × 8 and a 16 × 8 multiplier. In contrast to cell-based designs, array-based designs use a prefabricated matrix of non-connected components known as sites. These sites are wired together to create the circuit required. Array-based circuits can either be pre-diffused or pre-wired, also known as mask programmable and field programmable gate arrays, respectively (MPGAs and FPGAs). In MPGAs, wafers consisting of arrays of unwired sites are manufactured and then the sites are programmed by connecting them with wires, via different routing layers during the chip fabrication process. There are several types of these pre-diffused arrays, such as gate arrays, sea-of-gates, and compacted arrays (see Fig. 64.1). Unlike MPGAs, pre-wired gate arrays or FPGAs are programmed outside the semiconductor foundry. FPGAs consist of programmable arrays of modules implementing generic logic. In the anti-fuse type of FPGAs, wires can be connected by programming the anti-fuses in the array. Anti-fuses are open-circuit devices that become a short-circuit when an appropriate current is applied to them. In this way, the circuit design required can be achieved by connecting the logic module inputs appropriately by programming the anti-fuses. On the other hand, memory-based FPGAs store the information about the interconnection and configuration of the various generic logic modules in memory elements inside the array. The use of FPGAs is becoming more and more popular as the capacity of the arrays and their performance are improving. At present, they are used extensively for circuit prototyping and verification. Their relative ease of design and customization leads to low cost and time overheads. However, FPGA is still an expensive technology since the number of gate arrays required to implement a moderately complex

© 2000 by CRC Press LLC

design is large. The cost per gate of prototype design is decreasing due to continuous density and capacity improvement in FPGA technology. Hence, there are several design styles available to a designer, and choosing among them depends upon tradeoffs using factors such as cost, time-to-market, performance, and reliability. In real-life applications, nearly all designs are a mix of custom and semi-custom design styles, particularly cell-based styles. Depending on the application, designers adopt an approach of embedding some custom designed blocks inside a semi-custom design. This leads to lower overheads since only the critical parts of the design have to be hand-crafted. For example, a microprocessor typically has a custom designed data path and the control logic is synthesized using a standard cell-based technique. Given the complexity of microprocessors, recent efforts in CAD are attempting to automate the design process of data path blocks as well.3 Prototyping and circuit verification using FPGA-based technologies has become popular due to high costs and time overruns in case of a faulty design once the chip is manufactured.

64.3 Steps in the Design Flow An important decision for any design team is the design flow that they will adopt. The design flow defines the approach used to take a design from an abstract concept through the specification, design, test, and manufacturing steps.28 The waterfall model has been the traditional model for ASIC development. In this model, the design goes through various steps or phases while it is constantly refined to the highest level of detail. This model involves minimal interaction between design teams working on different phases of the design. The design process starts with the development of a specification and high-level design of the ASIC, which may include requirements analysis, architecture design, executable specification or C model development, and functional verification of the specification. The design is then coded at the register transfer level (RTL) in hardware description languages such as VHDL12 or Verilog.13 The functionality of the RTL code is verified against the initial specification (e.g., C model), which is used as the golden model for verifying the design at every level of abstraction (see Section 64.5). The RTL is then synthesized into a gatelevel netlist which is run through a timing verification tool which verifies that the ASIC meets the timing constraints specified. The physical design team subsequently develops a floorplan for the chip, places the cells, and routes the interconnects, after which the chip is manufactured and tested (see Fig. 64.2). The disadvantage with this design methodology is that as the complexity of the system being designed increases, the design becomes more error prone. The requirements are not properly tested until a working system model is available, which only becomes available late in the design cycle. Errors are hence discovered late in the design process and error correction often involves a major redesign and rerun through the steps of the design again. This leads to several design reworks and may even involve multiple chip fabrication runs. The steps and different levels of detail that the design of an integrated circuit goes through as it progresses from concept to chip fabrication, are shown in Fig. 64.2. The requirements of a design are represented by a behavioral model which represents FIGURE 64.2 A typical ASIC design flow.

© 2000 by CRC Press LLC

the functions the design must implement with the timing, area, power, testing, etc. constraints. This behavioral model is usually captured in the form of an executable functional specification in a language such as C (or C++). This functional specification is simulated for a wide set of inputs to verify that all the requirements and functionalities are met. For instance, when developing a new microprocessor, after the initial architectural design, the design team develops an instruction set architecture. This involves making decisions on issues such as the number of pipeline stages, width of the data path, size of the register file, number and type of components in the data path, etc. An instruction set simulator is then developed so that the range of applications being targeted (or a representative set) can be simulated on the processor simulator. This verifies that the processor can run the application or a benchmark suite within the required timing performance. The simulator also verifies that the high-level design is correct and attempts to identify data and pipeline hazards in the data path architecture. The feedback from the simulator may be used to refine the instruction set of the processor. The functional specification (or behavioral model) is converted into a register transfer level (RTL) model, either manually or by using a behavioral or high-level synthesis tool.27 This RTL model uses register-level components like adders, multipliers, registers, multiplexors, etc. to represent the structural model of the design with the components and their interconnections. This RTL model is simulated, typically using event-driven simulation (see Section 64.7) to verify the functionality and coarse-level timing performance of the model. The tested and verified software functional model is used as the golden model to compare the results against. The RTL model is then refined to the logic gate level using logic synthesis tools which implement the components with gates or combination of gates, usually using a cell-library-based methodology. The gate-level netlist undergoes the most extensive simulation. Besides functionality, other constraints such as timing and power are also analyzed. Static timing analysis tools are used to analyze the timing performance of the circuit and identify critical paths in the design. The gate-level netlist is then converted into a physical layout, by floorplanning the chip area, placement of the cells, and routing of the interconnects. The layout is used to generate the set of masks1 required for chip fabrication. Logic synthesis is a design methodology for the synthesis and optimization of gate-level logic circuits. Before the advent of logic synthesis, ASIC designers used a capture-and-simulate design methodology.4 In this methodology, a team of design architects starts with the requirements for the product and produces a rough block diagram of the chip architecture. This architecture is then refined to ensure completeness and functionality and then given to a team of logic and layout designers who use logic and circuit schematic design tools to capture the design and each of its functional blocks and their interconnections. Layout, placement, and routing tools are then used to map this schematic into the technology library or to another custom or semi-custom design style. However, the development of logic synthesis in the last decade has raised the ante to a describe-andsynthesize methodology. Designs are specified in hardware description languages (HDL) such as VHDL12 and Verilog,13 using Boolean equations and finite-state machine descriptions or diagrams, in a technologyindependent form. Logic synthesis tools are then used to synthesize these Boolean equations and finitestate machine descriptions into functional units and control units, respectively.5,6,23 Behavioral or high-level synthesis tools work at a higher level of abstraction and use programs, algorithms, and dataflow graphs as inputs to describe the behavior of the system and synthesize the processors, memories, and ASICs from them.24,27 They assist in making decisions that have been the domain of chip architects and have been based mostly on experience and engineering intuition. The relationship of the ASIC design flow, synthesis methodologies, and CAD tools is shown in Fig. 64.3. This figure shows how the design can go from behavior to register to gate to mask level via several paths which may be manual or automated or may involve sourcing out to another vendor. Hence, at any stage of the design, the design refinement step can either be performed manually or with the help of a synthesis 1Masks are the geometric patterns used to etch the cells and interconnects onto the silicon wafer to fabricate the chip.

© 2000 by CRC Press LLC

FIGURE 64.3

Manual design, automated synthesis, and outsourcing.

CAD tool or the design at that stage can be sent to a vendor who refines the current design to the final fabrication stage. This concept has been popular among fab-less design companies that use technology libraries from foundries for logic synthesis and send out the logic gate netlist design for final mask generation and manufacturing to the foundries. However, in more recent years, vendors are specializing in design of reusable blocks which are sold as intellectual property (IP) to other design houses, who then assemble these blocks together to create systems-on-a-chip.28 Frequently, large semiconductor design houses are structured around groups which specialize in each one of these stages of the design. Hence, they can be thought of as independent vendors: the architectural design team defines the blocks in the design and their functionality, the logic design team refines the system design into a logic level design for which the masks are then generated by the physical design team. These masks are used for chip fabrication by the foundry. In this way, the design style becomes modular and easier to manage.

64.4 Hierarchical Design Hierarchical decomposition of a complex system into simpler subsystems and further decomposition into subsystems of ever-more simplicity is a long established design technique. This divide-and-conquer approach attempts to handle the problem’s complexity by recursively breaking it down into manageable pieces which can be easily implemented. Chip designers extend the same hierarchical design technique by structuring the chip into a hierarchy of components and subcomponents. An example of hierarchical digital design is shown in Fig. 64.4.25 This figure shows how a 4-bit adder can be created using four single-bit full adders (FAs) which are designed using logic gates such as AND, OR, and XOR gates. The FAs are composed into the 4-bit adder by interconnecting their pins appropriately; in this case, the carry out of the previous FA is connected to the carry-in of the next FA in a ripple-carry manner. In the same manner, a system design can be recursively broken down into components, each of which is composed of smaller components until the smallest components can be described in terms of gates and/or transistors. At any level of the hierarchy, each component is treated as a black-box with a known input-output behavior, but how that behavior is implemented is unknown. Each black-box is designed

© 2000 by CRC Press LLC

FIGURE 64.4 An example of hierarchical design: (a) a 4-bit ripple-carry adder; (b) internal view of the adder composed of full adders (FAs); (c) full-adder logic schematic.

by building simpler and simpler black-boxes based on the behavior of the component. The smallest primitive components (such as gates and transistors) are used at the lowest level of hierarchy. Besides assisting in breaking down the complexity of a large system, hierarchy also allows easier conceptualization of the design and its functionality. At higher levels of the hierarchy, it is easier to understand the functionality at a behavioral level without having to worry about lower-level details. Hierarchical design also enables the reuse of components with little or no modification to the original design. The design approach described above is a top-down design approach to hierarchy. The top-down design approach is a recursive process that takes a high-level specification and successively decomposes and refines it to the lowest level of detail and ends with integration and verification. This is in contrast to a bottom-up approach, which starts by designing and building the lowest-level components and successively using these components to build components of ever-increasing complexity until the final design requirements are met. Since a top-down approach assumes that the lowest-level blocks specified can, in fact, be designed and built, the whole process has to be repeated if a low-level block turns out to be infeasible. Current design teams use a mixture of top-down and bottom-up methodologies, wherein critical low-level blocks are built concurrently as the system and block specifications are refined. The bottom-up approach attempts to abstract parameters of the low-level components so that they can be used in a generic manner to build several components of higher complexity.

64.5 Design Representation and Abstraction Levels Another hierarchical approach is based on the concept of design abstraction. This approach views the design with different degrees of resolution at different levels of abstraction. In the design process, the design goes through several levels of abstraction as it progresses from concept to fabrication — namely, system, register-transfer, logic, and geometrical.1 The system-level description of the design consists of a behavioral description in terms of functions, algorithms, etc. At the register transfer level, the circuit is represented by arithmetic and storage units and corresponds to the register transfer level (RTL) discussed earlier. The register-level components are selected and interconnected so as to achieve the functionality

© 2000 by CRC Press LLC

FIGURE 64.5 Simplified ASIC design flow: the progress of the design from the behavior to mask level and the synthesis processes and steps involved.

of the design. The logic level describes the circuit in terms of logic gates and flip-flops and the behavior of the system can be described in terms of a set of logic functions. These logic components are represented at the geometric level by a layout of the cells and transistors using geometric masks. These levels of abstraction can be further understood with the help of the simplified ASIC design flow shown in Fig. 64.5.26 This figure shows behavior as the initial abstraction level which represents the system level functionality of the design. The register-transfer level comprises components and their interconnections and, for more complex systems, may also comprise standard components such as ROMs (read-only memory), ASICs, etc. The logic level corresponds to the gate level representation and the set of masks of the physical layout of the chip correspond to the geometric level. This figure also shows the synthesis processes and the steps involved in each process. These synthesis processes help refine the design from one level of detail to the next finer level of detail. These synthesis processes are known as behavioral synthesis, logic synthesis, and physical synthesis, and each of these synthesis processes are discussed in detail in later sections. It is possible to go from one level of detail to the next by following the steps within the synthesis process, either manually or with the help of CAD tools. The circuit can also be viewed at different levels of design detail as the design progresses from concept to fabrication. These different design representations or views are differentiated by the type of information that they capture. These representations can be classified as behavioral, structural, and physical.4 In a behavioral representation, only the functional behavior of the system is described and the design is treated as a black-box. A structural representation refines the design by adding information about the components in the system and their interconnection. The detailed physical characteristics of the components are specified in the physical representation, including the placement and routing information. The relationships between the different abstraction levels and design representations or views is captured by the Y-chart shown in Fig. 64.6.10 This chart shows how the same design at the system level can have a behavioral view and a structural view. Whereas the behavioral view would conceptualize the design in terms of flowcharts and algorithms, the structural view would represent the design in terms of processors, memories, and other logic blocks. Similarly, the behavioral view at the register-transfer level

© 2000 by CRC Press LLC

FIGURE 64.6

Y-chart: relationship of different abstraction levels and design representations.

would represent the register transfer flow by a set of behavioral statements, whereas the structural view would represent the same flow by a set of components and their interconnections. At the logic level, a circuit can be represented with Boolean equations or finite-state machines in the behavioral view, or it can be represented as a network of interconnected gates and flip-flops in the structural view. The geometric level is represented as transistor functions in the behavioral level, as transistors in the structural view, and as layouts, cells, chips, etc. in the physical view. In this way, the Y-chart model helps to understand the various phases, levels of detail, and views of a design. There have been many extensions to this model, including adding aspects such as testing and design processes.11

64.6 System Specification In the following sections, we will discuss each of the steps in the design process of an ASIC. Any design or product starts with determining and capturing the requirements of the system. This is typically done in the form of a system requirements specification document. This specification describes the end-product requirements, functionality, and other system-level issues that impose requirements such as environment, power consumption, user acceptance requirements, and system testing. This leads to more specific requirements on the device itself, in terms of functionality, interfaces, operating modes, operating conditions, performance, etc. At this stage, an initial analysis is done on the system requirements to determine the feasibility of the specification. It is determined which design style will be used (see Section 64.2) and the foundry, process, and library are also selected. Some other parameters such as packaging, operating frequency, number of pins on the chip, area, and memory size are also estimated. Traditionally, for simple designs, design entry is done after the high-level architecture design has been completed. This design entry can be in the form of schematics of the blocks that implement the architecture. However, with increasing complexity of designs, concerns about system modeling and verification tools are becoming predominant. System designers want to ensure hardware design quality and quickly produce a working hardware model, simulate it with the rest of the system, and synthesize and formally verify it for specific properties. Hence, designers are adopting high-level hardware description languages (HDLs) for the initial specification of the system. These HDLs are simulatable and, hence, the functionality and architectural design can be simulated to verify the correctness and fulfillment of end-product

© 2000 by CRC Press LLC

requirements. In present ASIC design methodologies used in the industry, HDLs are typically used to capture designs at a register-transfer level and logic synthesis tools are then used to synthesize the design. However, recently the use of executable specifications for capturing system requirements is becoming popular, as proposed in the Specify-Explore-Refine (SER) methodology for system design.4 After this specify phase, the explore phase consists of evaluating various different system components to implement the system functionality within the design constraints specified. The specification is updated with the design decisions made during the exploration phase in the refine phase. This methodology leads to a better understanding of the system functionality at a very early stage in the process. An executable specification is particularly useful to validate the product functionality and correctness and for the automatic verification of various design properties. Executable specifications can be easily simulated and the same model can be used for synthesis. Current design methodologies produce functional verification models in C or C++ and these are then thrown away and the design is manually entered again for the design tools. The selection of a language to capture the system specification is an area of active research. The language must be easy to understand and program, and must be able to capture all the system’s characteristics besides having the support of CAD tools which can synthesize the design from the specification. Many languages have been used to capture system descriptions, including VHDL,12 Verilog,13 HardwareC,14 Statecharts,15 Silage,16 Esterel,17 and SpecSyn.18 More recently, there has been a move toward the use of programming languages for digital design due to their ability to easily express executable behaviors and allow quick hardware modeling and simulation and also due to system designers’ familiarity with generalpurpose, high-level programming languages such as C and C++.53 These languages have raised the level of abstraction at which the designer specifies the design to being closer to the conceptual model. The conceptual behavioral design can then be partitioned and structured and components can be allocated. In this manner, the design progresses from a purely functional specification to a structural implementation in a series of steps known as refinement. This methodology leads to lower design times, more efficient exploration of a larger design space, and lower re-design time.

64.7 Specification Simulation and Verification Once a design has been captured in a hardware description language or a schematic capture tool, the functionality of the specification needs to be verified. The most popular technique for design verification is simulation, in which a set of input values are applied to the design and the output values are compared to the expected output values. Simulation is used at every stage of the design process and at various levels of design description: behavioral, functional, logic, circuit, and switch level. Formal verification tools attempt to do equivalence checks between different stages of a design. Currently, in the industry, once the requirements of a design have been finalized, a functional specification is captured by a software model of the design in C or C++, which also models other design properties and architectural decisions. This software model is extensively simulated to verify that the design meets the system requirements and to verify the correctness of the architectural design. Often, a C or C++ model is used as the golden model against which the hardware model is verified at every stage of the design. The functional specification is translated (usually manually) into a structural RTL description, and their outputs are compared by simulation to verify that their functionality is equivalent. This is typically done by applying a set of input patterns to both the models and comparing their outputs on a cycle-by-cycle basis. As the design is further refined from RTL to logic level to physical layout, at each stage, the circuit is simulated to verify functional correctness and some other design properties, such as timing and area constraints. The simulations of the RTL, logic, and physical level descriptions are done by different kind of simulators.19 Logic-level simulators simulate the circuit at the logic gate level and are used extensively to verify the functional correctness of the design. Circuit-level simulation, which is the most accurate simulation technique, operates at a circuit level. The SPICE program is the foremost circuit simulation and analysis tool.20 SPICE simulates the circuit by solving the matrix differential equations for circuit

© 2000 by CRC Press LLC

currents, voltages, resistances, and conductances. Switch-level simulators, on the other hand, model transistors as switches and, unlike logic simulators, wires are not assumed to be ideal but instead are assumed to have some capacitance. Another simulator, RSIM, is a switch-level simulator with timing, which models CMOS gates as pull-down or pull-up structures and calculates their resistance to power or ground, so that it can be used with output capacitance to determine rise and fall times.21 Logic-level simulators are typically event-driven. These model the system in a discrete event system by defining appropriate events of interest and how the events are propagated throughout the model.6,22 Hardware description languages (HDLs) such as VHDL and Verilog12,13 have been designed based on event-driven simulation semantics. They have constructs to represent hardware features such as concurrency, hierarchy, and timing. Extensive simulation and functional verification techniques are used by designers at every stage of the design to ensure that no bugs are introduced in the process of refining the design from the behavioral level to the final layout.

64.8 Architectural Design After the design specification has been captured, the system is partitioned into blocks with clearly defined functionality and the interfaces and interaction between the blocks are defined. This structuring of the design is known as architectural design. Besides partitioning, architectural decisions include deciding number and type of components and their interconnects such as adders, multipliers, ALUs, buses, etc., whether the design will be pipelined2, number of pipeline stages, and the operations in each pipeline stage. These high-level architectural decisions have traditionally been done by a few experienced system architects in the design team. However, in the last decade, CAD tools such as high-level synthesis have been introduced which automatically or interactively make many of these architectural decisions and schedule the design, allocate components for it and interconnect them to create a register transfer level design optimized for different parameters.24,27

Behavioral Synthesis Behavioral or high-level synthesis, which is the automated synthesis of systems from behavioral descriptions, has received a lot of attention recently due to its ability to provide the low turn-around time required for an ASIC design. High-level synthesis accepts a behavioral description of a system and generates a data path for this description at a register-transfer level.29-31 High-level synthesis tools allow designers to work at a system level closer to the original conceptual model of the system. High-level synthesis tools can be targeted to optimize the area, performance, power, and testability of the final design. The tasks in high-level synthesis can be broadly classified into allocation, scheduling, and binding. Allocation consists of determining the number and type of components and other resources that are required for the implementation of the design. These components and resources are at the registertransfer level (RTL) and are taken from a library of available modules, which includes components such as ALUs, adders, multipliers, register files, registers, and multiplexers. Allocation also determines the number, width, and type of each bus in the system. Scheduling assigns each of the operations in the behavioral description to time intervals, also known as control steps. The data flows from one stage of registers to the next during each control step and may be operated upon by a functional unit. The control steps are usually the length of a clock-cycle. The operations in each control step are then assigned to particular register-level components by the binding task. Hence, operations are assigned to functional units, variables to storage units, and the interconnect between the various units are also established. Consider the sample data flow graph shown in Fig. 64.7(a) and its corresponding data path shown in Fig. 64.7(b). This data path was synthesized using a high-level synthesis system.30 The data flow graph 2Pipelining is a technique where a series of operations are done in a pipeline or assembly-line fashion so as to increase concurrency among different types of operations.

© 2000 by CRC Press LLC

FIGURE 64.7

High-level synthesis: (a) a sample data flow graph, (b) corresponding data path.

shows the variables X1, X2, X3, Y1, Y2, Y3, Z1, and W1, and the operations A to E. The data path in Fig. 64.7(b) shows the mapping of the variables to the registers and the operations to the functional units. Multiplexers are not shown in this figure. This example demonstrates the ability of CAD tools to synthesize behavioral descriptions into data paths. These CAD tools can also synthesize the control logic and make high-level decisions, such as number of pipeline stages, etc.27

Testable Design Testability of digital circuits has become a major concern with the increasing complexity of designs. Testability refers to the ability to detect manufacturing faults in a fabricated chip. Designers are increasingly using a design for testability (DFT) methodology to ensure that the circuit is testable. DFT attempts to modify the circuit during the design phase without affecting its functionality so as to make it testable. There are several approaches and techniques that are used to make chips and the individual components in them testable. Additional test hardware and pins are added to the chip, such as boundary scan test hardware37 which enable testing the chip, introduce test modes to the chip functionality, and provide pins dedicated to shifting in and out of the test vectors and their responses. The testability of the internal components of the chip is enhanced primarily by two techniques: serial scan and built-in self-test (BIST). In the first approach, the components within a chip are tested by applying test vectors to the input pins of the chip and shifting out the output patterns and checking for correctness. In the second approach, known as the built-in self-test (BIST) technique, the chip is tested by specialized hardware built-in within the chip that self-tests the components in the chip. The former approach is known as the full-scan or partial-scan test technique since all or some of the registers in the chip are connected in a test scan chain. Full-Scan Testing In practice, the full-scan technique for testing the data path in a chip is more popular among designers. This technique improves the observability and controllability of the circuit by using scan registers.37 A scan register has both serial shift and parallel-load capability and has additional serial-in and serial-out pins over a standard register. All the scan registers in the circuit are tied together in a chain by connecting the serial-out of a register to the serial-in of the next register. During normal circuit operation mode, the scan registers behave as parallel load registers. However, in the test mode, a test pattern is serially scanned into all the registers of the circuit and then the circuit is clocked and the values in the registers are serially shifted out. The output bit vector values are compared with the expected results to verify that the circuit is functioning correctly. In this way, only one serial-in

© 2000 by CRC Press LLC

FIGURE 64.8

Full-scan register-based design.

pin and one serial-out pin has to be assigned at the chip level. However, since for each test vector that is applied to the chip, it has to be scanned in serially and then the output has to be serially scanned out, this approach is very slow. The slow speed of testing using full-scan is its main disadvantage. The overhead of scan-based test techniques comprises area overhead and performance slow-down. However, the overhead is relatively low compared to other schemes such as BIST. The full-scan technique is demonstrated in Fig. 64.8. In this figure, there are four combinational blocks each of which feeds into registers which have been modified to be scan registers. There is a scan-in pin and a scan-out pin at the chip level and all the scan registers are tied together to form a scan chain. Built-In Self-testing The built-in self-test (BIST) methodology has gained popularity over the past decade and techniques have been demonstrated to incorporate it into behavioral synthesis tools.30,40 Memory blocks such as RAMs (random access memories) are usually tested by inserting built-in self-test (BIST) logic in the memory design. These BIST circuits apply pseudo-random patterns to the memory and test it by several techniques such as writing data into an address location and then reading it back out and comparing the two. Datapath units can also be tested by BIST techniques by applying a set of test vectors to the inputs of the units and doing a signature analysis of the output bit stream.37,39 This signature analysis is enough to ensure that the unit is not faulty. The input test vectors are generated in a pseudo-random manner using registers which are configured as pseudo-random pattern generators (PRPGs). Similarly, signature analysis is done by configuring registers as signature analyzers (SAs). Registers which can be configured in this manner are known as built-in logic block observers (BILBOs). One way, then, of ensuring testability of a functional unit is by creating a n:m embedding for the functional unit, where n is the number of inputs to the functional unit and m is the number of outputs. In such an embedding, it is ensured that each functional unit is fed by at least n registers and the functional unit feeds at least m registers which are different from the input registers. The input registers are configured as PRPGs and the output registers as SAs. In the test mode of the chip, the input PRPGs generate a test vector, a clock cycle is applied to the functional unit’s embedding, at the end of which the outputs of the unit are analyzed by the output registers configured as SAs. In this way, each functional unit can be tested by running the chip in test mode. However, to reduce the test time of the chip, multiple functional units can be tested simultaneously provided that any input PRPG register of one unit is not the output SA register of another. A test schedule or plan can be generated for testing the various units in as few test sessions as possible.38

© 2000 by CRC Press LLC

FIGURE 64.9

Built-in self-test (BIST)-based testable data path for sample data flow graph.

Consider the example of the data path of the sample data flow graph shown earlier in Fig. 64.7(b). In this figure, the multiplier module is part of a 2-1 embedding consisting of registers R2, R3, and R5. In the test mode, R2 and R3 are configured as pseudo-random test pattern generators, whereas R5 is configured as a signature register. However, both the adders cannot be part of a 2-1 embedding since their outputs are stored in the same registers as their inputs. By adding a register R6 (shown dotted in Fig. 64.9) at the output of the left adder, we can make this adder testable since it becomes part of a 2-1 embedding consisting of input registers R1 and R2 and output register R6. The other adder can be made testable by changing the binding of variables to registers such that Z1 is mapped to R3 and Y3 is mapped to R2, along with the necessary changes in the interconnect. If the modified embedding is used, the second adder will be the part of a 2-1 embedding which consists of input registers R3 and R4 and output register R2. The modified testable data path is shown in Fig. 64.9. There are several other ways that this circuit can be modified to make it testable. Some of the main challenges in this BIST-based methodology for testing data path units are ensuring that each functional unit is part of a n:m embedding while at the same time converting as few registers into BILBOs (since these are more expensive in terms of area) and generating an efficient test schedule such that the total test time is minimum. Although in this section we have attempted to introduce the issues in testability and design for testability, it is by no means a complete picture of the field of testing. Several test issues such as delay faults, mixed-signal test, partial scan have not been discussed. There are several techniques and test styles which can be adopted, depending on the characteristics of the system under design. Other chapters in this book offer a more detailed discussion on the subject.

64.9 Logic Synthesis Logic synthesis deals with the synthesis and optimization of circuits at the logic gate level.5,7-9 Digital circuits typically have sequential and combinational components. These can be specified by finite-state machines, state transition diagrams or tables, Boolean equations, schematic diagrams, or HDL descriptions. Finite-state machine representations are optimized by state minimization and encoding and Boolean functions are optimized either by two-level optimization techniques which are exact or by heuristic multi-level optimization techniques. Logic synthesis includes a range of optimizations and techniques like state machine optimization, multi-level logic optimization, retiming, re-synthesis, technology mapping, or post-layout transistor sizing. The optimization steps are selected and ordered according to the chosen optimization metric,

© 2000 by CRC Press LLC

whether it may be area, speed, power, or a tradeoff between these. These steps are divided into two phases: the technology-independent phase, where the logic circuit is optimized by Boolean or algebraic manipulation or state minimization, and the technology-mapping phase, in which the logic network is mapped into a technology library of cells and then, transistor-level optimizations are performed. Since circuits are usually a combination of combinational and sequential parts and the techniques to optimize the two differ a lot, we discuss each one separately.

Combinational Logic Optimization Combinational circuits can be modeled by two-level sum-of-products expressions. These expressions can be optimized by two-level minimization tools such as Espresso, Mini, or Presto.1,41 Two-level logic networks can be easily mapped onto macrocell-based design styles such as PLAs (programmable logic arrays). However, in practice, logic networks are usually multi-level and, hence, multi-level logic optimization tools such as MIS42 are becoming popular. Unlike two-level logic networks, multi-level network graphs can be mapped onto cell libraries with complex n-level gates, thereby allowing more complex cell and array-based design styles. To demonstrate the steps in technology-independent steps in combinatorial logic optimization, we show the optimization of Boolean functions representing two-level logic networks in a sum-of-products format of the logic variables. Boolean functions can be optimized by minimizing the number of operators using either map-based or table-based methods. The map-based method uses Karnaugh maps to minimize a Boolean function as shown in the example below. Consider the Boolean function:

F = a′b′c′d′ + a′b′c′d + a′b′cd′ + a′b′cd + a′bc′d + a′bcd′ + ab′cd′ + a′bcd + ab′cd + abcd where a, b, c, and d are single-bit Boolean variables. The Karnaugh map corresponding to this example is shown in Fig. 64.10(a).25 This map represents the terms in the Boolean expression by assigning a 1 in the squares that correspond to a term in the expression. Each term in a Boolean function is called a minterm. For any Boolean function with n-variables or literals, it has 2n possible minterms and a n-cube is defined as a minterm with all n-variables. A subcube is a minterm with fewer variables than n in it. From the Karnaugh map shown, we determine that the prime implicants (PIs), which are the subcubes not contained in any other subcube, are a′b′, a′c, a′d, cd, b′c. These are marked in the figure by dashed boxes. The dashed boxes were created by grouping together the maximal set of minterms in groups of multiples of 2 (i.e., 2, 4, 8, etc.). Essential primary implicants are the prime implicants which include a minterm that is not included in any other subcube. For this example, all the prime implicants are also essential prime implicants. A cover is a set of prime implicants such that each minterm in the Boolean

FIGURE 64.10

An example function: (a) Karnaugh map, (b) circuit implementation.

© 2000 by CRC Press LLC

function is contained in at least one prime implicant. A minimal cover is a selection of the minimum number of prime implicants that form a cover over all the minterms in the function. For this example, a minimal cover is a′b′, a′c, a′d, cd, b′c. Hence, the reduced Boolean function is:

F = a′b′ + a′c + a′d + cd + b′c The circuit corresponding to this function is shown in Fig. 64.10(b). The 5-input OR gate at the end of the circuit can be implemented by splitting it into several 2-input OR gates. The same minimization can be done using tabular methods such as the Quine-McCluskey method.25 This method represents the same information in tables which then reduce the minterms by iteratively finding subcubes with fewer variables. The reader is referred to standard texts on digital design for further discussion on this method. The Karnaugh map shown in Fig. 64.10(a) conceptually demonstrates the combinational logic optimization process. However, in practice, two-level optimizers such as Espresso are used for logic optimization. Espresso uses an expand-irredundant-reduce iterative algorithm to reduce the size of the given Boolean function.41 A n-variable function can be represented by a set of points in n-dimensional space. The function then has an on-set, which is the set of points for which the function’s value is 1, an off-set, which is the set of points for which the function’s value is 0 and a don’t-care or dc-set, which is the set of points for which the function’s value is don’t care. The basic Espresso algorithm first expands each cube in the on-set to make it as large as possible, without covering a point in the off-set (points in the dc-set may be covered). Then, for points covered by several cubes, the smaller cubes are removed in favor of the larger covering cubes in the irredundant step. Finally, the cubes are reduced so as to minimize the variables in the cubes. The example and strategies discussed above demonstrate the two-level optimization methodology. The final circuit implementation for the example, (see Fig. 64.10(b)) has two stages of logic. However, cell libraries used to map the gates in the logic circuit to the gates available from the foundry, usually have more complex gates which are a combination of several gates such as AND-OR, OR-AND, or NOR-AND gates. To fully utilize these cell libraries, multi-level logic optimization techniques are used. These techniques are not restricted to two-level logic networks but instead deal with multiple-level logic circuits. This provides the necessary flexibility required to map the logic network to complex cells in the technology library, hence optimizing area and delay. However, multi-level optimization techniques are not exact, i.e., only heuristics exist for modeling and optimizing multiple-level networks. For further discussion on this subject, the reader is referred to Ref. 1.

Sequential Logic Optimization Sequential circuits are usually represented by a finite-state machine (FSM) model. This consists of a combinational circuit and a set of registers as shown in Fig. 64.11. The model has a set of inputs, I, a set of outputs O, the state S, and a clock signal. The clock signal defines the clock cycle, which is a time

FIGURE 64.11

Finite state machine model.

© 2000 by CRC Press LLC

interval in which the combinational circuit analyzes the inputs and the state to calculate the outputs and the next state. At every clock cycle, the data computed by the combinational circuit is stored in the registers along with other state and control information. A finite-state machine (FSM) is defined by the quintuple where S, I, and O are the set of states, inputs, and outputs, respectively, and f and h represent the next state and output calculation functions. The next state function f can be represented as f :S × I → S and the output function h can be either represented as h:S × I → O or as h:S → O, depending on whether the finite-state machine is implemented as a Mealy machine or a Moore machine. In the Mealy machine, the output function is dependent on the inputs and the state, whereas in the Moore machine the output is state based only. In a sequential circuit represented by an FSM, the set of states, inputs and outputs, S, I, and O correspond to k flip-flops, Q0, …, Qk–1, n input signals, I0, …, In–1 and m output signals, O0, …, Om–1. Each of these correspond to a single bit in the implementation. The finite-state machine model is usually represented using state transition diagrams or state tables.1,25 State transition diagrams are mainly optimized by state minimization and state encoding (explained in the next subsection). Let us first discuss an example to demonstrate the design of sequential circuits. Consider the example of a modulo-4 counter shown in Fig. 64.12. Figure 64.12(a) shows the finite-state machine transition graph for the counter. The counter counts from 0 to 3 back to 0 whenever the count signal C is 1. When the count signal C is 0, the counter stays in the same state. The counter outputs the count Z at each clock cycle. Hence, the state transition graph has four states S0 to S3 corresponding to the count states 0 to 3. There is a transition from one state to the next if C = 1 and the output Z is the count at that time. If C = 0, the state does not change and the output Z is the same as when entering the state. The states S0 to S3 have been encoded as 00, 01, 11, 10, respectively. This is an example of an input-based or Mealytype FSM.

FIGURE 64.12 Sequential circuit example: modulo-4 counter (a) FSM for counter, (b) circuit for the counter, (c) state transition table, (d) next state Karnaugh map, (e) output Karnaugh map.

© 2000 by CRC Press LLC

The information from the FSM can be captured in a state transition table as shown in Fig. 64.12(c). In this figure, the present and the next states are shown using their encoding and are marked by bit variables Q1 Q0 and D1 D0, respectively. The output Z is a two-bit variable Z1 Z0 which goes from 0 to 3 (or 00 to 11). The Karnaugh maps corresponding to the next state and the output bit vectors are shown in Figs. 64.12(d) and 64.12(e) respectively. The maximal coverings for all the bits in the next state variables and the output variable are shown in these Karnaugh maps by dotted boxes. Note that although the Karnaugh Maps for D1 D0 and Z1 Z0 have been grouped together, their coverings and optimizations are independent. From these coverings, we get the following reduced Boolean equations for the bit variables:

D1 = Q1C + Q0C D0 = Q0C + Q1C

Z1 = Q1C + Q0C Z 0 = Q1Q0C + Q1Q0C + Q1Q0C + Q1Q0C

The circuit diagram corresponding to these equations is shown in Fig. 64.12(b). The circuit has two D-flip-flops which correspond to the two-bit variables in the state, and the combinational part has been implemented using simple AND, OR, and NOT gates. Note that in this example, the state minimization and encoding steps are assumed to have already been done. State Minimization and Encoding State minimization aims at reducing the number of machine states used to represent an FSM. Since the minimum number of bits required to encode n states is [log2n], reducing the number of states can lead to a reduced number of bits and, hence, flip-flops required to encode the states. It also leads to fewer transitions, fewer logic gates, and fewer inputs per gate. These reductions not only lead to lower area cost but also speed up the design and reduce the power consumption. State minimization can be done by finding equivalent states and by using don’t-care information to remove states. Two states are equivalent if and only if, for every input, both the states produce the same output and the corresponding next states are equivalent. Consider the example state transition graph shown in Fig. 64.13(a). The state transition table corresponding to this graph is shown in Fig. 64.13(c). State minimization can be done in two steps. The first

(c)

(d)

(e)

FIGURE 64.13 An example of state minimization: (a) original state transition graph, (b) minimized state transition graph, (c) original state transition table, (d) states grouped based on their outputs, (e) minimized state transition table.

© 2000 by CRC Press LLC

step is finding the states with the same outputs for the same inputs. We group these states such that states in the same group have the same output for each input. This is shown in Fig. 64.13(d). There are three groups u0, u1, and u2 which, respectively, give output 1, 0, and 0 when the input 0 is applied and give output 1, 0, and 1 when the input 1 is applied. In the next step, we compare the next states for each state in a group for all inputs. If the next state for two states within a group is in the same group, then the two states are considered equivalent. In this example, we find the states s0 and s2 in the group u0 are equivalent since all the next states of these two states are in the same group. Hence, these two states can be combined into one state and the minimized state transition table is shown in Fig. 64.13(e). The corresponding minimized state transition graph for the example is shown in Fig. 64.13(b). Note, that the transition from s1 to u0 is denoted as X/0 since for all inputs, when in state s1, the next state is u0 and the output is 0. After the states have been minimized, state encoding is performed to assign a binary representation to the states of the finite-state machine. In the example shown earlier in Fig. 64.13(b), the minimized state transition graph has four states, whereas the original state transition graph had five states (see Fig. 64.13(a)). Hence, whereas it would have taken 3 bits to encode the five states in the original FSM, the reduced FSM requires only 2 bits for the encoding. Fewer encoding bits implies fewer flip-flops in the circuit and, hence, reduced area and increased speed of the final design. There are several other encoding methodologies such as gray encoding, NRZ encoding, etc., which are used to reduce circuit switching, bus switching, etc.1

Technology Mapping Technology mapping forms the link between logic synthesis and physical design. After logic synthesis, a circuit-level schematic or netlist of the design is created using a vendor-independent logic library. This library has elements such as low-level gates, flip-flops, latches, and at times, multiplexers, counters, and adders. The schematic entry tool then generates a netlist of the elements with their interconnections. Typically, a netlist translator along with a vendor-specific library is used to replace the vendor-independent generic elements and generate the netlist in a particular vendor’s netlist format. This allows the schematic entry or netlist generation to be independent of the vendor-specific library. The process of transforming the generic cell based logic network into a vendor library-specific network is known as library binding or technology mapping. This step allows us to retarget the same design to different technologies and implementation styles. The library contains a set of parameterized logic cells. These cells may be primitive or a combination of a set of cells to produce a commonly used functionality such as adders, shifters, etc. Typically, the cell library vendor provides different libraries optimized for area, performance, power, and/or testability. Each cell in the vendor library contains a physical layout of the cell, its timing model (delay characteristics and capacitances on each input), a wire load model, a behavioral model (VHDL/Verilog model), circuit schematic, cell icon (for schematic tools), and for bigger cells, its routing and testing strategy. CAD tools use the timing characteristics to analyze the circuit and determine the capacitances at each node in the netlist, and use the delay formulas along with the timing characteristics of each element to compute the delays for each node. Wiring capacitances are included by estimating a wire-load model initially and then later using the back-annotation information from the floorplanning and place-androute tools (see Section 64.10). Cell-Library Binding Cell-library binding is the process of transforming the set of Boolean equations or the Boolean network into a logic gate network with the gates in the cell library. Cell-library binding approaches are classified into two types: rule-based and tree-based approaches. Rule-based approaches iteratively replace parts of the logic network with equivalent cells from the cell library. This is done using local transformations which do not affect the behavior of the circuit. The tree-based approach does either structural covering and matching or Boolean covering and matching. In the structural approach, the logic network is expressed

© 2000 by CRC Press LLC

FIGURE 64.14

Two different network coverings for the same 2-input NAND logic subnetwork.

as an algebraic expression which is represented as a graph. Similarly, the cells in the library are also represented by graphs and the problem is reduced to one of subgraph matching and graph covering. The Boolean approach is similar but uses the matching of Boolean functions instead of graphs. Tree-based matching is similar to pattern matching.33 The cells in the library are represented as pattern graphs and then the aim is to find an optimal covering of the nodes in the logic network so as to optimize for the cost function (which may be area, power, etc.). This problem then reduces to a tree matching and covering problem which can be solved in linear time. One approach is to transform the logic network into a canonical form using only 2-input NAND gates and represent it as a logic graph. The cells in the library are also represented as pattern graphs in the canonical 2-input NAND gate format along with their area and delay costs. The pattern matching algorithm then attempts to find a cover of all the gates in the given logic graph using the cell-library pattern graphs so as to minimize the area and/or delay costs. An illustrative example is shown in Fig. 64.14. In this figure, two different network coverings are shown for the same logic subnetwork. Both these coverings use 3-input NAND gates from the cell library; however, a simple covering could have bound each node with a 2-input NAND gate. Rule-based library binding techniques apply simple rules to identify circuit patterns and replace them with an equivalent pattern from the library. The cells from the library are characterized and rules derived from them. For example, a simple rule might replace two 2-input AND gates in series with a 3-input AND gate. More complex rules can even restructure a subnetwork of the given logic network so as to replace it with a more optimal subnetwork in terms of area and/or delay. Rule-based approaches are heuristic since the quality of results are affected to a great extent by the sequence in which the rules are applied. However, rule-based approaches allow complex transformations such as replacing nodes with high loads by high-drive cells or by inserting buffers. Also, rule-based approaches allow stepwise refinement and rebinding of cells to search for globally optimal results.

Static Timing Analysis Timing analysis is required to verify the correctness and the timing performance of a circuit by ensuring that the timing constraints such as set-up and hold times of the flip-flops are met and the critical paths3 in the circuit meet the timing budgets set for them. Static timing analysis exhaustively analyzes all the paths in the circuit netlist to check if they meet the timing requirements of the design. It computes the delay along the various paths and times all of them and determines the critical paths in the circuit. The timing analysis is done using the gate delay, rise time, fall time, capacitance, and load values in the cell library to determine the delay of each gate and the interconnect delay. Delay across a gate (or any other node) depends on the delay through the gate, the loading on the gate, the number of fan-outs, and load due to the interconnect. The delay through a path (i.e., a chain of nodes) is also affected by the 3

A critical path is a path in the circuit which has the maximum delay among all the paths in the circuit from its input to the output of the circuit.

© 2000 by CRC Press LLC

FIGURE 64.15

An example of a false path (i.e., a path which can never be activated).

skew or path delays due to the interconnect capacitances. In deep submicron designs, interconnect delays dominate over gate delays. For computing the path delays during static timing analysis, it is very important to have accurate estimates of the interconnect capacitances and wire-load model of the chip. Early floorplanning techniques are adopted to obtain these accurate estimates (see Section 64.10). In this way, by timing all the paths in the circuit, the timing analyzer can determine all the critical paths in the circuit. However, the circuit may have false paths, which are paths in the circuit which are never exercised during normal circuit operation for any set of inputs. An example of a false path is shown in Fig. 64.15. The path going from the A input of the first multiplexor through the combinational logic out through the B input of the second multiplexor to the output is a false path. This path can never be activated since if the A input of the first multiplexor is activated, then the Sel line will also select the A input of the second multiplexor. Static timing analysis tools are able to identify simple false paths; however, they are not able to identify all the false paths and sometimes report false paths as the critical paths. For hard-to-detect false paths, the designer has to explicitly mark the known false paths as such before running the static timing analysis tool.

Circuit Emulation and Verification Since testing and correcting a chip once it has been manufactured is a difficult and expensive task, it is essential to verify functional and timing characteristics of the design. As mentioned earlier in Section 64.2, FPGAs (field-programmable gate arrays) are increasingly being used for circuit prototyping and verification due to their ease of reconfigurability and programming. Once the netlist of the circuit design has been generated, it is used to program an FPGA-based circuit consisting of several FPGAs (depending on the size of the design).32 Test patterns are then applied to this design to check its functionality in such a way, so as to exercise all the functions possible and all the inputs possible. The outputs of the emulation circuit are compared with the responses expected as per the functionality as described in the system specification. If design errors are found, the FPGA boards can easily be reprogrammed after the design has been fixed, and it is this ease of reconfigurability that makes FPGAs an attractive — albiet expensive — prototyping system.

64.10

Physical Design

The physical design process consists of specification of area and power of each block, floorplanning, placement, routing, and clock tree design.34,35 The flow of the entire process is shown in Fig. 64.16, starting from logic synthesis to layout, parasitic extraction, and delay calculation. The physical design process starts during the logic synthesis process with the block circuit design, optimization and characterization steps, along with transistor resizing for taking care of loading and timing anamolies.

© 2000 by CRC Press LLC

Floorplanning is a chip-level layout process where the layout cells, blocks, and inputs/outputs (I/Os) are placed on the chip to create a map of the location of the various blocks and devices. The layout program places the blocks on the chip by defining both their position and orientation, while leaving enough space between blocks for wires and interconnects. An initial floorplan is developed, sometimes as early as the initial architectural design of the system, to assess if the chip can meet its timing, performance, and cost goals. This is done by estimating the sizes of the blocks and the interconnect area. A preliminary floorplan is critical in accurately estimating the area budgets of each of the components, clock distribution requirements of the chip, the wire-load model of the design, and the interconnect resistances and capacitances. These estimates can be used to guide logic synthesis and the layout process. When there is no early floorplanning, an area-based wire-load model is adopted, based on the estimate of the die size of the final chip. However, in this method, the estimates of capacitances for global interconnects can be highly inaccurate. Placement tools are used to optimally place the components or modules on the chip area. These tools take into account the size, aspect ratios, and pin positions of each component, so that the place- FIGURE 64.16 Physical design methodology. ment minimizes the area occupied by all the components. Routing tools then lay out or position the wires that connect the components so as to minimize the maximum, total, and average wire length. Routing on wafer can be done on multiple layers of metal, depending on the process technology being used. Usually, placement and routing tools make a lot of decisions that affect each other and are done iteratively or combined together in a single environment. Place-and-route tools are usually packaged with layout tools. These tools convert the logic-level design into the mask geometry of the targeted foundry using the techonology files of the foundry. The clock distribution architecture of the chip is determined to a great extent by the area of the chip, placement of the blocks, target clock frequency, and the target library. As the size of chips increases, clock skew and other clock distribution delays become significant. A single clock can be distributed throughout the chip using a balanced clock tree with a low enough skew to prevent hold-time violations for flip-flops that directly drive other flip-flops. However, as the clock frequency and size of the chip increases, this approach leads to extremely large, high-power clock buffers, which are unacceptable. An alternative approach being used now, is to use a lower-speed bus to distribute the clock as a bused signal. Each major block in the chip synchronizes its local clock to the bus block, either by buffering the bus block or by using a phase-locked loop (PLL). The local bus can be at higher frequency which is a multiple of the bus clock. Once the blocks have been placed and routed, the layout for each block is done either manually or with help of design automation tools. The layout is verified to check if the design works with the actual values of the parasitics of the interconnect on the chip and the clock distribution network. The parasitics are extracted, the delays along the interconnects are calculated, and the circuit is simulated. The results of the simulation are used to iterate over the entire physical design process as shown in Fig. 64.16.

© 2000 by CRC Press LLC

FIGURE 64.17

Illustrative example of layout design rules.

The final step in the physical design process is the mask generation phase. The masks are the geometric patterns that are used to etch the silicon by lithography. The output of design process is usually written out in Caltech Intermediate Format (CIF) or GDSII Stream. This is sent to the foundry, which manufactures the chip using the masks and runs its own design rule checks.

Layout Verification The layout is verified using verification tools such as design rule checkers (DRC) and extractors. The DRC verifies that the geometric layout of the design does not violate the spacing and dimension rules of the foundry. In ensures that the mask layout has the minimum spacing and size required, and also verifies the spacings among the mask features. The extractor produces a netlist file, usually in SPICE format, after analyzing the connectivity of the design. The extracted SPICE file, which includes transistor sizes and parasitic capacitances, is used to run SPICE simulations on the circuit.20 Figure 64.17 demonstrates layout design rules. The numbers used in this figure are illustrative. The figure shows rules such as the minimum separation between two lines of metal-1 or polysilicon, the minimum overlap of polysilicon over the n-type (or p-type) subtrate, etc. These design rules are specified by the technology library provider (i.e., the foundry) and have to be obeyed while performing the layout. The DRC tools verify that the rules have been obeyed and flag errors if they have not. The design rules are necessary since violations can potentially lead to manufacturing faults in the chip. Layout and layout verification is discussed in detail in other chapters in this book.

64.11

I/O Architecture and Pad Design

Another important decision while developing the architecture of the chip is the package and pin count of the chip. The package type is determined by the area and heat generation of the chip. Packages are of various types such as plastic or ceramic, and each one has a different number of pins and different layout of pins in the chips.36 Hence, the pin count is also determined at the same time as the package and is estimated during the initial architecture design.

© 2000 by CRC Press LLC

Pads are the interface between the pins on the outside of the chip and the inputs and outputs in the digital circuits within the chip. Pads are usually distributed around the edge of the chip or, in recent packaging schemes, across the entire chip face. Each pad has an associated input or output circuitry which provides the necessary drive current required. Hence, each pad has Vd d and Vs s (i.e., positive and negative voltage) wires running through it. The number of pads and corresponding pins dedicated to Vd d and Vs s depends on how much current the chip draws and the power it consumes.

64.12

Tests after Manufacturing

There are several types of defects that can be introduced by the manufacturing process, such as stuck-at faults, delay faults, etc.37 Hence, after the chip has been fabricated, it is tested extensively to find the faulty ones from the batch. By far one of the most expensive phases in the production of an integrated circuit, testing is done by applying test patterns to the unit being tested and comparing the unit’s responses with the expected outputs for a working unit. Automatic test pattern generation (ATPG) tools use the description of the circuit to derive the sequence of the test vectors which exercise as many paths in the design as possible and test for the faults that may occur.37 Manufacturing tests aim at finding several different types of faults based on which they can be broadly classified into functional tests, diagnostic tests, and parametric tests.43 Functional tests are simple tests which determine if a chip is functional or not and, hence, are also known as go/no go tests. Diagnostic tests are more involved since they aim at debugging the manufactured chip to determine which component in the chip has failed and possibly locate the fault within the component. This test is important to locate a manufacturing fault which is causing a large percentage of manufactured chips to fail. Parameteric tests check for clock skew, delay faults, noise margins, clock frequencies, etc. in the range of working conditions, such as supply voltage and temperature, for which the chip is supposed to function. However, it is very difficult to create a set of test patterns that test for all the potential faults in the circuit. Recent developments have led to design methodologies which aim to improve the testability of the circuit while it is being designed. In this way, it is possible to design a circuit so that a set of test patterns can be generated which tests for all possible faults in the circuit. A detailed discussion on testing and testing methodologies is beyond the scope of this chapter.

64.13

High-Performance ASIC Design

The main optimization goal of ASIC chips is usually area. However, in a lot of mission-critical designs, speed is of foremost concern. Such high-performance designs require special design methodologies. A lot of design teams adopt a completely hand-crafted design methodology for these chips. However, it is recommended to use standard logic synthesis tools to make one pass over the design and the components in the chip, so as to at least get an estimate of the speed and area of the components. Since CAD tools are able to explore a much larger design space, they often can generate fairly optimal designs which come close to meeting the speed constraints of the design team. The design team can then take these components and hand-tune them to improve their speed. Common methods used are transistor resizing and transistor reodering. Although most of the datapath blocks can be synthesized using standard cell libraries, there are always situations where a component is on the critical path. These critical blocks are typically completely handcrafted. Alternatively, although most of the chip may be in CMOS technology, designers may choose faster technologies for the custom-crafted components and, hence, adopt a mixed technology methodology for the chip. Dynamic and dual-rail logic are popular as high-speed design styles, although their power consumption is much higher. In dynamic logic, all the nodes are precharged and typically require less number of transistors than static circuits and, hence, switch faster than CMOS circuits. However, these circuits are more power hungry since there is more switching activity and each node has to be precharged. Dual-rail logic has, as the name implies, two rails of signals, one being the complement of the other. The main disadvantage with this type of design is that it leads to reduced current drives,

© 2000 by CRC Press LLC

especially at reduced voltages. However, recent technologies such as the differential current switch logic (DCSL) family have high-speed and low-power operations.44 Another factor often overlooked by designers is the fact that in most companies, technology libraries are designed so as to be optimum in terms of area (i.e., all the cells in the library have been hand-crafted so as to have the least area). However, there is always an area-speed tradeoff, and if a design is more speed critical and system architects are willing to throw some more area at the chip in order to improve speed, then the designers should request speed optimized technology libraries from the physical design team or foundry, as the case may be. This does not necessarily mean that all the cells in the library have to be redesigned to make them faster, but instead, only critical cells such as registers, full adders, or other components which are being used in components which are on the critical path, can be optimized.

64.14

Low Power Issues

The demand for portable semiconductor devices has fueled the need for more power-efficient semiconductor designs since the battery life on these portable devices is limited. This has led to the development of several power estimation and minimization design techniques. A considerable amount of this work is is focused on circuit-level power savings by modifying circuits and circuit design techniques to introduce low-power modes.45-47 Several synthesis tools23 also incorporate power estimation as part of their cost functions. In general, power management and savings have become a very important issue in IC design. Power dissipation in CMOS circuits arises from switching or dynamic power due to the switching current, short-circuit current when both n-channel and p-channel transistors are momentarily on during switching, and leakage current during static operation. Of these, the main source of power consumption in CMOS gates is the switching current or dynamic power. The average power consumption of a CMOS gate due to the switching current is given by

P = αC LVdd2 f

(64.1)

where f is the system clock frequency, Vdd is the supply voltage, CL is the load capacitance, and α is the switching activity (i.e., the probability of a 0 → 1 transition during a clock cycle). Some of the high-level strategies for reducing power consumption that can be deduced from this expression include: • Activity-based component shutdown: Shut the component down during periods of inactivity by either shutting the clock (f = 0) or shutting the power supply (Vdd = 0). This can be done when it is known that a component will not be used in a clock cycle, by either gating the clock or gating the power supply or asserting a disable on the component’s enable input (if any). 2 • Supply voltage reduction: Operate at the lowest possible supply voltage (since P = α Vdd ). Many chips which are embedded in portable devices adopt this methodology since the battery life of a portable device is limited. However, tradeoffs are made with other factors such as speed, noise margins, etc. • Switching activity reduction: Architectural changes to restructure the computation, communication, or memory for example to reduce the switching activity, α. By far, this has been the area of most research which has led to methods for achieving fewer transitions, especially on interconnect and memory. Recent work on system-level power shutdown and use of low-power modes, has shown that significant savings can be achieved by considering high-level system inactivity and usage information.48-50

64.15

Reuse of Semiconductor Blocks

In the past few years, the reuse of semiconductor functional blocks has become popular. High-level functional blocks such as signal-processing functions, input/output interface devices, audio/video compression and decompression functions etc., are being designed once and reused in several designs. These blocks are

© 2000 by CRC Press LLC

also known cores and several companies specializing in developing these cores are selling them as intellectual property (IP).51 These cores are designed with clear well-defined and well-documented interfaces so that they can be integrated into system designs easily. The resulting system-on-a-chip (SOC) uses several of these cores and sometimes a microprocessor core to implement a complex system targeted at, say, multimedia processing. This is akin to the use of software component libraries in software design. This core reuse methodology has created a new set of challenges for ASIC design.28,52 Frequently, while integrating the cores, a significant amount of “glue logic” is required to tie in the varied integration requirements of the cores. This glue logic effects system verification detrimentally, since the cores have to be tested and verified with the glue logic. Testing a chip with several cores is an open research problem. A methodology has to be developed that allows core access and isolation during scan-based testing. The industry is moving towards defining modular design styles and standard interface templates for cores, so that they can easily be plugged-in to a system and parameterizable features can be included or deleted depending on the design requirements. Bus and interconnect standards are also being developed, which will allow minimal glue logic to incorporate cores. New core test strategies are being developed to facilitate test and verification of cores and their interaction with other cores in the system. This system-on-a-chip technology is driving the next step in the evolution of semiconductor design and development of CAD tools. Design teams are re-learning the way designs are conceived and created, so as to allow reuse. The bus interface standardization efforts will eliminate glue logic and hence, the performance overheads due to glue logic. These standardizations will allow the development of CAD tools which will make the use of cores as easy as a standard cell library and core integration tools as interactive as circuit schematic tools of today.

64.16

Conclusion

As advances in semiconductor technology continue to provide the ability to put more on silicon with increasing circuit densities and performance, the ASIC design methodology is evolving to higher levels of system specification and an increasing use of CAD tools to automate the design process. Increasing complexity has also led to the proliferation of language-based approaches for digital design. More recently, programming languages are being used for system design due to their ability to quickly model and simulate digital system designs and the familiarity they enjoy with designers.53 The use of high-level programming languages for hardware modeling also helps in the semiconductor block reuse methodology. At a lower level of abstraction, logic synthesis tools have matured to the extent that they are indispensible for large, complex designs. The linking of the physical design and logic synthesis is becoming important and popular since the effectiveness and accuracy of logic synthesis is impacted to a great extent by the feedback and parasitic information provided by floorplanning tools. Behavioral synthesis methodologies are fast becoming available which allow the synthesis of high-level functional descriptions of systems in C-based languages. These tools attempt to raise the abstraction level and design entry level close to the conceptualization level. These high-level synthesis tools allow a more complete and efficient exploration of the design space which cannot be done effectively manually. They remove the onus from “experienced” system designers to tried and proven methodologies. Additionally, the ever-increasing demands for semiconductor devices in all aspects of everyday life is fueling the development of better and faster design turn-around tools and methodologies. Logic design productivity is increasing due to the availability of new tools and methodologies such as emulators and prototyping environments, cycle simulators, hardware accelerators, formal verification tools, system-ona-chip methodologies etc. The need for devices which are portable is prompting more power efficient design and power estimation methodologies. Increasingly complex interactions between physical aspects and higher levels of the design are causing a tighter integration of the various levels of design from highlevel synthesis to logic design to physical design. Finally, better development styles are being adopted which allow fast prototyping of a system and involve more interaction between the various design teams working on different levels of the design.

© 2000 by CRC Press LLC

References 1. G. De Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, New York, 1994. 2. Synopsys Module Compiler, http://www.synopsys.com/products/datapath/datapath.html. 3. A. Chowdhary, S. Kale, P. Saripella, N.K. Sehgal, and R.K. Gupta, A general approach for regularity extraction in datapath circuits, International Conference on Computer-Aided Design, 1998. 4. D. Gajski, S. Narayan, L. Ramachandran, F. Vahid, and P. Fung, System design methodologies: aiming at the 100 h design cycle, IEEE Transactions on (VLSI) Systems, vol. 4, no. 1, March 1996. 5. S. Devadas, A. Ghosh, and K. Keutzer, Logic Synthesis, McGraw-Hill, New York, 1994. 6. C.H. Roth Jr., Digital Systems Design Using VHDL, PWS Publishing, 1998. 7. R.H. Katz, Contemporary Logic Design, Benjamin/Cummings Publishing, 1994. 8. G.D. Hachtel and F. Somenzi, Logic Synthesis and Verification Algorithms, Kluwer Academic Publishers, 1996. 9. E.J. McCluskey, Logic Design Principles, Prentice-Hall, Englewood Cliffs, NJ, 1986. 10. D.D. Gajski and R.H. Kuhn, Guest editor’s Introduction: New VLSI tools, IEEE Computer, Dec. 1983. 11. A. Jantsch, A. Hemani, and S. Kumar, The Rugby Model: A Conceptual Frame for the Study of Modeling, Analysis and Synthesis Concepts of Electronic Systems, Design, Automation and Test in Europe, 1999. 12. IEEE Standard, VHDL Language Reference Manual, 1988. 13. D. Thomas and P. Moorby, The Verilog Hardware Description Language, Kluwer-Academic, 1991. 14. D. Ku and G. De Micheli, HardwareC — a language for hardware design, Stanford Univ. Tech. Rep. CSL-TR-90-419, 1988. 15. D. Harel, Statecharts: A visual formalism for complex systems, Sci. Comput. Programming, 8, 1987. 16. P. Hilfinger and J. Rabaey, Anatomy of a Silicon Compiler, Kluwer Academic Publishers, 1992. 17. N. Halbwachs, Synchronous Programming of Reactive Systems, Kluwer Academic Publishers, 1993. 18. F. Vahid, S. Narayan, and D.D. Gajski, SpecCharts: A VHDL frontend for embedded systems, IEEE Trans. Computer-Aided Design, vol. 14, pp. 694-706, 1995. 19. N. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A Systems Perspective, Addison Wesley, 1994. 20. L.W. Nagel, SPICE2: a computer program to simulate semiconductor circuits, Memo ERL-M520, Dept. Electrical Engineering and Computer Science, University of California, Berkeley, 1975. 21. C. Terman, Timing simulation for large digital MOS circuits, Advances in Computer-Aided Engineering Design, vol. 1, JAI Press, 1984. 22. Z. Navabi, VHDL: Analysis and Modeling of Digital Systems, McGraw-Hill, New York, 1993. 23. Synopsys Design Compiler, http://www.synopsys.com/products/logic/logic.html. 24. D.D. Gajski and L. Ramachandran, Introduction to high-level synthesis, IEEE Design Test Comput., winter 1994. 25. D.D. Gajski, Principles of Digital Design, Prentice-Hall, Englewood Cliffs, NJ, 1997. 26. S. Malik, private communication. 27. Synopsys Behavioral Compiler, http://www.synopsys.com/products/beh_syn/beh_syn.html. 28. M. Keating and P. Bricaud, Reuse Methodology Manual for System-On-a-Chip Designs, Kluwer Academic Publishers, 1998. 29. R. Camposano and W. Wolf, High Level VLSI Synthesis, Kluwer Academic, 1991. 30. C.P. Ravikumar, S. Gupta, and A. Jajoo, Synthesis of testable RTL designs using adaptive simulated annealing algorithm, Eleventh International Conference on VLSI Design, 1998, India. 31. D.D. Gajski, N.D. Dutt, C.-H. Wu Allen, and Steve Y.-L. Lin, High-Level Synthesis: Introduction to Chip and System Design, Kluwer Academic Publishers, 1992. 32. Quickturn Emulation Tools, http://www.quickturn.com/. 33. K. Keutzer, DAGON: Technology Binding and Local Optimization by DAG Matching, Proceedings of the Design Automation Conference, 1987.

© 2000 by CRC Press LLC

34. B. Preas and M. Lorenzetti, Physical Design Automation of VLSI Systems, Benjamin Cummings Publishing, 1988. 35. S.M. Sait and H. Youssef, VLSI Physical Design Automation, IEEE Press, 1995. 36. W. Wolf, Modern VLSI Design: Systems on Silicon, Prentice-Hall, Englewood Cliffs, NJ, 1998. 37. M. Abramovici, M.A. Breuer, and A.D. Friedman, Digital Systems Testing and Testable Design, Computer Science Press, 1990. 38. S.-P. Lin, C. Njinda, and M. Breuer, Generating a family of testable designs using the BILBO methodology, Journal of Electronic Testing: Theory and Applications, pp. 71-89, 1993. 39. L. Avra, Allocation and Assignment in High-Level Synthesis for Self-Testable Data Paths, Proceedings of International Test Conference, pp. 463–472, 1991. 40. V.D. Agrawal, C.R. Kime, and K.K. Saluja, A tutorial on built-in self-test, Part 1. Principles, Part 2. Applications, IEEE Design & Test of Computers, 10, March/June 1993. 41. R.K. Brayton, C. McMullen, G.D. Hachtel, and A. Sangiovanni-Vincentelli, Logic Minimization Algorithms for VLSI Synthesis, Kluwer Academic Publishers, 1984. 42. R.K. Brayton, R. Rudell, A. Sangiovanni-Vincentelli, and A. Wang, MIS: a multiple-level logic optimization system, IEEE Transactions on CAD/ICAS, CAD-6, Nov. 1987. 43. J.M. Rabaey, Digital Integrated Circuits: A Design Perspective, Prentice-Hall, Englewood Cliffs, NJ, 1996. 44. D. Somasekhar and K. Roy, Differential current switch logic: a low power DCVS logic family, European Solid-State Circuits Conference, 1995. 45. F.N. Najm, A survey of power estimation techniques in VLSI circuits, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Dec. 1994. 46. M. Pedram, Power Minimization in IC Design: Principles and Applications, ACM Transactions on Design Automation of Electronic Systems, Jan. 1996. 47. L. Benini and G. De Micheli, Dynamic Power Management: Design Techniques and CAD Tools, Kluwer Academic Publishers, 1997. 48. M.B. Srivastava, A.P. Chandrakasan, and R.W. Broderson, Predictive system shutdown and other architectural techniques for energy efficient programmable computation, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Mar. 1996. 49. G.A. Paleologo, L. Benini, A. Bogliolo, and G. De Micheli, Policy optimization for dynamic power management, Proc. of 35th Design Automation Conference, June 1998. 50. D. Ramanathan, S. Irani, and R.K. Gupta, Online power management algorithms for embedded systems, submitted for publication. 51. Y. Zorian and R.K. Gupta, Introduction to core-based design, IEEE Design and Test of Computers, Oct. 1997. 52. J.J. Engel et al., Design methodology for IBM ASIC products, IBM Journal of Research and Development, 40, (no. 4), IBM, July 1996. 53. R.K. Gupta and S.Y. Liao, Using a programming language for digital system design, IEEE Design and Test of Computers, Apr. 1997.

© 2000 by CRC Press LLC

Lockwood, J. "Logic Synthesis for Field Programmable Gate Array (FPGA) Technology" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

65 Logic Synthesis for Field Programmable Gate Array (FPGA) Technology 65.1 Introduction 65.2 FPGA Structures Look-up Table (LUT)-Based CLB • PLA-Based CLB • Multiplexer-Based CLB • Interconnect

65.3 Logic Synthesis Technology Independent Optimization • Technology Mapping

65.4 Look-up Table (LUT) Synthesis Library-Based Mapping • Direct Approaches

65.5 Chortle Tree Mapping Algorithm • Example • Chortle-crf • Chortle-d

65.6 Two-Step Approaches

John W. Lockwood Washington University

First Step: Decomposition • Second Step: Node Elimination • MIS-pga 2: A Framework for TLU-Logic Optimization

65.7 Conclusion

65.1 Introduction Field Programmable Gate Arrays (FPGAs) enable rapid development and implementation of complex digital circuits. FPGA devices can be reprogrammed and reused, allowing the same hardware to be employed for entirely new designs or for new iterations of the same design. While much of traditional IC logic synthesis methods apply, FPGA circuits have special requirements that affect synthesis. The FPGA device consists of a number of configurable logic blocks (CLBs), interconnected by a routing matrix. Pass transistors are used in the routing matrix to connect segments of metal lines. There are three major types of CLBs: those based on PLAs, those based on multiplexers, and those based on table lookup (TLU) functions. Automated logic synthesis tools are used to optimize the mapping of the Boolean network to the FPGA device. FPGA synthesis is an extension to the general problem of multi-level logic synthesis. FPGA logic synthesis is usually solved in two phases. The technology-independent phase uses a general multi-level logic optimization tool (such as Berkeley’s MIS) to reduce the complexity of the Boolean network. Next, a technology-dependent optimization phase is used to optimize the logic for the particular type of device. In the case of the TLU-based FPGA, each CLB can implement an arbitrary logic function of a limited

© 2000 by CRC Press LLC

number of variables. FPGA optimization algorithms aim to minimize: the number of CLBs used, the logic depth, and the routing density. The Chortle algorithm is a direct method that uses dynamic programming to map the logic into TLUbased CLBs. It converts the Boolean network into a forest of directed acyclic graphs (DAGs); then it evaluates and records the optimal subsolutions to the logic mapping problem as it traverses the DAG. The two-step algorithms operate by first decomposing the nodes, and then performing a node elimination. Later sections of this chapter discuss and detail the Xmap, Hydra, and MIS-pga algorithms. FPGA devices are fabricated using the same sub-micron geometries as other silicon devices. As such, the devices benefit from the rapid advances in device-technology. The overhead of the programming bits, general function generators, and general routing structures, however, reduce the total amount of logic available to the end user.

65.2 FPGA Structures An FPGA consists of reconfigurable logic elements, flip-flops, and a reprogrammable interconnect structure. The logic elements are typically arranged in a matrix. The interconnect is arranged as a mesh of variable-length metal wires and pass transistors to interconnect the logic elements. The logic elements are programmed by downloading binary control information from an external ROM, a build-in EPROM, or a host processor. After download, the control information is stored on the device and used to determine the function of the logic elements and the state of the pass transistors. Unlike a PLA, the FPGA can be used for multi-level logic functions. The granularity of an FPGA refers to the complexity of the individual logic elements. A fine-grain logic block appear to the user to be much like a standard mask-programmable gate array. Each logic block consists of only a few transistors, and is limited to implementing only simple functions of a few variables. A course-grain logic block (such as those from Xilinx, Actel, Quicklogic, and Altera) provides more general functions of a larger number of variables. Each Xilinx 4000-series logic block, for example, can implement any Boolean function of five variables, or two Boolean functions of four variables. It has been found that the course-grain logic blocks generally provide better performance than the fine-grain logic blocks, as the course-grained devices require less space for interconnect and routing by combining multiple logic functions into one logic block. In particular, it has been shown that a fourinput logic block uses the minimal chip area for a large variety of benchmark circuits.1 The expense of a few extra underutilized logic blocks outweighs the area required for the larger number of fine-grained logic blocks and their associated larger interconnect matrix and pass transistors. This paper will focus on the logic synthesis for course-grained logic elements. A course-grained configurable logic block (CLB) can be implemented using: a PLA-based AND/OR elements, multiplexors, or SRAM-based table look-up (LUT) elements. These configurations are described below in detail.

Look-up Table (LUT)-Based CLB The basic unit of look-up table (LUT)-based FPGAs is the configurable logic block (CLB), implemented as an SRAM of size 2n × 1. Each CLB can implement any arbitrary logic function of n variables, for a total of 2n functions. An example of an LUT-based FPGA is the Xilinx 4000-series FPGA, as illustrated in Fig. 65.1. Each CLB has three LUT generators, and two flip-flops.2 The first two LUTs implement any function of four variables, while the third LUT implements any function of three variables. Separately, each CLB can implement two functions of four variables. Combined, each CLB can implement any one function of five variables, or some restricted functions of nine variables (such as AND, OR, XOR).

© 2000 by CRC Press LLC

FIGURE 65.1

Xilinx 4000-series CLB.

PLA-Based CLB PLA-based FPGA devices evolved from the traditional PLDs. Each basic logic block is an AND-OR block consisting of wide fan-in AND gates feeding a few-input OR gate. The advantage of this structure is that many logic functions can be implemented using only a few levels of logic, due of the large number of literals that can be used at each block. It is, however, difficult to make efficient use of all inputs to all gates. Even so, the amount of wasted area is minimized by the high packing density of the wired-AND gates. To further improve the density, another type of logic block, called the logic expander, has been introduced. It is a wide-input NAND gate whose output could be connected to the input of the ANDOR block. While its delay is similar, the NAND block uses less area than the AND-OR block, and thus increases the effective number of product terms available to a logic block.

Multiplexer-Based CLB Multiplexer-based FPGAs utilize a multiplexer to implement different logic function by connecting each input to a constant or a signal.3 The ACT-1 logic block, for example, has three multiplexers and one logic gate. Each block has eight inputs and one output, implementing:

(

)

f =  s3 + s4   s1w + s1x  + s3 + s4  s2 y + s2 x  Multiplexer-based FPGAs can provide a large degree of functionality for a relatively small number of transistors. Multiplexer-based CLBs, however, place high demands on routing resources due to the large number of inputs.

Interconnect In all structures, a reprogrammable routing matrix interconnects the configurable logic blocks. A portion of the routing matrix for the Xilinx 4000-series FPGA, for example, is illustrated in Fig. 65.2. Local interconnects are used to join adjacent CLBs. Global routing modules are used to route signals across the chip. The routing and placement issues for the FPGAs are somewhat different from those of custom logic. For a large fan-out node, for example, an optimal placement for the elements for the fan-out

© 2000 by CRC Press LLC

FIGURE 65.2

Xilinx routing matrix.

FIGURE 65.3

FPGA chip layout.

would be along a single row or column, where the routing could be done using a long line. For custom logic, the optimal placement would be as a cluster, where the optimization attempted to minimize the distance between nodes. For the FPGA, the routing delay is more influenced by the number of pass transistors for which the signal must cross rather than by the length of the signal line. The power of the FPGA comes from the flexibility of the interconnect. A block diagram of a typical thirdgeneration FPGA device is shown in Fig. 65.3. The CLB matrix and the mesh of the interconnect occupy most of the chip real area. Macro blocks, when present, implement functions such as high-density memory or microprocessing cores. The I/O blocks surround the chip and provide connectivity to external devices.

65.3 Logic Synthesis Logic synthesis is typically implemented as a two-phase process: a technology-independent phase, followed by a technology mapping phase.4 The first phase attempts to generate an optimized abstract representation of the target circuit, and the second phase determines the optimal mapping of the optimized abstract representation onto a particular type of device, such as an FPGA. The second phase optimization may drastically alter the circuit to optimize the logic for a particular technology. In most approaches published, the technology-dependent FPGA optimization is based on the area occupied by the logic as measured by the number of LUTs. The abstract representation of a combination logic function ƒ is not unique. For example, ƒ may be expressed by a truth table, a sum-of-products (SOP) (such as ƒ = ab + cd + e′), a factored form (such as ƒ = (a + b)(c + (e′(ƒ + g′)))), a binary decision diagram (BDD) directed acyclic graph DAG), an if-thenelse DAG, or any combination of the above forms. The BDD is a DAG where the logic function is associated with each node, as shown in Fig. 65.4. It is canonical because, for a given function and a given order of the variables along all the paths, the BDD DAG is unique. A BDD may contain a great deal of redundant information, however, as the sub-functions may be replicated in the lower portions of the tree. The if-then-else DAG consists of a set of nodes, each with three children. Each node is a two-to-one selector, where the first child is connected to the control input of the selector and the other FIGURE 65.4 Binary decision diagram. two are connected to the signal inputs of the node.

© 2000 by CRC Press LLC

FIGURE 65.5

An example of Boolean network.

Technology-Independent Optimization In the technology-independent synthesis phase, the combinational logic function is represented by the Boolean network, as illustrated in Fig. 65.5. The nodes of the network are initially general nodes, which can represent any arbitrary logic function. During optimization, these nodes are usually mapped from the general form to a generic form, which only consists of AND, OR, and NOT logic nodes.4 At the end of first synthesis phase, the complexity and number of nodes of the Boolean network has been reduced. Two classes of operations — network restructuring and node minimization — are used to optimize the network. Network restructuring operations modify the structure of the Boolean network by introducing new nodes, eliminating others, and adding and removing arcs. Node minimization simplifies the logic equations associated with nodes.5 Restructuring Operations Decomposition reduces the support of the function, F, (denoted as sup(F)). The support of the function refers to the set of variables that F explicitly depends on. The cardinality of a function (denoted by sup(F)), represents the number of variables that F explicitly depends on. Factoring is used to transform the SOP form of a logic function into a factored form. Substitution expresses one given logic function in terms of another. Elimination merges a subfunction, G, into the function, F, so that F is expressed only in terms of its fan-in nodes of F and G (not in terms of G itself). The efficiency of the restructuring operations depends on finding a suitable divisor, P, to factor the function, that is, given functions F, choose a divisor P, and find the functions Q and R such that F = PQ+R. The number of possible divisors is hopelessly large; thus, an effective procedure is required to restrict the searching subspace for good divisors. The Brayton and McMullen kernel matching technique is used. The kernels of a function F are the set of expressions: K(F) = {g  g ⊂ D(F), where g is cube-free, and D(F) are the primary divisors. A cube is a logic function given by the product of literals. A cube of a function F is a cube whose onset does not have vertices in the off-set of F (e.g., if F = ab(c + d), ab is a cube of F). An expression F is cube-free if no cube divides the expression evenly.6 For example, F = ab + c is cube-free, while F = ab + ac is not cube-free. Finally, the primary divisors of F are the set of expression: D(F) = F/C  C is a cube.7 Kernel functions can be computed effectively by several fast algorithms. Based on the kernel functions extracted, the restructuring operations can generate acceptable results usually within a reasonable amount of time.4 Speed/quality tradeoffs are still needed, however, as is the case with MIS, which is a multi-level logic synthesis system.8 Node Minimization Node minimization attempts to reduce the complexity of a given network by using Boolean minimization techniques on its nodes.

© 2000 by CRC Press LLC

A two-level logic minimization with consideration of the don’t-care inputs and outputs can be used to minimize the nodes in the circuit. Two types of don’t-care sets — satisfiability don’t care (SDC) and observability don’t care (ODC) — are used in the two-level minimizer. The SCD set represents combinations of input variables that can never occur because of the structure of the network itself, while the ODC set represents combinations of variables that will never be observed at outputs. If the ODCs and SDCs are too large, a practical running time can only be achieved by using a limited subset of ODCs and SDCs.8 Another technique is to use a tautology checker to determine if two Boolean networks are equivalent, by taking XNOR of their corresponding primary outputs.9 A node is first tentatively simplified by deleting either variables or cubes. If the result of tautology check is 1 (equivalent), then this deletion is performed. As with the first method, an exhaustive search is usually not possible because of the computational cost of the tautology check.

Technology Mapping Taking the special characteristics of a particular FPGA device into account, the technology mapping phase attempts to realize the Boolean network using a minimal number of CLBs. Synthesis algorithms fall into two main categories: algorithmic approaches and rule-based techniques. By expressing the optimized AND/OR/NOT network as a subject graph (a network of two-input NAND gates), and a library of potential mappings as a pattern graphs, the first approach converts the mapping problem to a covering problem with the goal of finding the minimum-cost cover of the subject graph by the pattern graphs. The problem is NP-hard; thus, heuristics must be used. If the network to be mapped is a tree, an optimal heuristic method has been found. It is inspired by Aho et al.’s work on optimizing compilers. If the Boolean network is not a tree, a step of decomposition into forest of trees is performed; then the mapping problem is solved as a tree-covering-by-tree problem, using the proven optimal heuristic. The rule-based technique traverses the Boolean network and replaces subnetworks with patterns in the library when a match is found. It is slow compared to the first method, but can generate better results. Mixed approaches, which a perform tree-covering step followed by a rule-based clean-up step, are the current trend in industry.

65.4 Look-up Table (LUT) Synthesis The existing approaches to synthesize FPGAs based on look-up tables (LUTs) are summarized in Fig. 65.6. Beginning with an optimized AND/OR/NOT Boolean network generated by a general-purpose multilevel logic minimizer, such as MIS-II, these algorithms attempt to minimize the number of LUTs needed to realize the logic network.

Library-Based Mapping Library-based algorithms were originally developed for use in the synthesis of standard cell designs. It was assumed that there was a small number of pre-designed logic elements. The goal of the mapping function was to optimize the use of these blocks. MIS is one such library-based approach that performs multi-level logic minimization. It existed long before the conception of FPGAs and has been used for TLU logic synthesis. Non-equivalent functions in MIS are explicitly described in terms of two-input NAND gates. Therefore, an optimal library needs to cover all functions that can be implemented by the TLU. Library-based algorithms are generally not appropriate for TLU-based FPGAs due to their large number of functions which each CLB can implement.

Direct Approaches Direct approaches generate the optimized Boolean network directly, without the explicit construction of library components. Two classes of method are used currently: modified tree covering algorithms (i.e., Chortle and its improved versions) and two-step methods.

© 2000 by CRC Press LLC

FIGURE 65.6

Approaches to synthesize FPGAs based on LUTs.

Modified Tree-Covering Approaches The modified tree-covering approach begins with an AND/OR representation of the optimized Boolean network. Chortle, and its extensions (Chortle-crf and Chortle-d), first decompose the network into a forest of trees by clipping the multiple-fan-out nodes. An optimal mapping of each tree into LUTs is then performed using dynamic programming, and the results are assembled together according to the interconnection patterns of the forest. The details of the Chortle algorithms are given in the Section 65.5. Two-step Approaches Instead of processing the mapping in one direct step, the two-step methods handle the mapping by node decompostion followed by node elimination. The decomposition operation yields a network that is feasible. The node elimination step reduces the number of nodes by combining nodes based on the particular structure of a CLB. A Boolean network is feasible if every intermediate node is realized by a feasible function. A feasible function is a function that satisfies sup(ƒ) ≤ K, or informally, can be realized by one CLB. Different two-step approaches have been proposed and implemented, including MIS-pga1 and MISpga2 from U.C. Berkeley, Xmap from U.C. Santa Cruz, and Hydra from Stanford. Each algorithm has its own advantages and drawbacks. Details of these methods are given in Section 65.6. Comparisons among the direct and two-step methods are given in Section 65.7.

65.5 Chortle The Chortle algorithm is specifically designed for TLU-based FPGAs. The input to the Chortle algorithm is an optimized AND/OR/NOT Boolean network. Internally, the circuit is represented as a forest of directed acyclic graphs (DAGs), with the leaves representing the inputs and the root representing the output, as shown in Fig. 65.7. The internal nodes represent the logic functions AND/OR. Edges represent inverting or non-inverting signal paths. The goal of the algorithm is to implement the circuit using the fewest number of K-input CLBs in minimal running time. Efficient running time is a key advantage of Chortle, as FPGA mapping is a computationally intensive operation in the FPGA synthesis procedure. The terminology of the Chortle algorithm defines the mapping of a node, n, in a tree as the circuit of look-up tables rooted at that node that extends to the leaf nodes. The root look-up table of node n is the mapping of the Boolean function that has the node n as its single output. The utilization of a look-up table refers to the number of inputs, U, out of the K inputs actually used in the mapping. Finally, the

© 2000 by CRC Press LLC

FIGURE 65.7

Boolean network and DAG representation.

FIGURE 65.8

Forest of fan-out-free trees.

utilization division, µ, is a vector that denotes the distribution of the inputs to the root look-up table among subtrees. For example, a utilization vector of µ = {2,1} would refer to a table look-up function that has two of the K inputs from the left logic subtree, and one input from the right subtree.

Tree Mapping Algorithm The first step of the Chortle algorithm is to convert the input graph to forest of fan-out-free trees, where each logic function has exactly one output. As illustrated in Fig. 65.8, node n has a fan-out degree of two; thus, two new nodes, n1 and n2, are created that implement the same Boolean equation of node n. Each subtree is then evaluated independently. Chortle uses a postorder traversal of each DAG to determine the mapping of each node. The logic functions connecting the inputs (leaves) are processed first; the logic functions connecting those functions are processed next, and so on until reaching the output node (root). Chortle’s tree mapping algorithm is based on dynamic programming. Chortle computes and records the solution to all subproblems, proceeding from the smallest to the largest subproblem, avoiding recomputation of the smaller subproblems. The subproblem refers to computation of the minimum-cost mapping function of the node n in the tree. For each node ni, the subproblem, minMap(ni ,U), is solved for each value of U, ranging from 2 … K (U = K refers to a look-up function that is fully utilized, while U = 2 refers to a TLU with only two inputs). In general, for the same value of U, multiple utilization vectors, µ(u1, u2, …, uƒ ), are possible, such that ∑ƒi=1 ui = U. The utilization vector determines how many inputs are to be used from each of the previous optimal subsolutions. Chortle examines each possible mapping function to determine this node’s minimum-cost mapping function, cost(minMap(n,U)). For each value of U ∈ {2 … K}, the utilization division of the minimum-cost mapping function is recorded.10

© 2000 by CRC Press LLC

Example The Chortle mapping function is best illustrated by an example, as illustrated in Fig. 65.9. For this example, we will assume that each CLB may have as many as four inputs (i.e., K = 4). The inputs, {A,B,C,D,E,F}, perform the logic function: A ∗ B + (C ∗ D) E + F. In the postorder traversal n1 is visited first, followed by n2 … n5 . For n1, there is only one possible mapping function, namely, U = 2, µ = {1,1}. The same is true for n2 . When n3 is evaluated, there are two possibilities, as illusFIGURE 65.9 Chortle mapping example. trated in Fig. 65.10. First, the function could be implemented as a new CLB with two inputs (U = 2), driven from the outputs of n2 and E. This sub-graph would use two TLBs; thus, it would have a cost function of 2. For U = 3, only one utilization vector is possible, namely, µ = {2,1}. All three primary inputs C, D, and E are grouped into one CLB, thus producing a cost function of 1. We store only the utilization vectors and cost functions for minMax(n3 , 2) and minMax(n3 ,3). When n4 is evaluated, there are many possibilities, as illustrated in Fig. 65.11. With U = 2 (µ = {1,1}), a two-input CLB would combine the optimal result for n3 with the primary input F, producing a function with a cost of 2. For U = 3 (µ = {2,1}), a three-input CLB would combine the optimal result for n3: U = 2 with both inputs E and F, also at a cost of two CLBs. Finally, for U = 4, a single CLB would implement the function (C ∗ D) ∗ E + F), at a cost of 1. We store the utilization vectors and cost functions for minMax(n4,2), minMax(n4,3), and minMax(n4,4). Finally, we evaluate the output node, n5, as illustrated in Fig. 65.12. We see that there are four possible mappings and, of those, two minimal mappings are possible. Chortle may return either of the mappings where two CLBs implement: n5 = (A ∗ B) + n3 + F and n3 = (C ∗ D) ∗ E.

Chortle-crf The Chortle-crf algorithm is an improvement of the original Chortle algorithm. The major innovation with Chortle-crf involves the method for choosing gate-level node decomposition. The other improvements involve the algorithm’s response to reconvergent and replicated logic. The name, Chortle-crf, is based on the new command line options (-crf) that may be given when running the program (-c for constructive bin-packing for decomposition, -r for reconvergent optimization, and -f for replication optimization).11 Each of the optimizations are detailed below. Decomposition Decomposition involves splitting a node and introducing intermediate nodes. Decomposition is required if the original circuit has a fan-in greater than K. In this case, no one CLB could implement the entire

FIGURE 65.10

Mapping of node 3.

© 2000 by CRC Press LLC

FIGURE 65.11

Mapping of node 4.

FIGURE 65.12

Mapping of node 5.

FIGURE 65.13

Decomposition example.

© 2000 by CRC Press LLC

FIGURE 65.14

Reconvergent logic example.

function. In general, the decomposition of a node may yield a circuit that uses fewer CLBs. Consider, for example, implementations with four-input CLBs (K = 4) of the circuit shown in Fig. 65.13. Without decomposition, the output node forces the sub-optimal use of the first two function generators (i.e., A ∗ B and C ∗ D are implemented as individual CLBs). With decomposition, however, the output node OR gate is decomposed to form a new node, which implements the function: (A ∗ B) + (C ∗ D), which can be implemented in one CLB. The original Chortle algorithm used an exhaustive search of all possible decompositions to find the optimal decomposition for the subcircuit, causing the running time at a node to increase exponentially as the fan-in increased. As a heuristic within the original Chortle algorithm, nodes would be arbitrarily split if the fan-in to a node exceeded 10, allowing each subfunction to be computed in a reasonable amount of time. If a node was split, however, the solution was no longer guaranteed to be optimal. The improved Chortle-crf algorithm uses first-fit-decreasing bin packing algorithm to solve the decomposition problem. Large fan-in nodes are decomposed into smaller subnodes with smaller fan-in. Next, the look-up tables for the input functions are bin-packed into CLBs. A look-up table with k inputs is merged into the first CLB that has at least K – k unused inputs remaining. A new CLB is generated, if needed, to accommodate the k inputs. Reconvergent Logic Reconvergent logic occurs when a signal is split into multiple function generators, and then those output signals merge at another generator. An example of reconvergent logic is shown in Fig. 65.14. When the XOR gate was converted to a SOP format by the technology-independent minimization phase, two AND gates and an OR gate were generated. Both AND gates share the same inputs. If the total number of distinct inputs is less than the size of the CLB, it is possible to map these functions into one CLB. The Chortle-crf algorithm finds all local reconvergent paths, and then examines the effect of merging those signals into one CLB. Replicated Logic For multi-output logic circuits, there are cases when logic duplication uses fewer CLBs than logic that uses subterms generated by a shared CLB. Figure 65.15 shows an example of a six-input circuit with two outputs. One product term is shared for both functions ƒ and g. Without replication, the subfunction implemented by the middle AND gate would be implemented as one CLB, as well as the subfunctions for ƒ and g. In this case, however, the middle AND gate can be replicated, and mapped into both function generators, thus allowing the entire circuit to be implemented using two CLBs, rather than three. When a circuit has a fan-out greater than one, Chortle may implement the node explicitly or implicitly. For an explicit node, the subfunction is generated by a dedicated CLB, and this output signal is treated as an input to the rest of the logic. For an implicit node, the logic is replicated for each fan-out subcircuit. The algorithm computes the cost of the circuit, both with replication and without. Logic replication is chosen if this reduces the number of CLBs used to implement the circuit.

© 2000 by CRC Press LLC

FIGURE 65.15

Replicated logic example.

Chortle-d The primary goal of Chortle-d is to reduce the depth of the logic (i.e., the largest number of CLBs for any signal path through combinational logic).12 By minimizing the longest paths, it is possible to increase the frequency at which the circuit can operate. Chortle-d is an enhancement of the Chortle-crf algorithm. Chortle-d, however, may use more look-up tables than Chortle-crf to implement a circuit with a shorter depth. The Chortle-d algorithm separates logic into strata. Each stratum contains logic at the same depth. When nodes are decomposed, the outputs of the tables with the deepest stratum are connected to those at the next level. Chortle-d also employs logic replication, where possible. Replication often reduces the depth of the logic, as illustrated in Fig. 65.15. The depth optimization is only applied to the critical paths in the circuit. The algorithm first minimizes depth for the entire circuit to determine the maximum target depth. Next, the Chortle-crf algorithm is employed to find a circuit that has minimum area. For paths in the area-optimized circuit that exceed the target depth, depth-minimization decomposition is performed. This has the effect of equalizing the delay throuth the circuit. It was found that for the 20 circuits in the MCNC logic synthesis benchmark, the chortle-d algorithm constructed circuits with 35% fewer logic levels, but at the expense of 59% more look-up tables.

65.6 Two-Step Approaches As with Chortle, the two-step methods start with an optimized network in which the number of literals is minimized. The network is decomposed to be feasible in the first step; then the number of nodes is reduced in the second step. If the given network is already feasible, the first step is skipped.

First Step: Decomposition For a given FPGA device, with a k-input TLU, all nodes of the network with more than k inputs must be decomposed. Different methods decompose the network in different ways. MIS-pga 1 MIS-pga 1 was developed at Berkeley for FPGA synthesis, as an extension of MIS-II. It uses two algorithms, kernel decomposition and Roth-Karp decomposition, to decompose the infeasible nodes separately; then it selects the better result. Kernel decomposition decomposes an infeasible node ni by extracting a kernel function, ki, and splitting ni based on ki and its residue, ri . The residue ri , of a kernel ki , of a function F, is the expression for F with a new variable substituted for all occurrences of ki in F; for example, if F = x1x2 + x1x3, then

© 2000 by CRC Press LLC

FIGURE 65.16

Example of kernel decomposition.

ki = x2 + x3, and ri = x1ki. As there may be more than one kernel function that exists for a node, a cost function is associated with each kernel: cost(ki) = sup(ki) I sup(ri). The kernel with minimum cost is chosen. A kernel decomposition is illustrated in Fig. 65.16. Splitting infeasible nodes by kernel functions minimizes the number of new edges generated. Therefore, the considerations of wiring resources and logic area are integrated together. This procedure is applied recursively until all nodes are feasible. If no kernels can be extracted for a node, an AND-OR decomposition is applied. Roth-Karp decomposition is based on the classical decomposition of Ashenhurst and Curtis.13 Instead of building a decomposition chart whose size grows exponentially, as it does with the original method, a compact cover representation of the on-set and the off-set of the function is used. The Roth-Karp algorithm avoids the expensive computation of the best solution by accepting the first bound set. As with kernel decomposition, the AND/OR decomposition is used as a last resort. Hydra Decomposition The Hydra algorithm, developed at Stanford University, is designed specifically for two-output TLU FPGAs.14 Decomposition in Hydra is performed in three stages. The first and third stages are AND-OR decompositions, while the second stage is a simple-disjoint decomposition, which is defined as the following: Given a function, F, and its support, S, with F = G(H(Sa), Sb), where Sa, Sb ⊆ S and Sa U Sb = S; If Sa I Sb = 0, then G is a disjoint decomposition of F. The first stage is executed only if the number of inputs to the nodes in the given network is larger than a given threshold. Without performing the first stage, the efficiency of the second stage would be reduced. The last stage is applied only if the resulting network is still infeasible. In the second stage, the algorithm searches for all the function pairs that have common variables and then applies the simple-disjoint decomposition on them. As a result, two CLBs with the same fan-ins can be merged into one two-output CLB. The rationale is illustrated in Fig. 65.17. A weighted graph G(V,E,W) that represents the shared-variable relationship is constructed based on the given Boolean network. In the G(V,E,W), V is the node set corresponding to that of the Boolean network; edge, eij ⊂ E, exists for any pair of nodes, {vi , vj} ⊂ V, if they share variables; and weight, wij ⊂ W, is the number of variables shared correspondingly. Edges are first sorted by weight and then traversed in decreasing order to check for simple-disjoint decomposition. A cost function, which is the linear combination of the number of the shared inputs and the total number of variables in the extracted functions, is computed to decide whether or not to accept a certain simple decomposition. Xmap Decomposition The Xmap decomposes the infeasible network by converting the SOP form from MIS--II to an if-thenelse DAG representation.15 The terms of the SOP network are collected in a set, T; then, variables are sorted in decreasing order of the frequency of their appearance in T; finally, the if-then-else DAG is formed by the following recursive function:

© 2000 by CRC Press LLC

FIGURE 65.17

CLB mapping example.

FIGURE 65.18

Result of first iteration.

• Let V be the most frequently used variable in the current set, T. • Sort the terms in T into subsets T(Vd), T(V1), according to V. T(Vd) is the subset in which V does not appear, T(V1) is the onset of V, and T(V0) is the offset of V. • Delete V from all terms in T; then apply the same procedure recursively to the three subsets until all variables are tested. The resulting if-then-else DAG after first iteration is given in Fig. 65.18. A circuit that has been mapped to an if-then-else DAG is immediately suited for use with multiplexer-based CLBs.16 Additional steps are used to optimize the DAG for use with TLU functions.

Second Step: Node Elimination Three approaches have been proposed for node elimination: local elimination, covering, and merging. Local Elimination The operation used for local elimination is collapsing, which merges node ni into node nj whenever ni is a fan-in node to nj and the new node obtained is feasible. The Hydra algorithm accepts local eliminations as soon as they are found. MIS-pga 1, however, first orders all possible local eliminations as a function of the increase in the number of interconnections resulting from each elimination, and then greedily selects the best local eliminations.

© 2000 by CRC Press LLC

The number of nodes can be reduced by local elimination, but its myopic view of the network causes local elimination to miss better solutions. Additionally, the new node created by merging multi-fan-out nodes may substantially increase the number of connections among TLUs and hence make the wiring problem more difficult. This problem is more severe in Hydra than in MIS-pga 1. Covering The covering operation takes a global view of the network by identifying clusters of nodes that could be combined into a single TLU. The operation is a procedure of finding and selecting supernodes. A supernode, Si, of a node ni, is a cluster of nodes consisting of ni and some other nodes in the transitive fan-in of ni such that the maximum number of inputs to Si is k. Obviously, more than one supernode may exist for a node. In MIS-pga 1, the covering operation is performed in two stages. In the first stage, the supernodes are found by repeatedly applying the maxflow algorithm at each node. In the second stage, an optimal subset of the supernodes that can cover the whole network using a minimum number of supernodes is selected by solving a binate covering problem whose constrains are: first, all intermediate nodes should be included in at least one supernode; second, if a supernode Si is selected, some supernodes that supply the inputs of Si must be selected [the ordinary (unate), covering problem just has the first constraint]. Hydra examines the nodes of the network in order of decreasing number of inputs. An unassigned node with the maximal number of inputs is chosen first. A second node is then chosen such that the two nodes can be merged into the same TLU and the cost function (same cost function as was used in decomposition step) is maximized. This greedy procedure stops when all unexamined nodes have been considered. For Xmap, the logic blocks to be found are sub-DAGs of the if-then-else DAG for the entire circuit. The algorithm traverses the if-then-else DAG from inputs to outputs and keeps a log of inputs in the paths (called signals set) that can be used to compute the function of the node under consideration. Nodes in the signals set could be a marked node or a clean node. A marked node isolates its inputs to the current node, while a clean node exposes all its fan-ins. For an overflow node, whose signals set is larger than k (the number of inputs of the TLU), a marking procedure is executed to reduce the fan-ins of the overflow node. Xmap first marks the high-fan-out descendants of the node, and then marks the children of the node in decreasing order of the size of their signals set. The more inputs Xmap can isolate from the node under consideration, the better. The marking process cuts the if-then-else into pieces, each of which can be mapped into one CLB. Merging The purpose of the merging step is to combine nodes that share some inputs to exploit some of the particular features of FPGA architecture. For example, each CLB in the Xilinx XC4000 device has two four-input TLUs and a third TLU combining them with the ninth input (Section 65.3). In the three approaches discussed above, a post-processing step is performed to merge pairs of nodes after the covering operation. The problem is formulated as a maximum cardinality matching problem.

MIS-pga 2: A Framework for TLU-Logic Optimization MIS-pga 2 is an improved version of MIS-pga 1. It combines the advantageous features of Chortle-crf, MIS-pga 1, Xmap, and Hydra. In each step, Mis-pga 2 tries different algorithms and chooses the best.17 Four decomposition algorithms are executed in the decomposition step: 1. Bin-packing: the algorithm is similar to that of Chortle-crf, except the heuristic of MIS-pga 2 is the Best-Fit Decreasing. 2. Co-factoring decomposition: it decomposes a node based on computing its Shannon cofactor (ƒ = ƒ1 ƒ2 + ƒ′1 ƒ3). The nodes in the resulting network have, at most, three inputs. This approach is particularly effective for functions in which cubes share many variables.

© 2000 by CRC Press LLC

3. AND/OR decomposition: it can always find a feasible network, but is usually not a good network for the node elimination step. Therefore, it is used as the last resort. 4. Disjoint decomposition: unlike Hydra, this method is used on a node-by-node basis. When it is used as a preprocessing stage for the bin-packing approach, a locally optimal decomposition can be found. MIS-pga 2 interweaves some operations of the two-step methods. For example, the local elimination operation is applied to the original infeasible network as well as to the decomposed, feasible network. This same operation is referred to as partial collapse when applied before decomposition. Unlike MISpga 1, which separates the covering and the merging operations, these two operations are combined together to solve a single, binate covering problem. Because MIS-pga 2 does a more exhaustive decomposition phase, and because the combined covering/merging phase has a more global view of the circuit, MIS-pga 2 results are almost always superior to those of Chortle-crf, MIS-pga 2’s results are almost always superior to those of Chortle-crf, MIS-pga 1, Hydra, and Xmap. For the same reason, MIS-pga 2 is relatively slow, as compared to the other algorithms.

65.7 Conclusion By understanding how FPGA logic is synthesized, hardware designers can make the best use of their software development tools to implement complex, high-performance circuits. Synthesis of FPGA logic devices combines the algorithms of Chortle and its extensions, Xmap, Hydra, MIS-pga 1, and MIS-pga 2. Each of these methods starts with an optimized Boolean network and then maps the logic into the configurable logic blocks of a field-programmable gate array circuit. Because the optimal covering problem is NP-hard, heuristic approaches must balance between the optimality of the solution and the running time of the optimizer. Understanding this tradeoff is the key to rapidly prototyping logic using FPGA technology.

References 1. J. Rose, A.E. Gamal, and A. Sangiovanni-Vincentelli, Architecture of field-programmable gate arrays, Proceedings of the IEEE, vol. 81, pp. 1013-1029, July 1993. 2. Xilinx, Inc., The Programmable Logic Data Book, 1993. 3. ACTEL, FPGA Data Book and Design Guide, 1994. 4. A. Sangiovanni-Vincentelli, A.E. Gamal, and J. Rose, Synthesis methods for field programmable gate arrays, Proceedings of the IEEE, vol. 81, pp. 1057-1083, July 1993. 5. R.K. Brayton, G.D. Hachtel, and A. Sangiovanni-Vincentelli, Multilevel logic synthesis, Proceedings of the IEEE, vol. 78, pp. 264-300, Feb. 1990. 6. R. Brayton, R. Rudell, A. Sangiovanni-Vincentelli, and A. Wang, Multi-level logic optimization and the rectangular covering problem, IEEE International Conference on Computer-Aided Design, (Santa Clara, CA), pp. 62-65, 1987. 7. R. Murgai, Y. Nishizaki, N. Shenoy, R.K. Brayton, and A. Sangiovanni-Vincentelli, Logic synthesis for programmable gate arrays, ACM/IEEE Design Automation Conference, (Orlando, FL), pp. 620-625, 1990. 8. R.K. Brayton, R. Rudell, A. Sangiovanni-Vincentelli, and A.R. Wang, MIS: A multiple-level logic optimization system, IEEE Transactions on Computer-Aided Design, vol. CAD-6, pp. 1062-1081, November 1987. 9. D. Bostick, G.D. Hachtel, R. Jacoby, M.R. Lightner, P. Moceyunas, C.R. Morrison, and D. Ravenscroft, The boulder optimal logic design system, IEEE International Conference on Computer-Aided Design, (Santa Clara, CA), pp. 62-69, 1987.

© 2000 by CRC Press LLC

10. R.J. Francis, J. Rose, and K. Chung, Chortle: A technology mapping program for look-up tablebased field programmable gate arrays, ACM/IEEE Design Automation Conference, (Orlando, FL), pp. 613-619, 1990. 11. R.J. Francis, J. Rose, and Z. Vranesic, Chortle-crf: Fast technology mapping for look-up table-based FPGAs, ACM/IEEE Design Automation Conference, (San Francisco, CA), pp. 227-233, 1991. 12. R.J. Francis, J. Rose, and Z. Vranesic, Technology mapping of look-up table-based FPGAs for performance, IEEE International Conference on Computer-Aided Design, (Santa Clara, CA), pp. 568-575, 1991. 13. T. Luba, M. Markowski, and B. Zbierzchowski, Logic decomposition for programmable gate arrays, Euro ASIC ‘92, pp. 19-24, 1992. 14. D. Filo, J.C.-Y. Yang, F. Mailhot, and G.D. Micheli, Technology mapping for a two-output RAMbased field programmable gate array, European Design Automation Conference, pp. 534-538, 1991. 15. K. Karplus, Xmap: a technology mapper for table-lookup field programmable gate arrays, ACM/IEEE Design Automation Conference, (San Francisco, CA), pp. 240-243, 1991. 16. R. Murgai, R.K. Brayton, and A. Sangiovanni-Vincentelli, An improved systhesis algorithm for multiplexor-based pga’s ACM/IEEE Design Automation Conference, (Anaheim, CA), pp. 380-386, 1992. 17. R. Murgai, N. Shenoy, R.K. Brayton, and A. Sangiovanni-Vincentelli, Improved logic synthesis algorithms for table look up architectures, IEEE International Conference on Computer-Aided Design, (Santa Clara, CA), pp. 564-567, 1991.

© 2000 by CRC Press LLC

Kanopoulos, N "Testability Concepts and DFT" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

66 Testability Concepts and DFT Nick Kanopoulos Atmel, Multimedia and Communications

66.1 Introduction: Basic Concepts 66.2 Design for Testability

66.1 Introduction: Basic Concepts Physical faults or design errors may alter the behavior of a digital circuit. Design errors are tackled by redesigning the circuit, whereas physical errors can be reduced by determining appropriate operating conditions.1,2 There are many sources of physical faults: improper interconnections between parts, improper assembly, missing parts, and erroneous parts may occur while the circuit is being manufactured. After manufacturing, the circuit may fail due to excessive heat dissipation or for mechanical reasons associated with corrosions and, in general, bad maintenance. Short-circuit faults are those due to connections of signal lines that must be disconnected. In addition, disconnecting lines that must be connected may cause opencircuit faults.1,3 Failures in the operation of digital circuits are addressed in the testing process, which is abstracted in Fig. 66.1. Typically, the testing process determines the presence of faults. The circuit being tested is often called the circuit under test (CUT). Errors are detected by applying test patterns on the inputs of the CUT and analyzing the responses on its outputs. A test pattern is typically a vector of 0 and 1, and every bit corresponds to an input of the CUT. A test pattern is generated by a test pattern generator (TPG) tool. The responses are analyzed using an output response verification (ORV) tool. The ORV tool is a comparator circuit. The testing process is done periodically during the circuit’s life span. It is initially done after fabrication and while the CUT is still at the wafer. Testing is also done when it is removed from the wafer, and later it is tested as part of a printed circuit board (PCB). Testing is done either at the transistor level or at the logical level. We are considering here logical-level testing for which TPG and ORV are concerned with binary values, that is, the signals are binary values. The components are gates and flip-flops (or latches). We do not consider parametric testing, which analyzes waveforms at the transistor level. A circuit C = (V,E) is considered as a collection V of components and E lines. Figure 66.2 depicts a combinational circuit at the logic level. The components represent gates. The integer value on each circuit line indicates its label. the circuit inputs are lines 1, 2, 3, 6, 7, 23, and 24. The test patterns may be precomputed by a pattern generator program, often referred to as an automatic test pattern generator (ATPG). The goal in an ATPG program is to quickly compute a small set of test patterns that detect all faults. The design of ATPG tools is a difficult task. Once the patterns are generated, they are stored in the memory of an automatic test equipment (ATE) mechanism that applies the test patterns and analyzes the responses using the ORV tool. In order for the ATE tools to test PCBs or complex digital systems, they must be controlled by computer programs.

© 2000 by CRC Press LLC

FIGURE 66.1

The testing process.

FIGURE 66.2

A circuit at the logic level.

ATE equipment is often very expensive. Thus, some circuits are designed so that they can test themselves. This concept is called built-in self-testing (BIST). In BIST, the TPG and ORV tools are on-chip and the concern is twofold: accuracy and hardware cost. Chapter 67 reviews popular ATPG tools and BIST mechanisms. Furthermore, the complexity of current application-specific integrated circuits (ASICs) has led to the development of sophisticated CAD tools that automate the design of BIST mechanisms. Such tools are presented in Chapter 68. The testing process requires fault models that precisely define the behavior of the (logic-level) circuit. The standard model for logical-level testing is the stuck-at fault model. This model associates two types of faults for each line l of the circuit: the stuck-at 0 fault and the stuck-at 1 fault. The stuck-at 0 fault assumes that line l is permanently stuck at the logic value 0. Similarly, the stuck-at 1 assumes it is stuck at 1. The single stuck-at fault model assumes that only one such fault is present at a time. Under the single stuck-at fault model, a circuit with E lines can have at most 2 · E faults. Although the stuck-at fault model appears to be simplistic, it has been shown to be very effective, and a set of patterns that detect all single stuck-at faults covers most (physical) faults as well. However, the stuck-at fault model is of limited use to faults associated with delays in the operation of the CUT. Such faults are called delay faults. Although it has been shown that testing for delay faults can be theoretically reduced to testing for stuck-at faults in an auxiliary circuit, the size of the latter circuit is prohibitively large. Instead, an alternative fault model, the path delay fault model, is applied successfully. The path delay fault model is postponed until Chapter 68. In order for a test pattern to detect a stuck-at fault on line l, it must guarantee that the complementary logic value is applied on l. In addition, it must apply an appropriate logic value to each of the other lines in the circuit so that the erroneous behavior of the circuit at line l is propagated all the way to an output line. This way, the fault is observed and detected. The problem of generating a test pattern that detects a given stuck-at fault is an intractable problem, that is, it requires algorithms whose worst-case complexity it exponential to O(V  + E), the size of the input circuit. ATPG algorithms for the stuck-at fault model are described in Chapter 67. They are very efficient, and require seconds per stuck-at fault, even for very large circuits. The stuck-at fault model is easy to use, involves only 2 · E faults, and requires at most 2 · E test patterns. Once a pattern is applied by the ATE equipment, a process called fault simulation is performed in order to determine how many faults are detected by the applied test pattern. A key measure of the effectiveness of a set of test patterns is its fault coverage. This is defined as the percentage of faults detected by the set of patterns. Fault simulation is needed in order to determine the fault coverage of a set of test patterns. Fault simulation is important in testing with ATE as well as in the design of the on-chip test mechanisms. Fault

© 2000 by CRC Press LLC

simulation is an inherently polynomial process for the stuck-at fault model. However, an overview of sophisticated fault simulation techniques is presented in Chapter 68. Exhaustive TPG applies all possible test patterns at the circuit inputs, that is, 2|I| test patterns for a circuit with I  inputs. Instead, pseudo-exhaustive TPG guarantees that all stuck-at faults are covered with less than 2|I| patterns. BIST schemes are often designed so that pseudo-exhaustive TPG is guaranteed. (See also Chapter 67.) However, sometimes we need to generate patterns only for a given set of stuck-at faults. This type of TPG is called a deterministic TPG, and the generated test patterns must detect the predefined set of test patterns. A good pseudo-exhaustive or deterministic TPG tool must guarantee that a compact test set is generated. Consider a three-input NAND gate where lines a, b, and c are the three inputs and line d is the output. There exist three directly controllable lines and one observable line. Let us describe a test pattern as a binary vector of three values applied to lines a, b, and c, respectively. There are 2 · 4 stuck at faults. By applying 23 patterns, all the faults are covered. However, a compact test set contains at least four test patterns. Consider the following order of pattern application. Pattern (111) is applied first and covers four stuck-at faults. Pattern (110) covers two additional stuck-at faults. Finally, patterns (101) and (011) are needed to cover the last two faults. The number of applied patterns is also called the test length. The problem of minimizing the test length, which guarantees 100% fault coverage, is intractable. Heuristic methods can be applied to reduce the test length. Two faults are called indistinguishable if they are detected by the same set of test patterns. Identification of indistinguishable faults is an important concept in test set compaction. A stuck-at fault is called undetectable if it cannot be detected by any pattern. Any circuit that has at least one undetectable fault is called redundant. Any redundant circuit can be simplified by removing the line that contains the undetectable fault, and possibly other lines, without changing its functionality. In the above, the CUT was assumed to be a combinational circuit. The TPG process is significantly more difficult in sequential logic. In order for a stuck-at fault to be detected, a sequence of test patterns rather than a single pattern must be applied. The process of generating sequences of pattern with ATPG or on-chip TPGs is a tedious job. These concepts are discussed in more details in Chapter 67.

66.2 Design for Testability Design for testability (DFT) is applied to reduce difficulties associated with the TPG process on sequential circuits. DFT suggests that the digital circuit is designed with built-in features that assist the testing process. The goal in DFT is to maximize fault coverage, the test pattern generation process, the time required to apply the generated patterns, and the built-in hardware overhead. By definition, DFT is needed for BIST where TPG and ORV are on-chip. However, the majority of the proposed DFT methods are targeting the simplification of the ATPG process for sequential circuits, and assume that ATE is used. There are some guidelines that have been developed by experienced engineers and lead the insertion of the built-in mechanisms so that the input sequential CUT becomes testable with ATPG tools. 1. Set the circuit at a known state before and during testing. This is achieved by a RESET control line that is connected to the asynchronous CLEAR of each flip-flop in the CUT. 2. Partition the CUT into subcircuits which are tested easier. 3. Simplify the circuit to avoid redundancies. 4. Control and observe lines on feedback paths, lines that are far from inputs and outputs, and lines with high fan-in and fan-out. One way to implement the first guideline (1) is by inserting test points to control and observe at lines x that break all feedbacks. A test point on line x = (xin, xout) is a simple circuit that simulates the function f (x, s, c) = s′ · (x + c). The output of this circuit feeds xout. Input signals s and c are controlling. When s = 0 and c = 0, we have that f = x; that is, this combination can be used in operation mode. When s = 0 and c = 1, function f evaluates to 1. When s = 1 and c = 0, f evaluates to 0. The last two combinations

© 2000 by CRC Press LLC

can be used in the testing mode, and they guarantee that the line is fully controllable. It can be made observable by simply allowing for a new primary output at signal x. Another mechanism is to use bypass latches, also referred to as bypass storage elements (bses). These latches are bypassed during the operation mode and are fully controllable and observable points in the testing mode. This dual functionality is easily obtained with a simple multiplexing circuitry. (See also Fig. 66.3 below.)

FIGURE 66.3

The structure of a bypass storage element.

In both cases, the total hardware must be minimized, subject to a lower bound on the enhancement of the circuit’s testability. This optimization criterion requires sophisticated CAD tools, some of which are described in Chapter 68. The most popular DFT approach is the scan design. The approach is a variation of the bypass latch approach discussed earlier. Instead of adding new latches, as the bypass latch approach suggests, the scan design approach enhances every flip-flop in the circuit with a multiplexing mechanism that allows for the following. In the operation mode, the flip-flop behaves as usual. In the testing mode, all the flip-flops are connected to a single shift chain. The input of this chain is a single controllable point and its output is a single observable point. In the testing mode, each scanned flip-flop is a fully controllable and observable point. Observe that the testing phase amounts to testing combinational logic. Therefore, the ATPG (or the on-chip TPG) needs to generate single patterns instead of sequences of patterns. Each generated pattern is serially shifted in the scan chain. Typically, this process requires as many clock cycles as the number of flip-flops. Once every flip-flop obtains its controlling value, the circuit is turned to operation mode for a single cycle. Now the flip-flops are disconnected from the scan chain, and at the end of the clock cycle, the flip-flops are loaded with values that are to be observed and analyzed. Now the circuit is switched back into the testing mode (i.e., all flip-flops form again a scan chain). At this point, the states of the flip-flops are shifted out and are analyzed. This requires no more clock cycles than the number of flip-flops. The described scan approach is also called full scan because all flip-flops in the circuit are scanned. The advantage of the full scan approach is that it requires only two additional I/O pins: the input and output of the scan chain, respectively. The disadvantage is that it is time-consuming due to the shift-in and shift-out processes for each applied pattern, especially for circuits with many flip-flops. For such circuits, it is also hardware intensive because every flip-flop must have dual operation mode capability. The hardware and the application time can be reduced by employing CAD tools. See also Chapter 68. Another way to reduce application time and hardware cost is through partial scan. In partial scan, only a subset of flip-flops is scanned. The flip-flops and their ordering in the scan also require sophisticated CAD tools. The tradeoff in partial scan is that the ATPG tool may have to generate test sequences

© 2000 by CRC Press LLC

rather than single patterns. A CAD tool is needed in order to select and scan a small number of flipflops. This guarantees low hardware overhead and low application time. The flip-flop selection must also guarantee an upper bound on the length of any generated test sequence. This simplifies the task of the ATPG tool and has an impact on the test application time.

References 1. M. Abramovici, M.A. Breuer, and A.D. Friedman, Digital Systems Testing and Testable Design, Computer Science Press, 1990. 2. J.P. Hayes, Introduction to Digital Logic Design, Addison Wesley, Boston, 1993. 3. P.H. Bardell, W.H. McAnney, and J. Savir, Built-In Test for VLSI: Pseudorandom Techniques, John Wiley & Sons, New York, 1987.

© 2000 by CRC Press LLC

Kagaris, D. "ATPG and BIST" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

67 ATPG and BIST 67.1 Automatic Test Pattern Generation

Dimitri Kagaris Southern Illinois University

TPG Algorithms • Other ATPG Aspects

67.2 Built-In Self-Test Online BIST • Offline BIST

67.1 Automatic Test Pattern Generation Automatic test pattern generation (ATPG) refers in general to the set of algorithmic techniques for obtaining a set of test patters that detect possible faulty behavior of a circuit after its fabrication. Faults during fabrication can affect the functional correctness of the circuit (functional faults) and its timing performance (delay faults). In this chapter, we deal only with functional faults. The physical faults in a circuit (such as breaks, opens, technology-specific faults) have to be modeled as logical faults (like “stuckat” and “bridging” faults) in order to reduce the required complexity of ATPG. The most common fault model used in practice is the stuck-at model, where lines in a gate-level or register-transfer-level description of a circuit are assumed to be set permanently to a “1” or “0” value in the presence of a fault. An additional restriction is that the modeled faults cause only one line in the circuit to have a stuck-at value (single stuck-at fault model). Patterns generated under this model have been shown in practice to cover many of the unmodeled faults as well. Given a list of stuck-at faults of interest, the primary goal of ATPG is to generate a test pattern for each of these faults, and additionally to keep the overall number of test patterns generated as small as possible. The latter is required for reducing the time/cost of applying the test patterns to the circuit. In this section, we describe basic test pattern generation (TPG) algorithms for finding a test pattern given a stuck-at fault, and other aspects of the ATPG process for facilitating the task of TPG algorithms and reducing the number of generated test patterns.

TPG Algorithms Given a target fault of line l being stuck at value v, denoted by l s–a–v, a TPG algorithm attempts to – generate a pattern such that (1) the pattern brings l to have a value v (fault activation) and (2) the same pattern carries over the effect of the fault to a primary output (fault propagation). A path from line l to a primary output along each line of which the effect of the fault is carried over is called a sensitized path. The case of a line having a value of “1” in the correct circuit and a value of “0” in the circuit under the fault l s–a–v is denoted by the symbol D and, similarly, the opposite case is denoted by D. Given the symbols D and D, the basic Boolean operations AND, OR, NOT can be extended in a straightforward manner. For example, AND (1, D) = D, AND(1, D) = D, AND(0, D) = 0, AND(0, D) = 0, AND(x, D) = x, AND(x, D) = x (where x denotes the don’t-care case), etc.

© 2000 by CRC Press LLC

TPG Algorithms for Combinational Circuits A basic TPG algorithm for combinational circuits is the D-algorithm.1 This algorithm works as follows. All values are initially assigned a value of x, except line l which is assigned a value of D if the fault is l s–a–0, and a value of D if the fault is l s–a–1. Let G be the gate whose output line is l. The algorithm goes through the following steps: 1. Select an assignment for the inputs of G out of all possible assignments that produce the appropriate D-value (i.e., a D or D) at the output of G. This step is known as fault activation. All possible assignments are fixed for each gate type and are referred to as the primitive d-cubes for the faul (pdcfs) of the gate. For example, the pdcfs of a two-input AND gate are 0xD, x0D, and 11D, and the pdcfs of a two-input OR gate are 1xD, x1D, and 00D (using the notation abc for a gate with input values a and b and output value c). 2. Repeatedly select a gate from the set of gates whose output is currently x but has at least one input with a D-value. This set of gates is known as the D-frontier. Then select an assignment for the inputs of that gate out of all possible assignments that set the output to a D-value. All possible assignments are fixed for each gate type and are referred to as the propagation d-cubes (pdcs) of the gate. For example, the pdcs of a two-input AND gate are 1DD, D1D, 1DD, D1D, DDD, and DDD. By repeated application of this step, a D-value is eventually propagated to a primary output. This step is known as fault propagation. 3. Find an assignment of values for the primary inputs that establishes the candidate values required in steps (1) and (2). This step is known as line justification. For each value that is not currently accounted for, the line justification process tries to establish (“justify”) the value by (a) assigning binary values (and no D-values) on the inputs of the corresponding gate, working its way back to the primary inputs (this process is referred to as backtracing; and (b) determining all values that are imposed by all candidate assignments thus far (implication) and checking for any inconsistencies (consistency check). 4. If during step (3), an inconsistency is found, then the computation is restored to its state at the last decision point. This process is known as backtracking. A decision point can be (a) the decision in step (1) of which pdcf to select; (b) the decisions in step (2) of which gate to select from the D-frontier and which pdc to select for that gate; (c) the decision in step (3) of which binary combination to select for each value that has to be justified. 5. If line justification is eventually successful after zero or more backtrackings, then the existing values on the primary inputs (some of which may well be x) constitute a test pattern for the fault. Otherwise, no pattern can be found to test the given fault and that fault is thus shown to be redundant. The order of steps (2) and (3) may be interchanged, or even the two steps may be interspersed, in an attempt to reduce the running time, but the discovery or not of a pattern is not affected by such changes. As an example of the application of the D-algorithm, consider the circuit in Fig. 67.1 and the fault G s–a–1. In order to establish G ← D, the pdcf CD ← 00 is chosen and the D-frontier becomes {J} (gates are named by their output line). Then, gate J FIGURE 67.1 Example circuit. is considered and the pdc setting I ← 1 is selected with result J ← D and new D-frontier {M, N}. Assume gate M is selected. Then, the pdc setting H ← 0 is selected with result M ← D. However, the justification of current values H ← 0 and I ← 1 results in conflict, so the algorithm backtracks and tries the next pdc for gate M which sets H ← D. But again, this cannot be justified. Then the algorithm backtracks once

© 2000 by CRC Press LLC

more and selects gate N from the D-frontier. Then the assignment E ← 1 is made, which results in N ← D. Since the values E ← 1 and I ← 1 can now be justified without conflict, the algorithm terminates successfully, returning test pattern ABCDE = 11001. As another example, consider the circuit in Fig. 67.2 and the fault B s–a–1. In order to establish B ← D, the assignment B ← 0 is made and the D-frontier becomes {F, G}. Assume that gate F is selected. In order to propagate the fault to line H, the pdc setting FIGURE 67.2 Multipath sensitization. A ← 1 is selected and the pdc of gate H setting G ← 0 is tried. But this results in conflict, as B (and E) are required to be 0. Then the algorithm backtracks and tries the next available pdc of H which sets G ← D. This value can now be justified by setting C ← 1, with resulting test pattern ABC = 101. A similar thing happens if gate G is selected from the original D-frontier. That is, in this example, the algorithm had to sensitize two paths simultaneously from the fault site to a PO in order to detect the fault. This is referred to as multipath sensitization, but its need rarely arises in practice. To reduce computational time, examination of pdcs involving more than one input being set to D (or D) is often omitted. Another basic TPG algorithm is the PODEM.2 The PODEM algorithm uses also the five-valued logic (0, 1, x, D, D), and works as follows. Initially, all lines are assigned a value of x except line l, which is assigned a value of D if the fault is l s–a–0, and a value of D if the fault is l s–a–1. The algorithm at each step tries to satisfy an objective (v, l), defined as a desired value v at a line l by making assignments only to primary inputs (PIs), one PI at a time. The mapping of an objective to a single PI value is done – heuristically, as explained below. The initial objective is (v, l), assuming that the examined fault is l s–a–v. Then the algorithm computes all implications of the current pattern of values assigned to PIs. If the effect of the fault is propagated to a primary output (PO), the algorithm terminates with success. If a conflict occurs and the fault cannot be activated or cannot be propagated to a PO, then the algorithm backtracks to the previous decision point, which is the last assignment to a PI. If no conflict occurs but the fault has not been activated or not been propagated to a PO because the currently implied values on the lines involved are x, then the algorithm continues with the same objective (v, l) if the fault is still not activated, or with an – objective (c, l′) if the fault has been activated but not propagated, where l′ is an input line of a gate from the D-frontier that has currently assigned a value of x on it, and c is the controlling value of that gate. The determination of which single PI to selected and which value to assign to it given an objective (v, l) is done heuristically (in the worst case, at random). A simple heuristic is to select a path from line l – to a PI such that every line of the path except l has an x value on it, and assign to that PI the value v (v) if the total number of inverting gates (i.e., NOT, NAND, NOR) along that path is even (odd). In addition, concerning the selection of a gate from the D-frontier, a simple heuristic is to select the gate that is closest to a PO. As an example of the application of PODEM, consider the circuit of Fig. 67.1 and the fault G s–a–1. The initial objective is (0, G). The chosen PI assignment is C ← 1, and this has no implications. The objective remains the same, with chosen PI assignment D ← 0 and implications G ← D. The D-frontier becomes {J} and the next objective is (1, I). This results in PI assignments A ← 1 and B ← 1 with implications F ←1, H ← 1, I ← 1, M ← 0, J ← D, K ← D, L ← D, and new D-frontier {N}. The next objective is (1, E), which is immediately satisfied and has implication N ← D. So, the algorithm returns successfully with test pattern ABCDE = 11001. In the example of Fig. 67.2, PODEM works as follows. The original objective is (0, B). With PI assignment B ← 0, the D-frontier becomes {F, G}. Assuming gate F is selected, the next objective is (1, A), which is immediately satisfied with resulting implication F ← D and new D-frontier {G, H}. Given that gate H is selected as closer to the output, the next objective is (0, G) which leads to the PI assignment C ← 1 with implications G ← D and H ← D. That is, the resulting test pattern is ABC = 101. Notice that although the implied value for G was D while the objective generated was (1, G), this is not considered a conflict, since the goal of any objective is only to lead to a PI assignment that activates and propagates the fault to a PO.

© 2000 by CRC Press LLC

As an example involving backtracking in PODEM, consider the circuit of Fig. 67.3 and the fault J s–a–1. Starting with objective (0, J), the PI assignment A ← 0 is made (using path HFEA) with no implication, and then the PI assignment B ← 0 is made (using path HFEB) with implications E ← 0, F ← 0, G ← 0, H ← 0, I ← 1, J ← 1. But the latter constitutes a conflict, and so the algorithm backtracks trying PI assignment A ← 1. The implications of this assignment are E ← 1, F ← FIGURE 67.3 Backtracking in PODEM. 1, G ← 1. Since the fault at J is still not activated, the objective (1, B) is generated next (using path HFEB), which is satisfied immediately but has no new implications, then the objective (0, C) is generated (using path HC), which is satisfied immediately and has implication H ← 0. Finally, the objective (1, D) is generated (using path ID), which is satisfied immediately and has implications I ← 0 and J ← 0. Since the fault is now activated and (trivially) propagated, the algorithm terminates successfully with test pattern ABCD = 1101. Both of these basic algorithms are complete in that given enough time, they will find a pattern for a fault if and only if the fault is not redundant. The D-algorithm performs an implicit state-space search by assigning values to the lines of the circuit, whereas PODEM performs an implicit state-space search by assigning values to the PIs only. For circuits with no fan-out or without reconvergent fan-out, the algorithms take linear time to the size of the circuit; but for general circuits (with reconvergent fan-out), the algorithms may take exponential time. In fact, the test pattern generation problem has been shown to be NP-complete.3 The implicit state search in conjunction with a variety of heuristic measures can cut down the running time requirements. For instance, performing as many implications at each point as possible and checking for the existence of at least one path from a gate in the D-frontier to a PO such that every line on that path has an x value (otherwise, fault propagation is impossible) are very useful measures. In general, PODEM is faster than the D-algorithm. Several extensions to PODEM have been proposed, such as working with more than one objective each time, and stopping backtracking before reaching PIs. For instance, the FAN algorithm4 maintains a list of multiple objectives and stops backtracking at headlines rather than just PI lines. A headline is a line that is driven by a subcircuit containing no line that is reachable from some fan-out stem, and, therefore, can be justified at the end with no conflicts. As a short illustration, consider the example in Fig. 67.3. In order to activate the fault (i.e., J ← 0), both lines H and I must be driven to 0. The objectives (H, 0) and (I, 0) are now both taken into consideration. In order to achieve objective (H, 0), the assignment E ← 0 can be selected, as line E is a headline. But in order to achieve objective (I, 0), the assignment E ← 1 is required. Therefore, the algorithm selects the alternative assignment C ← 0 (as C is a PI) for objective (0, H), and then selects the assignment E ← 1 (as E is a headline) and D ← 1 (as D is a PI) for objective (0, I), which results in success. The justification of the value on E is left for a final pass with resulting test pattern ABCD = 1x00 or ABCD = x100. There are a plethora of TPG algorithms based on various strategies (see, e.g., Ref. 5 for more information). There are also parallel TPG algorithms designed for particular devices such as ROMs and PLAs. TPG Algorithms for Sequential Circuits Detecting faults in sequential circuits is much more difficult than for combinational circuits. This is due to the fact that because of the memory elements present in the logic, a sequence of patterns is generally required for each fault, along with an appropriate initial state. In general, TPG techniques for combinational circuits can be applied to sequential circuits by considering the iterative logic array model of the sequential circuits. This model applies to both synchronous and asynchronous sequential circuits, although it is more complex for the latter. Given a current state vector Q and a current input vector X, the function of a sequential circuit is specified as a mapping from (X, Q) to (Q+, Z), where Q+ is the next state vector and Z is the resulting

© 2000 by CRC Press LLC

output. In the iterative logic array representation, the sequential circuit is modeled as a series of combinational circuits C0, C1, …, CN , where N is the length of the current input pattern sequence applied to the sequential circuit. Each circuit Ci , referred to as a time frame, is an identical copy of the sequential circuit but with all feedback removed, and has inputs Xi and Qi , and outputs Qi+ and Zi . Inputs Xi are driven by the ith pattern applied to the sequential circuit and inputs Qi are driven by the outputs Q+i–1 of the previous time frame for i > 0, with Q0 being set to the original initial state of the sequential circuit. All outputs Zi are ignored except for the outputs ZN of the last time frame, which constitute the output of the sequential circuit resulting from the specific input sequence and initial state. Given a stuck-at fault, the fundamental idea in sequential TPG is to create an iterative logic array of appropriate length N and justify all the values necessary for the fault to be activated and propagated to the outputs ZN of the last time frame. If this can be achieved with the values of the Q0 inputs of the first time frame being set to ‘x’s, then a self-initializing test sequence is produced. Otherwise, the specific values required for the Q0 inputs (preferably, all “0”s) are assumed to be easily established through a reset capability. In principle, one can start from one time frame Ct (with the index t to be appropriately adjusted later) and try to propagate the effect of the fault to either some of the Zt lines or some of the Qt+ lines. In case of propagation to the Zt lines, Ct becomes tentatively the last frame in the iterative logic array and line justification by assignments to the Xt and Qt lines is repeatedly done in additional time frames Ct–1, Ct–2, …, Ct–Nb (up to some number Nb), until all lines are justified with either Qt–Nb being set to all ‘x’s or to a resetable initial state. In case of propagation to the Qt lines, additional time frames Ct+1, Ct+2, …, Ct+Nf are considered (up to some number Nf ), until the effect of the fault is propagated to the ZNf lines. Notice that because each time frame contains the same fault, the propagation can be done from any of the Ct–1, Ct–2, …, Ct–Nb time frames to the ZNf lines. Then, line justification is again attempted as above. In case of conflict during the justification process, backtracking is attempted to the last decision point, and this backtracking can reach as far as the Ct–Nf frame. In order to reduce the storage required for the computation status as well as the time requirements of this process, algorithms that consider only backward justification and no forward fault propagation have been proposed. For example, the Extended Backtrace (EBT) algorithm6 selects a path from the fault site to a primary output which may involve several time frames Ct–1, Ct–i+1, …, Ct, and then tries to justify all values for the sensitization of this path (along with the requirements for the initial state) by working with time frames Ct, Ct–1, …, Ct–i, …, Ct–Nb . As an illustration of the application of the EBT algorithm, consider the sequential circuit in Fig. 67.4(a). The structure of each time frame in the iterative logic array representation of it is given in Fig. 67.4(b).

FIGURE 67.4

A sequential circuit and a time frame in the iterative logic array representation.

© 2000 by CRC Press LLC

Consider the fault S s–a–0. The EBT algorithm selects the path SQ2Z to propagate the fault. This path involves two time frames, as the value of line S is the value of line Q2 before one clock cycle (by definition of the D-type flip-flop). Considering the index of the last frame to be t and following the structure of each time frame (Fig. 67.4(b)), the path actually comprises the lines Z[t], Q2[t], Q+2[t–1]. In order to sensitize this path, line E[t] must be set to 1. Now, in order to activate the fault at line S, which is identified with Q+2[t–1], lines I[t–1] and Q1[t–1] must be set to 1. Assuming a self-initializing sequence is sought, further justification needs to be made for the value Q1[t–1], which is equal tot he value of line Q+1[t–2] in an additional time frame indexed by t – 2. Since Q+1[t–2] is set directly by I[t–2], the search is over and the self-initializing sequence (first pattern first) is IE = (1x, 1x, x1).

Other ATPG Aspects There are several components in the ATPG process that are centered around the TPG algorithm and can be viewed as preprocessing or postprocessing steps to it. Given a list of target faults on which the TPG algorithm is to work on, some very useful preprocessing steps include the following: 1. Fault collapsing: For a circuit with n lines in total, there are 2n possible stuck-at faults to consider. Fault collapsing reduces this initial number by taking advantage of equivalence and dominance relations among faults. Two faults are said to be functionally equivalent if all patterns that detect the one detect also the other. Given a set of functionally equivalent faults, only one fault from that set has to be considered for test generation. A fault f1 is said to dominate a fault f2 if all patterns that detect f2 detect also f1 and there is at least one pattern that detects f1 but not f2. Then only f2 needs to be considered for test generation. It can be shown that the fault s–a–(c ⊕ i) on the output of a gate is functionally equivalent with the fault s–a–c on any of the gate inputs and that the fault – – s–a–(c ⊕ i) on the output of a gate dominates the fault s–a– c on any of the gate inputs, where c is the controlling value of the gate and i is 1 (0) if the gate is inverting (non-inverting). As an example, using these relations on the circuit of Fig. 67.1, we obtain that (F–s–0, A–s–0, B–s–0), (G–s–1, C–s–1, D–s–1), (J–s–1, G–s–0, I–s–0), (M–s–0, H–s–1, K–s–1), (N–s–0, E–s–0, L–s–0) are functionally equivalent sets of faults, and that F–s–1 dominates A–s–1 and B–s–1, G–s–0 dominates C–s–0 and D–s–0, J–s–0 dominates G–s–1 and I–s–1, M–s–1 dominates H–s–0 and K–s–0, and N–s–1 dominates E–s–1 and L–s–1. Given tese relations, only the set of faults {A–s–1, B–s–1, C–s–0, D–s–0, G–s–1, I–s–1 H–s–0, K–s–0, E–s–1, L–s–1, F–s–0, M–s–0, N–s–0} need be considered; the number of target stuck-at faults is reduced from 28 to 13. 2. Removal of randomly testable faults: A very simple way of eliminating faults from a target fault list is to generate test patterns at random and verify, by fault simulation, which target faults (if any) each generated pattern detects. The generation of such patterns is done by a pseudorandom method, that is, an algorithmic method whose behavior under specific statistical criteria seems close to random. Eliminating all faults by pseudorandom test pattern generation generally requires a very large number of patterns. For instance, under the assumption of uniform input distribution and independent test pattern generation, the smallest number of patterns to detect with probability ln(P ) Ps a fault whose detection probability is d is N =  ln(1 −s d)  . In general, faults with small detection





probability are referred to as randomly untestable or hard-to-detect faults, whereas faults with high detection probability are referred to as randomly testable or easy-to-detect faults. For example, in a circuit consisting of a single k-input AND gate with output line l, the fault l s–a–0 is a hard-todetect fault as only one out of 2k patterns can detect it, whereas the fault l s–a–1 is an easy-todetect fault as 2k – 1 out of 2k patterns can detect it; whereas the fault l s–a–1 is an easy-to-detect fault as 2h – 1 out of 2k patterns can detect it. In practice, an acceptable number of pseudorandom test patterns are generated and simulated in order to drop many easy-to-detect faults from the target fault list, with all remaining faults given over to a deterministic (as opposed to pseudorandom) TPG tool, in case a complete test is desired.

© 2000 by CRC Press LLC

3. Removal of faults identified by critical path tracing: A critical path under an input pattern t is a path from a primary input or internal line to a primary output such that if there is a change in the value under t of any line in the path, the PO also changes (in other words, input pattern t can – serve as a test pattern for each fault l s–a– v, where l is any line of the path and v is the value of that line under t). Critical path tracing is a technique for systematically identifying critical paths in a circuit. Starting from an assigned value to a PO (a PO line always constitutes a critical subpath), it works its way back to the PIs trying to extend current critical subpaths. The extension however cannot be done safely through stems of reconvergent fan-out. Given a gate whose output is the beginning of a current critical subpath, the method assigns only one input of the gate to a value – c or all inputs of the gate to value c in order to justify the output value, where c is the critical value of the gate. In both cases, longer critical subpaths are created that can be developed further recursively. Once the PIs are reached and all non-critical values are justified, all corresponding faults on lines in critical paths are covered by the resulting input pattern, and so these faults can be dropped from the initial fault list. Some critical paths for the circuit of Fig. 67.3 are shown in Fig. 67.5. Notice that stem E in Fig. 67.5(a) is not critical (as found by separate fault simulation), whereas stem E in Fig. 67.5(b) turns out to be actually critical. Critical path tracing can also be viewed as a fault-independent (in contrast to fault-driven) deterministic TPG algorithm that is generally faster but may not cover all possible detectable faults or prove that a fault is undetectable. A basic postprocessing step after test patterns have been generated by an ATPG technique is compaction. Compaction attempts to reduce the number of patterns by taking advantage of any x values in the patterns generated. The basic step is to merge two patterns which do not have conflicting values in any bit position. For example, in Fig. 67.6(a), we can compact patterns t1, t2 and t3, t4 to obtain the test set in Fig. 67.6(b), which cannot be compacted further. However, we can also compact patterns t2, t3, t4 and t1, t5 to obtain the test set in Fig. 67.6(c), which is smaller than that of Fig. 67.6(b). In general, finding a compacted test set of minimum size is an NP-hard problem, but efficient heuristics exist to solve the problem satisfactorily. Compaction can also be done simultaneously with test pattern generation in order to better exploit

FIGURE 67.5

Some critical paths (shown in bold) found by critical path tracing.

© 2000 by CRC Press LLC

FIGURE 67.6

Compaction of test patterns.

the x values as soon as they are generated. This is referred to as dynamic compaction (in contrast to static compaction), and its basic idea is to assign appropriately any x values in the last generated pattern in order to obtain test patterns for additional faults.

67.2 Built-In Self-Test In order to make the testing of a VLSI circuit easier, several design-for-testability criteria can be taken into account along with the other “traditional” design criteria of cost, delay, area, power, etc. For example, transforming a sequential circuit into combinational parts by linking in a “test mode” all its flip-flops into a shift register so that patterns to initialize the flip-flops can be easily loaded and responses can be observed is a common design-for-testability technique known as full-scan. Built-in Self-Test (BIST) is an ultimate design-for-testability technique in which extra circuitry is introduced on-chip in order to provide test patterns to the original circuit and verify its output responses. The aim is to provide a faster and more economic alternative to external testing. The difficulty in the BIST approach is the discovery of schemes which have very low hardware overhead and provide the required test quality in order to justify their inclusion on-chip.

Online BIST A special form of BIST is the design of self-checking circuits in which no explicit test patterns are provided, but the operation of the circuit is tested online by identifying any invalid output responses (i.e., responses that can never occur under fault-free operation). If, however, there is a fault that can cause a valid response to be changed into another valid response, then that fault cannot be detected. The identification of faulty behavior is done by a special built-in circuit called checker. For example, in a k: 2k decoder, a checker can check if exactly one of the 2k output lines has a value 1 each time. If the number of 1s in the output pattern is zero or more than one, then an error is detected. If, however, a fault in the decoder causes an input pattern to assert only one output line but not the correct one, then the fault cannot be detected by such a checker. In general, the design of self-checking circuits is based on coding theory. The checker has to encode all output responses of the circuit under fault-free operation in order to distinguish between valid and invalid responses. For example, using the single-bit parity code, a checker can compute the parity of the actual response of the circuit for the current input, compute also the parity of the (known) correct output response corresponding to that input, and compare the two parities. Faults in the checker can beat the purpose of fault detection in the original circuit. However, the assumption is that the logic of the checker is much simpler than the circuit it checks and therefore can be tested far more easily. Research on the design of self-checking checkers seeks to minimize the logic that is not self-testable.

Offline BIST In a general offline BIST scheme, test pattern generation and application, as well as output response verification, are done by built-in mechanisms while the circuit operates in a test mode.

© 2000 by CRC Press LLC

FIGURE 67.7

LFSR configurations.

Built-in TPG Mechanisms Mechanisms that have been considered for built-in test pattern generation and application include readonly memories, counters, cellular automata, and linear feedback shift registers (LFSRs). Of these mechanisms, LFSRs offer the most flexibility and have received the most attention. A linear feedback shift register (LFSR) consists of a series of flip-flops connected in a circular structure by means of exclusiveOR (XOR) gates. The two basic types of an LFSR are shown in Fig. 67.7(a) and Fig. 67.7(b). The structure in Fig. 67.7(a) uses the XOR gates externally, while the structure in Fig. 67.7(b) uses the XOR gates internally. The connections of the flip-flops to the XOR gates are fixed for a basic n-bit LFSR and are specified by the values ci, 1 ≤ i ≤ n, where ci = 1 denotes a connection, and ci = 0 denotes no connection. The specific pattern of ci values is conveniently represented as a polynomial P(x) = 1 + n cixi over the field of elements mod 2 and is referred to as the characteristic polynomial of the LFSR. Σi=1 n–1 cn–ixi, which is referred to as (The representation can also be done by the polynomial Pr(x) = xn + Σi=1 the reciprocal polynomial of P(x).) Given an initial state, an LFSR cycles through a sequence of states as determined by its characteristic polynomial. For particular characteristic polynomials known as primitive polynomials, the corresponding sequence of states has the maximum possible length (that is, 2n – 1, since the all-0 state will cause the LFSR to cycle through it continuously). A primitive polynomial of degree n has the property that the smallest value k such that xkmodP(x) = 1 is k = 2n – 1. Primitive polynomials exist for every degree and a list of them can be found in Ref. 7. An example of a specific LFSR with characteristic polynomial P(x) = x4 + x + 1, along with the sequence of the resulting states, is given in Fig. 67.8(a) for the external-XOR type and in Fig. 67.8(b) for the internal-XOR type. Although the properties of interest to most BIST applications are the same for the two LFSR types, an external-XOR type LFSR may be slower due to the multiple-level XOR logic. (Notice also that the stae of the external-XOR type LFSR at cycle i (starting from i = 0) is exactly the pattern x′modP(x).) There are three basic schemes for the design of a built-in test pattern generator: (1) deterministic, (2) pseudorandom, and (3) pseudo-exhaustive.

© 2000 by CRC Press LLC

FIGURE 67.8

LFSRs with (a) characteristic polynomial P(x) = x4 + x + 1 and (b) resulting sequences.

In deterministic TPG, a set of patterns for a list of target faults obtained by a TPG algorithm (after any postprocessing, like compaction) are “embedded” in a TPG mechanism. The obvious solution is to use a read-only memory (ROM) for this purpose, but this is applicable only for very small test sets. An alternative simple solution is to use a binary counter or an LFSR of length w (where w is the test pattern length) that starts from an initial state si and cycles through until it reaches another state sj so that all the desired patterns appear somewhere between states si and sj, with each intermediate state constituting a required or not required pattern. The problem here is to find (if at all) a pair of states si, sj in the sequence produced by the underlying mechanism such that the absolute distance between si and sj is acceptably smaller than 2w, in order to keep the number of testing cycles acceptably low. In pseudorandom built-in TPG, an LFSR is typically used as a pseudorandom generator, which cycles through a subsequence of l states, each state constituting a pseudorandom pattern, where l is again acceptably low. Such a sequence is analyzed by fault simulation in order to determine its fault coverage (defined as the ratio of the number of faults that the patterns in the sequence detected over the number of all detectable faults of interest). In general, very long subsequences are needed to achieve an acceptable level of fault coverage. An enhancement of this idea is to use weighted random LFSRs. These include extra logic in order to change the bit probabilities in the states that the LFSR generates. For example, by having bit i of each test pattern be the output of an AND gate driven by two LFSR bits, the probability of having a ‘1’ in bit i is the product of the probabilities of having a ‘1’ in those LFSR bits. In pseudo-exhaustive built-in TPG, the goal is to reduce the testing of the circuit to the testing of appropriate subcircuits of it such that each subcircuit depends on a small number of primary inputs, then apply all possible patterns to each of these subcircuits. The benefits of an exhaustive test set is that no test pattern generation or fault simulation is needed and that the generated patterns guarantee that all detectable faults that do not induce sequential behavior are detected. In order for pseudo-exhaustive TPG to achieve the benefits of exhaustive testing without taking prohibitive time, particular relations must hold between the primary outputs (POs) and the primary inputs (PIs) on which they depend. If

© 2000 by CRC Press LLC

such relations do not hold, they may be imposed upon the circuit through design-for-testability techniques. In general, there are many pseudo-exhaustive test sets that can be obtained for a given circuit. The goal in pseudo-exhaustive built-in TPG is to find and embed a pseudo-exhaustive test set that offers the best tradeoff in hardware implementation cost and testing time. As a simple example of how a pseudo-exhaustive test set can be obtained, consider a circuit with n inputs and one output fed by a two-input gate whose inputs are driven in turn by two disjoint subcircuits. Then, that output can be tested pseudo-exhaustively by 2n1 + 2n2 + 1 patterns instead of 2n, where n1 and n2 are the numbers of the (disjoint) primary inputs that drive the two subcircuits. The first 2n1 of these patterns contain a constant subpattern (consisting of n2 bits) required to sensitize the paths from the first subcircuit to the output; the next 2n2 of these patterns contain a constant subpattern (consisting of n1 bits) required to sensitize the paths from the second subcircuit to the output; and the last pattern is required to provide both inputs of the gate with the controlling value of the gate. This pseudo-exhaustive test set could be generated on-chip by using, for instance, a counter and some extra storage for the constant subpatterns, but such pseudo-exhaustive test sets can be impractical to implement in large circuits. Obtaining suitable pseudo-exhaustive test sets for built-in implementation is based on the consideration of the subsets of PIs on which each PO depends. Let us call such a set a D-set. All D-sets must be smaller than the number n of PIs; otherwise, pseudo-exhaustive testing is not applicable. A general preprocessing step for pseudo-exhaustive TPG is to identify groups of PIs that never appear together in a D-set. All PIs in such a group can share the same test signal for the pseudo-exhaustive testing. In this way, the number of test signals is reduced from n to n′, with an immediate reduction of the test time from 2n to 2n′. Minimizing the value of n′ is an NP-hard problem, but efficient heuristics exist to reduce it in practice. Pseudo-exhaustive test sets can be obtained by considering only the size k < n of the maximum D-set in a circuit and ignoring the structure of the D-sets as well as their number (i.e., such pseudo-exhaustive test sets are good for any n-input circuit with no output being dependent on more than k inputs). For example, it has been shown8 that a test set that comprises all binary patterns containing w1 ‘1’s, all binary patterns containing w2 ‘1’s, etc., up to wi ‘1’s, where w1, w2, …, wi are all the solutions of the equation w = c mod(n – k + 1), for some constant c ≤ n – k, constitute a pseudo-exhaustive test set. For instance, if n = 6 and k = 3, the set of all patterns with 0 or 4 ‘1’s (corresponding to c = 0), the set of all patterns with 1 or 5 ‘1’s (corresponding to c = 2), the set of all patterns with 2 or 6 ‘1’s (corresponding to c = 2), the set of all patterns with 3 ‘1’s (corresponding to c = 3) constitute pseudo-exhaustive test sets that can be applied to any circuit with n inputs and maximum D-set size k. The structure of one of these sets (corresponding to c = 2) is given in Fig. 67.9. The generation of such a set of patterns can be done by using constant-weight counters, which produce a sequence of states with the same constant number of ‘1’s in each. nThe disadvantages of this approach are the size of the test set which, although not 2n, is still 2  large  ≈ , and the hardware overhead required for the implementation of a constant-weight n − k + 1 counter. Better solutions may be obtained by considering the particular structure of each D-set. A very important mechanism in this regard is the Extended LFSR. An Extended LFSR (also known as LFSR/SR) is a shift register (SR) of n cells whose initial k cells are configured into an LFSR with a characteristic polynomial of degree k. Let P(x) be that characteristic polynomial. It has been shown (see, e.g., Ref. 9) that the successive states of such an LFSR/SR test exhaustively a D-set D = {d_1, d_2, …, d_s}, s = |D| (the di elements denote the indices of the cells that drive the circuit inputs), if an only if the set of vectors x d1modP(x), x d2modP(x), …, x dsmodP(x) are linearly independent. If this relation holds for every D-set, then the corresponding test sequence tests the circuit pseudo-exhaustively in time 2k (after the initialization of the LFSR and SR parts of the LFSR/SR). As an example, consider the D-sets D1 = {1, 2, 3, 4}, D2 = {2, 3, 5}, D3 = {3, 5, 6}. All these D-sets satisfy the above relation under primitive polynomial P(x) = x 4 + x + 1 (see Fig. 67.10(a)). However, if a D-set D4 = {1, 2, 5} were also present, that D-set could no more be tested pseudo-exhaustively, as its corresponding vectors are linearly dependent (see Fig. 67.10(b)).

© 2000 by CRC Press LLC

Obtaining an LFSR/SR under which the independency relation holds for every D-set of the circuit involves basically a search for an applicable polynomial of degree d, k ≤ d ≤ n, among all primitive polynomials of degree d, k ≤ d ≤ n. Primitive polynomials of any degree can be algorithmically generated. An applicable polynomial of degree n is, of course, bound to exist (this corresponds to exhaustive testing), but in order to keep the number of test cycles low, the degree should be minimized. Built-in Output Response Verification Mechanisms Verification of the output responses of a circuit under a set of test patterns consists, in principle, of comparing each resulting output value against the correct one, which has been precomputed and prestored for each test pattern. However, for built-in output response verification, such an approach cannot be used (at least for large test sets) because of the associated storage overhead. Rather, practical built-in output response verification mechanisms rely on some form of compression of the output responses so that only the final compressed form needs to be compared against the (precomputed and prestored) compressed form of the correct output response. Some representative built-in output response verification mechanisms based on compression are given below. 1. Ones count: In this scheme, the number of times that each output of the circuit is set to ‘1’ by the applied test patterns is FIGURE 67.9 A pseudo-exhauscounted by a binary counter, and the final count is compared tive test set for any circuit with six against the corresponding count in the fault-free circuit. inputs and largest D-set size 3. 2. Transition count: In this scheme, the number of transitions (i.e., changes from both 0 → 1 and 1 → 0) that each output of the circuit goes through when the test set is applied is counted by a binary counter and the final count is compared against the corresponding count in the fault-free circuit. (These counts must be computed under the same ordering of the test patterns.) 3. Signature analysis: In this scheme, the specific bit sequence of responses of each output is represented as a polynomial R(x) = r0 + r1 x + r2 x 2 + … + rs–1 x s–1, where ri is the value that the output takes under pattern ti, 0 ≤ i ≤ s, and s is the total number of patterns. Then, this polynomial is divided by a selected polynomial G(x) = g0 + g1 x + g2 x2 + … + gm xm of degree m for some desired

FIGURE 67.10 Linear independence under P(x) = x4 + x + 1: (a) D-sets that satisfy the condition; (b) A D-set that does not satisfy the condition.

© 2000 by CRC Press LLC

FIGURE 67.11

(a) Structure for division by x4 + x + 1; (b) general structure of an MISR.

value m, and the remainder of this division (referred to as signature) is compared against the remainder of the division by G(x) of the corresponding fault-free response C(x) = c0 + c1 x + c2 x 2 + … + cs–1 x s–1. Such a division is done efficiently in hardware by an LFSR structure such as that in Fig. 67.11(a). In practice, the responses of all outputs are handled together by an extension of the division circuit, known as multiple-input signature register (MISR). The general form of a MISR is shown in Fig. 67.11(b). In all compression techniques, it is possible for the compressed forms of a faulty response and the correct one to be the same. This is known as aliasing or fault masking. For example, the effect of aliasing in ones count output response verification is that faults that cause the overall number of ‘1’s in each output to be the same as in the fault-free circuit are not going to be detected after compression, although the appropriate test patterns for their detection have been applied. In general, signature analysis offers a very small probability of aliasing. This is due to the fact that an erroneous response R(x) = C(x) = E(x), where E(x) represents the error pattern (and addition is done mod 2), will produce the same signature as the correct response C(x) and only if E(x) is be a multiple of the selected polynomial G(x). BIST Architectures BIST strategies for systems composed of combinational logic blocks and registers generally rely on partial modifications of the register structure of the system in order to economize on the cost of the required mechanisms for TPG and output response verification. For example, in the BILBO (Built-In Logic Block Observer) scheme,10 each register that provides input to a combinational block and receives the output

© 2000 by CRC Press LLC

FIGURE 67.12

BILBO structure for a 4-bit register.

of another combinational block is transformed into a multipurpose structure that can act as an LFSR (for test pattern generation), as an MISR (for output response verification), as a shift register (for scan chain configurations), and also as a normal register. An implementation of the BILBO structure for a 4-bit register is shown in Fig. 67.12. In this example, the characteristic polynomial for the LFSR and MISR is P(x) = x4 + x + 1. By setting B1B2 B3 = 001, the structure acts like an LFSR. By setting B1B2 B3 = 101, the structure acts like an MISR. By setting B1B2 B3 = 000, the structure acts like a shift register (with serial input SI and serial output SO). By setting B1B2 B3 = 11x, the structure acts like a normal register, and by setting B1B2 B3 = 01x, the register can be cleared. As two more representatives of system BIST architectures, we mention here the STUMPS scheme,11 where each combinational block is interfaced to a scan path and each scan path is fed by one cell of the same LFSR and feeds one cell of the same MISR, and the LOCST scheme,12 where there is a single boundary scan chain for inputs and a single boundary scan chain for outputs, with an initial portion of the input chain configured as an LFSR and a final portion of the output chain configured as an MISR.

References 1. J.P. Roth, W.G. Bouricious, and P.R. Schneider, Programmed algorithms to compute tests to detect and distinguish between failures in logic circuits, IEEE Trans. Electronic Computers, 16, 567, 1967. 2. P. Goel, An implicit enumeration algorithm to generate tests for combinational logic circuits, IEEE Trans. Computers, 30, 215, 1981. 3. M.R. Garey and D.S. Johnson, Computers and Intractability – A Guide to The Theory of NPCompleteness, W.H. Freeman and Co., New York, 1979. 4. H. Fujiwara and T. Shimono, On the acceleration of test generation algorithms, IEEE Trans. Computers, 32, 1137, 1983. 5. M. Abramovici, M.A. Breuer, and A.D. Friedman, Digital Systems Testing and Testable Design, Computer Science Press, New York, 1990. 6. R.A. Marlett, EBT: A comprehensive test generation technique for highly sequential circuits, Proc. 15th Design Automation Conf., 335, 1978. 7. W.W. Peterson and E.J. Weldon, Jr., Error-Correcting Codes, MIT Press, Cambridge, MA, 1972. 8. Tang, D.T. and Woo, L.S., Exhaustive test pattern generation with constant weight vectors, IEEE Trans. Computers, 32, 1145, 1983. 9. Z. Barzilai, Coppersmith, D., and Rosenberg, A.L., Exhaustive generation of bit patterns with applications to VLSI testing, IEEE Trans. Computers, 32, 190, 1983. 10. B. Koenemann, J. Mucha, and G. Zwiehoff, Built-in test for complex digital integrated circuits, IEEE J. Solid State Circuits, 15, 315, 1980. 11. P.H. Bardell and W.H. McAnney, Parallel pseudorandom sequences for built-in test, in Proc. Intern’l. Test. Conf., 302, 1984. 12. J. LeBlanc, LOCST: A built-in self-test technique, IEEE Design and Test of Computers, 1, 42, 1984.

© 2000 by CRC Press LLC

Tragoudas, S. "CAD Tools for BIST/DFT and Delay Faults" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

68 CAD Tools for BIST/DFT and Delay Faults 68.1 Introduction 68.2 CAD for Stuck-at Faults Synthesis of BIST Schemes for Combinational Logic • DFT and BIST for Sequential Logic • Fault Simulation

Spyros Tragoudas

68.3 CAD for Path Delays

Southern Illinois University

CAD Tools for TPG • Fault Simulation and Estimation

68.1 Introduction This chapter describes computer-aided design (CAD) tools and methodologies for improved design for testability (DFT), built-in self-test (BIST) mechanisms, and fault simulation. Section 68.2 presents CAD tools for the traditional stuck-at fault model which was examined in Chapters 66 and 67. Section 68.3 describes a fault model suitable for delay faults — the path delay fault model. The number of path delay faults in a circuit may be a non-polynomial quantity. Thus, this fault model requires sophisticated CAD tools not only for BIST and DFT, but also for ATPG and fault simulation.

68.2 CAD for Stuck-at Faults In the traditional stuck-at model, each line in the circuit is associated to at most two faults, a stuck-at 0 and a stuck-at 1 fault. We distinguish between combinational and sequential circuits. In the former case, computer-aided design (CAD) tools target efficient synthesis of BIST schemes. The testing of sequential circuits is by far a more difficult problem and must be assisted by DFT techniques. The most popular DFT approach is the scan design. The following subsections present CAD tools for combinational logic and sequential logic, and then a review of advances in fault simulation.

Synthesis of BIST Schemes for Combinational Logic The Pseudo-exhaustive Approach In the pseudo-exhaustive approach, patterns are generated pseudorandomly and target all possible faults. A common circuit preprocessing routine for CAD tools is called circuit segmentation. The idea in circuit segmentation is to insert a small number of storage elements in the circuit. These elements are bypassed in operation mode — that is, they function as wires — but in testing mode, they are part of the BIST mechanism. Due to their dual functionality, they are called bypass storage elements (bses). The hardware overhead of a bse amounts to that of a flip-flop and a two-to-one multiplexer. Each

© 2000 by CRC Press LLC

FIGURE 68.1

An observable point that depends on four controllable points.

bse is a controllable as well as an observable point, and must be inserted so that every observable point (primary output or bse) depends on at most k controllable points (primary inputs or bses), where k is an input parameter not larger than 25. This way, no more than 2k patterns are needed to pseudoexhaustively test the circuit. The circuit segmentation problem is modeled as a combinational minimization problem. The objective function is to minimize the number of inserted bses so that each observable point depends on at most k controllable points. The problem is NP-hard in general.1 However, efficient CAD tools have been proposed.2-4 In Ref. 2, the bse insertion tool minimizes the hardware overhead using a greedy methodology. The CAD tool in Ref. 3 uses iterative improvement, and the one in Ref. 4 the concept of articulation points. When the test pattern generation (TPG) is an LFSR/SR with a characteristic polynomial P(x) with period P, P ≥ 2k – 1, bse insertion must be guided by a sophisticated CAD tools which guarantees that the P different patterns that are generated by the LFSR/SR suffice to test the circuit pseudo-exhaustively. This in turn implies that each observable point which depends on at most k controllable points must receive 2k – 1 patterns. (The all-zero input pattern is excluded because it cannot be generated by the LFSR/SR.) The example below illustrates the problem. Example 1 Consider the LFSR/SR of Fig. 68.1, which has seven cells. In this case, the total number of primary inputs and inserted bses is seven. Consider a consecutive labeling of the LFSR/SR cells in the range [1…7], where the left-most element takes label 1. Assume that an observable point o in the circuit depends on elements 1, 2, 3, and 5 of the LFSR/SR. In this case, k ≥ 4, and the input dependency of o is represented by the set Io = {1, 2, 3, 5}. Let the characteristic polynomial of the LFSR/SR be P(x) = x4 + x + 1. This is a primitive polynomial and its period P is P = 24 – 1 = 15. We list in Table 68.1 the patterns generated by P(x) when the initial seed is 00010. Any seed besides 00000 will return 24 – 1 different patterns. Although 15 TABLE 68.1 different patterns have been generated, the observable point o will received the 0 0 0 1 0 set of subpatterns projected by columns 1, 2, 3, and 5 of the above matrix. In 1 0 0 0 1 particular, o will receive patterns in Table 68.2. 1 1 0 0 0 Although 15 different patterns have been generated by P(x), point o receives 1 1 1 0 0 1 1 1 1 0 only eight different patterns. This happens because there exists at least one linear 0 1 1 1 1 combination in the set {x1, x2, x3, x5}, the set of monomials of o, which is divided 1 0 1 1 1 5 2 by P(x). In particular, the linear combination x + x + 1 is divisible by P(x). If 0 1 0 1 1 no linear combination is divisible by P(x), then o will receive as many different 1 0 1 0 1 patterns as the period of the characteristic polynomial P(x). 1 1 0 1 0 0 1 1 0 1 For each linear combination in some set Io which is divisible by the characteristic 0 0 1 1 0 polynomial P(x), we say that a linear dependency occurs. Avoiding linear depen1 0 0 1 1 dencies in the set Io sets is a fundamental problem in pseudo-exhaustive built-in 0 1 0 0 1 TPG. The following describes CAD tools for avoiding linear dependencies. 0 0 1 0 0 The approach in Ref. 3 proposes that the elements of the LFSR/SR (inserted bses plus primary inputs) are assigned appropriate labels in the LFSR/SR. It has

© 2000 by CRC Press LLC

been easily shown that no linear combination in some Io is divisible by P(x) if the TABLE 68.2 largest label in Io and the smallest label in Io differ by less than k units.3 We call this 0 0 0 0 property the k-distance property in set Io. Reference 3 presents a coordinated scheme 1 0 0 1 that segments the circuit with bse insertion, and labels all the LFSR/SR cells so that 1 1 0 0 1 1 1 0 the k-distance property is satisfied for each set Io . 1 1 1 0 It is an NP-hard problem to minimize the number of inserted bses subject to the 0 1 1 1 above constraints. This problem contains a special case the traditional circuit seg1 0 1 1 mentation problem. Furthermore, Ref. 3 shows that it is NP-complete to decide 0 1 0 1 whether an appropriate LFSR/SR cell labeling exists so that k-distance property is 1 0 1 1 1 1 0 0 satisfied for each set Io without considering the circuit segmentation problem, that 0 1 1 1 is, after bse elements have been inserted so that for each set Io it holds that |Io| ≤ k. 0 0 1 0 However, Ref. 3 presents an efficient heuristic for the k-distance property problem. 1 0 0 1 It is reduced to the bandwidth minimization problem on graphs for which many 0 1 0 1 efficient polynomial time heuristics have been proposed. 0 0 1 0 The outline of the CAD tool in Ref. 3 is as follows. Initially, bse elements are inserted so that for each set Io , we have that |Io| ≤ k. Then, a bandwidth-based heuristic determines whether all sets Io could satisfy the k-distance property. For each Io that violates the k-distance property, a modification is proposed by recursively applying a greedy bse insertion scheme, which is illustrated in Fig. 68.2. The primary inputs (or inserted bses) are labeled in the range [1…6], as shown in the Fig. 68.2. Assume that the characteristic polynomial is P(x) = x4 + x + 1, i.e., k = 4. Under the given labeling, sets Ie and Id satisfy the k-distance property but set Ig violates it. In this case, the tool finds the closest front of predecessors of g that violate the k-distance property. This is node f. New bses are inserted on the incoming edges if f. (The tool may attempt to insert bses on a subset of the incoming edges.) These bses are assigned labels 7, 8. In addition, 4 is relabeled to 6, and 6 to 4. This way, Ig satisfies the k-distance requirement. The CAD tool can also be executed so that instead of examining the k-distance, it examines instead if each set Io has at least one linear dependency. In this case, it finds the closest front of predecessors that contain some linear dependency, and inserts bse elements on their incoming edges. This approach increases the time performance without significant savings in the hardware overhead. The reason that primitive polynomials are traditionally selected as characteristic polynomials of LFSR/SRs is that they have large period P. However, any polynomial could serve as a characteristic polynomial of the LFSR/SR as long as its period P is no less than 2k – 1. If P is less than 2k – 1, then no set Io with |Io| = k can be tested pseudo-exhaustively. A desirable characteristic polynomial would be one that has large period P and whose multiples obey a given pattern which we could try to avoid when relabeling the cells of the LFSR/SR so that appropriate Io sets are formed. This is the idea of the CAD tool in Ref. 5.

FIGURE 68.2

Enforcing the k-distance property with bse insertion.

© 2000 by CRC Press LLC

In particular, Ref. 5 proposes that the characteristic polynomial is a product P(x) = P1(x) · P2(x) of two polynomials. P1(x) is a primitive polynomial of degree k which guarantees that the period of the characteristic polynomial P(x) is at least 2k – 1. P2(x) is the polynomial x d + x d–1 + x d–2 + … + x1 + x0, whose degree d is determined by the CAD tool. P2(x) is called a consecutive polynomial of degree d. The CAD tool determines which primitive polynomial of degree d will be implemented in P(x). The multiples of consecutive polynomials have a given structure. Consider an Io = {i1, i2, …, ik} and I′o = {i′1, i′2, …, i′k′} ⊆ Ik . Ref. 5 shows that there is no linear combination in set I′o if the parity of all remainders of each i′j ∈ I′o modulo d-1 is either even or odd. In more details, the algorithm groups all i′j whose remainder modulo d-1 is x under list Lx, and then checks the parity of the list Lx. There are d lists labeled L0 through Ld–1. If not all list parities agree, then there is no linear combination in I′o . (If a list Lx is empty, it has even parity.) The example below illustrates the approach. Example 2 Let Io = {27, 16, 5, 3, 1} and P2(x) = x4 + x3 + x2 + x + 1. Lists L3, L2, L1 and L0 are constructed, and their parities are examined. Set Io contains linear dependencies because in subset I′o = {27, 3}, there are even parities in all lists. In particular, list L3 has two elements and all the remaining lists are empty. However, there are no linear independencies in the subset I′o = {16, 3, 1}. In this case, L0, L1, and L3 have exactly one element each, and L2 is empty. Therefore, there is no subset of I′o where all Li, 0 ≤ i ≤ 3, have the same parity. The performance of the approach in Ref. 5 is affected by the relative order of the LFSR/SR cells. Given a consecutive polynomial of degree d, one LFSR/SR cell labeling may give linear dependencies in some Io whereas an appropriate relabeling may guarantee that no linear dependencies occur in any set Io . Ref. 5 shows that it is an NP-complete problem to determine whether a relabeling exists so that no linear dependencies occur in any set Io . The idea of Ref. 5 is to label the LFSR/SR cells so that a small fraction of linear dependencies exist in each set Io . In particular, for each set Io , the approach returns a large subset I′o with no linear dependencies with respect to polynomial P2(x). This is promise for pseudorandom built-in TPG. The objective is relaxed so that each set Io receives many different test patterns. Experimentation in Ref. 5 shows that the smaller the fraction of linear dependencies in a set, the larger fraction of different patterns will receive. Also observe that many linear dependencies can be filtered out by the primitive polynomial P1(x). A final approach for avoiding linear dependencies was proposed in Ref. 4. The idea is also to find a maximal subset I′o of each Io where no linear dependencies occur. The maximality of I′o is defined with respect to linear independencies, that is, I′o cannot be further expanded by adding another label a without introducing some linear dependencies. It is then proposed that cell a receives another label a′ (as small as possible) which guarantees that there are no linear dependencies in I′o ∪ {a}. This may cause many “dummy” cells in the LFSR/SR (i.e., labels that do not belong to any Io). Such dummy cells are subsequently removed by inserting XOR gates. The Deterministic Approach In this section we discuss BIST schemes for deterministic test pattern generation, where the generated patterns target a given list of faults. An initial set T of test patterns is traditionally part of the input instance. Set T has been generated by an ATPG tool and detects all the random resistant faults in the circuit. The goal in deterministic BIST is to consult T and, within a short period of time, generate patterns on-chip which detect all random pattern resistant faults. The BIST scheme may be reproduced by a subset of the patterns in T as well as patterns not in T. If all the patterns of T are to be reproduced on-chip, then the mechanism is also called a test set embedding scheme. (In this case, only the patterns of T need to be reproduced on-chip.) The objective in test set embedding schemes is well defined, but the reproduction time or the hardware overhead may be less when we do not insist that all the patterns of T are reproduced on-chip.

© 2000 by CRC Press LLC

FIGURE 68.3

The schematic of a weighted random LFSR.

A very popular method for deterministic on-chip TPG is to use weighted random LFSRs. A weighted random LFSR consists of a simple LFSR/SR and a tree of XOR gates, which is inserted between the cells of the LFSR/SR and the inputs of the circuit under test, as Fig. 68.3 indicates. The tree of XOR gates guarantees that the test patterns applied to the circuit inputs are weighted with appropriate signal probabilities (probability of logic “1”). The idea is to weigh random test patterns with non-uniform probability distributions in order to improve detectability of random pattern resistant faults. The test patterns in T assist in assigning weights. The signal probability of an input is also referred to as the weight associated with that input. The collection of weights on all inputs of a circuit is called a weight set. Once a weight set has been calculated, the XOR tree of the weighted LFSR is constructed. Many weighted random LFSR synthesis schemes have been proposed in the literature. Their syntheses mainly focuses on determining the weight set, thus the structure of the XOR tree. Recent approaches consider multiple weight sets. In Ref. 6, it has been shown that patterns with small Hamming distance are easier to be reproduced by the same weight set. This observation forms the basis of the approach which works in sessions. A session starts by generating a weight set for a subset T′ of patterns T with small Hamming distance from a given centroid pattern in the subset. Subsequently, the XOR tree is constructed and a characteristic polynomial is selected which guarantees high fault coverage. Next, fault simulation is applied and it is determined how many faults remain undetected. If there are still undetected faults, an automatic test pattern generator (ATPG) is activated, and a new set of patterns T is determined for the next session; otherwise, the CAD tool terminates. For the test set embedding problem, weighted random LFSRs are not the only alternative. Binary counters may turn out to be a powerful BIST structure that requires very little hardware overhead. However, their design (synthesis) must be supported by sophisticated CAD tools that quickly and accurately determine the amount of time needed for the counter to reproduce a test matrix T on-chip. Such a CAD tool is described in Ref. 7, and recommends whether a counter may be suitable for the test embedding problem on a given circuit. The CAD tool in Ref. 7 designs a counter which reproduces T within a number of clock cycles that is within a constant factor from the smallest possible by a binary counter. Consider a test matrix T of four patterns, consisting of eight TABLE 68.3 columns, labeled 1 through 8. (The circuit under test has eight 1 0 1 0 1 1 0 1 inputs.) A simple binary counter requires 125 clock cycles to repro1 0 1 1 1 1 0 1 duce these four patterns in a straightforward manner. The counter 1 0 1 0 1 1 1 1 is seeded with the fourth pattern and incrementally will reach the 0 1 0 0 0 0 0 0 second pattern, which is the largest, after 125 cycles. Instead, the

© 2000 by CRC Press LLC

CAD tool in Ref. 7 synthesizes the counter so that only 4 clock cycles are needed for reproducing onchip these four patterns. The idea is that matrix T can be manipulated appropriately. The following operations are allowed on T: • Any constant columns (with all 0 or all 1) can be eliminated, since ground and power wires can be connected to the respective inputs. • Merging of any two complimentary columns. This operation is allowed because the same counter cell (enhanced flip-flop) has two states Q and Q′. Thus, it can produce (over successive clock cycles) a column as well as its complement. • Many identical columns (and respective complementary) can be merged into a single column since the output of a single counter cell can fan-out to many circuit inputs. However, due to delay considerations we do not allow more than a given number f of identical columns to be merged. Bound f is an input parameter in the CAD tool. • Columns can be permuted. This corresponds to reordering of the counter cells. • Any column can be replaced by its complementary column. These five operations can be applied on T in order to reduce the number of clock cycles needed for reproducing it. The first three operations can be applied easily in a preprocessing step. In the presence of column permutation, the problem of minimizing the number of required clock cycles is NP-hard. In practice, the last two operations drastically reduce the reproduction time. The impact of column permutation is shown in the example in Table 68.4. TABLE 68.4 1 1 1 0

0 0 0 1

1 1 1 0

0 1 0 0

1 1 1 0

1 1 1 0

0 0 1 0

1 1 1 0

0 0 0 1

1 1 1 0

1 1 1 0

1 1 1 0

1 1 1 0

1 1 1 0

0 0 0 0

0 0 1 0

The matrix on the left needs 125 cycles to be reproduced on-chip. The column permutation shown to the right reduces the reproduction time to only four cycles. The idea of the counter synthesis CAD tool is to place as many identical columns as possible as the rightmost columns of the matrix. This set of columns can be preceded by a complementary column, if one exists. Otherwise, the first of the identical columns is complemented. The remaining columns are permuted so that a special condition is enforced, if possible. The example in Table 68.5 illustrates the described algorithm. Consider matrix T given in Table 68.5. TABLE 68.5 1 1 0 1 1 0

0 1 1 1 1 0

0 0 1 0 0 1

0 1 0 1 0 0

0 1 0 1 0 1

1 0 0 0 0 1

1 1 0 1 1 0

0 0 1 1 1 1

1 1 0 0 0 0

1 1 0 0 0 0

1 1 0 0 0 0

1 1 0 1 1 1

Assume that f = 1, that is, no fan-out stems are required. The columns are permuted as given in Table 68.6. The leading (right-most) four columns are three identical columns and a complementary column to them. These four leading columns partition the vectors into two parts. Part 1 consists of the first two vectors with prefix 0111. Part 2 contains the remaining vectors. Consider the subvectors of both parts in the partition, induced when removing the leading columns. This set of subvectors (each has eight bits) will determine the relative order of the remaining columns of T.

© 2000 by CRC Press LLC

TABLE 68.6 0 0 1 1 1 1

1 1 0 0 0 0

1 1 0 0 0 0

1 1 0 0 0 0

1 1 0 1 1 1

1 1 0 1 1 0

1 1 0 1 1 0

0 0 1 0 0 1

1 0 0 0 0 1

0 1 1 1 1 0

0 1 0 1 0 0

0 1 0 1 0 1

The unassigned eight columns are permuted and complemented (if necessary) so that the smallest subvector in part 1 is not smaller than the largest subvector in part 2. We call this conduction, the low order condition. The column permutation in Table 68.6 satisfies the low order condition. In this example, no column needs to be complemented in order for the low order condition to be satisfied. The CAD tool in Ref. 7 determines in polynomial time whether the columns can be permuted or complemented so that the low order condition is satisfied. If it is satisfied, it is shown that the amount of required clock cycles for reproducing T is within a factor of two from the minimum possible. This also holds when the low order condition cannot be satisfied. A test matrix T may contain don’t-cares. Don’t-cares are assigned so that we maximize the number of identical columns in T. This problem is shown to be NP-hard.7 However, an assignment that maximizes the number of identical columns is guided by efficient heuristics for the maximum independent set problem on a graph G = (V, E), which is constructed in the following way. For each column c of T, there exists a node vc ∈ V. In addition, there exists an edge between a pair of nodes if and only if there exists at least one column where one of the two columns has 1 and the other has 0. In other words, there exists an edge if and only if there is no don’t-care assignment that makes the respective columns identical. Clearly, G = (V, E) has an independent set of size k if and only if there exists a don’t-care assignment that makes the respective columns of T identical. The operation of this CAD tool is illustrated in the example below. Example 3 Consider matrix T with don’t-cares and columns labeled c1 through c6 in Table 68.7. In graph G = (V, E) of Fig. 68.4, node i corresponds to column ci, 1 ≤ i ≤ 6. Nodes 3, 4, 5, and 6 are independent. The matrix to the left below shows the don’t-care assignment on columns c3, c4 , c5 , and c6 . The don’t-care assignment on the remaining columns (c1 and c2) is done as follows. First, it is attempted to find a don’t-care assignment that makes either c1 or c2 complementary to the set of identical columns {c3, c4 , c5 , c6 }. Column c2 satisfies this condition. Then, columns c2, c3, c4, c5 and c6 are assigned to the left-most positions of T. As described earlier, the test FIGURE 68.4 Graph construction with the patterns of T are now assigned in two parts. Part 1 has don't-care assignment. patterns 1 and 3, and part 2 has patterns 2 and 4. The don’t-cares of column c1 are assigned so that the low order condition is satisfied. The resulting don’tcare assignment and column permutation is shown in the matrix to the right in Table 68.8. TABLE 68.7 c1 0 x 1 0

c2 0 1 x x

c3 1 0 x x

© 2000 by CRC Press LLC

TABLE 68.8 c4 x 0 1 x

c5 1 x x 0

c6 1 0 x x

0 x 1 0

0 1 x x

1 0 1 0

1 0 1 0

1 0 1 0

1 0 1 0

0 1 0 1

1 0 1 0

1 0 1 0

1 0 1 0

1 0 1 0

0 0 1 0

Extensions of the CAD tool involve partitioning of the patterns into submatrices where some or all of the above-mentioned operations are applied independently. For example, the columns of one submatrix can be permuted in a completely different way from the columns of another submatirx. Tradeoffs between hardware overhead and reproduction time have been analyzed among different variations (extensions) of the CAD tools. The tradeoffs are determined by the subset of operations that can be applied independently in each submatrix. The larger the set, the higher the hardware overhead is.

DFT and BIST for Sequential Logic CAD Tools for Scan Designs In the full scan design, all the flip-flops in the circuit must be scanned and inserted in the scan chain. The hardware overhead is large and the test application time is lengthy for circuits with a large number of flip-flops. Test application time can be drastically reduced by an appropriate reordering of the cells in the scan chain. This cell reordering problem has been formulated as a combinatorial optimization problem which is shown to be NP-hard. However, an efficient CAD tool for determining an efficient cell reordering is presented in Ref. 8. One useful approach for reducing both of the above costs is to resynthesize the circuit by repositioning its flip-flops so that their number is minimized while the functionality of the design is preserved. We describe such a circuit resynthesis scheme. Let us consider the circuit graph G = (V, E) of the circuit, where each node v ∈ V is either an input/output port or a combinational module. Each edge (u, v) ∈ E is assigned a weight ff(u, v) equal to the number of flip-flops on it. Ref. 9 has shown that flip-flops can be repositioned without changing the functionality of the circuit as follows. Let IO denote the set of input/output ports. The flip-flop repositioning problem amounts to assigning r() values to each node in V so that

() r (u) = r (v ) ≤ f f (u, v ), ∀(u, v ) ∈E r v = 0, ∀v ∈ IO

(68.1)

Once an r() value is assigned to each node at I/O port, the new number of flip-flops on each edge (u, v) is computed using the formula

( )

( ) () ()

f fnew u, v = f f u, v + r u − r v

(68.2)

The set of constraints in Eq. 68.1 is a set of difference constraints and forms a special case of linear programming which can be solved in polynomial time using Bellman–Ford shortest path calculations. The described resynthesis scenario is also referred to as retiming because flip-flop repositionings may affect the clock period. The above set of difference constraints has an infinite number of solutions. Thus, there exists an infinite number of circuit designs with an equivalent functionality. One can benefit from these alternative designs, and resynthesis can be done in order to optimize certain objective functions. In full scan, the objective is to minimize the total number of flip-flops. The latter quantity is precisely

f f (u, v ) ∑ ( ) new

u, v

which can be rewritten (using Eq. 68.2) as

© 2000 by CRC Press LLC

∑ ( f f (u,v ) + r(u) − r(v )) = ∑ f f (u,v ) + ∑ (r(u) − r(v )) (u, v )

(u, v )

(68.3)

(u, v )

Since the first term in Eq. 68.3 is an invariant, the goal is to find r() values that minimize ∑(u,v)(r(u) – r(v)) subject to the constraints in Eq. 68.1. This special case of integer linear programming is polynomially solvable using min-cost flow techniques.9 Once the r() values are computed, Eq. 68.2 is applied to determine where the flip-flops will be repositioned. The resulting circuit has minimum number of flip-flops.9 Although full scan is widely used by the industry, its hardware overhead is often prohibitive. An alternative approach for scan designs is the structural partial scan approach where a minimum cardinality subset of the flip-flops must be scanned so that every cycle contains at least one scanned flip-flop. This is an NP-hard problem. Reference 10 has shown that minimizing the number of flip-flops subject to some constraints additional to Eq. 68.1 turns out to be a beneficial approach for structural partial scan. The idea here is that minimizing the number of flip-flops amounts to maximizing the average number of cycles per flip-flop. This leads to efficient heuristics for selecting a small number of flip-flops for breaking all cycles. Other resynthesis schemes that reposition the flip-flops in order to reduce the partial scan overhead have been proposed in Refs. 11 and 12. Both schemes initially identify a set of lines L that forms a low cardinality solution for partial scan. L may have lines without flip-flops. Thus, the flip-flops must be repositioned so each line of L has a flip-flop which is then scanned. Another important goal in partial scan is to minimize the sequential depth of the scanned circuit. This is defined as the maximum number of flip-flops along any path in the scanned circuit whose endpoints are either controllable or observable. The sequential depth of a scanned circuit is a very important quantity because it affects the upper bound on the length of the test sequences which need to be applied in order to detect the stuck-at faults. Since the scanned circuit is acyclic, the sequential depth can be determined in polynomial time by a simple topological graph traversal. Figure 68.5 below illustrates the concept of the sequential depth. Cycles denote I/O ports, oval nodes represent combinational modules, solid square nodes indicate unscanned flip-flops, and empty square nodes are scanned flip-flops. The sequential depth of the circuit graph to the left is 2. The figure to the right shows an equivalent circuit where the sequential depth has been reduced to 1. In this figure, the unscanned (solid flip-flops) have been repositioned, while the scanned flip-flops remain at the original positions so that the scanned circuit is guaranteed to be acyclic. Flip-flop repositioning is done subject to the constraints in Eq. 68.1 so that the functionality of the design is preserved. Let F be the set of observable/controllable points in the scanned circuit. Let F(u, v) denote the maximum number of unscanned flip-flops between u and v, u, v ∈ F, and E′ denote the set of edges in the scanned sequential graph that have a scanned flip-flop. Ref. 10 proves that the sequential depth is at most k if and only if there exists a set of r() values that satisfy the following set of inequalities:

() () ( ) r (v ) − r (u) ≤ k − F (u, v ), ∀u, v ∈F r u − r v = 0, ∀ u, v ∈E ′

FIGURE 68.5

The impact of flip-flop repositioning on the sequential depth.

© 2000 by CRC Press LLC

(68.4)

A simple hierarchy search can then be applied in order to find the smallest sequential depth that can be obtained with flip-flop repositioning. A final objective in partial scan is to be able to balance the scanned circuit. In a balanced circuit, all paths between any pair of combinational modules have the same number of flip-flops. It has been shown that the TPG process for a balanced circuit reduces to TPG for combinational logic.13 It has been proposed to balance a circuit by enhancing already existing flip-flops in the circuit and then bypassing them during testing mode.13 A multiplexing circuitry needs to be associates with each selected flip-flop. Minimizing the multiplexer-related hardware overhead amounts to minimizing the number of selected flip-flops, which is an NP-hard problem.13 The natural question is whether flip-flop repositioning may help in balancing a circuit with less hardware overhead. Unfortunately, it has been shown that it cannot. It can however assist in inserting the minimum possible bse elements in order for the circuit to be balanced. Each inserted bse element is bypassed during operation mode but acts as a delay element in testing mode. The algorithm consists of two steps. In the first step, bses are greedily inserted so that the scanned circuit becomes balanced. Subsequently, the number of the inserted bse elements is minimized by repositioning the inserted elements. This is a variation of the approach that was described earlier for minimizing the number of flip-flops in a circuit. Bses are treated as flip-flops, but for every edge (u, v) with original circuit flip-flops, the set of constraints in Eq. 68.1 is enhanced with the additional constraint r(u) – r(v) = 0. This ensures that the flip-flops of the circuit will not be repositioned. The correctness of the approach relies on the property that any flip-flop repositioning on a balanced circuit always maintains the balancing property. This can be easily shown as follows. In an already balanced circuit, the number of flip-flops on any path pi(u, v) between any combinational nodes u, v has a number of flip-flops c(u, v). When u and v are not adjacent nodes but the endpoints of a path p with two or more lines, a telescoping summation using Eq. 68.2 can be applied on the edges of the path to show that ffnew p(u, v), the number of flip-flops on p after retiming, is

( ) ( ) () ()

f fnew p u, v = c u, v + r u − r v

Observe now that quantity ffnew p(u, v) is independent of the actual path p(u, v), and remains invariant as long as we have a path between nodes u and v. This argument holds for all pairs of combinational nodes u, v. Thus, the circuit remains balanced after repositioning the flip-flops. Test application time is a complex issue for designs that have been resynthesized for improved partial scan. Test sequences that have been precomputed for the circuit prior to its resynthesis cannot any more be applied to the resynthesized circuit. However, Ref. 14 shows that one can apply such recomputed test sequences after an initializing sequence of patterns brings the circuit to a given state s. State s guarantees that the precomputed patterns can be applied. On-chip Schemes for Sequential Logic Many CAD tools have been proposed in the literature for automating the design of BIST on-chip schemes for sequential logic. The first CAD tool of this section considers LFSR-based pseudo-exhaustive BIST. Then, a deterministic scheme that uses Cellular Automata is presented. A popular LFSR-based approach for pseudorandom built-in self-test (BIST) of sequential proposes to enhance the scanned flip-flops of the circuit into either Built-In Logic-Block Observation (BILBO) cells or Concurrent Built-In Logic-Block Observation (CBILBO) cells. Additional BILBO cells and CBILBO cells that are transparent in normal mode can also be inserted into arbitrary lines in sequential circuits. The approach uses pseudorandom pattern generators (PRPGs) and multiple-input signature registers (MISRs). There are two important differences between BILBO and CBILBO cells. (For the detailed structure of BILBO and CBILBO cells, see Ref. 15, Chapter 11.) First, in testing mode, a CBILBO cell operates both in the PRPG mode and the MISR mode, while a BILBO cell only can operate in one of the two modes.

© 2000 by CRC Press LLC

FIGURE 68.6

Illustration of the different hardware overheads.

The second difference is that CBILBO cells are more expensive than BILBO cells. Clearly, inserting a whole transparent test cell into a line is more expensive than enhancing an existing flip-flop regarding hardware costs. The basic BILBO BIST architecture partitions a sequential circuit into a set of registers and blocks of combinational circuits with normal registers replaced by BILBO cells. The choice between enhancing existing flip-flops to BILBO cells or to insert transparent BILBO cells generates many alternative scenarios with different hardware overheads. Consider the circuit in Fig. 68.6(a) with two BILBO registers R1 and R2 in a cycle. In order to test C1, register R1 is set in PRPG mode and R2 in MISR mode. Assuming that the inputs of register R1 are held at the value zero, the circuit is run in this mode for as many clock cycles as needed, and can be tested exhaustively for most cases — except for the all-zero pattern. At the end of this test process, the contents of R2 can be scanned out and the signature is checked. In the same way, C2 can be tested by configuring register R1 into MISR mode and R2 into PRPG mode. However, the circuit in Fig. 68.6(b) does not conform to a normal BILBO architecture. This circuit has only one BILBO register R2 in a self-loop. In order to test C1, register R1 must be in PRPG mode, and register R2 must be in both MISR mode and PRPG mode, which is impossible due to the BILBO cell structure. This situation can be handled by either adding a transparent BILBO register in the cycle or by using a CBILBO that can operate simultaneously in both MISR and PRPG modes. In order to make a sequential circuit self-testable, each cycle of the circuit must contain at least one CBILBO cell or two BILBO cells. This combinatorial optimization problem is stated as follows. The input is a sequential circuit, and a list of hardware overhead costs. cB: The cost of enhancing a flip-flop to a BILBO cell. cCB: The cost of enhancing a flip-flop to a CBILBO cell. cBt: The cost of inserting a transparent BILBO cell. cCBt: The cost of inserting a transparent CBILBO cell. The goal is to find a minimum cost solution of this scan register placement problem in order to make every cycle in the circuit have at least one CBILBO cell or at least two BILBO cells. The optimal solution for a circuit may vary, depending upon different cost parameter sets. For example, we can have three different solutions for the circuit in Fig. 68.7. The first is that both flip-flops FF1 and FF2 can be enhanced to CBILBO cells. The second is that one transparent CBILBO cell can be inserted at the output of gate G3 to break the two cycles. The third is that both flip-flops FF1 and FF2 can be enhanced to BILBO cells, together with one transparent BILBO cell inserted at the output of gate G3. Under the cost parameter set cB = 20, cBt = 30, cCB = 40, cCBt = 60, the hardware overhead of the three solutions are 80, 60, and 70, in that order. The second solution, using a transparent CBILBO cell, has the least hardware overhead.

© 2000 by CRC Press LLC

FIGURE 68.7

The solution depends on the cost parameter set.

However, under the cost parameter set cB = 10, cBt = 30, cCB = 40, cCBt = 60, the first solution, using both transparent and enhanced BILBO cells, yields the optimal solution with total hardware overhead of 50. Although a CBILBO cell is more expensive than a BILBO cell, and a transparent cell is more expensive than an enhanced one, in some situations using CBILBO cells and transparent test cells may be beneficial to the hardware overhead. For this difficult combinatorial problem, Ref. 16 presents a CAD tool that finds the optimal hardware overhead using a branch and bound approach. The worst-case time complexity of the CAD tool is exponential and, in many instances, its time response is prohibitive. For this reason, Ref. 16 proposes an alternative branch and bound CAD tool that terminates the search whenever solutions close to the optimal are found. Although time complexity still remains exponential, the results reported in Ref. 16 show that branch and bound techniques are promising. The remainder of this section presents a CAD tool for embedding test sequences on-chip. Checking for stuck-at faults in sequential logic requires the application of a sequence of test patterns to set the values of some flip-flops along with those values required for fault justification/propagation. Therefore, it is imperative that all test patterns in each test sequence are applied in the specified order. Cellular automata (CA) have been proposed as a TPG mechanism to achieve this goal, the advantage being mainly that they are a finite state machine (FSM) with a very regular structure. References 17 and 18 propose that hybrid CAs are used for embedding test sequences on-chip. Hybrid CAs consist of a series of flip-flops fi1 ≤ n. The next state fi+ of flip-flop i is a function Fi of the present states of fi–1, fi , and fi+1. (We call them the 3-neighborhood CAs.) For the computation of fi+ and fn+, the missing neighbors are considered to be constant 0. A straightforward implementation of function Fi is by an 8-to-1 multiplexer. Consider a p × w test matrix T comprising p ordered test vectors. The CAD tool in Ref. 18 presents a systematic methodology for this embedding problem. First, we give some definitions.18 Given a sequence of three columns (XL, X, XR), each row i, 1 ≤ i ≤ p – 1, is associated to a template i τi = x L

x i x iR  . i +1  x 

(No template is associated with the last row p). Let H(τi) denote the upper part [xLi xi xRi ]

of τi and let L(τi) denote the lower part, [xi+1]. Given a sequence of columns (XL, X, XR), two templates τi and τj , 1 ≤ i, j ≤ p – 1, are conflicting if and only if it happens that H(τi) = H(τj) and L(τi) ≠ L(τj). A sequence of three columns (XL, X, XR) is a valid triplet if and only if there are no conflicting templates. This is imperative in order to have a properly defined Fi function for the corresponding CA cell that will generate column X of the test matrix, if column X is assigned between columns XL and XR in the CA cell ordering. If a valid triple cannot be formed from test matrix columns, a so-called “link column” must be introduced (corresponding to an extra CA cell) so as to make a valid triplet. The goal in the studied on-chip embedding problem by a hybrid CA is to introduce the minimum number of link columns (extra CA cells) so as to generate the whole sequence. The CAD tool in Ref. 18 tackles this problem by a systematic procedure that uses shift-up columns. Given a column X = (x1, x2, ˆ = (x 1, x 2, …, x p,d)tr, where d is a don’t-care. Given a …, xp)tr, the shift-up column of X is the column X ˆ column X, the sequence of columns (XL, X, X) is a valid triplet for any column XL .

© 2000 by CRC Press LLC

Moreover, given two columns A and B of the test matrix, a shifting sequence from A to B to be a ˆ L = Lˆ , 1 ≤ i ≤ j, and (L , L , B), is a sequence of columns (A, L0, L1, L2, …, Lj , B) such that L0 = A, i i–1 j–1 j valid triplet. A shifting sequence is always a valid sequence. The important property of a shifting sequence (A, L0 , L1, L2 , …, Lj , B) is that column A can be preceded by any other column X in a CA ordering, with the resulting sequence (X, A, L0, L1, L2 , …, Lj , B) being still valid. That is, for any two columns A and B of the test matrix, column B can always be placed after column A with some intervening link columns without regard to what column is placed before A. Given any two columns A and B of the test matrix, the goal of the CAD tool in Ref. 18 is to find a shifting sequence (A, L0, L1, …, LjAB , B) of minimum length. This minimum number (denoted by mAB) can be found by successive shift-ups of L0 = Aˆ until a valid triplet ending with column B is formed. Given an ordered test matrix T, the CAD tool in Ref. 18 reduces the problem of finding short length test shifting sequences to that of computing a Traveling Salesman (TS) solution on an auxiliary graph. Experimental results reported in Ref. 18 show that this hybrid CA-based approach is promising.

Fault Simulation Explicit fault simulation is needed whenever the test patterns are generated using an ATPG tool. Fault simulation is needed in scan designs when an ATPG tool is used for TPG. Fault simulation procedures may also be used in the design of deterministic on-chip TPG schemes. On the other hand, pseudoexhaustive/pseudorandom BIST schemes mainly use compression techniques for detecting whether the circuit is faulty. Compression techniques were covered in Chapter 67.15 (Chapter 10 provides a more detailed discussion.) This section reviews CAD tools proposed for fault simulation of stuck-at faults in single-output combinational logic. For a more extensive discussion on the subject, we refer the reader to Ref. 15 (Chapter 5). The simplest form of simulation is called single-fault propagation. After a test pattern is simulated, the stuck-at faults are inserted one after the other. The values of every faulty circuitry are compared with the error-free values. A faulty value needs to be propagated from the line where the fault occurs. The propagation process continues line-by-line, in a topological search manner, until there is no faulty value that differs from the respective good one. If the latter condition is not satisfied, the fault is detected. In an alternative approach, called parallel-fault propagation, the goal is to simulate n test patterns in parallel using n–bit memory. Gates are evaluated using boolean instructions operating on n–bit operands. The problem with this type of simulation is that events may occur only in a subset of the n patterns while at a gate. If one average α fraction of gates have events on their inputs in one test pattern, the parallel simulator will simulate 1/α more gates than an event-driven simulator. Since n patterns are simulated in parallel, the approach is more efficient when n ≥ 1/α, and the speed-up is n · α. Single and parallel fault propagation are combined efficiently in a CAD tool proposed in Ref. 19. Another approach for fault simulation is the critical path tracing approach.20 For every test pattern, the approach first simulates the fault-free circuit and then determines the detected faults by determining which lines have critical values. A line has critical value 0 (1) in pattern t if and only if test pattern t detects the fault stuck-at 0 (1) at the line. Therefore, finding the lines that are critical in pattern t amounts to finding the stuck-at faults that are detected by t. Critical lines are found by backtracking from the primary outputs. Such a backtracking process determines paths of critical lines that are called critical paths. The process of generating critical paths uses the concept of sensitive inputs of a gate with two or more inputs (for a test pattern t). This is determined easily: If only input l has the controlling value of a gate, then it is sensitive. On the other hand, if all the inputs of a gate have noncontrolling value, then they are all sensitive. There is no other condition for labeling some input line of a gate as sensitive. Thus, the sensitive inputs of a gate can be identified during the fault-free simulation of the circuit. The operation of the critical path tracing algorithm is based on the observation that when a gate output is critical, then all its sensitive inputs are critical. On fan-out free circuits, critical path tracing is

© 2000 by CRC Press LLC

FIGURE 68.8

The solution depends on the cost parameter set.

a simple traversal that applies recursively to the above observation. The situation is more complicated when there exist reconvergent fan-outs. This is illustrated in Fig. 68.8. In Fig. 68.8(a), starting from g, we determine critical lines g, e, b, and c1 as critical, in that order. In order to determine whether c is critical, we need additional analysis. The effects of the fault stuck-at 0 on line c propagate on reconvergent paths with different parities which cancel each other when they reconverge at gate g. This is called self-masking. Self-masking does not occur at Fig. 68.8(b) because the fault propagation from c2 does not reach the reconvergent point. In Fig. 68.8(b), c is critical. Therefore, the problem is to determine whether self-masking occurs or not at the stem of the circuit. Let 0 (1) be the value of a stem l under test t. A solution is to explicitly simulate the fault stuck-at 1 (0) on l, and if t detects this fault, then l is marked as critical. Instead, the CAD tool uses bottlenecks in the propagation of faults that are called capture lines. Let a be a line with topological level tla, sensitized to stuck-at fault f with a pattern t. If every path sensitized to f either goes through a or does not reach any other line with greater topological level greater than tla , then a is a capture line of f under pattern t. Such a line is common to all paths on which the effects of f can propagate to the primary output under pattern t. The capture lines of a fault form a transitive chain. Therefore, a test t detects fault f if and only if all the capture lines of f under test pattern t are critical in t. Thus, in order to determine whether a stem is critical, the CAD tool does not propagate the effects of the fault step up to the primary output; it only propagates the fault effects up to the capture line that is closest to the stem.

68.3 CAD for Path Delays CAD Tools for TPG Fault Models and Non-enumerative ATPG In the path delay fault problem, defects cause the propagation time along paths in the circuit under test to exceed the clock period. We assume here a fully scanned circuit where path delays are examined in combinational logic. A path delay fault is any path where either a rising (0 → 1) or falling (1 → 0) transition occurs on every line in the path. Therefore, for every physical path in the circuit, there exist two path delay faults. The first path delay fault is associated with a rising transition on the first line on

© 2000 by CRC Press LLC

the path. The second path delay fault is associated with a falling transition on the first line on the path. In order to detect path delay faults, pairs of patterns must be applied rather than single test patterns. One of the conditions that can be imposed on the tests for path delay faults is the robust condition. Robust tests guarantee the detection of the targeted path delay faults independent of any delays in the rest of the circuit. Table 68.9 lists the conditions for robust propagation of path delay faults in a circuit containing AND, OR, NAND, and NOR gates. TABLE 68.9

Requirements for Robust Propagation Output Transition

gate AND OR NAND NOR

0→1

1→0

Any number of inputs Single input Single input Any number of inputs

Single input Any number of inputs Any number of inputs Single input

Thus, when the output of a AND gate has been assigned, a rising transition multiple inputs are allowed to have rising transitions because rising transitions for an AND gate are transitions from a controlling value (cv) to a noncontrolling value (ncv). If, on the other hand, the output of an AND gate has a falling transition (ncv → cv), then only one input is allowed to have a ncv → cv transition in order to satisfy the robustness. Some definitions are necessary before we describe additional path delay fault families. Given a path delay fault p and a gate g on the p, the on-input of g with respect to path p is the input of g that is also on p. All other inputs of g are called off-inputs of g with respect to path p. Robust path delay faults are a subset of the non-robust path delay faults. A non-robust test vector satisfies the conditions: (1) a transition is launched at the primary input of the target path, and (2) all off-inputs of the target path settle to non-controlling values under the second pattern in the vector. A robust test vector must satisfy the conditions of the non-robust tests, and whenever the transition at an on-input line a is cv → ncv, each off-input of a is steady at ncv. The target faults detected by robust test vectors are called robustly testable, and are a subset of the target faults that are detected by non-robust test vectors. The target faults that are not robust testable and are detected by non-robust test vectors are called non-robustly testable. Non-robust test vectors cannot guarantee the detection of the target fault in the presence of other delay faults. Functionally sensitizable test vectors allow for faults to be detected in the presence of multiple path delays. They detect a set of faults that is a superset of those detected by non-robust test vectors. A target fault is functionally testable (FT) if there is at least one gate with one or more off-inputs with ncv → ncv transition, where all of its off-inputs with ncv → cv transition are also delayed while its remaining offinputs satisfy the conditions for non-robust test vectors. We say that each such gate satisfies the functionally testable (FT) condition. It has been shown that FT faults have better probability to be detected when the maximum off-input slack (or, simply, slack) is a small integer. (The slack of an off-input is defined as the difference between the stable time of the on-input signal and the stable time of the off-input signal.) Faults that are not detected by functionally sensitizable test vectors are called functionally unsensitizable. Table 68.10 summarizes the above-mentioned off-input conditions.21 TABLE 68.10 Off-input Signals for Two Input Gates and Fault Classification

cv → ncv ncv → cv Stable ncv Stable cv

© 2000 by CRC Press LLC

Off-input Transition

On-input Transition

Robust Funct. unsensitizable Robust Funct. unsensitizable

Non-robustly testable Functionally testable Robust Funct. unsensitizable

Other classifications of path delay faults have been recently proposed in the literature, but they are not presented here.22,23 Systematic path delay fault classification is very important when considering test pattern generation. For example, test pattern generation for robust path delay faults does not need to consider actual delays on the gates. However, delays have to be considered when generating pairs of patterns for non-robust and functionally testable faults. For the latter fault family, the generator must take into consideration that they are multiple faults, and that the slack is an important parameter for their detection. The conventional approach for generating test patterns for path delay faults is a modification of the test pattern generation for stuck-at faults. It consists of a two-phase loop, each loop iteration resulting in a generated pair of patterns. Initially, transitions are assigned on the lines of path P. This is called the path sensitization phase. Then, a modified ATPG for stuck-at fault is executed twice. The first time, a test pattern must be generated so that every line of the selected path delay fault receives its initial transition value. The second execution of the modified ATPG generates another pattern, which assigns the final transition value on every line on the path. This is called the line justification phase. The problem with this conventional approach is that the repeat loop will be executed as many times as the number of path delay faults, which is an exponential quantity to the size of the circuit. More explicitly, the difficulty of the path delay fault model is that the number of targeted faults is exponential, therefore we cannot afford to generate pairs of test patterns that detect one fault at a time. Any practical ATPG tool must be able to generate a polynomial number of test patterns. Thus, in the case of path delay faults, the two-phase loop must be modified as follows. The first phase must be able to sensitize multiple paths. The second phase must be able to justify the assigned line transitions of as many sensitized paths as possible. The goal in a non-enumerative ATPG is to generate a pair of patterns that sensitizes and justifies the transitions on all the lines of a subcircuit. Clearly, the average number of paths in each examined subcircuit must be an exponential quantity when the number of paths in the circuit is exponential. Thus, a necessary condition for the path sensitization phase is to generate, on average, subgraphs with large size. The ATPG tools described in this section generate pairs of test patterns for robust path delay faults.24,25 Both tools target an efficient path sensitization phase. A necessary condition for the paths of a subcircuit to be simultaneously sensitized is to be structurally compatible with respect to the parity (on the number of inverters) between any two reconvergent nodes in the subcircuit. This concept is illustrated in Fig. 68.9.

FIGURE 68.9

A graph consisting of structurally compatible paths.

© 2000 by CRC Press LLC

Consider the circuit on the top portion of Fig. 68.9. The subgraph induced by the thick edges consists of two structurally compatible paths. These two paths share two OR gates. The two subpaths that share the same OR gate endpoints have even parity. Any graph that constraints structurally compatible graphs is called a structurally compatible (SG) graph. The tools in Refs. 24 and 25 consider a special case of SG graphs with a single primary input and a single primary output. We call such an SG graph a primary compatible SG graph (PCG graph). For the same pair of primary input and output nodes in the circuit, there may be many different PCG graphs, which are called sibling PCG graphs. Sibling PCG graphs contain mutually incompatible paths. The subgraph induced by the thick edges on the bottom portion of Fig. 68.9 shows a PCG that is sibling to the one on the top portion. This graph also contains two paths (the ones induced by the thick edges). The ATPG tool in Ref. 25 generates large sibling PCGs for every pair of primary input and output nodes in the circuit. The size of each returned PCG is measured in terms of the number of structurally compatible paths that satisfy the requirements for robust propagation described earlier. Experimentation in Ref. 25 shows that the line justification phase satisfies the constraints along paths in a manner proportional to the size of the graph returned by the multiple path sensitization phase. Given a pair of primary input and primary output nodes, Ref. 25 constructs large sibling PCGs as follows. Initially, a small number of lines in the circuit are removed so that the subcircuit between the selected primary inputs and outputs is a series-parallel graph. A polynomial time algorithm is applied on the series-parallel graph which finds the maximum number of structurally compatible paths that satisfy the conditions for robust propagation. An intermediate tree structure is maintained, which helps extract many such large sibling PCGs for the same pair of primary input and output nodes. Finally, many previously deleted edges are inserted so that the size of the sibling PCGs is increased further by considering paths that do not necessarily belong on the previously constructed series-parallel graph. Once a pair of patterns is generated by the ATPG tool in Ref. 25, fault-simulation must be done so that the number of robust paths detected by the generated pair of patterns can be determined. The fault simulation problem for the path delay fault model is not as easy as for the stuck-at model. The difficulty relies on the fact that the number of path delay faults is not necessarily a polynomial quantity. Each generated pair of patterns by the CAD tool in Ref. 25 targets robust path delay faults in a particular sibling PCG. It may, however, detect robust path delay faults in the portion of the circuit outside the targeted PCG. This complicates the fault simulation process. Thus, Ref. 25 suggests that faults are simulated only within the current PCG in which case a simple topological graph traversal suffices to detect them. On-chip TPG Aspects Many recent on-chip TPG schemes have been recently proposed for generating pairs of patterns. They are classified as either pseudo-exhaustive/pseudorandom or deterministic. A pseudo-exhaustive scheme for generating pairs of patterns on-chip is proposed in Ref. 26. The method is based on a simple LFSR that has 2 · w cells for a circuit with w inputs. Every other LFSR cell is connected to a circuit input. In particular, all the LFSR cells at even positions are connected to circuit inputs, and the remaining LFSR cells are used for “destroying” the shift dependency of the contents in the LFSR cells at even positions. The cells at odd positions are also called separation cells. Since the contents of the latter cells are independent, the scheme can generate all the possible two-input patterns. The schematic of the approach is given in Fig. 68.10. Such an LFSR scheme is called a full-input separation LFSR.26 It requires a significant hardware overhead and long wire feedback connections. A CAD tool is presented in Ref. 26 that reduces the size of the hardware overhead and the wire lengths by simply observing that separation cells must exist between any two LFSR cells that are connected to inputs that affect at least one circuit output. For each circuit output o, the Io set which contains the labels of all the input cells of the full separation LFSR which affect o is constructed. Then, an LFSR cell relabeling CAD tool is proposed which minimizes the total number of separation cells so that the labels of all Ios are even numbers.26 Weighted random LFSRs can be used for on-chip deterministic TPG of pairs of patterns. Let us, for simplicity, consider the embedding problem. Here, the goal is to reproduce on-chip a matrix T consisting

© 2000 by CRC Press LLC

FIGURE 68.10

The schematic of an LFSR-based scheme for pseudo-exhaustive on-chip TPG.

FIGURE 68.11

The schematic of a weighted random LFSR-based approach for deterministic on-chip TPG.

of n pairs of patterns (pi1, pi2), 1 ≤ i ≤ n, each of size w, that have been generated by an ATPG tool such as the one described in the previous section. A simple approach is to use a weighted random LFSR that n generates patterns pi of size 2w. Every pattern pi is simply the concatenation of patterns pi1 and pi2. Once pattern pi is generated, a simple circuit consisting of two-to-one multiplexers “splits” pattern pi into its two pattern pi1 and pi2 and, in addition, guarantees that patterns pi1 are applied at even clock pulses and pattern pi2 are applied at odd clock pulses. The schematic of the approach is given in Fig. 68.11.

Fault Simulation and Estimation Exact fault simulation for path delay faults is not a trivial aspect independent of the model used to propagate the delays (robust, non-robust, functionally testable path delay faults). The number of path delay faults remains, in the worst case, exponential, independent of propagation restrictions. Ref. 27 presents an exact simulation CAD tool for any type of path delay fault. The drawback of the approach in Ref. 27 is that it may require exponential time (and space) complexity, although experimentation has shown that in practice it is very efficient. The following describes CAD tools for obtaining lower bounds on the number of detected path delay faults by a given set of n pairs of patterns. These approaches apply to any type of path delay fault and are referred to as fault estimation schemes. In Ref. 28, every time a pair of patterns is applied, the CAD tool examines whether there exists at least one line where either a rising or falling transition has not been encountered by the previously applied pairs of test patterns. Let Ei, 1 ≤ i ≤ n, denote the set of lines for which either a rising or a falling transition occurs for the first time when the pair of patterns Pi is applied. When |Ei| > 0, a new set of path delay faults is detected by pattern Pi. These are the paths that contain lines in Ei . A simple topological search of the combinational circuit suffices to detect their number. If for some Pi , we have |Ei | = 0, the approach does not detect any path delay faults.

© 2000 by CRC Press LLC

FIGURE 68.12

An undetected path delay fault.

The approach in Ref. 28 is non-enumerative but returns a conservative lower bound to the number of detected paths. Figure 68.12 illustrates a case where a path delay fault may not be counted. Assume that the path delay faults in all three patterns start with a rising transition. Furthermore, assume that the first pair of patterns detects path delay faults along all the paths of the subgraph which is covered by thick edges. Let the second pair of patterns detect path delay faults on all the paths of the subgraph covered by dotted edges, and let the dashed path indicate a path delay fault detected by the third pair of patterns. Clearly, the latter path delay fault cannot be detected by the approach in Ref. 28. For this reason, Ref. 28 suggests that fault simulation is done by virtually partitioning the circuit into subcircuits. The subcircuits should contain disjoint paths. One implementation for such a partitioning scheme is to consider lines that are independent in the sense that there is no physical path in the circuit that contains any two selected lines. Once a line is selected, we form a subcircuit that consists of all lines that depend on the selected line. In addition, the selected lines must form a cut separating the inputs from the outputs so that every physical path. This way, every path delay fault belongs to exactly one subcircuit. Figure 68.13 below shows three selected lines (the thick lines) of the circuit in Fig. 68.12 that are independent and also separate the inputs from the outputs.

FIGURE 68.13

Three independent lines that form a cut.

© 2000 by CRC Press LLC

FIGURE 68.14

All paths are detected using three subcircuits.

Figure 68.14 contains the subcircuits corresponding to these lines. The first pattern detects path delay faults in the first two subcircuits, and the second pattern detects path delay faults in the third subcircuit. The missed path delay fault by the third pattern of Fig. 68.2 is detected on the third subcircuit because, in that subcircuit, its first line does not have a marked rising transition when the third pair of patterns is applied. Reference 29 gives a new dimension to the latter problem. Such a cut of lines is called a strong cut. The idea is to find a maximum strong cut that allows for a maximum collection of subcircuits where fault coverage estimation can take place. A CAD tool is presented in Ref. 29 that returns such a maximum cardinality strong cut. The problem reduces to that of finding a maximum weighted independent set in a comparability graph, which is solvable in polynomial time using a minimum flow technique. There is no formal proof that the more the subcircuits, the better the fault coverage estimation is. However, experimentation verifies this assertion.29 Another CAD tool is given in Ref. 30. Every time a new pair of patterns is applied, the approach searches for sequences of rising and falling transitions on segments that terminate (or originate) at a given line. Therefore, if the CAD tool is implemented using segments of size two, every line can have up to four associated transitions. This enhances fault coverage estimation because new paths can be identified when a new sequence of transitions occurs through a line instead of a single transition.

References 1. S.N. Bhatt, F.R.K. Chung, and A.L. Rosenberg, Partitioning Circuits for Improved Testability, Proc. MIT Conference on Advanced Research in VLSI, 91, 1986. 2. W.B. Jone and C.A. Papachristou, A Coordinated Approach to Partitioning and Test Pattern Generation for Pseudoexhaustive Testing, Proc. 26th ACM/IEEE Design Automation Conference, 525, 1989. 3. D. Kagaris and S. Tragoudas, Cost-Effective LFSR Synthesis for Optimal Pseudoexhaustive BIST Test Sets, IEEE Transactions on VLSI Systems, 1, 526, 1993. 4. R. Srinivasan, S.K. Gupta, and M.A. Breuer, An Efficient Partitioning Strategy for Pseudo-Exhaustive Testing, Proc. 30th ACM/IEEE Design Automation Conference, 242, 1993. 5. D. Kagaris and S. Tragoudas, Avoiding Linear Dependencies for LFSR Test Pattern Generators, Journal of Electronic Testing: Theory and Applications, 6, 229, 1995. 6. B. Reeb and H.J. Wunderlich, Deterministic Pattern Generation for Weighted Random Pattern Testing, Proc. European Design and Test Conference, 30, 1996. 7. D. Kagaris, S. Tragoudas, and A. Majumdar, On the use of Counters for Reproducing Deterministic Test Sets, IEEE Transactions on Computers, 45, 1405, 1996. 8. S. Narayanan and M.A. Breuer, Asynchronous Multiple Scan Chains, Proc. IEEE VLSI Test Symposium, 270, 1995.

© 2000 by CRC Press LLC

9. C.E. Leiserson and J.B. Saxe, Retiming Synchronous Circuitry, Algorithmica, 6, 5, 1991. 10. D. Kagaris and S. Tragoudas, Retiming-based Partial Scan, IEEE Transactions on Computers, 45, 74, 1996. 11. S.T. Chakradhar and S. Dey, Resynthesis and Retiming for Optimum Partial Scan, Proc. 31st Design Automation Conference, 87, 1994. 12. P. Pan and C.L. Liu, Partial Scan with Preselected Scan Signals, Proc. 32nd Design Automation Conference, 189, 1995. 13. R. Gupta, R. Gupta, and M.A. Breuer, The BALLAST Methodology for Structured Partial Scan Design, IEEE Transactions on Computers, 39, 538, 1990. 14. A. El-Maleh, T. Marchok, J. Rajski, and W. Maly, On Test Set Preservation of Retimed Circuits, Proc. 32nd ACM/IEEE Design Automation Conference, 341, 1995. 15. M. Abramovici, M.A. Breuer, and A.D. Friedman, Digital Systems Testing and Testable Design, Computer Science Press, 1990. 16. A.P. Stroele and H.-J. Wunderlich, Test Register Insertion with Minimum Hardware Cost, Proc. International Conference on Computer-Aided Design, 95, 1995. 17. S. Boubezari and B. Kaminska, A Deterministic Built-In Self-Test Generator Based on Cellular Automata Structures, IEEE Transactions on Computers, 44, 805, 1995. 18. D. Kagaris and S. Tragoudas, Cellular Automata for Generating Deterministic Test Sequences, Proc. European Design and Test Conference, 77, 1997. 19. J.A. Waicukauski, E.B. Eichelberger, D.O. Florlenza, E. Lindbloom, and T. McCarthy, Fault Simulation for Structured VLSI, VLSI Systems Design, 6, 20, 1985. 20. M. Abramovici, P.R. Menon, and D.T. Miller, Critical Path Tracing: An Alternative to Fault Simulation, IEEE Design and Test of Computers, 1, 83, 1984. 21. K.-T. Cheng and H.-C. Chen, Delay Testing for Robust Untestable Faults, Proc. International Test Conference, 954, 1993. 22. W.K. Lam, A Saldhana, R.K. Brayton, and A.L. Sangiovanni-Vincentelli, Delay Fault Coverage and Performance Tradeoffs, Proc. Design Automation Conference, 446, 1993. 23. M.A. Gharaybeh, M.L. Bushnell, and V.D. Agrawal, Classification and Test Generation for PathDelay Faults Using Stuck-Fault Tests, Proc. International Test Conference, 139, 1995. 24. I. Pomeranz, S.M. Reddy, and P. Uppalui, NEST: An Nonenumerative Test Generation Method for Path Delay Faults in Combinational Circuits, IEEE Transactions on CAD, 14, 1505, 1995. 25. D. Karayiannis and S. Tragoudas, ATPD: An Automatic Test Pattern Generator for Path Delay Faults, Proc. International Test Conference, 443, 1996. 26. J. Savir, Delay Test Generation: A Hardware Perspective, Journal of Electronic Testing: Theory and Applications, 10, 245, 1997. 27. M.A. Gharaybeh, M.L. Bushnell, and V.D. Agrawal, An Exact Non-Enumerative Fault Simulator for Path-Delay Faults, Proc. International Test Conference, 276, 1996. 28. I. Pomeranz and S.M. Reddy, An Efficient Nonenumerative Method to Estimate the Path Delay Fault Coverage in Combinational Circuits, IEEE Transactions on Computer-Aided Design, 13, 240, 1994. 29. D. Kagaris, S. Tragoudas, and D. Karayiannis, Improved Nonenumerative Path Delay Fault Coverage Estimation Based on Optimal Polynomial Time Algorithms, IEEE Transactions on Computer-Aided Design, 3, 309, 1997. 30. K. Heragu, V.D. Agrawal, M.L. Bushnell, and J.H. Patel, Improving a Nonenumerative Method to Estimate Path Delay Fault Coverage, IEEE Transactions on Computer-Aided Design, 7, 759, 1997.

© 2000 by CRC Press LLC

Long, S.I. "Materials" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

69 Materials Stephen I. Long University of California at Santa Barbara

69.1 69.2 69.3 69.4

Introduction Compound Semiconductor Materials Why III-V Semiconductors? Heterojunctions

69.1 Introduction Very-high-speed digital integrated circuit design is a multidisciplinary challenge. First, there are several IC technologies available for very-high-speed applications. Each of these claims to offer unique benefits to the user. In order to choose the most appropriate or cost-effective technology for a particular application or system, the designer must understand the materials, the devices, the limitations imposed by process on yields, and the thermal limitations due to power dissipation. Second, very-high-speed digital ICs present design challenges if the inherent performance of the devices is to be retained. At the upper limits of speed, there are no digital circuits, only analog. Circuit design techniques formerly thought to be exclusively in the domain of analog IC design are effective in optimizing digital IC designs for highest performance. Finally, system integration when using the highest-speed technologies presents an additional challenge. Interconnections, clock and power distribution both on-chip and off-chip require much care and often restrict the achievable performance of an IC in a system. The entire scope of very-high-speed digital design is much too vast to present in a single tutorial chapter. Therefore, we must focus the coverage in order to provide some useful tools for the designer. We will focus primarily on compound semiconductor technologies in order to restrict the scope. Silicon IC design tutorials can be found in other chapters in this handbook. This chapter gives a brief introduction to compound semiconductor materials in order to justify the use of non-silicon materials for the highestspeed applications. The transport properties of several materials are compared. Second, a technologyindependent description of device operation for high-speed or high-frequency applications will be given in Chapter 70. The charge control methodology provides insight and connects the basic material properties and device geometry with performance. Chapter 71 describes the design basics of very-high-speed ICs. Static design methods are illustrated with compound semiconductor circuit examples, but are based on generic principles such as noise margin. The transient design methods emphasize analog circuit techniques and can be applied to any technology. Finally, Chapter 72 describes typical circuit design approaches using FET and bipolar device technologies and presents applications of current interest.

69.2 Compound Semiconductor Materials The compound semiconductor family is composed of the group III and group V elements shown in Table 69.1. Each semiconductor is formed from at least one group III and one group V element. Group IV

© 2000 by CRC Press LLC

elements such as C, Si, and Ge are used as dopants, as are several group TABLE 69.1 Column III, IV, II and VI elements such as Be or Mg for p-type and Te and Se for nand V Elements Associated with type. Binary semiconductors such as GaAs and InP can be grown in Compound Semiconductors large single-crystal ingot form using the liquid-encapsulated CzochralB C N ski method1 and are the materials of choice for substrates. At the Al Si P present time, GaAs wafers with a diameter of 100 and 150 mm are Ga Ge As In Sn Sb most widely used. InP is still limited to 75 mm diameter. Three or four elements are often mixed together when grown as thin epitaxial films on top of the binary substrates. The alloys thus formed allow electronic and structural properties such as bandgap and lattice constant to be varied as needed for device purposes. Junctions between different semiconductors can be used to further control charge transport as discussed in Section 69.4.

69.3 Why III-V Semiconductors? The main motivation for using the III-V compound semiconductors for device applications is found in their electronic properties when compared with those of the dominant semiconductor material, silicon. Figure 69.1 is a plot of steady-state electron velocity of several n-type semiconductors versus electric field. From this graph, we see that at low electric fields the slope of the curves (mobility) is higher than that of silicon. High mobility means that the resistivity will be less for III-V n-type materials, and it may be easier to achieve lower access resistance. Access resistance is the series resistance between the device contacts and the internal active region. An example would be the base resistance of a bipolar transistor. Lower resistance will reduce some of the fundamental device time constants to be described in Chapter 70 that often dominate device high-frequency performance. Figure 69.1 also shows that the peak electron velocity is higher for III-V materials, and the peak velocity can be achieved at much lower electric fields. High velocity reduces transit time, the time required for a charge carrier to travel from its source to its destination, and improves device high-frequency performance, also discussed in Chapter 70. Achieving this high velocity at lower electric fields means that the devices will reach their peak performance at

FIGURE 69.1 Electron velocity versus electric field for several n-type semiconductors.

© 2000 by CRC Press LLC

TABLE 69.2 Semiconductor Si (bulk) Ge InP GaAs Ga0.47In0.53As InAs Al0.3Ga0.7As AlAs Al0.48In0.52As

Comparison of Mobilities and Peak Velocities of Several n- and p-type Semiconductors EG (eV)

εr

Electron Mobility (cm2/V-s)

Hole Mobility (cm2/V-s)

Peak Electron Velocity (cm/s)

1.12 0.66 1.35 D 1.42 D 0.78 D 0.35 D 1.80 D 2.17 1.92 D

11.7 15.8 12.4 13.1 13.9 14.6 12.2 10.1 12.3

1450 3900 4600 8500 11,000 22,600 1000 280 800

450 1900 150 400 200 460 100 — 100

N.A. N.A. 2.1 × 107 2 × 107 2.7 × 107 4 × 107 — — —

Note: In bandgap energy column, the symbol “D” indicates direct bandgap; otherwise, it is indirect bandgap. T = 300 K and “weak doping” limit.

lower voltages, which is useful for low-power, high-speed applications. Mobilities and peak velocities of several semiconductors are compared in Table 69.2. On the other hand, as also shown in Table 69.2, p-type III-V semiconductors have rather poor hole mobility when compared with elemental semiconductor materials such as silicon or germanium. Holes also reach their peak velocities at much higher electric fields than electrons. Therefore, p-type III-V materials needed for the base of a bipolar transistor, for example, are used, but their thickness must be extremely small to avoid degradation in transit time. Lateral distances must also be small to avoid excessive series resistance. CMOS-like complementary FET technologies have also been developed,2 but their performance has been limited by the poorer speed of the p-channel devices.

69.4 Heterojunctions In the past, most semiconductor devices were composed of a single semiconductor element, such as silicon or gallium arsenide, and employed n- and p-type doping to control charge transport. Figure 69.2(a) illustrates an energy band diagram of a semiconductor with uniform composition that is in an applied electric field. Electrons will drift downhill and holes will drift uphill in the applied electric field. The electrons and/or holes could be produced by doping or by ionization due to light. In a heterogeneous semiconductor as shown in Fig. 69.2(b), the bandgap can be graded from wide bandgap on the left to narrow on the right by varying the composition. In this case, even without an applied electric field, a built-in quasi-electric field is produced by the bandgap variation that will transport both holes and electrons in the same direction. The abrupt heterojunction formed by an atomically abrupt transition between AlGaAs and GaAs, shown in the energy band diagram of Fig. 69.3, creates discontinuities in the valence and conduction bands. The conduction band energy discontinuity is labeled ∆EC and the valence band discontinuity, ∆EV . Their sum equals the energy bandgap difference between the two materials. The potential energy steps caused by these discontinuities are used as barriers to electrons or holes. The relative sizes of these potential barriers depend on the composition of the semiconductor materials on each side of the heterojunction. In this example, an electron barrier in the conduction

© 2000 by CRC Press LLC

FIGURE 69.2 (a) Homogeneous semiconductor in uniform electric field, and (b) Heterogeneous semiconductor with graded energy gap. No applied electric field.

FIGURE 69.3

Energy band diagram of an abrupt heterojunction.

band is used to confine carriers into a narrow potential energy well with triangular shape. Quantum well structures such as these are used to improve device performance through two-dimensional charge transport channels, similar to the role played by the inversion layer in MOS devices. The structure and operation of heterojunctions in FETs and BJTs will be described in Chapter 70. The overall principle of the use of heterojunctions is summarized in a Central Design Principle: “Heterostructures use energy gap variations in addition to electric fields as forces acting on holes and electrons to control their distribution and flow.” 3,4 The energy barriers can control motion of charge both across the heterojunction and in the plane of the heterojunction. In addition, heterojunctions are most widely used in light-emitting devices, since the compositional differences also lead to either stepped or graded index of refraction, which can be used to confine, refract, and reflect light. The barriers also control the transport of holes and electrons in the light-generating regions. Figure 69.4 shows a plot of bandgap versus lattice constant for many of the III-V semiconductors.3 Consider GaAs as an example. GaAs and AlAs have the same lattice constant (approximately 0.56 nm) but different bandgaps (1.4 and 2.2 eV, respectively). An alloy semiconductor, AlGaAs, can be grown

FIGURE 69.4

Energy bandgap versus lattice constant for compound semiconductor materials.

© 2000 by CRC Press LLC

epitaxially on a GaAs substrate wafer using standard growth techniques. The composition can be selected by the Al-to-Ga ratio, giving a bandgap that can be chosen across the entire range from GaAs to AlAs. Since both lattice constants are essentially the same, very low lattice mismatch can be achieved for any composition of AlxGa1-xAs. Lattice matching permits low defect density, high-quality materials to be grown that have good electronic and optical properties. It quickly becomes apparent from Fig. 69.4, however, that a requirement for lattice matching to the substrate greatly restricts the combinations of materials available to the device designer. For electron devices, the low mismatch GaAs/AlAs alloys, GaSb/AlSb alloys, Al.48In.52As/InP/Ga.47In.53As and GaAs/In.49Ga.51As combinations alone are available. Efforts to utilize combinations such as GaP on Si or GaAs on Ge that lattice match have been generally unsuccessful because of problems with interface structure, polarization, and autodoping. For several years, lattice matching was considered to be a necessary condition if mobility-damaging defects were to be avoided. This barrier was later broken when it was discovered that high-quality semiconductor materials could still be obtained although lattice-mismatched if the thickness of the mismatched layer is sufficiently small.5,6 This technique, called pseudomorphic growth, opened another dimension in III-V device technology, and allowed device structures to be optimized over a wider range of bandgap for better electron or hole dynamics and optical properties. Two of the pseudomorphic systems that have been very successful in high-performance millimeterwave FETs are the InAlAs/InGaAs/GaAs and InAlAs/InGaAs/InP systems. The InxGa1-x As layer is responsible for the high electron mobility and velocity which both improve as the In concentration x is increased. Up to x = 0.25 for GaAs substrates and x = 0.80 for InP substrates have been demonstrated and result in great performance enhancements when compared with lattice-matched combinations.

© 2000 by CRC Press LLC

Estriech, D.B. "Compound Semiconductor Devices for Digital Circuits" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

70 Compound Semiconductor Devices for Digital Circuits 70.1 Introduction 70.2 Unifying Principle for Active Devices: Charge Control Principle 70.3 Comparing Unipolar and Bipolar Transistors Charge Transport in Semiconductors • Field-Effect (Unipolar) Transistor • Bipolar Junction Transistors (Homojunction and Heterojunction) • Comparing Parameters

Donald B. Estreich Hewlett-Packard Company

70.4 Typical Device Structures FET Structures • FET Performance • Heterojunction Bipolar Structures • HBT Performance

70.1 Introduction An active device is an electron device, such as a transistor, capable of delivering power amplification by converting dc bias power into time-varying signal power. It delivers a greater energy to its load than if the device were absent. The charge control framework7-9 discussed below presents a unified understanding of the operation of all electron devices and simplifies the comparison of the several active devices used in digital integrated circuits.

70.2 Unifying Principle for Active Devices: Charge Control Principle Consider a generic electron device as represented in Fig. 70.1. It consists of three electrodes encompassing a charge transport region. The transport region is capable of supporting charge flow (electrons as shown in the figure) between an emitting electrode and a collecting electrode. A third electrode, called the control electrode, is used to establish the electron concentration within the transport region. Placing a control charge, QC , on the control electrode establishes a controlled charge, denoted as –Q, in the transport region. The operation of active devices depends on the charge control principle7: Each charge placed upon the control electrode can at most introduce an equal and opposite charge in the transport region between the emitting and collecting electrode. At most, we have the relationship, –Q = QC. Any parasitic coupling of the control charge to charge on the other electrodes, or remote parts of the device, will decrease the controlled charge in the transport region, that is –Q < QC more generally. For example, charge coupling between the control

© 2000 by CRC Press LLC

FIGURE 70.1 region.

Generic charge control device consisting of three electrodes embedded around a charge transport

electrode and the collecting electrode forms a feedback or output capacitance, say Co . Time variation of QC leads to the modulation of the current flow between emitting and collecting electrodes. The generic structure in Fig. 70.1 could represent any one of a number of active devices (e.g., vacuum tubes, unipolar transistors, bipolar transistors, photoconductors, etc.). Hence, charge control analysis is very broad in scope, since it applies to all electronic transistors. Starting from the charge control principle, we associate two characteristic time constants with an active device, thereby leading to a first-order description of its behavior. Application of a potential difference between the emitting and collecting electrodes, say VCC , establishes an electric field in the transport region. Electrons in the transport region respond to the electric field and move across this region with a transit time τr . The transit time1 is the first of the two important characteristic times used in charge control modeling. With charge –Q in the transit region, the static (dc) current Io between emitting and collecting electrodes is

Io = –Q/τ r = Qc/τ r

(70.1)

A simple interpretation of τr is as follows: τr is equal to the length l of the transport region, divided by the average velocity of transit (i.e., τr = l/〈v〉). From this perspective, a charge of –Q (coulombs) is swept out the collecting electrode every τr seconds. Now consider Fig. 70.2, showing the common-emitting electrode connection of the active device of Fig. 70.1 connected to input and output (i.e., load) resistances, say Rin and RL , respectively. The second characteristic time of importance can now be defined: It is the lifetime time constant, and we denote it by τ. It is a measure of how long a charge placed on the control electrode will remain on the control terminal. The lifetime time constant is established in one of several ways, depending on the physics of the active device and/or its connection. The controlling charge may “leak away” by (1) discharging through the external resistor Rin as typically happens with FET devices, (2) recombining with intermixed oppositely charged carriers within the device (e.g., base recombination in a bipolar transistor), or (3) discharging through an internal shunt leakage path within the device. The dc current flowing to replenish the lost control charge is given by

Iin = –Q/τ = Qc/τ

(70.2)

The static (dc) current gain GI of a device is defined as the current delivered to the output, divided by the current replenishing the control charge during the same time period. Where in τ seconds charge –Q 1The transit time τ is best interpreted as an average transit time per carrier (electron). We note that 1/τ is common r r to all devices — it is related to a device’s ultimate capability to process information.

© 2000 by CRC Press LLC

FIGURE 70.2 Generic charge control device of Fig. 70.1 connected to input and output resistors, Rin and RL , respectively, with bias voltage and input signal applied.

is both lost and replenished, charge Qc times the ratio τ/τr has been supplied to the output resistor RL . In symbols, the static current gain is

GI = Io/Iin = τ /τ r

(70.3)

provided –Q = QC holds. In the dynamic case, the process of small-signal amplification consists of an incremental variation of the control charge Qc directly resulting in an incremental change in the controlled charge, –Q. The resulting variation in output current flowing in the load resistor translates into a time-varying voltage vo . The charge control formalism holds just as well for large-signal situations. In the large-signal case, the changes in control charge are no longer small incremental changes. Charge control analysis under large charge variations is less accurate due to the simplicity of the model, but still very useful for approximate switching calculations in digital circuits. An important dynamic parameter is the input capacitance Ci of the active device. Capacitance Ci is a measure of the work required to introduce a charge carrier in the transport region. Capacitance Ci is given by the change in charge Q for a corresponding change in input voltage vin. It is desirable to maximize Ci in an active device. The transconductance gm is calculated from

 ∂Io   ∂I  = o   ∂v in  v  ∂Q 

gm = 

o

 ∂Q    ∂v in 

⋅

(70.4)

The first partial derivative on the right-hand side of Eq. 70.4 is simply (1/τr), and the second partial derivative is Ci . Hence, the transconductance gm is the ratio

gm =

Ci τi

(70.5)

A physical interpretation of gm is the ratio of the work required to introduce a charge carrier to the average transit time of a charge carrier in the transport region. The transconductance is one of the most commonly used device parameters in circuit design and analysis.

© 2000 by CRC Press LLC

FIGURE 70.3 Two-port, small-signal, admittance charge control model with the emitting electrode selected as the common terminal to both input and output.

In addition to Ci , another capacitance, say Co , is introduced and associated with the collecting electrode. Capacitance Co accounts for charge on the collecting electrode coupled to either static charge in the transport region or charge on the control electrode. A non-zero Co indicates that the coupling between the controlling electrode and the charge in transit is less than unity (i.e., –Q < QC). For small-signal analysis the capacitance parameters are usually taken at fixed numbers evaluated about the device’s bias state. When using charge control in the large-signal case, the capacitance parameters must include the voltage dependencies. For example, the input capacitance Ci can be strongly dependent upon the control electrode to emitting electrode and collecting electrode potentials. Hence, during the change in bias state within a device, the magnitude of the capacitance Ci is time varying. This variation can dramatically affect the switching speed of the active device. Parametric dependencies on the instantaneous bias state of the device are at the heart of accurate modeling of large-signal or switching behavior of active devices. We introduce the small-signal admittance charge control model shown in Fig. 70.3. This model uses the emitting electrode as the common terminal in a two-port connection. The transconductance gm is the magnitude of the real part of the forward admittance yf and is represented as a voltage-controlled current source positioned from collecting-to-emitting electrode. The input admittance, denoted by yi , is equivalent to (C i /τ), where τ is the control charge lifetime time constant. Parameter yi can be expressed in the form (gi + sCi) where s = jω. An output admittance, similarly denoted by yo, is given by (Co/τr) where τr is the transit time and, in general yo = (go + sCo). Finally, the output-to-input feedback admittance yr is included using a voltage-controlled current source at the input. Often, yr is small enough to approximate as zero (the model is then said to be unilateral). Consider the frequency dependence of the dynamic (ac) current gain Gi. The low-frequency current gain is interpreted as follows: an incremental charge qc is introduced on the control electrode with lifetime τ. This produces a corresponding incremental charge –q in the transport region. Charge –q is swept across the transport region every transit time τr seconds. In time τ, charge –q crosses the transit region τ/τr times, which is identically equal to the low-frequency current gain.

© 2000 by CRC Press LLC

(a)

(b) FIGURE 70.4 (a) Small-signal admittance model with output short-circuited, and (b) magnitude of the small-signal current gain Gi plotted as a function of frequency. The unity current gain crossover (i.e., Gi = 1) defines the parameter fT (or ωT).

The lifetime τ associated with the control electrode arises from charge “leaking off ” the controlling electrode. This is modeled as an RC time constant at the input of the equivalent circuit shown in Fig. 70.4(a) with τ equal to ReqCi . Req is the equivalent resistance presented to capacitor Ci . That is, Req is determined by the parallel combination of 1/gi and any external resistance at the input. The break frequency ωB associated with the control electrode is

ωB =

1 1 = τ ReqCi

(70.6)

When the charge on the control electrode varies at a rate ω less than ωB , Gi is given by τ/τr because charge “leaks off ” the controlling electrode faster than 1/ω. Alternatively, when ω is greater than ωB , Gi decreases with increasing ω because the applied signal charge varies upon the control electrode more rapidly than 1/τ. In this case, Gi is inversely proportional to ω, that is,

Gi =

ω 1 = T ωτ r ω

(70.7)

where ωT is the common-emitter unity current gain frequency. At ω = ωT ( = 2πfT), the ac current gain equals unity, as illustrated in Fig. 70.4(b). Consider the current gain-bandwidth product Gi ∆f. A purely capacitive input impedance cannot define a bandwidth. However, a finite real impedance always appears at the input terminal in any practical application. Let Ri be the effective input resistance of the device (i.e., Ri will be equal to (1/gi) in parallel with the external resistance Rin). Since the input current is equal to qc /τ and the output current is equal to q/τr , the current gain-bandwidth product becomes

© 2000 by CRC Press LLC

q τr ω q c τ 2π

(70.8)

ω 1 = T = fT 2πτ r 2π

(70.9)

Gi ⋅ ∆f = For ω  ωB, at τ = 1/ω, and assuming |qc | = |-q|,

Gi ⋅ ∆f =

fT (or ωT) is a widely quoted parameter used to compare or “benchmark” active devices. Sometimes, fT (or ωT) is interpreted as a measure of the maximum speed a device can drive a replica of itself. It is easy to compute and historically has been easy to measure with bridges and later using S-parameters. However, fT does have interpretative limitations because it is defined as current into a short-circuit output. Hence, it ignores input resistance and output capacitance effects upon actual circuit performance. Likewise, voltage and power gain expressions can be derived. It is necessary to define the output impedance before either can be quantified. Let Ro be the effective output resistance at the output terminal of the active device. Assuming both the input and output RC time constants to be identical (i.e., RiCi = RoCo), the voltage gain Gv can be expressed in terms of Gi as

G v = Gi

Ro C = Gi i Ri Co

(70.10)

where Ro is the parallel equivalent output resistance from all resistances at the output node. The power gain Gp is computed from the product of Gi ⋅Gv along with the power gain-bandwidth product. These results are listed in Table 70.1 as summarized from Johnson and Rose.7 These simple expressions are valid for all devices as interpreted from the charge control perspective. They provide for a first-order comparison, in terms of a few simple parameters, among the active devices commonly TABLE 70.1

Charge Control Relations for All Active Devices

Parameter

Symbol

Expression

Transconductance

gm

Ci ⇔ ω TC i τr

Current amplification

Gi

ω 1 ⇔ T ωτ r ω

Voltage amplification

Gv

ω C 1 Ci ⇔ T i ωτ r Co ω Co

Power amplification

Gp = GiGv

ω2 C 1 Ci ⇔ T2 i ω τ Co ω Co

Current gain-bandwidth product

Gi⋅∆f

1 ⇔ ωT τr

Voltage gain-bandwidth product

Gv⋅∆f

C 1 Ci ⇔ ωT i Co τ r Co

Power gain-bandwidth product

Gp⋅∆f2

C 1 Ci ⇔ ω T2 i Co τ r2 Co

2 2 r

Note: Table assumes RiCi = RoCo. (After Johnson and Rose (Ref. 7), March 1959. © 1959 IEEE, reproduced with permission of IEEE.)

© 2000 by CRC Press LLC

available. From an examination of Table 70.1, it is evident that maximizing Ci and minimizing τr leads to higher transconductance, higher parametric gains, and greater frequency response. This is an important observation in understanding how to improve upon the performance of any active device. Whereas fT has limitations, the frequency at which the maximum power gain extrapolates to unity, denoted by ωmax , is often a more useful indicator of device performance. The primary limitation of ωmax is that it is very difficult to calculate and is usually extrapolated from S-parameter measurements in which the extrapolation is approximate at best.

70.3 Comparing Unipolar and Bipolar Transistors Unipolar transistors are active devices that operate using only a single charge carrier type, usually electrons, in their transport region. Field-effect transistors fall into the unipolar classification. In contrast, bipolar transistors depend on positive and negative charged carriers (i.e., both majority and minority carriers) within the transport region. A fundamental difference arises from the relative locations of the control electrode and transport region — in unipolar devices, they are physically separated, whereas in bipolar devices, they are merged into the same physical region (i.e., base region). Before reviewing the physical operation of each, transport in semiconductors is briefly reviewed.

Charge Transport in Semiconductors10-12 Bulk semiconducting materials are useful because their conductivity can be controlled over many orders of magnitude by changing the doping level. Both electrons and holes10 can conduct current in semiconductors. In integrated circuits metal, semiconductor, and insulator layers are used together in precisely positioned shapes and thicknesses to form useful device and circuit functions. Fig. 69.1 illustrates the behavior of electron velocity as a function of local electric field strength for several important semiconducting materials. Two characteristic regions of behavior can be identified: a linear or ohmic region at low electric fields, and a velocity-saturated region at high fields. At low fields, current transport is proportional to the carrier’s mobility. Mobility is a measure of how easily carriers move through a material.10 At high fields, carriers saturate in velocity; hence, current levels will correspondingly saturate in active devices. The data in Fig. 69.1 assumes low doping levels (i.e., Nx < 1015 cm–3). The dashed curve represents transport in a GaAs quantum well formed adjacent an Al0.3Ga0.7As layer — in this case, interface scattering lowers the mobility. A similar situation is found for transport in silicon at a semiconductor–oxide interface such as found in metal-oxide-semiconductor (MOS) devices. Several general conclusions can be extracted from this data: 1. Compound semiconductors generally have higher electron mobilities than silicon. 2. At high fields (say E > 20,000 V/cm), saturated electron velocities tend to converge to values close to 1 × 107 cm/s. 3. Many compound semiconductors show a transition region between low and high electric field strengths with a negative differential mobility due to electron transfer from the Γ (k = 0) valley to conduction band valleys with higher effective masses (this gives rise to the Gunn Effect13). Hole mobilities are much lower than electron mobilities in all semiconductors. Saturated velocities of holes are also lower at higher electric fields. This is why n-channel field-effect transistors have higher performance than p-channel field-effect transistors, and why npn bipolar transistors have higher performance than pnp bipolar transistors. Table 69.2 compares electron and hole mobilities for several semiconducting materials.

Field-Effect (Unipolar) Transistor14-16 Fig. 70.5(a) shows a conceptual view of an n-channel field-effect transistor (FET). As shown, the n-type channel is a homogeneous semiconducting material of thickness b, with electrons supporting the drainto-source current. A p-type channel would rely on mobile holes for current transport and all voltage polarities would be exactly reversed from those shown in Fig. 70.5(a). The control charge on the gate

© 2000 by CRC Press LLC

FIGURE 70.5 (a) Conceptual view of a field-effect transistor with the channel sandwiched between source and drain ohmic contacts and a gate control electrode in close proximity; and (b) cross-sectional view of the FET with a depletion layer shown such as would be present in a compound semiconductor MESFET.

region (of length L and width W) establishes the number of conduction electrons per unit area in the channel by electrostatic attraction or exclusion. The cross-section on the FET channel in Fig. 70.5(b) shows a depletion layer, a region void of electrons, as an intermediary agent between the control charge and the controlled charge. This depletion region is present in junction FET and Schottky barrier junction (i.e., metal-semiconductor junction) MESFET structures. In all FET structures, the gate is physically separated from the channel. By physically separating the control charge from the controlled charge, the gate-to-channel impedance can be very large at low frequencies. The gate impedance is predominantly capacitive and, typically, very low gate leakage currents are observed in high-quality FETs. This is a distinguishing feature of the FET — its high input impedance is desirable for many circuit applications. The channel, positioned between the source and drain ohmic contacts, forms a resistor whose resistance is modulated by the applied gate-to-channel voltage. We know the gate potential controls the channel charge by the charge control relation. Time variation of the gate potential translates into a corresponding time variation of the drain current (and the source current also). Therefore, transconductance gm is the natural parameter to describe the FET from this viewpoint. Fig. 70.6(a) shows the ID-VDS characteristic of the n-channel FET in the common-source connection with constant electron mobility and a long channel assumed. Two distinct operating regions appear in Fig. 70.6(a) — the linear (i.e., non-saturated) region, and the saturated region, separated by the dashed parabola. The origin of current saturation corresponds to the onset of channel pinch-off due to carrier exclusion at the drain end of the channel. Pinch-off occurs when the drain voltage is positive enough to deplete the channel completely of electrons at the drain end; this corresponds to a gate-to-source voltage equal to the pinch-off voltage, denoted as –Vp in Figs. 70.6(b) and (c). For constant VDS in the saturated region, the ID vs. VGS transfer curve approximates “square law” behavior; that is, © 2000 by CRC Press LLC

FIGURE 70.6 (a) Field-effect transistor drain current (ID) versus drain-to-source voltage (VDS) characteristic with the gate-to-source voltage (VGS) as a stepped parameter; (b) ID versus VGS “transfer curve” for a constant VDS in the saturated region of operation, revealing its “square-law” behavior; (c) transconductance gm versus VGS for a constant VDS in saturated region of operation corresponding to the transfer curve in (b). These curves assume constant mobility, no velocity saturation, and the “long-channel FET approximation.”

 V I D = I D,sat = I DSS 1 − GS − VP 

( )

  

2

for − VP ≤ VGS ≤ ϕ

(70.11)

where IDSS is the drain current when VGS = 0, and ϕ is a built-in potential associated with the gate-tochannel junction or interface (e.g., a metal-semiconductor Schottky barrier as in the MESFET). The symbol ID,sat denotes the drain current in the saturated region of operation. Transconductance gm is linear with VGS for the saturation transfer characteristic of Eq. (70.11) and is approximated by

gm =

∂I D D  V ≅ 2 DSS 1 − GS ∂VGS VP  − VP 

( )

  for − VP ≤ VGS ≤ ϕ 

(70.12)

Equations 70.11 and 70.12 are plotted in Figs. 70.6(b) and (c), respectively.

Bipolar Junction Transistors (Homojunction and Heterojunction) 13-17 In the bipolar junction transistor (BJT), both the control charge and the controlled charge occupy the same region (i.e., the base region). A control charge is injected into the base region (i.e., this is the base current flowing in the base terminal), causing the emitter-to-base junction’s potential barrier to be lowered. Barrier lowering results in majority carrier diffusion across the emitter-to-base junction. Electrons diffuse into the base and holes into the emitter in the npn BJT shown in Fig. 70.7. By controlling the emitter-to-base junction’s physical structure, the dominant carrier diffusion across this n-p junction should be injection into the base region. For our npn transistor, the dominant carrier transport is electron diffusion into the base region where the electrons are minority carriers. They transit the base region, of base width Wb, by both diffusion and drift. When collected at the collector-to-base junction, they establish the collector current IC. The base width must be short to minimize recombination in the base region (this is reflected in the current gain parameter commonly used with BJT and HBT devices). In homojunction BJT devices, the emitter and base regions have the same bandgap energy. The respective carrier injection levels are set by the ratio of the emitter-to-base doping levels. For high emitter efficiency, that is, the number of carriers diffusing into the base being much greater than the number of carriers simultaneously diffusing into the emitter, the emitter must be much more heavily doped than the base region. This places a limit on the maximum doping level allowed in the base of the homojunction

© 2000 by CRC Press LLC

FIGURE 70.7 (a) Conceptual view of a bipolar junction transistor with the base region sandwiched between emitter and collector regions. Structure is representative of a compound semiconductor heterojunction bipolar transistor. (b) Simplified cross-sectional view of a vertically structured BJT device with primary electron flow represented by large arrow.

BJT, thereby leading to higher base resistance than the device designer would normally desire.16 In contrast, the heterojunction bipolar transistor (HBT) uses different semiconducting materials in the base and emitter regions to achieve high emitter efficiency. A wider bandgap emitter material allows for high emitter efficiency while allowing for higher base doping levels which in turn lowers the parasitic base resistance. An example of a wider bandgap emitter transistor is shown in Fig. 70.8. In this example, the emitter is AlGaAs whereas the base and collector are formed with GaAs. Figure 70.8 shows the band diagram under normal operation with the emitter–base junction forward-biased and the collector–base junction reverse-biased. The discontinuity in the valence band edge at the emitter–base heterojunction is the origin of the reduced diffusion into the emitter region. The injection ratio determining the emitter efficiency depends exponentially on this discontinuity. If ∆Eg is the valence band discontinuity, the injection ratio is proportional to the exponential of ∆Eg normalized to the thermal energy kT4 :

(

Jn ∝ exp − ∆E g kT Jp

)

(70.13)

For example, ∆Eg equal to 8kT gives an exponential factor of approximately 8000, thereby leading to an emitter efficiency of nearly unity, as desired. The use of the emitter-base band discontinuity is a very efficient way to hold high emitter efficiencies. In bipolar devices, the collector current IC is given by the exponential of the base-emitter forward voltage VBE normalized to the thermal voltage kT/q

© 2000 by CRC Press LLC

FIGURE 70.8 The bandgap diagram for an HBT AlGaAs/GaAs device with the wider bandgap for the AlGaAs emitter (solid line) compared with a homojunction GaAs BJT emitter (dot-dash line). The double dot-dashed line represents the Fermi level in each region.

(

IC = IS exp qVBE kT

)

(70.14)

The saturation current IS is given by a quantity that depends on the structure of the device; it is inversely proportional to the base doping charge QBASE and proportional to the device’s area A, namely

IS =

qADni2 Q BASE

(70.15)

where the other symbols have their usual meanings (D is the minority carrier diffusion constant in the base, ni is the intrinsic carrier concentration of the semiconductor, and q is the electron’s charge). A typical collector current versus collector-emitter voltage characteristic, for several increasing values of (forward-biased) emitter-base voltages, is shown in Fig. 70.9(a). Note the similarity to Fig. 70.6(a), with the BJT having a quicker turn-on for low VCE values compared with the softer knee for the FET. The transconductance of the BJT and HBT is found by taking the derivative of Eq. 70.14, thus

gm =

(

∂IC qI = S exp qVBE kT ∂VBE kT

)

(70.16)

Both IC and gm are of exponential form, as observed in Fig. 70.9; Eqs. 70.14 and 70.16 are plotted in Figs. 70.9(b) and (c), respectively. The transconductance of the BJT/HBT is generally much larger than that of the best FET devices (this can be verified by comparing Eq. (70.12) with Eq. (70.16) with typical parameter values inserted). This has significant circuit design advantages for the BJT/HBT devices over the FET devices because high transconductance is needed for high current drive to charge load capacitance

© 2000 by CRC Press LLC

FIGURE 70.9 (a) Collector current (IC) versus collector-to-emitter voltage (VCE) characteristic curves with the baseto-emitter voltage (VBE) as stepped parameter; (b) IC versus VBE “transfer curve” for a constant VCE in saturated region of operation shows exponential behavior; and (c) transconductance gm versus VBE for a constant VCE in the saturated region of operation corresponding to the transfer curve in (b).

in digital circuits. In general, higher gm values allow a designer to use feedback to a greater extent in design and this provides for greater tolerance to process variations.

Comparing Parameters Table 70.2 compares some of the more important features and parameters of the BJT/HBT device with the FET device. For reference, a common-source FET configuration is compared with a common-emitter BJT/HBT configuration. One of the most striking differences is the input impedance parameter. A FET has a high input impedance at low to mid-range frequencies because it essentially is a capacitor. As the frequency increases, the magnitude of the input impedance decreases as 1/ω because a capacitive reactance varies as Cgs /ω. The BJT/HBT emitter-base is a forward-biased pn junction, which is inherently a low impedance structure because of the lowered potential barrier to carriers. The BJT/HBT input is also capacitive (i.e., a large diffusion capacitance due to stored charge), but a large conductance (or small resistance) appears in parallel assuring a low input impedance even at low frequencies. BJT/HBT devices are known for their higher transconductance gm , which is proportional to collector current. An FET’s gm is proportional to the saturated velocity vsat and its input capacitance Cgs . Thus, device structure and material parameters set the performance of the FET whereas thermodynamics play the key role in establishing the magnitude of gm in a BJT/HBT. Thermodynamics also establishes the magnitude of the turn-on voltage (this follows simply from Eq. 70.14) in the BJT/HBT device. For digital circuits, turn-on voltage (or threshold voltage) is important in terms of repeatability and consistency for circuit robustness. The BJT/HBT is clearly superior to the FET in this regard because doping concentration and physical structure establish an FET’s turn-on voltage. In general, these variables are less controllable. However, the forward turn-on voltage in the AlGaAs/GaAs

© 2000 by CRC Press LLC

TABLE 70.2

Comparing Electrical Parameters for BJT/HBT vs. FET

Parameter Input impedance Z Turn-on Voltage Transconductance Current gain Unity current gain cutoff frequency fT Maximum frequency of oscillation fmax Feedback capacitance 1/f Noise Thermal behavior Other

BJT/HBT

FET

Low Z due to forward-biased junction; large diffusion capacitance Cbe Forward voltage VBE highly repeatable; set by thermodynamics High gm [ = IC/(kT/q)] β (or hFE) = 50 to 150; β is important due to low input impedance fT = gm/2πCBE is usually lower than for FETs

High Z due to reverse biased junction or insulator; small depletion layer capacitance Cgs Pinch-off voltage VP not very repeatable; set by device design Low gm [≅ vsatCgs] Not meaningful at low frequencies and falls as 1/ω at high frequencies fT = gm/2πCgs ( = vsat/2πLg) higher for FETs

fmax = [fT/(8πrb’Cbc]½

fmax = fT [rds/Rin]½

Cbc large because of large collector junction

Usually Cgd is much smaller than Cbc

Low in BJT/HBT Thermal runaway and second breakdown

Very high 1/f noise corner frequency No thermal runaway Backgating is problem in semi-insulating substrates

HBT is higher (~1.4 V) because of the band discontinuity at the emitter–base heterojunction. For InPbased HBTs, the forward turn-on voltage is lower (~0.8 V) than that of the AlGaAs/GaAs HBT and comparable to the approximate 0.7 V found in silicon BJTs. This is important in digital circuits because reducing the signal swing allows for faster circuit speed and lowers power dissipation by allowing for reduced power supply voltages. For BJT/HBT devices, current gain (often given the symbol of β or hFE ) is a meaningful and important parameter. Good BJT devices inject little current into the emitter and, hence, operate with low base current levels. The current gain is defined as the collector current divided by the base current and is therefore a measure of the quality of the device (i.e., traps and defects, both surface and bulk, degrade the current gain due to higher recombination currents). At low to mid-range frequencies, current gain is not especially meaningful for the high input impedance FET device because of the capacitive input. The intrinsic gain of an HBT is higher because of its higher Early voltage VA . The Early voltage is a measure of the intrinsic output conductance of a device. In the HBT, the change in the collector voltage has very little effect on the modulation of the collector current. This is true because the band discontinuity dominates the establishment of the current collected at the collector–base junction. A figure of merit is the intrinsic voltage gain of an active device, given by the product gmVA , and the HBT has the highest values compared to silicon BJTs and compound semiconductor FETs. It is important to have a dynamic figure of merit or parameter to assess the usefulness of an active device for high-speed operation. Both the unity current gain cutoff frequency fT and maximum frequency of oscillation fmax have been discussed in the charge control section above. Both of these figures of merit are used because they are simple and can generally be correlated to circuit speed. The higher the value of both parameters, the better the high-speed circuit performance. This is not the whole story because in digital circuits other factors such as output node-to-substrate capacitance, external load capacitances, and interconnect resistance also play an important role in determining the speed of a circuit. Generally, 1/f noise is much higher in FET devices than in the BJT/HBT devices. This is usually of more importance in analog applications and oscillators however. Thermal behavior in high-speed devices is important as designers push circuit performance. Bipolar devices are more susceptible to thermal runaway than FETs because of the positive feedback associated with a forward-biased junction (i.e., a smaller forward voltage is required to maintain the same current at higher temperatures). This is not true in the FET; in fact, FETs generally have negative feedback under common biases used in digital circuits. Both GaAs and InP have poorer thermal conductivity than silicon, with GaAs being about onethird of silicon and InP being about one-half of silicon.

© 2000 by CRC Press LLC

Finally, circuits built on GaAs or InP semi-insulating substrates are susceptible to backgating. Backgating is similar to the backgate-bias effects in MOS transistors, only it is not as predictable or repeatable as the well-known backgate-bias effect is in silicon MOSFETs on silicon lightly doped substrates. Interconnect traces with negatively applied voltages and located adjacent to devices can change their threshold voltage (or turn-on voltage). It turns out that HBT devices do not suffer from backgating, and this is one of their advantages. Of course, semi-insulating substrates are nearly ideal for microstrip transmission lines on top of the substrates because of their very low loss. Silicon substrates are much more lossy in comparison and this is a decided advantage in GaAs and InP substrates.

70.4 Typical Device Structures In this section, a few typical device structures are described. We begin with FET structures and then follow with HBT structures. There are many variants on these devices and the reader is referred to the literature for more information.15,16,19-22

FET Structures In the silicon VLSI world, the MOSFET (metal-oxide-semiconductor field-effect transistor) dominates. This device forms a channel at the oxide–semiconductor interface upon applying a voltage to the gate to attract carriers to this interface.23 The thin layer of mobile carriers forms a two-dimensional sheet of carriers. One of the limitations with the MOSFET is that the oxide–semiconductor interface scatters the carriers in the channel and degrades the performance of the MOSFET. This is evident in Fig. 69.1 where the lower electron velocity at the Si-SiO2 interface is compared with electron velocities in compound semiconductors. For many years, device physicists have looked for device structures and materials which increase electron velocity. FET structures using compound semiconductors have led to much faster devices such as the MESFET and the HEMT. The MESFET (metal-semiconductor FET) uses a thin doped channel (almost always n-type because electrons are much more mobile in semiconductors) with a reverse-biased Schottky barrier for the gate control.15 The cross-section of a typical MESFET is shown in Fig. 70.10(a). A recessed gate is used along with a highly doped n+ layer at the surface to reduce the series resistance at both the source and drain connections. The gate length and electron velocity in the channel dominate in determining the highspeed performance of a MESFET. Much work has gone into developing processes that form shorter gate structures. For digital devices, lower breakdown voltages are permissible, and therefore shorter gate lengths and higher channel doping is more compatible with such devices. For a given semiconductor material, a device’s breakdown voltage BVGD times its unity current gain cutoff frequency fT is a constant. Therefore, it is possible to tradeoff BVGD for fT in device design. A high fT is required in high-speed digital circuits because devices with a high fT over their logic swing will have a high gm /C ratio for large-signal operation. A high gm /C ratio translates into a device’s ability to drive load capacitances. It is also desirable to maximize the charge in the channel per unit gate area. This allows for higher currents per unit gate width and greater ability to drive large capacitive loads. The higher current per unit gate width also favors greater IC layout density. In the MESFET, the doping level of the channel sets this limit. MESFET channels are usually ion-implanted and the added lattice damage further reduces the electron mobility. To achieve still higher currents per gate width and even higher figures of merit (such as fT and fmax), the HEMT (high electron mobility transistor) structure has evolved.16,22 The HEMT is similar to the MESFET except that the doped channel is replaced with a two-dimensional quantum well containing electrons (sometimes referred to as a 2-D electron gas). The quantum well is formed by a discontinuity in conduction band edges between two different semiconductors (such as AlGaAs and GaAs in Fig. 69.3). From Fig. 69.4 we see that GaAs and Al0.3Ga0.7 As have nearly identical lattice constants but with somewhat different bandgaps. One compound semiconductor can be grown (i.e., using molecular beam epitaxy or metalo-organic chemical vapor deposition techniques) on a different compound semiconductor if the

© 2000 by CRC Press LLC

FIGURE 70.10 Typical FET cross-sections for (a) GaAs MESFET device with doped channel, and (b) AlGaAs/GaAs HEMT device with single quantum well containing and two-dimensional electron gas.

lattice constants are identical. Another example is Ga0.47In0.53 As and InP, where they are lattice matched. The difference in conduction band edge alignment leads to the formation of a quantum well. The greater the edge misalignment, the deeper the quantum well can be, and generally the greater the number of carriers the quantum well can hold. The charge per unit area that a quantum well can hold directly translates into greater current per unit gate width. Thus, the information in Fig. 69.4 can be used to bandgap engineer different materials that can be combined in lattice matched layers. A major advantage of the quantum well comes from being able to use semiconductors that have higher electron velocity and mobility than the substrate material (e.g., GaAs) and also avoid charge impurity scattering in the quantum well by locating the donor atoms outside the quantum well itself. Figure 70.10(b) shows a HEMT cross-section where the dopant atoms are positioned in the wider bandgap AlGaAs layer. When these donors ionize, electrons spill into the quantum well because of its lower energy. Higher electron mobility is possible because the ionized donors are not located in the quantum well layer. A recessed gate is placed over the quantum well, usually on a semiconductor layer such as the AlGaAs layer in Fig. 70.10(b), allowing modulation of the charge in the quantum well. There are only a few lattice-matched structures possible. However, semiconductor layers for which the lattice constants are not matched are possible if the layers are thin enough (of the order of a few nanometers). Molecular beam epitaxy and MOCVD make it possible to grow layers of a few atomic layers. Such structures are called pseudomorphic HEMT (PHEMT) devices.16,22 This gives more flexibility in selecting quantum well layers which hold greater charge and have higher electron velocities and mobilities. The highest performance levels are achieved with pseudomorphic HEMT devices.

FET Performance All currently used FET structures are n-channel because hole velocities are very low compared with electron velocities. Typical gate lengths range from 0.5 microns down to about 0.1 microns for the fastest devices. The most critical fabrication step in producing these structures is the gate recess width and depth.

© 2000 by CRC Press LLC

The GaAs MESFET (ca. 1968) was the first compound semiconductor FET structure and is still used today because of its simplicity and low cost of manufacture. GaAs MESFET devices have fT values in the 20 GHz to 50 GHz range corresponding to gate lengths of 0.5 microns down to 0.2 microns, and gm values of the order of 200 to 400 mS/mm, respectively. These devices will typically have IDSS values of 200 to 400 mA/mm, where parameter IDSS is the common-source, drain current with zero gate voltage applied in a saturated state of operation. In comparison, the first HEMT used an AlGaAs/GaAs material structure. These devices are higher performance than the GaAs MESFET (e.g., given an identical gate length, the AlGaAs/GaAs HEMT has an fT about 50% to 100% higher, depending on the details of the device structure and quality of material). Correspondingly higher currents are achieved in the AlGaAs/GaAs HEMT devices. Hig her performance still is achieved using InP based HEMTs. For example, the In0.53Ga0.47 As/In0.52 Al0.48 As on InP lattice-matched HEMT have reported fT numbers greater than 250 GHz with gate lengths of the order of 0.1 microns. Furthermore, such devices have IDSS values approaching 1000 mA/mm and very high transconductances of greater than 1500 mS/mm.22,24 These devices do have low breakdown voltages of the order of 1 or 2 V because of the small bandgap of InGaAs. Changing the stoichiometric ratios to In0.15Ga0.85 As/In0.70 Al0.30 As on a GaAs substrate produces a pseudomorphic HEMT structure. The In0.15Ga0.85As is a strained layer when grown on GaAs. The use of strained layers gives the device designer more flexibility in accessing a wider variety of quantum wells depths and electronic properties.

Heterojunction Bipolar Structures Practical heterojunction bipolar transistors (HBT) devices19,21 are still evolving. Molecular beam epitaxy (MBE) is used to grow the doped layers making up the vertical semiconductor structure in the HBT. In fact, HBT structures were not really practical until the advent of MBE, although the idea behind the HBT goes back to around 1950 (Shockley). The vastly superior compositional control and layer thickness control with MBE is what made HEMTs and HBTs possible. The first HBT devices used an AlGaAs/GaAs junction with the wider bandgap AlGaAs layer forming the emitter region. Compound semiconductor HBT devices are typically mesa structures, as opposed to the more nearly planar structures used in silicon bipolar technology, because top surface contacts must be made to the collector, base, and emitter regions. Molecular beam epitaxy grows the stack of layers over the entire wafer, whereas, in silicon VLSI processes, selective implantations and oxide masking localize the doped regions. Hence, etching down to the respective layers allows for contact to the base and collector regions. An example of such a mesa HBT structure20 is shown in Fig. 70.11. The HBT shown uses an InGaP emitter primarily for improved reliability over the AlGaAs emitter and a carbon-doped p+ base GaAs layer.

FIGURE 70.11 Cross-section of an HBT device with carbon-doped p+ base and an InGaP emitter.20 Note the commonly used mesa structure, where selective layer etching is required to form contacts to the base and collector regions.

© 2000 by CRC Press LLC

Recently, InP-based HBTs21 have emerged as candidates for use in high-speed circuits. The two dominant heterojunctions are InP/InGaAs and AlInAs/InGaAs in InP devices. The small but significant bandgap difference between AlInAs directly on InP greatly limits its usefulness. InP-based HBT device structures are similar to those of GaAs-based devices and the reader is referred to Chapter 5 of Jalali and Pearton16 for specific InP HBT devices. Generally, InP has advantages of lower surface recombination (higher current gain results), better electron transport, lower forward turn-on voltage, and higher substrate thermal conductivity.

HBT Performance Typical current gain values in production-worthy HBT devices range from 50 at the low range to 150 at the high range. Cutoff frequency fT values are usually quoted under the best (i.e., peak) bias conditions. For this reason fT values must be carefully interpreted because in digital circuits, the bias state varies widely over the entire switching swing. For this reason, probably an averaged fT value would be better, but it is difficult to determine. Typical fT values for HBT processes in manufacturing (say 1998) are in the 50 to 150 GHz range. For example, for the HBT example in Fig. 70.11 with a 2 µm × 2 µm emitter fT is approximately 65 GHz at a current density of 0.6 mA/µm2 and its dc current gain is around 50. Of course, higher values for fT have been reported for R&D or laboratory devices. In HBT devices, the parameter fmax is often lower than its fT value (e.g., for the device in Fig. 70.11, fmax is about 75 GHz). Base resistance (refer to Table 70.2 for equation) is the dominant limiting factor in setting fmax. The best HBT devices have fmax values only slightly higher than their fT values. In comparison, MESFET and HEMT devices typically have higher fmax /fT ratios, although in digital circuits this may be of little importance. Where the HBT really excels is in being able to generate much higher values of transconductance. This is a clear advantage in driving larger loading capacitances found in large integrated circuits. Biasing the HBT in the current range corresponding to the highest transconductance is essential to take advantage of the intrinsically higher transconductance.

© 2000 by CRC Press LLC

Long, S.I. "Logic Design Principles and Examples" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

71 Logic Design Principles and Examples 71.1 Introduction 71.2 Static Logic Design Direct-Coupled FET Logic • Source-Coupled FET Logic • Static and Dynamic Noise Margin and Noise Sources

71.3 Transient Analysis and Design for Very-High-Speed Logic

Stephen I. Long University of California at Santa Barbara

Zero-Order Delay Estimate • Time Constant Delay Methods: Elmore Delay and Risetime • Time Constant Methods: OpenCircuit Time Constants • Time Constant Methods: Complications

71.1 Introduction The logic circuits used in high-speed compound semiconductor digital ICs must satisfy the same essential conditions for design robustness and performance as digital ICs fabricated in other technologies. The static or dc design of a logic cell must guarantee adequate voltage and/or current gain to restore the signal levels in a chain of similar cells. A minimum noise margin must be provided for tolerance against process variation, temperature, and induced noise from ground bounce, crosstalk, and EMI so that functional circuits and systems are produced with good electrical yield. Propagation delays must be determined as a function of loading and power dissipation. Compound semiconductor designs emphasize speed, so logic voltage swings are generally low, τr low so that transconductances and fT are high, and device access resistances are made as low as possible in order to minimize the lifetime time constant τ. This combination makes circuit performance very sensitive to parasitic R, L, and C, especially when the highest operation frequency is desired. The following sections will describe the techniques that can be used for static and dynamic design of high-speed logic.

71.2 Static Logic Design A basic requirement for any logic family is that it must be capable of restoring the logic voltage or current swing. This means that the voltage or current gain with loading must exceed 1 over part of the transfer characteristic. Figure 71.1 shows a typical Vout vs. Vin dc transfer characteristic for a static ratioed logic inverter as is shown in the schematic diagram of Fig. 71.2. It can be seen that a chain of such inverters will restore steady-state logic voltage levels to VOL and VOH because the high-gain transition region around Vin = VTH will result in voltage amplification. Even if the voltage swing is very small, if centered on the inverter threshold voltage VTH, defined as the intersection between the transfer characteristic and the Vin = Vout line, the voltage will be amplified to the full swing again by each successive stage.

© 2000 by CRC Press LLC

FIGURE 71.1

Typical voltage transfer characteristic for the logic inverter shown in Fig. 71.2.

FIGURE 71.2

Schematic diagram of a direct-coupled FET logic (DCFL) inverter.

Ratioed logic implies that the logic high and low voltages VOH and VOL shown in Fig. 71.1 are a function of the widths W1 and WL of the FETs in the circuit shown in Fig. 71.2. In III-V technologies, this circuit is implemented with either MESFETs or HEMTs. The circuit in Fig. 71.2 is called Direct Coupled FET Logic or DCFL. The logic levels of non-ratioed logic are independent of device widths. Non-ratioed logic typically occurs when the switching transistors do not conduct any static current. This is typical of logic families such as static CMOS or its GaAs equivalent CGaAs2 which make use of complementary devices. Dynamic logic circuits such as precharged logic25 and pass transistor logic26,27 also do not require static current in pull-down chains. Such circuits have been used with GaAs FETs in order to reduce static power dissipation. They have not been used, however, for the highest speed applications.

Direct-Coupled FET Logic DCFL is the most widely used logic family for the high-complexity, low-power applications that will be discussed in Chapter 72. The operation of DCFL shown in Fig. 71.2 is easily explained using a load line analysis. Currents are indicated by arrows in this figure. Solid arrows correspond to currents that are nearly constant. Dashed arrows represent currents that depend on the state of the output of the inverter. Figure 71.3 presents an ID–VDS characteristic of the enhancement mode (normally-off with threshold voltage VT > 0) transistor J1. A family of characteristic curves is drawn representing several VGS values. In this circuit, VGS = Vin and VDS = Vout. A load line representing the I–V characteristic of the active load (VGS = 0) depletion-mode (normallyon with VT < 0) transistor J2 is also superimposed on this drawing. Note that the logic low level VOL is determined by the intersection of the two curves when ID = IDD. Load line 1 corresponds to a load device with narrow width; load line 2 for a wider device. It is evident that the narrow, weaker device will provide a lower VOL value and thus will increase the logic swing. However, the weaker device will also have less

© 2000 by CRC Press LLC

FIGURE 71.3 Drain current versus drain-source voltage characteristic of J1. The active load, J2, is also shown superimposed over the J1 characteristics as a load line. Two load lines corresponding to wide and narrow J2 widths are shown. In addition, the gate current IG of J3 versus Vout limits the logic high voltage.

current available to drive any load capacitance, so the inverter with load line 1 will therefore be slower than the one with load line 2. There is therefore a tradeoff between speed and logic swing. So far, the analysis of this circuit is the same as that of an analogous nMOS E/D inverter. In the case of DCFL logic inverters implemented with GaAs-based FETs, the Schottky barrier gate electrode of the next stage will limit the maximum value of VOH to the forward voltage drop across the gate-source diode. This is shown by the gate diode IG–VGS characteristic also superimposed on Fig. 71.3. VOH is given by the point of intersection between the load current IDD and the gate current IG, because a logic high output requires that the switch transistor J1 is off. VOH will therefore also depend on the load transistor current. Effort must be made not to overdrive the gate since excess gate current will flow through the internal parasitic source resistance of the driven device J3, degrading VOL of this next stage.

Source-Coupled FET Logic A second widely used type of logic circuit — source-coupled FET Logic or SCFL is shown in Fig. 71.4. SCFL, or its bipolar counterpart, ECL, is widely used for very high-speed applications, which will be discussed in Chapter 72. The core of the circuit consists of a differential amplifier, J1 and J2, a current source J3, and pull-up resistors RL on the drains. The differential topology is beneficial for rejection of common-mode noise. The static design procedure can be illustrated again by a load-line analysis. A maximum current ICS can flow through either J1 or J2 .

FIGURE 71.4

Schematic of differential pair J1,J2 used as a source-coupled FET logic (SCFL) cell.

© 2000 by CRC Press LLC

Figure 71.5 shows the ID–VDS characteristic of J1 for example. The maximum current ICS is shown by a dotted line. The output voltage VO1 is either VDD or VDD – ICSRL ; therefore, the maximum differential voltage swing, ∆V = 2 ICSRL , is determined by the choice of RL. Next, the width of J1 should be selected so that the change in VGS needed to produce the voltage drop ICSRL at the drain is less than ICSRL. This will ensure that the voltage gain is greater than 1 (needed to compensate for the source followers described below) and that the device is biased in its saturation region or cutoff at all times. The latter requirement is necessary if the maximum speed is to be obtained from the SCFL stage, FIGURE 71.5 Load-line analysis of the SCFL inverter cell. since device capacitances are minimized in saturation and cutoff. Source followers are frequently used on the output of an SCFL stage or at the inputs of the next stage. Figure 71.6(a) shows the schematic diagram of the follower circuit. The follower can serve two functions: level shifting and buffering of capacitive loads. When used as a level shifter, a negative or positive voltage offset can be obtained between input and output. The only requirement is that the VGS of the source follower must be larger than the FET threshold voltage. If the source follower is at the output of an SCFL cell, it can be used as a buffer to reduce the sensitivity of delay to load capacitance or fanout. The voltage gain of a source follower is always less than 1. This can be illustrated by another load-line analysis. Figure 71.6(b) presents the ID1–VDS1 characteristic of the source follower FET, J1. A constant VGS1 is applied for every curve plotted in the figure. The load line (dashed line) of a depletion-mode, active current source J2 is also superimposed. In this circuit, the output voltage is Vout = VDD – VDS1. The Vout is determined by the intersection of the load line with the ID1 characteristic curves. The current of the pull-down current source is selected according to the amount of level shifting needed. A high current will result in a greater amount of level shift than a small load current. If the devices have high output resistance, and are accordingly very flat, very little change in VGS1 will be required to change Vout over the (b)

(a)

(c)

FIGURE 71.6 (a) Schematic of source follower, (b) load-line analysis of source follower, and (c) source follower buffer between SCFL stages.

© 2000 by CRC Press LLC

full range from VOL to VDD . If VGS1 remains nearly constant, then Vout follows Vin , hence the name of the circuit. Since the input voltage to the source follower stage is Vin = VGS1 + Vout, a small change in VGS1 would produce an incremental voltage gain close to unity. If the output resistance is low, then the characteristic curves will slope upward and a larger range of VGS1 will be necessary to traverse the output voltage range. This condition would produce a low voltage gain. Small signal analysis shows that

Av =

1

(

)

1 + 1  g m1 rds1 rds 2   

(71.1)

The buffering effect of the source follower is accomplished by reducing the capacitive loading on the drain nodes of the differential amplifier because the input impedance of a source follower is high. Since the output tries to follow the input, the change in VGS will be less than that required by a common source stage. Therefore, the input capacitance is dominated by CGD , typically quite small for compound semiconductor FETs biased in saturation. The effective small-signal input capacitance is

(

CG = CGD + CGS 1 − Av

)

(71.2)

where Av = dVout /dVin is the incremental voltage gain. The source follower also provides a low output impedance, whose real part is approximately 1/gm at low frequency. The current available to charge and discharge the load capacitance can be adjusted by the width ratio of J1 and J2. If the load is capacitive, Vout will be delayed from Vin . This will cause VGS1 to temporarily increase, providing excess current ID1 to charge the load capacitance. Ideally, for equal rise and fall times, the peak current available from J1 should equal the steady-state current of J2. Source followers can also be used at the input of an SCFL stage to provide level shifting as shown in Fig. 71.6(c). In this case, the drain resistors, RL , should be chosen to provide the proper termination resistance for the on-chip interconnect transmission line. These resistors provide a reverse termination to absorb signals that reflect from the high input impedance of the source follower. Alternatively, the drain resistors can be located at the gate of the source follower, thereby providing a shunt termination at the destination end of the interconnect. This practice results in good signal integrity, but because the practical values of characteristic impedance are less than 100-Ω , the current swing in the differential amplifier core must be large. This will increase power dissipation per stage. SCFL logic structures generally employ more than one level of differential pairs so that complex logic functions (XOR, latch, and flip-flop) that require multiple gates to implement in logic families such as DCFL can be implemented in one stage. More details on SCFL gate structures and examples of their usage will be given in Chapter 72.

Static and Dynamic Noise Margin and Noise Sources Noise margin is a measure of the ability of a logic circuit to provide proper functionality in the presence of noise.28 There are many different definitions of noise margin, but a simple and intuitive one for the static or dc noise margin is illustrated by Fig. 71.8. Here, the transfer characteristic from Fig. 71.1 is plotted again. In order to evaluate the ability of a chain of such inverters to reject noise, a loop consisting of two identical inverters is considered. This might be representative of the positive feedback core of a bistable latch. Because the inverters are connected in a loop, an infinite chain of inverters is represented. The transfer characteristic of inverter 2 in Fig. 71.7 is plotted in gray lines. For inverter 2, Vout 1 = Vin2 and Vout2 = Vin1. Therefore, the axes are reversed for the characteristic plotted for inverter 2. If a series noise source VN were placed within the loop as shown, the maximum static noise voltage allowed will be represented by the maximum width of the loops formed by the transfer characteristic.28 These widths, labeled VNL and VNH , respectively, for the low and high noise margins, are shown in the figure. If the voltage VN exceeds VNL or VNH , the latch will be set into the opposite state and will remain there until

© 2000 by CRC Press LLC

FIGURE 71.7 VNH and VNL.

Voltage transfer characteristics of an inverter pair connected in a loop. Noise margins are shown as

reset. This would constitute a logic failure. Therefore, we must insist that any viable logic circuit provide noise margins well in excess of ambient noise levels in the circuit. The static noise margin defined above utilized a dc voltage source VN in series with logic inverters to represent static noise. This source might represent a static offset voltage caused by IR drop along IC power and ground distribution networks. The DCFL inverter, for example, would experience a shift in VTH that is directly proportional to a ground voltage offset. This shift would skew the noise margins. The smallest noise margin would determine the circuit electrical yield. The layout of the power and ground distribution networks must consider this problem. The width of power and ground buses on-chip must be sufficient to guarantee a maximum IR drop that does not compromise circuit operation. It is important to note that this width is frequently much greater than what might be required by electromigration limits. It is essential that the designer consider IR drop in the layout. Some digital IC processes allow the topmost metal layer to form a continuous sheet, thereby minimizing voltage drops. The static noise voltage source VN might also represent static threshold voltage shifts on the active devices due to statistical process variation or backgating effects. Therefore, the noise margin must be several times greater than the variance in device threshold voltages provided by the fabrication process so that electrical yields will not be compromised.29 The above definition of maximum width noise margin has assumed a steady-state condition. It does not account for transient noise sources and the delayed response of the logic circuit to noise pulses. Unfortunately, pulses of noise are quite common in digital systems. For example, the ground potential can often be modified dynamically by simultaneous switching events on the IC chip.30 Any ground distribution bus can be modeled as a transmission line with impedance Z0 where

Z0 =

Lo Co

(71.3)

Here, Lo is the equivalent series inductance per unit length and Co the equivalent shunt capacitance per unit length. Since the interconnect exhibits a series inductance, there will be transient voltage noise ∆V induced on the line by current transients as predicted by

∆V = L

dI dt

(71.4)

This form of noise is often called ground bounce. The ground bounce ∆V is particularly severe when many devices are being switched synchronously, as would be the case in many applications involving

© 2000 by CRC Press LLC

flip-flops in shift registers or pipelined architectures. The high peak currents that result in such situations can generate large voltage spikes. For example, output drivers are well-known sources of noise pulses on power and ground buses unless they are carefully balanced with fully differential interconnections and are powered by power and ground pins separate from the central logic core of the IC. Designing to minimize ground bounce requires minimization of inductance. Bakoglu30 provides a good discussion of power distribution noise in high-speed circuits. There are several steps often used to reduce switching noise. First, it is standard practice to make extensive use of multiple ground pins on the chip to reduce bond-wire inductance and package trace inductance when conventional packaging is used. Bypass capacitance off-chip can be useful if it can be located inside the package and can have a high series resonant frequency. On-chip bypass capacitance is also helpful, especially if enough charge can be supplied from the capacitance to provide the current during clock transitions. The objective is to provide a low impedance between power and ground on-chip at the clock frequency and at odd multiples of the clock frequency. Finally, as mentioned above, high-current circuits such as clock drivers and output drivers should not share power and ground pins with other logic on-chip. Crosstalk is another common source of noise pulses caused by electromagnetic coupling between adjacent interconnect lines. A signal propagating on a driven interconnect line can induce a crosstalk voltage and current in a coupled line. The duration of the pulse depends on the length of interconnect; the amplitude depends on the mutual inductance and capacitance between the coupled lines.31 In order to determine how much noise is acceptable in a logic circuit, the noise margin definition can be modified to accommodate the transient noise pulse situation. The logic circuit does not respond instantaneously to the noise pulse at the input. This delay in the response is attributed to the device and interconnect capacitances and the device current limitations which will be discussed extensively in Section 71.3. Consider the device input capacitance. Sufficient charge must be transferred during the noise pulse to the input capacitance to shift the control voltage either above or below threshold. In addition, this voltage must be maintained long enough for the output to respond if a logic upset is to occur. Therefore, a logic circuit can withstand a larger noise pulse amplitude than would be predicted from the static noise margin if the pulse width is much less than the propagation delay of the circuit. This increased noise margin for short pulses is called the dynamic noise margin (DNM). The DNM approaches the static NM if the pulse width is wide compared with the propagation delay because the circuit can charge up to the full voltage if not constrained by time. The DNM can be predicted by simulation. Figure 71.8(a) shows the loop connection of the set-reset NOR latch similar to that which was used for the static NM definition in Fig.10.7. The inverter has been modified to become a NOR gate in this case. An input pulse train V1(t) of fixed duration but with

FIGURE 71.8 (a) Set-reset latch used to describe dynamic noise margin simulation, and (b) plot of the pulse amplitude applied to the set input in (a) that results in a logic upset.

© 2000 by CRC Press LLC

gradually increasing amplitude can be applied to the set input. The latch was initialized by applying an initial condition to the reset input V2(t). The output response is observed for the input pulse train. At some input amplitude level, the output will be set into the opposite state. The latch will hold this state until it is reset again. The cross-coupled NOR latch thus becomes a logic upset detector, dynamically determining the maximum noise margin for a particular pulse width. The simulation can be repeated for other pulse widths, and a plot of the pulse amplitude that causes the latch to set for each pulse duration can be constructed, as shown in Fig. 71.8(b). Here, any amplitude or duration that falls on or above the curve will lead to a logic upset condition.

Power Dissipation Power dissipation of a static logic circuit consists of a static and a dynamic component as shown below.

PD = VDD I DD + C L ∆V 2 fη

(71.5)

In the case of DCFL, the current IDD from the pull-up transistor J2 is relatively constant, flowing either in the pull-down (switch) device J1 or in the gate(s) of the subsequent stage(s). Taking its average value, the static power is VDD I DD . The dynamic power CL∆V2fη depends on the frequency of operation f, the load capacitance CL, and the duty factor η. η is application dependent. Since the voltage swing is rather small for the DCFL inverter under consideration (about 0.6 V for MESFETs), the dynamic power will not be significant unless the load capacitance is very large, such as in the case of clock distribution networks. The VDD power supply voltage is traditionally 2 V because of compatibility with the bipolar ECL VTT supply, but a VDD as low as 1 V can be used for special low-power applications. Typical power dissipation per logic cell (inverter, NOR) depends on the choice of supply voltage and on IDD. Power is typically determined based on speed and is usually in the range of 0.1 to 0.5 mW/gate. DCFL logic circuits are often used when the application requires high circuit density and very low power.

71.3 Transient Analysis and Design for Very-High-Speed Logic Adequate attention must be given to static or dc design, as described in the previous section, in order to guarantee functionality under the worst-case situations. In addition, since the only reason to use the compound semiconductor devices for digital electronics at all is their speed, attention must be given to the dynamic performance as well. In this section, we will describe three methods for estimating the performance of high-speed digital logic circuit functional blocks. Each of these methods has its strengths and weaknesses. The most effective methods for guiding the design are those that provide insight that helps to identify the dominant time constants that determine circuit performance. These are not necessarily the most accurate methods, but are highly useful because they allow the designer to determine what part of the circuit or device is limiting the speed. Circuit simulators are far more accurate (at least to the extent that the device models are valid), but do not provide much insight into performance limitations. Without simple analytical techniques to guide the design, performance optimization becomes a trial-and-error exercise.

Zero-Order Delay Estimate The first technique, which uses the simple relationship between voltage and current in a capacitor,

I = CL

© 2000 by CRC Press LLC

dV dt

(71.6)

is relevant when circuit performance is dominated by wiring or fan-out capacitance. This will be the case if the delay predicted by Eq. 71.6 due to the total loading capacitance, CL, significantly exceeds the intrinsic delay of a basic inverter or logic gate. To apply this approach, determine the average current available from the driving logic circuit for charging (ILH) and discharging (IHL) the load capacitance. The logic swing ∆V is known, so low-to-high (tPLH) and high-to-low (tPHL) propagation delays can be determined from Eq. 71.6. These delays represent the time required to charge or discharge the circuit output to 50% of its final value. Thus, tPLH is given by

t PLH =

C L ∆V 2I LH

(71.7)

where ILH is the average charging current during the output transition from VOL to VOL + ∆V/2. The net propagation delay is given by

tP =

t PLH + t PHL 2

(71.8)

At this limit, where speed is dominated by the ability to drive load capacitance, we see that increasing the currents will reduce tP . In fact, the product of power (proportional to current) and delay (inversely proportional to current) is nearly constant under this situation. Increases in power lead to reduction of delay until the interconnect distributed RC delays or electromagnetic propagation delays become comparable to tP . The equation also shows that small voltage swing ∆V is good for speed if the noise margin and drive current are not compromised. This means that the devices must provide high transconductance. For example, the DCFL inverter of Fig. 71.2 can be analyzed. Figure 71.9 shows equivalent circuits that represent the low-to-high and high-to-low transitions. The current available for the low-to-high transition, IPLH , shown in Fig. 71.9(a), is equal to the average pullup current, IDD . If we assume that VOL = 0.1 V and VOH = 0.7 V, then the ∆V of interest is 0.6 V. This brings the output up to 0.4 V at V50% . In this range of Vout, the active load transistor J2 is in saturation at all times for VDD > 1 V, so IDD will be relatively constant, and all of the current will be available to charge the capacitor. The high-to-low transition is more difficult to model in this case. Vout will begin at 0.7 V and discharge to 0.4 V. The discharge current through the drain of J1 is going to vary with time because the device is below saturation over this range of Vout . Looking at the Vin = 0.7 V characteristic curve in Fig. 71.3, we see that its ID–VDS characteristic is resistive. Let’s approximate the slope by 1/Ron. Also, the discharge current IPHL is the difference between IDD and ID1 = Vout /Ron , as shown in Fig. 71.9(b). The average current available to discharge the capacitor can be estimated by

I HL =

VOH + V50% − I DD 2Ron

(71.9)

C L ∆V 2I HL

(71.10)

Then, tPHL is estimated by

t PHL =

Time Constant Delay Methods: Elmore Delay and Risetime Time constant delay estimation methods are very useful when the wiring capacitance is quite small or the charging current is quite high. In this situation, typical of very-high-speed SSI and MSI circuits that

© 2000 by CRC Press LLC

FIGURE 71.9

(a) Equivalent circuit for low-to-high transition; and (b) Equivalent circuit for high-to-low transition.

push the limits of the device and process technology, the circuit delays are dominated by the devices themselves. Both methods to be described rely on a large-signal equivalent circuit model of the transistors, an approximation dubious at best. But, the objective of these techniques is not absolute accuracy. That is much less important than being able to identify the dominant contributors to the delay and risetime, since more accurate but less intuitive solutions are easily available through circuit simulation. The construction of the large signal equivalent circuit requires averaging of non-linear model elements such as transconductance and certain device capacitances over the appropriate voltage swing. The propagation delay definition described above, the delay required to reach 50% of the logic swing, must be relaxed slightly to apply methods based on linear system analysis. It was first shown by Elmore in 194832 and apparently rediscovered by Ashar in 196433 that the delay time tD between an impulse function δ(0) applied at t = 0 to the input of a network and the centroid or “center-of-mass” of the impulse response (output) is quite close to the 50% delay. This definition of delay tD is illustrated in Fig. 71.10. Two conditions must be satisfied in order to use this approach. First, the step response of the network is monotonic. This implies that the impulse response is purely a positive function. Monotonic step response is valid only when the circuit poles are all negative and real, or the circuit is heavily damped. Due to feedback through device capacitances, this condition is seldom completely correct. Complex poles often exist. However, strongly underdamped circuits are seldom useful for reliable logic circuits because their transient response will exhibit ringing, so efforts to compensate or damp such oscillations are needed in these cases anyway. Then, the circuit becomes heavily damped or at least dominated by a single pole and fits the above requirement more precisely. Second, the correspondence between tD and tPLH is improved if the impulse response is symmetric in shape, as in Fig. 71.10(b). It is shown in Ref. 33 that cascaded stages with similar time constants have a tendency to approach a Gaussian-shaped distribution as the number of stages becomes large. Most logic systems require several cascaded stages, so this condition is often true as well. Assuming that these conditions are approximately satisfied, we can make use of the fact that the impulse response of a circuit in the frequency domain is given by its transfer (or network) function F(s) in the complex frequency s = σ + jω. Then, the propagation delay, tD , can be determined by

© 2000 by CRC Press LLC

FIGURE 71.10 (a) Monotonic step response of a network; (b) corresponding impulse response. The delay tD is defined as the centroid of the impulse response. ∞

tD =

∫ tf (t )dt 0 ∞

∫ f (t )dt 0



= lim

∫ tf (t )e

0 s →0 ∞

  d − F s   =  ds  F s   s =0 f t e − st dt  − st

() ()

dt

∫ ()

(71.11)

0

Fortunately, the integration never needs to be performed. tD can be obtained directly from the network function F(s) as shown. But, the network function must be calculated from the large-signal equivalent circuit of the device, including all important parasitics, driving impedances, and load impedances. This is notoriously difficult if the circuit includes a large number of capacitances or inductances. Fortunately, in most cases, circuits of interest can be subdivided into smaller networks, cascaded, and the presumed linearity of the circuits can be employed to simplify the task. In addition, the evaluation of the function at s = 0 eliminates many terms in the equations that result. In particular, Tien34 shows that two corollaries are particularly useful in cascading circuit blocks: 1. If the network function F(s) = A(s)/B(s), then

  d d − A s   Bs  +  ds t D =  ds  As   Bs  s =0  

() ()



( )  ( ) 

(71.12) s =0

2. If F(s) = A(s)B(s)C(s), then

    d  d  d − A s  − B s  − C s   +  ds  +  ds  t D =  ds  As   Bs   Cs   s =0   s =0   s =0 

() ()

() ()

() ()

(10.13)

This shows that the total delay is just the sum of the individual delays of each circuit block. When computing the network functions, care must be taken to include the driving point impedance of the

© 2000 by CRC Press LLC

previous stage and to represent the previous stage as a Thevenin-equivalent open-circuit voltage source. A good description and illustration of the use of this approach in the analysis of bipolar ECL and CML circuits can be found in Ref. 34. Risetime: the standard definition of risetime is the 10 to 90% time delay of the step response of a network. While convenient for measurement, this definition is analytically unpleasant to derive for anything except simple, first-order circuits. Elmore demonstrated that the standard deviation of the impulse response could be used to estimate the risetime of a network.32 This definition provides estimates that are close to the standard definition. The standard deviation of the impulse response can be calculated using

∞  TR2 = 2π  t 2 f t dt − t D2    0 

∫ ()

(71.14)

Since the impulse response frequently resembles the Gaussian function, the integral is easily evaluated. Once again, the integration need not be performed. Lee35 has pointed out that the transform techniques can also be used to obtain the Elmore risetime directly from the network function F(s).

  d2 d  2F s   F s  − 2π  ds TR2 = 2π  ds  F s   F s  s =0  

() ()

() ()

2

     s =0

(71.15)

This result can also be used to show that the risetimes of cascaded networks add as the square of the individual risetimes. If two networks are characterized by risetimes TR1 and TR2 , the total risetime TR,total is given by the RMS sum of the individual risetimes

TR2,total = TR21 + TR22

(71.16)

Time Constant Methods: Open Circuit Time Constants The frequency domain/transform methods for finding delay and risetime are particularly valuable for design optimization because they identify dominant time constants. Once the time constants are found, the designer can make efforts to change biases, component values, or optimize the design of the transistors themselves to improve the performance through addressing the relevant bottleneck in performance. The drawback in the above technique is that a network function must be derived. This becomes tedious and time-consuming if the network is of even modest complexity. An alternate technique was developed36,37 that also can provide reasonable estimates for delay, but with much less computational difficulty. The open-circuit time constant (OCTC) method is widely used for the analysis of the bandwidth of analog electronic circuits just for this reason. It is just as applicable for estimating the delay of very-high-speed digital circuits. The basis for this technique again comes from the transfer or network function F(s) = Vo(s)/Vi(s). Considering transfer functions containing only poles, the function can be written as

()

F s =

a0 bns n + bn−1s n−1 + L + b1s + 1

(71.17)

The denominator comes from the product of n factors of the form (τjs + 1), where τj is the time constant associated with the j-th pole in the transfer function. The b1 coefficient can be shown to be equal to the sum

© 2000 by CRC Press LLC

n

b1 =

∑τ

(71.18)

j

j =1

of the time constants and b2 the product of all the time constants. Often, the first-order term dominates the frequency response. In this case, the 3-dB bandwidth is then estimated by ω3dB = 1/b1. The higherorder terms are neglected. The accuracy of this approach is good, especially when the circuit has a dominant pole. The worst error would occur when all poles have the same frequency. The error in this case is about 25%. Much worse errors can occur however if the poles are complex or if there are zeros in the transfer function as well. We will discuss this later. Elmore has once again provided the connection we need to obtain delay and risetime estimates from the transfer function. The Elmore delay is given by

D = b1 − a1

(71.19)

where a1 is the corresponding coefficient of the first-order zero (if any) in the numerator. The risetime is given by

(

TR2 = b12 − a12 + 2 a2 − b2

)

(71.20)

In Eq. 71.20, a2 and b2 correspond to the coefficients of the second-order zero and pole, respectively. At this point, it would appear that we have gained nothing since finding that the time constants associated with the poles and zeros is well known to be difficult. Fortunately, it is possible to obtain the b1 and b2 coefficients directly by a much simpler method: open-circuit time constants. It has been shown that35,36 n

b1 =

∑ j =1

n

R joC j =

∑τ

jo

(71.21)

j =1

that is, the sum of the time constants τjo , defined as the product of the effective open-circuit resistance Rjo across each capacitor Cj when all other capacitors are open-circuited, equals b1 . These time constants are very easy to calculate since open-circuiting all other capacitors greatly simplifies the network by decoupling many other components. Dependent sources must be considered in the calculation of the Rjo open-circuit resistances. Note that these open-circuit time constants are not equal to the pole time constants, but their sum gives the same result for b1. It should also be noted that the individual OCTCs give the time constant of the network if the j-th capacitor were the only capacitor. Thus, each time constant provides information about the relative contribution of that part of the circuit to the bandwidth or the delay.35 If one of these is much larger than the rest, this is the place to begin working on the circuit to improve its speed. The b2 coefficient can also be found by a similar process,38 taking the sum of the product of time constants of all possible pairs of capacitors. For example, in a three-capacitor circuit, b2 is given by

b2 = R1oC1R21 sC2 + R1oC1R13sC3 + R2oC2 R32sC3

(71.22)

where the Rjsi resistance is the resistance across capacitor Cj calculated when capacitor Ci is short-circuited and all other capacitors are open-circuited. The superscript indicates which capacitor is to be shorted. So, R3s2 is the resistance across C3 when C2 is short-circuited and C1 is open-circuited. Note that the first time constant in each product is an open-circuit time constant that has already been calculated. In

© 2000 by CRC Press LLC

FIGURE 71.11

Schematic of basic ECL inverter.

addition, for any pair of capacitors in the network, we can find an OCTC for one and a SCTC for the other. The order of choice does not matter because

RioCi RijsC j = R joC j RisjCi

(71.23)

so we are free to choose whichever combination minimizes the computational effort.38 At this stage, it would be helpful to illustrate the techniques described above with an example. An ECL inverter whose schematic is shown in Fig. 71.11 is selected for this purpose. The analysis is based on work described in more detail in Ref. 39. The first step is to construct the large-signal equivalent circuit. We will discuss how to evaluate the large-signal component values later. Figure 71.12 shows such a model applied to the ECL inverter, where

FIGURE 71.12

(a) Large-signal half-circuit model of ECL inverter; and (b) large-signal equivalent circuit of (a).

© 2000 by CRC Press LLC

the half-circuit approximation has been used in Fig. 71.12(a) due to the inherent symmetry of differential circuits.40 The hybrid-pi BJT model shown in Fig. 71.12(b) has been used with several simplifications. The dynamic input resistance, rπ , has been neglected because other circuit resistances are typically much smaller. The output resistance, ro , has also been neglected for the same reason. The collector-to-substrate capacitance, CCS , has been neglected because in III-V technologies, semi-insulating substrates are typically used. The capacitance to substrate is quite small compared to other device capacitances. Retained in the model are resistances Rbb , the extrinsic and intrinsic base resistance, and REX , the parasitic emitter resistance. Both of these are very critical for optimizing high-speed performance. In the circuit itself, RIN is the sum of the driving point resistance from the previous stage, probably an emitter follower output, and Rbb1 of Q1. RL is the collector load resistor, whose value is determined by half of the output voltage swing and the dc emitter current, ICS. RL = ∆V/2ICS. The REX of the emitter follower is included in REF . We must calculate open-circuit time constants for each of the four capacitors in the circuit. First consider C1 , the base-emitter diffusion and depletion capacitance of Q1 . C2 is the collector-base depletion capacitance of Q1. C3 and C4 are the corresponding base-emitter and base-collector capacitances of Q2 . Figure 71.13 represents the equivalent circuit schematic when C2 = C3 = C4 = 0. A test source, V1, is FIGURE 71.13 Equivalent largeplaced at the C1 location. R1o = V1 /I1 is determined by circuit analysis signal half-circuit model for calculation of R1o . to be

R1o =

RIN + REX 1 + G M 1REX

(71.24)

Table 71.1 shows the result of similar calculations for R2o , R3o , and R4o . The b1 coefficient (first-order estimate of tD) can now be found from the sum of the OCTCs:

b1 = R1oC1 + R2oC2 + R3oC3 + R4oC4

(71.25)

Considering the results in Table 71.1, one can see that there are many contributors to the time constants and that it will be possible to determine the dominant terms after evaluating the model and circuit parameters. Next, estimates must be made of the non-linear device parameters, GMi and Ci . The large signal transconductances can be estimated from

GM =

∆I C ∆VBE

For the half-circuit model of the differential pair, the current ∆IC is the full value of ICS since the device switches between cutoff and ICS . The ∆VBE corresponds to the input voltage swing needed to switch the device between cutoff and ICS . This is on the order of 3VT (or 75 mV) for half of a differential input. So, GM1 = ICS /0.075 is the large-signal estimate for transconductance of Q1 . The emitter follower Q2 is biased at IEF when the output is at VOL . Let us assume that an identical increase in current, IEF , will provide the logic swing needed on the output of the inverter to reach VOH . Thus, ∆IC2 = IEF and REF = (VOH – VOL)/IEF . The difference in VBE at the input required to double the collector current can be calculated from

© 2000 by CRC Press LLC

(71.26)

TABLE 71.1 Effective Zero-Frequency Resistances for Open-Circuit TimeConstant Calculation for the Circuit of Fig. 71.12 (G′MI = GMI/(1 + GMI REX)) R1o

RIN + REX 1 + G M 1REX

R2o

RIN + RL + G M ′ 1RIN RL

R3o R4o

Rbb + RL + REF 1 + G M 2REF Rbb + RL

()

∆VBE = VT ln 2 = 0.7VT = 17.5mV

(71.27)

Thus, GM2 = IEF /0.0175. C1 and C3 consist of the parallel combination of the depletion (space charge) layer capacitance, Cbe , and the diffusion capacitance, CD . C2 and C4 are the base-collector depletion capacitances. Depletion capacitances are voltage varying according to

 V C V = C 0 1 −   φ

( ) ()

−m

(71.28)

where C(0) is the capacitance at zero bias, φ is the built-in voltage, and m the grading coefficient. An equivalent large-signal capacitance can be calculated by

C=

Q2 − Q1 V2 − V1

(71.29)

Qi is the charge at the initial (1) or final (2) state corresponding to the voltages Vi . Q2 – Q1 = ∆Q and V2

∫()

∆Q = C V dV

(71.30)

V1

The large-signal diffusion capacitance can be found from

CD = G M τ f

(71.31)

where τf is the forward transit delay (τr) as defined in Section 70.2. Finally, the Elmore risetime estimate requires the calculation of b2 . Since there are four capacitors in the large-signal equivalent circuit, six terms will be necessary:

b2 = R1oC1R21 sC2 + R1oC1R13sC3 + R1oC1R14 sC4 + R2oC2 R32sC3 + R2oC2 R42sC4 + R3oC3R43sC4

(71.32)

R2s1 will be calculated to illustrate the procedure. The remaining short-circuit equivalent resistances are shown in Table 71.2. Referring to Fig. 71.14, the equivalent circuit for calculation of R2s1 is shown. This is the resistance seen across C2 when C1 is shorted. If C1 is shorted, V1 = 0 and the dependent current source is dead. It can be seen from inspection that

R21 s = RIN REX + RL

(71.33)

Time Constant Methods: Complications As attractive as the time constant delay and risetime estimates are computationally, the user must beware of complications that will degrade the accuracy by a large margin. First, consider that both methods have depended on a restrictive assumption regarding monotonic risetime. In many cases, however, it is not unusual to experience complex poles. This can occur due to feedback which leads to inductive input or output impedances and emitter or source followers which also have inductive output impedance. When

© 2000 by CRC Press LLC

TABLE 71.2 Effective Resistances for Short Circuit Time Constant Calculation for the Circuit of Fig. 71.12 R12s 1 3s 1 4s

RIN REX + RL

R

R3o

R

R4o

2 3s

R

 1   G ′ RIN RL  + Rbb + REF  M1  1 + G M 2REF

R24s

 1   G ′ RIN RL  + Rbb  M1 

R34s

(R

L

)

+ Rbb REF

combined with a predominantly capacitive input impedance, complex poles will generally result unless the circuit is well damped. The time constant methods ignore the complex pole effects which can be quite significant if the poles are split and σ  jω. In this case, the circuit transient response will exhibit ringing, and time constant estimates of bandwidth, delay, and risetime will be in serious error. Of course, the ringing will show up in the circuit simulation, and if present, must be dealt with by adding damping resistances at appropriate locations. An additional caution must be given for circuits that include zeros. Although Elmore’s equations can modify the estimates for tD and TR FIGURE 71.14 Equivalent circuit 1 when there are zeros, the OCTC method provides no help in finding model for calculation of R2s . the time constants of these zeros. Zeros often occur in wideband amplifier circuits that have been modified through the addition of inductance for shunt peaking, for example. The addition of inductance, either intentionally or accidentally, can also produce complex pole pairs. Zeros are intentionally added for the optimization of speed in very-high-speed digital ICs as well; however, the large area required for the spiral inductors when compared with the area consumed by active devices tends to discourage the use of this method in all but the simplest (and fastest) designs.35

© 2000 by CRC Press LLC

Chang, C.E., et al "Logic Design Examples" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

72 Logic Design Examples Charles E. Chang Conexant Systems, Inc.

Meera Venkataraman Troika Networks, Inc.

Stephen I. Long University of California at Santa Barbara

72.1 Design of MESFET and HEMT Logic Circuits Direct-Coupled FET Logic (DCFL) • Source-Coupled FET Logic (SCFL) • Advanced MESFET/HEMT Design Examples

72.2 HBT Logic Design Examples III-V HBT for Circuit Designers • Current-Mode Logic • Emitter-Coupled Logic • ECL/CML Logic Examples • Advanced ECL/CML Logic Examples • HBT Circuit Design Examples

72.1 Design of MESFET and HEMT Logic Circuits The basis of dc design, definition of logic levels, noise margin, and transfer characteristics were discussed in Chapter 71 using a DCFL and SCFL inverter as examples. In addition, methods for analysis of highspeed performance of logic circuits were presented. These techniques can be further applied to the design of GaAs MESFET, HEMT, or P-HEMT logic circuits with depletion-mode, enhancement-mode, or mixed E/D FETs. Several circuit topologies have been used for GaAs MESFETs, like direct-coupled FET logic (DCFL), source-coupled FET logic (SCFL), as well as dynamic logic families,41 and have been extended for use with heterostructure FETs. Depending on the design requirements, whether it be high speed or low power, the designer can adjust the power-delay product by choosing the appropriate device technology and circuit topology, and making the correct design tradeoffs.

Direct-Coupled FET Logic (DCFL) Among the numerous GaAs logic families, DCFL has emerged as the most popular logic family for highcomplexity, low-power LSI/VLSI circuit applications. DCFL is a simple enhancement/depletion-mode GaAs logic family, and the circuit diagram of a DCFL inverter was shown in Fig. 71.2. DCFL is the only static ratioed GaAs logic family capable of VLSI densities due to its compactness and low power dissipation. An example demonstrating DCFL’s density is Vitesse Semiconductor’s 350K sea-of-gates array. The array uses a two-input DCFL NOR as the basic logic structure. The number of usable gates in the array is 175,000. A typical gate delay is specified at 95 ps with a power dissipation of 0.59 mW for a buffered two-input NOR gate with a fan-out of three, driving a wire load of 0.51 mm.42 However, a drawback of DCFL is its low noise margin, the logic swing being approximately 600 mV. This makes the logic sensitive to changes in threshold voltage and ground bus voltage shifts. DCFL NOR and NAND Gate The DCFL inverter can easily be modified to perform the NOR function by placing additional enhancement-mode MESFETs in parallel as switch devices. A DCFL two-input NOR gate is shown in Fig. 72.1. If any input rises to VOH, the output will drop to VOL. If n inputs are high simultaneously, then VOL will be decreased because the width ratio W1/WL in Fig. 71.2 has effectively increased by a factor of n. There is a limit to the number of devices that can be placed in parallel to form very wide NOR functions. The

© 2000 by CRC Press LLC

FIGURE 72.1

DCFL two-input NOR gate schematic.

drain capacitance will increase in proportion to the number of inputs, slowing down the risetime of the gate output. Also, the subthreshold current contribution from n parallel devices could become large enough to degrade VOH, and therefore the noise margin. This must be evaluated at the highest operating temperature anticipated because the subthreshold current will increase exponentially with temperature according to43,44 :

  cV     bV     aV   I D = I S 1 − exp DS   exp DS   exp GS    VT     VT     VT   

(72.1)

The parameters a, b, and c are empirical fitting parameters. The first term arises from the diffusion component of the drain current which can be fit from the subthreshold ID–VDS characteristic at low drain voltage. The second and third terms represent thermionic emission of electrons over the channel barrier from source to drain. The parameters can be obtained by fitting the subthreshold ID–VDS and ID–VGS characteristics, respectively, measured in saturation.45 For the reasons described above, the fan-in of the DCFL NOR is seldom greater than 4. In addition to the subthreshold current loading, the forward voltage of the Schottky gate diode of the next stage drops with temperature at the rate of approximately –2mV/degree. Higher temperature operation will therefore reduce VOH as well, due to this thermodynamic effect. A NAND function can also be generated by placing enhancement-mode MESFETs in series rather than in parallel for the switch function. However, the low voltage swing inherent in DCFL greatly limits the application of the NAND function because VOL will be increased by the second series transistor unless the widths of the series devices are increased substantially from the inverter prototype. Also, the switching threshold VTH shown in Fig. 71.1 will be slightly different for each input even if width ratios are made different for the two inputs. The combination of these effects reduces the noise margin even further, making the DCFL NAND implementation generally unsuitable for VLSI applications. Buffering DCFL Outputs The output (drain) node of a DCFL gate sources and sinks the current required to charge and discharge the load capacitance due to wiring and fan-out. Excess propagation delay of the order of 5 ps per fanout is typically observed for small DCFL gates. Sensitivity to wiring capacitance is even higher, such that unbuffered DCFL gates are never used to drive long interconnections unless speed is unimportant. Therefore, an output buffer is frequently used in such cases or when fan-out loading is unusually high. The superbuffer shown in Fig. 72.2(a) is often used to improve the drive capability of DCFL. It consists of a source follower J3 and pull-down J4. The low-to-high transition begins when VIN = VOL. J4 is cut off

© 2000 by CRC Press LLC

FIGURE 72.2 (a). Superbuffer schematic, and (b) modified superbuffer with clamp transistor. J5 will limit the output current when Vout > 0.7 V.

and J3 becomes active, driving the output to VOH. VOUT follows the DCFL inverter output. For the output high-to-low transition, J4 is driven into its linear region, and the output can be pulled to VOL = 0 V in steady state. J3 is cut off when the DCFL output (drain of J1) switches from high to low. Since this occurs one propagation delay after the input switched from low-to-high, it is during this transition that the superbuffer can produce a current spike between VDD and ground. J4 attempts to discharge the load capacitance before the DCFL gate output has cut off J3 . Thus, superbuffers can become an on-chip noise source, so ground bus resistance and inductance must be controlled. There is also a risk that the next stage might be overdriven with too much input current when driven by a superbuffer. This could happen because the source follower output is capable of delivering high currents when its VGS is maximum. This occurs when Vout = VOH = 0.7 V, limited by forward conduction of the gate diodes being driven. For a supply voltage of 2 V, a maximum VGS = 0.7 V is easily obtained on J3 , leading to the possibility of excess static current flowing into the gates. This would degrade VOL of the subsequent stage due to voltage drop across the internal source resistance. Figure 72.2(b) shows a modified superbuffer design that prevents this problem through the addition of a clamp transistor, J5 . J5 limits the gate potential of J3 when the output reaches VOH , thus preventing the overdriving problem.

Source-Coupled FET Logic (SCFL) SCFL is the preferred choice for very-high-speed applications. An SCFL inverter, a buffered version of the basic differential amplifier cell shown in Fig. 71.4, is shown in Fig. 72.3. The high-speed capability of SCFL stems from four properties of this logic family: small input capacitance, fast discharging time of the differential stage output nodes, good drive capability, and high Ft . In addition to higher speed, SCFL is characterized by high functional equivalence and reduced sensitivity to threshold voltage variations.29 The current-mode approach used in SCFL ensures an almost constant current consumption from the power supplies and, therefore, the power supply noise is greatly reduced as compared to other logic families. The differential input signaling also improves the dc, ac, and transient characteristics of SCFL circuits.46 SCFL, however, has two drawbacks. First, SCFL is a low-density logic family due to the complex gate topology. Second, SCFL dissipates more power than DCFL, even with the high functional equivalence taken into account.

© 2000 by CRC Press LLC

FIGURE 72.3

Schematic diagram of SCFL inverter with source follower output buffering.

FIGURE 72.4

SCFL two-level series-gated circuit.

SCFL Two-Level Series-Gated Circuit A circuit diagram of a two-level series-gated SCFL structure is shown in Fig. 72.4 2-to-1 MUXs, XOR gates, and D-latches and flip-flops can be configured using this basic structure. If the A inputs are tied to the data signals and the B inputs are tied to the select signal, the resulting circuit —is a 2-to-1 MUX. If the data is fed to the A1 input, the clock is connected to B and the A outputs (OA, OA) are fed back to the A2 inputs, the resulting circuit is a D-latch as seen in Fig. 72.5. Finally, an XOR gate is created by – – – connecting A1 = A2 , forming a new input AIN and A1 = A2 to complementary new input AIN . The inputs to the two levels require different dc offsets in order for the circuit to function correctly, thus level-shifting networks using diodes or source followers are required. Series logic such as this also requires higher supply voltages in order to keep the devices in their saturation region. This will increase the power dissipation of SCFL.

© 2000 by CRC Press LLC

FIGURE 72.5 SCFL D latch schematic. Two cascaded latch cells with opposite clock phasing constitute a masterslave flip-flop.

The logic swing of the circuit shown in Fig. 72.4 is determined by the size of the current source T3 and the load resistors R1 and R2 (R1 = R2). Assuming T3 is in saturation, the logic swing on nodes X and Y is

∆VX,Y = Ids3 R1 = Idss3 R1

(72.2)

where Idss3 is the saturation current of T3 at Vgs = 0 V. The logic high and low levels on node X (VX,H , VX,L ) are determined from the voltage drop across R3 and Eq. 72.2.

V X,H = Vdd – (Idss3 R3)

(72.3)

V X,L = Vdd – [Idss3 (R1 + R3)]

(72.4)

The noise margin is the difference between the minimum voltage swing required on the inputs to switch the current from one branch to the other (VSW) and the logic swing ∆VX,Y . VSW is set by the ratio between the sizes of the switch transistors (T1,T2, T4-T7) and T3. For symmetry reasons, the sizes of all the switch transistors are kept the same size. Assuming the saturation drain-source current of an FET can be described by the simplified square-law equation:

Ids = βW(Vgs – VT )2

(72.5)

where VT is the threshold voltage, W is the FET width, and β is a process-dependent parameter. VSW is calculated assuming all the current from T3 flows only through T2.

VSW = VT

(W 3 W 2)

(72.6)

For a fixed current source size (W3), the larger the size of the switch transistors, the smaller the voltage swing required to switch the current and, hence, a larger noise margin. Although a better noise margin is desirable, it needs to be noted that the larger switch transistors means increased input capacitance and decreased speed. Depending on the design specifications, noise margin and speed need to be traded off. Since all FETs need to be kept in the saturation region for the correct operation of an SCFL gate, levelshifting is needed between nodes A and B and the input to the next gate, in order to keep T1, T2, and T4-T7 saturated. T3 is kept in saturation if the potential at node S is higher than VSS + Vds,sat . The potential at node S is determined by the input voltages to T1 and T2. VS settles at a potential such that the drainsource current of the conducting transistor is exactly equal to the bias current, Idss3 , since no current flows through the other transistor. The minimum logic high level at the output node B (VOB,H) is

© 2000 by CRC Press LLC

VOB,H ≥ Vss + Vds,sat + Vgs = Vss + Vds,sat + VSW + Vth

(72.7)

To keep T9 and T11 in saturation, however, requires that

VOB,H ≥ Vss + Vds,sat + VSW

(72.8)

As with the voltage on node S, the drain voltages of T1 and T2 are determined by the voltage applied to the A inputs. The saturation condition for T1 and T2 is

VOA,H – VSW – Vth – VS = VOA,H – VOB,H ≥ Vds,sat

(72.9)

Equation 72.9 shows that the lower switch transistors are kept in saturation if the level-shifting difference between the A and B outputs is larger than the FET saturation voltage. Since diodes are used for levelshifting, the minimum difference between the two outputs is one diode voltage drop, VD . If Vds,sat > VD , more diodes are required between the A and B outputs. The saturation condition for the upper switch transistors, T4 to T7, is determined by the minimum voltage at nodes A and B and the drain voltage of T1 and T2.

VA,min – (VOA,H – VSW – Vth) ≥ Vds,sat

(72.10)

Substituting Eq. 72.4 into Eq. 72.10 yields

(V

dd

(

)) (

)

− Idss 3 ∗ R1 + R3 − VOA ,H − VSW − Vth ≥ Vds,sat

(72.11)

Rewriting Eq. 72.11 using Eq. 72.8 gives the minimum power supply range

V dd – Vss ≥ Idss3 ∗ (R1+R3) + 3Vds,sat

(72.12)

Equation 72.11 allows the determination of the minimum amount of level-shifting required between nodes A and B to the outputs

VA,H – VOA,H ≥ V ds,sat + Idss3 ∗ R1 – Vth – VSW

(72.13)

Equations 72.8 to 72.13 can be used for designing the level shifters. The design parameters available in the level-shifters are the widths of the source followers (W8, W10), the current sources (W9, W11), and the diode (WD). Assuming the current source width (W9) is fixed, the voltage drop across the diodes is partially determined by the ratio (WD /W9). This ratio should not be made too small. Operating Schottky diodes at high current density will result in higher voltage drop, but this voltage will be partially due to the IDRS drop across the parasitic series resistance. Since this resistance is often process dependent and difficult to reproduce, poor reproducibility of VD will result in this case. The ratio between the widths of the source follower and the current source (W8/W9) determines the gate-source voltage of the source follower (Vgs8). Vgs8 should be kept below 0.5 V to prevent gate-source conduction. The dc design of the two-level series-gated SCFL gate in Fig. 72.4 can be accomplished by applying Eqs. 72.2 to 72.13. Ratios between most device sizes can be determined by choosing the required noise margin and logic swing. Only W3 in the differential stage and W9 among the level-shifters are unknown at this stage. All other device sizes can be expressed in terms of these two transistor widths. The relation between W3 and W9 can be determined only by considering transient behavior. For a given total power dissipation, the ratio between the power dissipated in the differential stage and the output buffers determines how fast the outputs are switched. If fast switching at the outputs is desired, more power needs to be allocated to the output buffers and, consequently, less power to the differential

© 2000 by CRC Press LLC

FIGURE 72.6

2.5-Gb/s optical communication system.

stage. While this allocation will ensure faster switching at the output, the switching speed of the differential stage is reduced because of the reduced current available to charge and discharge the large input capacitance of the output buffers. Finally, it is useful to note that scaling devices to make a speed/power tradeoff is simple in SCFL. If twice as much power is allocated to a gate, all transistors and diodes are made twice as wide while all resistors are reduced by half.46

Advanced MESFET/HEMT Design Examples High-Speed TDM Applications The need for high bandwidth transmission systems continues to increase as the number of bandwidthintensive applications in the areas of video imaging, multimedia and data communication (such as database sharing and database warehousing) continues to grow. This has led to the development of optical communication systems with transmission bit rates, for example, of 2.5 Gb/s and 10 Gb/s. A simplified schematic of a 2.5 Gb/s communication system is shown in Fig. 72.6. As seen in Fig. 72.6, MUXs, DMUXs, and switches capable of operating in the Gb/s range are crucial for the operation of these systems. GaAs MESFET technology has been employed extensively in the design of these high-speed circuits because of the excellent intrinsic speed performance of GaAs. SCFL is especially well suited for these circuits where high speed is of utmost importance and power dissipation is not a critical factor. The design strategies employed in the previous subsection can now be further applied to a high-speed 4:1 MUX, as shown in Fig. 72.7. It was shown that the two-level series gated SCFL structure could be easily configured into a D-latch. The MSFF in the figure is simply a master-slave flip-flop containing two D-latches. The PSFF is a phase-shifting flip-flop that contains three D-latches and has a phase shift of 180° compared with an MSFF. The 4:1 MUX is constructed using a tree-architecture in which two 2:1 MUXs merge two input lines each into one output operating at twice the input bit rate. The 2:1 MUX at the second stage takes the two outputs of the first stage and merges it into a single output at four times the primary input bit rate. The architecture is highly pipelined, ensuring good timing at all points in the circuit. The inherent propagation delay of the flip-flops ensures that the signals are passed through the selector only when they are stable.46 The interface between the two stages of 2:1 MUXs is timing-critical, and care needs to be taken to obtain the best possible phase margin at the input of the last flip-flop. To accomplish this, a delay is added between the CLK signal and the clock input to this flip-flop. The delay is usually implemented

© 2000 by CRC Press LLC

FIGURE 72.7

High-speed 4:1 multiplexer (MUX).

using logic gates because their delays are well-characterized in a given process. Output jitter can be minimized if 50% duty-cycle clock signals are used. Otherwise, a retiming MSFF will be needed at the output of the 4:1 MUX. The 4:1 MUX is a good example of an application of GaAs MESFETs with very-high-speed operation and low levels of integration. Vitesse Semiconductor has several standard products operating at the Gb/s range fabricated in GaAs using their own proprietary E/D MESFET process. For example, the 16 × 16 crosspoint switch, VSC880, has serial data rates of 2.0 Gb/s. The VS8004 4-bit MUX is a high-speed, parallel-to-serial data converter. The parallel inputs accept data at rates up to 625 Mb/s and the differential serial data output presents the data sequentially at 2.5 Gb/s, synchronous with the differential high-speed clock input.42 While the MESFET technologies have proven capable at 2.5 and 10 Gb/s data rates for optical fiber communication applications, higher speeds appear to require heterojunction technologies. The 40-Gb/s TDM application is the next step, but it is challenging for all present semiconductor device IC technologies. A complete 40-Gb/s system has been implemented in the laboratory with 0.1-µm InAlAs/InGaAs/InP HEMT ICs as reported in Refs. 47 and 48. Chips were fabricated that implemented multiplexers, photodiode preamplifiers, wideband dc 47-GHz amplifiers, decision circuits, demultiplexers, frequency dividers, and limiting amplifiers. The high-speed static dividers used the super-dynamic FF approach.49 A 0.2-µm AlGaAs/GaAs/AlGaAs HEMT quantum well device technology has also demonstrated 40-Gb/s TDM system components. A single chip has been reported that included clock recovery, data decision, and a 2:4 demultiplexer circuit.50 The SCFL circuit approach was employed. Very-High-Speed Dynamic Circuits Conventional logic circuits using static DCFL or SCFL NOR gates such as those described above are limited in their maximum speed by loaded gate delays and serial propagation delays. For example, a typical DCFL NOR-implemented edge-triggered DFF has a maximum clock frequency of approximately 1/5τD and the SCFL MSFF is faster, but it is still limited to 1/2τD at best. Frequency divider applications that require clock frequencies above 40 GHz have occasionally employed alternative circuit approaches which are not limited in the same sense by gate delays and often use dynamic charge storage on gate

© 2000 by CRC Press LLC

FIGURE 72.8

Dynamic frequency divider (DFD) divide-by-2 circuit. (Ref. 51, ©1989 IEEE, with permission.)

nodes for temporarily holding a logic state. These approaches have been limited to relatively simple circuit functions such as divide-by-2 or -4. The dynamic frequency divider (DFD) technique is one of the well-known methods for increasing clock frequency closer to the limits of a device technology. For an example, Fig. 72.8 shows a DFD circuit using a single-phase clock, a cross-coupled inverter pair as a latch to reduce the minimum clock frequency, and pass transistors to gate a short chain of inverters.51,52 These have generally used DCFL or DCFL superbuffers for the inverters. The cross-coupled inverter pair can be made small in width, since its serial delay is not in the datapath. But the series inverter chain must be designed to be very fast, generally requiring high power per inverter in order to push the power-delay product to its extreme high-speed end. Since fan-out is low, the intrinsic delays of an inverter in a given technology can be approached. The maximum and minimum clock frequencies of this circuit can be calculated from the gate delays of the n series inverters as shown in Eqs. 72.14 and 72.15. An odd number n is required to force an inversion of the data so that the circuit will divide-by-2. Here, t1 is the propagation delay of the pass transistor switches, J1 and J2, and tD is the propagation delay of the DCFL inverters. The parameter “a” is the duty cycle of the clock. For a 50% clock duty cycle, the range of minimum to maximum clock frequency is about 2 to 1.

fφmax =

1 t1 + nt D

(72.14)

fφmin =

a t1 + nt D

(72.15)

Clock frequencies as high as 51 GHz have been reported using this approach with a GaAs/AlGaAs P-HEMT technology.52 The power dissipation was relatively high, 440 mW. Other DFD circuit approaches can also be found in the literature.53-55 A completely different approach, as shown in Fig. 72.9, utilizes an injection-locked push-pull oscillator (J1 and J2) whose free running frequency is a subharmonic of the desired input frequency.56 FETs J3 and J4 are operating in their ohmic regions and act as variable resistors. The variation in VGS1 and VGS2 cause the oscillator to subharmonically injection-lock to the input source. Here, a divide-by-4 ratio was demonstrated with an input frequency of 75 GHz and a power dissipation of 170 mW using a 0.1-µm InP-based HEMT technology with fT = 140 GHz and fmax = 240 GHz. This divider also operated in the 59–64 GHz range with only –10 dBm RF input power. The frequency range is limited by the tuning range of the oscillator. In this example, about 2 octaves of frequency variation was demonstrated.

© 2000 by CRC Press LLC

FIGURE 72.9

Injection-locked oscillator divide-by-4 (Ref. 56, ©1996 IEEE. With permission.)

Finally, efforts have also been made to beat the speed limitations of a technology by dynamic design methods while still maintaining minimum power dissipation. The quasi-dynamic FF48,49 and quasidifferential FF57 are examples of circuit designs emphasizing this objective. The latter has achieved 16-GHz clock frequency with approximately 2 mW of power per FF.

72.2 HBT Logic Design Examples From a circuit topology perspective, both III-V HBTs and silicon BJTs are interchangeable, with myriad similarities and a few essential differences. The traditional logic families developed for the Si BJTs serve as the starting point for high-speed logic with III-V HBTs. During the period of intense HBT development in the 1980s and early 1990s, HBTs have implemented ECL, CML, DTL, and, I2L logic topologies as well as novel logic families with advanced quantum devices (such as resonant tunneling diodes1) in the hopes of achieving any combination of high-speed, low-power, and high-integration level. During that time, III-V HBTs demonstrated their potential integration limits with an I2L 32-bit microprocessor58 and benchmarked its high-speed ability with an ECL 30-GHz static master/slave flip-flop based frequency divider. During the same time, advances in Si based technology, especially CMOS, have demonstrated that parallel circuit algorithms implemented in a technology with slower low-power devices capable of massive integration will dominate most applications. Consequently, III-V-based technologies such as HBTs and MESFET/HEMT have been relegated to smaller but lucrative niche markets. As HBT technology evolved into a mature production technology in the mid-1990s, it was clear that III-V HBT technology had a clear advantage in high-speed digital circuits, microwave integrated circuits, and power amplifier markets. Today, in the high-speed digital arena, III-V HBTs have found success in telecom and datacom lightwave communication circuits for SONET/ATM-based links that operate from 2.5 to 40 Gb/s. HBTs also dominate the high-speed data conversion area with Nyquist-rate ADCs capable of gigabit/gigahertz sampling rates/bandwidths, sigma-delta ADCs with very high oversampling rates, and direct digital synthesizers with gigahertz clock frequencies and ultra-low spurious outputs. In these applications, the primary requirement is ultra-high-speed performance with LSI (10 K transistors) levels of integration. Today, the dominant logic type used in HBT designs is based on non-saturating emitter coupled pairs such as ECL and current-mode logic (CML), which is the focus of this chapter.

III-V HBT for Circuit Designers III-V HBTs and Si BJTs are inherently bipolar in nature. Thus, from a circuit point of view, both share many striking similarities and some important differences. The key differences between III-V HBT

© 2000 by CRC Press LLC

technology and Si BJT technology, as discussed below, can be traced to three essential aspects: (1) heterojunction vs. homojunction, (2) III-V material properties, and (3) substrate properties. First, the primary advantage of a base–emitter heterojunction is that the wide bandgap emitter allows the base to be doped higher than the emitter (typically 10 to 50X in GaAs/AlGaAs HBTs) without a reduction in current gain. This translates to lower base resistance for improved fmax and reduces base width modulation with Vce for low output conductance. Alternatively, the base can be made thinner for lower base-transit time (τb) and higher ft without having Rb too high. If the base composition is also graded from high bandgap to low, an electric field can be established to sweep electrons across the base for reduced τb and higher ft. With a heterojunction B-E and a homojunction B-C, the junction turn-on voltage is higher in the B-E than it is in the B-C. This results in a common-emitter I–V curve offset from the off to saturation transition. This offset is approximately 200 mV in GaAs/AlGaAs HBTs. With a highly doped base, base punch-through is not typically observed in HBTs and does not limit the ft-breakdown voltage product as in high-performance Si BJTs and SiGe HBT with thin bases. Furthermore, if a heterojunction is placed in the base–collector junction, a larger bandgap material in the collector can increase the breakdown voltage of the device and reduce the I–V offset. Second, III-V semiconductors typically offer higher electron mobility than Si for overall lower τb and collector space charge layer transit times (τcscl). Furthermore, many III-V materials exhibit velocity overshoot in the carrier drift velocity. When HBTs are designed to exploit this effect, significant reductions in τcscl can result. With short collectors, the higher electron mobility can result in ultra-high ft ; however, this can also be used to form longer collectors with still acceptable τcscl , but significantly reduced Cbc for high fmax. The higher mobility in the collector can also lead to HBTs with lower turn on resistance in the common emitter I–V curves. Since GaAs/AlGaAs and GaAs/InGaP have wider bandgaps than Si, the turn-on voltage of the B–E (Vbe,on) junction is typically on the order of 1.4 V vs. 0.9 V for advanced high-speed Si BJT. InP-based HBTs can have Vbe,on on the order of 0.7 V; however, most mature production technologies capable of LSI integration levels are based on AlGaAs/GaAs or InGaP/GaAs. The base–collector turn-on voltage is typically on the order of 1 V in GaAs-based HBTs. This allows Vce to be about 600 mV lower than Vbe without placing the device in saturation. The wide bandgap material typically results in higher breakdown voltages, so III-V HBTs typically have a high Johnson figure of merit (ft * breakdown voltage) compared with Si- and SiGe-based bipolar transistors. The other key material differences between III-V vs. silicon materials are the lack of a native stable oxide in III-V, the extensive use of poly-Si in silicon-based processes, and the heavy use of implants and diffusion for doping silicon devices. III-V HBTs typically use epitaxial growth techniques, and interconnect step height coverage issues limit the practical structure to one device type, so PNP transistors are not typically included in an HBT process. These key factors contribute to the differences between HBTs and BJTs in terms of fabrication. Third, the GaAs substrate used in III-V HBTs is semi-insulating, which minimizes parasitic capacitance to ground through the substrate, unlike the resistive silicon substrate. Therefore, the substrate contact as in Si BJTs is unnecessary with III-V HBTs. In fact, the RF performance of small III-V HBT devices can be measured directly on-wafer without significant de-embedding of the probe pads below 26 GHz. For interconnects, the line capacitance is typically dominated by parallel wire-to-wire capacitance, and the loss is not limited by the resistive substrate. This allows for the formation of high-Q inductors, lowloss transmission lines, and longer interconnects that can be operated in the 10’s of GHz. Although BESOI and SIMOX Si wafers are insulating, the SiO2 layer is typically thin resulting in reduced but still significant capacitive coupling across this thin layer.59 Most III-V substrates have a lower thermal conductivity than bulk Si, resulting in observed self-heating effects. For a GaAs/AlGaAs HBT, this results in observed negative output conductance in the commonemitter I–V curve measured with constant Ib. The thermal time constant for GaAs/AlGaAs HBTs is on the order of microseconds. Since thermal effects cannot track above this frequency, the output conductance

© 2000 by CRC Press LLC

FIGURE 72.10

Standard differential CML buffer with a simple reference generator.

of HBTs at RF (> 10 MHz) is low but positive. This effect does result in a small complication for HBT models based on the standard Gummel Poon BJT model.

Current-Mode Logic The basic current-mode logic (CML) buffer/inverter cell is shown in Fig. 72.10. The CML buffer is a differential amplifier that is operated with its outputs clipped or in saturation. The differential inputs (Vin and Vin′) are applied to the bases of Q1 and Q2. The difference in potential between Vin and Vin′ determines which transistor Ibias is steered through, resulting in a voltage drop across either load resistance RL1 or RL2. If Vin = VOH and Vin′ = VOL (Vin,High > Vin,Low), Q1 is on and Q2 is off. Consequently, Ibias completely flows through RL1, causing Vout′ to drop for a logic low. With Q2 off, Vout floats to ground for a logic high. If the terminal assignment of Vout and Vout′ were reversed, this CML stage would be an inverter instead of a buffer The logic high VOH of a CML gate is 0 V. The logic low output is determined by VOL = –RL1Ibias . With RL1/RL2 = 200-Ωs, and Ibias = 2 mA, the traditional logic low of a CML gate is –400 mV. As CML gates are cascaded together, the outputs of one stage directly feed the inputs of another CML gate. As a result, the base-collector of the “on” transistor is slightly forward-biased (by 400 mV in this example). For highspeed operation, it is necessary to keep the switching transistors out of saturation. With a GaAs basecollector turn-on voltage near 1V, 500 to 600 mV forward-bias is typically tolerated without any saturation effects. In fact, this bias shortens the base-collector depletion region, resulting in the highest ft vs. Vce (fmax suffers due to increase in Cbc). As a result, maximum logic swing of a CML gate is constrained by the need to keep the transistors out of saturation. As the transistor is turned on, the logic high is actively pulled to a logic low; however, as the transistor is turned off, the logic low is pulled up by a RC time constant. With a large capacitive loading, it is possible that the risetime is slower than the falltime, and that may result in some complications with high-speed data. A current mirror (Qcs and Rcs) sets the bias (Ibias) of the differential pair. This is an essential parameter in determining the performance of CML logic. In HBTs, the ft dependence on Ic is as follows:

© 2000 by CRC Press LLC

( (

)) (

)

1 2πft = τec = nkT qIc C bej + C bc + C bed + R c C bej + C bc + τ b + τcsc l

(72.16)

where τec is the total emitter-to-collector transit time, qIc/nkT is the transconductance (gm), Cbc is the base-collector capacitance, Cbej is the base–emitter junction capacitance, Cbed is the B–E diffusion capacitance, Rc is the collector resistance, τb is the base transit time, and τcscl is the collector space charge layer transit time. At low currents, the transit time is dominated by the device gm and device capacitance. As the bias increases, τec is eventually limited by τb and τcscl . As this limit approaches, Kirk effect typically starts to increase τb/τcscl , which decreases ft . In some HBTs, the peak fmax occurs a bit after the peak ft. With this in mind, optimal performance is typically achieved when Ibias is near Ic,maxft or Ic,maxfmax . In some HBT technologies, the maximum bias may be constrained by thermal or reliability concerns. As a rule of thumb, the maximum current density of HBTs is typically on the order of 5 × 104 A/cm2. The bias of CML and ECL logic is typically set with a bias reference generator, where the simplest generator is shown in Fig. 72.10. Much effort has been invested in the design of the reference generator to maintain constant bias with power supply and temperature variation. In HBT, secondary effects of heterojunction design typically result in slightly varying ideality factor with bias. This makes the design of bandgap reference circuits quite difficult in most HBT technologies, which complicates the design of the reference generator. Nevertheless, the reference generators used today typically result in a 2% variation in bias current from –40 to 100C with a 10% variation in power supply. In most applications, the voltage drop across Rcs is set to around 400 mV. With Vee set at –5.2 V, Vref is typically near –3.4 V. With constant bias, as Vee moves by ±10%, then the voltage drop across Rcs remains constant, so Vref moves by the change in power supply (about ±0.5 V). Since the logic levels are referenced to ground, the average value of Vcm (around –1.4 V) remains constant. This implies that changes in the power supply are absorbed by the base–collector junction of Qcs , and it is important that this transistor is not deeply saturated. Since the device goes from the cutoff mode to the forward active mode as it switches, the gate delay is difficult to predict analytically with small-signal analysis. Thus, large-signal models are typically used to numerically compute the delay in order to optimize the performance of a CML gate. Nevertheless, the small-signal model, frequency-domain approach described in Chapter 71.3 (Elmore) leads to the following approximation of a CML delay gate with unity fan-out33:

(

)

(

) (

τd ,cml = 1 + g m R L R bC bc + R b C be + C d + 2C bc + 1 2 C be + 1 2 C d

)

gm

(72.17)

where Cd is the diffusion capacitance of gm(τb + τcscl). Furthermore, by considering the difference in charge storage at logic high and logic low, divided by the logic swing, the effective CML gate capacitance can be expressed19 as

(

C cml = C be 2 + 2C bc + C s + τ b + τcsc l

)

RL

(72.18)

where Cs is collector-substrate and interconnect capacitances. Both equations show that the load resistor and bias (which affects gm and device capacitors) have a strong effect on performance. For a rough estimate of the CML maximum speed without loading, one can assume that ft is the gain bandwidth product. With the voltage gain set at gmRL, the maximum speed is ft/(gmRL). In the above example, at 1 mA average bias, gm,int = 1/26 S at room temperature. Assuming the internal parasitic emitter resistance RE is 10 ohms and using the fact that gm,ext = gm,int /(1 + REgm,int), the effective extrinsic gm is 1/36 mhos. With a 200-ohm load resistor, the voltage gain is approximately 5.5. With a 70-GHz ft HBT process, the maximum switching rate is about 13 GHz. Although this estimate is quite rough, it does show that high-speed CML logic desires high device bias and low logic swing. In most differential circuits, only 3 to 4 kT/q is needed to switch the transistors and overcome the noise floor. With such low levels

© 2000 by CRC Press LLC

FIGURE 72.11

Standard differential ECL buffer with outputs taken at two different voltage levels.

and limited gain, the output may not saturate to the logic extremes, resulting in decreasing noise margin. In practice, the differential logic level should not be allowed to drop below 225 mV. With a 225-mV swing vs. 400 mV, the maximum gate bandwidth improves to 23 GHz from 13 GHz.

Emitter-Coupled Logic By adding emitter followers to the basic HBT CML buffer, the HBT ECL buffer is formed in Fig. 72.11. From a dc perspective, the emitter followers (Qef1 and Qef2 ) shift the CML logic level down by Vbe . With the outputs at VoutA/VoutA′ and 400 mV swing from the differential pair, the first level ECL logic high is –1.4 V and the ECL logic low is –1.8 V. For some ECL logic gates, a second level is created through a Schottky diode voltage shift (Def1/Def2). The typical Schottky diode turn-on voltage for GaAs is 0.7 V, so the output at VoutB/VoutB′ is –2.1 V for a logic high and –2.5 V for a logic low. In general, the HBT ECL levels differ quite a bit from the standard Si ECL levels. Although resistors can be used to bias Qef1/Qef2 , current mirrors (Qcs1/Qcs2 and Rcs1/Rcs2 ) are typically used. Current mirrors offer stable bias with logic level at the expense of higher capacitance, while resistors offer lower capacitance but the bias varies more and may be physically quite large. From an ac point of view, emitter followers have high input impedance (approximately β times larger than an unbuffered input) and low output impedance (approx. 1/gm), which makes it an ideal buffer. In comparison with CML gates, since the differential pair now drives a higher load impedance, the effect of loading is reduced, yielding increased bandwidth, faster edge rates, and higher fan-out. The cost of this improvement is the increase in power due to the bias current of the emitter followers. For example, in a 50 GHz HBT process, a CML buffer (fan-out = 1, Ibias = 2mA, RL1/RL2 = 150 Ω), the propagation delay (tD) is 14.8 ps with a risetime [20 to 80%] (tr) of 31 ps and a falltime [20 to 80%] (tf ) of 21 ps. In comparison, an ECL buffer with level 1 outputs (fan-out = 1, Ibias = 2 mA, Ibias1/Ibias2 = 2 mA, RL1/RL2 = 150 Ω) has td = 14 ps, tf = 9 ps, and tr = 17 ps. With a threefold increase in Pdiss, the impedance transformation of the EF stage results in slightly reduced gate delays and significant improvements in the rise/falltimes. With the above ECL buffer modified for level 2 (level shifted) outputs, the performance is only slightly lower with tD = 14.2 ps, tf = 11 ps, and tr = 22 ps. In general, emitter followers tend to have bandwidths approaching the ft of the device, which is significantly higher than the differential pair. Consequently, it is possible to obtain high-speed operation with the EF biased lower than would be necessary to obtain the maximum device ft. With the ECL level 1

© 2000 by CRC Press LLC

buffer, if the Ibias1/Ibias2 is lowered to 1 mA from 2mA, the performance is still quite high, with td = 15 ps, tf = 13 ps, and tr = 18 ps. Although tD approaches the CML case, the tf and tr are still significantly better. As the EF bias is increased, its driving ability is also increased; however, at some point with high bias, the output impedance of the EF becomes increasingly inductive. When combined with large load capacitance (as in the case of high fan-out or long interconnect), it may result in severe ringing in the output that can result in excessive jitter on data edges. The addition of a series resistor between the EF output and the next stage can help to dampen the ringing by increasing the real part of the load. This change, however, increases the RC time constant, which usually results in a significant reduction in performance. In practice, changing the impedance of the EF bias source (high impedance current source or resistor bias) does not have a significant effect on the ringing. As a result, the primary method to control the ringing is through the EF bias, which places a very real constraint on bandwidth, fan-out, and jitter that needs to be considered in the topology of real designs. In some FET DCFL designs, several source followers are cascaded together to increase the input impedance and lower the output resistance between two differential pairs for high-bandwidth drive. Due to voltage headroom limits, it is very difficult to cascade two HBT emitter followers without causing the current source to enter deep saturation. In general, ECL gates are typically used for the high-speed sections due to significant improvement in rise/falltimes (bandwidth) and drive ability, although the power dissipation is higher.

ECL/CML Logic Examples Typically, ECL and CML logic is mixed throughout high-speed GaAs/AlGaAs HBT designs. As a result, there are three available logic levels that can be used to interconnect various gates. The levels are CML (0/–400 mV), ECL1 (–1.4/–1/8 V), and ECL2 (–2.1/–2.5 V). To form more complex logic functions, typically two levels of transistors are used to steer the bias current. Figure 72.12 shows an example of an CML AND/NAND gate. For the ECL counterpart, it is only necessary to add the emitter followers. The top input is VinA/VinA′. The bottom input is VinB/VinB′. In general, the top can be driven with either the

FIGURE 72.12

Two-level differential CML AND gate.

© 2000 by CRC Press LLC

CML or the ECL1 inputs, and the bottom level can be driven by ECL1 and ECL2 levels. The choice of logic input levels are typically dictated by the design tradeoff between bandwidth, power dissipation, and fan-out. As seen in Fig. 72.12, only when VinA and VinB are high, will Ibias current be steered into the load resistor that makes Vout = VOH. All other combinations will make Vout = VOL, as required by the AND function. Due to the differential nature, if the output terminal labels were reversed, this would be a NAND gate. For the worst-case voltage headroom, VinA/VinA′ is driven with ECL1 levels, resulting in Vcm1 of –2.8 V. With an ECL2 high on VinB (–2.1 V), the lower stage (Q3/Q4) has a B-C forward-bias of 0.7 V, which may result in a slight saturation. This also implies that Vcm2 is around –3.5 V, which results in an acceptable nominal 100 mV forward-bias on the current source transistor (Qcs). As Vee becomes less negative, the change in Vee is absorbed across Qcs, which places Qcs closer into saturation. In saturation, the current source holds Ie in Qcs constant; so if Ib increases (due to saturation), then Ic decreases. For some current source reference designs that cannot source the increased Ib, the increased loading due to saturated Ib may lower Vref , which would have a global effect on the circuit bias. If the current source reference can support the increase in Ib , then the bias of only the local saturated differential pair starts to decrease leading to the potential of lower speed and lower logic swing. For HBTs, the worst-case Qcs saturation occurs at low temperature, and the worst-case saturation for Q3 /Q4 occurs at high temperature since Vbe changes by –1.4 mV/C and Vdiode = –1.1 mV/C (for constant-current bias). It is possible to decrease the forward-bias of the lower stage by using the base-emitter diode as the level shift to generate the second ECL levels; however, the power supply voltage needs to increase from –5.2 to possibly –6 V. With the two-level issues in mind, Fig. 72.13 illustrates the topology for a two-level OR/NOR gate. This design is similar to an AND gate except that Vout = VOL if both VinA and VinB are low. Otherwise, Vout = VOH. By using the bottom differential pair to select one of the two top differential pairs, many other prime logic functions can be implemented. In Fig. 72.14, the top pairs are wired such that Vout = VOL if VinA = VinB, forming the XOR/XNOR block. If the top differential pairs are thought of as selectable

FIGURE 72.13

Two-level differential CML OR/NOR gate.

© 2000 by CRC Press LLC

FIGURE 72.14

Two-level differential CML XOR gate.

buffers with a common output as shown in Fig. 72.15, then a basic 2:1 MUX cell is formed. Here, VinB /VinB′ determines which input (VinA1/VinA1′ or VinA2/VinA2′) is selected to the output. This concept can be further extended to a 4:1 MUX if the top signals are CML and the control signals are ECL1 and ECL2, as shown in Fig. 72.16. Here, the MSB (ECL1) and LSB (ECL2) determine which of the four inputs are selected. With the 2:1 MUX in mind, if each top differential pair had separate output resistors with a common input, a 1:2 DEMUX is formed as shown in Fig. 72.17. The last primary cell of importance is the latch. This is shown in Fig. 72.18. Here, the first differential pair is configured as a buffer. The second pair is configured as a buffer with positive feedback. The positive feedback causes any voltage difference between the input transistors to be amplified to full logic swing and that state is held as long as the bias is applied. With this in mind, as the first buffer is selected (VinB = VOH), the output is transparent to the input. As VinB = VOL , the last value stored in the buffer is held, forming a latch, which in this case, is triggered on the falling edge of the ECL2 level. When two of these blocks are connected together in series, it forms the basic master-slave flip-flop.

Advanced ECL/CML Logic Examples With small signal amplifiers, the cascode configuration (common base on top of a common emitter stage) typically reduces the Miller capacitance for higher bandwidth. With the top-level transistor on, the top transistor forms a cascode stage with the lower differential amplifier. This can lead to higher bandwidths and reduced rise/falltimes. However, in large-signal logic, the bottom transistor must first turn on the top cascode stage before the output can change. This added delay results in a larger propagation delay from for the bottom pair vs. the top switching pair. In the case of the OR/NOR, as in Fig. 72.12, if Q1 or Q2 switches with Q3 on or if Q4 switches, the propagation delay is short. If Q3 switches, it must first turn

© 2000 by CRC Press LLC

FIGURE 72.15

Two-level 2:1 differential CML MUX gate.

FIGURE 72.16

Three-level 4:1 differential CML MUX gate.

© 2000 by CRC Press LLC

FIGURE 72.17

Two-level 1:2 differential CML DEMUX gate.

FIGURE 72.18

CML latch.

© 2000 by CRC Press LLC

FIGURE 72.19

Single-level CML quasi-differential OR gate.

on either Q1/Q2 , leading to the longest propagation delay. In this case, the longest delay limits the usable bandwidth of the AND gate. Likewise, in the XOR case, the delay of the top input is shorter than the lower input, which results in asymmetric behavior and reduced bandwidth. For a 10-GHz flip-flop, this can result in as much as a 10-ps delay from rising edge of the clock to the sample point of the data. This issue must be taken into account in determining the optimal input data phase for lowest bit errors when dealing with digital data. One solution to the delay issue is to use a quasi-differential signal. In Fig. 72.19, a single-ended singlelevel OR/NOR gate is shown. Here, a reference generator of (VH + VL)/2 is applied to VinA′. If either of the VinA1 or VinA2 is high, then Vout = VOH . This design has more bandwidth than the two-level topology shown in Fig. 72.12, but some of the noise margin may be sacrificed. Figure 72.20 shows an example of a single-level XOR gate with a similar input level reference. In this case, Ibias1 = Ibias2 = Ibias3 . The additional Ibias3 is used to make the output symmetric. Ignoring Ibias3 , when VinA is not equal to VinB, Ibias1 and Ibias2 are used to force Vout′ = VOL. When VinA = VinB, Vout = Vout′ since both are lowered by Ibias , resulting in an indeterminate state. To remedy this, Ibias3 is added to Vout to make the outputs symmetric. This design results in higher speed due to the single-level design; however, the noise margin is somewhat reduced due to the quasi-differential approach and the outputs have a common-mode voltage offset of RLIbias . In a standard differential pair, the output load capacitance can be broken into three parts. The basecollector capacitance of the driving pair, the interconnect capacitance, and the input capacitance of the next stage. The interconnect capacitance is on the order of 5 to 25 fF for adjacent to nearby gates. The base-collector depletion capacitance is on the order of 25 fF. Assuming that the voltage gain is 5.5, the effective Cbc or Miller capacitance is about 140 fF. Cbe,j , when the transistor is off, is typically less than 6 fF. The Cbe,d capacitance when the transistor is on is of the order of 50 to 200 fF. These rough numbers show that the Miller effect has a significant effect on the effective load capacitance. For the switching transistor, the Miller effect increases both the effective internal Cbc as well as the external load. In these situations, a cascode stage may result in higher bandwidth and sharper rise/falltimes with a slight increase in propagation delay. Figure 72.21 shows a CML gate with an added cascode stage. Due to the 400-mV logic swing, the cascode bases are connected to ground. For higher swings, the cascode bases can be biased to a more negative voltage to avoid saturation. The cascode requires that the input level be either

© 2000 by CRC Press LLC

FIGURE 72.20

Single-level CML quasi-differential XOR gate.

FIGURE 72.21

CML buffer with a cascode output stage.

ECL1 or ECL2 to account for the Vbe drop of the cascode. Since the base of the cascode is held at ac ground, the Miller effect is not seen at the input of the common-base stage as the output voltage swings. From the common-emitter point of view, the collector swings only about 60 mV per decade change in Ic , thus the Miller effect is greatly reduced. The reduction of the Miller effect through cascoding reduces

© 2000 by CRC Press LLC

FIGURE 72.22

CML buffer with a cascode output stage and bleed current to keep the cascode “on”.

the effect of both the internal transistor Cbc and load capacitance due to Cbc of the other stages, which results in the reduced rise/falltimes, especially at the logic transition region. As both of the transistors in a cascode turn off, the increased charge stored in both transistors that has to discharge through an RC time constant may result in a slower edge rate near the logic high of a rising edge. In poorly designed cascode stages, the corner point between the fast-rising edge to the slowerrising edge may occur near the 20/80% point, canceling out some of the desired gains. Furthermore, with the emitter node of the off common-base stage floating in a high-impedance state, its actual voltage varies with time (large RC discharge compared to the switching time). This can result in some “memory” effects where the actual node voltage depends on the previous bit patterns. In this case, as the transistor is turns on, the initial voltage may vary, which can result in increased jitter with digital data. With these effects in mind, the cascoded CML design can be employed with performance advantages in carefully considered situations. One way to remedy the off cascode issues is to use prebias circuits as shown in Fig. 72.22. Here, the current sources formed with Qpreb1 and Qpreb2 (Iprebias  Ibias ) ensures that the cascode is always slightly on by bleeding a small bias current. This results in improvements in the overall rise- and falltimes, since the cascode does not completely turn off. This circuit does, however, introduce a common-mode offset in the output that may reduce the headroom in a two-level ECL gate that it must drive. Furthermore, a series resistor can be introduced between the bleed point and the current source to decouple the current source capacitance into the high-speed node. This design requires careful consideration to the design tradeoffs involving the ratio of Ibias/Iprebias as well as the potential size of the cascode transistor vs. the switch transistors for optimal performance. When properly designed, the bleed cascode can lead to significant performance advantages. In general, high-speed HBT circuits require careful consideration and design of each high-speed node with respect to the required level of performance, allowable power dissipation, and fan-out. The primary tools the designer has to work with are device bias, device size, ECL/CML gate topology, and logic level to optimize the design. Once the tradeoff is understood, CML/ECL HBT-based circuits have formed

© 2000 by CRC Press LLC

some of the faster circuits to date. The performance and capability of HBT technology in circuit applications are summarized below.

HBT Circuit Design Examples A traditional method to benchmark the high-speed capability of a technology is to determine the maximum switching rate of a static frequency divider. This basic building block is employed in a variety of high-speed circuits, which include frequency synthesizers, demultiplexers, and ADCs. The basic static frequency divider consists of a master/slave flip-flop where the output data of the slave flip-flop is fed back to the input data of the master flip-flop. The clock of the master and the clock of the slave flip-flop are connected together, as shown in Fig. 72.23. Due to the low transistor count and importance in many larger high-speed circuits, the frequency divider has emerged as the primary circuit used to demonstrate the high-speed potential of new technologies. As HBT started to achieve SSI capability in 1984, a frequency divider with a toggle rate of 8.5 GHz6 was demonstrated. During the transition from research to pilot production in 1992, a research-based AlInAs/GaInAs HBT (ft of 130 GHz and fmax of 90 GHz) was able to demonstrate a 39.5 GHz divide-by 4.60 Recently, an advanced AlInAs/GaInAs HBT technology (ft of 164 GHz and fmax of 800 GHz) demonstrated a static frequency divider operating at 60 GHz.61 This HBT ECL-based design, to date, reports the fastest results for any semiconductor technology, which illustrates the potential of HBTs and ECL/CML circuit topology for high-speed circuits. Besides high-speed operation, production GaAs HBTs have also achieved a high degree of integration for LSI circuits. For ADCs and DACs, the turn-on voltage of the transistor (Vbe) is determined by material constants, thus there is significantly less threshold variation when compared to FET-based technologies. This enables the design of high-speed and accurate comparators. Furthermore, the high linearity characteristics of HBTs enable the design of wide dynamic range and high linearity sample-and-hold circuits. These paramount characteristics result in the dominance of GaAs HBTs in the super-high performance/high-speed ADCs. An 8-bit 2 gigasamples/s ADC has been fabricated with 2500 transistors. The input bandwidth is from dc to 1.5 GHz, with a spur-free dynamic range of about 48 dB.62 Another lucrative area for digital HBTs is in the area of high-speed circuits that are employed in fiberoptic based telecommunications systems. The essential circuit blocks (such as a 40-Gb/s 4:1 multiplexers63

FIGURE 72.23

2:1 Frequency divider based on two CML latches (master/slave flip-flop).

© 2000 by CRC Press LLC

and 26-GHz variable gain-limiting amplifiers64) have been demonstrated with HBTs in the research lab. In general, the system-level specifications (SONET) for telecommunication systems are typically very stringent compared with data communication applications at the same bit rate. The tighter specifications in telecom applications are due to the long-haul nature and the need to regenerate the data several times before the destination is reached. Today, there are many ICs that claim to be SONET-compliant at OC-48 (2.5 Gb/s) and some at OC-192 (10 Gb/s) bit rates. Since the SONET specifications apply on a system level, in truth, there are very few ICs having the performance margin over the SONET specification for use in real systems. Due the integration level, high-speed performance, and reliability of HBTs, some of the first OC-48 (2.5 Gb/s) and OC-192 (10 Gb/s) chip sets (e.g., preamplifiers, limiting amplifiers, clock and data recovery circuits, multiplexers, demultiplexers, and laser/modulator drivers) deployed are based on GaAs HBTs. A 16 × 16 OC-192 crosspoint switch has been fabricated with a production 50 GHz ft and fmax process.65 The LSI capability of HBT technology is showcased with this 9000 transistor switch on a 6730 × 6130 µm2 chip. The high-speed performance is illustrated with a 10 Gb/s eye diagram shown in Fig. 72.24. With less than 3.1 ps of RMS jitter (with four channels running), this is the lowest jitter 10-Gb/s switch to date. At this time, only two 16 × 16 OC-192 switches have been demonstrated65,66 and both were achieved with HBTs. With a throughput of 160,000 Mb/s, these HBT parts have the largest amount of aggregate data running through it of any IC technology. In summary, III-V HBT technology is a viable high-speed circuit technology with mature levels of integration and reliability for real-world applications. Repeatedly, research labs have demonstrated the world’s fastest benchmark circuits with HBTs with ECL/CML-based circuit topology. The production line has shown that current HBTs can achieve the both the integration and performance level required for high-performance analog, digital circuits, and hybrid circuits that operate in the high gigahertz range. Today, the commercial success of HBTs can be exemplified by that fact that HBT production lines ship several million HBT ICs every month and that several new HBT production lines are in the works. In the future, it is expected that advances in Si based technology will start to compete in the markets currently held by III-V technology; however, it is also expected that III-V technology will move on to address ever higher speed and performance issues to satisfy our insatiable demand for bandwidth.

FIGURE 72.24

Typical 10 Gbps eye diagram for OC-192 crosspoint switch.

© 2000 by CRC Press LLC

Section References 1. Ware, R., Higgins, W., O’Hearn, K., and Tiernan, M., Growth and Properties of Very Large Crystals of Semi-Insulating Gallium Arsenide, presented at 18th IEEE GaAs IC Symp., Orlando, FL, 54, 1996. 2. Abrokwah, J. K., Huang, J. H., Ooms, W., Shurboff, C., Hallmark, J. A., et al., A Manufacturable Complementary GaAs Process, presented at IEEE GaAs IC Symposium, San Jose, CA, 127, 1993. 3. Kroemer, H., Heterostructures for Everything: Device Principles of the 1980s?, Japanese J. Appl. Phys., 20, 9, 1981. 4. Kroemer, H., Heterostructure Bipolar Transistors and Integrated Circuits, Proc. IEEE, 70, 13, 1982. 5. Matthews, J. W. and Blakeslee, A. E., Defects in Epitaxial Multilayers. III. Preparation of Almost Perfect Layers, J. Crystal Growth, 32, 265, 1976. 6. Matthews, J. W. and Blakeslee, A. E., Coherent Strain in Epitaxially Grown Films, J. Crystal Growth, 27, 118, 1974. 7. Johnson, E. O. and Rose, A., Simple General Analysis of Amplifier Devices with Emitter, Control, and Collector Functions, Proceedings of the IRE, 47, 407, 1959. 8. Cherry, E. M. and Hooper, D. E., Amplifying Devices and Low-Pass Amplifier Design., John Wiley & Sons, New York, 1968, Chap. 2 and 5. 9. Beaufoy, R. and Sparkes, J. J., The Junction Transistor as a Charge-Controlled Device, ATE Journal, 13, 310, 1957. 10. Shockley, W., Electrons and Holes in Semiconductors, Van Nostrand, New York, 1950. 11. Ferry, D. K., Semiconductors, Macmillan, New York, 1991. 12. Lundstrom, M., Fundamentals of Carrier Transport, Addison-Wesley, Reading, MA, 1990. 13. Sze, S. M., Physics of Semiconductor Devices, second ed., John Wiley & Sons, New York, 1981. 14. Yang, E. S., Fundamentals of Semiconductor Devices, McGraw-Hill, New York, 1978, Chap. 7. 15. Hollis, M. A. and Murphy, R. A., Homogeneous Field-Effect Transistors, High-Speed Semiconductor Devices, Sze, S. M., Ed., Wiley-Interscience, New York, 1990. 16. Pearton, S. J. and Shah, N. J., Heterostructure Field-Effect Transistors, High-Speed Semiconductor Devices, S. M. Sze, Ed.: Wiley-Interscience, 1990, Chap. 5. 17. Muller, R. S. and Kamins, T. I., Device Electronics for Integrated Circuits, second ed., John Wiley & Sons, New York, 1986, Chap. 6 and 7. 18. Gray, P. E., Dewitt, D., Boothroyd, A. R., and Gibbons, J. F., Physical Electronics and Circuit Models of Transistors, John Wiley & Sons, New York, 1964, Chap. 7. 19. Asbeck, P. M., Bipolar Transistors, High-Speed Semiconductor Devices, S. M. Sze, Ed., John Wiley & Sons, New York, 1990, Chap. 6. 20. Low, T. S. et al., Migration from an AlGaAs to an InGaP Emitter HBT IC Process for Improved Reliability, presented at IEEE GaAs IC Symposium Technical Digest, Atlanta, GA, 153, 1998. 21. Jalali, B. and Pearson, S. J., InP HBTs Growth, Processing and Applications, Artech House, Boston, 1995. 22. Nguyen, L. G., Larson, L. E., and Mishra, U. K., Ultra-High-Speed Modulation-Doped Field-Effect Transistors: A Tutorial Review, Proc. of IEEE, 80, 494, 1992. 23. Brews, J. R., The Submicron MOSFET, High-Speed Semiconductor Devices, Sze, S. M., Ed., WileyInterscience, New York, 1990, Chap. 3. 24. Nguyen, L. D., Brown, A. S., Thompson, M. A., and Jelloian, L. M., 50-nm Self-Aligned-Gate Pseudomorphic AlInAs/GaInAs High Electron Mobility Transistors, IEEE Trans. on Elect. Dev., 39, 2007, 1992. 25. Yuan, J.-R. and Svensson, C., High-Speed CMOS Circuit Technique, IEEE Journal of Solid-State Circuits, 24, 62, 1989. 26. Weste, N. H. E. and Eshraghian, K., Principles of CMOS VLSI Design – A Systems Perspective, second ed., Addison-Wesley, Reading, MA, 1993. 27. Rabaey, J. M., Digital Integrated Circuits: A Design Perspective, Prentice-Hall, New York, 1996.

© 2000 by CRC Press LLC

28. Hill, C. F., Noise Margin and Noise Immunity in Logic Circuits, Microelectronics, 1, 16, 1968. 29. Long, S. and Butner, S., Gallium Arsenide Digital Integrated Circuit Design, McGraw-Hill, New York, 1990, Chap. 3. 30. Bakoglu, H. B., Circuits, Interconnections, and Packaging, Addison-Wesley, Reading, MA, 1990, Chap. 7. 31. Long, S. and Butner, S., Gallium Arsenide Digital Integrated Circuit Design, McGraw-Hill, New York, 1990, Chap. 5. 32. Elmore, W. C., The Transient Response of Damped Linear Networks with Particular Regard to Wideband Amplifiers, J. Appl Phys., 19, 55, 1948. 33. Ashar, K. G., The Method of Estimating Delay in Switching Circuits and the Fig. of Merit of a Switching Transistor, IEEE Trans. on Elect. Dev., ED-11, 497, 1964. 34. Tien, P. K., Propagation Delay in High Speed Silicon Bipolar and GaAs HBT Digital Circuits, Int. J. of High Speed Elect., 1, 101, 1990. 35. Lee, T. H., The Design of CMOS Radio-Frequency Integrated Circuits, Cambridge Univ. Press, Cambridge, U.K., 1998, Chap. 7. 36. Gray, P. E. and Searle, C. L., Electronic Principles: Physics, Models, and Circuits, John Wiley & Sons, New York, 1969, 531. 37. Gray, P. and Meyer, R., Analysis and Design of Analog Integrated Circuits, 3rd ed., John Wiley & Sons, New York, 1993, Chap. 7. 38. Millman, J. and Grabel, A., Microelectronics, second ed., ed. McGraw-Hill, New York, 1987, 482. 39. Hurtz, G. M., Applications and Technology of the Transferred-Substrate Schottky-Collector Heterojunction Bipolar Transistor, M.S. Thesis, University of California, Santa Barbara, 1995. 40. Gray, P. and Meyer, R., Analysis and Design of Analog Integrated Circuits, 3rd ed., John Wiley & Sons, New York, 1993, Chap. 3. 41. Long, S. and Butner, S., Gallium Arsenide Digital Integrated Circuit Design, McGraw-Hill, New York, 1990, 210. 42. Vitesse Semiconductor, 1998 Product Selection Guide, 164, 1998. 43. Troutman, R. R., Subthreshold Design Considerations for Insulated Gate Field-Effect Transistors, IEEE J. Solid-State Cir., SC-9, 55, 1974. 44. Lee, S. J. et al., Ultra-low Power, High-Speed GaAs 256 bit Static RAM, presented at IEEE GaAs IC Symp., Phoenix, AZ, 1983, 74. 45. Long, S. and Butner, S., Gallium Arsenide Digital Integrated Circuit Design, McGraw-Hill, New York, 1990, Chap. 2. 46. Lassen, P. S., High-Speed GaAs Digital Integrated Circuits for Optical Communication Systems, Ph.D Dissertation, Tech. U. Denmark, Lyngby, Denmark, 1993. 47. Miyamoto, Y., Yoneyama, M., and Otsuji, T., 40-Gbit/s TDM Transmission Technologies Based on High-Speed ICs, presented at IEEE GaAs IC Symp., Atlanta, GA, 51, 1998. 48. Otsuji, T. et al., 40 Gb/s IC’s for Future Lightwave Communications Systems, IEEE J. Solid State Cir., 32, 1363, 1997. 49. Otsuji, T. et al., A Super-Dynamic Flip-Flop Circuit for Broadband Applications up to 24 Gb/s Utilizing Production-Level 0.2 µm GaAs MESFETs, IEEE J. Solid State Cir., 32, 1357, 1997. 50. Lang, M., Wang, Z. G., Thiede, A., Lienhart, H., Jakobus, T., et al., A Complete GaAs HEMT Single Chip Data Receiver for 40 Gbit/s Data Rates, presented at IEEE GaAs IC Symposium, Atlanta, GA, 55, 1998. 51. Ichioka, T., Tanaka, K., Saito, T., Nishi, S., and Akiyama, M., An Ultra-High Speed DCFL Dynamic Frequency Divider, presented at IEEE 1989 Microwave and Millimeter-Wave Monolithic Circuits Symposium, 61, 1989. 52. Thiede, A. et al., Digital Dynamic Frequency Dividers for Broad Band Application up to 60 GHz, presented at IEEE GaAs IC Symposium, San Jose, CA, 91, 1993. 53. Rocchi, M. and Gabillard, B., GaAs Digital Dynamic IC’s for Applications up to 10 GHz, IEEE J. Solid-State Cir., SC-18, 369, 1983.

© 2000 by CRC Press LLC

54. Shikata, M., Tanaka, K., Inokuchi, K., Sano, Y., and Akiyama, M., An Ultra-High Speed GaAs DCFL Flip Flop – MCFF (Memory Cell type Flip Flop), presented at IEEE GaAs IC Symp., Nashville, TN, 27, 1988. 55. Maeda, T., Numata, K., et al., A Novel High-Speed Low-Power Tri-state Driver Flip Flop (TD-FF) for Ultra-low Supply Voltage GaAs Heterojunction FET LSIs, presented at IEEE GaAs IC Symp., San Jose, CA, 75, 1993. 56. Madden, C. J., Snook, D. R., Van Tuyl, R. L., Le, M. V., and Nguyen, L. D., A Novel 75 GHz InP HEMT Dynamic Divider, presented at IEEE GaAs IC Symposium, Orlando, FL, 137, 1996. 57. Maeda, T. et al., An Ultra-Low-Power Consumption High-Speed GaAs Quasi-Differential Switch Flip-Flop (QD-FF), IEEE J. Solid-State Cir., 31, 1361, 1996. 58. Yuan, H. T., Shih, H. D., Delaney, J., and Fuller, C., The Development of Heterojunction Integrated Injection Logic, IEEE Trans. on Elect. Dev., 36, 2083, 1989. 59. Johnson, R. A. et al., Comparison of Microwave Inductors Fabricated on Silicon-on-Sapphire and Bulk Silicon, IEEE Microwave and Guided Wave Letters, 6, 323, 1996. 60. Jensen, J., Hafizi, M., Stanchina, W., Metzger, R., and Rensch, D., 39.5 GHz Static Frequency Divider Implemented in AlInAs/GaInAs HBT Technology, presented at IEEE GaAs IC Symposium, Miami, FL, 103, 1992. 61. Lee, Q., Mensa, D., Guthrie, J., Jaganathan, S., Mathew, T. et al., 60 GHz Static Frequency Divider in Transferred-substrate HBT Technology, presented at IEEE International Microwave Symposium, Anaheim, CA, 1999. 62. Nary, K. R., Nubling, R., Beccue, S., Colleran, W. T. et al., An 8-bit, 2 Gigasample per Second Analog to Digital Converter, presented at 17th Annual IEEE GaAs IC Symposium, San Diego, CA, 303, 1995. 63. Runge, K., Pierson, R. L., Zampardi, P. J., Thomas, P. B., Yu, J. et al., 40 Gbit/s AlGaAs/GaAs HBT 4:1 Multiplexer IC, Electronics Letters, 31, 876, 1995. 64. Yu, R., Beccue, S., Zampardi, P., Pierson, R., Petersen, A. et al., A Packaged Broadband Monolithic Variable Gain Amplifer Implmented in AlGaAs/GaAs HBT Technology, presented at 17th Annual IEEE GaAs IC Symposium, San Diego, CA, 197, 1995. 65. Metzger, A. G., Chang, C. E., Campana, A. D., Pedrotti, K. D., Price, A. et al., A 10 Gb/s High Isolation 16×16 Crosspoint Switch, Implemented with AlGaAs/GaAs HBT’s, to be published, 1999. 66. Lowe, K., A GaAs HBT 16×16 10 Gb/s/channel Cross-Point Switch, IEEE J. Solid State Cir., 32, 1263, 1997.

© 2000 by CRC Press LLC

Chung, M.J., Kim, H. "Internet Based Micro-Electronic Design Automation (IMEDA) Framework" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

73 Internet-Based Micro-Electronic Design Automation (IMEDA) Framework 73.1 Introduction 73.2 Functional Requirements of Framework The Building Blocks of Process • Functional Requirements of Workflow Management • Process Specficiation • Execution Environment • Literature Surveys

73.3 IMEDA System 73.4 Formal Representation of Design Process Process Flow Graph • Process Grammars

73.5 Execution Environment of the Framework

Moon Jung Chung Michigan State University

Heechul Kim Hankuk University of Foreign Studies

The Cockpit Program • Manager Programs • Execution Example • Scheduling

73.6 Implementation The System Cockpit • External Tools • Communications Model • User Interface

73.7 Conclusion

73.1 Introduction As the complexity of VLSI systems continues to increase, the micro-electronic industry must possess an ability to reconfigure design and manufacturing resources and integrate design activities so that it can quickly adapt to the market changes and new technology. Gaining this ability imposes a two-fold challenge: (1) to coordinate design activities that are geographically separated and (2) to represent an immense amount of knowledge from various disciplines in a unified format. The Internet can provide the catalyst by abridging many design activities with the resources around the world not only to exchange information but also to communicate ideas and methodologies. In this chapter, we present a collaborative engineering framework that coordinates distributed design activities through the Internet. Engineers can represent, exchange, and access the design knowledge and carry out design activities. The crux of the framework is the formal representation of process flow using the process grammar, which provides the theoretical foundation for representation, abstraction, manipulation, and execution of design processes. The abstraction of process representation provides mechanisms to represent hierarchical decomposition and alternative methods, which enable designers to manipulat e the process flow diag ram and se lect the best method.In the frame work, the process inf ormat ion

© 2000 by CRC Press LLC

is layered into separate specification and execution levels so that designers can capture processes and execute them dynamically. As the framework is being executed, a designer can be informed of the current status of design such as updating and tracing design changes and be able to handling exception. The framework can improve design productivity by accessing, reusing, and revising the previous process for a similar. The cockpit of our framework interfaces with engineers to perform design tasks and to negotiate design tradeoff. The framework has the capability to launch whiteboards that enable the engineers in a distributed environment to view the common process flows and data and to concurrently execute dynamic activities such as process refinement, selection of alternative process, and design reviews. The proposed framework has a provision for various browsers where the tasks and data used in one activity can be organized and retrieved later for other activities. One of the predominant challenges for micro-electronic design is to handle the increased complexity of VLSI systems. At the turn of the century, it is expected that there will be 100 million transistors in a single chip with 0.1 micron features, which will require an even shorter design time (Spiller, 1997). This increase of chip complexity has given impetus to trends such as system on a chip, embedded system, and hardware/software co-design. To cope with this challenge, industry uses custom-off- the-shelf (COTS) components, relies on design reuse, and practices outsourcing design. In addition, design is highly modularized and carried out by many specialized teams in a geographically distributed environment. Multi-facets of design and manufacturing, such as manufacturability and low power, should be considered at the early stage of design. It is a major challenge to coordinate these design activities (Fair, 94). The difficulties are caused by due to the interdependencies among the activities, the delay in obtaining distant information, the inability to respond to errors and changes quickly, and general lack of communications. At the same time, the industry must contend with decreased expenditures on manufacturing facilities while maintaining rapid responses to market and technology changes. To meet this challenge, the U.S. government has launched several programs. The Rapid Prototyping of Application Specific Signal Processor (RASSP) Program was initiated by the Department of Defense to bring about the timely design and manufacturing of signal processors. One of the main goals of the RASSP program was to provide an effective design environment to achieve a four-time improvement in the development cycle of digital systems (Chung, 1996). DARPA also initiated a program to develop and demonstrate key software elements for Integrated Product and Process Development (IPPD) and agile manufacturing applications. One of the foci of the earlier program was the development of infrastructure for distributed design and manufacturing. Recently, the program is continued to Rapid Design Exploration & Optimization (RaDEO) to support research, development, and demonstration of enabling technologies, tools, and infrastructure for the next generation of design environments for complex electro-mechanical systems. The design environment of RaDEO is planned to provide cognitive support to engineers by vastly improving their ability to explore, generate, track, store, and analyze design alternatives (Lyons, 1997). The new informat ion technolo gies, such as the I nternet and mobile computing , are chang ing the wa y we comm unicat e and conduct business. More and more desig n centers use PCs, and link the m on the Internet/intranet. The web-based comm unicat ion al lows people to collaborate across spa ce and time, between humans, humans and c omputers, and computers in a shar ed virtual w orld (Berners-L ee, 1994). This emerging technolo gy holds the key to enhanc e desig n and man ufa cturing a ctivities. The Internet can be used as the medium of a virtual e nvironment where concepts and me thodologies can be discusse d, accessed, and improved by the par ticipating engineers. Through the medium, resources and a ctivities can be reorganiz ed,reconfigur ed,and integrated by the par ticipating organizat ions. This ne w paradigm certainl y impa cts the traditional means f or desig ning and man ufa cturing a complex product. Using Java, programs can be implemented in a platform-ind ependent way so that the y can be executed in any ma chine w ith a Web browser. Commo n Object Request Broker Architecture (CORBA) (Yang , 1996) provides distributed services for tools to comm unicat e through the Internet (Vogel). Desig ners ma y be able to execute remote tools thr ough the Internet and see the visualizat ion of desig n data (Er kes, 1996; Chan, 1998; Chung , 1998). Even though the potential impact of this technology will be great on computer aided design, Electronic Design Automation (EDA) industry has been slow in adapting this new technology (Spiller, 1997). Until

© 2000 by CRC Press LLC

recently, EDA frameworks used to be a collection of point tools. These complete suites of tools are integrated tightly by the framework using their proprietary technology. These frameworks have been suitable enough to carry out a routine task where the process of design is fixed. However, new tools appear constantly. To mix and match various tools outside of a particular framework is very difficult. Moreover, tools, expertise, and materials for design and manufacturing of a single system are dispersed geographically. Now we have reached the stage where a single tool or framework is not sufficient enough to handle the increasing complexity of a chip and emerging new technology. A new framework is necessary which is open and scalable. It must support collaborative design activities so that designers can add new tools to the framework, and interface them with other CAD systems. There are two key functions of the framework: (1) managing the process and (2) maintaining the relationship among many design representations. For design data management, refer to (Katz, 1987). In this chapter, we will focus on the process management aspect. To cope with the complex process of VLSI system design, we need a higher level of viewing of a complete process, i.e., the abstraction of process by hiding all details that need not to be considered for the purpose at hand. As pointed out in National Institute of Standards and Technology reports (Schlenoff, 1996; Knutilla, 1998), a “unified process specification language” should have the following major requirements: abstraction, alternative task, complex groups of tasks, and complex sequences. In this chapter we first review the functional requirements of the process management in VLSI system design. We then present the Internet-based Micro-Electronic Design Automation (IMEDA) System. IMEDA is a web-based collaborative engineering framework where engineers can represent, exchange, and access design knowledge and perform the design activities through the Internet. The crux of the framework is a formal representation of process flow using process grammar. Similar to the language grammar, production rules of the process grammar map tasks into admissible process flows (Baldwin, 1995a). The production rules allow a complex activity to be represented more concisely with a small number of high-level tasks. The process grammar provides the theoretical foundation for representation, abstraction, manipulation, and execution of design and manufacturing processes. It facilitates the communication at an appropriate level of complexity. The abstraction mechanism provides a natural way of browsing the process repository and facilitates process reuse and improvement. The strong theoretical foundation of our approach allows users to analyze and predict the behavior of a particular process. The cockpit of our framework interfaces with engineers to perform design tasks and to negotiate design tradeoff. The framework guides the designer in selecting tools and design methodologies, and it generates process configurations that provide optimal solutions with a given set of constraints. The just-in-time binding and the location transparency of tools maximize the utilization of company resources. The framework is equipped with whiteboards so that engineers in a distributed environment can view the common process flows and data and concurrently execute dynamic activities such as process refinement, selection of alternative processes, and design reviews. With the grammar, the framework gracefully handles exceptions and alternative productions. A layered approach is used to separate the specification of design process and execution parameters. One of the main advantages of this separation is freeing designers from the overspecification and graceful exception handling. The framework, implemented using Java, is open and extensible. New process, tools, and user-defined process knowledge and constraints can be added easily.

73.2 Functional Requirements of Framework Design methodology is defined as a collection of principles and procedures employed in the design of engineering systems. Baldwin and Chung (Baldwin, 1995a) define design methodology management as selecting and executing methodologies so that the input specifications are transformed into desired output specifications. Kleinfeldt (1994), states that “design methodology management provides for the definition, presentation, execution, and control of design methodology in a flexible, configured way.” Given a methodology, we can select a process or processes for that particular methodology. Each design activity, whether big or small, can be treated as a task. A complex design task is hierarchically decomposed into simpler subtasks, and each subtask in turn may be further decomposed. Each

© 2000 by CRC Press LLC

task can be considered as a transformation from input specification to output specification. The term wor kflow is used to represent the details of a process including its structure in terms of all the required tasks and their interdependencies. Some process may be ill-structured, and capturing it as a workflow may not be easy. Exceptions, conditional executions, and human involvement during the process make it difficult to model the process as a workflow. There can be many different tools or alternative processes to accomplish a task. Thus, a design process requires design decisions such as selecting tools and processes as well as selecting appropriate design parameters. At a very high level of design, the input specifications and constraints are very general and may even be ill-structured. As we continue to decompose and perform the tasks based on design decisions, the output specifications are refined and the constraints on each task become more restrictive. When the output of a task does not meet certain requirements or constraints, a new process, tools, or parameters must be selected. Therefore, the design process is typically iterative and based on previous design experience. Design process is also a collaborative process, involving many different engineering activities and requiring the coordination among engineers, their activities, and the design results. Until recently, it was the designer’s responsibility to determine which tools to use and in what order to use them. However, managing the design process itself has become difficult, since each tool has its own capabilities and limitations. Moreover, new tools are developed and new processes are introduced continually. The situation is further aggravated because of incompatible assumptions and data formats between tools. To manage the process, we need a framework to monitor the process, carry out design tasks, support cooperative teamwork, and maintain the relationship among many design representations (Chiueh, 1990; Katz, 1987). The framework must support concurrent engineering activities by integrating various CAD tools and process and component libraries into a seamless environment. Figure 73.1 shows the RASSP enterprise system architecture (Welsh, 1995). It integrates tools, tool frameworks, and data management functions into an enterprise environment. The key functionality of the RASSP system is managing the RASSP design methodology by “process automation”, that is, controlling CAD program execution through workflow.

FIGURE 73.1

RASSP enterprise system architecture.

© 2000 by CRC Press LLC

The Building Blocks of Process The lowest level of a building block of a design process is a tool. A tool is an unbreakable unit of a CAD program. It usually performs a specific task by transforming given input specifications into output specifications. A task is defined as design activities that include information about what tools to use and how to use them. It can be decomposed into smaller subtasks. The simplest form of the task, called an atomic task , is the one that cannot be decomposed into subtasks. In essence, an atomic task is defined as an encapsulated tool. A task is called log ical if it is not atomic. A workflow of a logical task describes the details of how the task is decomposed into subtasks, and the data and control dependencies such as the relationship between design data used in the subtasks. For a given task, there can be several workflows, each of which denotes a possible way of accomplishing the task. A methodolog y is a collection of workflow supported together with information on which workflow should be selected in a particular instance.

Functional Requirements of Workflow Management To be effective, a framework must integrate many design automation tools and allow the designer to specify acceptable methodologies and tools together with information such as when and how they may be used. Such a framework must not only guide the designer in selecting tools and design methodologies, but also aid the designer in constructing a workflow that is suitable to complete the design under given constraints. The constructed workflow should guarantee that required steps are not skipped; built-in design checks are incorporated into the workflow. The framework must also keep the relationships between various design representations, maintain the consistency between designs and support cooperative teamwork, and allow the designer to interact with the system to adjust design parameters or to modify the previous design process. The framework must be extendible to accommodate rapidly changing technologies and emerging new tools. Such a framework can facilitate developing new hardware systems as well as redesigning a system from a previous design. During a design process, a particular methodology or workflow selected by a designer must be based on available tools, resources (computing and human), and design data. For example, a company may impose a rule that if input is a VHDL behavioral description, then designers should use Model Technology’s VHDL simulator, but if the input is Verlig, they must use ViewLogic simulator. Or, if a component uses Xilinx, then all other components must also use Xilinx. Methodology must be driven by local expertise and individual preference, which in turn, are based on the designer’s experience. The process management should not constrain the designer. Instead, it must free designers from routine tasks, and guide the execution of workflow. User interaction and a designer’s freedom are especially important when exceptions are encountered during the execution of flows, or when designers are going to modify the workflow locally. The system must support such activities through “controlled interactions” with designers. Process management can be divided into two parts: • A formal specification of supported methodologies and tools that must show the tasks and data involved in a workflow and their relationships. • An execution environment that helps designers to construct workflow and execute them.

Process Specification Methodology management must provide facilities to specify design processes. Specification of processes involves tasks and their structures (i.e., workflow). The task involved and the flow of process, that is the way the process can be accomplished in terms of its subtasks, must be defined. Processes must be encapsulated and presented to designers in a usable way. Designers want an environment to guide them in building a workflow and to help them execute it during the design process. Designers must be able to browse related processes, and compare, analyze, and modify them.

© 2000 by CRC Press LLC

Tasks Designers should be able to define the tasks that can be logical or atomic, organize the defined tasks, and retrieve them. Task abstraction refers to using and viewing a task for specific purposes and ignoring the irrelevant aspects of the task. In general, object-oriented approaches are used for this purpose. Abstraction of the task may be accomplished by defining tasks in terms of “the operations the task is performing” without detailing the operations themselves. Abstraction of tasks allows users to clearly see the behavior of them and use them without knowing the details of their internal implementations. Using the generalization–specialization hierarchy (Chung, 1990), similar tasks can be grouped together. In the hierarchy, a node in the lower level inherits its attributes from its predecessors. By inheriting the behavior of a task, the program can be shared, and by inheriting the representation of a task (in terms of its flow), the structure (workflow) can be shared. The Process Handbook (Malone, in press) embodies concepts of specialization and decomposition to represent processes. There are various approaches associated with binding a specific tool to an atomic task. A tool can be bound to a task statically at the compile time, or dynamically at the run time based on available resources and constraints. When a new tool is installed, designers should be able to modify the existing bindings. The simplest approach is to modify the source code or write a script file and recompile the system. The ideal case is plug and play, meaning that CAD vendors address the need of tool interoperability, e.g., the Tool Encapsulation Specification (TES) proposed by CFI (CFI, 1995). Workflow To define a workflow, we must specify the tasks involved in the workflow, data, and their relationship. A set of workflows defined by methodology developers enforces the user to follow the flows imposed by the company or group. Flows may also serve to guide users in developing their own flows. Designers would retrieve the cataloged flows, modify them, and use them for their own purposes based on the guidelines imposed by the developer. It is necessary to generate legal flows. A blackboard approach was used in (Lander, 1995) to generate a particular flow suitable for a given task. In Nelsis (Bosch, 1991), branches of a flow are explicitly represented using “or” nodes and “merge” nodes. A task can be accomplished in various ways. It is necessary to represent alternative methodologies for the task succinctly so that designers can access alternative methodologies and select the best one based on what-if analysis. IDEF3.X (IDEF) is used to graphically model workflow in RASSP environment. Figure 73.2 shows an example of workflow using IDEF3.X. A node denotes a task. It has inputs, outputs, mechanism, and conditions. IDEF definition that has been around for 20 years mainly to capture flat modeling such as a shop floor process. IDEF specification, however, requires complete information such as control mechanisms and scheduling at the specification time, making the captured process difficult to understand. In IDEF, “or” nodes are used to represent the alternative paths. It does not have an explicit mechanism to represent alternative workflow. IDEF is ideal only for documenting the current practice and not suitable for executing iterative process which is determined during the execution of the process. Perhaps, the most important aspect missing from most process management systems is the abstraction mechanism (Schlenoff, 1996).

Execution Environment The execution environment provides dynamic execution of tasks and tools and binds data to tools, either manually or automatically. Few frameworks separate the execution environment from the specification of design process. There are several modes in which a task can be executed (Kleinfeldth, 1994): manual mode, manual execution of flow, automatic flow execution, and automatic flow generation. In manual flow execution, the environment executes a task in the context of a flow. In an automatic flow execution environment, tasks are executed based on the order specified on the flow graph. In automatic flow generation, the framework generates workflow dynamically and executes them without the guidance of designers. Many frameworks use blackboard- or knowledge-based approaches to generate workflow. However, it is important for designers to be able to analyze the workflow created and share it with others.

© 2000 by CRC Press LLC

FIGURE 73.2

Workflow example using IDEF definition.

That is, repeatability and predictability are important factors if frameworks support dynamic creation of workflow. Each task may be associated with pre- and post-conditions. Before a task is executed, the pre-condition of the task is evaluated. If the condition is not satisfied, the framework either waits until the condition is met, or aborts the task and selects another alternative. After the task is executed, its post-condition is evaluated to determine if the result meets the exit criteria. If the evaluation is unsatisfactory, another alternative should be tried. When a task is complex involving many subtasks and each subtask in turn has many alternatives, generating a workflow for the task that would successfully accomplish the task is not easy. If the first try of an alternative is not successful, another alternative should be tried. In some cases, backtrack occurs which nullifies all the executions of previous workflow.

Literature Surveys Many systems have been proposed to generate design process (Knapp, 1991) and manage workflow (Dellen, 1997; Lavana, 1997; Schurmann, 1997; Sutton, 1998). Many of them use the web technology to coordinate various activities in business (Andreoli, 1997), manufacturing (Berners-Lee, 1994; Cutkosy, 1996; Erkes, 1996), and micro-electronic design (Rastogi, 1993, Chan, 1998). WELD (Chan, 1998) is a network infrastructure for a distributed design system that offers users the ability to create a customizable and adaptable virtual design system that can couple tools, libraries, design, and validation services. It provides support not only for designing but also for manufacturing, consulting, component acquisition, and product distribution, encompassing the developments of companies, universities, and individuals throughout the world. Lavana et al. (1997) proposed an Internet-based collaborative design. They use Petri nets as a modeling tool for describing and executing workflow. User teams, at different sites, control the workflow execution by selection of its path. Minerva II (Sutton, 1998) is a software tool that provides design process management capabilities serving multiple designers working with multiple CAD frameworks. The proposed system generates design plan and realizes unified design process management across multiple CAD frameworks and potentially across multiple design disciplines. ExPro (Rastogi, 1993) is an expert-system-based process management system for the semiconductor design process.

© 2000 by CRC Press LLC

There are several systems that automatically determine what tools to execute. OASIS (OASIS, 1992) uses Unix make file style to describe a set of rules for controlling individual design steps. The Design Planning Engine of the ADAM system (Knapp, 1986; Knapp, 1991) produces a plan graph using a forward chaining approach. Acceptable methodologies are specified by listing pre-conditions and post-conditions for each tool in a lisp-like language. Estimation programs are used to guide the chaining. Ulysses (Bushnell, 1986) and Cadweld (Daniel, 1991) are blackboard systems used to control design processes. A knowledge source, which encapsulates each tool, views the information on the blackboard and determines when the tool would be appropriate. The task management is integrated into the CAD framework and Task Model is interpreted by a blackboard architecture instead of a fixed inference mechanism. Minerva (Jacome, 1992) and the OCT task manager (Chiu, 1990) use hierarchical strategies for planning the design process. Hierarchical planning strategies take advantage of knowledge about how to perform abstract tasks which involve several subtasks. To represent design process and workflow, many languages and schema have been proposed. NELSIS (Bosch, 1991) framework is based on a central, object-oriented database and on a flow management. It uses a dataflow graph as Flow Model and provides the hierarchical definition and execution of design flow. PLAYOUT (Schurmann, 1997) framework is based on separate Task and Flow Models which are highly interrelated among themselves and the Product Model. In (Barthelmann, 1996), graph grammar is proposed in defining the task of software process management. Westfechtel (1996) proposed “processnet” to generate the process flow dynamically. However, in many of these systems, the relationship between task and data is not explicitly represented. Therefore, representing the case in which a task generates more than one datum and each of them goes to a different task is not easy. In (Schurmann, 1997), Task Model (describing the I/O behavior of design tools) is used as a link between the Product Model and the Flow Model. The proposed system integrates data and process management to provide traceability. Many systems use IDEF to represent a process (Chung, 1996; IDEF; Stavas). IDEF specification, however, requires complete information such as control mechanisms and scheduling at the specification time, making the captured process difficult to understand. Although there are many other systems that address the problem of managing process, most proposed system use either a rule-based approach or a hard-coded process flow. They frequently require source code modification for any change in process. Moreover, they do not have mathematical formalism. Without the formalism, it is difficult to handle the iterative nature of the engineering process and to simulate the causal effects of any changes in parameters and resources. Consequently, coordinating the dynamic nature of processes is not well supported in most systems. It is difficult to analyze the rationale how an output is generated and where a failure has occurred. They also lack a systematic way of generating all permissible process flows at any level of abstraction while providing means to hide the details of the flow when they are not needed. Most systems have the tendency to over-specify the flow information, requiring complete details of a process flow before executing the process. In most real situations, the complete flow information may not be known after the process has been executed: they are limited in their ability to address the underlying problem of process flexibility. They are rather rigid and not centered on users, and do not handle exceptions gracefully. Thus, the major functions for the collaborative framework such as adding new tools and sharing and improving the process flow cannot be realized. Most of them are weak in at least one of the following criteria suggested by NIST (Schlenoff et al., 1996): process abstraction, alternative tasks, complex groups of tasks, and complex sequences.

73.3 IMEDA System The Internet-based Micro-Electronic Design Automation (IMEDA) System is a general management framework for performing various tasks in design and manufacturing of complex micro-electronic systems. It provides a means to integrate many specialized tools such as CAD and analysis packages, and allows the designer to specify acceptable methodologies and tools together with information such as when and how they may be used. IMEDA is a collaborative engineering framework that coordinates

© 2000 by CRC Press LLC

design activities distributed geographically. The framework facilitates the flow of multimedia data sets representing design process, production, and management information among the organizational units of a virtual enterprise. IMEDA uses process grammar (Baldwin, 1995) to represent the dynamic behavior of the design and manufacturing process. In a sense, IMEDA is similar to agent-based approach such as Redux (Petrie, 1996). Redux, however, does not provide process abstraction mechanism or facility to display the process flow explicitly. The major functionality of the framework includes • Formal representation of the design process using the process grammar that captures a complex sequence of activities of micro-electronic design. • Execution environment that selects a process, elaborates the process, invokes tools, pre- and postevaluates the productions if the results meet the criterion, and notifies designers. • User interface that allows designers to interact with the framework, guides the design process, and edits the process and productions. • Tool integration and communication mechanism using Internet Socket and HTTP. • Access control that provides a mechanism to secure the activity and notification and approval that provide the mechanisms to disperse design changes to, and responses from, subscribers IMEDA is a distributed framework. design knowledge, including process information, manager programs, etc., are maintained in a distributed fashion by local servers. The following Fig. 73.3 illustrates how IMEDA links tools and sites for distributed design activities. The main components of IMEDA are • System Cockpit: It controls all interactions between the user and the system and between the system components. The cockpit will be implemented as a Java applet and may be executable on any platform for which a Java enabled browser is available. It keeps track of the current design status and informs the user of possible actions. It allows users to collaboratively create and edit process flows, production libraries, and design data. • Manager Programs: These encapsulate design knowledge. Using pre-evaluation functions, managers estimate the possibility of success for each alternative. They invoke tools and call postevaluation functions to determine if a tool’s output meets the specified requirements. The interface servers allow cockpits and other Java-coded programs to view and manipulate production, task

FIGURE 73.3

The architecture of IMEDA.

© 2000 by CRC Press LLC









• •

and design data libraries. Manager programs must be maintained by tool integrators to reflect site-specific information such as company design practices and different ways of installing tools. Browsers: The task browser organizes the tasks in a generalization-specialization (GS) hierarchy and contains all the productions available for each task. The data-specification browser organizes the data-specifications in a GS hierarchy and contains all the children. External Tools. These programs are the objects invoked by the framework during DM activities. Each atomic task in a process flow is bound to an external tool. External tools are written typically by the domain experts. Site Proxy Server: Any physical site that will host external tools must have a site proxy server running. These servers provide an interface between the cockpit and the external tools. The site server receives requests from system cockpits, and invokes the appropriate tool. Following the tool completion, the site server notifies the requesting cockpit, returning results, etc. CGI Servers and Java Servlets: The system cockpit may also access modules and services provided by CGI servers or the more recently introduced Java servlets. Currently, the system integrates modules of this type as direct components of the system (as opposed to external tools that may vary with the flow). Database Servers: Access to component data is a v ery important func tion. Using an API cal led JBDC, the frame work can directly access virtual ly an y comme rcial ly availab le database se rver remotely. Whiteboard: The shared cockpit, or “whiteboard” is a communication medium to share information among users in a distributed environment. It allows designers to interact with the system and guides the design process collaboratively. Designers will be able to examine design results and current process flows, post messages, and carry out design activities both concurrently and collaboratively. Three types of whiteboards are the process board, the chat board, and the freeform drawing board. Their functionality includes: (i) process board to the common process flow graph indicating the current task being executed and the intermediate results arrived at before the current task; (ii) drawing board to load visual design data, and to design and simulate process; and (iii) chat board to allow participants to communicate with each other via text-based dialog box.

IMEDA uses a methodology specification based on a process flow graphs and process grammars (Baldwin, 1995). Process grammars are the means for transforming high-level process flow graphs into progressively more detailed graphs by applying a set of substitution rules, called productions, to nodes that represent logical tasks. It provides not only the process aspect of design activities but also a mechanism to coordinate them. The formalism in process grammar facilitates abstraction mechanisms to represent hierarchical decomposition and alternative methods, which enable designers to manipulate the process flow diagram and select the best method. The formalism provides the theoretical foundations for the development of IMEDA. IMED A contains the database o f admissib le flows, called process specifications. With the init ial task, constraints, and execution environment parame ters, including personal profile, IMED A guid es desig ners in constructing process flow graphs in a t op-down manne r by applying productions. It also provides desig ners with the ab ilit y to discover process configur ations that p rovide optimal sol utions. It maintains consist ency amo ng desig ns and al lows the desig ner to interact with the sy stem and a djust desig n parame ters, or modify the previous desig n process. As the frame work is being executed, a desig ner can be informed of the current status of desig n such as updat ing and t racing desig n chang es and be able to hand ling e xception. Real-world processes are typically very complex by their very nature; IMEDA provides designers the ability to analyze, organize, and optimize processes in a way never before possible. More importantly, the framework can improve design productivity by accessing, reusing, and revising the previous process for a similar design. The unique features of our framework include Process Abstraction/Modeling — Process grammars provide abstraction mechanism for modeling admissible process flows. The abstraction mechanism allows a complex activity to be represented

© 2000 by CRC Press LLC

more concisely with a small number of higher-level tasks, providing a natural way of browsing the process repository. The strong theoretical foundation of our approach allows users to analyze and predict the behavior of a particular process. With the grammar, the process flow gracefully handles exceptions and alternative productions. When a task has alternative productions, backtracking occurs to select other productions. Separation of Process Specification and Execution Environment — Execution environment information such as complex control parameters and constraints is hidden from the process specification. The information of these two layers is merely linked together to show the current task being executed on a process flow. The represented process flow can be executed in both automatic and manual modes. In the automatic mode, the framework executes all possible combinations to find a solution. In the manual mode, users can explore design space. Communication and Collaboration — To promote real-time collaboration among participants, the framework is equipped with the whiteboard, a communication medium to share information. Users can browse related processes, compare them with other processes, analyze, and simulate them. Locally managed process flows and productions can be integrated by the framework in the central server. The framework manages the production rules governing the higher level tasks, while lower level tasks and their productions are managed by local servers. This permits the framework to be effective in orchestrating a large-scale activity. Efficient Search of Design Process and Solution — IMEDA is able to select the best process and generate a process plan, or select a production dynamically and create a process flow. The process grammar easily captures design alternatives. The execution environment selects and executes the best one. If the selected process does not meet the requirement, then the framework backtracks and selects another alternative. This backtrack occurs recursively until a solution is found. If you allow a designer to select the best solution among many feasible ones, the framework may generate many multiple versions of the solution. Process Simulation — The quality of a product depends on the tools (maturity, speed, and special strength of the tool), process (or workflow selected), and design data (selected from the reuse library). Our framework predicts the quality of results (product) and assesses the risk and reliability. This information can be used to select the best process/work flow suitable for a project. Parallel Execution of Several Processes and Multiple Versions — To reduce the design time and risk, it is necessary to execute independent tasks in parallel whenever they are available. Sometimes, it is necessary to investigate several alternatives simultaneously to reduce the design time and risk. Or the designer may want to execute multiple versions with different design parameters. The key issue in this case is scheduling the tasks to optimize the resource requirements. Life Cycle Support of Process Management — The process can be regarded as a product. A process (such as airplane designing or shipbuilding) may last many years. During this time, it may be necessary for the process itself to be modified because of new tools and technologies. Life cycle support includes updating the process dynamically, and testing/validating the design process, version history and configuration management of the design process. Tests and validations of the design processes, the simulation of processes, and impact analysis are necessary tools.

73.4 Formal Representation of Design Process1 IMEDA uses a methodology specification based on a process flow graphs and process grammars (Baldwin, 1995). The grammar is an extension of graph grammar originally proposed by Ehrig (1979) and has been applied to interconnection network (Derk, 1998) and software engineering (Heiman, 1997).

1

Materials in this section are excerpted from R. Baldwin and M.J. Chung, IEEE Computer, Feb. 1995. With permission.

© 2000 by CRC Press LLC

Process Flow Graph A process flow graph depicts tasks, data, and the relationships among them, describing the sequence of tasks for an activity. Three basic symbols are used to represent a process flow graph. Oval nodes represent Logical Tasks, two-concentric oval nodes represent Atomic Tasks, rectangular nodes represent Data Specifications and diamond nodes represent Selectors. A task that can be decomposed into subtasks is called log ical . Logical task nodes represent abstract tasks that could be done with several different tools or tool combinations. A task that cannot be decomposed is atomic . An atomic task node, commonly called a tool invocation, represents a run of an application program. A selec tor is a task node that selects data or parameter. Data specifications are design data, where the output specification produced by a task can be consumed by another task as an input specification. Each data specification node, identified by a rectangle, is labeled with a data specification type. Using the graphical elements of the flow graph, engineers can create a process flow in a top-down fashion. These elements can be combined into a process flow graph using directed arcs. The result is a bipartite acyclic directed graph that identifies clearly the task and data flow relationships among the tasks in a design activity. The set of edges indicates those data specifications used and produced by each task. Each specification must have at most one incoming edge. Data specifications with no incoming edges are inputs of the design exercise. T(G), S(G), and E (G) are the sets of task nodes, specification nodes, and edges of graph G, respectively. Figure 73.4 shows a process flow graph that describes a possible rapid

FIGURE 73.4 A sample process flow graph in which a state diagram is transformed into a field-programmable gate array configuration file.

© 2000 by CRC Press LLC

FIGURE 73.5 Graph production from a design process grammar. Two simulation alternatives based on input format are portrayed in (a); two partition alternatives representing different processes for an abstract task are protrayed in (b).

prototyping design process, in which a state diagram is transformed into a field-programmable gate array (FPGA) configuration file. The various specification types form a class hierarchy where each child is a specialization of the parent. There may be several incompatible children. For example, VHDL and Verilog descriptions are both children of simulation models. We utilize these specification types to avoid data format incompatibilities between tools (see Fig. 73.5a). Process flow graphs can describe design processes to varying levels of detail. A graph containing many logical nodes abstractly describes what should be done without describing how it should be done (i.e., specifying which tools to use). Conversely, a graph in which all task nodes are atomic completely describes a methodology. In our prototype, we use the following definitions: In(N ) is the set of input nodes of node N : In(N ) = { M | (M,N ) ∈ E}. Out(N) is the set of output nodes of node N: OUT (N ) = { M | (N,M ) ∈ E}. I(G) is the set of input specifications of graph G: { N ∈ S(G) | In(N ) = ∅ } .

Process Grammars The designer specifies the overall objectives with the initial graph that lists available input specifications, desired output specifications, and the logical tasks to be performed. By means of process grammars, logical task nodes are replaced by the flows of detailed subtasks and intermediate specifications. The output specification nodes are also replaced by nodes that may have a child specification type. The productions in a graph grammar permit the replacement of one subgraph by another. A production in a design process grammar can be expressed formally as a tuple P = (GLHS, GRHS,σin, σout), where GLHS and GRHS are process flow graphs for the left side and the right side of the production, respectively, such that (i) GLHS has one logical task node representing the task to be replaced, (ii) σin is a mapping

© 2000 by CRC Press LLC

from the input specifications I(GLHS) to I(GRHS), indicating the relationship between two input specifications (each input specification of I(GRHS) is a subtype of I(GLHS)), and (iii) σout is a mapping from the output specifications of GLHS to output specifications of GRHS indicating the correspondence between them. (each output specification must be mapped to a specification with the same type or a subtype). Figure 73.5 illustrates productions for two tasks, simulate and FPGA partitioning. The mappings are indicated by the numbers beside the specification nodes. Alternative productions may be necessary to handle different input specification types (as in Fig. 73.5a), or because they represent different processes- separated by the word “or’’ — for performing the abstract task (as in Fig. 73.5b). Let A be the logical task node in GLHS, and A′ be a logical task node in the original process flow graph G such that A has the same task label as A′. The production rule P can be applied to A′, which means that A′ can be replaced with GRHS only if each input and output specifications of A′ matches to input and output specifications of GLHS, respectively. If there are several production rules with the same left side flow graph, it implies that there are alternative production rules for the logical task. Formally, the production matches A′ if (i) A′ has the same task label as A. (ii) There is a mapping ρin, from In(A) to In(A′), indicating how the inputs should be mapped. For all nodes N ∈ In(A), ρin(N) should have the same type as N or a subtype. (iii) There is a mapping, ρout, from Out(A′) to Out(A), indicating how the outputs should be mapped. For all nodes N∈Out(A′), ρout(N) should have the same type as N or a subtype. The mappings are used to determine how edges that connected the replaced subgraph to the remainder should be redirected to nodes in the new subgraph. Once a match is found in graph G, the production is applied as follows: (i) Insert GRHS –I(GRHS) into G. The inputs of the replaced tasks are not replaced. (ii) For every N in I(GRHS) and edge (N,M) in GRHS, add edge (ρin(σin (N)),M) to G. That is to connect the inputs of A′ to the new task nodes that will use them. (iii) For every N in Out(A′) and edge (N,M) in G, replace edge (N,M) with edge (σout(ρout(N)),M). That is to connect the new output nodes to the tasks that will use them. (iv) Remove A′ and Out(A′) from G, along with all edges incident on them. Figure 73.6 illustrates a derivation in which the FPGA partitioning task is planned, using a production from Fig. 73.5b. The process grammar provides mechanism of specifying alternative methods for a logical task. A highlevel flow graph can then be decomposed into detailed flow graphs by applying production rules to a logical task. A production rule is a substitution that permits the replacement of a logical task node with a flow graph that represents a possible way of performing the task. The concept of applying productions to logical tasks is somewhat analogous to the idea of productions in traditional (i.e., non-graph) grammars. In this sense, logical tasks correspond to log ical symbols in grammar, and atomic tasks correspond to terminal symbols.

73.5 Execution Environment of the Framework Figure 73.7 illustrates the architecture of our proposed system, which applies the theory developed in the previous section. Decisions to select or invoke tools are split between the designers and a set of manager programs, where manager programs are making the routine decisions and the designers make decisions that requires higher-level thinking. A program called Cockpit coordinates the interaction among manager programs and the designers. Tool sets and methodology preferences will differ among sites and over time. Therefore, our assumption is that each unit designates a person (or group) to act as system integrator, who writes and maintains the tool-dependent code in the system. We provide the toolindependent code and template to simplify the task of writing tool-dependent code.

© 2000 by CRC Press LLC

FIGURE 73.6 A sample graph dertivation. Nodes in the outlined region, left, are replaced with nodes in the outlined region, right, according to production partition 1 in Fig. 73.5.

FIGURE 73.7

The proposed system based on Cockpit.

The Cockpit Program The designer interacts with Cockpit, a program which keeps track of the current process flow graph and informs the designer of possible actions such as productions that could be applied or tasks that could be executed. Cockpit contains no task-specific knowledge; its information about the design process comes

© 2000 by CRC Press LLC

entirely from a file of graph productions. When new tools are acquired or new design processes are developed, the system integrator modifies this file by adding, deleting, and editing productions. To assist the designer in choosing an appropriate action, Cockpit interacts with several manager programs which encapsulate design knowledge. There are two types of manager programs: task managers and production managers. Task managers invoke tools and determine which productions to execute for logical task nodes. Production managers provide ratings for the productions and schedule the execution of tasks on the right-had side of the production. Managers communicate with each other using messages issued by Cockpit. Our prototype system operates as follows. Cockpit reads the initial process flow graph from an input file generated by using a text editor. Cockpit then iteratively identifies when productions can be applied to logical task nodes and requests that the production managers assign the ratings to indicate how appropriate the productions are for those tasks. The process flow graph and the ratings of possible production applications are displayed for the designer, who directs Cockpit through a graphical user interface to apply a production or execute a task at any time. When asked to execute a task, Cockpit sends a message to a task manager. For an atomic task node, the task manager simply invokes the corresponding tool. For a logical task, the task manager must choose one or more productions, as identified by a Cockpit. The Cockpit applies the production and requests that the production manager executes it.

Manager Programs Manager programs must be maintained by system integrators to reflect site-specific information, such as company design practices and tool installation methods. Typically, a manager program has its own thread. A Cockpit may have several manager programs, and therefore multi-threads. We define a communication protocol between Cockpit and manager programs and provide templates for manager programs. The manager programs provide five operations: pre-evaluation, tool invocation, logical task execution, production execution, and query handling. Each operation described below corresponds to a C++ or Java function in the templates, which system integrators can customize as needed. Pre-evaluation: Production managers assign ratings to help designers and task managers select the most appropriate productions. The rating indicates the likelihood of success from applying this production. The strategies used by the system integrator provide most of the code to handle the rating. In some cases, it may be sufficient to assign ratings statically, based on the success of past productions. These static ratings can be adjusted downward when the production has already been tried unsuccessfully on this task node (which could be determined using the query mechanism). Alternatively, the ratings may be an arbitrarily complex function of parameters obtained through the query mechanism or by examining the input files. Sophisticated manager programs may continuously gather and analyze process metrics that indicate those conditions leading to success, adjust adjusting ratings accordingly. Tool Invocation: Atomic task mangers must invoke the corresponding software tool when requested by Cockpit, then determine whether the tool completed successfully. In many cases, information may be predetermined and entered in a standard template, which uses the tool’s result status to determine success. In other cases, the manager must determine tool parameters using task-specific knowledge or determine success by checking task-specific constraints. Either situation would require further customization of the manager program. Logical Task Execution: Logical task managers for logical tasks must select productions to execute the logical task. Cockpit informs the task manager of available productions and their ratings. The task manager can either direct Cockpit to apply and execute one or more productions, or it can decide that none of the productions is worthwhile and report failure. The task manager can also request that the productions be reevaluated when new information has been generated that might influence the ratings, such as a production’s failure. If a production succeeds, the task manager checks any constraints; if they are satisfied, it reports success.

© 2000 by CRC Press LLC

Production Execution: Production managers execute each task on the right-hand side of the production at the appropriate time and possibly check constraints. If one of the tasks fails or a constraint is violated, backtracking can occur. The production manager can use task-specific knowledge to determine which tasks to repeat. If the production manager cannot handle the failure itself, it reports the failure to Cockpit, and the managers of higher level tasks and productions attempt to handle it. Query Handling: Both production and task managers participate in the query mechanism. A production manager can send queries to its parent (the task manager for the logical task being performed) or to one of its children (a task manager of a subtask). Similarly, a task manager can send a query to its parent production manager or to one of its children (a production manager of the production it executed). The manager templates define C functions, which take string arguments, for sending these queries. System integrators call these functions but do not need to modify them. The manager templates also contain functions which are modified by system integrators for responding to queries. Common queries can be handled by template code; for example, a production manager can frequently ask its parent whether the production has already been attempted for that task and whether it succeeded. The manager template handles any unrecognized query from a child manager by forwarding it to the parent manager. Code must be added to handle queries for task-specific information such as the estimated circuit area or latency.

Execution Example Now we describe a synthesis scenario that illustrates our prototype architecture in use. In this scenario, the objective is to design a controller from a state diagram, which will ultimately be done following the process flow graph in Fig. 73.4. There are performance and cost constraints on the design, and the requirement to produce a prototype quickly. The productions used are intended to be representative but not unique. For simplicity, we assume that a single designer is performing the design with, therefore, only one Cockpit. The start graph for this scenario contains only the primary task, chip synthesis, and specification nodes for its inputs and outputs (like the graph in the left in Fig. 73.8). Cockpit tells us that the production of Fig. 73.8 can be applied. We ask Cockpit to apply it. The chip synthesis node is then replaced by nodes for state encoding, logic synthesis, and physical synthesis, along with intermediate specification nodes. Next, we want to plan the physical synthesis task. Tasks can be planned in an order other than they are to be performed. Cockpit determines that any of the productions shown in Fig. 73.9 may be applied, then queries each production’s task manager program asking it to rate the production’s appropriateness in the current situation. Based on the need to implement the design quickly, the productions for standard cell synthesis and FIGURE 73.8 Productions for chip synthesis. full custom synthesis are rated low while the production for FPGA synthesis is rated high. Ratings are displayed to help us decide. When we plan the state encoding task, Cockpit finds two productions: one to use the tool Min-bits Encoder and the other to use the tool One-hot Encoder. One-hot Encoder works well for FPGAs, while Minbits Encoder works better for other technologies. To assign proper ratings to these productions, their production managers must find out which implementation technology will be used. First, they send a query to their parent manager, the state encoding task manager. This manager forwards the message to

© 2000 by CRC Press LLC

FIGURE 73.9

FIGURE 73.10

Productions for physical synthesis.

Sequence of actions for query handling.

its parent, the chip synthesis production manager. In turn, this manager forwards the query to the physical synthesis task manager for an answer. All messages are routed by Cockpit, which is aware of the entire task hierarchy. This sequence of actions is illustrated in Fig. 73.10. After further planning and tool invocations, a netlist is produced for our controller. The next step is the FPGA synthesis task. We apply the production in Fig. 73.11 and proceed to the FPGA partitioning task. The knowledge to automate this task has already been encoded into the requisite manager programs, so we direct Cockpit to execute the FPGA partitioning task. It finds the two productions illustrated in Fig. 73.5b and requests their ratings, Next, Cockpit sends an execute message, along with the ratings, to the FPGA partitioning task manager. This manager’s strategy is to always execute the highest-rated production, which in this case is production Partition 1. (Other task managers might have asked that both productions be executed or, if neither were promising, immediately reported failure.) This sequence of actions is shown in Fig. 73.12. Because the Partition 1 manager used an as-soon-as-possible task scheduling strategy, it asks Cockpit to execute XNFMAP immediately. The other subtask, MAP2LCA, is executed when XNFMAP complete successfully. After both tasks complete successfully, Cockpit reports success to the FPGA partitioning task manager. This action sequence is illustrated in Fig. 73.13.

© 2000 by CRC Press LLC

FIGURE 73.11

Production for field-programmable gate array synthesis.

FIGURE 73.12

Sequence of action during automatic task execution.

Scheduling In this subsection, we describe a detailed description and discussion of auto-mode scheduling, including the implementation of the linear scheduler. The ability to search through the configuration space of a design process for a design configuration that meets user-specified constraints is important. For example, assume that a user has defined a process for designing a digital filter with several different alternative ways of performing logical tasks such as “FPGA Partitioning” and “Select the Filter Architecture.” One constraint that an engineer may wish to place on the design might be: “Find a process configuration that produces a filter that has maximum delay at most 10 nanoseconds.” Given such a constraint, the framework must search through the configuration space of the filter design process, looking for a sequence of valid atomic

© 2000 by CRC Press LLC

FIGURE 73.13

Sequence of actions during automatic production execution.

tasks that produces a filter with “maximum delay at most 10 nanoseconds.” We call the framework component that performs this search a scheduler. There are, of course, many different ways of searching through the design process configuration space. In general, a successful scheduler will provide the following functionality: • Completeness (Identification of Successful Configurations): Given a particular configuration of a process, the correct scheduler will be able to conclusively determine whether the configuration meets user-specified constraints. The scheduler must guarantee that before reporting failure, all possible process configurations have been considered, and if there is a successful configuration, the algorithm must find it. • Reasonable Performance: The configuration space of a process grows exponentially (in the number of tasks). Ideally, a scheduler will be able to search the configuration space using an algorithm that requires less than exponential time. The Linear Scheduling Algorithm is very simple yet complete and meets most of the above criteria. In this algorithm, for each process flow graph (corresponding to an initial process flow graph or a production), it has a scheduler. Each scheduler is a separate thread with a Task Schedule List (TSL) representing the order in which tasks are to be executed. The tasks in a scheduler’s TSL are called its children tasks. A scheduler also has a task pointer to indicate the child task being executed in the TSL. The algorithm is recursive such that with each new instantiation of a production of a given task, a new scheduler is created to manage the flow graph representing the production selected. A liner scheduler creates a TSL by performing a topological sort of the initial process flow graph and executes its children tasks in order. If a child task is atomic, the scheduler executes the task without creating a new scheduler; otherwise, it selects a new alternative, creates a new child scheduler to manage the selected alternative, and waits for a signal from the child scheduler indicating success or failure. When a child task execution is successful, the scheduler increments the task pointer in its TSL and proceeds to execute the next task. If a scheduler reaches the end of its TSL, it signals success to its own parent, and awaits signals from its parent if it should terminate itself (all successful) or rollback (try another to find new configurations).

© 2000 by CRC Press LLC

If a child task fails, the scheduler tries another alternative for the task. If there is no alternatives left, it rolls back (by decrementing the task pointer) until it finds a logical task that has another alternative to try. If a scheduler rolls back to the beginning of the TSL and cannot find an alternative, then its flow has failed. In this case, it signals a failure to its parent and terminates itself. In the linear scheduling algorithm, each scheduler can send or receive any of five signals: PROCEED, ROLLBACK, CHILD-SUCCESS, CHILD-FAILURE, and DIE. These signals comprise scheduler-to-scheduler communication, including self-signaling. Each of the five signals is discussed below. • PROCEED: This signal tells the scheduler to execute the next task in the TSL. It can be self-sent or received from a parent scheduler. For example, a scheduler increments its task pointer and sends itself a PROCEED signal when a child task succeeds, whereas it sends a PROCEED signal to its children to start its execution • ROLLBACK: This is signaled when a task execution has failed. This signal may be self-sent or received from a parent scheduler. Scheduler self-signals ROLLBACK whenever a child task fails. A Rollback can result in either trying the next alternative of a logical task, or decrementing the task pointer and trying the previous task in the TSL. If rollback results in decrementing the task pointer to point to a child task node which has received a success-signal, the parent scheduler will send a rollback signal to that child task scheduler. • CHILD-SUCCESS: A child scheduler sends a CHILD-SUCCESS to its parent scheduler if it has successfully completed the execution of all of the tasks in its TSL. After sending the child-success signal, the scheduler remains active, listening for possible rollback signals from the parent. After receiving a child-success signal, parent schedulers self-send a proceed signal. • CHILD-FAILURE: A child-failure signal is sent from a child scheduler to its parent in the event that the child’s managed flow fails. After sending a child-failure signal, children schedulers terminate. Upon receiving child-failure signals, parent scheduler self-send a rollback signal. • DIE: This signal may be either self-sent, or sent from parent schedulers to their children schedulers.

73.6 Implementation In this section, a high level description of the major components of IMEDA and their organization and functionality will be presented. Detailed explanations of the key concepts involved in the architecture of the Process Management Framework will also be discussed, including external tool integration, the tool invocation process, the Java File System, and state properties.

The System Cockpit The System Cockpit, as its name suggests, is where nearly all user interaction with the framework takes place. It is here that users create, modify, save, load, and simulate process flow graphs representing design processes. This system component has been implemented as a Java applet. As such, it is possible to run the cockpit in any Java-enabled Web browser such as Netscape’s Navigator or Microsoft’s Internet Explorer. It is also possible to run the cockpit in some Java-enabled operating systems such as IBM’s OS/2. Each cockpit component also has the following components: • Root Flow. Every cockpit has a Root Flow. The Root Flow is the flow currently being edited in the Cockpit’s Flow Edit Panel. Notice that the Root Flow may change as a result of applying a production to a flow graph, in which case the Root Flow becomes a derivation of itself. • Flow Edit Panel. The Flow Edit Panel is the interactive Graphical User Interface for creating and editing process flow graphs. This component also acts as a display for animating process simulations performed by various schedulers such as the manual or automode linear scheduler. • Class Directory. The Cockpit has two Class Directories: Task Directory and the Specification Directory. These directories provide the “browser” capabilities of the framework, allowing users to create reusable general-to-specific hierarchies of task classes. Class Directories are implemented using a tree structure.

© 2000 by CRC Press LLC

• Production Database. The Production Database acts as a warehouse for logical task productions. These productions document the alternative methods available for completing a logical task. Each Production Database has a list of Productions. The Production Database is implemented as a treelike structure, with Productions being on the root trunk, and Alternatives being leaves. • Browser. Browsers provide the tree-like graphical user interface for users to edit both Class Directories and Databases. There are three Browsers: Database Browser for accessing the Production Database, Directory Browser for accessing the Task Directory, and Directory Browser for accessing the Spec Browser. Both Database Browsers and Directory Browsers inherit properties from object Browser, and offer the user nearly identical editing environments and visual representations. This deliberate consolidation of Browser interfaces allowed us to provide designers with an interface that was consistent and easier to learn. • Menu. A user typically performs and accesses most of the system’s key function from the cockpit’s Menu. • Scheduler. The cockpit has one or more schedulers. Schedulers are responsible for searching the configuration space of a design process for configurations that meet user specified design constraints. The Scheduler animates its process simulations by displaying them in the Flow Edit Panel of the Cockpit.

External Tools External Tools are the concrete entities to which atomic tasks from a production flow are bound. When a flow task object is expanded in the Cockpit Applet (during process simulation), the corresponding external tool is invoked. The external tool uses a series of inputs and produces a series of outputs (contained in files). These inputs and outputs are similarly bound to specifications in a production flow. Outputs from one tool are typically used as inputs for another. IMEDA can handle the transfer of input and output files between remote sites. The site proxy servers, in conjunction with a remote file server (also running at each site) automatically handle the transfer of files from one system to another. External tools may be implemented using any language, and on any platform that has the capability of running a site server. While performing benchmark tests of IMEDA, we used external tools written in C, Fortran, Perl, csh (a Unix shell script), Java applications, and Mathematica scripts. External Tool Integration One of the primary functionality of IMEDA is the integration of user-defined external tools into an abstract process flow. IMEDA then uses these tools both in simulating the process flow to find a flow configuration that meets specific constraints, and in managing selected flow configurations during actual design execution. There are two steps to integrating tools with a process flow defined in IMEDA: association and execution. Association involves “linking” or “binding” an abstract flow item (e.g., an atomic task) to an external tool. Execution describes the various steps that IMEDA takes to actually invoke the external tool and process the results. Binding Tools External tools may be bound to three types of flow objects: Atomic Tasks, Selectors, and Multiple Version Selectors. Binding an external tool to a flow object is a simple and straightforward job, involving simply defining certain properties in the flow object. The following properties must be defined in an object that is to be bound to an external tool: • SITE. Due to the fact that IMEDA can execute tools on remote systems, it is necessary to specify the site where the tool is located on. Typically, a default SITE will be specified in the system defaults, and making it unnecessary to define the site property unless the default is to be overridden. Note that the actual site ID specified by the SITE property must refer to a site that is running a Site Proxy Server listening on that ID. See the “Executing External Tools” section below for more details.

© 2000 by CRC Press LLC

• CMDLINE.. The CMDLINE property specifies the command to be executed at the specified remote site. The CMDLINE property should include any switches or arguments that will always be sent to the external tool. Basically, the CMDLINE argument should be in the same format that would be used if the command were executed from a shell/DOS prompt. • WORKDIR. The working directory of the tool is specified by the WORKDIR property. This is the directory in which IMEDA will actually execute the external tool, create temporary files, etc. This property is also quite often defined in the global system defaults, and thus may not necessarily have to be defined for every tool. • WRAPPERPATH. The JDK 1.0.2 does not allow Java Applications to execute a tool in an arbitrary directory. To handle remote tool execution, a wrapper is provided. It is a “go-between” program that would simply change directories and then execute the external tool. This program can be as simple as a DOS/NT batch file, a shell script, or a perl program. The external tool is wrapped in this simple script, and executed. Since IMEDA can execute tools at remote and heterogeneous sites, it was very difficult to create a single wrapper that would work on all platforms (WIN32, Unix, etc.). Therefor, the wrapper program may be specified for each tool, defined as global default, or a combination of the two. Once the properties above have been defined for a flow object, the object is said to be “bound” to an external tool. If no site, directory, or filename is specified for the outputs of the flow object, IMEDA automatically creates unique file names, and stores the files in the working directory of the tool on the site that the tool was run. If a tool uses as inputs data items that are not specified by any other task, then the data items must be bound to static files on some site. Executing External Tools Once flow objects have been bound to the appropriate external tools, IMEDA can be used to perform process simulation or process management. IMEDA actually has several “layers” that lie between the Cockpit (a Java applet) and the external tool that is bound to a flow being viewed by a user in the Cockpit. A description of each of IMEDA components for tool invocations is listed below. • Tool Proxy. The tool proxy component acts as a liaison between flow objects defined in Cockpits and the Site Proxy Server. All communication is done transparently through the communication server utilizing TCP/IP sockets. The tool proxy “packages” information from Cockpit objects (atomic tasks, selectors, etc.) into string messages that the Proxy Server will recognize. It also listens for and processes messages from the Proxy Server (through the communications server) and relays the information back to the Cockpit object that instantiated the tool proxy originally. • Communications Server. Due to various security restrictions in the 1.0.2 version of Sun Microsystem’s Java Development Kit (JDK), it is impossible to create TCP/IP socket connections between a Java applet and any IP address other than the address from which the applet was loaded. Therefore, it was necessary to create a “relay server” in order to allow cockpit applets to communicate with remote site proxy servers. The sole purpose of the communications server is to receive messages from one source and then to rebroadcast them to all parties that are connected and listening on the same channel. • Site Proxy Server. Site Proxy Servers are responsible for receiving and processing invocation requests from tool proxies. When an invocation request is received, the site proxy server checks to see that the request is formatted correctly, starts a tool monitor to manage the external tool invocation, and returns the exit status of the external tool after it has completed. • Tool Monitors. When the site proxy server receives an invocation request and invokes an external tool, it may take a significant amount of time for the tool to complete. If the proxy server had to delay the handling of other requests while waiting for each external tool to complete, IMEDA would become very inefficient. For this reason, the proxy server spawns a tool monitor for each external tool that is to be executed. The tool monitor runs as a separate thread, waiting on the tool, storing its stdout and stderr, and moving any input or output files that need moving to their appropriate

© 2000 by CRC Press LLC

site locations, and notifying the calling site proxy server when the tool has completed. This allows the site proxy server to continue receiving and processing invocation requests in a timely manner. • Tool Wrapper. Tool wrapper changes directories into the specified WORKDIR, and then executes the CMDLINE. • External Tool. External tools are the actual executable programs that run during a tool invocation. There is very little restriction on the nature of the external tools.

Communications Model The Communications Model of IMEDA is perhaps the most complex portion of the system in some respects. This is where truly distributed communications come into play. One system component is communicating with another via network messages rather than function calls. The heart of the communications model is the Communications Server. This server is implemented as a broadcast server. All incoming messages to the server are simply broadcast to all other connected parties. FlowObjects communicate with the Communications Server via ToolProxys. A ToolProxy allows a FlowObject to abstract all network communications and focus on the functionality of invoking tasks. A ToolProxy takes care of constructing a network message to invoke an external tool. That message is then sent to the Communications Server via a Communications Client. The Communication Client takes care of the low-level socket based communication complexities. Finally, the Communications Client sends the message to the Communications Server, which broadcasts the message to all connected clients. The client for which the message was intended (typically a Site Proxy Server) decodes the message and, depending on its type, creates either a ToolMonitor (for an Invocation Message) or an External Redraw Monitor (for a Redraw Request). The Site Proxy Server creates these monitors to track the execution of external programs, rather than monitoring them itself. In this way, the Proxy Server can focus on its primary job – receiving and decoding network messages. When the Monitors invoke an external tool, they must do so within a Wrapper. Once the Monitors have observed the termination of an external program, they gather any output on stdout or stderr and return these along with the exit code of the program to the Site Proxy Server. The Proxy Server returns the results to the Communications Server, then the Communications Client, then the ToolProxy, and finally to the original calling FlowObject.

User Interface The Cockpit provides both the user interface and core functionality of IMEDA. While multiple users may use different instances of the Cockpit simultaneously, there is currently no provision for direct collaboration between multiple users. Developing efficient means of real-time interaction between IMEDA users is one of the major thrusts of the next development cycle. Currently the GUI of the cockpit provides the following functionality: • Flow editing. Users may create and edit process flows using the flow editor module of the Cockpit. The flow editor provides the user with a simple graphical interface that allows the use of a template of tools for “drawing” a flow. Flows can be optimally organized via services provided by a remote Layout Server written in Perl. • Production Library Maintenance. The Cockpit provides functionality for user maintenance of collections of logical task productions, called libraries. Users may organize productions, modify input/output sets, or create/edit individual productions using flow editors. • Class Library Maintenance. Users are provided with libraries of task and specification classes that are organized into a generalization-specialization hierarchy. Users can instantiate a class into an actual task, specification, selector, or database when creating a flow by simply dragging the appropriate class from a class browser and dropping it onto a flow editor’s canvas. The Cockpit

© 2000 by CRC Press LLC

FIGURE 73.14

A system cockpit window.

provides the user with a simple tree structure interface to facilitate the creation and maintenance of class libraries. • Process Simulation. Processes may be simulated using the Cockpit. The Cockpit provides the user with several scheduler modules that determine how the process configuration space will be explored. The schedulers control the execution of external tools (through the appropriate site proxy servers) and simulation display (flow animation for user monitoring of simulation progress). There are multiple schedulers for the user to choose from when simulating a process, including the manual scheduler, comprehensive linear scheduler, etc. • Process Archival. The Cockpit allows processes to be archived on a remote server using the Java File System (JFS). The Cockpit is enabled by a JFS client interface to connect to a remote JFS server where process files are saved and loaded. While the JFS system has its clear advantages, it is also awkward to not allow users to save process files, libraries, etc. on their local systems. Until version 1.1 of the Java Development Kit, local storage by a Java applet was simply not an option — the browser JVM definition did not allow access to most local resources. With version 1.1 of the JDK, however, comes the ability to electronically sign an applet. Once this has been done, users can grant privileged resource access to specific applets after a signature has been verified.

© 2000 by CRC Press LLC

Design Flow Graph Properties Initially, a flow graph created by a user using GUI is not associated with any system-specific information. For example, when a designer creates an atomic task node in a flow graph, there is initially no association with any external tool. The framework must provide a mechanism for users to bind flow graph entities to the external tools or activities that they represent. We have used the concept of properties to allow users to bind flow graph objects to external entities. In an attempt to maintain flexibility, properties have been implemented in a very generic fashion. Users can define any number of properties for flow object. There are a number of key properties that the framework recognizes for each type of flow object. The user defines these properties to communicate needed configuration data to the framework. A property consists of a property label and property contents. The label identifies the property, and consists of an alpha-numeric string with no white space. The contents of a property is any string. Currently users define properties using a freeform text input dialog, with each line defining a property. The first word on a line represents the property label, and the remainder of the line constitutes the property contents. Property Inheritance To further extend the flexibility of flow object properties, the framework requires that each flow object be associated with a flow object class. Classes allow designers to define properties that are common to all flow objects that inherit from that flow object class. Furthermore, classes are organized into a generalto-specific hierarchy, with children classes inheriting properties from parent classes. Therefore, the properties of a particular class consist of any properties defined locally for that object, in addition to properties defined in the object’s inherited class hierarchy. If a property is defined in both the flow object and one of its parent classes, the property definition in the flow object takes precedence. If a property is defined in more than one class in a class hierarchy, the “youngest” class (e.g., the child in a parent-child relationship) takes precedence. Classes are defined in the Class Browsers of IMEDA. Designers that have identified a clear general-tospecific hierarchy of flow object classes can quickly create design flow graphs by dragging and dropping from class browsers onto flow design canvases. The user would then need only to overload those properties in the flow objects that are different from their respective parent classes.

FIGURE 73.15

A task browser.

© 2000 by CRC Press LLC

For example, consider a class hierarchy of classes that all invoke the same external sort tool, but pass different flags to the tool, based on the context. It is likely that all of these tools will have properties in common, such as a common working directory and tool site. By defining these common properties in a common ancestor of all of the classes, such as Search, it is unnecessary to redefine the properties in the children classes. Of course, children classes can define new properties that are not contained in the parent classes, and may also overload property definitions provided by ancesFIGURE 73.16 A property window and property inheritance. tors. Following these rules, class Insertion would have the following properties defined: WORKDIR, SITE, WRAPPERPATH, and CMDLINE. Macro Substitution While performing benchmarks on IMEDA, one cumbersome aspect of the framework that users often pointed out was the need to re-enter properties for tasks or specifications if, for example, a tool name or working directory changed. Finding every property that needed to be changed was a tedious job, and prone to errors. In an attempt to deal with this problem, we came up with the idea of property macros. That is, a property macro is any macro that is not a key system macro. A macro is a textual substitution rule that can be created by users. By using macros in the property databases of flow objects, design flows can be made more flexible and more amiable to future changes. As an example, consider a design flow that contains many atomic tasks bound to an external tool. Our previous example using searches is one possible scenario. On one system, the path to the external tool may be “/opt/bin/sort,” while on another system the path is “/user/keyesdav/public/bin/sort.” Making the flow object properties flexible is easy if a property macro named SORTPATH is defined in an ancestor of all affected flow objects. Children flow objects can then use that macro in place of a static path when specifying the flow object properties. As a further example, consider a modification to the previous “Search task hierarchy” where we define a macro SORTPATH in the Search class, and then use that macro in subsequent children classes, such as the Insertion class.

FIGURE 73.17 Macro definition.

© 2000 by CRC Press LLC

In the highlighted portion of the Property Database text area, a macro called “SORTPATH” is defined. In subsequent class’ Property Databases, this macro can be used in place of a static path. This makes it easy to change the path for all tools that use the SORTPATH property macro — just the property database dialog where SORTPATH is originally defined needs to be modified.

FIGURE 73.18

Macro substitution.

Key Framework Properties In our current implementation of IMEDA, there are a number of key properties defined. These properties allow users to communicate needed information to the framework in a flexible fashion. Most importantly, it allows system architects to define or modify system properties quickly. This is an important benefit when working with evolving software such as IMEDA.

73.7 Conclusion Managing the design process is the key factor to improve the productivity in the micro-electronic industry. We have presented an Internet-based Micro-Electronic Design Automation (IMEDA) framework to manage the design process. IMEDA uses a powerful formalism, called design process grammars, for representing design processes. We have also proposed an execution environment that utilizes this formalism to assist designers in selecting and executing appropriate design processes. The proposed approach is applicable not only in rapid prototyping but also in any environment where a design is carried out hierarchically and many alternative processes are possible. The primary advantages of our system are • Formalism: A strong theoretical foundation enables us to analyze how our system will operate with different methodologies. • Parallelism: In addition to performing independent tasks within a methodology in parallel, our system also allows multiple methodologies to be executed in parallel. • Extensibility: New tools can be integrated easily by adding productions and manager programs. • Flexibility: Many different control strategies can be used. They can even be mixed within the same design exercise. The prototype of IMEDA is implemented using Java. We are currently integrating more tools into our prototype system and developing manager program templates that implement more sophisticated algorithms for pre-evaluation, logical task execution, and query handling. Our system will become more useful as CAD vendors to adapt open software systems and allow greater tool interoperability.

© 2000 by CRC Press LLC

References Andreoli, J.-M., Pacull, F., and Pareschi, R., XPECT: A framework for electronic commerce, IEEE Internet Comput., vol. 1, no. 4, pp. 40-48, 1998. Baldwin, R. and Chung, M.J., A formal approach to mangaging design processes, IEEE Comput., pp. 54-63, Feb. 1995a. Baldwin, R. and Chung, M.J., Managing engineering data for complex products, Res. Eng. Design, 7, pp. 215-231, 1995b. Barthelmann, K., Process specification and verification, Lect. Notes Comput. Sci., 1073 pp. 225-239, 1996. Berners-Lee, T., Cailliau, R., Luotonen, A., Nielsen, H. F., and Secret, A., The World-Wide Web, Commun. ACM, 37, 8, pp. 76-82, 1994. ten Bosch, K. O., Bingley, P., and Van der Wolf, P., Design flow management in the NELSIS CAD framework, Proc. 28th Design Automation Conf., pp. 711-716, 1991. Bushnell, M. L. and Director, S. W., VLSI CAD tool integration using the Ulysses environment, 23rd ACM/IEEE Design Automation Conf., pp. 55-61, 1986. Casotto, A., Newton, A. R., and Snagiovanni-Vincentelli, A., Design management based on design traces, 27th ACM/IEEE Design Automation Conf., pp. 136-141, 1990. Tool Encapsulation Specification, Draft Standard, Version 2.0, released by the CFI TES Working Group, 1995. Chan, F. L., Spiller, M. D., and Newton, A. R., WELD — An environment for web-based electronic design, 35th ACM/IEEE Design Automation Conf., June 1998. Chiueh, T. F. and Katz, R. H., A history model for managing the VLSI design process, Int. Conf. Comput. Aided Design, pp. 358-361, 1990. Chung, M. J., Charmichael, L., and Dukes, M., Managing a RASSP design process, Comp. Ind., 30, pp. 49-61, 1996. Chung, M. J. and Kim, S., An object-oriented VHDL environment, 27th Design Automation Conf., pp. 431-436, 1990. Chung, M. J. and Kim, S., Configuration management and version control in an object-oriented VHDL environment, ICCAD 91, pp. 258-261, 1991. Chung, M. J. and Kwon, P., A web-based framework for design and manufacturing a mechanical system, 1998 DETC, Atlanta, GA, Sept. 1998. Cutkosy, M. R., Tenenbaum, J. M., and Glicksman, J., Madefast: collaborative engineering over the Internet, Commun. ACM, vol. 39, no. 9, pp. 78-87, 1996. Daniell, J. and Director, S. W., An object oriented approach to CAD tool control, IEEE Trans. Comput.Aided Design, pp. 698-713, June 1991. Dellen, B., Maurer, F., and Pews, G., Knowledge-based techniques to increase the flexibility of workflow management, in Data and Knowledge Engineering, North-Holland, 1997. Derk, M. D. and DeBrunner, L. S., Reconfiguartion for fault tolerance using graph grammar, ACM Trans. Comput. Syst., vol. 16, no. 1, pp. 41-54, Feb. 1998. Ehrig, H., Introduction to the algebraic theory of graph grammars, 1st Workshop on Graph Grammars and Their Applications to Computer Science and Biology, pp. 1-69, Springer, LNCS, 1979. Erkes, J. W., Kenny, K. B., Lewis, J. W., Sarachan, B. D., Sobololewski, M. W., and Sum, R. N., Implementing shared manufacturing services on the World-Wide Web, Commun. ACM, vol. 39, no. 2, pp. 34-45, 1996. Fairbairn, D. G., 1994 Keynote Address, 31st Design Automation Conference, pp. xvi-xvii, 1994. Hardwick, M., Spooner, D. L., Rando, T., and Morris, K. C., Sharing manufacturing information in virtual enterprises, Commun. ACM, vol. 39, no. 2, pp. 46-54, 1996. Hawker, S., SEMATECH Computer Integrated Manufacturing(CIM) framework Architecture Concepts, Principles, and Guidelines, version 0.7. Heiman, P. et al., Graph-based software process management, Int. J. Software Eng. Knowledge Eng., vol.7, no. 4, pp. 1-24, Dec. 1997.

© 2000 by CRC Press LLC

Hines, K. and Borriello, G., A geographically distributed framework for embedded system design and validation, 35th Annual Design Automation Conf., 1998. Hsu, M. and Kleissner, C., Objectflow: towards a process management infrastructure, Distributed and Parallel Databases, 4, pp. 169-194, 1996. IDEF http://www.idef.com. Jacome, M. F. and Director, S. W., A formal basis for design process planning and management, IEEE Trans. Comput.-Aided Design of Integr. Circuits Syst., vol. 15, no. 10, pp. 1197-1211, October 1996. Jacome, M. F. and Director, S. W., Design process management for CAD frameworks, 29th Design Automation Conf., pp. 500-505, 1992. Di Janni, A., A monitor for complex CAD systems, 23rd Design Automation Conference, pp. 145-151, 1986. Katz, R. H., Bhateja, R., E-Li Chang, E., Gedye, D., and Trijanto, V., Design version management, IEEE Design and Test, 4(1) pp. 12-22, Feb. 1987. Kleinfeldth, S., Guiney, M., Miller, J. K., and Barnes, M., Design methodology management, Proc. IEEE, vol. 82, no.2, pp. 231-250, Feb. 1994. Knapp, D. and Parker, A., The ADAM design planning engine, IEEE Trans. Comput. Aided Design Integr. Circuits Syst., vol. 10, no. 7, July 1991. Knapp, D. W. and Parker, A. C., A design utility manager: the ADAM planning engine, 23rd ACM/IEEE Design Automation Conf., pp. 48-54, 1986. Kocourek, C., An architecture for process modeling and execution support, Comput. Aided Syst. Theor. — EUROCAST, 1995. Kocourek, C., Planning and execution support for design process, IEEE Interantional symposium and workshop on systems engineering of computer based system proceedings, 1995. Knutilla, A., Schlenoff, C., Ray, S., Polyak, S. T., Tate, A., Chiun Cheah, S., and Anderson, R. C., Process specification language: an analysis of existing representations, NISTIR 6160, National Institute of Standards and Technology, Gaithersburg, MD, 1998. Lavana, H., Khetawat, A., Brglez, F., and Kozminski, K., Executable workflows: a paradigm for collaborative design on the Internet, 34th ACM/IEEE Design Automation Conf., June 1997. Lander, S. E., Staley, S. M., and Corkill, D. D., Designing integrated engineering environments: blackboard-based integration of design and analysis tools, Proc. IJCAI-95 Workshop Intelligent Manuf. Syst., AAAI, 1995 Lyons, K., RaDEO Project Overview, http://www.cs.utah.edu/projects/alpha1/arpa/mind/index.html. Malone, T. W., Crowston, K., Lee, J., Pentland, B. T., Dellarocas, C., Wyner, G., Quimby, J., Osborne, C., Bernstein, A., Herman, G., Klein, M., and O’Donnell, E., in press. OASIS Users Guide and Reference Manual, MCNC, Research Triangle Park, North Carolina, 1992. Petrie, C. J., Agent Based Engineering, the Web, and Intelligence, IEEE Expert, Dec. 1996. Rastogi, P., Koziki, M., and Golshani, F., ExPro-an expert system based process management system, IEEE Trans. Semiconductor Manuf., vol. 6, no. 3, pp. 207-218. Schlenoff, C., Knutilla, A., and Ray, S., Unified process specification language: requirements for modeling process, NISTIR 5910, National Institute of Standards and Technology, Gaithersburg, Maryland, 1996. Schurmann, B. and Altmeyer, J., Modeling design tasks and tools — the link between product and flow model, Proc. 34th ACM/IEEE Design Automation Conf., June 1997. Sutton, P. R. and Director, S. W., Framework encapsulations: a new approach to CAD tool interoperability, 35th ACM/IEEE Design Automation Conf., June 1998. Sutton, P. R. and Director, S. W., A description language for design process management, 33rd ACM/IEEE Design Automation Conf., pp. 175-180, June 1996. Spiller, M. D. and Newton, A. R., EDA and Network, ICCAD, pp. 470-475, 1997. Stavas, J. et al., Workflow modeling for implementing complex, CAD-based, design methodologies. Toye, G., Cutkosky, M. R., Leifer, L. J., Tenenbaum, J. M., and Glicksman, J., SHARE: a methodology and environment for collaborative product development, Proc. Second Workshop Enabling Technol.: Infrastruct. Collaborative Enterprises, Los Alamitos, California, IEEE Computer Society Press, pp.33-47, 1993.

© 2000 by CRC Press LLC

Vogel, A. and Duddy, K., Java Programming with CORBA, Wiley Computer Publishing, New York. Welsh, J., Kalathil, B., Chanda, B., Tuck, M. C., Selvidge, W., Finnie, E., and Bard, A., Integrated process control and data management in RASSP enterprise system, Proc. of 1995 RASSP Conf., 1995. Westfechtel, B., Integrated product and process management for engineering design applications, Integr. Comput.-Aided Eng., vol. 3, no. 1, pp. 20-35, 1996. Yang, Z. and Duddy, K., CORBA: a platform for distributed object computing ACM Operating Syst. Rev., vol. 30, no. 2, pp. 4-31, 1996.

© 2000 by CRC Press LLC

Parker, A.C., et al. "System-Level Design" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

74 System-Level Design 74.1

Introduction Design Philosophies and System-Level Design • The System Design Space

74.2 74.3

System Specification System Partitioning Constructive Partitioning Techniques • Iterative Partitioning Techniques

74.4

Alice C. Parker University of Southern California

Yosef Gavriel Virginia Polytechnic Institute and State University

Suhrid A. Wadekar IBM Corp.

Scheduling and Allocating Tasks to Processing Modules 74.5 Allocating and Scheduling Storage Modules 74.6 Selecting Implementation and Packaging Styles for System Modules 74.7 The Interconnection Strategy 74.8 Word Length Determination 74.9 Predicting System Characteristics 74.10 A Survey of Research in System Design System Specification • Partitioning • Non-pipelined Design • Macro-pipelined Design • Genetic Algorithms • Imprecise Comutation • Probabilistic Models and Stochastic Simulation • Performance Bounds Theory and Prediction • Word Length Selection

74.1 Introduction The term system, when used in the digital design domain, implies many different entities. A system can consist of a processor, memory, and input/output, all on a single integrated circuit, or it can consist of a network of processors, geographically distributed, each performing a specific application. There can be a single clock, with modules communicating synchronously, multiple clocks with asynchronous communication, or an entirely asynchronous operation. The design can be general, or specific to a given application, i.e., application-specific. Together, the previously mentioned variations constitute the system style. To a great extent, system style selection is determined by the physical technologies used, the environment in which the system operates, designer experience, and corporate culture, and is not automated to any great extent. System-level design covers a wide range of design activities and design situations. It includes the more specific activity system engineering, that involves the requirements development, test planning, subsystem interfacing, and end-to-end analysis of systems. System-level design is sometimes called system architecting, a term used widely in the aerospace industry. General-purpose system-level design involves the design of programmable digital systems, including the basic modules containing storage, processors, input/output, and system controllers. At the system level, the design activities include determining the following:

© 2000 by CRC Press LLC

• • • • • • • •

the power budget (the amount of power allocated to each module in the system); the cost and performance budget allocated to each module in the system; the interconnection strategy; the selection of commercial off-the-shelf modules (COTS); the packaging of each module; the overall packaging strategy; the number of processors, storage units, and input/output interfaces required; and the overall characteristics of each processor, storage unit, and input/output interface.

For example, memory system design focuses on the number of memory modules required, how they are organized, and the capacity of each module. A specific system-level issue in this domain can be the question of how to partition the memory between the processor chip and the off-chip memory. At a higher level, a similar issue might involve configuration of the complete storage hierarchy, including memory, disk drives, and archival storage. For each general-purpose system designed, many systems are designed to perform specific applications. Application-specific system design involves the same activities as described previously, but can involve many more issues, since there are usually more custom logic modules involved. Specifications for application-specific systems contain not only requirements on general capabilities but also the functionality required in terms of specific tasks to be executed. Major application-specific, system-level design activities include not only the general-purpose system design activities, but the following activities as well: • • • • • • •

partitioning an application into multiple functional modules; scheduling the application tasks on shared functional modules; allocating functional modules to perform the application tasks; allocating and scheduling storage modules to contain blocks of data as it is processed; determining the implementation styles of functional modules; determining the word lengths of data necessary to achieve a given accuracy of computation; and predicting resulting system characteristics once the system design is complete.

Each of the system design tasks given in the previous two lists will be described in subsequent detail. Since the majority of system design activities are application-specific, this section will focus on systemlevel design of application-specific systems. Related activities, hardware-software co-design, verification, and simulation are covered in other sections.

Design Philosophies and System-Level Design Many design tools have been constructed with a top-down design philosophy. Top-down design represents a design process whereby the design becomes increasingly detailed until final implementation is complete. Considerable prediction of resulting system characteristics is required in order to make the higher-level decisions with some degree of success. Bottom-up design, on the other hand, relies on designing a set of primitive elements and forming more complex modules from those elements. Ultimately, the modules are assembled into a system. At each stage of the design process there is complete knowledge of the parameters of the lower-level elements. However, the lower-level elements may be inappropriate for the tasks at hand. Industry system designers describe the design process as being much less organized and considerably more complex than the top-down and bottom-up philosophies suggest. There is a mixture of top-down and bottom-up activities, with major bottlenecks of the system receiving detailed design consideration while other parts of the system still exist only as abstract specifications. For this reason, the system-level design activities presented in detail here support such a complex design situation. Modules, elements, and components used to design at the system level might exist or might only exist as abstract estimates along with requirements. The system can be designed after all modules have been designed and manufactured, prior to any detailed design, or with a mixture of existing and new modules.

© 2000 by CRC Press LLC

The System Design Space System design, like data path design, is quite straightforward as long as the constraints are not too severe. However, most modern system designs must solve harder problems than problems solved by existing systems; moreover, designers must race to produce working systems faster than competitors. More variations in design are possible than ever before, and such variations require that a large design space be explored. The dimensions of the design space (its axes) are system properties such as cost, power, design time, and performance. The design space contains a population of designs, each of which possesses different values of these system properties. There are literally millions of system designs for a given specification, each of which exhibits different cost, performance, power consumption, and design time. Straightforward solutions that do not attempt to optimize system properties are easy to obtain but may be inferior to designs that require use of system-level design tools and perhaps many iterations of design. The complexity of system design is not due to the fact that system design is an inherently difficult activity, but that so many variations in design are possible and time does not permit exploration of all of them.

74.2 System Specification Complete system specifications contain a wide range of information including • • • • • • •

constraints on the system power, performance, cost, weight, size, and delivery time; required functionality of the system components; any required information about the system structure; required communication between system components; the flow of data between components; the flow of control in the system; and the specification of input precision and desired output precision.

Most systems specifications that are reasonably complete exist first in a natural language. Such natural language interfaces are not currently available with commercial, system-level design tools. More conventional system-specification methods, used to drive system-level design tools, include formal languages, graphs, or a mixture of the two. Each of the formal system-specification methods described here contains some of the information found in a complete specification, i.e., most specification methods are incomplete. The remaining information necessary for full system design can be provided interactively by the designer, can be entered later in the design process, or can be provided in other forms at the same time the specification is processed. The required design activities determine the specification method used for a given system design task. There are no widely adopted formal languages for system-level hardware design although SLDL (System-Level Design Language) is currently being developed by an industry group. Hardware descriptive languages such as VHDL1 and Verilog2 are used to describe the functionality of modules in an applicationspecific system. High-level synthesis tools can then synthesize such descriptions to produce registertransfer designs. Extensions of VHDL have been proposed to encompass more system-level design properties. Apart from system constraints, VHDL specifications can form complete system descriptions. However, the level of detail required in VHDL, and to some extent in Verilog, requires the designer to make some implementation decisions. In addition, some information explicit in more abstract specifications, such as the flow of control between tasks, is implicit in HDLs. Graphical tools have been used for a number of years to describe system behavior and structure. Block diagrams are often used to describe system structure. Block diagrams assume that tasks have already been assigned to basic blocks and that their configuration in the system has been specified. They generally cannot represent the flow of data or control, or design constraints. The PMS (Processor Memory Switch) notation invented by Bell and Newell was an early attempt to formalize the use of block diagrams for system specification.3

© 2000 by CRC Press LLC

Petri nets have been used for many years to describe system behavior using a token-flow model. A token-flow model represents the flow of control with tokens, which flow from one activity of the system to another. Many tokens can be active in a given model concurrently, representing asynchronous activity and parallelism, important in many system designs. Timed Petri nets have been used to model system performance, but Petri nets cannot easily be used to model other system constraints, system behavior, or any structural information. State diagrams and graphical tools such as State Charts4 provide alternative methods for describing systems. Such tools provide mechanisms to describe the flow of control, but they do not describe system constraints, system structure, data flow, or functionality. Task-flow graphs, an outgrowth from the Control/Data-Flow Graphs (CDFG) used in high-level synthesis, are often used for system specification. These graphs describe the flow of control and data between tasks. When used in a hierarchical fashion, task nodes in the task-flow graph can contain detailed functional information about each task, often in the form of a CDFG. Task flow graphs contain no mechanisms for describing system constraints or system structure. Spec Charts5 incorporate VHDL descriptions into State-Chart-like notation, overcoming the lack of functional information found in State Charts. Figure 74.1 illustrates the use of block diagrams, Petri nets, task flow graphs, and spec charts.

FIGURE 74.1

The use of block diagrams, Petri nets, task flow graphs, and spec charts, shown in simplified form.

74.3 System Partitioning Most systems are too large to fit on a single substrate. If the complexity of the system tasks and the capacity of the system modules are of the same order, then partitioning is not required. All other systems must be partitioned so that they fit into the allowed substrates, packages, boards, multi-chip modules, and cases. Partitioning determines the functions, tasks, or operations in each partition of a system. Each partition can represent a substrate, package, multi-chip module, or larger component. Partitioning is performed with respect to a number of goals, including minimizing cost, design time, or power, or maximizing performance. Any of these goals can be reformulated as specific constraints, like meeting given power requirements.

© 2000 by CRC Press LLC

When systems are partitioned, resulting communication delays must be taken into account, affecting performance. Limitations on interconnection size must be taken into account, affecting performance as well. Pin and interconnection limitations force the multiplexing of inputs and outputs, reducing performance, and sometimes affecting cost. Power consumption must also be taken into account. Power balancing between partitions and total power consumption might both be considerations. In order to meet market windows, system partitions can facilitate the use of COTS, programmable components, or easily fabricated components such as gate arrays. In order to meet cost constraints, functions that are found in the same partition might share partition resources. Such functions or tasks cannot execute concurrently, affecting performance. Partitioning is widely used at the logic level, as well as on physical designs. In these cases, much more information is known about the design properties, and the interconnection structure has been determined. System partitioning is performed when information about the specific components’ properties might be uncertain, and the interconnection structure undetermined. For these reasons, techniques used at lower levels must be modified to include predictions of design properties not yet known and prediction of the possible interconnection structure as a result of the partitioning. The exact partitioning method used depends on the type of specification available. If detailed CDFG or HDL specifications are used, the partitioning method might be concerned with which register-transfer functions (e.g., add, multiply, shift) are found in each partition. If the specification primitives are tasks, as in a task-flow graph specification, then the tasks must be assigned to partitions. Generally, the more detailed the specification, the larger the size of the partitioning problem. Powerful partitioning methods can be applied to problems of small size (n < 100). Weaker methods such as incremental improvement must be used when the problem size is larger. Partitioning methods can be based on constructive partitioning or iterative improvement. Constructive partitioning involves taking an unpartitioned design and assigning operations or tasks to partitions. Basic constructive partitioning methods include bin packing using a first-fit decreasing heuristic, clustering operations into partitions by assigning nearest neighbors to the same partition until the partition is full, random placement into partitions, and integer programming approaches.

Constructive Partitioning Techniques Bin packing involves creating a number of bins equal in number to the number of partitions desired and equal in size to the size of partitions desired. Then, the tasks or operations are sorted by size. The largest task in the list is placed in the first bin, and then the next largest is placed in the first bin, if it will fit, or if it does not fit, into the second bin. Each task is placed into the first bin in which it will fit, until all tasks have been placed in bins. More bins are added if necessary. This simple heuristic is useful to create an initial set of partitions to be improved iteratively later. Clustering is a more powerful method to create partitions. Here is a simple clustering heuristic. Each task is ranked by the extent of “connections” to other tasks either due to control flow, data flow, or physical position limitations. The most connected task is placed in the first partition, and then the tasks connected to it are placed in the same partition, in order of the strength of their connections to the first task. Once the partition is full, the task with the most total connections remaining outside a partition is placed in a new partition, and other tasks are placed there in order of their connections to the first task. This heuristic continues until all tasks are placed. Random partitioning places tasks into partitions in a greedy fashion until the partitions are full. Some randomization of the choice of tasks is useful in producing a family of systems, of which each member is partitioned randomly. This family of systems can be used successfully in iterative improvement techniques for partitioning, as described later in this section. The most powerful technique for constructive partitioning is mathematical programming. Integer and mixed integer-linear programming techniques have been used frequently in the past for partitioning. Such powerful techniques are computationally very expensive, and they are successful only when the number of objects to be partitioned is small. The basic idea behind integer programming used for

© 2000 by CRC Press LLC

partitioning is the following: An integer, TP(i,j), is used to represent the assignment of tasks to partitions. When TP = 1, task i is assigned to partition j. For each task in this problem, there would be an equation partition total

∑TP(i, j) = 1

(74.1)

j =1

This equation more or less states that each task must be assigned to one and only one partition. There would be many constraints of this type in the integer program, some of which were inequalities. There would be one function representing cost, performance, or other design property, to be optimized. The simultaneous solution of all constraints, given some minimization or maximization goal, would yield the optimal partitioning. Apart from the computational complexity of this technique, the formulation of the mathematical programming constraints is tedious and error prone if performed manually. The most important advantage of mathematical programming formulations is the discipline it imposes on the CAD programmer in formulating an exact definition of the CAD problem to be solved. Such problem formulations can prove useful when applied in a more practical environment, as described below in the next section, “Iterative Partitioning Techniques.”

Iterative Partitioning Techniques Of the many iterative partitioning techniques available, two have been applied most successfully at the system level. These techniques are min-cut partitioning, first proposed by Kernigan and Lin, and genetic algorithms. Min-cut partitioning involves exchanging tasks or operations between partitions in order to minimize the total amount of “interconnections” cut. The interconnections can be computed as the sum of data flowing between partitions, or as the sum of an estimate of the actual interconnections that will be required in the system. The advantage of summing the data flowing is that it provides a quick computation, since the numbers are contained in the task flow graph. Better partitions can be obtained if the required physical interconnections are taken into account, since they are related more directly to cost and performance than to the amount of data flowing. If a partial structure exists for the design, predicting the unknown interconnections allows partitioning to be performed on a mixed design, one that contains existing parts as well as parts under design. Genetic algorithms, highly popular for many engineering optimization problems, are especially suited to the partitioning problem. The problem formulation is similar in some ways to mathematical programming formulations. A simple genetic algorithm for partitioning is described here. In this example, a chromosome represents each partitioned system design, and each chromosome contains genes, representing information about the system. A particular gene, TP(i,j), might represent the fact that task i is contained in partition j when it is equal to 1, and is set to 0 otherwise. A family of designs created by some constructive partitioning technique then undergoes mutation and crossover as new designs evolve. A fitness function is used to check the quality of the design, and the evolution is halted when the design is considered fit, or when no improvement has occurred after some time. In the case of partitioning, the fitness function might include the estimated volume of interconnections, the predicted cost or performance of the system, or other system properties. The reader might note some similarity between the mathematical programming formulation of the partitioning problem presented here and the genetic algorithm formulation. This similarity allows the CAD developer to create a mathematical programming model of the problem to be solved, find optimal solutions to small problems, and create a genetic algorithm version. The genetic algorithm version can be checked against the optimal solutions found by the mathematical program. However, genetic

© 2000 by CRC Press LLC

algorithms can take into account many more details than can mathematical program formulations, can handle non-linear relationships better, and can even handle stochastic parameters.1 Partitioning is most valuable when there is a mismatch between the sizes of system tasks and the capacities of system modules. When the system tasks and system modules are more closely matched, then the system design can proceed directly to scheduling and allocating tasks to processing modules.

74.4 Scheduling and Allocating Tasks to Processing Modules Scheduling and allocating tasks to processing modules involve the determination of how many processing modules are required, which modules execute which tasks, and the order in which tasks are processed by the system. In the special case where only a single task is processed by each module, the scheduling becomes trivial. Otherwise, if the tasks share modules, the order in which the tasks are processed by the modules can affect system performance or cost. If the tasks are ordered inappropriately, some tasks might wait too long for input data, and performance might be affected. Or, in order to meet performance constraints, additional modules must be added to perform more tasks in parallel, increasing system cost. A variety of modules might be available to carry out each task, with differing cost and performance parameters. As each task is allocated to a module, that module is selected from a set of modules available to execute the task. This is analogous to the task module selection, which occurs as part of high-level synthesis. For the system design problem considered here, the modules can be either general-purpose processors, special-purpose processors (e.g., signal processing processors), or special-purpose hardware. If all (or most) modules used are general-purpose, the systems synthesized are known as heterogeneous application-specific multiprocessors. A variety of techniques can be used for the scheduling and allocation of system tasks to modules. Just as with partitioning, these techniques can be constructive or iterative. Constructive scheduling techniques for system tasks include greedy techniques such as ASAP (as soon as possible) and ALAP (as late as possible). In ASAP scheduling, the tasks are scheduled as early as possible on a free processing module. The tasks scheduled first are the ones with the longest paths from their outputs to final system outputs or system completion. Such techniques, with variations, can be used to provide starting populations of system designs to be further improved iteratively. The use of such greedy techniques for system synthesis differs from the conventional use in high-level synthesis, where the system is assumed to be synchronous, with tasks scheduled into time steps. System task scheduling assumes no central clock, and tasks take a wide range of times to complete. Some tasks could even complete stochastically, with completion time a random variable. Other tasks could complete basic calculations in a set time, but could perform a finer grain (more accurate) of computations if more time were available. A simple task-flow graph is shown in Fig. 74.2, along with a Gantt chart illustrating the ASAP scheduling of tasks onto two processors. Note that two lengthy tasks are performed in parallel with three shorter tasks, and that no two tasks take the same amount of time. Similar to partitioning, scheduling and allocation, along with module selection, can be performed using mathematical programming. In this case, since the scheduling is asynchronous, time becomes a linear rather than integer quantity. Therefore, mixed integer-linear programming (MILP) is employed to model system-level scheduling and allocation. A typical MILP timing constraint is the following:

()

()

TOA i + Cdelay ≤ TIR j

1

(74.2)

Stochastic parameters represent values that are uncertain. There is a finite probability of a parameter taking a specific value that varies with time, but that in general, probability is less than one.

© 2000 by CRC Press LLC

FIGURE 74.2

An example task-flow graph and schedule.

where TOA(i) is the time the output is available from task i, Cdelay is the communication delay, and TIR (j) is the time the input is required by task j. Unfortunately, the actual constraints used in scheduling and allocation are mostly more complex than this, because the design choices have yet to be made. Here is another example:

()

( ) ∑[Pdelay(k) ∗ M (i, k)]

TOA i ≥ TIR i +

(74.3)

k

This constraint states that the time an output from task i is available is greater than or equal to the time necessary inputs are received by task i, and a processing delay Pdelay has occurred. M(i,k) indicates that task i is allocated to module k. Pdelay can take on a range of values, depending on which of k modules is being used to implement task i. The summation is actually a linearized select function that picks the value of Pdelay to use depending on which value of M(i,k) is set to 1. As with partitioning, mathematical programming for scheduling and allocation is computationally intensive and impractical for all but the smallest designs, but it does provide a baseline model of design that can be incorporated in other tools. The most frequent technique used for iterative improvement in scheduling and allocation at the system level is a genetic algorithm. The genes can be used to represent task allocation and scheduling. In order to represent asynchronous scheduling accurately, time is generally represented as a linear quantity in such genes, rather than an integer quantity.

© 2000 by CRC Press LLC

74.5 Allocating and Scheduling Storage Modules In digital systems, all data requires some form of temporary or permanent storage. If the storage is shared by several data sets, the use of the storage by each data set must be scheduled. The importance of this task in system design has been overlooked in the past, but it has now become an important system-level task. Modern digital systems usually contain some multimedia tasks and data. The storage requirements for multimedia tasks sometimes result in systems where processing costs are dwarfed by storage costs, particularly caching costs. For such systems, storage must be scheduled and allocated either during or after task scheduling and allocation. If storage is scheduled and allocated concurrently with task scheduling and allocation, the total system costs are easier to determine, and functional module sharing can be increased if necessary in order to control total costs. On the other hand, if storage allocation and scheduling are performed after task scheduling and allocation, then both programs are simpler, but the result may not be as close to optimal. Techniques similar to those used for task scheduling and allocation can be used for storage scheduling and allocation.

74.6 Selecting Implementation and Packaging Styles for System Modules Packaging styles can range from single-chip dual-in-line packages (DIPs) to multi-chip modules (MCMs), boards, racks, and cases. Implementation styles include general-purpose processor, special-purpose programmable processor (e.g., signal processor), COTS modules, Field Programmable Gate Arrays (FPGAs), gate array, standard cell, and custom integrated circuits. For many system designs, system cost, performance, power, and design time constraints determine selection of implementation and packaging styles. Tight performance constraints favor custom integrated circuits, packaged in multi-chip modules. Tight cost constraints favor off-the-shelf processors and gate array implementations, with small substrates and inexpensive packaging. Tight power constraints favor custom circuits. Tight design time constraints favor COTS modules and FPGAs. If a single design property has high priority, the designer can select the appropriate implementation style and packaging technology. If, however, design time is crucial, but the system to be designed must process video signals in real time, then tradeoffs in packaging and implementation style must be made. The optimality of system cost and power consumption might be sacrificed: The entire design might be built with FPGAs, with much parallel processing and at great cost and large size. Because time-to-market is so important, early market entry systems may sacrifice the optimality of many system parameters initially and then improve them in the next version of the product. Selection of implementation styles and packaging can be accomplished by adding some design parameters to the scheduling and allocation program, if that program is not already computationally intensive. The parameters added would include • • • •

a variable indicating that a particular functional module was assigned a certain implementation style; a variable indicating that a particular storage module was assigned a certain implementation style; a variable indicating that a particular functional module was assigned a certain packaging style; and a variable indicating that a particular storage module was assigned a certain packaging style.

Some economy of processing could be obtained if certain implementation styles precluded certain packaging styles.

74.7 The Interconnection Strategy Modules in a digital system are usually interconnected in some carefully architected, consistent manner. If point-to-point interconnections are used, they are used throughout the system, or in a subsystem. In the

© 2000 by CRC Press LLC

same manner, buses are not broken arbitrarily to insert point-to-point connections or rings. For this reason, digital system design programs usually assume an interconnection style and determine the system performance relative to that style. The most common interconnection styles are bus, point-to-point, and ring.

74.8 Word Length Determination Functional specifications for system tasks are frequently detailed enough to contain the algorithm to be implemented. In order to determine the implementation costs of each system task, knowledge of the word widths to be used is important, as system cost varies almost quadratically with word width. Tools to automatically select task word width are currently experimental, but the potential for future commercial tools exists. In typical hardware implementations of an arithmetic-intensive algorithm, designers must determine the word lengths of resources such as adders, multipliers, and registers. In a recent publication,6 Wadekar and Parker presented algorithm-level optimization techniques to select distinct word lengths for each computation which meet the desired accuracy and minimize the design cost for the given performance constraints. The cost reduction is possible by avoiding unnecessary bit-level computations that do not contribute significantly to the accuracy of the final results. At the algorithm level, determining the necessary and sufficient precision of an individual computation is a difficult task, since the precision of various predecessor/successor operations can be traded off to achieve the same desired precision in the final result. This is achieved using a mathematical model7 and a genetic selection mechanism.6 There is a distinct advantage to word-length optimization at the algorithmic level. The optimized operation word lengths can be used to guide high-level synthesis or designers to achieve an efficient utilization of resources of distinct word lengths and costs. Specifically, only a few resources of larger word lengths and high cost may be needed for operations requiring high precision to meet the final accuracy requirement. Other relatively low-precision operations may be executed by resources of smaller word lengths. If there is no timing conflict, a large word length resource can also execute a small word length operation, thus improving the overall resource utilization further. These high-level design decisions cannot be made without the knowledge of word lengths prior to synthesis.

74.9 Predicting System Characteristics In system-level design, early prediction gives designers the freedom to make numerous high-level choices (such as die size, package type, and latency of the pipeline) with confidence that the final implementation will meet power and energy as well as cost and performance constraints. These predictions can guide power budgeting and subsequent synthesis of various system components, which is critical in synthesizing systems that have low power dissipation, or long battery life. The use by synthesis programs of performance and cost lower bounds allows smaller solution spaces to be searched, which leads to faster computation of the optimal solution. System cost, performance, power consumption, and design time can be computed if the properties of each system module are known. System design using existing modules requires little prediction. However, if system design is performed prior to design of any of the contained system modules, their properties must be predicted or estimated. Due to the complexities of prediction techniques, describing these techniques is a subject worthy of an entire chapter. A brief survey of related readings is found in the next section. The register-transfer and subsequent lower level power prediction techniques such as gate- and transistorlevel techniques are essential for validation before fabricating the circuit. However, these techniques are less efficient for system-level design process, as a design must be generated before prediction can be done.

74.10

A Survey of Research in System Design

Many researchers have investigated the problem of system design, dating back to the early 1970s. This section highlights work that is distinctive, along with tutorial articles covering relevant topics. Much

© 2000 by CRC Press LLC

good research is not referenced here, and the reader is reminded that the field is dynamic, with new techniques and tools appearing almost daily. Issues in top-down vs. bottom-up design approaches were highlighted in the design experiment reported by Gupta et al.8

System Specification System specification has received little attention historically except in the specific area of software specifications. Several researchers have proposed natural language interfaces capable of processing system specifications and creating internal representations of the systems that are considerably more structured. Of note is the work by Granacki9 and Cyre.10 One noteworthy approach is the Design Specification Language (D)SL, found in the Design Analysis and Synthesis Environment.11 One of the few books on the subject concerns the design of embedded systems — systems with hardware and software designed for a particular application set.12 In one particular effort, Petri nets were used to specify the interface requirements in a system of communicating modules, which were then synthesized.13 The SIERA system designed by Srivastava, Richards, and Broderson14 supports specification, simulation, and interactive design of systems.

Partitioning Partitioning research covers a wide range of system design situations. Many early partitioning techniques dealt with assigning register-level operations to partitions. APARTY, a partitioner designed by Lagnese and Thomas, partitions CDFG designs for single-chip implementation in order to obtain efficient layouts.15 Vahid16 performed a detailed survey of techniques for assigning operations to partitions. CHOP assigns CDFG operations to partitions for multi-chip design of synchronous, common clocked systems.17 Vahid and Gajski developed an early partitioner, SpecPart, which assigns processes to partitions.18 Chen and Parker reported on a process-to-partition technique called ProPart.19

Non-pipelined Design Although research on system design spans more than two decades, most of the earlier works focus on single aspects of design like task assignment, and not on the entire design problem. We cite some representative works here. These include graph theoretical approaches to task assignment,20,21 analytical modeling approaches for task assignment,22 and probabilistic modeling approaches for task partitioning,23,24 scheduling,25 and synthesis.26 Two publications of note cover application of heuristics to system design.27,28 Other noteworthy publications include mathematical programming formulations for task partitioning29 and communication channel assignment.30 Early efforts include those done by soviet researchers since the beginning of the 1970s such as Linsky and Kornev31 and others, where each model only included a subset of the entire synthesis problem. Chu et al.32 published one of the first mixed integer-linear programming (MILP) models for a sub-problem of system-level design, scheduling. Recently the program SOS (Synthesis of Systems), including a compiler for MILP models33,34 was developed, based on a comprehensive MILP model for system synthesis. SOS takes a description of a system described using a task-flow graph, a processor library and some cost and performance constraints, and generates an MILP model to be optimized by an MILP solver. The SOS tool generates MILP models for the design of non-periodic (non-pipelined) heterogeneous multiprocessors. The models share a common structure, which is an extension of the previous work by Hafer and Parker for high-level synthesis of digital systems.35 Performance bounds of solutions found by algorithms or heuristics for system-level design are proposed in many papers, including the landmark papers by Fernandez and Bussel36 and Garey and Graham37 and more recent publications.38

© 2000 by CRC Press LLC

The recent work of Gupta et al.8 reported the successful use of system-level design tools in the development of an application-specific heterogeneous multiprocessor for image processing. Gupta and Zorian39 describe the design of systems using cores, silicon cells with at least 5000 gates. The same issue of Design and Test contains a number of useful articles on design of embedded core-based systems. Li and Wolf 40 report on a model of hierarchical memory and a multiprocessor synthesis algorithm which takes into account the hierarchical memory structure. A major project, RASSP, is a rapid-prototyping approach whose development is funded by the U.S. Department of Defense.41 RASSP addresses the integrated design of hardware and software for signal processing applications. An early work on board-level design, MICON, is of particular interest.42 Newer research results solving similar problems with more degrees of design freedom include the research by C-T Chen43 and D-H Heo.44 GARDEN, written by Heo, finds the design with the shortest estimated time to market, which meets cost and performance constraints. All the MILP synthesis works cited to this point address only the nonperiodic case. Synthesis of application-specific heterogeneous multiprocessors is a major activity in the general area of system synthesis. One of the most significant system-level design efforts is Lee’s Ptolemy project at the University of California, Berkeley. Representative publications include papers by Lee and Bier describing a simulation environment for signal processing45 and the paper by Kalavade et al.46 Another prominant effort is the SpecSyn project47 which is a system-level design methodology and framework.

Macro-pipelined Design Macro-pipelined (periodic) multiprocessors execute tasks in a pipelined fashion, with tasks executing concurrently on different sets of data. Most research work on design of macro-pipelined multiprocessors has been restricted to homogeneous multiprocessors having negligible communication costs. This survey divides the past contributions according to the execution mode: preemptive or nonpreemptive. Nonpreemptive Mode The nonpreemptive mode of execution assumes that each task is executed without interruption. It is used quite often in low-cost implementations. Much research has been performed on system scheduling for the nonpreemptive mode. A method to compute the minimum possible value for the initiation interval for a task-flow graph given an unlimited number of processors and no communication costs was found by Renfors and Neuvo.48 Wang and Hu49 use heuristics for the allocation and full static scheduling (meaning that each task is executed on the same processor for all iterations) of generalized perfect-rate task-flow graphs on homogeneous multiprocessors. Wang and Hu apply planning, an artificial intelligence method, to the task scheduling problem. The processor allocation problem is solved using a conflict-graph approach. Gelabert and Barnwell50 developed an optimal method to design macro-pipelined homogeneous multiprocessors using cyclic-static scheduling, where the task-to-processor mapping is not time-invariant as in the full static case, but is periodic, i.e., the tasks are successively executed by all processors. Gelabert and Barnwell assume that the delays for intra-processor and inter-processor communications are the same, which is an idealistic scenario. Their approach is able to find an optimal implementation (minimal iteration interval) in exponential time in the worst case. In his doctoral thesis, Tirat-Gefen51 extended the SOS MILP model to solve for optimal macropipelined, application-specific heterogeneous multiprocessors. He also proposed an integer-linear programming (ILP) model allowing simultaneous optimal retiming and processor/module selection in high and system-level synthesis.52 Verhaegh53 addresses the problem of periodic multidimensional scheduling. His thesis uses an ILP model to handle the design of homogeneous multiprocessors without communication costs implementing data-flow programs with nested loops. His work evaluates the complexity of the scheduling and

© 2000 by CRC Press LLC

allocation problems for the multidimensional case, which were both found to be NP-complete. Verhaegh proposes a set of heuristics to handle both problems. Passos and Sha54 evaluate the use of multi-dimensional retiming for synchronous data-flow graphs. However, their formalism can only be applied to homogeneous multiprocessors without communication costs. The Preemptive Mode of Execution Feng and Shin55 address the optimal static allocation of periodic tasks with precedence constraints and preemption on a homogeneous multiprocessor. Their approach has an exponential time complexity. Ramamrithan56 developed a heuristic method that has a more reasonable computational cost. Ratemonotonic scheduling (RMS) is a commonly used method for allocating periodic real-time tasks in distributed systems.57 The same method can be used in homogeneous multiprocessors.

Genetic Algorithms Genetic algorithms are becoming an important tool for solving the highly non-linear problems related to system-level synthesis. The use of genetic algorithms in optimization is well discussed by Michalewicz58 where formulations for problems such as bin packing, processor scheduling, traveling salesman, and system partitioning are outlined. Research works involving the use of genetic algorithms to system-level synthesis problems are starting to be published, for example, as are the results of the following: • Hou et al.59 — scheduling of tasks in a homogeneous multiprocessor without communication costs; • Wang et al.60 — scheduling of tasks in heterogeneous multiprocessors with communication costs but not allowing cost vs. performance tradeoff, i.e., all processors have the same cost; • Ravikumar and Gupta61 — mapping of tasks into a reconfigurable homogeneous array processor without communication costs; • Tirat-Gefen and Parker62 — a genetic algorithm for design of application-specific heterogeneous multiprocessors (ASHM) with nonnegligible communications costs specified by a nonperiodic task-flow graph representing both control and data flow; and • Tirat-Gefen51 — introduced a full-set of genetic algorithms for system-level design of ASHMs incorporating new design features such as imprecise computation and probabilistic design.

Imprecise Computation The main results in imprecise computation theory are due to Liu et al.63 who developed polynomial time algorithms for optimal scheduling of preemptive tasks on homogeneous multiprocessors without communications costs. Ho et al.64 proposed an approach to minimize the total error, where the error of a task being imprecisely executed is proportional to the amount of time that its optional part was not allowed to execute, i.e., the time still needed for its full completion. Polynomial time-optimal algorithms were derived for some instances of the problem.63 Tirat-Gefen et al.65 presented a new approach for application-specific, heterogeneous multiprocessor design that allows tradeoffs between cost, performance, and data-quality through incorporation of imprecise computation into the system-level design cycle.

Probabilistic Models and Stochastic Simulation Many probabilistic models for solving different subproblems in digital design have been proposed recently. The problem of task and data-transfer scheduling on a multiprocessor when some tasks (data transfers) have nondeterministic execution times, (communication-times) can be modeled by PERT

© 2000 by CRC Press LLC

networks, which were introduced by Malcolm et al.66 along with the critical path method (CPM) analysis methodology. A survey on PERT networks and their generalization to conditional PERT networks is done by Elmaghraby.67 In system-level design, the completion time of a PERT network corresponds to the system latency, whose cumulative distribution is a nonlinear function of the probability density distributions of the computation times of the tasks and the communication times of the data transfers in the task-flow graph. The exact computation of the cumulative probability distribution function (c.d.f.) of the completion time is computationally expensive for large PERT networks, therefore it is important to find approaches that approximate the value of the expected time of the completion time and its c.d.f. One of the first of these approaches was due to Fulkerson,68 who derived an algorithm to find a tight estimate (lower bound) of the expected value of the completion time. Robillrad and Trahan69 proposed a different method using the characteristic function of the completion time in approximating the c.d.f. of the completion time. Mehrotra et al.70 proposed a heuristic for estimating the moments of the probabilistic distribution of the system latency tc . Kulkarni and Adlakha71 developed an approach based on Markov processes for the same problem. Hagstrom72 introduced an exact solution for the problem when the random variables modeling the computation and communication times are finite discrete random variables. Kamburowski73 developed a tight upper bound on the expected completion time of a PERT network. An approach using random graphs to model distributed computations was introduced by Indurka et al.,23 whose theoretical results were improved by Nicol.24 Purushotaman and Subrahmanyam74 proposed formal methods applied to concurrent systems with a probabilistic behavior. An example of modeling using queueing networks instead of PERT networks is given by Thomasian and Bay.75 Estimating errors due to the use of PERT assumptions in scheduling problems is discussed by Lukaszewicz.76 Tirat-Gefen developed a set of genetic algorithms using stratified stochastic sampling allowing simultaneous probabilistic optimization of the scheduling and allocation of tasks and communications on application-specific heterogeneous multiprocessor with nonnegligible communication costs.51

Performance Bounds Theory and Prediction Sastry77 developed a stochastic approach for estimation of wireability (routability) for gate arrays. Kurdahi78 created a discrete probabilistic model for area estimation of VLSI chips designed according to a standard cell methodology. Küçükçakar79 introduced a method for partitioning of behavioral specifications onto multiple VLSI chips using probabilistic area/performance predictors integrated into a package called BEST (Behavioral ESTimation). BEST provides a range of prediction techniques that can be applied at the algorithm level and includes references to prior research. These predictors provide information required by Tirat-Gefen’s system-level probabilistic optimization methods.51 Lower bounds on the performance and execution time of task-flow graphs mapped to a set of available processors and communication links were developed by Liu and Liu80 for the case of heterogeneous processors but no communication costs and by Hwang et al.81 for homogeneous processors with communication costs. Tight lower bounds on the number of processors and execution time for the case of homogeneous processors in the presence of communication costs were developed by Al-Mouhamed.82 Yen and Wolf83 provide a technique for performance estimation for real-time distributed systems. At the system and register-transfer level, estimating power consumption by the interconnect is important.84 Wadekar et al.85 reported “Freedom,” a tool to estimate system energy and power that accounts for functional-resource, register, multiplexer, memory, input/output pads, and interconnect power. This tool employees a statistical estimation technique to associate low-level, technology-dependent, physical and electrical parameters with expected circuit resources and interconnection. At the system level, “Freedom” generates predictions with high accuracy by deriving an accurate model of the load capacitance for the given target technology — a task reported as critical in high level power prediction by Brand and Visweswariah.86 Methods to estimate power consumption prior to high-level synthesis were also investigated by Mehra and Rabaey.87 Liu and Svensson88 reported a technique to estimate power consumption

© 2000 by CRC Press LLC

in CMOS VLSI chips. The reader is referred to an example publication that reports power prediction and optimization techniques at the register transfer level.89

Word Length Selection Many researchers studied word-length optimization techniques at the register-transfer level. A few example publications are cited here. These technique can be classified as statistical techniques applied to digital filters,90 simulated annealing-based optimization of filters,91 and simulation-based optimization of filters, digital communication, and signal processing systems.92 Sung and Kum reported a simulation-based word-length optimization technique for fixed-point digital signal processing systems.93 The objective of these particular architecture-level techniques is to minimize the number of bits in the design which is related to, but not the same as the overall hardware cost.

References 1. 2. 3. 4. 5. 6. 7.

8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.

IEEE Standard VHDL Language Reference Manual, IEEE Std. 1076, IEEE Press, New York, 1987. Bhasker, J., A Verilog HDL primer, Star Galaxy Press, 1997. Bell, G. and Newell, A., Computer Structures: Readings and Examples, McGraw Hill, New York, 1971. Harel, D., Statecharts: a visual formalism for complex systems, Sci. Comput. Progr., 8, 231, 1987. Vahid, F., Narayan, S., and Gajski, D. D., SpecCharts: a VHDL front-end for embedded systems, IEEE Trans. CAD, 14, 694, 1995. Wadekar, S. A. and Parker, A. C., Accuracy sensitive word-length selection for algorithm optimization, Proc. Int. Conf. Circuit Design [ICCD], 54, 1998. Wadekar, S. A. and Parker, A. C., Algorithm-level verification of arithmetic-intensive applicationspecific hardware designs for computation accuracy, in Digest Third International High Level Design Validation and Test Workshop, 1998. Gupta, P., Chen, C. T., DeSouza-Batista, J. C., and Parker, A. C., Experience with image compression chip design using unified system construction tools, Proc. 31st Design Automation Conf., 1994. Granacki, J. and Parker, A.C., PHRAN – Span: a natural language interface for system specifications, Proc. 24th Design Automation Conf., 416, 1987. Cyre, W. R. Armstrong, J. R., and Honcharik, A. J., Generating simulation models from natural language specifications, Simulation, 65, 239, 1995. Tanir, O. and Agarwal, V. K., A specification-driven architectural design environment, Computer, 6, 26, 1995. Gajski, D. D., Vahid, F., Narayan, S., and Gong, J., Specification And Design of Embedded Systems, Prentice Hall, Englewood Cliffs, NJ, 1994. de Jong, G. and Lin, B., A communicating Petri net model for the design of concurrent asynchronous modules, ACM/IEEE Design Automation Conf., June 1994. Srivastava, M. B., Richards, B. C., and Broderson, R. W., System level hardware module generation, IEEE Trans. Very Large Scale Integration [VLSI] Syst., 3, 20, 1995. Lagnese, E. and Thomas, D., Architectural partitioning for system level synthesis of integrated circuits, IEEE Trans. Comput.-Aided Design, 1991. Vahid, F., A Survey of Behavioral-Level Partitioning Systems, Technical Report TR ICS 91-71, University of California, Irvine, CA, 1991. Kucukcakar, K. and Parker, A.C., Chop: a constraint-driven system-level partitioner, Proc. 28th Design Automation Conf., 514, 1991. Vahid, F. and Gajski, D. D., Specification partitioning for system design, Proc. 29th Design Automation Conf., 1992. Parker, A. C., Chen, C.-T., and Gupta, P., Unified system construction, Proc. SASIMI Conf., 1993. Bokhari, S. H., Assignment problems in parallel and distributed computing, Kluwer Academic Publishers, 1987.

© 2000 by CRC Press LLC

21. Stone, H. S. and Bokhari, S. H., Control of distributed processes, Computer, 11, 97, 1978. 22. Haddad, E. K., Optimal Load Allocation for Parallel and Distributed Processing, Technical Report TR 89-12, Department of Computer Science, Virginia Polytechnic Institute and State University, April 1989. 23. Indurkhya, B., Stone, H. S., and Cheng, L. X., Optimal partitioning of randomly generated distributed programs, IEEE Trans. Software Eng., SE-12, 483, 1986. 24. Nicol, D. M., Optimal partitioning of random programs across two processors, IEEE Trans. Software Eng., 15, 134, 1989. 25. Lee, C. Y., Hwang, J. J., Chow, Y. C., and Anger, F. D., Multiprocessor scheduling with interprocessor communication delays, Operations Res. Lett., 7, 141, 1988. 26. Tirat-Gefen, Y.G., Silva, D. C., and Parker, A. C., Incorporating imprecise computation into systemlevel design of application-specific heterogeneous multiprocessors, in Proc. 34th. Design Automation Conf., 1997. 27. DeSouza-Batista, J. C. and Parker, A. C., Optimal synthesis of application specific heterogeneous pipelined multiprocessors, Proc. Int. Conf. Appl.-Specific Array Process., 1994. 28. Mehrotra, R. and Talukdar, S. N., Scheduling of Tasks for Distributed Processors, Technical Report DRC-18-68-84, Design Research Center, Carnegie-Mellon University, December 1984. 29. Agrawal, R. and Jagadish, H. V., Partitioning techniques for large-grained parallelism, IEEE Trans. Comput., 37, 1627, 1988. 30. Barthou, D., Gasperoni, F., and Schwiegelshon, U., Allocating communication channels to parallel tasks, in Environments and Tools for Parallel Scientific Computing, Elsevier Science Publishers B.V., 275, 1993. 31. Linsky, V. S. and Kornev, M. D., Construction of optimum schedules for parallel processors, Eng. Cybernet., 10, 506, 1972. 32. Chu, W. W., Hollaway, L.J., and Efe, K., Task allocation in distributed data processing, Computer, 13, 57, 1980. 33. Prakash, S. and Parker, A. C., SOS: synthesis of application specific heterogeneous multiprocessor systems, J. Parallel Distrib. Comput., 16, 338, 1992. 34. Prakash, S., Synthesis of Application-Specific Multiprocessor Systems, Ph.D. thesis, Department of Electrical Engineering and Systems, University of Southern California, Los Angeles, January 1994. 35. Hafer, L. and Parker, A., Automated synthesis of digital hardware, IEEE Trans. Comput., C-31, 93, 1981. 36. Fernandez, E. B. and Bussel, B., Bounds on the number of processors and time for multiprocessor optimal schedules, IEEE Trans. Comput., C-22, 745, 1975. 37. Garey, M. R. and Graham, R. L., Bounds for multiprocessor scheduling with resource constraints, SIAM J. Comput., 4, 187, 1975. 38. Jaffe, J. M., Bounds on the scheduling of typed task systems, SIAM J. Comput., 9, 541, 1991. 39. Gupta, R. and Zorian, Y., Introducing core-based system design, IEEE Design Test Comput., Oct.Dec., 15, 1997. 40. Li, Y. and Wolf, W., A task-level hierarchical memory model for system synthesis of multiprocessors, Proc. Design Automation Conference, 1997, 153. 41. Design and Test, special issue on rapid prototyping, 13,3, 1996. 42. Birmingham, W. and Siewiorek, D., MICON: a single board computer synthesis tool, Proc. 21st Design Automation Conf., 1984. 43. Chen, C-T, System-Level Design Techniques and Tools for Synthesis of Application-Specific Digital Systems, Ph.D. thesis, Department of Electrical Engineering and Systems, University of Southern California, Los Angeles, January 1994. 44. Heo, D. H., Ravikumar, C. P., and Parker, A., Rapid synthesis of multi-chip systems, Proc. 10th Int. Conf. VLSI Design, 62, 1997. 45. Lee, E. A. and Bier, J. C., Architectures for statically scheduled dataflow, J. Parallel Distrib. Comput., 10, 333, 1990.

© 2000 by CRC Press LLC

46. Kalavede, A., Pino, J. L., and Lee, E. A., Managing complexity in heterogeneous system specification, simulation and synthesis, Proc. Int. Conf. Acoustics, Speech, Signal Process. (ICASSP), May, 1995. 47. Gajski, D. D., Vahid, F., and Narayan, S., A design methodology for system-specification refinement, Proc. European Design Automation Conf., 458, 1994. 48. Renfors, M. and Neuvo, Y., The maximum sampling rate of digital filters under hardware speed constraints, IEEE Trans. Circuits Syst., CAS-28, 196, 1981. 49. Wang, D. J. and Hu, Y. H., Multiprocessor implementation of real-time DSP algorithms, IEEE Trans. Very Large Scale Integration (VLSI) Syst., 3, 393, 1995. 50. Gelabert, P. R. and Barnwell, T. P., Optimal automatic periodic multiprocessor scheduler for fully specified flow graphs, IEEE Trans. Signal Process., 41, 858, 1993. 51. Tirat-Gefen, Y.G., Theory and Practice in System-Level of Application Specific Heterogeneous Multiprocessors, Ph.D. dissertation, Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, 1997. 52. CasteloVide-e-Souza, Y.G., Potkonjak, M., and Parker, A.C., Optimal ILP-based approach for throughput optimization using algorithm/architecture matching and retiming, Proc. 32nd Design Automation Conf., June 1995. 53. Verhauger, W.F., Multidimensional Periodic Scheduling, Ph.D. thesis, Eindhoven University of Technology, Holland, 1995. 54. Passos, N. L., Sha, E. H., and Bass S. C., Optimizing DSP flow-graphs via schedule-based multidimensional retiming, IEEE Trans. Signal Process., 44, 150, 1996. 55. Feng, D. T. and Shin, K. G., Static allocation of periodic tasks with precedence constraints in distributed real-time systems, Proc. 9th Int. Conf. Distrib. Comput., 190, 1989. 56. Ramamritham, K., Allocation and scheduling of precedence-related periodic tasks, IEEE Trans. Parallel Distrib. Syst., 6, 1995. 57. Ramamritham, K., Stankovic, J. A., and Shiah, P.F., Efficient scheduling algorithms for real-time multiprocessors systems, IEEE Trans. Parallel Distrib. Syst., 1, 184, 1990. 58. Michalewicz, Z., Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, Berlin, 1994. 59. Hou, E.S.H, Ansari, N., and Ren, H., A Genetic algorithm for multiprocessor scheduling, IEEE Trans. Parallel Distrib. Syst., 5, 113, 1994. 60. Wang, L., Siegel, H. J., and Roychowdhury, V. P., A genetic-algorithm-based approach for task matching and scheduling in heterogeneous computing environments, Proc. Heterogeneous Comput. Workshop, Int. Parallel Process. Symp., 72, 1996. 61. Ravikumar, C. P. and Gupta, A., Genetic algorithm for mapping tasks onto a reconfigurable parallel processor, IEE Proc. Comput. Digital Tech., 142, 81, 1995. 62. Tirat-Gefen, Y. G. and Parker, A. C., MEGA: an approach to system-level design of applicationspecific heterogeneous multiprocessors, Proc. Heterogeneous Comput. Workshop, Int. Parallel Process. Symp., 105, 1996. 63. Liu, J. W. S., Lin, K.-J., Shih, W.-K., Yu, A. C.-S., Chung, J.-Y., and Zhao, W., Algorithms for scheduling imprecise computations, IEEE Comput., 24, 58, 1991. 64. Ho, K., Leung, J. Y-T. and Wei, W-D., Minimizing Maximum Weighted Error for Imprecise Computation Tasks, Technical Report UNL-CSE-92-017, Department of Computer Science and Engineering, University of Nebraska, Lincoln, 1992. 65. Tirat-Gefen, Y. G., Silva, D. C., and Parker, A. C., Incorporating imprecise computation into systemlevel design of application-specific heterogeneous multiprocessors, Proc. 34th. Design Automation Conf., 1997. 66. Malcolm, D. G., Roseboom, J. H., Clark, C. E., and Fazar, W., Application of a technique for research and development program evaluation, Oper. Res., 7, 646, 1959. 67. Elmaghraby, S. E., The theory of networks and management science: part II, Manage. Sci., 17, B.54, 1970. 68. Fulkerson, D. R., Expected critical path lengths in pert networks, Oper. Res., 10, 808, 1962.

© 2000 by CRC Press LLC

69. Robillard, P. and Trahan, M., The completion time of PERT networks, Oper. Res., 25, 15, 1977. 70. Mehrotra, K., Chai, J., and Pillutla, S., A Study of Approximating the Moments of the Job Completion Time in PERT Networks, Technical Report, School of Computer and Information Science, Syracuse University, New York, 1991. 71. Kulkarni, V. G. and Adlakha, V. G., Markov and Markov-regenerative pert networks, Oper. Res., 34, 769, 1986. 72. Hagstrom, J. N., Computing the probability distribution of project duration in a PERT network, Networks, 20, John Wiley & Sons, New York, 1990, 231. 73. Kamburowski, J., An upper bound on the expected completion time of PERT networks, Eur. J. Oper. Res., 21, 206, 1985. 74. Purushothaman, S. and Subrahmanyam, P. A., Reasoning about probabilistic behavior in concurrent systems, IEEE Trans. Software Eng., SE-13, 740, 1987. 75. Thomasian, A., Analytic queueing network models for parallel processing of task systems, IEEE Trans. Comput., C-35, 1045, 1986 76. Lukaszewicz, J., On the estimation of errors introduced by standard assumptions concerning the distribution of activity duration in pert calculations, Oper. Res., 13, 326, 1965. 77. Sastry, S. and Parker, A. C., Stochastic models for wireability analysis of gate arrays, IEEE Trans. Comput.-Aided Design, CAD-5, 1986. 78. Kurdahi, F. J., Techniques for area estimation of VLSI layouts, IEEE Trans. Comput.-Aided Design, 8, 81, 1989. 79. Küçükçakar, K. and Parker, A. C., A methodology and design tools to support system-level VLSI design, IEEE Trans. Very Large Scale Integration [VLSI] Syst., 3, 355, 1995. 80. Liu, J. W. S. and Liu, C. L., Performance analysis of multiprocessor systems containing functionally dedicated processors, Acta Informatica, 10, 95, 1978 81. Hwang, J. J., Chow, Y. C., Ahnger, F. D., and Lee, C. Y., Scheduling precedence graphs in systems with interprocessor communication times, SIAM J. Comput., 18, 244, 1989. 82. Mouhamed, M., Lower bound on the number of processors and time for scheduling precedence graphs with communication costs, IEEE Trans. Software Eng., 16, 1990. 83. Yen, T.-Y. and Wolf, W., Performance estimation for real-time embedded systems, Proc. Int. Conf. Comput. Design, 64, 1995. 84. Landman, P. E. and Rabaey, J. M., Activity-sensitive architectural power analysis, IEEE Trans. on CAD, 15, 571, 1996. 85. Wadekar, S. A., Parker, A. C., and Ravikumar, C. P., FREEDOM: statistical behavioral estimation of system energy and power, Proc. Eleventh Int. Conf. on VLSI Design, 30, 1998. 86. Brand, D. and Visweswariah, C., Inaccuracies in power estimation during logic synthesis, Proc. Eur. Design Automation Conf. (EURO-DAC), 388, 1996. 87. Mehra, R. and Rabaey, J., Behavioral level power estimation and exploration, Proc. First Int. Workshop Low Power Design, 197, 1994. 88. Liu, D. and Svensson, C., Power consumption estimation in CMOS VLSI chips, IEEE J. Solid-State Circuits, 29, 663, 1994. 89. Landman, P. E. and Rabaey, J. M., Activity-sensitive architectural power analysis, IEEE Trans. Comput.-Aided Design, 15, 571, 1996. 90. Zeng, B. and Neuvo, Y., Analysis of floating point roundoff errors using dummy multiplier coefficient sensitivities, IEEE Trans. Circuits Syst., 38, 590, 1991. 91. Catthoor, F., Vandewalle, J., and De Mann, H., Simulated annealing based optimization of coefficient and data word lengths in digital filters, Int. J. Circuit Theor. Appl., 16, 371, 1988. 92. Grzeszczak, A., Mandal, M. K., Panchanathan, S. and Yeap, T., VLSI implementation of discrete wavelet transform, IEEE Trans. VLSI Syst., 4, 421, 1996. 93. Sung, W. and Kum, Ki-II., Simulation-based word-length optimization method for fixed-point digital signal processing systems, IEEE Trans. Signal Process., 43, 3087, 1995.

© 2000 by CRC Press LLC

Bhasker, J. "Synthesis at the Register Transfer Level and the Behavioral Level" The VLSI Handbook. Ed. Wai-Kai Chen Boca Raton: CRC Press LLC, 2000

© 2000 by CRC PRESS LLC

75 Synthesis at the Register Transfer Level and the Behavioral Level 75.1 75.2 75.3 75.4

Introduction The Two HDL’s The Three Different Domains of Synthesis RTL Synthesis Combinational Logic • Sequential Logic

75.5 Modeling a Three-State Gate 75.6 An Example 75.7 Behavioral Synthesis

J. Bhasker Cadence Design Systems

Scheduling • ALU Allocation • Register Allocation

75.8 Conclusion

75.1 Introduction This chapter provides an overview of register transfer level synthesis and behavioral synthesis, contrasting the two with examples. Examples are written using VHDL and Verilog HDL, the two dominant hardware description languages (HDL) in the industry today. The chapter intends to be more of a tutorial. It first describes the distinguishing characteristics of register transfer level (RTL) modeling as opposed to behavioral level modeling. It then uses both HDLs to illustrate how RTL models are mapped to hardware. Both combinational logic synthesis and sequential logic synthesis are presented. This includes how flip-flops, latches, and three-state gates are inferred from the RTL model. A finite state machine modeling example is also described. The later part of the chapter shows the behavioral synthesis methodology with examples to illustrate the flow of transformations that occur during the synthesis process. Many scheduling and resource allocation algorithms exist today. In this chapter, we illustrate the basic ideas behind the algorithms. Examples are used to show the architectural exploration that can be performed with behavioral synthesis, something that is not possible with register transfer level synthesis. Synthesis is here! It has become an integral part of every design process. Once upon a time, all circuits were designed by hand and logic gates and their interconnections were entered into a system using a schematic capture tool. This is no longer the norm. More and more designers are resorting to synthesis because of the tremendous advantages that it provides, for example, describing the design at a higher level of abstraction. To this end, a language is needed to describe the design. This is where a hardware description language comes in. A hardware programming language is a formal language designed with the intent of describing hardware. Additionally, each language construct has a functional semantic

© 2000 by CRC Press LLC

associated with it that can be used to verify a design described in HDL (a design described in a HDL is often called a “model”). The model also serves as a means of documenting the design.

75.2 The Two HDL’s The two dominant hardware description languages in use today are i VHDL ii Verilog HDL VHDL originated from the Department of Defense through its VHSIC program and became a public domain standard in 1987, whereas Verilog HDL originated from a private company and became a public domain standard in 1995. Both languages are targeted at describing digital hardware. A design can be expressed structurally, in a dataflow style or in a sequential behavior style. The key difference between the two languages is that VHDL extends the modeling to higher levels of data abstraction, provides for strong type checking, and supports the delta delay mechanism. An excellent introduction to both languages can be found in (Bhasker, 1995) and (Bhasker, 1997). The complete descriptions of languages can be found in their respective language reference manuals (LRMs), (IEEE, 1993) and (IEEE, 1995). Here is an example of a simple arithmetic logic unit described using both languages. The design is described using a mixed style — it contains structural components, dataflow, and sequential behavior. -- VHDL: library IEEE; use IEEE.STD_LOGIC_1164.all, IEEE.NUMERIC_STD.all; entity ALU is port (A, B: in UNSIGNED(3 downto 0); SEL: in STD_LOGIC_VECTOR(0 to 1); Z: out UNSIGNED(7 downto 0); ZComp: out BOOLEAN); end; architecture MIXED_STYLE of ALU is component MULTIPLIER port (PortA, PortB: in UNSIGNED (3 downto 0); PortC: out UNSIGNED (7 downto 0)); end component; signal MulZ: UNSIGNED (7 downto 0); begin ZComp A, PortB => B, PortC => MulZ); process (A, B, SEL, MulZ) begin Z ‘0’); case SEL is when “00” => Z(3 downto 0) Z(3 downto 0) Z < = MulZ; when others => Z ‘Z’); end case; end process; end;

© 2000 by CRC Press LLC

//Verilog: module ALU (A, B, SEL, Z, ZComp); input [3:0] A, B; input [0:1] SEL; output [7:0] Z; reg [7:0] Z; output ZComp; assign ZComp = (SEL == 2’b11) ? A < B: ‘b0; MULTIPLIER M1 (.PortA(A), .PortB(B), .PortC(MulZ)); always @(A or B or SEL) begin case (SEL) 2’b00: Z = A + B; 2’b01: Z = A — B; 2’b10: Z = MulZ; default: Z = ‘bz; endcase end endmodule

In VHDL, the interface of the design (entity declaration) is separate from the description of the design (architecture body). Note that each signal is declared as a specific type (UNSIGNED). This type is declared in the package NUMERIC_STD, which in turn is included in the design using the context clauses (library and use clause). The structural part is described using a component instantiation statement — a component declaration is required to specify the interface for the component. The dataflow part is specified using a concurrent signal assignment. The sequential part is specified using a process statement; this contains a case statement that switches to an appropriate branch based on the value of the case expression. In the Verilog model, each variable can have at most four values: 0, 1, x, and z. The model shows the two main data types in Verilog: net and register (a wire is a net data type, while a reg is a register data type). The structural part is described using a module instantiation statement. Notice that named association is used to specify the connection between the ports of the module and its external nets to which they are connected. Dataflow part is modeled using the continuous assignment statement, while the sequential part is represented using the always statement. In this chapter, we shall use both of the languages to illustrate the examples when describing synthesis.

75.3 The Three Different Domains of Synthesis There are three distinct domains in synthesis: i logic synthesis ii RTL synthesis iii behavioral synthesis But first, the definition (at least the author’s) of synthesis: Synthesis is the process of transforming an HDL description of a design into logic gates. The synthesis process itself, starting from HDL, involves a number of tasks that need to be performed. These tasks may or may not be distinct in synthesis tools (Fig. 75.1). Starting from an HDL description, synthesis generates a technology-independent RTL level netlist (RTL blocks interconnected by nets). Based on the target technology and design constraints, such as area and delay, the module builder generates a technology-specific gate level netlist. A logic optimizer further optimizes the logic to match the design constraints and goals such as area and delay. The synthesis process may bypass the module build phase and directly generate a gate level netlist if there are no RTL blocks in the design.

© 2000 by CRC Press LLC

FIGURE 75.1

The tasks involved in synthesis process.

In this chapter, we will not discuss logic optimization and module building. One source that describes the algorithms behind logic optimization is (DeMicheli, 1994). Coming back to the three different synthesis domains, let us briefly explore them. The level of abstraction increases as we go from the logic level to the behavioral level. Inversely, the structural inference reduces as we go from the logic level to the behavioral level (Fig. 75.2).

FIGURE 75.2

Varying levels of abstraction.

In the logic synthesis domain, a design is described in terms of Boolean equations. Components may be instantiated to describe hierarchy or may lower level primitives such as flip-flops. Here is a logic synthesis model for an incrementor whose output is latched. -- VHDL: library IEEE; use IEEE.STD_LOGIC_1164.all; use IEEE.NUMERIC_STD.all; entity INCREMENT is port (A: in UNSIGNED(0 to 2); CLOCK: in STD_LOGIC; Z: out UNSIGNED(0 to 2)); end; architecture LOGIC_LEVEL of INCREMENT is component FD1S3AX port (DATA, CLK: in STD_LOGIC; Q: out STD_LOGIC); end component; signal DZ1, DZ2, DZ0, S429, A1BAR: STD_LOGIC;

© 2000 by CRC Press LLC

begin DZ1