Embedded Multiprocessors: Scheduling and Synchronization (Signal Processing and Communications)

  • 1 277 4
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

Embedded Multiprocessors: Scheduling and Synchronization (Signal Processing and Communications)

lltiprocessors and Synchronization BHATTACHARYYA MARCEL DEKKER, INC. NEWYORK BASEL e 2000 00-0~2900 This book is p

974 125 37MB

Pages 348 Page size 336.479 x 530.16 pts Year 2008

Report DMCA / Copyright

DOWNLOAD FILE

Recommend Papers

File loading please wait...
Citation preview

lltiprocessors and Synchronization

BHATTACHARYYA

MARCEL DEKKER, INC.

NEWYORK BASEL e

2000 00-0~2900 This book is printed on acid-free paper.

Marcel Dekker, Inc. 270 Madison Avenue, New York, 10016 tel: 2 12-696-9000; fax: 2 12-685-4540

Marcel DekkerAG ~utgasse4, Postfach 8 12, CH-400 1 Basel, Switzerland tel: 41-61-261-8482; fax: 41-61-261-8896

The ~ublisheroffers discounts on t h s book when ordered in bulk quantities. For more i n f o ~ t i o n , write Special to Sa~es~rofessionalMarketing the at ~ e a d q u a ~ eaddress rs above.

Neither this booknor any part m y be reproduced or transmitted in f o m or by means, electronic mechanical, including p h o t o c o p ~ g , m i c r o ~ l ~and ng, recording, by ~ f o ~ t i storage o n and retrieval system, without permission in writing from the publisher. Current printing (last digit) l 0 9 8 7 6 5 4 3 2 1

To my parent^, and Uma Sundararajan Sriram

~~und~ati Shuvra S. Bhattacharyya

This Page Intentionally Left Blank

Over the past 50 years, digital siglla~ rocessing has evolved major engineering d i s c ~ p ~ ~The n e . fields of signal processing have grown from the origin of fast Fourier transforln and digital filter design to statistical spectral analysis and array processing, and image, audio, and lnultiln~diaprocessing, and shaped developments in high-performance VLSI signal processor design. Indeed, there are few fields that enjoy so manyapplications-signalprocessingiseverywhere in our lives. Whenoneuses cellular phone, the voice is compressed,coded,and modulated using signal processing techniques. cruise missile winds along hillsides searching for the target, the signal processor is busy processing the imagestakenalong the way.Whenwe are watching movie in HDTV, millions of audio and video data are being sent to our homes and received with unbelievable fidelity. When scientists compare samples, fast pattern recognition techniques are being used. On and on, one can see the impact of signal processing in almost every engineering and scientific discipline, Because of the immense importan~eof signal processing and the fastgrowingdemands of businessand in dust^, this series on signal processing serves to report up-to-date developments and advances in the field. The topics of interest include but are not limited to the following: Signal theory and analysis Statistical signal processing Speech and audio processing Image and video processing ~ ~ l t i l ~ esignal d i a processing and technology Signal processing for colnlnunications Signal processing architectures and VLSI design

I hope this series will provide the interested audience with higll-~uality, state-of-the-art signal processing literature through research monographs, edited books, and rigorously written textbooks by experts in their fields.

Ray V

DSP 1 DSP 2 MCU

Embedded systems are computers that are not first and foremost computers. They are pervasive, appearing in automobiles, telephones, pagers, consumer electronics, toys, aircraft, trains, security systems,weaponssystems, printers, modems, copiers, thermostats, manufacturing systems, appliances, etc. technically active person today probably interacts regularly with more embedded systems than conventional computers. This relatively recent phenomenon. Not so long automobiles depended on finely tuned mechanical systems for the timing of ignition and its synchronization with other actions. It was not so long ago that modems were finely tuned analogcircuits. Embedded systems usually encapsulate domain expertise. Even small software programs may be very sophisticated, requiring deep understanding of the domain and of supporting technologies such signal processing. Because of this, such systems are often designed by engineers who classically trained in the domain, for example, in internal combustion engines or in communication theory. They have little background in the theory of computation, parallel computing, and concurrency theory. Yet they face one of the most difficult problems addressed by these disciplines, that of coordinating multiple concurrent activities in real tjme, often in safety-critical environment.Moreover,they face these problems in context that is often extremely cost-sensitive, mandating optimal designs, and time-critical, mandatin~rapid designs. Embedded software is unique in that parallelism is routine. Most modems and cellular telephones, for example, incorporate multiple programmable processors. Moreover, embedded systems typically include custom digital and analog hardware that must interact with the software, usually in real time. That hardware operates in parallel with the processor that runs the software, and the software must interact with it much it would interact with another software process running in parallel. Thus, in having to deal with real-time issues and parallelism, the designers of embedded softwareface daily basis problems that occur only in esoteric research in the broader field of computer science.

uter scientists refer to use of physica~ly distinct computational resources (processors) “parallelism,” and to the logical property that multiple activities occur at the same time as “concu~ency.” Paral~e~ism implies concurrency, but the reverse is not true. Almost all operating systems deal with concurrent which managed by multiplexing multiple processes or threads on deal with parallelism, for example by mapping processor. A few onto physically distinct processors. Typical embedded systems exhibit both concu~encyand parallelism, but their context different from that of genose opera tin^ systems in many ways. In embedded systems, concu~enttasks are often statically defined, largely the lifetime of the system. A cellular phone, for example, has nct modes of operation (dialing, talking, standby, etc.), and in each mode of operatio ll-defined set of tasks is c o n c u ~ e ~ t active ly (speech encoding, etc.). The static structure of the concurr much more detailed analysis and optimization in more dynamic environment. is book is about such analysis and optimization. rdered transaction strategy, for example, leverages that relatively static of embedded software to dramatically reduce the synchronization overhead of communication between processors. It recognizes that embedded software is intrinsically less predictable than hardware and more predictable than eneral-pu~osesoftware. Indeed, minimizing synchronization overhead by static i n f o ~ a t i o nabout the application is the major theme of this book. In general-pu~osecomputation, communication is relatively expensive. Consider for example the interface between the audio h a r d w ~ eand the software of typical personal computer today. Because the transaction costs are extremely h, data extensively buffered, resu~tingin extremely long latencies. A path from the microphone of PC into the software and back out to the speaker typically has latencies of hundreds of milliseconds. This severely limits the utility of the audio hardware of the computer. Embed ed systems cannot tolerate such latencies. major theme of this book is communication between components. The iven in the book are firmly rooted in manipulable and tractable ford yet are directly applied to hardware design. The closely related IPC ssor communication) graph and synchronization graph models, introhapters 7 and 9, capture the essential prope~iesof this com~unicae of graph-theoretic properties of IPC and sync~onizationgraphs,

optimi~ationproblems are formulated and solved. For example, the notion of resynchroni~ation, where explicit synchronization operations are minimi~ed through manipulation of the sync~onizationgraph, proves to bean effective optimi~ationtool. In some ways, embedded software has more in common with hardware thanwith traditional software. ardware highly parallel. Conceptually9hardware is an assemblage of components that operate continuously or discretely in time and interact via sync~onousor asynchronous communication, oftw ware an assemblage of components that trade off use"of CPU, operating sequentially, and communicating by leaving traces of their (past and completed) execution on stack or in memo^. Hardware temporal. In the extreme case, analog hardware operates in a continuum, computational medium that is totally beyond the reach of software, Communication not just synchronous; it is physical and fluid, oftw ware is sequential and discrete. ~ o n c u ~ e n cinysoftware is about reconciling sequences, Concu~encyin hardware is about reconciling signals, This book ~xaminesparallel software from the perspective of signals, and identifies joint hardware/software designs that are ~articularlywell-suited for embedded systems. The prima^ abstraction mechanism in software is the ~rocedure(or the method in object-oriented designs). Procedures are terminating computations. module that operat The primary abstraction mechanism in hardware allel with the other components. These modules represent non-termina putations. These are very different abstraction mechanisms. Hardw do not start, execute, complete, and return. They just are. In embedded systems9 software components often have the sameproperty. They do not t e ~ i n a t e . ~onceptually,the distinction between hardware and software, from the perspective of co~putation9has only to do with the degree of concu~encyand the role of time. An application with large amount of concu~encyand heavy that have temporal content rnight well be thought of using the ~bstract~ons been successful for hardware, regardless of how it implemented. An application that is sequential and ignores time rnight well be thought of using the abstractions thathave succeeded for software, regardless ofhowit is implemented. The key problem becomes one of identifying the appropriate abstractions for representing the design. This book identifies abstractions that work well for the joint design of embedded software and the hardware on which itruns. The intellectual content in this book is high. While some of the methods it describes are relatively simple, most are quite sophisticated. Yet examples are given that concretely de strate how these concepts can be applied in practical hardware architectures. over, there is very little overlap with other books on parallel processing. The focus on application-specific processors and their use in

x embedded systems leads to a rather different set of techniques. I believe that this book defines a new discipline. It gives a systematic approach to problems that engineers previously have been able to tackle only in an ad hoc manner.

E d w a r ~ Lee Professor ~ e ~ a r t m e n t o ~ ~ l e cEngineering trical and Computer Sciences University Cal~orniaat Berkeley Berkeley, Cal~ornia

Softwareimplementation of c0mpute"intensivemultimedia applications such video conferencing systems, set-top boxes, and wireless mobile terminals and base stations is extremely attractive due to the flexibility, extensibility, and potential portability of programmable implementations. However, the data rates involved in many ofthese applications tend to be very high, resulting in relatively few processor cycles available per input sample for reasonable processor clock rate. Employing multiple processors is usually the only means for achieving the requisite compute cycles without moving to dedicated ASIC solution. With the levels of integration possible today, one can easily place four to digital signal processors on single die; such an integrated multiprocessor strategy promising approach for tackling the complexities associated with future systems-on-achip. However, it remains significant challenge to develop software solutions that can effectively exploit such multiprocessor implementation platforms. Due to the great complexity of implementing multiprocessorsoftware, and the severe performance constraints of multimedia applications, the develop~nent of automatic tools for mapping high level specifications of multimedia applications into efficient multiprocessor realizations has been an active research area for the past several years. ~ a p p i n gan application onto multiprocessor system involves three main operations: assigning tasks to processors, ordering tasks on each processor, and determining the time at which each task begins execution. These operations are collectively referred to the application on the given architecture. key aspect of the multiprocessor scheduling problem for multimedia system implementation that differs from classical scheduling contexts is the central role of interprocessor communication the efficient management of data transfer between communicating tasks that are assigned to different processors. Since the overall costs of interprocessor communication can have dramatic impact on execution speed and power consumption, effective handling of interprocessor communicatio~is crucial to the development of cost-effective multiprocessor implementations. This books reviews important research in three key areas related to multiprocessor implementation of multimedia systems, and this book exposes important synergies between efforts related to these areas. Our areas of focus are the incorporation of interprocessor communication costs into multiprocessor scheduling decisions; modelingmethodology, called the "synchronization

graph,” for multiprocessor system performance analysis; and the application of the synchronization graph model to the development of hardware and software timizations that can significantly reduce the inte~rocessorcommunication erhead of given schedule. ore specifically, this book reviews, in a unified manner^ several imporiprocessor scheduling strategies that effectively inco~oratethe consideration of inte~rocessorcommunication costs, and highlights the varietyof techniques employed in these multiprocessor scheduling strategies to take interprocessor communication into account. The book reviews body of research performed by the authors on modeling implementations of multiprocessor schedules, and on the use of these odel ling techni~uesto optimize interprocessor communication costs. A unified framework then presented for applying arbitrary scheduling strategies in conjunction with the application of alternative optimization algorithms that address specific subproblems associated with implementing given schedule. We provide several examples of practical applications that demonstrate the relevance of the techniques desc~bedin this book. are grateful to the Signal Processing Series Editor Professor K. Liu (University of land, College Park) for his encouragement of this project, and to Executive isition Editor B. Clark(MarcelDekker, Inc.) his coordination of the effort. It was privilege for both of us to be students of Professor Edward A. Lee (University of California at erkeley). Edward provided truly inspiring research environmen~during our d toral studies, and gave valuable feedbackwhileweweredevelopingmanyof the concepts that underlie n this book. We also acknowledge helpful proofreading assistance andrachoodan, Mukul ~handelia,and Vida Kianzad ~ a r y l a n dat College Park); andenlighteningdiscussionswith n and Dick Stevens (U. S. Naval Research Laboratory), and Praveen (AngelesDesignSystems).Financialsupport (for S. S. Bhattadevelopment of this book was provided by the National Science

§rira~

Liu)

l

~ultiprocessorDSP systems

2

l .2 Application-specific multiprocessors

4

1.3 Exploitation of p a r a ~ l e l i s ~5 1.4 Dataflow modeling for DSP design 1

9

Utilityof dataflow for DSP

1.6 Overview

11

13 2.1 Parallel architecture classifications

2.2

13

Exploiting instruction level parallelism

15

2.2.1 ILP in programmable DSP processors 2.2.2 2.2.3

Sub-word parallelism processors

17

18

2.3 Dataflow DSP architectures

2.4 Systolic and wavefront arrays

19 20

15

CONTENTS

2.5 Multiprocessor DSP

architectures

2.6 Single chip multiprocessors

21

23

2.7 Reconfigurable computing 25 2.8Architectures

that exploit predictable IPC27

Summary 2.9 29

3.1

Graphdata structures

31

3.2 Dataflow graphs 32 3.3 Computation graphs 32 3.4 Petri

nets

33

3.5 Synchronous dataflow 3.6Analytical 3.7 Converting

34

properties of SDF graphs35 general SDF graph into homogeneous SDF graph

3.8Acyclicprecedenceexpansiongraph

36

38

3.9 Application graph 41 3.10 Synchronous languages

42

3.1 1 HSDFGconceptsand

notations

3.12Complexityofalgorithms

45

3.13 Shortest andlongestpaths 3.13.1

43

in graphs47

Dijkstra’s algorithm

48

3.13.2 TheBellman-Fordalgorithm

48

3.13.3 The Floyd-~arshallalgorithm 3.14Solving

difference constraints using shortest paths

3.15 Maximum cycle mean 3.16 Summary

UL

49 50

53

53

ELS

4. 1 Task-level parallelism anddata parallelism

55

5

CONTENTS

4.2

Static versus dynamic scheduling strategies

4.3 Fully-static schedules

62

4.5 Dynamic schedules

64

4.6 ~uasi-staticschedules Schedule notation

56

57

4.4 Self-timed schedules

4.7

XV

65

67

Unfolding HSDF graphs

69

4.9 Execution time estimates and static schedules 4.10 Summary

72

74 7

..I........*..

Froblem description

75

5.2 Stone’s assignment algorithm 5.3 List scheduling algorithms 5.3.1

76 80

1

Graham’s bounds

5.3.2 The basic algorithms

HLFET and ETF

5.3.3 The mapping heuristic

84

5.3.4 Dynamic level scheduling 5.3.5Dynamic

85

critical path scheduling

5.4 Clustering algorithms

84

86

87

5.4.1 Linear clustering 5.4.2 Internalization

89

5.4.3 Dominant sequence clustering 5.4.4 Declustering

19

5.5 Integrated scheduling algorithms 5.6 Fipelined scheduling Summary 5.7

89

92

94

100

l 6.1 The ordered-transactions strategy

101

xvi

~~NT~NTS

6.2 Shared bus

~chitecture 104

6.3 Interprocessor communication mechanisms

104

6.4 Usingthe ordered-transactions approach107 6.5 Design ofan orderedmemory access ~ultiprocessor 108 6.5.1 Highleveldesign

108

description

6.5.2 A modified design 109 6.6 Design details

a prototype

6.6.1 Top level design

1

112

6.6.2 Transaction order controller 6.6.3 Host interface 6.6.4 Processing

114

1 18

element

121

6.6.5 FPGA circuitry

122

6.6.6 Shared memo^

123

6.6.7 Conne~tingmultiple boards 6.7Hardwareand

123

software implementation

125

oard design 125 6.7.2 Software interface 6.8 Ordered

125

andparameter control

128

6.9 Application examples 129

Fourier Transform (FFT) 132

6.9.31024pointcomplexFast 6.10 S u ~ ~ a r y,134

7.1 Inter-processor communicationgraph (Gipc) 7.2 Execution time

estimates

138

143

7.3 Ordering constraints viewed as edges addedto Gipc

xvii Periodicity 7.4 7.5 Optimal

145 order

146

7.6 Effects of changes inexecutiontimes149 7.6. l

Dete~inisticcase

150

7.6.2 Modelingrun-timevariationsin

execution times151

7.6.3 Bounds on the average iteration period154 7.6.4 Implications fortheordered transactions schedule 7.7 Summary

155

157

T 8.1TheBoolean 8.1.1

dataflow model159 Scheduling

160

8.2 Parallel implementation on sharedmemorymachines163 8.2.1 General strategy 163 8.2.2Implementation

on theOMA

165

8.2.3 Improved mechanism 169 8.2.4 Generating theannotatedbus access list 8.3 Data-dependent iteration 8.4 Summary

9.1

171

174

175

barrier ~ I M D technique178

9.2Redundant

synchronization removalin non-iterative dataflow179

9.3 Analysis of self-timed execution182 9.3.1 Estimated throughput 182 9.4 Strongly connected componentsandbuffer size bounds182 9.5 Synchronization model 185 9.5.1

Synchronization protocols

185

9.5.2 The synchronizationgraph G,

xviii 9.6Asynchronization

cost metric190

9.7Removingredundantsynchronizations19 9.7.1

1

The independenceofredundantsynchronizations192

9.7.2 Removing redundant synchronizations 193 9.7.3 Comparisonwith Shaffer’s approach195 9.7.4 An example 195 9.8 Making the synchronizationgraph strongly connected197 9.8.1Addingedges

to the synchronizationgraph199

9.9 Insertion of delays201 9.9.1 Analysis

of DetermineDelays205

9.9.2 Delay insertion example 207 9.9.3 Extending the algorithm208 9.9.4 Complexity 2

10

9.9.5 Related work 210 9.10 Summary 21

1

N. 10.1 Definition of resynchronization2

13

10.2 Properties ofresynchronization2

15

10.3 Relationship to set covering218

10.4 Intractability of resynchronization221 10.5 Heuristic solutions

224

10.5.1 Applying set-covering techniques to pairs of SCCs 10.5.2 Amore flexible approach225 10.5.3 Unit-subsumptionresynchronization edges23 10.5.4 Example 234 10.5.5 Simulation approach 236 10.6Chainablesynchronizationgraphs236 10.6.1Chainablesynchronizationgraph

SCCs

237

1

10.6.2 Comparison to the Global-Resynchronize heuristic

239

10.6.3 A generalization of thechainingtechnique240 10.6.4 Incorporating the chainingtechnique242 10.7Resynchronizationof 10.8 Summary

constraint graphs for relative scheduling242

243

11.1 Eliminationofsynchronizationedges246 11.2Latency-constrainedresynchronization248 11.3 Intractability ofLCR253 11.4Two-processorsystems260 11.4. l

Interval covering26

1

11.4.2Two-processorlatency-constrainedresynchronization262 11.4.3 Takingdelays into account266 1.5 A heuristic for generalsynchronizationgraphs

276

11.S. 1 Customization to transparent synchronization graphs 278 11S.2 Complexity 278 11.5.3 Example 280 11.6 Summary

12.1Computing

286

buffer sizes

29 l

12.2 A framework for self-timed implementation292 12.3 Summary 294

297 3011 321

This Page Intentionally Left Blank

The focus of this book is theexploration of architectures and design methodologies for application-specific arallel systems in the gener embedded applications in digital si nal processing (DSP).In the such multiprocessors typically consist of one or more central processing units (micro-controllers or programmable digital signal processors), and one or more application-specific hardware components (implemented custom application specific integrated circuits or reconfigurable logic such field programmable gate arrays ( F ~ ~ A s )Such ) . embedded multiprocessor systems are becoming increasingly common today in applications ranging from digital audio/video equipment to portable devices such cellular phones and personal digital assistants. With increasing levels of integration, it is now feasible to integrate such heterogeneous systems entirely on single chip. The design task of such multiprocessor systems-on-a-chip is complex, and the complexity will only increase in the future. One of the critical issues in the design of embedded multiprocessors is managing communication and synchronization overhead between the heterogeneous processing elements. This book discusses systematic techniques aimed at reducing this overhead in multiprocessors that are designed to be application-specific. The scope of this book includes both hardware techniques for minimizing this overhead based on compile time analysis, well software techniques for strategically designing synchronization points in multiprocessor implementation withthe objective o ducing synchronization overhead. The techniques presented here apply to P algorithms that involve predictable control structure; the precise domain of applicability of these techniques will be formally stated shortly. Applications in signal, image, and video processing require large computing power and have real-time p e ~ o ~ a n requirements. ce The computing engines in such applications tend to be embedded opposed to general-purpose. Custom

Chapter 1

VLSI implementations are usually preferred in such high throughput applications. However, custom approaches havethe well known problems of long design cycles (the advances in high-level VLSI synthesis notwithstanding) and low flexibility in the final implementation. Programmable solutions are attractive in both of these respects: the p r o g r a ~ ~ a bcore l e needsto be verified for correctness only once, and design changes can be made late in the design cycle by modifying the software program. Although verifying the embedded software to be run on programmable part is also hard problem, inmost situations changes late in the design cycle (and indeed even after the system design is completed) are much easier and cheaper to make in the case of software than inthe case of hardware. Special processors are available today that employ an architecture and an instruction set tailored towards signal processing. Such software programmable integrated circuits are called “Digital Signal Processors” (DSP chips or DSPs for short). The special features that these processors employ are discussed extensively by Lapsley, Bier, Shoham and Lee [LBSL94]. However, single processor even DSPs often cannot deliver the performance requirement of some applications. In these cases, use of multiple processors is an attractive solution, where both the hardware and the software make use of the application~specific nature of the task to be performed. For multiprocessor implementation of embedded real-time DSP applications, reducing interprocessor communication costs andsynchronization costs becomes particularly important, because there usually premium on proof video cessorcyclesin these situations. Forexample,considerprocessing images in a video-conferencing application. Video-conferencing typically involves Quarter-CIF (Common Intermediate Format) images; this format specifies data rates of 30 frames per second, with each frame containing144 lines and 176 pixels per line, The effective sampling rate of the Quarter-CIF video signal is 0.76 Megapixels per second. The highest performance programmable DSP processor available of this writing (1999) has cycle time of 5 nanoseconds; this allows about 260 instruction cycles per processor for processing each sample the video signal sampled at 0.76 MHz. In multiprocessor scenario, IPC can potentially waste these precious processor cycles, negating some of the benefits of using multiple processors. In addition to processor cycles, IPC wastes power since it involves access to shared resources such memories and busses. Thus reducing IPC costs also becomes important from power consumption perspective for portable devices.

Over the past few years several companies have offered boards consisting of multiple DSPs. More recently, semiconductor companies have been offering

chips that integrate multiple DSP engines on single die, Examples of such integrated multiprocessor DSPs include commercially available products such the Texas Instruments TMS320C80 multi-DSP [GGV92], Philips Trimedia processor [RSSS], and the Adaptive Solutions CNAPSprocessor. The Hydra research at Stanford [H0981 is another example of an effort focussed on single-chip multiprocessors. MultiprocessorDSPs are likely to be increasingly popular in the future for variety of reasons. First, VLSItechnologytodayenables one to “stamp” 4-5 standard DSPs onto single die; this trend is certain to continue in the coming years. Such an approachis expected to become increasingly attractive because it reduces the testing time for the increasingly complex VLSI systems of the future. Second, since such device is programmable, tooling and testing costs of building an ASIC (application-specific integrated circuit) for each different application are saved by using such device for many different applications. This advantage of DSPs is going to be increasingly important circuit integration levels continue their dramatic ascent. Third, although there has been reluctance in adopting automatic compilers for embedded DSPs, suchparallel DSP products make the use of automatedtools feasible; with large number of processors per chip, one can afford to give up some processing power to the inefficiencies in the automatic tools. In addition, new techniques are being researched to make the process of automatically mapping design onto multiple processors more efficient the research results discussed in this book are attempts in that direction. situation is analogous to how logic designers have embraced automatic logic synthesis tools in recent years logic synthesis tools and VLSI technology have improved to the point that the chip area savedby manual design over automated designis not worth the extra design time involved: one can afford to “waste’, few gates, just one can afford to waste limited amount of processor cycles to compilation ine~ciencies in multiprocessor DSP system. Finally, proliferation telecommunication standards andsignal formats, often giving rise to multiple standards for the very same application, makes software implementation extremely attractive. Examples of applications in this categoryinclude set-top boxescapableofrecognizing varietyofaudiolvideo formatsandcompression standards, modernssupportingmultiple standards, multi-mode cellular phones and base stations that work with multiple cellular standards, multimedia workstations that are required to run variety of different multimedia software products, and programmable audiolvideo codecs.Integrated multiprocessor DSP systems provide very flexible software p l a t f o ~for this rapidly-growing family ofapplications.

A natural generalization of such fully-programmable, multiprocessor inte-

Chapter

grated circuits the class of multiprocessor systems that consists of an a r b i t r ~ possibly heterogeneous collection of programmable processors well set of zero or more custom hardware elements on single chip. ~ a p p i n gapplications onto such an architecture is then hardware/software codesign problem. However,theproblems of interprocessor communi~ation and synchronization are, for the most part, identical to those encountered in fully-pro~rammable systems, In this book, when we refer to “m~ltiprocessor,~’ we imply an architecture that, described above, may be comprised of dif€ere~ttypes of programmable processors, andmay include custom hardware elements. Additionally, the multiprocessor systems that we address in this book may be packaged in single integrated circuit chip, or maybe distributed across multiple chips. All of the techni~uesthat we present in this book apply to this general class of parallel processing architectures.

Although this book addresses broad range of parallel architectures, it focuses on-thedesign of such architectures in the context of specific, well-defi~ed families of applications. We focus on application-specific parallel proce instead of applying the ideas in general purpose parallel systems because systems are typically components embedded app~ications,and the computational characteristics of embedded applications are fundamentally different from those of genera1“purposesystems. General purpose parallel computation involves user-progra~mablecomputing devices, whichcanbeconveniently config~red for wide variety of purposes, and can be re-configured any number of times the user’s needs change. omp put at ion in an embedded app~ication,however, is usually one-time programmed by the designer of that ernbedded system digital cellular radio handset, for example) and is not meant to be programmable by the end user. Also, the computation in embedded syste is specia~ized (the c o ~ p u t a tionin SE” functions such as speech cellular radio handsetinvolvesspecifi compression, channel equalization, modulation, etc.), andthe desi ners of embedded multiprocessor hardware typically have specific knowled applications that will be developed on the p l a t f o ~ sthat they develo trast, ~ c h i t e c t of s general purpose computing systems cannot afYord to customize their hardware too heavily for any specific class of applications. only designers of embedded systems have the oppo~unityto accurately predict and optimi~efor the specific ap ation subsystems that willbe executing on the hardware that theydevelop.wever,ifonly general purpose imple~entation techniques are used in the development of an embedded system, then the designers of that embedded system lose this oppo~unity.

Furthemore, embedded applications face very different constraints compared to general purpose computation. on-recu~ng design costs, competitive time-to-mar~etconstraints, limitations on the amount and placement of memory, constraints on power consumption, and real-time performance requirements are few examples. Thus for an embedded application, it is critical to apply techniques for design and implementation that exploit the special characteristics of the application in order to optimize for the specific set of constraints that must be satisfied. These techniques are naturally centered around design methodologies that tailor the hardware and software implementation to the particular application.

Parallel computation has of course been topic of active research in computer science for the past several decades. Whereas parallelism within single processor hasbeen successfully exploited (instruction-level parallelism), the problem of pa~itioning single user program onto multiple such processors is yet to be satisfactorily solved. Although the hardware for the design of multiple processor machines the memory, interconnection network, inpu~outputsubsystems, etc. has received much attention, efficient partitioning of general program (w~ttenin G, for example) across given set of processors arranged in particular configuration is still an open problem. The need to detect parallelism from within the overspecified sequencing in popular imperative languages such G, the need to manage overhead due to communication and synchronization between processors, and the requirement of dynamic load balancing for some programs (an added source of overhead) complicates the partitioning problem for general p r o g r a ~ . Ifwe turn from general purpose computation to application-specific domains, however, parallelism often easier to identifyand exploit. This is because much more is known about the computational structure of the functionality being implemented, In such cases, we do not have to rely on the limited ability of automated tools to deduce this high-level structure from generic, low-level specifications (for instance, from general purpose programmin~language such C). Instead, it may bepossible to employ specialized computational models such one of the numerous variants of dataflow and finite state machine models that expose relevant structure in our targetted applications, and greatly facilitate the manualor automatic derivation of optimized implementations. Such specification models will be unacceptable in general-purpose context due to their limited applicability, butthey present tremendous opportunity tothe designer of embedded applications. The use of specialized computational models particularly d a t a ~ o ~ - b a s emodels d especially prevalent in the DSP domain.

Chapter

Similarly, focusing particular application domain mayinspire the discovery of highly streamlined system architectures. For example, one of the most extensively studied family of application-specific parallel processors is the class of syst~licarray architectures [Kun88][Rao85]. These architectures consist of regularly arranged arrays of processors that communicate locally, onto which certain class of applications, specified in a mathemat~calform, can be systematically mapped. Systolic arrays are further discussed in Chapter 2.

The necessaryelementsin the studyof application-specific computer architectures are: 1) clearly defined set of problems that can be solved usingthe particular application-specific approach, 2) formal mechanism for specification of these applications, and systematic approach for designing hardware and software from such a specification. In this book we focus on embedded signal, image, and videosignal processing applications, and specification model called Sync~onousDataflow that has proven to be very useful for design of such applications. Dataflow is a well-known programming model in which program is represented as a set of tasks with data precedences. Figure 1.1 shows an example of dataflow graph, where computation tasks (actors) A B , C , and D are represented as circles, and arrows (or arcs) between actors represent FIFO (first-infirst-out) queues that direct data values from the output of one computationto the input of another. Figure 1.2 shows the semantics of a dataflow graph. Actors consume data (or tokens, represented bullets in Figure 1.2) fromtheir inputs, perform computations on them (fire), and produce certain number of tokens on their outputs. The functions performed by the actors define the overall function of the dataflow graph; for example in Figure 1.l, and B could be data sources, C

Figure 1.l.

example of a dataflow graph.

could be simple addition operation, and D could be data sink. Then the function of the dataflow graph would be simply to output the sum of two input tokens. Dataflow graphs are very useful specification mechanism for signal processing systems since they capture the intuitive expressivity of block diagrams, flow charts, and signal flow graphs, while providing the formal semantics needed for system design and analysis tools. The applications we focus on are those that ELM873 and its extensions; described becan by willwe discuss the fo putational model in detail in Chapter 3. SDF in its pure form can onlyrepresent application sion making at the task level. Extensions of SDF (such the (BDF) model [Lee91][Buc93]) allow control constructs, so that data-dependent control flow can be expressed in such models. These models are si~nificantly more powerful in terms of expressivity, but they give up some of the useful analytical properties possessed the SDF model. For instance, Buck shows that it is possible to simulate any Turing machine in the BDF model [Buck93), TheBDF model can therefore compute all Turing computable functions, whereas this not

l

"firing".

Chapter 1

possible in the case of the SDF model. We further discuss the Boolean dataflow model in Chapter 8. In exchange for the limited expressivity of an SDF representation, we can efficiently check conditions such whether given SDF graph deadlocks, and whether it can be implemented usinga finite amount of memory.No such general procedures can be devised for checking the c o ~ e s p o n d i nconditions ~ (deadlock behavior and bounded memory usage)for computation model that can simulate any given Turing machine. This is because the problems of determining if any given Turing machine halts (the halting problem), and determining whether‘it will use less than given amount of memory (or tape) are that is, no general algorithmexists to solve these problems in finite time. In this work, we first focus on techniques that apply to SDF applications, and we will propose extensions to these techniques for applications that can be specified essentially SDF, but augmented with limited number of control constructs (and hence fall into the BDF model). SDF has proven to be useful model for representing significant class of DSP algorithms; several computeraided design tools for DSP have been developed around SDF and closely related models. Examples of commercial tools based on SDF are the Signal Processing rksystem (SPW) from Cadence [PLN92][BL91]; and COSSAP, from Synopsys [RPM92]. Tools developed at various universities that use SDF and related models include Ptolemy [PHLB95a], the Warp compiler [Pri92], DESCARTES M921, GRAPE[LEAP94],and the GraphCompiler[VPS90].Figure 1.3

Figure 1.3. block diagram specificationof an F system in Cadence Signal Processing ~ o r k s y s t e (SPW). ~

showsanexampleofansystem SP

specified

blockdiagraminCadence

The SDF model is popular because it has certain analytical properties that in practice; we will discuss these properties and how they arise in the section. The most important property of SDF graphs in the context of this book that it is possible to effectively exploit parallelism in an algorithm specified an SDF graph by scheduling computations in the SDF graph onto multiple processors at compile or design timerather than at run-time. Given such schedule that d e t e ~ i n e dat compile time, we can extract i n f o ~ a t i o nfrom it with view towards optimizingthe final implementation. Inthis book we present techniques for minimizing synchronization and inter-processor communication overhead in statically (i.e., compiletime)scheduledmultiprocessorsinwhich the program derived from dataflow graph specification. The strategy is to model run-time execution of such multiprocessor to determine how processors communicate and sync~onize,and then to use this information to optimize the final implementation.

SDF (and other closely As mentioned before, dataflow models such related models) have proven to be useful for specifying applications in signal processing and communications, with the goal of both simulation of the algorithm at the functional or behavioral level, and for synthesis from such high level specification to a software description (e.g., a C program) or a hardware description (e.g., DL) or combination thereof. The descriptions thus generated can then be compiled down to the final implementation, e.g., an embedd~d processor, or an ASIC. One of the reasons for the popularity of such dataflow based modelsis that they provide formalism for block-diagram based visual programming, which is very intuitive specification mechanism for DSP; the expressivity of the S model sufficiently enco~passes significant class of DSP applications, including multirate applications that involve upsampling and downsamplingoperations. An equallyimportantreason for employingdataflow is that such specification exposes parallelism in the p It is wellknown that imperativeprogramming styles such C andF N tend to over-specify the control structure of givencomputation,andcompilationofsuch specifications onto parallel architectures is known to be hard problem. Dataflow onthe other hand imposes minimal data-dependency constraints in the specification, potentially enabling compiler to detect p~allelismveryeffectively. The sameargumentholds for hardware synthesis, where it also important to be able to specify and exploit concu~ency.

Chapter

The SDF model has also proven to be useful for compiling DSP applications on single processors. Programmable digital signal processing chips tend to have special instructions such single cycle multiply-accumulate (for filtering functions), moduloaddressing (for mana&ingdelay lines), and bit-reversed addressing (for FFT computation). DSP chips also contain built in parallel functional units that are controlled from fields in the instruction (such parallel moves from memoryto registers combined with anALU operation). It is difficult for automatic compilers to optimally exploit these features; executable code generated by commercially available compilers today utilizes one-and-a-half to two times the programmemory that correspondinghandoptimizedprogram requires, and results in two to three times higher execution time compared to hand-optimi~ed code[ZVSM95]. There are however significant research efforts underway that are narrowing this gap. Forexample, see [LDK95][SM~97]. Moreover, some of the newer DSP architectures such the Texas Instruments S 3 2 0 C 6 ~ 0are more compiler friendly than past DSP architectures; automatic compilers for these processors often rival hand optimized assembly code for many standard DSP benchmarks. Block diagram languages based on models such SDF have proven to be bridge between automatic compilation and hand coding approaches;library of reusable blocks in particular programming language is hand coded, this library then constitutes the set of atomic SDF actors. Since the library blocks are reusable, one can afford to carefully optimize and fine tune them. The atomic blocks are fine to medium grain in size; an atomic actor in the SDF graph may implement anything from filtering function to two input addition operation. The final program is then automatically generated by concatenating code corresponding to the blocks inthe program according to the sequence prescribed by schedule. This approach mature enough that there are commercial tools available today, for example the SPVV and COSSAP tools mentioned earlier, that employ this technique. Powerful optimization techniques have been developedfor generating sequential programs from SDF graphs that optimize for metrics such program and data memory usage, the run-time efficiency of buffering code, and context switching overhead betweensub-tasks [BM~96]. Scheduling is fundamental operation that must be performed in order to implement SDF graphs on both uniprocessor well multiprocessors. Uniprocessor scheduling simply refers to determining sequence of execution ofactors such that all precedence constraints are met and all the buffers between actors correspondi in^ to arcs) return to their initial states. Multiprocessor scheduling involves determining the mapping of actors to available processors, in addition to determining of the sequence in which actors execute. VVe discuss the issues involved in multiprocessor scheduling in subsequentchapters.

ve~vie The following chapter describes examples of application specific multiprocessors used for signal processing applications. Chapter lays down the formal notation anddefinitions used in the remainder of this book for modeling runtime synchronization and interprocessor communication. Chapter describes scheduling modelsthat are commonly employed when scheduling dataflow graphs on multiple processors. Chapter describes scheduling algorithms that attempt to maximize performance while accurately taking interprocessor communication costs into account. Chapters 6 and describe a hardware based technique for minimizing IPCand synchronization costs; the key idea in these chapters is topredict the pattern of processor accesses to shared resources and to enforce this pattern during runtime. We present the hardware design and implementation of four processor machine the Ordered Memory Access Architecture (OMA). The OMA is shared bus multiprocessor that uses shared memory for IPC, Theorder in which processors access shared memory for thepurpose of communication is predetermined at compile time and enforced by bus controller on the board, resulting in low-cost IPC mechanism without the need for explicit synchronization. This scheme is termed the Ordered Transactions strategy In Chapter we present graph theoretic scheme for modeling run-time onization behavior of multiprocessors using structure we call the that takes into account the processor assignment and ordering constr that self-timed schedule specifies. We also discussthe effect of run-time variations in execution times of tasks on the performance of a multiprocessor implementation. In Chapter 8, we discuss ideas for extending the Ordered Transactions strategy to models more powerful than SDF, for example, the Boolean dataflow (BDF) model. The strategy here is to assume we have only small number of control constructs in the SDF graph and explore techniques for this case. The domain of applicability of compile time optimization techniques can be extended to programs that display some dynamic behavior in this manner, without having to deal with the complexity of tackling the general BDF model. The ordered memory access approach discussed in Chapters 6 to 8 requires special hardware support. When such support is not available, we can utilize set of software-based approaches to reduce synchronization overhead. These techniques for reducing sync~onizationoverhead consist of efficient algorithms that minimize the overall synchronization activity in the imple~entation of given self-timed schedule. A straightfo~ardmultiprocessor implementation of dataflow specification often includes ~ ~ u n ~ a n t points, i.e., theobjective of certain set of synchronizations is guaranteed a side effect

of other synchronization points in the system. Chapter 9 discusses efficient algorithms for detecting and eliminating such redundant synchronization operations. discuss graph transformation called that allows e use of more efficient synchronization protocols. It is alsopossible to reduce the overall synchronization cost of self-timed implementation by adding synchronization points between processors that were not present in the schedule specified originally. In Chapter 10, we discuss technique, called r ~ s y ~ ~ h r o n ~ ~ for t i osystematically n, manipulating synchronization points in this manner. Resynchronization is performed with the objective of im~rovingthroughput of the multiprocessor implementation. Frequently in realtime signal processing systems, latency also an important issue, and although resynchronization improves the throughput, it generally degrades (increases) the latency. hapter 10 addresses the problem of resynchronization underthe assumption that an arbitrary increase in latency is acceptable. Such scenario arises when the computations occur in feedforward manner, e.g., audiolvideo decoding for playback from media such Digital 'Versatile Disk (DVD), and for wide variety of simulation applications. Chapter 11 examines the relationship between resynchronization and latency, and addresses the problem of optimal resynchronizationwhenonly limited increase in latency is tolerable. Such latency constraints are present in interactive applications such video conferencing and telephony, where beyond certain point the latency becomes annoying to the user. In voicetelephony, for example, the round trip delay of the speech signal is kept below about 100 milliseconds to achieve acceptable quality. The ordered memory access strategy discussed in Chapters 6 through 8 can be viewed hardware approach that optimizes for IPC and synchronization overhead in statically scheduled multiprocessor implementations. The synchronization optimization techniques of Chapter9 through 12, on the other hand, operate at the level of scheduled parallel program by altering the synchronization s t ~ c t u r eof given schedule to minimize the synchronization overhead in the final implementation. ~hroughoutthe book, we illustrate the key concepts by applying them to examples of practical systems.

dollar cost

tion.

elements could themselves be self-contained processors that exploit parallelism within themselves. In the latter case, we can view the parallel program as being split into multiple threads of computation, where each threadis assigned to a processing element. The processing element itself could be a traditional von Neumann-type Central Processing Unit (CPU), sequentially executing instructions fetched from a central instruction storage, or it could employ (ILP) to realize high performance by executing in parallel multiple instructions in its assigned thread. The interconnection mechanism between processors is clearly crucial to the performance of the machine on a given application. For fine-grained and instruction level parallelism support, communication often occurs through a simple mechanism such as a multi-po~edregister file. For machines composed of more sophisticated processors, a large varietyofinterconnectionmechanism have been employed, ranging from a simple shared bus to 3-dimensional meshes and hyper-trees [Lei92]. Embedded applications often employ simple structures such as hierarchical busses or small crossbars. The twomain flavors of ILPare superscalar andVLIW(VeryLong Instruction Word) [PH96]. Superscalar processors (e.g.,the Intel Pentium processor) contain multiple functional units (ALUs, floating point units, etc.); instructions are brought into the machine sequentially and are scheduled dynamically by the processor hardware onto the available functional units. Out-of-order execution of instructions is alsosupported. VLIW processors, on the otherhand,relyonacompiler to statically schedule instructions onto functional units; the compiler determines exactly what operationeach functional unit performsineach instruction cycle. The “long instruction word” arises because the instruction word must specify the control i n f o ~ a t i o nfor all the functional units in the machine. Clearly, a VLIW model is less flexible than a superscalar approach; however, the implementation cost of VLIW is also significantly less because dynamic scheduling need not be supported in hardware. Forthis reason, several modern DSP processors have adopted the VLIW approach; at the same time, as discussed before, the regular nature of DSP algorithms lend themselves wellto the static scheduling approach employed in VLIW machines. We will discuss some of these machines in detail in the following sections. Given multiple processors capable of executing autonomously, the program threads running on the processors may be tightly or loosely coupled to one another. In a tightly coupled architecture the processors may run in lockstep executing the same instructions on different data sets (e.g., systolic arrays), or they may run in lock step, but operate on different instruction sequences (similar to VLIW). Alternatively,processors may executetheir programs independent ofone

another, only communicating or sync~onizingwhen necessary. Even in this case there is wide range of how closely processors are coupled, which can range from shared memory model where the processors may share the same memory address space to “network of workstations’’ model whereautono~ousmachines communicate in coarse-grained manner over local area network. In the following sections, we discuss app~ication-specificparallel processors that exemplify the many variations in parallel architectures discussed thus far. We will find that these machines employ tight coupling between processors; these machines also attempt to exploit the predictable run-time nature of the targeted applications, by employing architectural techniquessuch as VLIW,and employing processor interconnectionsthat reflect the nature of the targeted application set. Also, these architectures rely heavilyupon static scheduling techniques for their performance.

DSP processors have incorporated ILP techniques since inception; the key innovation in the very first DSPs was single cycle multiply-accumulate unit. In addition, almost all DSP processors today employ an architecture that includes multiple internal busses allowing multiple datafetches in parallel with aninstruction fetch in single instruction cycle; this is also known “Harvard” architecture. Figure 2.1 showsanexampleof modern DSP processor(Texas Instruments TMS320C54x DSP) containing multiple address and data busses, and parallel address generators. Since filtering is the key operation in most DSP algorithms, modern programmable DSP architectures provide highly specialized support for this function. For example, multiply-and-accumulate operation may be performed in parallel with two data fetches from data memory (for fetching the signal sample and the filter coefficient); in addition, an update of two address registers (potentially including modulo operations to support circular buffers and delay lines), and an instruction fetch can also be done in the same cycle. there are many atomic operations performed in parallel in single cycle; this allows finite impulse response (FIR) filter implementation using only oneDSP instruction cycle per filter tap. For example, Figure 2.2 shows the assembly code for the inner loop of an FIR filter implementation on a TMS32OC54x DSP. The MAC instruction is repeated for each tap in the filter; for each repetition this instruction fetches the coefficient and data pointed to by address registers AR2 and AR3, multiplies and accumulates them into the “A” accumulator, and postincrements the address registers.

Chapter

have a complex inst~ctionset and follow philosophy very difTerent from ““Reduced n s t ~ c t i o nSet ~ o m ~ u t e r ” tectures, that are prevalent in the general p u ~ o s e high ~ e ~ o ~ a n c e microprocessor domain. The advantages of a com~lex inst~ction set are compact

ified viewof the

object code, and dete~inistic perfo~ance, while the price of supporting complex instruction set lower compiler efficiency and lesser portability of the software. The constraint of lowpower,andhigh performance-to-cost ratlo re~uirementfor embedded applications has resulted in very differe tion paths for processors compared to general-purpose processors. these paths eventually converge in the future remains to be seen.

Sub-word parallelism refers to the ability to divide wide ALU into narrower slices so that multiple operations on a smaller data type can be performed on the same datapath in an SIMD fashion (Figure 2.3). Several general purpose microprocessors employ multi-media enhanced instruction set that exploits sub-word parallelism to achieve higher performance on multimedia applicatio~s that require a smaller precision. Technology”-enhanced Intel Pentium processor [E own general purpose CPU with an enhanced instruction set to handle throughput intensive “media” processing. The MMX instructions allow 64~bitALU to be partitioned into $-bit slices, providing subThe $-bit ALU slices work in parallel in an SIMD fashion. The Pentiurn can perform operations such as addition, subtraction, and logical operations on eight &bit samples (e.g., image pixels) in a single cycle. It also can perform data movement operations such single cycle swapping of bytes within words, p a c ~ n gsmaller sized words into a 64-bit register, etc. operations such as four 8-bit multiplies (with or without satu shifts within sub-words, and sum of products of sub-words, may all be p e r f o ~ e d in a singlecycle. Similarly enhanced microprocessors have been developed by systems (the “VIS” inon set for the SPARC processor [TO Hewlett-Packard (the inst~ctionsfor the PA RISC process The VIS instruction set includes a capability for performing absolute difference (for image compression ~pplications). The include a sub-word average, shift and add, and fairly generic permute instr~ctions

Chapter

that change the positions of the sub-words within 64-bit word boundary in a very flexible manner. The permute instructions are especially useful for efficiently aligning data within 64-bit word before employing an instruction that operates on multiple sub-words. DSP processors such the TMS32OC60 and ~ S 3 2 ~ 8and 0 the , Philips Trimedia also support sub-word parallelism. Exploiting sub-word parallelism clearly requires extensive static or compile time analysis, either manually or by compiler.

ro~~ssors Asdiscussed before, the lower cost of compiler-scheduledapproach employed in VLIW machines compared to hardware scheduling employed in superscalar processors makes VLIW good candidate DSP architecture. It is therefore no surprise that several semiconductormanufacturershave recently announced VLIW-based signal processor products. The Philips Trimedia processor, for example, is geared towards video signal processing, and employs VLIW engine. The Trimedia processor also has special hardware for handling various standard video formats. In addition, hardwaremodules for highly specialized functionssuch Variable Length Decoding (usedfor MPECvideo decoding), color and format conversion, are also provided. Trimedia also instructions that exploit sub-word parallelism among byte-sized samples within 32-bit word. The ChromaticsMPACT architecture [Pur971usesan interesting hardware/software partitioned solution to provide programmable platform for PC-

byte

a + be + cf + gd + h

Figure 2.3. Example sub-word parallelism: Additionof bytes within a 32 bit register (saturation or truncation could be specified).

APPLICATION-SPECIFIC~ULTIPROC~SSORS

based multi-media. The target applications are graphics, audiohide0 processing, and video games. The key idea behind Chromatic’s multimedia solution is to use some a ~ o u n tof processing capability in the native x86 CPU, and usethe MPACT processor for accelerating certain functions when multiple applications are operated simultaneously (e.g., when FAX message arrives while teleconferencing session in operation). Finally, the Texas Instruments TMS32OC6x DSP [Tex98]is high performance, general purpose DSP that employs VLIW architecture. The C6x processor is designed around eight functional units that are grouped into two identical sets of four functional units each (see Figure 2.4). These functional units are the D unit for memory loadlstore and addhubtract operations; the M unit for multiplication; the L unit for additio~subtraction,logical and comparison operations; and the S unit for shifts in addition to addhubtract and logical operations. Each set of four functional units has its own register file, and bypass provided for accessing each half of the register file by either set of functional units. Each functional unit is controlled by a 32-bit instru~tionfield; the instruction word for the processor therefore has length between 32 bits and 256 bits, depending on how many functional units are actually active in given cycle. Features such predicated inst~ctionsallow conditional execution of instructions; this allows one to avoid branching when possible, very useful feature considering the deep pipeline of the C6x.

Several multiprocessors geared towards signal processing are based on the dataflow architecture principles introduced by Dennis ~ D e n 8 0 these ~ ; machines deviate from the traditional von Neumann model of computer. Notable among these are Hughes Data Flow Multiprocessor [GB91], the Texas Instruments Data Flow Signal Processor [Gri84], and the AT&T EnhancedModular Signal Processor [Blo86]. The first two perform the processor assignment step at compile time (i.e., tasks are assigned to processors at compile time) and tasks assigned to processor are scheduled on it dynamically; the AT&T EMPS performs even the assignment of tasks to processors at run-time. The main steps involved in scheduling tasks on multiple processors are discussed fully in Chapter 4. Each of these machines employs elaborate hardware to implement dynamic scheduling within processors, and employs expensive communication networks to route tokens generated by actors assigned to one processor to tasks on other processors that require these tokens. In most DSP applications, however, such dynamic scheduling is u n n e c e s s ~since compile time predictability makes static scheduling techniques viable. Eliminating dynamic scheduling results in much simpler hardware without an undueperformance penalty.

Another example ofan application-specific dataflow architecture the 1 [Cha84], which single chip processor geared towards image ch chip contains one functional unit; multiple such chips can be connected together to execute programs in a pipelined fashion. The actors are statically assigned to each processor, and actors assigned to given processor are scheduled on it dynamically. The primitives that this chip supports, convolution, bit manipulations, accumulation, etc., are specifically designed for image processing applications.

ystolic arrays consist of processors that are locally connected and may be arranged in different interconnection topologies: mesh, ring, torus, etc. The term “systolic” arises because all processors in such a machine run in lock-step, alternating between computation step and communication step. The model followed is usually SIMD (Single Instruction ~ u l t i p l eData). S execute certain class of problems that can be specified o ~ t h m s(RIA)” [Rao85]; systematic techni~uesexist for mapping an algo-

256-bit instruction word

rithm specified in form onto dedicated processor arrays in an optimal fashion. ~ptimalityes i metrics such as processor and communication link utilization, scalability with the problem size, and achieving best for a givennumber of essors. Several numerical computation problerriswere found to fall into the algebra, matrix operations, singular value decomposition, [Lei921 for interesting systolic array implementations of a variety of di~erentnumerical problems). Only highly regular computations can be specified in the RIA form; this makes the applicability of systolic arrayssomewhat restrictive. vefront arrays are similar to systolic arrays except that processors are n881. Communication between procesnot under the control a global clock sors is async~onousor self-timed; ands shake between processors ensures runtime sync~onization,Thus processors in wavefront array can be complex and the arrays themselves can consist of a large number of processors without incurring the associated problems of clock skew and global sync~onization.The ibility of wavefront arrays over systolic arrays comes atthe cost of llon University [A+87] is an example of ed ato dedicated array designed for one and communicate anged in a linear array es. Programs are written for this comhe Warp project also led to the i orate inter"processor c node is a single VL composed of a computation engine and a communication engine. tion agent consists of an integer and logical unit as well as a Ao and multiply unit. Each unit is capable of ~ n n i inde~endently, ~ g to a multi-po~edregister file. The communication agent connects to its neig~bors via four bidirectional communication links, and provides the interface to support message passing type communication between cells as well as word-based sysi tolic communication. The i nodescan therefore be connected invari gle and two dimensional topologies. Various image processing applicat FFT, image smoothing, computer vision) and matrix algorithms decomposition) have been reported for this machine [Lou93].

programmable systoli specific application. with their neighbors

ext, we discuss multiprocessors that make use of multiple off-the-shelf p r o ~ r a ~ m a ~ l e chips. An example of such a system is the S ~ A . R Tar ture [Koh90] that reconfigurable bus-based design comprised of SP32C processors, and custom VLSI components for routing data between pro-

Chapter

cessors. Clusters of processors may be connected onto common bus, or may form linear array with neighbor-to-neighbor communication. This allows the multiprocessor to be reconfigured depending on the communication requirement of the particular application being mapped onto it. Scheduling and code generation for this machine is done by an automatic parallelizing compiler [HJ92]. The DSP3 multiprocessor [SW921 comprised of AT&T DSP32C processors connectedin a mesh configuration. The meshinterconnect is implementedusingcustomVLSIcomponents for data routing. Eachprocessor communicates with four of its adjacent neighbors through this router, which consists of input and output queues, and crossbar that is configurable under program control. Data packets contain headersthat indicate the ID of the destination processor. The RingArrayProcessor(RAP)system[M+92]uses TI DSP32OC30 processors connected in ring topology. This system is designed specifically for speech-recognition applications basedon artificial neural networks.TheRAP system consists of several boards that are attached to host workstation, andacts as a co-processor for the host. The unidirectional pipelined ring topology employed for interprocessor communication was foundto be ideal for the particular algorithms that were to be mapped to this machine. The ring structure is similar to the SMART array, except that no processor ID is included with the data, and processor reads and writes into the ring are scheduled at compile time. The ring is used to broadcast data from one processor to all the others during one

INmRFACE UNIT

APPLICATION-SPECIFIC~~LTIPROCESSORS

phase of the neural network algorithm, andis used to shift data from processor to processor in pipelined fashion in the second phase. Several modern oE-the-shelf DSP processors provide special support for multiprocessing. Examples include the Texas Instruments TMS32OC40 Motorola DSP96000, Analog Devices ADSP-21060 “SHARC”, well the Inmos(nowowned by SGS Thompson)Transputer line of processors. The DSP96000 processor is floating point DSP that supports two independent busses, one of which can be usedfor local accesses and the other for inter-processor communication. The C40 processor is also floating point processor with two sets of busses; in addition it has six $-bit bidirectional ports for interprocessor provides six communication. The ADSP-21060 is floating point DSP that bidirectional serial links for interprocessor communication. The Transputer is CPU with four serial links for interprocessor communications. Owing to the ease with which these processors can be interconnected, numberofmulti-DSPmachineshavebeen built around the DSP960~, SHARC,and the Transputer. Examplesofmulti-DSPmachinescomposed of DSP96000s include MUSIC [G+92] that targets neural network applications well as the architecture described in Chapter 6; C40 based parallel processors havebeendesigned for beamforming applications [Ger9S],andmachine vision [DIE3961 among others; ADSP-21060basedmultiprocessorsinclude speech-recognition applications [T+9S], applications in nuclear physics [A+98], and digital music [Sha98]; and machines built around Transputers have targeted applications in scientific computation [Mou96], and robotics [YM96].

Modern VLSI technology enables multiple CPUs to be placed on single die, to yield multiprocessor system-on-a-chip, Olukotun [0+96] present an interesting study that concludes that goingto multiple processor solution is better path to high performance than going to higher levels of instruction level parallelism (using superscalarapproach, for example). Systolic arrays have been proposed ideal candidates for application-specific multiprocessor on chip implementations; however, pointed out before, the class of application targeted by systolic arrays limited. We discuss next some interesting single chip multiprocessor architectures that have been designed andbuilt to date. The Texas I n s t ~ m e n t s~ S 3 2 0 C 8 0(Multimedia Video Processor) [GGV92] is an example of single chip multi-DSP. It consists of four DSP cores, and RISC processor for control-oriented applications. Each DSP core has its own local memory and some amount of shared RAM. EveryDSP can access the shared memory in any one ofthe four DSPs through an interconnection network. A powerful transfer controller is responsible for moving data on-chip, and also

graphics applications. ta transfers are all persor desi~ned video PE9 consists nine indi-

ction level paral~e~ism by means four indivi~ualprocess in^ uniwhichcanperform mu~tiple arithmetic operations each cycle. Thus the a h i ~ h l y~ a r a l ~ architecel ture that exploits p~allelismat m ~ l t i p ~levels. e m~eddedsingle-chip mu~tiprocessor§may be composed heteroe ~ e o processors. ~§ For exa anyconsumerdevicestoday, controllers, etc., signal processi~gtasks, ~ h i l the e other is ~icrocontrol~er such as a two-processor s y s t e ~ increasingly found in embedded applicaoptimization used in each processor. t i o ~ ~ b ~of~ the a u types s e of arch~te~tural microcontroller an ef~cient inte~upt-hand~in~ capability, and is more

APPLICATION-SPECIFIC~~LTIPROCESSORS

amenable to compilation from high-level language; however, it lacks the multiply-accumulate performance of DSP processor. The microcontroller thus ideal for p e r f o ~ i n guser interface and protocol processing type functions that are somewhat asynchronous in nature, while the DSP is more suited to signal processing tasks that tend to be synchronous and predictable. Even though new DSP processors boasting microcontroller capabilities havebeen int~oduced recently (e.g., the itachi SH-DSP andthe TI TMS320C27x series) an AR DSP two processor solution expected to be popular for embedded signal processinglcontrol applications in the near future. A good example of such an architecture is described in [Reg94]; this part uses two DSP processors along with microcontroller to implement audio processing and voice band modemfunctions in software.

Reconfigurable computers are another approach to application-specific computing that has received significant attention lately.. Reconfigurable computing is based on implement in^ function in hardware using con~gurablelogic (e.g., field programmable gate array or FPGA), or higher'levelbuilding blocks that can be easily configured and reconfigured to provide range of different functions, Building dedicated circuit for given function can result in large speedups; examples of such functions are bit manipulation in applications such cryptography and compression; bit-field extraction; highly regular computations such Fourier and Discrete Cosine Transforms; pseudo random number generation; compact lookup tables, etc. One strategy that has been employed for building configurable computers to build the machine entirely out of reconfigurable logic; examples of such machines, used for applications such DNA sequence matching, finite field arithmetic, and encryption, are discussed in [G+91][~~95][GMN96~[~+96].

A second and more recent approach to reconfigurable architectures is to augment programmable processor with configurable logic. In such an architecture, functions best suited to hardware implementation are mapped to the FPGA to take advantage of the resulting speedup, and functions more suitable to software (e.g., control dominated applications, and floating point intensive computation) can make useof the programmable processor. The Garp processor [ H ~ 9 7 ] , for example, combines Sun UltraSPARC core with an FPGA that serves reconfigurable functional unit. Special instructions are defined for configu~ng the FPGA, and for transferring data between the FPGA and the processor. The authors demonstrate 24x speedup over SunUltraSPARC machine, for an encryption application. In [HFHK97] the authors describe similar architecture, called Chimaera, that augments RISC processor with an FPGA. In the Chimaera architecture, the reconfigurable unit has access to the processor register

Chapter 2

file; in the GARP architecture the processor is responsible for directly reading from and writing data to the reconfigurable unit through special instructions that are augmented to the native instruc~ion setof the RISC processor. Both architectures include special inst~ctionsin the processor for sending commands to the reconfigurable unit. Another example of reconfigurable architecture Matrix [MD97], which attempts to combine the efficiencyof processors on irregular, heavily multiplexed tasks with the efficiency of FPGAs on highly regular tasks. The Matrix architecture allows selection of the granularity according to application needs. It consists ofan array of basic functional units (BFUs) that maybe configured either as functional units (add, multiply, etc.), or control for another BFU. Thus one can configure the array into parts that function in SIMD mode under common control, where each such partition runs an independent thread in an MIMD mode. In [ASI+98] the authors describe the idea of domain-specific processors that achieve low power dissipation for small class of applications they are optimized for, These processors augmented with general purpose processors yield practical trade-off between flexibility, power and performance. The authors esti-

Instruction, Data

Configuration, Data

Figure 2.7. A processor augmentedwith an FPGA-based accelerator [H~97][~FHK97].

APPLICATION-SPECI~IC ~ULTIPROCESSORS

7

mate that such an approach can reduce the power utilization of speech coding implementations by over an order of magnitude compared to an implementation using only general purpose DSPprocessor. PADDI (Programmable Arithmetic Devices for DIgital signal processing) is another reconfigurable architecture that consists of an array of high performance execution units (EXUs) with localized register files, connected via flexible interconnectmechanism[CR92]. The EXUs perform arithmetic functions add, subtract, shift, compare, accumulate etc. The entire array is consuch trolled by hierarchical control structure: A central sequencer broadcasts global control word, which then decoded locally by each EXU to determine its action. The local EXU decoder (“nan~store~’) handles local control, for example the selection of operands and program branching. Finally, Wu and Liu [WLR98] describe reconfigurable processing unit that can be used building block for variety of video signal processing functions including FIR, IIR, and adaptive filters, and discrete transforms such DCT, An array of processing units along with an interconnection networkis used to implement any one of these functions, yielding t ~ o u g h p ucomparable t to custom ASIC designs but with much higher flexibility and potential for adaptive operation.

As we will discuss in Chapter 4, compile time scheduling very effective for large class of applications in signal processing and scientific computing, Given such schedule, we can obtain information about the pattern of inter-processor communication that occurs atrun-time. This compile time information can be exploited by the hardware architecture to achieve efficient communication between processors. We exploit this fact in the strategy discussedinChapter In this section wediscuss related work in this area of employing compile time information about inter-processor communication coupled with enhancements to the hardware architecture with the objective of reducing IPG and sync~onizationoverhead. Determining the pattern of processorcommunications is relatively straightforward in SIMD implementations. Techniques applied to systolic arrays in fact use the regular communication pattern to determine an optimal interconnect topology for given algorithm. An interesting architecture in this context is the GF11 machine built at IBM [BDW85]. The GF11 is an SIMD machine in which processors are interconnected using Benes network (Figure 2.8), which allows the GF1 to support variety of different interprocessor communication topologies rather than fixed topology. Benes networks are non-blocking, i.e., they can provide one-to-one con-

Chapter

nectionsfrom all the network inputs to the networkoutputssimultaneously according to any specified permutation. These networks achieve the functional capability of full crossbar switch with much simpler hardware. The drawback, however, that in Benes network, computing switchsettings needed to achieve particular p e ~ u t a t i o ninvolves somewhat complex algorithm [Lei92]. In the GFl1, this problem is solved by precomputing the switch settings based on the program to be executed onthe array. A central controller is responsible for reconfiguring the Benes network at run-time based on these predete~inedswitch setl synchronous with respect to tings. Interprocessor communication in the computations in the processors, similar to systolic arrays. The GF11 has been used for scientific computing, e.g., calculations in quantum physics, finite element analysis, LU decomposition, and other applications, An example of mesh connected parallel processor that uses compile time information at the hardware level is the ~ u M e s hsystem at MIT [SHL+97]. In this system, it is assumed that the communication pattern source and destination of each message, and the communication bandwidth required can be extracted from the parallel pro~ramspecification. Some ~ o u noft dynamic execution is supported by the architecture. Each processing node in the mesh gets communication schedule which it follows at run-time. If the compile time estimates of bandwidth requirements are accurate, the architecture realizes effiInterconnection Network

Central Controller

IBM

cient, hot-spot free, low-overhead communication. Incorrect bandwidth estimates or dynamic executionare not catastrop~ic,but these do cause lower pe~ormance. machine is another example of a paral~elprocessor re configured statically. The processing elements are tiled mesh topology; each element consists of a RISC-like processor, with ements special inst~ctionsand configurable data widths. es enforce a compile-time determined static communication pattern, allowingdynamicswitchingwhen necessary. Implementing the static communication pattern reduces sync~onizationoverheadandnetwork congestion, A compiler is responsible for pa~itioningthe program into threads mappedontoeach processor, configuring the reconfigurable logic oneach processor, and routing communications statically.

In this chapter we discussed various types of application-specific multiprocessorsemployed for signal processing. Although these machinesemploy arallel processing techni~ueswell known in general pu ing, the predictable natureof the computationsallows for simp~ified syste ~chitectures.It is often possible to configure processor interconnectsstatically to make use of compile time knowledge inter-processor communication patterns. This allows for low overhead inte~rocessorcommunication and synchr ~ e c h a n i s that ~ s employ a combination of simple hardware s u p p o ~for softw~e tech~iques applied to programsrunning on the processors. explore these ideas f u ~ h ein r the following chapters.

This Page Intentionally Left Blank

In this chapter we introduce terminology and definitions usedinthe remainder of the book, and formalize the dataflow model that was introduced intuitively in Chapter 1. We also briefly introduce the concept of algorithmic complexity, and discuss various shortest and longest path algorithms in weighted directed graphs alongwith their associated complexity. These algorithms are used extensively in subsequent chapters. start with, we define the difference of two arbitrary sets S, and S2 by {S St 1s S,} and we denote the number of elements in finite set if r is real number, then we denote the smallest integer that is greater than or equal to r by r r l . S , S2 S by IS1

d pair E) where V is the set of where edge is an ordered pair (v1, E If e E we say that e is directed from to v1 is the of and is the of We refer to the source and sink vertices of graph edge e E E by src( e) and snk(e) In directed graph we cannot have two or more edges that have identical source sink vertices. A generalization of directed graph is which two or more edges have the same source and sink vertices. Figure 3.l(a) shows an example of directed graph, and Figure 3.l(b) shows an example of directed multigraph. The vertices are represented by circles and the edges are represented by arrows between the circles. Thus, the vertex set of the directed graph of Figure 3.l(a) is B,C, and the edge set is

B),(A,

(D,B),

Chapter

directed multirah,wherethe vertices (actors) represent com~utationand edges (arcs) repre rst-in-~r~t-out) queues that direct data values from the output of one to the input of another. es thus represent data precedences between computations. cons~medata tokens) from their inputs, p e r f o ~computations on them re), and produce certain numbers of tokens on their outputs. -level functional lan uages such pure L1 and as Id Lucid be directly converted i presentations; such conversion is possible because these laned to be i.e., programs in these languages contain global variables or data structures, and functions in these lan~uagescannot modify their ~ g u m e n t s[Ack82]. since it is possible to s i ~ u l a t eany Turing machine in one of these languages, questions such as deadlock (or equivalently, t e ~ i n a t i nbehavior) ~ and determining maximum h become undecid-

inand

the speci~edcomputation in har~wareor s o f t ~ ~ e .

ne such restricted model (and in fact one of the earliest graph-based

computation models) the eo of and Miller where the authors establish th graph model is i.e., the sequence of tokens produced on the edges of given computation graph are unique, and do not depend on the order that the actors in the graph fire, long all data dependencies are respected by the firing order. The authors also provide an algorithm that, based on topological and algebraic properties of the graph, determines whether the c putation specified by a given computation graph willeventually t e ~ i n a t e . cause of the latter property, computation graphs clearly cannot simulate all Turing machines, and hence are not expressive general dataflow language like Lucid or pure LISP. omp put at ion graphs provide some of the theoretical foundations for the SDF model to be discussed in detail in Section

Another model of computation relevant to dataflow is the which are analogous [Pet8l][Mur89]. A Petri net consists of set of to actors in dataflow, and set of that are analogous to arcs. Each transition has certain number input places and output places connected to it. Places may contain one or more A etri net has the following semantics: transition when all its input places have one or more tokens and, upon firing, it produces certain number of tokens on each of its output places. A large number of diff~rentkinds of Petri net models have been proposed in the literature formodeling di~erenttypes of systems. Some of these Petri net models have the same expressive power Turing machines: for example, if transitions areallowed to possess “inhibit” inputs (if place co~espondingto such an input to transition contains token, then that transition is not allowed to fire) then Petri net can simulate any Turing machine (pp. 201 in [Petsl]). Others (depending on topological restrictions imposed on how places and transitions can be interconnected) are equivalent to finite state machines, and yet others are similar to SDF graphs. Some extended Petri net models allow notion of time, to model execution times of computations, There is also body of work on stochastic extensions of timed Petri nets that are useful for modeling uncertainties in computation times. We will touch upon some of these Petri net models again in Chapter 4. Finally, there are Petri nets that distinguish between different classes of tokens in the specification Petri nets), that tokens can have information associated withthem. We refer to [Pet811 [Mur89] for details on the extensive variety of Petri nets that have been proposed overthe years.

Chapter 3

The particular restricted dataflow model we are mainly concerned with in this book is the SDF Sync~onousData Flow model proposed by Lee and ~esserschmitt[LM97].The SDF model poses restrictions on the firing of actors: the number of tokens produced ( ~ o n s u ~ e by d )an actor on each output (input) edge is fixed number that is known at compile time. The number of tokens produced and consumed by each SDF actor on each of its edges is annotated in illustrations of an SDF graph by numbers at thearc source and sink respectively. In an actual im~lementation,arcs represent buffers in physical memory. "%e arcs in an SDF graph may contain initial tokens, which we also refer to delays. Arcs with delays canbe interpreted data dependencies across iterations of the graph; this concept will be formalized in the following cha ter when we discuss scheduling models. We will represent delays using bullets on the edges of the SDF graph; we indicate more than one delay on an edge by number alongside the bullet. An example of an SDF graph is illustrated in Figure 3.2. DSP applications typically represent computations on an indefinitely long data sequence; therefore the SDF graphs we are interested in for the purpose of signal processing must execute in non-te~inatingfashion. Consequently, we must be able to obtain periodic schedules for SDF representations, which can then be run infinite loops using a finite amount of physical memory. Unbounded buffers imply sample rate inconsistency, and deadlock implies that all actorsin the graph cannot be iterated indefinitely. Thus for our purposes, correctly constructed SDF graphs are those that can be scheduled periodically using finite amount of memory. The main advantage of imposing restrictions on the SDF model (over general dataflow model) lies precisely in the ability to determine whether or not an arbitrary SDF graph has periodic schedule that neither

1

SDF

BACKGROUND TERMINOLOGY ANDNOTATION

deadlocks nor requires unbounded buffer sizes [LM87]. The buffer sizes required to implement arcs in SDF graphs can be determined at compile time (recall that this is not possible for general dataflow model); consequently, buffers can be allocated statically, andrun-timeoverhead associated withdynamicmemory allocation is avoided. The existence of periodic schedule that can be inferred at compile time implies that correctly constructed SDF graph entails no run-time scheduling overhead.

This section briefly describes some useful properties of SDF graphs; for more detailed and rigorous treatment, please refer to the work of Lee an schmitt [LM87][Lee86]. An SDF graph compactly represented by its The topology matrix, referred to henceforth as I", represents the SDF graph structure; this matrix contains one columnfor each vertex, and one row for each edge in the SDF graph. The ( i , j ) th entry in the matrix corresponds to the number of tokens produced by the actor numbered j onto the edge numbered i If the j th actor tokens from the i th edge, i.e., the th edge is incident into the j th actor, then the ( i , j ) th entry is negative. Also, if the j th actor neither produces nor consumes any tokens from the i th edge, then the (i,j ) th entry set to zero. For example, the topology matrix I" for the SDF graph in Figure 3.2 is:

where the actors B ,and are numbered 1 and 3 respectively; the edges (A,B) and (A,C) are numbered and 2 respectively. A useful property of I" is stated by the following Theorem. A connected SDF graph with S vertices that has consistent samS 1 which ensures that l? has ple rates is guaranteed to have null space.

Proo) See [LM87]. This can easily be verified for (3-1). This fact is utilized to determine the epetitions vector for an SDF graph with S actors numbered 1 to S is column vector of length with the property that if each actor i is invoked number of times equal to the i th entry of q then the number of tokens on each edge of the SDF graph remains unchanged. Furthermore, is the smallest integer vector for which this property holds.

Chapter 3

Clearly, the repetitions vector is very useful for generating infinite schedules for SDF graphs by inde~nitelyrepeating finite length schedule, while maintaining small buffer sizes between actors. Also, will only exist if the SDF graph has consistent samplerates. The conditions for the existence of determined by Theorem 3.1 coupled with the following Theorem. The repetitions vector for an SDF graph with consistent sample rates is the smallest integer vector in the nullspace of its topology matrix. That is, is the smallest integer vector such that See [ ~ ~ 8 ~ ] .

e easily obtained by solving set of linear equations; these are ~ ~ t ~ osince n s ,they represent the constraint that the number of samples produced and consumed on each edge of the SDF graph be the same after each actor fires number of times equal to its corresponding entry in the repetitions vector. For the example of Figure 3.2, from (3-l),

Clearly, if actors B ,and C are invoked 3 2 , and 3 times respectively, the number of tokens on the edges remain unalte~ed(no token on token on C) Thus, the repetitions vector in (3-2) brings the SDF graph back to its “initial state”.

An SDP graph in which every actor consum each of its inputs and outputs is called

G actor fires when it has one or more tokens on all its input es one token from each input edge when it fires, and produces one token on all its output is very similar to ns in the marked gra ond to edges, and initial tokens (or in arking) of the marked graph correinitial tokens (or delays) in H The repetitions vector defined ious i

section canbeused

to con-

GY AND NOTATION

outline this t r a n s f o ~ a -

of this transformation. invocations) of let us call B) in G , let represent fires, and let aB represent and consumes only one token from each of which is source, the co~espondst now be the source vertex for edges. Each of these

c o n s u ~ e s the origin~l

us call these o u t ~ u and t tively. The k th sample

enerated

and

the

F graph that is not an HSDFG can always be convertedinto an equivalent HSDFG [Lee86]. The resulting HSDFG has larger number of actors than the original SDF graph. It in fact has number of actors equal to the sum of the entries in the repetitions vector. In the worst case, the SDF to HSDFG transformation may result in an exponential increase in the number of actors (see for an example of family of SDF graphs in which this blowup occurs). Such transfo~ation,however, appears to be necessary when constructing periodic multiprocessor schedules from multirate SDF graphs, although there has been some work on reducingthe complexity of the HSDFG that results from transforming given SDF graph by applying graph clustering techniques to that SDF graph An SDF graph converted into an HSDFG for th sor scheduling can be further converted into an

rposes of multi roces-

Figure 3.3. Expansion of an edge in an SDF graph into multiple edgesin the e~uivalent G, Note the input and output ports on the verticesof

~ A C ~ 6 R O U N D T E R ~ I N O LAND O 6 YNOTATION

by removing from the HSDFG arcs that contain initial tokens (delays). Recall that arcs with initial tokensonthem represent dependencies between successive iterations of the dataflow graph. An APEGis therefore useful for constructing multiprocessor schedules that, for algorithmic simplicity, do not attempt to overlap multiple iterations of the dataflow graph by exploiting precedence constraints across iterations. Figure 3.5 shows an example of an APEG, Note that the precedence constraints present in the original HSDFG of Figure 3.4

Figure 3.4. HSDFG obtained by expanding the SDF graphin Figure 3.2.

Figure 3.5. APEG obtained from the HSDFGin Figure 3.4.

Chapter

are maintaine~by this APEG, as long efore the next iteration begins.

each iteration of the graph is c o ~ p l e t e ~

Since we are concerned with ~ultiprocessorschedules, we assume that we ith ~p~lication represented homo~eneous F graph hencefo~h, unless we state otherwise. This of course results in no loss of ~eneralitybecause

general SDF graph converted into homogeneous graph for the purposes of multiprocessor scheduling anyway. In Chapter 8 we discuss how the ideas that apply to HSDF graphs can be extended to graphs containing actors that display data-dependent behavior (i.e., actors).

resentation ofanalgorithm (for example, k, or Fast Fourier T r a n s f o ~ is ) called an For example, Figure shows an SDF representation of two-channel rnultirate filter bank that consists of pair of analysis filters followed by synthesis filters. This graphcanbetransformed into anequivalent which represents the application graph for the two-channel filter bank, as shown

Figure 3.7. (a) SDF graphrepres~nting ta~ o - c h a n nfilter ~ l bank. (b)Ap graph.

in Figure 3,7(b). Algorithms that map applications specified SDF graphs on to single and multiple processors take the equivalent application graph input. Such algorithms will be discussed in Chapters 4 and 5. Chapter 7 will discuss how the performance of multiprocessor system after scheduling commodeled by another HSDFG called the inte or IPG graph. The IPC graph derived original application graph, and the given parallel schedule. Furthermore, Chapters 9 to 11 will discuss how third HSDFC, called the synchronization graph, can be used to analyze and optimize the synchronization structure of multiprocessor system. The full interaction of the application graph, IPG graph, and synchronization graphs, and also the formal definitions of these graphs will then be further elaborated in Chapters 7 through 1

SDF should not be confused with sync (e.g., LUSTW, SIG~AL, and E S ~ ~ Lwhich ) , have very different semantics from SDF. Synchronous languages have been proposed for formally specifying and modeling reactive systems, Le., systems that constantly react to stimuli from given physical environment. Signal processing systems fall into the reactive category, and so do control and monitoring systems, communication protocols, man-machine interfaces, etc. In synchronous languages, variables are possibly infinite sequences of data of certain type. Associated with each such sequence is conceptual (and sometimes explicit) notion of In LUSTRE, each variable is explicitly associated with clock, which determines the instants at which the value of that variable is defined. SIGNAL and ESRREL do not have an explicit notion of clock. The clock signal in LUSTRE sequence of Boolean values, and variable in LUSTRE program assumes its th value when its corresponding clock takes its th TRUE value.Thus we may relate one variable with another by means of their clocks. In ESTEREL, on the other hand, clock ticks are implicitly defined in terms of instants when the reactive system co~espondingto an E S R W L program receives (and reacts to) external events. Allcomputations in synchronouslanguage are definedwithrespect to these clocks. In contrast, the term “synchronous” in the SDF context refers to the fact that SDF actors produce and consume fixed number tokens, of and these numbers are known at compile time. This allows us to obtain periodic schedules for SDF graphs such that the average rates of firing of actors are fixed relative to one another. ~e will not be concerned with synchronous languages, although these languages have close and interesting relationship with dataflow models usedfor specification signal processing algorithms [LP95].

BACKGROUND TERMINOLOGY AND NOTATION

DFG)is a directed multigraph

E)

f initial tokens) on by deZay(e) We say that is an output edge of and that is an input edge of snk( We will also use the notation for an edge directed from to The delay on the edge is denoted by delay or simply delay

ath in (V,E ) a finite, non-empty sequence where a member of and snk( e,) Wesaythat the path e2, e,) c o n ~ i n §each and each subsequence of is directedfrom to and each member of ( s r c ( e , ) , src(e,), is on nates atvertex and terminate§ atvertex a path that terminates at a vertex that has no successors. That IS, e,) isa dead-end path such that for all e E , h that directed from a vertex to itself is called a cycle, e is acycle of which no proper subsequence a cycle. If l I: i I: ( k

Clearly,

pk)

1

a finite sequence of paths such that for 1 S i c k , and snk(ei,,i) for then we define the concat~natiQnof pk),denoted by

is a path from

(e,, e,) If p of denoted Delay

a path in an by

to

WSDFG,then we define the pa

i=

Since the delays on all WSDFG edges are restricted to be non-negative, it easily seen that between any two vertices x, y V , either there is no path directed from to y or there exists a (not necessarily unique) minimu between x and y oGiven an HSDFG G , and vertices x, y in we define y ) to be equal to the path delay of a minimum-delay path from to y if there exist one or more paths from to y and equal to 00 if there is no path from to y If G is understood, then we may drop the subscript and simply write “p in place of It is easily seen that minimum delay path lengths satisfy the following inequaZ~~

of ( V , we mean the directed graph formed byany E V’} We denote h the set of edges {e E El the subgraph associated with the vertex-subset V’ by subgraph( V’) if for each pair of distinct vertiWe say that V , is stron~ly there is path directed from y y there path directed from to y i subgraph( V’) is say that subset V’ c: V onnected. A stron~lycoma strongly connected subset V’ c: V su properly contains V’. If V’ is an SCC, then when there is no ambiguity, we may also thatsay s u b g r a ~V’) ~ ( is are distinct in ( V , E ) , we say that G, is C2 if there is an edge directed from some vertex in Clto some vertex C2 is predecessor SCC of sor.SCC; and an SCC is si essor SCC. An edge e is a ge of ( V , if it is not contained in an SCC, or equivalently, if it in a cycle; an edge that is contained in at least one cycle is called

ces

A sequence of vertices

is chain that joins and if for i ( k 1 We say that directed multigraph f for any pair of distinct members A B of there is B . Given directed multigraph G ( V , there is unique partition (unique up to reordering of the members of the partition) V , , V2, V,, such that for i subgra~h(V ; ) is connected; and for each eE E, E V i for some j Thus, each V i can be viewed maximal connected subset of V , and we refer to each V ; of G . acent to

ical of an acyclic directed ~ultigraph(V,E) is an ordering the members of V such that for each e E E , (i

that is, the source vertex of each edge occurs earlier in the orderin than the sink vertex. An acyclic directed multigrapli is said to be one topological sort, and we say that an n -vertex 1) edges. ifit has L

For elaboration any of the graph-theor~ticconcepts presented in this section, we refer the reader to Cormen, Leiserson, and Rivest

AG

one of these

mation from “B” to “A” implies that polynomial time algorithm to solve “A” can be used to solve “B” in polynomial time, and if “B” is NP-complete then the transformation implies that “A” is at least complex any NP-complete problem. Such problem is called We illustrate this concept with simple example. Consider the set-coverwhere we are given collection of subsets C of finite set S , and positive integer The problem is to find out if there is subset C’ c: C such that and each element of S belongs to at least one set in C’ By finding polynomial transfor~ationfrom known NP-complete problem to the set-covering problem we can prove that the set cover problem is NPhard. For this purpose, we choose the problem, where we are given graph C ( V , E) and positive integer 5 IVI and the problem is to determine if there exists subset of vertices V’ V such that V’l and for each edge e E E either e ) E V’ or E V’. The subset V’ is said to be of the set of vertices V . The vertex cover problem is known to be NP-complete, and by transforming it to the set covering problem in polynomial time, we can show that the set covering problem is NP-hard. Given an instance of vertex cover, we can convertit into an instance of setcovering by first letting S be the set of edges E.Then for each vertex E V , we {e e ) or e ) } The construct the subset of edges set E V } f o m s the collection C’. Clearly, this transfo~ationcan be done in time at most linear in the number of edges of the input graph, and the is vertex resulting C’ has size equal to VI Our transformation ensures that cover for if and only if T V E 1 set cover for the set of edges E . Now, we may use solution of set cover to solve the transformed problem, since verexists if and only if corresponding set cover 5 exists tex cover V’l for E Thus, the existence of polynomial time algorithm for set cover implies the existence of polynomial time algorithmfor vertex cover. This provesthat set cover is NP-hard. It can easily be shown that the set cover problem is also NP-complete by showing that it belongs to the class NP. However, since fomal discussion of complexity classes is beyond the scope of this book, we will refer the interested reader to for comprehensive discussion of complexity classes and the definition of the class NP. In summa^, by finding polynomial transformation from problem that is known to be NP-complete to given problem, we can prove that the given problem is NP-hard. This implies that polynomial time algorithm to solve the given problem in all likelihood does not exist, and if such an algorithm does kquired to find it. exist, major breakthrough in complexity theory would be This provides justification for solving such problems using suboptimal polyno-

BACKGROUND TERMINO~OGY AND

NOTATION

mial time heuristics. It should be pointed outthat polynomial transformation of an NP-complete problem to given problem, if it exists, is often quite involved, and is not necessarily straightforward in the case of the set-covering example discussed here. In Chapter 10, we use the concepts outlined in this section to show that particular synchronization optimization problem is NP-hard by reducing the setcovering problem to the synchronization optimization problem. We then discuss efficient heuristics to solve that problem.

There is rich history of work on shortest path algorithms and there are many variants and special cases of these problems (depending, for example, on the topology of the graph, or on the values of the edge weights) for which efficient algorithms have been proposed. In what follows we focus on the most general, andfrom the pointofviewof this book,most useful shortest path algorithms.

( V , E) with real valued edge Consider weighted, directed graph G weights W ( U , for each edge (U, E E . The single-source shortest path problem finds path with minimum weight (defined the sum of the weights of the edges on the path) from given vertex E V to all other vertices U E V U whenever at least one path from to U exists. If no such path exists, then the shortest path weight is set to The two best known algorithms for the single-source shortest path algorithm are Dijkstra’s algorithm and the Bellman-Ford algorithm. Dijkstra’s algo(w(u, 0 The rithm is applicable to graphswithnon-negativeweights running time of this algorithm is O( The Bellman-Ford algorithm solves the single-source shortest path problem for graphs that may have negative edge weights; the Bellman-Ford algorithm detects the existence of negative weight cycles reachable from and, if such cycles are detected, it reports that no solution to the shortest path problem exists. If negative weight cycle is reachable from then clearly we can reduce the weight of any path by traversing this negative cycle one or more times. Thus, no finite solution to the shortest path problem exists in this case. An interesting fact to note is that for graphs containing negativecycles, the problem of determining the weight of the shortest path between two vertices is NP-hard A simple path is defined one that does not visit the same vertex twice, i.e., simple path does not include anycycles. The all-pairs shortest path problem computes theshortest path between all pairs of vertices in graph. Clearly, the single-source problem can be applied

3

eatedly to solve the all-pairs problem. owever, a moreefficient algorithm asedon dynamic programming the Floydall algorithm maybe used to solve the all-pairs shortest path problem time. This algorithm solves the all-pair§ problem in the absence of ne ding longest path pro~lemsmay be solved using theshortest e straightforw~dway to do this to simply negate all edge .e., use the edge weights U, algorithm for the sin~le-source roblem. If all the edge weights the longest simple path becomes NP-hard reachable from the source vertex. the following sections, where we briefly describe the s h o ~ e spath t algoiscussed thus far. ~e describe the algorithms in pseudo-code, and assume we only need the weight of the longest shortest path; these a l g o ~ t h ~ s actual path, but we do not need this information for the purposes of will we not delve into the correc ofs e algoI1 refer the reader to texts such an for detaile~discussion of these graph algorithms.

e pseudo-code for the algorithm is shown times, the total time spent in th e~entationof extracting the ~ i n i m u mele for each iteration of th lernented in time more clever implementation of the minimum extraction ste leads to tationofthe algorithm with

modified

lgorithm solves the sin ts are negative, proble from thedesigcycles when these are present. e nested For loop in Step 4 deter~inesthe complexity of the algorithrn; This algorithrn is based on the

techni~ue,

Next, consider the all-pairs shortest path problem. One simple me tho^ of solving this is to apply the single- urce problem to all vertices in the IEI) time using the ellman-Ford algorithm. "he Floy takes algorithm improves upon this. A pseudo-code speci~cationof this given in Figure 3.10, The triply nested loop in this algorithm clearly implies a c o m ~ l e x i of t~ This algorithm is based upon dynamic programmin~:At the k th iteration of the o u t e r ~ o s t loop, the shortest path from the vertex n u ~ ~ e i r e ~ e t e ~ i n among ~d all pathsthat do not visit any vertex n u m ~ e r ek ~ ain, we leave it to texts such for a formal

E),with non-n nd a source vertex S E V . rtest path from S to V €

3. tract

U E

d( t )

such that d( U )

min (d(t ) ,d( U )

Figure 3.8. Dijkstra's a l g o r i t ~ ~ ,

min(d(v)lvE

Chapter

proof of correctness.

discussed in subsequent chapters, fea obtained solution of system of straints are of the form

S~ng~eSourceShortestPath ighted directed graph C ( V , E),with edgewei~ht for each e E E ,and a source vertexS E V . & V ) , the weight of the shortest path from S to each vertex E V , or elseaBoolean indicatin~thepresence of negative cycles reachable from S

1. l n i t i a i i ~ ~

0, and

for ail other vertices

t- 63

3. V,+V

U)

U)

U,

U,

Set ~e~ative~yclesExist TRUE

Figure 3.9. The Bellman-Ford algorithm.

~ A C ~ G R O U N D T E R ~ ~ N O LAND O G YATIO ION

xi

xj

where x i are unknowns to be determined, and are given; this problem is a special case of linear programming. The data precedence constraints between actors in a dataflow graph often lead to a system of difference constraints, we shall see later. Such a system of inequalities can be solved using shortest path algorithms, by t r a n s f o ~ i n gthe difference constraints into a This graph consists of a number of vertices equal to the number of variables x i ,

~ e i g h t e ddirected graph G

( V , E),with edgeweight

for

e weight of the shortest path from S to each vertex V €

1.Let (V( yt number the vertices Let be an matrix, set A ( i , as the weight of the edge from i to thevertex If nosuchedgeexists, thevertexnumbered Also, i)

~ i y t ( A ( i , A(i, k )

4. For vertices U, V E V with enumeration U d(u,

Figure 3.1 0. The Floyd-~arshallalgorithm.

i and

set

Chapter

and for each di~erenceconstraint xj I the graph contains an edge with edge weight An additional vertex is with zero weight edges directed from to all other vertices in the he solution to the system of di~erenceconstraints is then simply given toall other vertices in the graph eights of the shortest path from That is, setting each to be the weight of the shortest path from to in feasible solution to the set of difference constraints. A feasible so~utionexists if, and only if, there are no negative cycles in the constraint graph. nce constraints can therefore be solved using the ~ e ~ l m a n - F algoor~ reason for adding is to ensure that negative cycles in the graph, if present, are reachable from the source vertex. This in turn ensures that given the source vertex, the ellman-Ford algorithm will determine the existence of feasible solution. For example, consider the following set of ine~ualitiesin three variables: -3

Il e constraint graph obtained from these ine~ualitiesis shown in Figure 3.l l. A sible solution is obtained by computing the shortest paths from to each xi thus -1 -3 and 0 , is feasible solution. Clearly, given such feasible solution if we add the same constant to each we obtain another feasible solution. make use of such solution of difference constraints in Chapter 7.

.l l, ~ o ~ s t r a igraph. nt

The ~

~

i

for an ~

we shall see insubsequent chapters, the aximum achi~vablethroughput for give

~ is defined Fu graph C

umcyclemean

related to the

comprehensiv~over vie^ of m cycle mean, Out of these, appears to have the most ef~cient is the sum of over

round relating to d a t a ~ omode ~ cussed conversion of general and generation of an Acyclic Precey described asymptotic not of NP-complete proble~s. described some useful shortest path algorith~sthat are used extensively in the f~llowingchapters, and define^ the maximum cycle mean. This bac~ground be used extensively in the remainder ofthis book.

This Page Intentionally Left Blank

This chapter discusses parallel scheduling of application graphs. The perce metric of interest for evaluating schedules is the avera T :the average time it takes for all the actors in the graph to once. Equivalently, we could use the throughput (i.e.; the number of iterations of the graph executed per unit time) performance metric. Thus an optimal schedule is onethat minimizes T.

In the execution of dataflow graph, actors fire when sufficient number of tokens are present at their i ts. A dataflow graph therefore lends itself naturally to or where the problem is to assign tasks cessors. Systolic andwavefront arrays ~araIle~ism; where the data set is partitioned among multiple processors executing the same program. Ideally, we would like to exploit data parallelism along withfunctional parallelism within the same parallel programming framework. Such combined framework currently an active research topic; several parallel languages have been proposed recently that allow programmer to specify both data well functional parallelism [BH98][RS~97].Ramaswamy et ([RSB97])propose hierarchical Macro Dataflow Graph representation of programs written inF O R ~ A NAtomic . nodes at the lowest level of the hierarchy represent tasks that are run in data parallel fashion on specified number of processors. The nodes themselves are run concurrently, utilizing functional parallelism. The work of Printz [Prig11on geometric scheduling,and the Multidimensional SDF modelproposed by Lee in [Lee93], are two other promising approaches for combining data and functional parallelism.

strategy is essentially a self-timed approach where the order inwhich processors communicate is determined at compile time, and the target hardware enforces the predetermined transaction order during run-time. Such a strategy leads to a low overhead interprocessor communication mechanism. ~e will discuss this modelin greater detail in the following two chapters. The trade-off between generality of the applications that can be targeted by particular scheduling model, and the run-time overhead and implementation complexity entailed by that model is shown in Figure 4.1. ~e discuss these scheduling strategies in detail in the following sections.

strategy, the exact firing time of each actor assumed to be known at compile time. Such a scheduling style is used in conjunction with systolic array ~chitecturesdiscussed in Section 2.4, for scheduling

Run-time overhead, implementation complexity

Figure 4.1 Trade-off of generality againstrun-time overhead and implement~tio~ ~omplexity.

processors discussed in 2.2.3, and also in high-level synthesis of applications that consist only of operations with guaranteed worst-case execution times [De 941. Under fully-static schedule,all processors run in lock step; the operation each processor performs on each clock cycle ispredetermined at compile time and is enforced at run-time either implicitly (by the program each processor executes, perhaps augmented with “nop”s or idle cycles for correct timing) or explicitly (by means of program sequencer, for example). A fully-static schedule of simple WSDFG G is illustrated in Figure 4.2. ”he fully-static schedule is schematically represented c ~ a r twhich , indicates the processors along the vertical axis, and time along the horizontal axis. The actors are represented rectangles with horizontal length equal to the execution time of the actor. The left edgeof each rectangle in the Gantt chart corresponds to the starting time of the corresponding actor. The Gantt chart can be viewed processor-time plane; scheduling can then be viewed mechanism to tile this plane while minimizing total schedule length, or equivalently minimizing idle time (“empty spaces’’ inthe tiling process). Clearly, the fully-static strategy is viable only if actor execution time estimates are accurate and dataindependent, or if tight worst-case estimates are available for these execution times. As shown inFigure 4.2, two different types of fully-static schedules arise, depenging onhow successive iterations of the HSDFG are treated. Execution times of all actors are assumed to be one time U in this example. The fully-static schedule in Figure 4.2(b) represents s c h e ~ ~ lsuccessive e: iterations of the HSDFG in blocked schedule are treated separately so that each iteration is completed before thenext one begins. A more elaborate blocked schedule on five processors is shown in Figure 4.3. The HSDFG is scheduled if it executes for only one iteration, i.e., inter-iteration dependencies are ignored; this schedule isthen repeated to get an infinite periodic schedule for the HSDFC. ”he length of the blocked schedule determines the average iteration period T. ”he scheduling problem is then to obtain a schedule that minimizes (which is also called the of the schedule). A wer bound on for blocked schedule issimply the length of the of the graph, which is the longest delay-free path in the graph. Ignoring the i~ter-iterationdependencies when scheduling an application graph is equivalent to the classical multiprocessor scheduling problem for an Acyclic Precedence Expansion Graph (APEG). As discussed in Section 3.8, the APEG is obtained from the given application graph by eliminating all edges with delays on them (edges with delays represent dependencies across iterations) and replacing multiple edges that are directed between the same two vertices in the same direction with single edge. This replacement done because such multiple edges represent identical precedence constraints; these edges are taken into

MU~TI~ROCESSOR S C ~ E D U ~ MODELS I~G

account individually during buffer assignment, however. Optimal multiprocessor scheduling of an acyclic graph is known to be NP-hard and a number of heuristics have been proposed for this problem. One of the earliest, and still popular, solutions to this problem first proposed by Hu [Hu61]. ~ist-schedulingis greedy approach: whenever atask is ready to run, it is sched-

(a) HSDFG

acyclic precedence graph

t

bloc ked schedule

Proc l Proc t

T

(c) overlapped schedule Fullystatic

4

as soon as a processor available to run it. Tasksare assigned priorities, and am on^ the tasks that are ready to run at any instant, the task with the highest priority is executed first. Various researchers have proposed different priority mechfor list-scheduling [ACD74], some of whichuse critical-path-based (76723[Koh75][Bla87] la871 summari~esa large number Execution A,B,F

:5

E

(a) HSDFG

Idle

T

=l1

t

(c) ~ully-staticexecution

N iterations of

is d e s c ~ b in e ~detail in Section 4.8.

Chapter 4

can be computedefficiently and optimally in polynomial time [P~91][GS92]. Overlapped scheduling heuristics have not been extensively studied blocked schedules. The main work in this area by Lam [Lam88], and deGroot [dGH92], who propose modified list-scheduling heuristic that explicitly constructs an overlapped schedule. Another workrelated to overlapped scheduling is the “cyclo-static scheduling”approachproposed by Schwartz.Thisapproach attempts to optimally tile the processor-time plane to obtain the best possible schedule. The search involved in this process has worst-case complexity that is exponential in the size of the input graph, althoughit appears that the complexity is manageable in practice, at least for small examples [SISS].

The fully-static approachintroducedin the previous section cannotbe usedwhen actors have variable execution times; the fully-static approach requires precise knowledge of actor execution times to guarantee sender-receiver sync~onization.It is possible to use worst-case execution times andstill employ fully-static strategy, but this requires tight worst-case execution time estimates that may not beavailable to us. An obvious strategy for solving this problem is to introduce explicit synchronization whenever processors communicate. Thisleads s ~ h e ~ ~ l(ST) i n strategy ~ in the scheduling taxonomy of Lee and [LH89]. In this stratkgy we first obtain fully-static schedule using techniques that will be discussed in Chapter5 , making use ofthe execution time estimates.Aftercomputing the fully-static schedule(Figure4.4 (b)), wesimply discard the timing information that is not required, and only retain the processor assignment andthe ordering of actors on each processor specified by the fullystatic schedule (Figure 4.4(c)). Each processor is assigned sequential list of actors, some of whichare and receive actors, which it executes in an infinite loop. When processor executes communication actor, it synchronizes with the processor(s) it communicates with. Exactly when processor executes eachactor depends on when, at run-time, all input data for that actor is available, unlike the fully-static case where no such run-time check is needed. Conceptually, the processor sending data writes data into FIFO buffer, and blocks whenthat buffer is full; the receiver, on the other hand, blocks when the buffer it reads from is empty. Thus flow control is performed at run-time. The buffers may be implemented using shared memory, or using hardware FIFOs between processors. In self-timed strategy, processors run sequential programs and communicate when theyexecute the communication primitives embeddedin their programs, shown schematically in Figure 4.4(c). The multiple DSP machines that wediscussedin the Section2.5 all employ some form of self-timed scheduling. Clearly, general purpose parallel

machines can also be programmed using the self-timed scheduling style, since these machines provide mechanisms for run-time synchronization and flow control.

A self-timed scheduling strategy is robust with respect to changes in execution times of actors, because sender-receiver sync~onizationis performed at run-time. Such a strategy, however, implies higher IPC costs compared to the fully-static strategy because of the need for synchronization (e.g., using semaphore management). In addition the self-timed scheduling strategy faces arbitration costs: the fully-static schedule guarantees mutually exclusive access of shared communication resources, whereas shared resources need to be arbitrated at run-time in the self-timed schedule. Consequently, whereas IPC in the fullystatic schedule simply involves reading and writing from shared memory (no synchronization or arbitration needed), implying a cost of a few processor cycles for IPC, the self-timed scheduling strategy requires of the order of tens of processor cycles, unless special hardware is employed for run-time flow control. Run-time flow control allows variations in execution times of tasks; in

Proc 1

Proc 1

Proc 2

start

start

Proc 2 (a) HSDFC (c) Self-timed implementation (schematic) t

Fully-static schedule

Figure 4.4. Steps in a self-timed scheduling strategy.

Chapter

p l i ~ e sthe compiler softw

e, since the c o ~ p i l e rno longer

m$$], that could potential~yuse f~l~y-static scheduli~~, still choose t such run-time flow control (at the expense of additional hardware) ting software si~plicity. pres~ntsan interestin the trade-off involved between hardware CO whenweconsider d y n a ~ i cflow e ~ e n t e din hardwareversus ow control enforced a compiler

ection 2. l, where an les instructions in the The dataflow

munication. E ~ b e d d e dsignal processing systems will usually not require this type of scheduling owing to the run-time overhead and complexity involved, and the availability of compile time i n f o ~ a t i o nthat makes static scheduling techniques practical.

Actors that exhibit data dependent execution time usually do so because they include one or more data-dependent control structures, for example CO tionals and data-dependent iterations. In such case, if we have some know1 about the tati is tics of the control variables (number of iterations loop will go through or the boolean value of the control input to an if-then-else type construct), it possible to obtain static schedule that optimi~esthe aver me of the overall computation. The key idea here is to define an for each actor in the dataflow graph, An execution profile for construct consists of the number of processors assigned to it, and local schedule of that construct on the assigned processors; the profile essentially defines the shape that dynamic actor takes in the processor-time plane. In case the actor execution data-dependent, an exact profile cannot be pre-determined at compile time. In such case, the profile is chosen by making use of stati~ticalinformation about the actor, e.g., average execution time, probability distri control variables, etc. Such an approach is called [Lee$$b]. Figure 4.5 shows quasi-static strategy applied to conditiona~construct (adapted from [Lee$$b]). I

.

[HL97] has applied the quasi-static approach to data~owconstructs representing data-dependent iteration, recursion, and conditionals, where optimal profiles are computed assuming the knowledge of the probability density functions of data-dependent variables that influence the profile. The data-dependent constructs must be identified in given dataflow graph, either manually or automatically, before Ha’s techniques can be applied. These techniques make the simplifying assumption that the +controltokens for different dynamic actors are independent of one another, and that each control stream consists tokens that take TRUE or FALSE values randomly and are independent and identically distributed (i.i.d.) according to statistics known at compile time.

Ha’s quasi-static approach constructs blocked schedule for oneiteration of the dataflow graph. The dynamic constructs are scheduled in a hierarchical fashion; each dynamic construct is scheduled on certain number of processors, and is then convertedinto single node in the graph and is assigned certain exehen scheduling the remainder of the graph, the dynamic construct treated as an atomic block, and its execution profile is used to d e t e ~ i n e how to schedule the remaini~gactors around it; the profile helps tiling actors in

4

the processor-time plane with the objective of minimizing the overall schedule length. Such a ~ierarchicalscheme effectively handles nested control constructs, e.g., nested conditionals. The locally optimal decisions made for the dynamic ~onstructsare shown to be effective when the variability in a dynamic construct

is small. We will return to quasi-static schedules again in Chapter 8.

To model execution times of actors (and to perform static scheduling), we associate an execution time (non-negative integer) with each actor in the HSDFG; assigns execution time to each actor (the actual execution time can be interpreted as t ( cycles of base clock). Interprocessor communication costs are represented by assigning execution times to the and actors. The values t ( may be set equal to execution time when exact execution times are not available, in which case results of the computations that make use of these values (e.g., the iteration period are compile-time estimates. Recall that actors in an HSDFG are executed essentially infinitely. Each of that actor.An it~ratiQ firing ofan actor is called an in~o~atiQn HSDFG corresponds to one invocation of every actor in the HSDFG. schedule specifies processor assignment, actor ordering andfiring times of actors, and these may be done at compile-time or at run-time, depending on the scheduling strategy being employed. To specify firing times, we let the function k)E represent the time at which the k th invocation of the actor k) represents the time at which starts. Correspondingly, the function the k thexecution of the actor completes, at which point produces data tokens at its output edges. Since we are interested in the k th execution of each 0, 1,2, we set k ) 0 and k ) 0 for actor for k k 0 the "initial conditions". If the k th invocation of an actor takes t( time units to complete for all k then we can claim:

k)

k)

Recall that fully-static schedule specifies processor assignment, actor ordering on each processor, and also the precise firing times of actors. We use the following notation for fully-static schedule:

A fully-static schedule S (for P processors) specifies triple: S

TFS}

7

where 1,2, P} is the processor assignment, and is the iteration period. A fully-static schedule specifies the firing times k ) ofall actors, and since we want finite representation for an infinite schedule, fullystatic schedule is constrained to be periodic:

k) is thusthe

kTFS

starting time of the first execution of actor Clearly, the t~oughputfor such schedule is

(i.e.,

4

The op( function and the values are chosen so that all data precedence constraints and resource constraints are met, ~e define precedence constraints follows: dge

k)

E

k

in an HSDFG for all k

E) represents the (data) (4-1)

The above definition arises because each actor consumes one token from each of its input edges when it fires. Since there are already tokens on each incoming edge e of actor another (k l tokens must be produced on e before the k th execution of can begin. Thus the actor e) must have completed its ( k l )th execution before can begin its k th execution. The “-1 arise because we define k) for k 0 rather than k 0 This done fornotational convenience. Any schedule that satisfies all the precedence constraints specified by G is called an G [Rei68]. A n HSDFC correspon k) admissible schedule, That a valid execution respects all data precedences specified by the HS For the purposes of the techniques presented in this book, we are only recedence relationships between actors in the HSDF graph. In ne or more pairs of vertices can havemultiple edges connecting them in the same “direction”; in other words general HSDFG is 3.1). Such multi~graphoften arises when multirate SDF d intoan HSDFG. ~ u l t i p l eedges between the same pair of vertices in the same direction are redundant far precedence relationships are concerned. Suppose there are multiple edges from vertex to and amongst these edges, the minimum edge delay is equal to dminThen, if we replace all of these edges by single edge with delay equal to dmin it is easy to verify that this single edge ~aintainsthe precedence ts for of the edges that were directed from to Thus general maybe preprocessed into form where the source and sink vertices uniquely identify an edge in the graph, and we by the ordered pair (e), represent an edge e E FG that directed multigraph may be t r a n s f o ~ e dinto directed graph such that the precedence constraints of the original HSDFG are maintained by the transfo~ation.Such transformation illustrated in Figure 4.6. The multiple edges are taken into account individually when buffers are assi~nedto the arcs in the graph. We p e ~ such o ~ transformation to avoid needless clutter in analyzing HSDFGs, an to reduce the running time of algorithms that operate on HSDFCs.

In a self-timed scheduling strategy, we determine a fully-static schedu~e, ng the execution time esti~ates,but we retain only the and the ordering of actors on each processor as speciwe discard the preci formation specified in static schedule. Although we may sta setting 0) subse~uent k ) values are d e t e ~ i n e dat run-time based on th The average iteration period of a self-ti~ed ity of data at the input of each a analy~ethe evolution of a self-timed sched-

As we discussed in Section 4.3, in some cases it is advanta~eousto ~ n ~ o E d graph by a certain unfold factor, say andschedule iterations of the graph together in order to e oit inter-iteration ~ a r a ~ ~ e l more i s m effectively. In this section, we describe the unfo~dingtransformat~on.

it

G ( V , E ) unfolded times represents iterations of the the unfold in^ transformation therefore results inanother copies of each of the vertices of G . rtex V and the copies of VI. From the definitio obvious that: m) m)

end(

mN

mN

E) and

E) for all

0 , and 0 S E

Also, G N maintains exactly the same precedence constraints therefore the edges EN must reflect the same inter and intra-iteratio~

f o r ~ i an n~ that isadirected while main recedenee constraints.

(4-2)

G , and

~ ~ i t i ~into r aon~ h

7

Chapter

constraints the edges E ,For the precedence constraint in G represented by the edge E E ,there will be set of one or more edges in EN that represents the same precedence constraint in G N .The construction of EN is follows. From (4- l), an edge

E

E represents the precedence constraint: for all k

k)

Now, we can let k

and write

as modN,

(4-4)

where mod y) equals the value of taken modulo and equals the quotient obtained when is divided by Then, (4-1) can be written

We now consider two cases: 1. If to yield:

mod

2. If 0 to yield:

then (4-5) may be combined with (4-2) and (4-

mod

then (4-5) may be combined with (4-2) and (4-

Equations (4-6) and (4-7) are summarized N

contains edges from

edgesfrom forthe

to

such that N

E

E,

edges in E N ,which is the edge set of GN In particular, EN

there are a set of

for values of

follows. For each edge

each with delay In addition, EN contains

mod N

to

each with delay

values of l such that 0 0 , then there

azero-delay

1

mod edge fromeach

Note that if for

7

N . Figure 4.7 shows an example of the unfolding transformation. Figure 4.8 lists an algorithm that may be usedfor unfolding. Note that this algorithm has complexity of When constructing schedule for an unfolded graph, the processorassignment and the actor starting times are defined for all vertices of the u ~ f o l graph ~ e ~ (i.e., and are defined for N invocations of eachactor); T,, is the iteration period for the unfolded graph, andthe average iteration period for the original graph is then In the remainder of this we assume we are dealing with the unfolded graph and we refer only to the iteration period and throughputof the unfoldedgraph,ifunfolding is in fact employed,with the understanding that these quantities can be scaledby the unfolding factor to obtain the corresponding quantities for theoriginal graph.

G:

G3:

Figure 4.7. Example of an unfolding transformation: HSDFG G is unfolded by a factor of 3 to obtain the unfolded HSDFG G3

4

assume that we have reasonably good estimates of actor execution times available to us at compile time to enable us to exploit static s c h e d u ~ i n ~ t e c h n i ~ ~ ehowever, s; these estimates need not be exact, and execution times of actors may even be data-dependent, Thus we allow actors that have d i ~ e r e nexet cution timesfromone iteratio FG to the next, long these variations are small rare. This the casewhenestimates are available for the execution times, andcutiontimes are close to the c o ~ e s p o n ~ ~ ing estimates with high ~robability,but deviations from the estimates of (eEectively) arbitrary magnitude occasio~allyoccur due to ~ h ~ n o m e nsuch a cache misses, i n t e ~ p t suser , inputs, or error handling. ~ o n s e ~ u e n t ltight y ; worst-case e bounds cannot generally bed e t e ~ i n e dfor such operations; how-

es

lo

S

mod N

I N

N

vp

N c I-

mod N S

N

I

N to EN

ever, reasonably good execution time estimates can in fact be obtained for these operations, so that static assignment and ordering techniques are viable. For such applications self-timed scheduling ideal, because the performance penalty due to lack of dynamic load balancing is overcome by the much smaller run-time scheduling overhead involved whenstatic assignment and ordering employed. The estimates for execution times of actors can be obtained by several different mechanisms. The most straightforward method is for the programmer to provide these estimates while developing the library of primitive blocks (actors). In this method, the programmer specifies the execution time estimates for each actor as a mathematical function of the p~ametersassociated with that actor (e.g., number of filter taps for an FIR filter, or the block size of a block operation such as an FFT). This strategy is used in the Ptolemy system EPto98) for example, and is especially effective for libraries in which the primitives are written in the assembly language of the target processor. The programmer can provide a good estimate for blocks written in such a low-level library by counting the number of processor cycles each inst~ctionconsumes, or by profilingtheblockonan inst~ction-setsimulator. It is more difficult to estimate execution times for blocks that contain control constructs such as data-dependent iterations and conditionals within their body, and when the target processor employs pipelining and caching. Also, it is difficult, if not impossible, for the programmer to provide reasonably accurate estimates of execution times for blocks written in a high-level language (as in the C code generation library in Ptolemy). The solution adopted in the G tern [LEP90] is to automatically estimate these execution times by compiling the block (ifnecessary) and ~ n n i n git by itself in a loopon an instruction-set simulator for the target processor.To take into account data-dependent execution behavior, different input data sets can be provided for the block during simulation. Either the worst-case or the average-case execution time is used as the final estimate. The estimation procedure employed by CRAPE obviously time-consuming; in fact, estimation turns out to be the most time-consuming step in the PE design flow. Analytical techniques can be used instead to reduce this estimation time; for example, Li and Malik ELM951 have proposed algorithms for estimating the execution time of embedded software, Their estimation technique, which forms a part of a tool called cin~erella,consists of two components: 1) determining the sequence of inst~ctionsin the program that results in maximum execution time (program path analysis) and 2) modeling the target processor to determine how much time the worst case sequence determined in step 1 takes to execute (micro-~chitecturemodeling). The target processor model also takes the effect of instruction pipelines and cache activity into account. The input to the tool is a generic C program with annotations that specify the loop bounds (Le.,

4

the maximum number ofiterations for which loop runs). Although the problem is formulated an integer linear program (ILP), the claim is that practical inputs to the tool can be efficiently analyzed using standard ILP solver. The advantage of this approach, therefore, is the efficient mannerinwhichestimates are obtained compared to simulation. It should be notedthat the program path analysis component of the Li and Malik technique is, in general, an undecidable problem; therefore for these techniques to function, the programmer must ensure that his or her program does not contain pointer references, dynamic data structures, recursion, etc. and must provide bounds on all loops. Li and Malik’s techniquealso depends on the accuracy of the processormodel,although one canexpectgoodmodels to eventually evolve for DSPchips and microcontrollersthat are popular in the market. The problem of estimating execution times of blocksis central for us to be able to effectively employ compile time design techniques. This problem is an important area of research in itself, and the strategies employed in Ptolemy and CRAPE, andthoseproposed byLiandMalik are useful techniques, andwe expect better estimation techniques to be developed inthe future.

In this chapter, wediscussedvariousschedulingmodels for dataflow graphs on multiprocessor architectures that differ in whether scheduling decisions are made at compile time or at run-time. The scheduling decisions are actor assignment, actor ordering, and d e t e ~ i n a t i o nof exact firing times of each actor. A fully-static strategy lies at one extreme, in which all of the scheduling decisions are made at compile time, whereas dynamic strategy makes all scheduling decisions at run-time. The trade-off involved is the low complexity of static techniques against the greater generality and tolerance to data dependent behavior in dynamic strategies. Fordataflow-oriented signal processing applications, the availability of compile time information makes static techniques very attractive. A self-timed strategy is commonly employed for such applications, where actor assignment and ordering is fixed at compile time (or system design time) butthe exact firing time of each actor is determined at run-time, in data driven fashion. Such strategy easily implemented in practical systems through sender-receiver sync~onizationduring interprocessor co~munication.

In this chapter, we focus on technjquesthat are used in self-timed scheduling algorithms to handle IPC costs. Since tremendous variety of scheduling algorithms have been developedto date, it not possible here to provide comprehensive coverage of the field. Instead, we highlight some of the most fundamental developments to date in I ~ ~ - c o n s c i o umultiprocessor s schedulin~strategies for HSDFGs.

date, mostof the research on scheduling DFGs has focused on the problem of minimizing the schedule ch the timerequired to execute all actors in the HSDFG once. When a schedule executed repeatedly for example by being encapsulated within an infinite loop, would typically be the case for a DSP application the resulting throughput equal to the reciprocal (1 /p) of the schedule makespan if all processors synchronize (perform a “barrier sync~onization” described later in Section 9.1) at the end of each schedule iteration. The throughput can often be improved beyond 1/p) by abandoning global barrier synchronization? and implementinga self-timed execution of the schedule, described in Section4.4 further exploit parallelism between graph iterations, one may employ the technique of unfold in^, which discussed in Section model the transit time of inte~rocessorcommunication data in multiprocessorsystemmple, the time to write andread data values to andfrom F edges are typically weighted by an estimate of the delay to transmit and receive the associated data if the source and sink actors of the edge are assigned to different processors. Suchestimates are similar

to the execution time estimates that we use to model the r~n-timeof individual dataflow actors, discussed in Section 4.9. In this chapter, we are concerned p r i m ~ i l ywith efficient scheduling of in which an cost is associate lern of constructing ~ i n i m makespan u~ given target ~ultiprocessorarchitecture nition suggests, solutions to the schedu~in~ problem are heavily dependent on the underlying target architecture. In an a t t e ~ pto t decompose this problem and separate target-speci~caspects from aspects of the problem that are fundam~ntalto the s t ~ c t u r eof the input some rese~chershave applied t~o-phased approach, pioneered by Sark first phase involves schedu~ingthe input which consists of an i n ~ n i t enumber of interconnection network. can perform interproces-

early, the complexity of this second phase of the scheduling process is uniprocessor target or

chain-st~cturedprocessor interconnection to~ology7

focused on the ~erivationof e~ectiveheuristic r Since the inter-iteration dependencies represented relevant in thecontext of mini~um-makespan sc~eduling,

graph is often r e f e ~ e dto that for each e E E we app~icationis speci~ed section 3.8) obtain~dby

classic algorithm for computin n ~ e n of t actors to processors based on etw work flow principles was developed Stone [Sto77]. This algorithm is designed for heterogeneous ~u~tiprocessor syste~s, and its goal is to

map actors to processors so that the sum of computation time and time spent on IPC is minimized. More specifically, suppose that we are given target multiprocessor architecture consisting of (possibly heterogeneous) processors P,, a set of actors A I , A 2 , A,,, set of actor execution times { t i ( A j ) } ,where for each i 1,2, and each j 1,2, m } , ti(Aj) gives the execution time of actor A , on processor and set of inter-actor communi~ation costs { C , } where if actors A,. and A , do not exchange data, and otherwise, C, gives the cost of exchanging data between and A , if A , and A , are assigned to different processors. The goal of Stone's assignment algorithm is to compute an assignment

such that the net computation and communication cost

is minimized. Note that minimizing (5-2) is not equivalent to minimi~ingthe makespan. For example, if the set of target processors is homogeneous, then an optimal solution with respect to (5-2) results from simply assigning all actors to single processor. The core of the a l g o ~ t an ~ elegant approach for t r a n s f o ~ i n g given instance of the assignment problem into an instance ( V ( ] ) ,E ( I ) ) of the minimum-weight cutset problem. For example, for two-processor system, two vertices p 1 and are created in Z(1) co~espondingto the two heterogeneous target processors, and vertex is created for each actor A i . For each A , an undirected edge in is instantiated between and p 1 and the weight of this edge set to the execution time of on This edge models the execution time cost of actor A, that results if and p1 lie on opposite sides of the cutset that is computed for Similarly an edge is instantiated with weight Finally, for each pair of vertices and such that 0 , an edge in instantiated with weight From minimu~-weightcutset in Z(1) that separates p 1 and an optimal solution to the heterogeneous processor assignment problem can easily be minimum-weight cutset that separates derived. Specifically9if R c:E ( I ) and an optimal assignment can be derived from by:

Chapter 5

The net computation and communicationcost of this assignment cost(F)

simply (5-4)

R(I)

An illustration of Stone’s Algorithm shown in Figures 5.2 and 5.1. Figure 5.2(a) shows the actor interaction structure of the input application; Figure 5.2(b) specifies the actor execution times; and Figure 5.2(c) gives all non-zero communication costs C, 0 . The associated instance of the minimumweight cutset problem that is derived by Stone’s Algorithm is depicted in Figure 5.1 and minimum-weight cutset

cutset

shown in Figure 5.l(b). The optimal assignment that results from this given F(A,)

beused

(n

F(A,)

and F ( A , )

F(A,)

(5-6)

For the two-processor case ( n 2 a variety of efficient algorithms can to derive a minimumweight cutset Stone’s constructio~ hen the target architecture containsmorethantwoprocessors the weight of each edge is set to a weighted sum of the values

ure 5.1. (a) The instance of the minimum-weight cutset problem that is derive from the example of Figure5.2. (b) An illustrationof a solution to this instance of the minimum- eight cutset problem.

IPC-CONSCIOUSS C H ~ ~ U L ~ I LN~~O R I T H M S

c,

cji

Figure 5.2. An example that is used to illustrate Stone's Algorithm for computing heterogeneous processor assignments.

Chapter

tj(~i) and an optimal assignment is derived by computin~ minimum -way cutset in When 4 Stone9sapproach becomes cotnputationally intractable.

Although Stone9salgorithm has high intuitive appeal and has had considerable in~uenceon the SDI;scheduling community?the most effective algorithms ~ n o w ntoday for self-timed scheduling of SDI; graphs have jointly considered both the assi~nmentand ordering sub-~roblems.The approaches used in these joint algorithms fall into two broad categories a~proachesthat are driven by iterative, list-based mapping of individual tasks, and those that are based on cons t ~ c t i n gclusters of tasks that are to be executed on the same processor. These ories of schedulin~techniques are discussed in the following three sections.

L of actors in constructed; global time clock cc is maintained; and each task I" is eventually mapped into time interval on some processor (the time intervals for two distinct actors assigned to the sameprocessor cannot overlap). The priority list L linordering vlvl) of the actors in the input task graph E) v19 such that for any pair of distinct actors e given higher scheduling ~rioritythan vi if and only if i ma~pedto an available processor soon it becomes the highe according to L among all actors that are An actor is ot yet been mapped, but its predecessors have all been mapped t where is the current value of cc. For self-timed implementation, actors on each processor are ordered according to the order of their associated time intervals. An impo~antgeneralization of list scheduling, which we call been formalized by Printz eady-list scheduling maincheduling convention that schedule is constructed by repeatedly ch~dulingready actors, but eliminates the notion of static priortime clock. the only list that fundamental to the the list of actors that are readyat ven schedulin~step. t of effective ready-list algorithms for scheduling roble^

To be effective when I C costs are notne ible, list-scheduling or y-list algorithm must incorporate the latencies associated with I

US S ~ H E ~ U L I ~ ~

tions. This involves either explicitly scheduling IPC operations onto the communication resources of the target architecture the scheduling process progresses, or incorporating estimates of the time that it takes for data that is produced by an actor on one processor to be available for consumption by an actor that has been assigned to another processor. In either case, an additional constraint is imposed on the earliest possible starting times of actors that depend on the arrival of data.

ties have been shown to Its and properties have -case of the s~heduling problem, wewhich call In ideal s c h e d ~ ~ ~the ng, target multiprocessor ar ocessors that homogetime of an actor is independent of the processor it is assigned to), and IPG performed in zero time. ~lthough,issues of heterogeneous processing times and IPC cost are avoided, the ideal scheduling problem intracta~le hen list scheduling is applied to an instance of the ideal sche problem and a given priority list for the problem instance, the resulting schedule is not necessarily uni~ue,and generally depends on the details of the particular list schedulin~algorithm that used. ~pecifically,the schedule depends on the processor selection scheme that is used when more than one processor is availat given scheduling step. For exam consider the simple task graph in t( t( and the targetmultiure and suppose that processor architecture consists of two processors and If list scheduling applied to this example with ~rioritylist C) then any one of the four schedules illus~atedin Figure S.3(b) may result, depending on the processor selection scheme.

(C, t, denote the instance of the ideal scheduling problem E) actor execution times (on each processor that consists of task graph C E and target, zero-IPC architect in the target ~chitecture) that consists of identical processors. n given a list-scheduling algorithm fo fine S,( G, t , L ) to be and priority list (v1, dule produced by when it is G, t, L) to be the makesp t, n,

{S,(

t, n,

is list scheduling al~orithm}

of schedules thatcanbe produced when list t, with priority list L . For the example of Fi ure 5.3, we have

Chapter

Figure 5.3. An example thatillustrates the dependenceof list scheduling on processor selection (for a given priority list).

It is easily shown that schedules produced by list-scheduling algorithms on given instance ISP( G, t, all have the same makespan. That is, t, n7 L)I(A

E

$7

1

(5-9)

This property of uniform makespan does not generallyhold, however, if we allow heterogeneous processors in the target architecture or if we incorporate non-zero IPC costs. Clearly, effective construction of the priority list L is critical to achieving high-quality results with list scheduling. Graham has shown that when arbitrary priority lists are allowed, it is possible for the list scheduling approach to produce unusual results. In particular, it is possible that the number of processors, the execution time of one or more actors, or the precedence constraints in an SDF graph (removing one or more edges) can all cause list scheduling algorithm to produce results that are (longer total execution time) than those obtained whenthe algorithm is applied with the original number of processors, the original set of SDF edges, or original execution times respectively [Gra69]. Graham, however, has established a tight bound on the anomalous performance degradation that can be encountered with list scheduling [Gra69]. This result is summarized by the following theorem. ( V , E) is task graph; L is Suppose that a priority list for G and are positive integers such t V 0, 1, and t’ V 0, 1 2, are assignments of non-negative integers to members of V (sets of actor execution times) such that for each t’( t( E’ E t, and S’ E G’, t’, where G’ ( V , E’) Then 10)

and this is the tightest possible bound. Graham has also established tight bound on the variation in list scheduling performance that can be encountered when different priority lists are used for the same instance of the ideal scheduling problem.

( V , E ) is task graph; L and L’ are a priority Suppose that lists for G positive integer; t V 0, 1 are a s s i ~ n ~ e nof t s execution times to members of V S E L ) and S’ E C( G, t, L’) Then

Chapter

and this isthe tightest possible bound.

an actor in an acyclic SDF graph C is defined to be the length of the longest directed path in C that originates at A ere, the length of path taken to be the sum of the execution times of the actors on the path. Intuitively, actors with high level values need to bescheduled early since long sequences of computation depend on their com~letion.One of the earliest and most widely-used list-scheduli is the HLF’ET (highest level first with estimated times) algorithm In this al~orithm, is created by sorting the actors in decreasing order of their levguaranteed to produce an optimal result if there are only twoprocessors, both processors are identical, and all tasks have identical execution times For the general ideal schedulin~ ~roblem (any finite number of cessorsis allowed, ecution times need not be identical, are uniformly zero), has been proven to frequently produce ne~-optimalschedules [ACD7 Early strategies for i n c o ~ o r a t i nIPC ~ costs intolist scheduling include the algorithm of Yu [Yu84]. algorithm, modification of H L F E ~scheduling, repeatedly selects the ready actor that has the highest level, and schedules it on the processor that can finish its execution at the earliest time. The earliest finishing time of ready actor on processor depends both on the time intervals that have already been scheduled on and on the IPC time required for the data required from the predecessors of A to arrive at P. In contrast to these early algorithms, the ETF (earliest task first) algorithm wang, Chow and Angers uses the level metric only tie-breaking criterion. At each scheduling step in ETF, the value P) the earliest time at which actor can commence execution on processor is comp~tedfor every ready actor A andevery target processor Ifan actor-processor pair uniquely minimizes $,(A, then is scheduled to execute on starting at time otherwise, the tie resolved by selecting the actorprocessor pair that has the highest level.

osed list-scheduling algorith ich attempts to account for within an arbitrary, ~ u l ~ - int~rconnection h o ~ network. Since ~aintainingand a ~ ~ l y i precise n~ accounting of traffic within such network can be com~uta-

tionally expensive, the algorithm hasbeen devised to maintain an approxiwhich reasonable estimates of mate view of network state E($) from communication delaycanbederived for scheduling purposes. Atany given matrices H , L and scheduling time step t E( t ) incorporates three where the set of processors in the target multiprocessor arc~itecture.Given pl, E P , gives the number of hops between and in the interL@,, givesthe prefe~edoutgoing communication ~ o ~ n e c t i onetwork; n channel of p1 that should beusedwhen communicating data to and D( gives the communication delay between p1 and that arises due to contention with other IPC operations in the system. H algorithm, actors are first prio~tizedby static, modi~edlevel metric that incorporates the communication costs that are assigned to the task graph edges. For actor x is the longest path length in the task graph that originates at x , where the length of path taken the sum of the actor execution times and edge communication costs along the path. A list-scheduling loop then carried out in which at any givenscheduling time step t actor is selected from a ~ o n gthe actors that are ready at t Prothat maximi~es cessor selection is thenachieved by ing to the processor thatallows the earliest estimate^ completion time. estimated completion time derived from the network state approximati wini and Lewis observe that there is signi~canttrade-off‘ between cywithwhichthenetwork state approximation E($) isupdated (which affects the accuracy of the app on), and the time complexity the resulting scheduling algorithm. The orithm addresses this trade-off by updating onlywhen schedule gins sending IPC datato successor actor that is scheduled on another processor, or when the IPC data associated with task graph edge ( x , y ) arrives at the processor that y is assigned to. Loosely speaking, thepr ~echanismin the H algorithm the converse of that employed in EW. H,the “earliest ac processor mapping’’ is used tie-brea~ng crite~on, while the modified level is used the primary priority function. Note also that when thetarget processor set is homogeneous, selecting an actor-processor pair that minimizes the starting time is equivalent to selecting pair that minimizes completion time, while this equivalence does not necessarily hold for heterogeneous architecture. Thus, the concept of “earliest actor-processor mapping” that is employed by ETF is different from that in MH only in the heterogeneous processor case.

In the DL§ (dynamic level scheduling) algorit~mof §ih and Lee, the use of levels in traditional HLFET scheduling is replaced by measure of scheduling priority that is to be continually re-evaluated the schedule is constructed

[SL93a]. Sih and Lee demonstrated that such concept preferable because the "scheduling affinity" between actor-processor pairs depends not only on longest paths in the task graph, but g on the current s c h e d ~ l i ~state, which includes the actor/time"interval pairs that have already been scheduled on processing resources, and the IPC operations that have already been scheduled on the communication resources. with E W , the DLS algorithm also abandons the useofthe global scheduling clock c c , and allows all target processors to be considered candidates in every scheduling step (instead of just those processors that are idle at the current value of c c ) . With the elimination of c c , Sih's metric for prioritizing can be formulated actors the d y ~ a m i c

where represents the scheduling state at the current scheduling step; denotes the conventional (static) level of actor A D ( A , P, denotes the earliest time at which all data required by actor can arrive at processor P and F(P, gives the completion time of the last actor that is presently assigned to While the incorporation of scheduling state by the DLS algorithm represents an important advancement in IPC-conscious scheduling, the fornulation of the dynamic level in (S-12) contains subtle limitation, which was observed by Kwok and Ahmad [ U 9 6 ] . This limitation arises because the relative contributions of the two componentsin 12) (the static level and the data arrival time) to the dynamic level metric vary the scheduling process progresses. Early in the scheduling process, the static levels are usually high, since the actors considered generally have relatively many topological descendants,and similarly, data arrival times are low, since the scheduling of actors on each processor begins at the origin of the time axis and progresses towards increasing values of time. more and more scheduling steps are carried out, the static level parameters of ready actors will decrease steadily (implying lower influence on the dynamic level), and the data arrival times will increase (implying higher influence). Thus, the relative weighting of the static level and the data arrival time are not constant, but rather can vary strongly between different scheduling steps. This variation not taken to account in the DLS algorithm.

Motivated partly by their observation on the limitations of dynamic level scheduling,KwokandAhmadhavedevelopedan alternative variation of list scheduling, called the DCP (dynamic critical path) algorithm, that also dynamically re-evaluates actor priorities [ U 9 6 ] . The DCP algorithm motivated by the observation that the set of critical paths in task graph can change from one

I~C-CO~SCIOUS se ~ ~ U L ALGO~ITHMS I ~ G

scheduling step to the next, where the critical path defined to be directed path along which the sum of computation and communication times m a x i ~ i ~ ~ d . For example, consider the task graph depicted in Fig. 5.4. Here, the number beside each actor gives the execution time of the actor, and each numeric edge weight gives the IPC cost associated with the edge. Initially, in this graph, B C and the length of this path is 16 time units. the critical path is If the first two scheduling steps map both actors and B to the same processor (e. g., to minimize the starting time of B then the weight of the edge B ) in Fig. 5.4 effectively changes to zero. The critical path in the new “partially scheduled” graph thus becomes the path which has length of 14 time units. Because critical paths can change in this manner the scheduling process progresses, the critical path of the partially scheduled graph called the The DCP algorithm operates by repeatedly selecting and scheduling actors on the dynamic critical path, and updating thepartially scheduled graph scheduling decisions are made. An elaborate processor selection scheme is also incorporated to map the actor selected at each scheduling step. This scheme not only considers the arrival of required data on each candidate processor, but takes into accountthe possible starting timesof the taskgraphsuccessorsof the selected actor

ltiprocessor scheduling operate by incrementally constructing groupings, called of actors that are to be executed

Figure 5.4. An illustrationof “dynamic” critical paths inmultiproc~ssorschedu~in~.

5

on the same processor. Clustering and list scheduling can be used in complem ~ n t afashion. ~ ~ypically,clustering is applied to focus the efforts of listscheduler on effective processor assignments. When used ef~ciently,clustering can signi~cantlyenhance the results produced by list scheduler (and variety of other scheduling techniques). A scheduling algorithm (such list scheduler) processes clustered H ~ by constraining ~ F the ~ vertices of V that are encompassed by each cluster to be assigned to the same processor. More than one cluster may be mapped by the scheduling algorithm to execute on the same processor; thus, a sequence of clustering operations does not necessarily specify a complete processor assignment, even when the target processors are all homogeneous. The net result of a clustering algorithm is to identify family of disjoint subsets M kc: V such that the underlying scheduling algorithm is forced to avoid IPC costs between any pair of actors that are members of the i . In the remainder of this section, we examine variety of algorithms for computing such a family of subsets.

In the of Kim and Browne, longest paths in the input task graph are iteratively identi~edand clustered until every edge in the graph is either encompassed by a cluster or is incident to cluster at both its source and sink. The path length metric based on function of the computation (e,, e,,) is a task graph and communicatio~along given path. If p path, then the value of Kim and Browne’s path length metric for p given by (5Tp

where

is the set of actors traversed by p

the IPC cost associated with an edge

e;

the total IPC cost between an actor

T, and the “normalization factors”

E

T, and actors that are not contained in

and

are parameters of the algorithm.

Kim and Browne do not give systematic technique for determining the normalization factors that should be used with the Linear Clustering A~gorithm. Indeed, the derivation of the most appropriate normalization factors based on characteristics of the input task graph and the target multiprocessor architecture 0.5 and appears to be an interesting direction for &furtherstudy. When 1 linear clustering reduces to clustering of critical paths.

Sarkar’s ~ ~ t e r ~ u l i zalgorithm u t i ~ ~ [Sar89] for graph-clustering is based on determining set of clustering operations that do not degrade the per task graph on machine with boundless processing resources (i.e., In internalization, the task graph edges &efirst sorted in decreasing order oftheir associated IPC costs. The edges in this list are then traversed according to this ordering, When each edge e is visited in this traversal, an estimate T, of the parallel execution time is computed with the source and sink vertices of e constrained to executeon the sameprocessor. This estimate derived for an unboundednumberof processors, and fully-connectedcommunicationnetwork. If does not exceed the parallel executjon time estimate of the current clustered graph,then the c u ~ e n tclustered graph is modified by merging the source and sink of e into the same cluster. An important strength of internalization its simplicity, which makes it easily adaptable to accommodate additional scheduling objectives beyond minimizing execution time. For example, hierarchical scheduling framework for multirate DSP systems has been developed usingSarkar’s clustering technique substrate [PBL95]. This hierarchical framework provides systematic method for combiningmultiprocessorschedulingalgorithms that minimizeexecution time with uniprocessor scheduling techniques that optimize target program’s code and data memory re~uirements.

algorithm of Yang and Gerasoulis inco~oratesprinciples similar to those used inthe DCP algorithm, but applies these princi~lesunder the methodologyof clustering than list scheduling [YG94]. withDCP, “partially scheduledgraph” is repeatedlyexamined and updated scheduling steps are carried out. The IPC costs of intra-cluster edges in the PSG are all zero; other IPC costs are the same the co~espondingcosts in the task graph. Additionally, the DSc algorithm inserts new intra-cluster edges into the PSG so that linear (total) ordering of actors is always maintained within eachcluster. Initially, each actor in the task graph assigned to its own cluster. Each clustering step selects a task graph actor that has not been selected in any previ-

Chapter 5

ous clustering step, and determines whether or not to merge the selected actor with one of its predecessors in the PSG. The selection process is based on priority function ~ ( A )which 9 is defined to be the length of the longest path {computation and communication time) in the PSG that traverses actor This priority function fully captures the concept of dynamic critical paths: an actor maximizes if, and only if, it lies on critical path of the PSG.

At given clustering step, an actor is selected if it “free” which means that all of its PSG predecessors have been selected in previous clustering steps and it maximizes over all free actors. Thus, if free actor exists that on a dynamic critical path, then an actor on the dynamic critical path will be selected. However, it possible that none of the free actors are on the PSC critical path. In such cases, the selected actor is not on the critical path {in contrast, the DCP algorithm alwaysselects actors that are on the dynamic critical path). Once an actor A is “selected,” its predecessors are sorted in decreasing order of the sum of t(x) % ( x ) where t(x) is the execution time of predecessor is the IPC cost of edge A) and h ( x ) is the len longest direct path in the PSG that terminates at A set of one or more predecessors is then chosen from the head of this sorted list such that “zeroing” {setting the IPC cost to zero) the associated output edges ...) minimizes the value ofh(A) and hence f ( A ) ,in the A), new PSG that results from clustering the subset PSG of vertices X2t x,, A I The DSc algorithm was designed with low computationa~complexity the primary objective. The algorithm achieves time complexity of O( where is the number of task graph actors, and E the number of edges. Incontrast, linear clustering E ) ) [GY92]; linearization an O ( E ( N E ) ) algorithm; ETF is O ( P N * ) where is O ( P 3 ~ *DLS ) is

is the number of target

where g

the complexity

of the data routing algorithmthat is used to compute D(A, DCP and the D e c l ~ ~ t e r i ~ g A Zdiscussed g ~ r i t ~ ~in, Section5.4.4 below, has Complexity with internalization, DSC designed for fully connected network containing an unbounded number ofprocessors, and for practical, processor-constrained systems it can be used as preprocessing or intermediate compilation phase. discussed in Section 5.1, optimal scheduling in the presence of IPC costs intractable even for fully connected, infinite processor systems, and thus, given the polynomial complexity of DSc and internal~zation,we cannot expect guaranteed optimality from these algorithms. However, DSC is shown to be opti-

IPC-CONSCIOUS S C ~ ~ ~ U L I N G

mal for number of non- trivia^ sub-classes of task graphs [YC94].

Sih and Lee have developed clustering approach called that is based on examining pairs of paths in the task graphto systematically determine which instances of paral~el~sm should be preserved during theclustering process [SL93b]. Rather than exhaustively examining allpairs of paths (in general, task that is hopelessly time-consuming) the Declustering technique focuses on paths rs, which are actors that have mu~tip~e successors. Branch actors are examined in increasing order of their static levels. Examination of a branch actor B begins by sorting its successors in decreasing order of their static levels. The two successors C, and at the head of this list (highest static levels) are then categorized as being either an ~ b r u (“non~ c ~ intersecting branch”) or an (“intersecting branch”) pair To perform this categorization, it is necessary to compute the tr of C, and C, The transitive closure ofan actor X , denoted TC(X) in a task graph is simply the set of actors Y such that there is delayless path in C directed from to Y Given the ~ansitiveclosures TC( C,) and C,) the successor pair (Cl, C,) is an TC( C,)

TC( C,)

63,

16)

and otherwise (if the transitive closures have non-empty intersection), (C,, C,) an instance. Intuitively, the transitive closure is relevant to the derivation of parallel schedules since two actors can execute in parallel (execute over overlapping segments of time) if, and only if, neither actor is in the transitive closure of the other. Once the branch-actor successor pair (C,, C,) is categorized as being an Ibranch or Nbranch instance, from it to determine an efTective means for capturing the parallelism associated with (C,, C,) within clustering framework. If (C,, C,) is an Nbranch instance, then the TPPI associated with (C,, is the subgraph formed by combining a longest path (cumu~ativeexecution time) from C, to any task graph sink actor (an actor that has no outputedges), a longest path from to any task graph sink actor, the associated branch actor B , and the connectingedges (B, C,) and For example, consider the task graph shown in Figure for simplicity, assume that the execution t of each actor is unity; observe that the set of T,U ) andconsider the TPPI computation branch actors in this graph associated with branch actor T.The successors of this branch actor, U and V , satisfy TC( U ) W, X, and TC( V ) Y} Thus,wehave

TC( U ) TC( V ) which indicates that for branch actor the successor pair ( U , V ) is an branch instance. The TPPI associated with this branch instance shown in Figure 5.5(b). If (Ct, C,) an Ibranch instance then the TPPI associated with C,) derived by first selecting an actor called fromthe intersecstatic level. The TPPI the tion TC( U ) TC( V ) that has maxi combining longest pathfrom C, t longest pathfrom to connecting edges (B,C,) and (B,C,) Among the branch actors in Figure 5S(a), only actor has an Ibranch instance associated with it. The corresponding TP I, derived from merge actor U , is shown in Figure 5.5(c). After TPPI is identi~ed,an optimal schedule of the TPPI onto two-processor arc~itectureis derived. Because of the restricted structure of TPPI topologies, such an optimal schedule can be computed efficiently. Furthermore, depending on whether the TPPI corresponds to an Ibranch or an ~ b r a n c ~ instance7and on whether the optimal two-processor schedule utilizes both target processors, the optimal schedule can be represented by removing zero, one, or arcs, from the TPPI: after removing the cut arcs from the TPPI, the (one or two) co~nectedcomponents in the resulting subgraph give the processor assignment associated with the optimal two-processor schedule. The declustering algorithm repeatedly applies the branch actor analysis discussed above for all branch actors in the task graph, and keeps track of all cut arcs that are found during this traversal of branch actors. After the traversal is complete, all cut arcs are temporarily removed from the task graph, and the connected components of the resulting graph are clustered. These clusters are then combined in pairwise fashion two clusters at time to produce hierarchy of two-actor clusters. Careful graph analysis used to guide this hierarchy formation to preserve the most useful instances of parallelism for large depth possible within the cluster hierarchy. Then7during the and phases of the Declustering Algorithm, the cluster hierarchy systematically broken down and scheduled to match the ch~acteristicsof the target multiprocessor ~chitecture,For full details on and the reader is encouraged to consult [Sih9 SL93bJ.

Due in part to the high complexity of the assignment and ordering problems in the presence of IPC costs, independent comparisons on subsets of algo-

Figure 5.5. An illustration of TPPIs in the Declustering Algorithm.

Chapter

rithms developed for the scheduling problem consistently reveal that no single algorithm dominates clear “best-choice” that handles most applications better than all of the other algo~thms(for example, see [LAAG94, Thus, an important challenge facing tool designers for application-speci~cmultiprocessor implementation is the development of efficient methods for integrating the variety of algorithm innovations in IPC-conscious scheduling so that their advantages can be combined in systematic manner. One example of an initial effort in this direction is the DS (dynamic selection) strategy [LAAG94]. DS is list scheduling algorithm that compares the number of available processors to n,) the number of executable (ready) actors n, at each scheduling step. If then one step of the DLS algorithm is invoked to complete the current scheduling step; otherwise, minor variation of HLFET is applied to complete the step. This algorithm was motivated experiments that revealed certain “regions of operation” in which scheduling algorithms exhibit particularly strong or weak performance compared to others. The performance of DS is shown to be significantly better than that of DLS or alone.

Pipelinedschedulingalgorithmsattempt to efficiently partition task graph into stages, assign groups of processors to stages, and construct schedules for each pipeline stage. Under such scheduling model, the pipeline determines the throughput of the multiprocessor implementation. In general, pipelining can significantly improve the throughput beyond what is achievable by the classical (minimum-makespan) scheduling problem; however, this improvement in throughput may corneat the expense of s i ~ n i f i c aincrease ~t in latency (e.g., overthe latency that achievable by employing minimum makespan schedule). Research on pipelined schedulin~is at significa~tlyless mature state than on the classical ~ r o ~defined l e ~ in Section 5.1. Due to its high relevance to DSP and multi~ediaapplications, we expect that in the coming years, there will be increasing activity in the area of pipelined scheduling. Bokhari developed fundamentalresults on the mapping oftask graphs into pipelined schedules. Bokharidemonstratedan efficient, optimalalgorithm for mapping chain-structured task graph onto linear chain of processors (the is based on an innovative data for modeling the chain pipelining problem. Figure 5.6 illustrates an instance of the chain pipelining problem, and the co~espondinglayered assignment graph.The task graph to be scheduled; the linearly-connected target multiprocessor architecture; and the layered assignment

Figure 5.6. An instance of the chain pipelining problem ciated layered assignment problem.

and (b)), and the

Chapter

graph associated with the given task graph and target ~chitectureare shown in Figures 5.6(a), 5.6(b) and 5.6(c), respectively. The number above each task graph actor A, in Figure 5.6(a) gives the execution time $(A,)of A, and the number above each task graph edge gives the IPC cost from the associated source and sink actors if the source and sink are mapped to successive stages in the linear chain of target processors. If the source and sink actors of an edge are mapped to the same stage (processor), then the co~municationcost is taken to be zero. Given an arbitrary chain-structured task graph actors {Xi, such that

(V,E) consisting of

are the send actors. The subscript refers to actors that communicate control tokens. The main difference between such an i~plementationand the self-timed implementation we discussed inearlier chapters are the control tokens. conditional construct is partitioned across more than one processor, the control token(s) that determine its behavior must be broadcast to all the processors that execute that construct. Thus, in Figure 8.4, the value c , which is computed by ocessor 2 (since the actor that produces c is assigned to Processor 2), must be broadcast to the other two processors. In shared memory machine this broadcast can be implemented by allowing the processor that evaluates the control

Chapter 8

token (Processor 2 in our example) to write its value to particular shared mernlocation preassignedat compile time; theprocessorwillthenupdate this location once for each iteration of the graph. Processors that require the value a articular control token simply read that value from shared memory, and the processor that writes the value of the control token needs to do so only once. In

i

'OR

proe p~oe proe

proe

for

E ~ T E N ~ I NTHE G

ARCHITEC~RE

this way, actor executions can be conditioned upon the value of control tokens evaluated at run-ti~e.In the previous chapters, we discussed synchronization associated with data transfer between processors. Synchronization checks must also be performed for the control tokens; the processor that writes the value of a tokenmustnot overwrite the shared memory location unless all processors requiring the value of that token have in fact read the shared memory location, and processors reading a control token must ascertain that the value they read corresponds to the current iteration rather than a previous iteration. The need for broadcast of control tokens creates additional communication overhead that should ideally be taken into account during scheduling. The methods of Lee and Ha, and also prior research related to quasi-static scheduling that they refer to in their work, do not take this cost intoaccount. Static multiprocessor scheduling applied to graphs with dynamic constructs taking costs of distributing control tokens into account is thus an interesting problem for further study.

Recall that the QMA architecture imposes an order in which shared memis accessed by processors in the machine. This is done to implement the OT

Proc 1 receive c if E receive F else I receive

Proc 3 D receive c (rc2) if (c) H else L send (S$ ode for subgraph-l>

) and if the value of c were always FALSE,the transaction order would be s2,

caccessorder for subgraph-l>)

ote that writing the control token c once to shared memory is enough since the same shared location can be read by all processors requiring the value of c.

A architecture, possible strategy to switch between these two access ordersat run-time. This is enabled by the preset feature of the transaction controller (Section 6.6.2). Recall that the transaction controller is implemented presettable schedule counter that addresses memory containing the corresponding to the bus access order. To handle conditional con-

sche~ulefor subgraph-l proc l

2 proc 3 sl, rl,

proc I proc 2 proc 3

THE

structs, we derive two bus access lists co~espondingto each path inthe program, and the processor that determines the branch condition (processor 2 in our example) forces the controller to switch between access lists by loading the schedule access schedule ofFigcounter with the appropriate value (address“7” in the ure 8.7). NotefromFigure 8.7 that there are two points where the schedule counter can be set; one is at the completion of the TRUE branch, andthe other is a jump into the FALSE branch. The branch into the FALSE path best taken careof by processor 2, since itcomputes the value of the control token c , whereas the branch after the TRUE path (which bypasses the access list of the FALSE branch) is best taken care of by processor 1, since processor 1 already possesses the bus at the time when the counter needs to be loaded. The schedule counter load operations are easily incorporated into the sequential programs of processors 1 and 2. The mechanism of switching between access orders works well when the number of control tokens is small. But if the number of such tokens is large,

bus access list

Addr

forces co~trollerto ju ccess listfor the E branch if c is proc forces c o ~ ~ o l lto e rbypass the

value of c is

access list thatis stored in the schedule ure 8.6. L o a ~ i oper n ~ tion of the schedule countercon~itionedon sho~n.

Chapter 8

then this mechanisms breaks down, even if we can efficiently compute quasistatic schedule for the graph. see why this is so, consider the graph in Figure 8.8, which contains conditional constructs in parallel paths going fromthe input to the output. The functions and “g,” are assumed to be subgraphs that are assigned to more than one processor. In Ha’s hier~chical scheduling approach, each conditional is scheduled independently; once scheduled, it is converted into an atomic node in the hierarchy, and profile is assigned to it. Scheduling of the other conditional constructs can thenproceed based on these profiles. Thus, the scheduling complexity in terms of the number of parallel paths is if there are parallel paths. If we implement the resulting quasi-static schedule in the manner stated in the previous section, and employ the mechanism above, we would need one bus access list for every combination of the bk.This is because each fi and Q will have its ownassociated bus access list, which then has to be combined with the bus access lists of all the other branches to yield one list. For example, if all Booleans are true, then all the are exe-

E X T E ~ THE ~ IOMA ~ ~ARCHITEC~~RE

cuted, and we get one access list. If b1 is TRUE, and b2 through b, are FALSE, executed, and through are executed. This corresponds to another cess list. This implies bus access lists for each of the combination of that execute, i.e., for each possible execution path in thegraph.

Although the idea of maintaining separate bus access lists simple mechanism for han const~cts,itcan sometimes be i~practical, in the example above. an alternative mechanism based on ~u~~~~~that handles arallel conditional constructs more effectively. e main idea behind masking to store an ID of along with the processor ID in the bus access list. The Boolean ID determines whether p ~ i c u l a bus r grant is “enabled.” This ows us to combine the access lists of the nodes through and g, through The bus grant co~esponding to each is tagged with the boolean ID of the corresponding b,, and an additional bit indicates that the bus grant is to be enabled when “RUE. Similarly, each bus grant co~espondingto the access list of gi is tagged with the ID of and an additional bit indicates that the bus grant must be enabled only if the correspon~ing control token has FALSE value. At run-time, the controller steps through the bus access list as before, but instead of simply granting the bus to the procesat the head of the list, it first checks that the control token corre§ponding to field of the list is in its correct state. If it is in the correct state for bus grant corresponding to an and FALSE for bus grant corresponding to gi), then the bus grant is performed, otherwise it masked. Thus the run-time values of the Booleans must be made available to the transaction controller for it to decide whether to mask a pa~icularbus grant or not. rant should be enabled by product the dataflow graph, and the completed conditionals in parallel branches of the g

hus, in general we need to implement an (cl)Pr~cZDl, ( c ~ ~ P r ~ c Z D ~each , bus access valued condition r~spondingto

bus access list of the form annotated with

indicating that the bus should be granted to the processor corwhen

evaluates to

uct function of the ooleans b,, b,, ooleans (e.g., complement),

UE;

could be an arbitr

b,} in the system, and the complements

K ,where the bar over

variable indicates its

Chapter

This scheme is implemented shown in Figure 8.9. The schedule memnow containstwo fields corresponding to eachbus access: CCondiCID> instead of the field alone that wehad before. The n> field encodes unique product associated with that particular bus access. In the prototype, we can use 3 bits for , and bits for the This would allow us to handle 8 processors and 32 product ooleans. Therecanbe to 3” productterms in the worst case corresponding to Booleans in the system, because for each Boolean b, a product term could contain bi or it could contain its complement or else b, could be a “don’t care”. It is unlikely that all 3” possible product terms

shared address bus

Cl

Signal indicating whether to mask currentBG or not

BGl

ccess mechanism that ontrol tokens that are evalua

BGn

will be required in practice; we therefore expect such scheme to be practical. can be implemented within the controller at The ne cess^ product terms compile time, based on the bus access pattern of the particular dynamic dataflow graph to be executed. In Figure 8.9, the flags bl,b,, are l-bit memory elements (Aipflops) that are m e ~ o r ymapped to the shared bus, and store the values of the oolean control tokens in the system. The processor that computes the value of each control token updates the conesponding b, by writing to the shared memcation that maps to bi The product combinations c,, c,, c, are just could be functions of the b,S and the complement of the b, e.g. the schedule counter steps through the bus access list, the bus grant is ranted only if the condition conesponding to that access evaluates to us if the entry appears at the head of the bus access list, G , then processor 1 receives a bus grant only if the control token E and b, is FALSE, otherwise the bus grant is masked and the schedu~e counter moves upto the next entry in the list. This schemecanbe inco~oratedinto the transaction controller inour architecture prototype, since the controller is implemente product terms cl, c, may be programmed into the F compile time; when we generate p r o g r a ~ s ate the a n n ~ t a ~ ebus d access list and hardware description for the L, say) that imple~entsthe required product terms.

bus daccess

list

straightforward, even if inef~cient,mechanism for obtaining such list is to use enumeration; we simply enumerate possible combinations of ooleans), and determine system (2" combinations for sequence(sequence of 'S) for each combination.achcombination conesponds to an execution path in the graph, an we can estimate the time of occurrence of bus accesses conesponding to eac combination from the quasi-static schedule. For example, bus accessesc o ~ e s p o n d i nto~one sche two execution paths in the quasi-static schedule of Figure 8. along the time axis shown in Figure 8.10 (we have igno co~espondingto su~graph-lto keep the illustration simple). he bus access schedules

for each of the combinations can now be col-

Chapter

sed into one annotated list, in Figure 8.10; the fact that accesses for each u s to enforce global order combination are ordered with respect to time on the accesses in the collapsed bus ac list are annotated with their respective The collapsed list obtained above canbe used, in the maskedcontroller scheme; however, there is potential for optimi~ingthis list. Note, however, that the same transaction may appear in the cess list c o ~ e s ~ o n d i ntogdi~erent olean combinations, because particular that bus access. For example, the first t in both execution paths, because they ar t ofthe value of C , In the worst case, bus access that end ill ind up a p ~ e ~ n g in the bus access lists all of the these bus accesses appear contiguously in the collapsed bus access sequence, we can combine them into one. For example, “(c Proc2, P~ocLZ” inthe annotated schedule of Figure 8.10 can be combined into single “Proc 2” entry, which is not conditioned on any control token. Consider another example: ifwe get contiguous entries

c=T

t

C

FAL

lists and the annotated listc o r r e s ~ o nto~ i ~ ~

b, Proc3” and b, b2 Proc3” in the collapsed list, we can replace the two entries with a singleentry bl Proc3”’. ore generally, if the collapsed list contains contiguous segment of the form: k,

c ~ ) ~ r o c I L ) k 7(c,)ProcIL)k7

each of the contiguous segments can be written c,)~rocIL)k7 where the bus grant condition is an expression (c, c,) which sum of products function of the Booleans in the system. Two-level logic minimization can then be applied to determine minimal representation of each of these expressions. Such 2-level minimization can be done by using a logic ‘mization tool such 0 ~ B H ~ S V c ~ which 4 1 , simpli~es given expression into an entation with minimal number of product c,) can be minimized into another terms. Suppose the expre SOP expression where p The segment ( c ~ ) ~ r o c I L ) k ,( c , ) P r o c ~

k,

can then be replaced with an equivalent segment of the form: k)

ProcIL)k7

This procedure results in minimal set of contiguous appearances of a bus grant to the same processor. noth her optimization that can be performed is to combine annotated bus access lists with the switching mechanism of Section Suppose we have the following an~otatedbus access list:

(b, Then, by “factoring”

(b,

(b,

bS)~rocIL)~)

out, the above list may be equivalently written

(b~){(G)~~uc~L)i,

(b4 bS)ProcIL),},

W, all the three bus accesses may be skipped whenever the Boolean

is

LSE by loading the schedule counter and forcing it to increment its count by three, instead of evaluating each access separately, and s ~ p p i n gover each one individually. This strategy reduces overhead, because it costs an extra bus cycle access when condition co~espondingto that bus access evaluby s ~ ~ p i over n g three bus accesses that we know are going to be disabled, we save three idle bus cycles. There is an added cost of one cycle for loading the schedule counter; the total savings in this example therefore two bus cycles. One of the problems with the above approach is that it involves explicit enumeration of all possible combinations of ooleans, the complexity of which

Chapter

limits the size of problems that can be tackled with this approach. An implicit mechanism for representing all possible execution paths therefore desirable. One such mechanism is the use of Binary Decision Diagrams (BDDs), which have been used to efficiently represent and manipulate Boolean functions for the purpose of logic minimization [Bry86]. BDDs have been used to compactly represent large state spaces, and to perform operations implicitly over such state spaces when methods based on explicit techniques are infeasible. One difficulty encountered in applying BDDs to the problem of representing execution paths is that it is not obvious how precedence and orderingconstraints can be encoded in representation. The execution paths co~espondingto the various Boolean co~binationscan be represented using BDD, but it isn't clear how to represent esponding to the different execution

in Figure 8.2(b). A quasistatic schedule for such a construct may look like the one in Figure 8.1 l. The C, and D of Figure 8 4 b ) are assumed to be subgraphs rather than Such quasi-static schedule can be i~plementedin s t r a i g h t f o r w ~ ~ A architecture, provided that the data-dependentconstruct spans all the processors in the system. The access schedule c o ~ e s p o n ~ i ntog iterated subgraph is simply repeated until the iteration construct t e r ~ ~ n a t e s . processor responsible for determining when the iteration t e ~ i n a t e scan be made to force the schedule counter to loop back untilthe termination condition is reached. This shown in Figure 8.12.

2 3

u~si-staticschedule for the data-dependent iteration graph of Fig-

E X T E N ~ I NTHE ~ OMA A ~ C H I T E C ~ ~ E

This chapterhas dealt withextensionsoftheordered-transactions approach to graphswithdata-dependent control flow. The BooleanRataflow model was briefly reviewed, and the quasi-static approach to scheduling conditional anddata-dependent iteration constructs. A schemewasthendescribed whereby the Ordered Transactions approach could be used when such control constructs are included in the dataflow graph. In this scheme, bus access schedules are computed for each set of values that the control tokens in the graph evaluate to, and the bus access controller is made to select between these lists at runtime based on which set of values the control tokens actually take at any given time. This was also shown to be applicable to data-dependent iteration constructs. Such scheme is feasible when the number of execution paths in the graph small, A mechanis~based on masking of bus accesses depending on run-time values of control tokens may be usedfor handling the case when there are multiple conditional constructs in “parallel.”

access list

Processor that determines termination condition of the iteration can also reinitialize the schedule counter

Figure 8.1 2. possible access order list corresponding to the quasi-static schedule of Figure 8.1 1.

This Page Intentionally Left Blank

e previous three chapters have been concerned with the actions s~ategy,which a hardware approach to reducingI tion costs in self-timed schedules. In this chapter and the fol we discuss software"based strategies for minimizin~ nchronization costs in the final implementation of given self-timed schedule. ese software-based techniques are widely-applicable to sh~ed-memorymu1 rocessors that consist of eous or heterogeneous collections of processors, and they do not require bility of hardware support for employing the OT approach or any other form of specialized hardware suppo~. Recall that the self-timed scheduling s~ategyintroduces sync checks whenever processors communicate. A straightforward imple timed schedule would require that for each inter-processor c o ~ ~ u n i c ~ t i o n the sending processor ascertain that the buffer it is writing to is e receiver ascert~nthat the buffer itis reading from not empty. cessors block (s~spendexecution) when the appro~riatecondition is not met. Such sender-receiversynchronization can be implementedin any ing on the p ~ i c u l a r h a r ~ w aplatform re under consideration: in sh machines, such synchronization es testing and setting sharedmemory;in machines that synchronization in h ~ d w a r e(such as ~ ~ i e r sspecial ), sync~onization ins~uctions are used; and in the case of systems that consist of a mix of p r o g r ~ m a b l eprocessors and custom hardware ments, sync~onization is a c ~ e v e dby ~mployinginterfaces that support bloc reads and writes. In each type of platform, each that requires synchronization check costs performance,and sometimes extra hardware com~lexity. ~ e m a ~ h o r e checks cost execution time on the processors, synchronization ins~uctionsthat

Chapter

make use of special synchronization hardware such barriers also cost execution time, and blocking interfaces between programmable processor and custom hardware in combined hardware/software implementation require more hardware than non-blocking interfaces [H+93]. In this chapter, we present algorithms and techniques that reduce the rate at which processors must access shared memoryfor the purpose of synchronization in multiprocessor implementations of SDF programs. One of the procedures we present, for example, detects when the objective of one s y n ~ ~ o n i z a t i oopern ation is guaranteed side effect of other synchronizations in the system, thus enabling us to eliminate such superfluous sync~onizationoperations. The optimization procedure that we propose can be used post-processing step to any static scheduling technique (for example, to any one of the techniques presented in Chapter S) for reducing synchronization costs in the final implementation. As before, we assume that “good” estimates are available for the execution times of actors and that these execution times rarely display large variations so that selftimed scheduling viable for the applications under consideration. If additional timing information is available, such guaranteed upper and lower bounds on the execution times ofactors, it is possible to use this information to further optimize synchronizations in the schedule. However, use of such timing boundswill be left future work; we mentionthis again in Chapter 13.

Among the prior work that is most relevant to this chapter is the principle of Dietz, Zaafrani, and O’Keefe, which is combined hardware and software solution to reducing run-time sync~onizationoverhead [DZ092]. In this approach, shared-memory MIMD computer augmented with hardware support that allows arbitrary subsets of processors to synchronize precisely with respect to one another by executing sync~onizationoperation called If subset of processors is involved in barrier operation, then each processorin this subset will wait at the barrier until all other processors in the subset have reached the barrier. After all processors in the subset have reachedthe barrier, the sync~r~ny. co~espondingprocesses resume execution in In [DZ092], the barrier mechanism is applied to minimize synchronization overhead in self-timed schedule with hard lower and upper bounds on the task execution times. The execution time ranges are used to detect situations where the earliest possibleexecutiontimeof task that requires datafrom another processor is guaranteed to be later than the latest possible time at which the required data is produced. M e n such an inference cannot be made, barrier is instantiated between the sending and receiving processors. In addition to performing the required data synchronization, the barrier resets (to zero) the uncer-

SYNCHRONIZATIO~IN SELF-TI

tainty between the relative execution times for the processors that are involved in the barrier, and thus enhances the potential for subsequent timing analysis to eliminate the need for explicit synchronizations. The techniquesof barrier IMD do notapply to the problem that we addressbecausetheyassume that hardware barrier mechanism exists; they assume that tight boundson task executiontimes are available; they do not address iterative, self-timed execution, in which the execution of successive iterations of the dataflow graph can overlap; and even for non-iterative execution, there is no obvious correspondence between an optimalsolution that uses barrier synchronizations and an optimal solution that employs decoupled synchronization checks at the sender and receiver end point is illustrated in Figure 9.1. Here, in the absence of execution time bounds, an optimal application of barrier synchronizations can be obtained by inse~ing two barriers one barrier across and and the other barrier across A4 and This is illustrated in Figure 9,l(c). However, the corresponding collection of directed sync~onizations to and t0A4 is not sufficient since it does not guaranteethat the data required by from available before begins execution.

In [Sha89], Shaffer presents an algorithm that minimizes the number of directed synchronizations in the self-timed execution of dataflow graph. However, this work, like that of Dietz et al., does not allowthe execution of successive iterations of the dataflow graph to overlap. It also avoids havingto consider dataflow edges that have delay. The technique that we discuss in this chapter for removing redundant synchronizations can be viewed generalization of Shaffer’s algorithm to handle delays and overlapped, iterative execution, and we will discuss this further in Section 9.7. The other major software-based techniquesfor sync~onizationoptimization that we discuss in this book handling the feedforward edges of the synchruni~ationgraph (to be defined in Section 9.5.29, discussed in Section 9.8, and “resynchronization”, discussed in Chapters 10 and l 1 are fundamentally different from Shaffer’s technique since they address issues that are specific to the more general context of overlapped,iterative execution. As discussed in Chapter 4, multiprocessor executing self-timed schedule is one where each processor is assigned sequential list of actors, some of which are send and receive actors, which it executes in an infinite loop. When processor executes communication actor, it synchronizes with the processor(s) it communicates with. exactly when processorexecuteseach actor depends on when, at runtime, all input data for that actor available, unlike the

Proc

A,, A,

S Y N ~ H R ~ N I Z IN ~ TSELFI ~ ~TIME^ SYSTEMS

fully-static casewhere no such ~ n - t i m echeck is needed. In this chapter we use “processor” in slightly general terms: a processor could be a pro~rammablecomponent, in which case the actors mapped to it execute as software entities, or it could be a hardware component, in which case actors signed to it are implemented and execute in hardware. See [KL93] for adiscussion on combined hardware/software synthesis from single dataflow specification. Examples of ap~lication-specificmultiprocessors that use programmable processors and some form of static scheduling are described in +$$][Koh90l9which were also discussed in Chapter 2. ~nter-processor communication between processors is assumed to take place via shared memory. Thus the sender writes to a particular shared memory location and the receiver reads from that location. The shared memoryitself could be global memorybetween all processors, or it could be distribute between pairs of processors (as hardware FIFO queues or dual ported memo~es for example). Each inter-processor communication edge inan translates into abuffer of a certain size in shared memo^. Sender-receiver synchronization also assumed to take flags in shared ~ e m o r y .Special hardware for synchronization phores implemented in hardware, etc.) would be prohibitive multiprocessor machines for applications such as that we are conside~ng, Interfaces between h a r d w ~ eand software are typically implemented using memory-mapped registers in the address space of the progra~mableprocessor (again a kind of shared memory), and synchronization is achieved using flags that can be tested and set by the programmable component, and the same can be done by an inte~facecontroller on the hardware side [H+93]. Under the model above, the benefits of sync~onizationoptimization become obvious. Each sync~onizationthat is eliminated directly results in one less s y n c ~ o n i ~ a t i ocheck, n or, equivalently, one less shared memory access. example, where a processor would have to check a flagin shared m e ~ o r ybefore executing p~mitive,eliminating that synchronization implies there is no longer need for such a check. ”his translates to one less shared memory read. Such a benefit is especially signi~cantfor simplifying interfaces between a programmable component and a hardware component: or a without the need for synchronization implies that the interface can be implemented in a non-bloc~ingfashion, greatly simplifying the interface controller. a result, eliminating a sync~ronizationdirectly results in simpler hardware in this case. the metric for the optimizations we present in this chapter is the total number of accesses toshared memory that are needed for the purpose of synchronization in the final multiprocessor implementation of the self-timed schedule. This metric will be defined precisely in Section 9.6.

We model synchronization in self-timed implementation using the IPC graphmodelintroducedin the previous chapter. before, IPC graph G,( V,EiF) is extracted from given HSDFC G and multi-processor schedule; Figure 9.2 shows one such example, which we use throughoutthis chapter. We will find it useful to partition the edges ofthe IPC graph in th ingmanner: Eiw Ei,,U E,,,, where are the (shown dashed in Figure 9.2(d)) that are directed from the send to the receive actors in G,, and Ei,,,are the “internal” edges that represent the fact that actors assigned to a particular processor (actors internal to that processor) are executed sequentially according to the order predetermined by the self-timed schedule. A communication edge e E E,,,, in G, represents two functions: 1) reading and writing of data values into the buffer represented by that edge; and 2) synchronization between the sender and the receiver. mentioned before, we assume the use of shared memory for the purpose of sync~onization;the synchronization operation itself mustbeimplementedusingsomekindofsoftware protocol between the sender and the receiver. We discuss these sync~onizationprotocols shortly.

t

Recall from Lemma 7.3 that the average iteration period co~espondingto self~timedschedule with anIPC graph G, is given by the maximum cycle mean (G,) If we only have execution time estimates available instead of exact values, and we set the execution times ofactors t( to be equal to these estimated values, then we obtain the e s t i ~ ~iteration t e ~ period by computing G,) Henceforth we will assume that we know the e s t i ~ ~ t e ~ t ~ r o ~ calculated by setting the t ( v ) values to the available timing estimates. In all the transformations that we present in the rest of the chapter, we will preserve the estimated throughput by preserving the maximum cycle mean of G,, with each t ( v ) set to the estimated execution time of In the absence of more precise timing information, this is the best we can hopeto do.

In dataflow semantics, the edges between actors represent infinite buffers. Accordingly, the edges of the IPC tially buffers of infinite size. However, from Lemma 7.1, every edge that belongs to strongly connected component, and hence to some cycle) can only have finite number of tokens at any time during the execution ofthe IPC graph. We will call

S Y ~ C H R O ~ I Z A T IIN O ~SELF-TI

Execution TimeEsti~ates

A, C, H, F

on

processors

4

l

Figure 9.2. Self-timed execution.

9

this constant the s ewill we represe timed buffer bound: &(e)

of that edge, and for feedback edge emm ma 7.1 yields the following self-

min Delay (C) C is cycle that contains e

1)

(edges that do not belong to any bound on buffer size; therefore for practical implementations we need to i ~ ~ Q ~ e bound on the sizes of these edges. For example, Figure shows an graph where the communication edge (S, r) could be unbounded when the execution time of, is less than that of B. In practice, we need to bound the buffer edge; we will denote such an impose^' bound for a feedforward e ) Since the effect of placing such rest~ctionincludes “artificially” constraining ( e ) from getting more th invocations ahead of S ~ ~ ( e its ) effect on the estimated t ~ o u g h p u t reverse edge that has delays on it, where m ed e in Figure 9.3(b)). Since the addition of this e potential to reduce the estimated throughput; to prevent such e) must be chosen to be large enough so that the m a x i ~ u mcycle mean remains unc~angedupon adding the reverse edge with m delays.

ure 9.3.

g r a with ~ ~ a f e e ~ f o ~edge: ~ r d (a) ~uffer~.

NIZATI~~ IN S E L F - T I ~SE ~ S T E ~ S

Sizing buffers optimallysuch that the maximumcyclemeanremains unchanged has been g, Lewis and LofKLL873, in where the authors propose ramming in an f o ~ u l a t i othe nof prob~em,with the numberof constraints equal to the n u ~ b e roffundamental cycles in the H S ~ F G(potentially an esponential number ofconstraints). cient heuristic procedure to determine

holds for each feedforward edge e , then the maximum cycle mean of the resulting graph does not exceed Then, binarysearch (e) for eachfeedforwardedge,whilecomputing t search ach step and ascertaining that it less than i,) results buffer ain assignment for the feedforward edges. ~ l t h o u g hthis procedure is efficient, it suboptimal because the order that the edges e are chosen ar~itraryand may effect the quality of the final solution. we will see in Section 9.8, however, imposin~such bound roach for bounding buffer sizes, because such bound entails sync~onizationcost. In Section 9.8 we show that there is better technique for bounding buffer sizes; this technique achieves bounded buffersizes by t r a n s f o ~ ing the graph into strongly connected graph by adding minimal number of additional sync~onizationedges. Thus, i the final algorithm, it is notin fact ne cess^ to use compute these bounds

define two basic synchronization protocols communication edge hether or not the length of the co~espondingbuffer gu~anteedto be bounded from the analysis presented in the previous section. Given an I graph G and communication edge e in G , if the length of the co~esponding buffer is not bounded that is, if e is f e e d f o ~ a r dedge of G then we l synchronizationprotocol called which guarantees that an invocation of snk( e) never atte om an empty b u ~ e r ; (b) an invocation of never attem~tsto write data into the buffe nless the number of tokens in the buffer is less than some pre-specified limit (e) which the amount of memo^ allocated to the buffer discusse~in the previous section. On the other hand, if the topology of the

graph guarantees that the

Chapter

th for e is bounded by some valu then we use simpler protocol, calle that only explicitly ensures above. chroni~ationprotocols defined S

n this mechanism, for is maintained on the processor that executes src for is maintained on the is maintainedinsome processor that executes snk( e) and cop shared memo^ location (e) The pointer and are initiali%edto respectively. Just after each execution of e) the new data value produced onto e is written into the shared memory buffer for e offset e) is updated following the by operation l ) mo tb( e) and e) is updated to contain the new value of e) Just before each execution of the e) is repeatedly examined until it found to be t shared memory bu echanism uses the r e a d / ~ ~pointers te and these are initialized the same way; however, rather than maintaining copy of e) in the shared memo^ location we ~ a i n t a i na count (initiali%ed to of the number of unread tokens that currently reside in the buffer. executes, e) is repeatedly examined until its value is found e) then the new data value producedonto e is written into the sharedmemory buffer for e at offset e) (e) is updated in B (except that the newvalue is not w ~ t t e nto shar memo^); and the count in (e) is incremented. Just before each execution the value contained in e) is repeatedlymineduntil it is found to benonzero;then the data value residing at offset count in is decre there is enough shared memory to hold feedforwardcommunicationedge e of communication edge some of the buffers feedforward edges, roughput. Note that feedback edge e , f optimally choosing which edges should besubject to stricter buffer bounds when there is shortage of s h a r ~ d ~ e m oand r y ,the selection of these stricter bounds is an interesting area for further investigation. An impo~antparameter in an implementation of

S Y N C ~ R O N I ~ A T IN I O SEL~-TI~ED ~ SYSTE

T b If receiving processor finds that the correspondi~gIPC buffer is full, then the processor releases the shared memory bus, and waits time units before requesting the bus again to re-check the shared memory synchronization variable. Similarly, sending processor waits T b time units between successive accesses of the same synchronization variable. The back-off time can be selected experimentally by simulating the execution of the given synchronization graph (with the available execution time estimates) over wide range of candidate back-off times, and selecting the back-off time that yields the highest simulated throughput. As we discussed inthe beginning of this chapter, some of the communication edges in G, need not have explicit synchronization, whereas others require synchronization, which needto be implemented either using the UBS protocol or S protocol. Allcommunicationedges also represent buffers in shared memory. Thus we divide the set of communication edges follows: Es E, where the edges E, need explicit synchronization o Er need no explicit synchronization. Recall that a communication edge

of Gi, represents the

Vk

vi,

Thus, before we perform any optimization on synchronizations, Es and Er Q,, because every communication edge represents synchronization owever, in the following sections we describe how we can move certain m E* to Er, thus reducing synchronization operations inthe final implementation. After all synchronization optimizations have been applied, the communication edges of the IPC graph fall into either Es or E,.At this point the edges Es Er in G, represent buffer activity, and must be implemented buffers insharedmemory,whereas the edges Es represent synchronizationconstraints, and are implemented using the UBS and protocols introduced in the previous section. For the edges in E,, the synchronization protocol is executed before the buffers corresponding to the communication edge areaccessed so to ensure sender-receiver synchronization. For edges in Er,however, no synchronization needs to be done before accessing the shared buffer. Sometimes we will also find it useful to introduce synchronization edges without actually communicating data between the sender and the receiver (for the purpose of ensuring finite buffers for example), so that no shared buffers need to be assigned to these edges, but the corresponding sync~onizationprotocol invoked for these edges.

l1 optimizations that move edges from E, to E, must respect the syn-

Chapter

chronization constraints implied by G,. If we ensure this, then we only need to implement the synchronizati ( V ,Eint E$) the syn G, represents the sync~onization ~o~straints ensured, and the algo~thmswe present for minimizing synchronization costsoperate on G,, efore any synchronization-related optimizations are performed G, G, ecause at this stage, but as we move communication edges from to G, has fewer and fewer edges. moving edges from E, to enever we remove edges from G, we viewed as removal of edges from G,. haveto ensure, of course, that the syn ization graph G, atthat step respects all the synchronization constr~ntsof G, because we only implement synchronizations represented by the edges in G , , The following theorem is ~ s e f uto l formalize the concept of when the sync~onization constr~nts represented by one synchronization graph G,' imply the s y n c ~ o n i z a t i oconstraints ~ of another graph G: This theorem provides a useful constraint for synchronization optimization, and it underlies the validity of the main techni~uesthat we will present in this chapter. The synchronization constraints in a synchronization graph (V,

imply the synchronization cons~aintsof the sync~roniza-

.Ei,,$

tiongraph

GS2 ( V , EiatU ES2) if the following condition holds:

Es' p,(

that

(E)

CS2but not in

'V'E

s.t.

if for each edge E that

G,' there is a mini mu^ delay path from

E)

to

that has total delay of at most ote that since the vertex sets for the two graphs ar to refer to edges

E

sat.

and

as being vertices of

entical, it is meaningfu~ even though there are

P E,'

E

First we prove the following lemma. If there is a path stffrt( rouf of ~e~~

e,,) in

e2, k)

then

k

l

e following constraints hold along such a path p (as per (4-1))

imilarly,

start(

e2))

S Y ~ ~ H ~ O N I Z A TINI O ~

oting that

is the same

we get

(e

k ) end(snk(ei),k delay(e2)) k ) start(v, k ) so we get

~ a ~ s a l i implies ty

k ) start( ~~bstituting

SYSTEMS

k

e

deZay(

in

delay(e2) d e Z a y ( e ~ ) ) .

k ) end(src(e,),k

~ontinuingalong p in this manner, it can easily be verified that start(snk(e,)9k )

deZay(e,)

end(src(e,),

delay(e,-i) deZay(e,))

that is,

(e

k)

k

Delay ( p ) ) ) QED.

Proof of If E E:, E Esithen the synchronization constraint due to the edge holds in both graphs. But for each s.t. E E,' we need to show that the constraint due to delay(

k) holds in G,' provided least one path p and

e,)

delay which implies there at from to in 6,' such that DeZay(p) deZay(&).

existence of such path p implies

From Lemma

k - DeZay(p))).

k)

that is,

start(

k)

k

(p)))

If then

elay ( p ) deZay DeZay(p)) end( k we get

start(

k ) end(

e above relation is identical to

delay(&)) Substituting this in k

delay(

and this proves the Theorem.

Chapter

The above theorem motivatesthe following definition. If G,’ V , Ei,, Esl) and ( V , E,,,, nization graphs with the same vertex-set, we say that G,’ s.t. E E*, E E l ,we have snk(~))

are synchroG , ~if

Thus, Theorem 9. l states that the synchronization constraints of Eli,,, E$’) imply the synchronization constraints of E,,, if ( V ,E,,,, U Es’)preserves ( V , Given an IPC graph G,, and synchroni~ationgraph G, such that G, preserves G,, suppose we implement the synchronizations corresponding to the synchronization edges of G,. Then, the iteration period of the resulting system is determined by the maximum cycle mean of G, ( ~ ~ ~ (This G Jbecause the synchronizationedgesalonedetermine the interaction between processors; communication edge without synchronization does notconstrain the execution of the corresponding processors in anyway.

~e refer to each access of the shared memory ‘‘Synchronization variable” by and snk( as s y n ~ to shared memory. If synchronization for e is implemented using UBS, then we see that on average, 4 s y n ~ ~ o n i z a t i oaccesses n are required for e in each iteration period, while im lies 2 synchronization accesses per iteration period. ~e define the sy of synchronization graph G, to be the average number of synchronizationaccessesrequiredper iteration period. Thus,if n f f denotes the number of synchronizationedgesinG,$ that are feedforwardedges,and nfb denotes the number of synchronization edges that are feedback edges, then the synchronization cost of G, can be expressed ( 4 n , 2 n f b ) .In the remainder of this chapter, we develop techniquesthat apply the results and the analysis framework developed in the previous sections to minimize the synchronization cost of self-timed implementation of an HSDFG withoutsacrificing the integrity of any inter-processor data transfer or reducing the estimated throughput. Note that in the measure defined above of the number of shared memory accesses required for synchronization, some accesses to shared memory are not taken into account. In particular, the “synchronization cost” metric does not consider accesses to shared memory that are performed while the sink actor is waiting for the required data to become available, or the source actor is waiting for “empty slot” in the buffer. The number of accesses required to perform these “busy-wait,’ or “spin-lock” operations is dependent on the exact relative execution times of the actor invocations. Since in the problem context under considernumber of ation, this i n f o ~ a t i o n not generally available to us, the best

SYNC~~ONIZ~TION IN S E L ~ - T ISYSTEMS ~E~

accesses the number of shared memory accesses required for synchronization assuming that IPC data on an edge is always produced before the co~esponding sink invocation attempts to execute is used an approximation. In the remainder of this chapter, we discuss two mechanisms for reducing sync~onizationaccesses. The first (presented in Section9.7) is the detection and removal of redundunt synchronization edges, which are synchronization edges whose respective sync~onizationfunctions are subsumed by other synchronization edges, and thus need not be implemented explicitly. This technique essentially detects the set of edges that can be moved from the to the set Er.In Section 9.8, we examine the utility of adding additional synchronization edges to convert a synchronization graph that is not strongly connected into strongly connected graph. Such conversion allows us to implement all synchronization BS. We address optimization criteria in performing such conversion, and we will showthat the extra synchronization accesses requiredfor such conversion are always (at least) compensated by the number of synchronization accesses that are saved by the more expensive UBS synchronizations that are converted to sync~onizations. Chapters 10 and l 1 discuss mechanism, called resynchrunizutiu~,for inserting synchronization edges in way that the number of original synchronization edges that become redundant exceedsthe number of new edges added.

The first technique that we explore for reducing sync~onizationoverhead is removal of redu~dunt sy~chru~izutiun from the sync~onizationgraph, i.e., finding minimal set of edges E$ that need explicit synchronization.

A synchronization edge is ant in synchronizationgraph G if its removal yields sync~onizationgraph that preserves G Equivalently, from definition 9.1, synchronization edge e is redundant in the synchronization graph G if there is path (e) in G directed from src to snk( such that I: e) The synchronization graph G is tains no redundant synchronization edges. the sync~onizationfunction associated with redundant synchronization edge ‘‘comes for free” by-product of other synchroniz~tions.Figure 9.4 shows an example of redundant synchronization edge. ere, before executing actor D the processor that executes {A,B, C, D} does not need to synchronize with the processor that executes {E, G, H } because,due to the sync~onizationedge the corresponding invocation of is guaranteed to is redundant in Figure complete before each invocation of D is begun. Thus, 9.4 and can be removed fromEs into the set Er.It is easily verified that the path

is directed from the delay on

((K G), ( G H), to snk( x,) and has a path delay (zero) that is equal to

In this section, we discuss anefficient algorithm to optimally remove redundant sync~onizationedges from a synchronization graph.

The following theorem establishes that the order inwhichweremove redundant synchronization edges is not important; therefore all the redundant sync~onizationedges can be removed together.

( V , .Ei,, U a sync~onizationgraph, e , Suppose that and e, are distinct redun~ant synchronization~dges in G, (i.e., these are edges that could be indivi~uallymoved to E, and G, V , Ein,U Then redundant in G,. Thus both e , and e, can be moved into Ertogether, Since redundant in snk( e,) such that

there is a path

st ( e , )

in G, directed from

i_
;

and define

P’’= ( y l , y2,

Clearly,

x27 Y k , Y k + 1, ym)* is a path from src(e2) to snk(e2) in G,. Also,

eZay (p’) ~ ~ e Z( ap )y

Y k - 1 v x19

* * * P

~ e Z a(yp ) deZay ( e (from (9-9)) (from (9-8)).

Theorem 9.2 tells us that we can avoid implementing sync~onizationfor redundant synchronization edges sincethe “redundancies” are not interdependent. Thus, an optimal removal of redundant sync~onizationscan be obtained by applying a straightforward algorithm that successively tests the synchronization edges for redundancyin some arbitrary sequence, and since computing the weight of the shortest path in a weighted directed graph is a tractable problem, we can expect such a solution to be practical.

Figure 9.5 presents an efficient algorithm, based on the ideas presented in the previous subsection, for optimal removal of redundant sync~onizationedges. In this algorithm, we first compute the path delay of a minimum-delay path from y ) here, we assign a path delay of to y for each ordered pair of vertices whenever there is no path from to y This computation is equivalent to solving an instance of the well known shurtest pru~Zem(see Section 3.13). Then, we examine each sync~ronization edge e in some arbitrary sequence and determine whether or not there is apath from src ( e ) to snk that does not contain e and that has a path delay that does not exceed deZay e ) This check for redundancy is equivalent to the check that is performed by the statement in RemuveRedundantSynchs because if is a pathfrom src(e) to snk( e ) that contains more than one edge and that contains e then p must contain a cycle c such that c does not contain e and since all cycles must have pos-

Chapter

itive path delay (from Lemma 7. the path delay of such path p must exceed if satisfies the inequality in the statement of and p* is a path from to such that then p* cannot contain e This observation allows us to avoid havingto recompute the shortest paths after removing candidate redundant edge from C,. From the d e ~ n i t ~ oofn redundant synchronizatio~edge, it is easily verified that the removal of redundant synchronization edge does not alter any of the minimum-delay path values (path delays). That is, given redundant synifwe let chronization edge e, in G,, andtwo arbitrary vertices y E G, ( V , Eint (E then y) P ~ , ~ (y X), Thus,none of the minimum-delay path values computed in Step 1 need to be recalculated after removing redundant sync~onizationedge in Step 3. Observe that the complexity of the function is dominated by Step 1 and Step 3. Since all edge delays are non-negative, we can repeatedly apply Dijkstra’s single-source shortest path algorithm (once for each vertex) to carry out Step in VI time; we discussed Dijkstra’s algorithm in

chroni~ationgraph C,

E iU , Es

raph G,*

( V , Ein, (Es E,))

re 9.5. An algorithm thatoptima~iyremoves redundantsyn~~ronization

S Y N ~ ~ ~ O N I Z A T IIN ON SELF-TI~ED SYSTE~S

Section 3.13. A modification of Dijkstra’s algorithm can be used to reduce the is an complexity of Step 1 to Q( V/210g2(VI) VI IEI) [CLR92]. In Step 3, upper bound for the number of synchronization edges, and the in worst case, each vertex has an edge connecting it to every other member of V. Thus, the timeand if we use the modification to Dijkstra’s complexity of Step 3 is Q( algorithm mentioned above for Step 1, then the time-complexity of R e ~ u v e R e dundantSyn&hsis 3(Iv1210g,(lvl)

IVllEI)

Q(lVl2l0g2(lVI) IVllEI).

In [Sha89], Shaffer presents an algori inimizes the number of directed synchronizations in the self-timed execution of an HSDFG U (implicit) assumption that the execution of successive iterations of the are not allowed to overlap. In Shaffer’s technique, construction identical to the sync~onizationgraph is used except that there is no feedback edge connecting the last actor executed on processor to the first actor executed on the same processor, and edges that have delay are ignored since only intra-iteration dependencies are significant. Thus, Shaffer’s synchronization graph is acyclic. Re~uveRed~ndantSynchs can be viewed an extension ofShaffer’s algorithm to handle self-timed, iterative execution of an HSDFG;Shaffer’s algorithm accounts for self-timed execution only within graph iteration, and in general, it can be applied to iterative dataflow programs only if all processors are forced to synchronize between graph iterations.

In this subsection, we illustrate the benefits of removing redundant synchronizations through practical example. Figure 9.6(a) shows an abstraction of three channel, multi-resolution quadrature mirror (QMF)filter bank, which has applications in signal compression [\rai93]. This representation based on the general (not homogeneous) SDF model, and accordingly, each edge is annotated with the number of tokens produced and consumedby its source and sink actors. Actors and represent the subsystems that, respectively, supply and consume data tolfrom the filter bank system; B and C each represents parallel combination of decimating high and low pass FIR analysis filters; D and E represent the corresponding pairs of inte~olatingsynthesis filters. The amount of delay onthe edge directed from to E is equal to the sum of the filter orders of C and D . For more details on the application represented by Figure 9.6(a), we refer the reader to [Vai93]. construct periodic parallel schedule, we must first determine the numthat each actor must be invoked in the periodic schedule, described in Section 3.6. Next, we must determine the precedence relation-

Chapter

Figure 9.6.(a) multi-resolution QMF filter bank usedto illustrate the benefitsof removin~redundant synchroni~~tions. (b) The precedence gra self-ti~ed,two-processor, parallel schedule for (a).The initialsynchroni~ation graph for (c).

SYNCHRO~IZATI~N IN SELF-TI~EDSYSTEMS

ships between the actor invocations. In d e t e ~ i n i n gthe exact precedence relationships, we must take into account the dependence of a given filter invocation on not only the invocation that produces the token that is “consumed” by the filter, but also on the invocations that produce the n preceding tokens, where n the order of the filter. Such dependence can easily be evaluated with an additional dataflow para~eteron each actor input that specifies the number of past tokens that are accessed [Pri91]’. Using this information, together with the invocation counts specified by we obtain the precedence relationships specified by the graph of Figure 9.6( in which the i th invocation of actor N is labeled and each edge e specifies that invocation snk( e ) requires data produced by invocation src( e ) delay( e ) iteration periods after theiteration period in which the data is produced.

A self-timed schedule for Figure 9.6(b) that can be obtained from Hu’s list scheduling method [Hu6l] (described Section 5.3.2) is specified in Figure 9.6(c), and the synchronization graph that corresponds to the IPC graph of Figure 9.6(b) and Figure 9.6(c) is shown in Figure 9.6(d). All of the dashed edges in Figure 9,6(d) are synchronization edges. If we apply Shaffer’s method, which considers only those synchronization edges that do not have delay, we can eliminate the need for explicit synchronization along only one of the 8 sync~onizationedges In contrast, ifwe apply ~ e ~ ~ v e ~ e d u n d a n t $ y nwe c hcan s, detect the redundancy of B2) well four additional redundant synchronization edges (A3,B,),(A4,B,),( B 2 , and (B,, Thus, ~ e ~ ~ v e ~ e d~ndant$ynchsreduces the number of synchronizations from 8 down to 3 reduction of 62%. Figure 9.7 shows the synchronization graph of Figure 9.6(d) after all redundant sync~onizationedges are removed. It is easily verified that the sync~onizationedges that remain in this graph are not redundant; explicit sync~onizationsneed only be implemented for these edges.

Y In Section 9.5.1, we defined two different sync~ronizationprotocols bounded buffer synchronization (BBS), which has cost of 2 synchronization accesses per iteration period, and can be used whenever the associated edge is contained in strongly connected component of the synchronization graph; and l. It should be noted that some SDF-based design environments choose to forgo paralleli~ation across multiple invocations of actor in favor of simplified code generation and scheduling. For example, in the GRAPE system, this restriction has been justified the grounds that it simplifies inter-processor data management, reduces code duplication, and allows the derivation of efficient scheduling algorithms that operate directly general SDF graphs without requiring the use of the acyclic precedence graph (APG) [BELP94).

Chapter

unbounded buffer synchronization (UBS), which has cost of 4 synchronization accesses per iteration period. We pay the additional overhead of UBS whenever the associated edge is feedforward edge of the synchronization graph. One alternative to implementing UBS for feedforward edge e is to add synchronization edges to the synchronization graph so that e becomes encapsulated in strongly connected component; such transformation would allow e to beimplementedwithBBS.However, extra synchronizationaccesseswillbe required to implement the new synchronization edges that are inserted. In this section, we show that by adding synchronization edges through certain simple procedure, the synchronization graph can be transformed into strongly connected graph in way that the overhead of implementing the extra synchronization edges is always compensated by the savings attained by being able to avoid the use of UBS. That is, the conversion to strongly connected synchronization graph ensures that the total number of sync~onizationaccesses required (per iteration period) for the transformed graph less than or equal to the number of synchronization accesses required for the original synchronization graph. T'hrough a practical example, we show that this transformation can signi~cantly reduce the number of required synchronization accesses. Also, we discuss atechnique to compute the delay that should be added to each of the new edges added

synch. edges internal edges

Figure 9.7. The synchronization graphof Figure 9.6(d) after all redundant synchronization edges are removed.

SYN~H~ONI~ATION IN S E L F - ~ ISYSTEMS ~E~

in the conversion to a strongly connected graph. This technique computes the delays in a way that the estimated throughput of the IPC graph is preserved with minimal increase in the shared memory storage cost required to implement the communication edges.

Figure 9.8 presents an efficient algorithm for transforming a synchronization graph that is not strongly connected into a strongly connected graph. This algorithmsimply“chainstogether” the source SCCs, and similarly, chains together the sink SCCs. The construction is completed by connecting the first SCC of the “source chain” to the last of the sink chain with anedge that we From each source or sinkSCC, the algorithm selects a execution time to be the chain “link” co~espondingto tion time vertices are chosen in an attemptto minimize the amount of delay that must be inserted on the new edges to preserve the esti-

chronization graph that is not strongly connected. rongly connected graph obtained by adding edges between the SCCs of enerate an orderingC,, C,,, of the source SCCs of and simD,,of the sink SCCs of ilarly, generate an ordering E C, that minimi~est ( * ) over C,. lect a vertex tantiate edge the t a vertex W , E

that minimizes t ( * ) over

that minimizes t ( * ) over D,.

Selectavertex E. ~nstantiatethe edge

that mini~izest ( * ) over

~nstantiatethe edge ~ ~ ( w , , , ,

wi).

Chapter

mated t ~ o u g h p u of t the original graph. In Section 9.9, we discuss the selection of delays for theedges introduced by It is easily verified that algorithm always produces a strongly connected graph, and that conversion to strongly connected graph cannot be attained by adding fewer edges than the number of edges added by Figure 9.9 illustrates possible solution obtained by algorithm Were, the black dashed edges are the synchronization edges contained in the original sync~onizationgraph, and the grey dashed edges are the edges that are added by The dashed edge labeled e, is the sink-source edge. ~ s s u m i n gthe synchronization graph is connected, the number of feedfor( n , 1)) where n, is the n~mberof ward edges nf must satisfy This follows from the fundamental graph theoretic fact that in connected graph (V*,E*) must be at least (1V.l 1) it is easily verified that the number of new edges introduced by is equal to (nsrc n,,k where is the number of source and It,,k is the number of sink the number of syn~hronizationaccesses per iteration period, S, that is required to implement the edges introd~cedby is nsnk 1)) while the number of sync~onizationaccesses,

illustrat~on

solution

N IN

S-, eliminated by

(by allowing the feedforward edges of original sync~onizationgraphtobe implemented with equals It follows that the net change (S+ S-) in th nization accesses satisfies

(S+

and thus, (S+ S-) S 0 .

1 n , f )S have established the following result.

uppose that G is sync~onizationgraph, andis the graph om applying algorith~ to G . Then the synchronization cost of is lessthan or equal to the synchronization cost of G . For example, without the edges added by (the dashed grey edges) in Figure 9.9, there are 6 feedforward edges, which require synchronization accesses per iteration period to implement. The addition of the dashed edges require ynchronization accesses to implement these new edges, but allows us to use for the original feedforward edges, which leads to savings of synchr on accesses for the original feedforward edges. the net effect achieved by in this example is a reduc8) As tionofthe total number f sync~onizationaccesses by another example, consider igure 9.10, which shows the synchronization graph topology (after redun~antsynchronization edges areremoved) that results from four-processor schedule of esizerfor plucked-s~ingmusical inst~ments insevenvoicesbasedonthe us-~trongtechnique. This algorithm was discussed in Chapter 3, an example application that was implemented on the ordered memo^ access archit~ctureprototype. This graph contains 6 synchronization edges (the dashed edges), all of which are feedforward edges, so the nc~onizationcost is sync~onizationaccesses per iteration period. nce the graph has one source and one sink SCC, only one edge is added by and adding this edge reduces the synchronization cost to savings. Figure 9-11 shows the topology of possible solution computed by on this example. Here, the dashed edges represent the synchronization edges in the synchronization graph returned by

ne impo~antissue that remains to be addressed !i the conversion of G, into strongly connected graph G,? is the proper insers y n c ~ o ~ i z a t i ograph n tion of delays so that is not deadlocked, and does nothavelower estimated throughput than G,. The potential for deadlock and reduced estimated throughput arise because the conversion to strongly connected graph must necessarily introduce one or more new fundamental cycles. In general, new cycle may be

Chapter

delay-free, or its cycle mean may exceedthat of the critical cycle in Thus, we may have to insert delays on the edges added by Co~vert-to-SC-gra~~. The location (edge) and magnitude of the delays that we add are significant since they affect the self-timed buffer bounds of the communication edges, as shown subsequently in Theorem Since the self-timed buffer bounds determine the amount of memory that we allocate for the corresponding buffers, it desirable to prevent deadlock and decrease in estimated throughput in a way that the sum of the self-timed buffer bounds over all communication edges is minimized. In this section, we outline a simpleand efficient algorithm called ~ e t e r ~ i ~ e ~ efor zays z a y s an optimal result addressing this problem. Algorithm ~ e t e r ~ i ~ e ~ eproduces

igure 9.1 0. The synchronization graph, after redundant synchronization edges areinducedby a four-processor schedule of a musicsynthesizer bas ~ a r p ~ u s - ~ t r algorithm. ong

SY~CHR~~I~A INTSEL~-TIME~ IO~ SYSTEMS

if G, has only one source or only one sink SCC; in other cases, the algorithm must be viewed heuristic. Our algorithm produces an optimal result if G, has only one source SCC or only one sink SCC; in other cases, the algorithm must be viewed heuristic. In practice, the assumptions under which we can expect an optimal result are frequently satisfied. For si~plicityin explaining the opti~alityresult that has been established y s first , specify restricted version of the algofor Algorithm ~ e t e r ~ i n e ~ e z awe rithm that assumes only one sink SCC. After explaining the optimality of this restricted algorithm, we discuss how it can be modified to yield an optimal algorithm for the general single-source-SCC case, and finally, we discuss how it can be extended to provide heuristic for arbitrary synchronization graphs. Figure 9.12 outlines the restricted version of Algorithm ~ e t e r ~ i n e ~ e z a y s that applies when the synchroni~ationgraph G, has exactly one source Here, ~ e Z l ~ is~ assumed ~ ~ o to r be ~ an algorithm that takes synchronization graph input, and repeatedly applies the Bellman-Ford algorithm discussed in Section 3.13 to return the cycle mean of the critical cycle in if one or more cycles exist that have zero path delay, then ~ e l l ~ returns ~ n ~ ~ r ~

Figure 9.1 1. possible solution obtained by applying ~ o f f v ~ r t - t o - ~to~the -gr~~~ example of Figure 9.10.

Chapter

~eter~i~e~elays Synchronizationgraphs and where is thegraphcomputed by Conve~t-fo-SC-g~a~~ when applied to G,. The ordering of source SCCs generated in Step 2 of Converf-fo-SC-gra~~ is denoted Cl, C,, For i 1, m 1 denotes the edge instantiated by Converf-fo-SC-gra~~ from a vertex in to a vertex in The sink-source edge instantiated byConverf-fo-SC-g~a~~ is denoted on-negative integers d,- such that the estimatedthrou~hputwhen delay 0 i m 1 equals estimated throughputof

G,[

h,,,=

l

-1

set delays on each edge to be infinite

~ e ~ / ~ a ~ ~ o r ~ (

of G,

compute the max. cycle mean

an upper bound on the delay required for any ei

i

0,

m~ ~ ~ ~ e L a y (~X i , C

AYi+)

Si]

~ i n ~ e l a h, ~B synchronization graph X , an edge itive integer B Assuming minimum (0, 1, than h"

~

, the fix delay on

be to

*I

in X , a positive real numberh ,and a pos-

B] has estimated throughput no less thanh-' det~rminethe B} such that the estimated throughput of dl is no less

~ e ~ o ram binarysearch in therange 0, 1, B] tofindtheminimumvalueof r 0, B} such that ~ e / / ~ a n ~ o r ~ r] ( returns a value less than or equal to h . Return this minimum value ofr

Figure 9.12. An ~~gorithm for determining the ~ i g o r i t hC~Q ~ ~ e ~ - f Q - S C - g ~ ~ ~ ~ .

on the edgesi n t r o ~ u ~ ebyd

S Y ~ ~ H R O ~ I IN ~ ~SELFT I O TIME^ ~ SYSTEMS

In developing the optimality properties of Algorithm D e t e r ~ i n e D e l ~ y s , we will use the following definitions: If G ( V , E ) isaDFC; tinct members of E ; and

then C e,

A,,

l

An

isasequenceofdis-

denotes the DFC

whereeach defined by ~elay( A, Thus, results from “changing the delay” on each value

snk( snk( and is simply the that to the corresponding new delay

src( l

at G synchronization graph that preserves G,. n G minimum-delaypath in G directed from an IPC edge (in G, otivation for Algorithm ~ e t e ~ i n e D e l a is y sbased on the observations e paths introduced by C o n v e r t - t o - ~ C - ~ can r a ~be ~ p ~ i t i o n e dinto m non-empty subsets such that each member of P, contains e and contains no other members of e and similarly, the set of fundamental cycles introduced by Deter~ineDelayscan be p~titionedinto W O , W , , such that eachmemberof W , contains e, and contains no other members of{e,, e,,

y const~ction, nonzero delay on any of the edges e tributes to reducing the cycle means of all members of W, Algorithm ~ i n e ~ e lstarts ~ y s(it~rationi 0 of the For loop) by determining the minimum delay 6,) on that is required to ensure that none of the cycles in has cycle mean that exceeds the maximum cycle mean h,,, of G,. Then (in iteration i 1 the algorithm determines the minimum delay on e, that is required to guarantee that no member of W Ihas a cycle mean that exceeds h,,, assuming that ~ e l a y ( 6, ow, if delay ~ e l ~ y 6, and 0 then for any positive integer k S k units of delay can be “transferred from e l to without violating the property that no member of U W , ) contains a cycle whose cycle mean exceeds h,,, However, such transfo~ationincreases the path See Figure 9.12 for the specification of what the e, represent.

Chapter

delay of each member of while leaving the path delay of each member of unchanged,and therefore such transformationcannotreduce the self-timed buffer bound of any IPC edge. Furthermore, apart from transferring delay from e , to e o , the only other change that can be made to delay( or delay ( e without introducing member of (W, U W , ) whose cycle mean exceeds h,,, is to increase one or both of these values by some positive integer amount(s). Clearly, such change cannot reduce the self-timed buffer bound on any IPC edge. Thus, we see that the values and computed by Dete~ineDeZaysfor delay(eo) and deZay(e,) respectively, optimallyensurethat no member of U has cycle mean that exceeds h,,,, After computing these values, Determi~eDelayscomputes the minimum delay 6,on e, that is required for all members of W, to have cycle means less than or equal to h,,, assuming that delay( e,) and delay ( e , ) Given the “configuration” (deZuy(e,) delay(e,) delay(e,) transferring delayfrom e2 to e , increases the path delay of all members of while leaving the path delay of each member of (POU P,) unchanged; and transferring delay from e, to increases the path delay across U P,) while leaving the path delay across unchanged. Thus, by an argument similar to that given to establish the optimality 6,) with respect to ( W oU W , ) , we can deduce that (1) the values computed by Determine~elaysfor the delays on e,,, e , , e , guarantee that no member of U U has cycle mean that exceeds h,,, and (2) for 6,’) to (e,,, e 2 ) that preserves the any other assignment of delays estimated throughput across ( W oU U W,), and for any IPC edge e such that an IPC sink-source path of e is contained in U U P,) the self-timed buffer bound of e under the assignment is greater than or equal to computed by iterself-timed buffer bound of e under the assignment ations i 2 of Determi~eDeZays. After extending this analysis successively to each of the remaining iteram of thefor loop in Determine~eZays, we arrive the foltions i 3,4, lowing result. Suppose that G, is sync~onizationgraph that has exactly one sink SCC; let G, and ( e oe, l , be in Figure 9.12; let (do, be the result of applying DetermineDeZays to and and let (do’, dm- beanysequence of m non~negativeintegers such that eo do’, ’3 has the same estimated throughput G, Then

*.*,e,-, ~d,-,’l)r:Q1.(6,[e,~do, ?e,-,-+d,-,l), (X) denotes the sum of the self-timed buffer bounds over all IPC edges e . .

in G, induced by the sync~onizationgraph

Figure 9.13 illustrates solution obtained from ~ e t e r ~ i n e ~ e L aHere y s . we and we assume that the set of IPC assume that t ( v ) 1 for eachvertex edges {e,, e b } (for clarity, we are assuming in this example that the IPC edges are present in the given synchronization graph). The grey dashed edges are the We see that h,,,, is determined by the edgesadded by Convert-to-SC-~rap~. cycle in the sink SCC of the original graph, and inspection of this cycle yields h,,, we see that the set W O the set of fundamental cycles that contain and do not contain e l consists of single cycle that contains three edges. By inspection of this cycle, we see that the minimum delay on e,, required to guarantee that its cycle mean does not exceed h,,,v is 1. Thus, the i 0 iteration of the For loop in ~ e t e r ~ ~ n e ~ ecomputes Z a y s 6, l Next, we see that Wl consists of single cycle that contains five edges, and we see that two delays must be present on this cycle for its cycle mean to be less than or equal to h,,, Since one delay has been placed on ~ e t e r ~ i n e ~ e Lcomays iteration of the For loop. Thus, the solution deterputes l in the i mined by ~ e t e r ~ i n e ~ e z for a y sFigure 9.13 is 6,) 1, l ) the resulting self-timed buffer bounds of e, and eb are, respectively, 1 and and

Figure 9.13. An example used to illustratea solution obtained byalgorit~m ~eter~i~e~ei~ys.

2+1= ow is an alternative assignmentofdelayson thatpreserves the estimated throughput of the original graph. However, in this assignhe self-timed buffer bounds of e, and are identically equal 4 , one greater than the c o ~ e s p o n d i n g ~ ufrom m the delay assignment 1, 1 computed by DetermineDeZays. Thus, if denotes the graph returned by Cu~vert-tu-SC"graphfor the example of Figure9.13, we have that

denotes the sum of the self-timed buffer bounds over all IPC edges

A~gorithmDeter~ineDeZayscan easily be modified to optimally handle general graphs that have only one SCC. Here, the algorithm s~ecification remains essentially the same, with the exception that for i 1 2, .) ( m denotes the edge directed from vertex in - i to vertex in D, where D, is the ordering of sink SCCs generated in tep of the corresponding invocation of Cunve~-tu-SC-graph still denotes the sink-source edge instantiated by Cunvert-tu-SC-graph),By adapting the reasoningbehind Theorem 9.4, it is easily verified that when it is applicable, this modified algorithm always yields an optimal solution. As far we are aware, there is no straight for war^ extension of Deterdays to general graphs (multiple source SCCs and multiple sink SCCs) guaranteed to yield optimal solutions. The fundamental problem for the eneral case is the inability to derive the partitions W O , W , , W,,,P, of the fundamental cycles ( P C sink-source paths) introduced by ~ U ~ v e r t - t u - S C - g rsuch ~ p h that each contains e,, and cone,where E, is the set of edges tains no other members of E, added by C u ~ v e r t - t u - ~ C - g rThe ~ ~ hexistence . of such pa~itionswas crucial to our development of Theorem 9.4 because it implied that once the minimum vale,. are successively computed9 " t r a n ~ f e ~ i n delay g ' ~ from some ues for eo, e,, to some e j is never beneficial. Figure 9.14 shows an example of synchronization graph that has multip~esource SCCs and multiple sink SCCs, and that does not induce partition of the desired form for the fundamental cycles.

e t e ~ i n e D e ~ a ycan s beextended to yield heuristics for the eneral case in which the original synchronization graph C, contains more than and more than one sink SCC. For example, if ( a l , a 2 , a k ) ne source denoteedges that were instantiated by C ~ n v e ~ - t u - ~ C - g r f"between" fph the CCs with each ai representing the i th edge created and similarly,

SYNCHRONIZATION IN S E L ~ - T ISYSTE~S ~E~

b l , b2t 6,) denote the sequence of edgesinstantiated between the sink thenalgorithm ~ e t e ~ i n e ~ e ~can abeyapplied s with the modi~cation that m k-tZ+l,and

where e, is the sink-source edge from C o n v e r t ~ t o - ~ C - g r a ~ ~ z . The derivation of alte~ativeheuristics for general synchroni~ationgraphs appears to be an interesting direction for further research. It should be noted, though, that practical synchronization graphs frequently contain either single source or single SGG, or both such the example of Figure 9.10 that algorithm ~ e t e r ~ ~ n e ~ e ztogether a y s , with its counte~artfor graphs that have single source form widely-applicable solution for optimal~ydeterini in^ the delays on the edges created by C u n v e ~ - t o - ~ C - g r u ~ ~ .

Figure A synchronization graph, afterprocess in^ by such that there is no m -way partition WO, W,- of the fundamenta by that satisfies both (1). Each W , conEach W i does not contain any member tains e, et, ei+2, Here, the fundamental cycles introduced by y dashed edges are the edges instantiated by are e21 is easilyv ~ r i f i e that ~ these cycles cannot be decom~osed n if we are ~ l ~ o w to e dreorder thee, S.

Chapter

re exist constants and such that and S for all edges e then the complexity of O( VI /Ellog,( V I ) ) (see Section 3.13.2); and we have

5

for all

and so that

TIVI DTIV/ Thus, each invocation of ~ i n ~ e l runs u y in

It follows that and any of the variations of defined above is VI IEI (log2(IV1))2) where m is the number of edges instantiated by Since where is the number of source SCCs, and the number of sink SCCs, it is obvious that IVI With this observation, and the observation that 5 /VI2,we have that and its variations are O( IV14(log,( IVi))') Furthermore, it easily verified that the time complexity of dominates that of the time complexity of applying in succession is o(~v~~(~o~,(~v~))*). Although the issue ofdeadlock does not explicitly arise inalgorithm the algorithm does guaranteethat the output graph is not deadlocked, assuming that the input graph is not deadlocked. This because (from Lemma 7.1) deadlock is equivalent to the existence of a cycle that has zero path delay, and thus equivalent to an infinite maximum cycle mean. Since elays does not increase the maximum cycle mean, it follows that the algorithm cannot converta graph that not deadlocked into deadlocked graph.

Converting mixed grain HSDFG that contains feedforward edges into a stron ly connected graph has been studied by Zivojnovic, 941 in the context of retiming when the assignment of actors to processors is fixed beforehand. In this case, the objective is to retime the input graph so that the number of communication edges that have nonzero delay is maximized, and the conversion is performed to constrain the set of possible retimings in such way that integer linear programmin~formulation can be developed. Thetechnique generates two dummyvertices that are connected by an edge; the sink vertices of the original graph are connected to one of the dummy vertices, while the other d u ~ m yvertex is connected to each source. It is easily verified that in self-

S~NCH~ONIZATIO~ IN S E L F - T I ~SYSTEMS E~

timed execution, this scheme requires at least four more synchronization accesses per graph iteration than the method that we have proposed.We can obtain further relative savings if we succeed in detecting one or more beneficial resynchronization opportunities. The effect of Zivojnovic’s retiming algorithmon synchronizaunpredictable since, on onehand, communicationedge tionoverhead becomes “easier to make redun~ant”when its delay increases, while on the other hand, the edge becomes less useful in making other communication edges redundant since the path delay of all paths that contain the edge increase.

This chapter has developed two software strategies for minimizing synchronizationoverheadwhenimplementing self-timed, iterative dataflowprograms, These techniques rely on graph-theoretic analysis framework based on two data structures called the interprocessor communication graph and the synchronization graph. This analysis framework allows us to determine the effects on throughput and buffer sizes of modifying the points in the target program at which sync~onizationfunctions are carried out, and we have shown how this framework can be used to extend an existing technique removal of redundant synchronization edges for non-iterative programs to the iterative case, and to develop new method for reducing synchr~nizationoverhead the conversion of sync~onizationgraph into strongly connected graph so that more efficient sync~onizationprotocol can be used. As in Chapter the main premise of the techniques discussed in the chapter that estimates are available for the execution times of actors such that the actual execution time of an actor exhibits large variation from its corresponding estimate only with very low frequency. Accordingly, our techniques have been devised to guarantee that if the actual execution time of each actor invocation is always equal to the corresponding execution time estimate, then the throughput of an implementation that incorporates our synchronization minimization techniques is never less than the throughput of a corresponding unoptirnized implementation that is, we never accept an opportunity to reduce synchronization overhead if it constrains execution in such way that t ~ o u g h p u is t decreased. Thus, the techniques discussed in this section are particularly relevant to embedded applications, where the price of synchronization high, and accurate execution time estimates are often available, but guarantees on these execution times do not exist due to infrequent events such cache misses, interrupts, and error handling. In the nexttwo chapters, we discuss third software-basedtechnique for reducing synchronization overhead in applicationcalled r~sync~runizatiun, specific multiprocessors.

This Page Intentionally Left Blank

This chapter discusses technique, called resync~roniz~tio~, for reduci~g synchronization overheadin application-specific multiprocessor implementations. The t e ~ h n i ~ uapplies e to arbitrary collections of dedicated, programmable or configurable processors, such combinations of programmable DSPs, ASICS, and FPGA subsystems. synchronization is based on the concept of redundant synchronization operations, which defined in the previous chapter. The objective of resynchronization is tointroduce new synchroni~ationsin such way that the number of original synchroni~ationsthat consequently become redundant is significantly more than number of new sync~onizations.

Intuitively, resync~onizationis the process of adding one or more new sync~onizationedges andremoving the redundant edges that result. Figure lO,l(a) ill~strateshow this concept can be used to reduce the total numberof synchronizations in multiprocessor implementation, Here, the dashed edges represent synchronization edges. Observe that if the new synchronization edge C, H ) is inserted, then two of the original synchronization edges and ( E , become redundant. Since redundant synchronization edges can be removed from the synchronization graph to yield an equivalent synchronization graph, we see that the net effect of adding the sync~onizationedge C, H ) is to reduce the number of synchroni~ationedges that need to be imple~entedby 1 Figure lO.l(b) shows the sync~onizationgraph that results from inserting the r ~ ~ y n c ~ r o n i ~edge ~t~on into Figure 1O.l(a), and then ~emovingthe redundant sync~onizationedges that result. ~ e ~ n i t i o10.1 n gives formal definition of resynchronization. This considers resynchronization only “across” feedforward edges. Resynchroni~ation that includes inserting edges into is also possible; however, in general, such resynchronization may increase the estimated throughput (see Theorem 10.1 at

Chapter

the end of Section 10.2). Thus, for our objectives, it must be verified that each new synchronization edge introduced in an does not decrease the estimated throughput. avoid this complication, which requires check of significant complexity (0(IVl IEllog,( V I ) ) ,where ( V ,E) the modified synchronization graph, using the Bellman Ford algorithm described in Section 3.13.2) for candidate resynchronization edge, we focus only on “feedforward” resynchronization in this chapter. Future research will address combining the insights developed herefor feedforward r e s y n c ~ o ~ ~ z a twith i o n efficient techniques to estimate the impact that given resynchronization edge has on the estimated throughput. Opportunities for feedforward resynchronization are pa~icularlyabundant in the dedicated hardware implementation of dataflow graphs. If each actor is mapped to separate piece of hardware, in the VLSI dataflow arrays of Kung, then for any application graph that is acyclic, every communication channel between two units will have an associated feedforward sync~onizationedge. Due to increasing circuit integration levels, such isomorphic mapping of dataflow subsystems into hardware is becoming attractive for growing family of applications. Feedforward synchronization edges often arise naturally in multiprocessor software implementations as well. A software exam-

I

ple is reviewed in detail in Section 10.5.

itio Suppose that G ( V , E) synchronization ra h, and {e,, e2, e,,} is the set ofallfeedforwardedges in G A of G finite set R e,’,e2’, e,’} of edges that are not lY contained in E , butwhosesourceand sink vertices are in V , such that e e2 e,’ are feedforward edges in the HSDFG G* ( V , ((E andb) G* preserves G that is, snk(ei))S deZay(ei) for all i E 1 2, n } Each member of that is not in E is called of the resynchronization G* is called the d with R ,and this graph denoted by “(R, G) If we let G denote the graph in Figure 10.1, then the set of feedforward edges is {(B,G), (E, (E, (H, R id&, (E, (H, is resync~onizationof G Figure 10.1(b) shows the HSDFG

G* and from Figure lO.l(b), it is easily verified that tions and (b) of Definition 10.1.

R , and G* satisfy condi-

Typically, resynchro zation is meaningful only in the context of synchronization graphs that are not that is, synchronizationgraphs that do not contain any delay-free cycles, or equivalently, that have infinite estimated throughput. In the remainder of this chapter and throughout Chapter 11, we are concerned only with deadlock-free synchronization graphs. Thus, unless otherwise stated, we assume the absence of delay-free synchronization graph cycles. In practice, this assumption is not problem, since delay-free cycles canbe detected efficiently

This section reviews number of useful properties ofsynchronization redundancy and resynchronization that we will apply throughout the developments of this chapter and Chapter 11.

.l: Suppose that G ( V , E) is synchronizationgraphand is redundant synchronization edge in G . Then there exists simple path p in G directed from to snk( such that p does not contain S , and DeZay(p) S deZay(s) Let G’ ( V , (E denote the synchronization graph that results when we remove S from G . Then from Definition 9.2, there exists path p’ in G’ directed from SE( to S) such that

Delay (p’) S delay

S)

(10-1)

Chapter

Now observe that every edge in C’ is contained in C and thus, C contains the path If is simple path, then we are done. ~therwise, can be expressed concatenation

where each is simple path, at least one qi is non-empty, and each (not necessarily simple) cycle. Since valid synchronization graphs cannot contain delay-free-cycles (Section we must have 1 for l k n Thus, since each originates and terminates at the same actor,thepath qn) simple path directed from to such that ~ombiningthis last inequality with (10-1) yields

0-3) F u r t h e ~ o r esince , is contained in G it follows from the construction of that must also be contained in G . Finally, since is contained in C’, C’ does not contain and the set of edges contained in is subset of the set of edges contained in we have that does not contain QED. Suppose that G and G’ are synchronization graphs such that G’ preserves G and is path in G from actor x to actor Then there is a path in G’ from to such that 5 and G where tr( denotes the set of actors traversed by the path cp. Thus, if synchronization graph G’ preserves another synchronization graph and is path in C from actor to actor then there is at least one path in G’ such that 1) the path directed from x to the cumulative delay on does not exceed the cumulative delay on and every actor that is traversed by is also traversed by (althoug~ may traverse one or more actors that are not traversed by For example in Figure lO.l(a), if we let x (G, W

?

y

and

(H7 10-4)

in Figure lO.l(b) confirms Lemma 10.1 for this example. Here

{ B , G, H,

and

G, H,

Let l, By definition of the relation, each ei that is not synchronization edge in G is contained in G’. For in from each that is a synchroni~ationedge in G there must be path

src(e;) to i, define the path

such that DeZay(p;) delayte,). Let denote the set of e, that are synchronization edges in G , and to be the concatenation

Clearly, a path in G’ from x to y and since Delay p,) I delay( ei) holds whenever is synchronization edge, it follows that Delay( I Delay( p ) Furthermore, from the const~ctionof it is apparent that every actor that is traversed by is also traversed by The following lemma states that if resynchronization contains resynchronization edge e such that there is delay-free path in the original synchronization graph from the source of e to the sink of e , then e must be redundant in the resychronized graph. Suppose that G synchronization graph; R is resynchronizad ( x , y ) is resynchronization edge such that pc(x, y ) Then y ) is redunda~tin (R,G) minimalresynchronization(fewest number of elements) hasthe property that pG(x’, y’) for each resynchronization edge ( X ’ , y ’ ) Proofi Let p denote minimum-delay path from x to y in G . Since ( x , is resynchronization edge, ( x , y ) is not contained in G , and thus, p traverses at least three actors. FromLemma 10.1, it follows that there is path p’ in G) from x to y such that

DeZay(p’)

(10-6)

and p’ traverses at least three actors. Delay (p’) ~ e Z u y ( ( x ,

(1

and p’ ( ( x , y ) ) Furthermore, p’ cannot properly contain y ) To see this, observe that if p’ contains ( x , y ) but p’ ( ( x , y ) ) then from (10-6), it follows that there exists delay-free cycle in G (that traverses and hence that our assumption of deadlock-free schedule (Section 10.1) is violated. Thus, we conclude that ( x , y ) is redundant in G). consequence of Lemma 10.1, the estimated throughput of given synchronization graph is always less than or equal to that of every synchronization graph that it preserves. If G is synchronizationgraph,and graph that preserves G , then h,,,( G’) h,,,( G)

G’ is

synch ronization

Suppose that is critical cyclein G Lemma10.1guarantees that there is cycle C’ in G’ such that Delay( C’) DeZay( C) and b) the set of actors that are traversed by C is subset of the set actors traversed by Now clearly, b) implies that 9

traversed

is traversed

C‘

(l

C

and this observation together with implies that the cycle mean of C’ is greater than or equal to the cycle mean of C . Since C is critical cycle in G , it follows that h,,,( G’) h,,,( G) QED.

Thus, any saving in synchronization cost obtained by rearranging synchronization edges may come at the expense of decrease in estimated t ~ o u g h ~ u t . implied by Definition 10.1, weavoid this complication by restricting our attention to feedforward synchronization edges. Clearly, resynchronization that rearrangesonlyfeedforwardsynchronizationedgescannotdecrease the estimated t ~ o u g h p u since t new cycles are introduced and no existing cycles are altered. with the form of resynchronization that is addressed in this chapter, any decrease in synchronization cost that we obtain is not diminished by degradation of the estimated throughput.

nization with the fewest In Section 10.4, it isformally shown that the resynchronization problem is NP-hard, which means that it is unlikely that efficient algorithms can be devised to solve the problem exactly, and thus, for practical use, we should search for good heuristic solutions In this section, we explain the intuition behind this result. establish the NPhardness of the resynchronization problem, we ex when there are exactly two which we call t and we derive polynomial-time reduction from the classic ing well-known NP-hard problem, to the pairwise resynchronization problem. Inthe set-covering problem, one given finite set X and family T of subsets of and askedto find minimal (fewest number of members) subfamily T, T such that T,

subfamily of T is said to if each member of is contained in some member of the subfamily. Thus, the set-covering problem is the problem of finding minimal cover.

Given synchronization graph G , let ( x , , x,) be sync~roniand let ( y , ,y,) be an ordered pair of actors in G We say that (x17 in G if

p(”,, y1) Po‘),, delaY((x17 Thus,everysynchronizationedgesubsumes itself, and intuitively, if ( x , , x,) is synchronization edge, then y , , y , ) subsumes ( x , , x,) if and only if zero-delay synchronization edge directed from y , to y2 makes ( x , , x,) redundant. The following fact easily verified from Definitions 10.1 and 10.2. Suppose that G is synchronization graph that contains exactly two SCCs, the set of feedforward edges in G , and is resynchronization of G . Then for each e E there exists e’ E such that snk( e’)) subsumes e in G . An intuitive correspondence between the pairwise resynchronization problem and the set covering problem can be derived from Fact 10.2. Suppose that G synchronization graph with exactly two SCCs, Cl and such that each feedforward edge is directed from member of C, to member of We start the finite set that we wish to by viewing the set of feedforward edges in G cover, and with each member p of ( x , y ) ( x E C , , y E we associate the subset of defined by {e E Thus, is the set of feedforward edges of G whose corresponding synchronizations can be eliminated if we implement zero-delay synchronization edge directed from the first vertex of the ordered pair to the second vertex of p . Clearly then, is resynchronization if and only if each e E F is contained in at least one X( snk( e;’))) that is, and ifonly if snk( e;’))) 1 S i S covers F. Thus, solving the pairwise resynchronization problem for G is equivalent to finding minimal cover for given the family of subsets y ) ( x E Cl, y E C,)}

{x((

Figure 10.2 helps to illustrate this intuition. Suppose that we are given the {x1, and the familyof subsets T t l , t,, where t1 t2 {x,, and To construct an instance of the pairwiseresynchronizationproblem,wefirst create two vertices andan edge directed between these vertices for each member of we label each of the edges created in this step with the corresponding member of Then for each t E: T we create two vertices and t ) Next, for each relation xi E ti (there are six such relations in this example), we create two delayless edges one directed from the sourceof the edge corresponding to and directed to t i ) and another directed from t j ) to the sink of the edge corresponding to This last step has the effect ofmakingeach pair set

Chapter

v ~ n k (t i ) ) subsume exactly those edges that correspond to members of in other words, after this construction, ti), ti))) t i , for each i Finally, for each edge created in the previous step, we create a corresponding feedback edge oriented in the opposite direction, and having a unit delay.

x(

ti

V

Figure (a) An instance of the pairwiseresynchronizatiofl problem thatis derived from an instanceof the set-covering problem; (b) theWSDFG that results from a solutionto this instanceof pairwise resyflchronization.

Figure 10.2(a) shows the synchronization graph that results from this construction process. Here, it is assumed that each vertex corresponds to separate processor; the associated unit delay, self loop edges are not shown to avoid clutSCCs the SCC Observe ter. that the graph contains two U and the SCC U and that the set of feedforward edges the set of edges that correspond to members of Now, recall that major correspondence betweenthe given instance ofset covering and the instance of pairwise resynchronization defined by Figure 10.2(a) ti), t i ) ) ) t i , for each Thus, if we can find minimal that resynchronization of Figure 10.2(a) such that each edge in this resynchronization is directed from some t k ) to the corresponding t k ) then the associated tk form minimum cover of For example, it easy, albeit tedious, to verify that the resync~onizationillustrated in Figure 10.2(b),

x(

do(

is minimal resynchronization of Figure 10.2(a), and from this, we can conclude is minimal cover for X . From inspection of the given sets and that t,, T , it easily verified that this conclusion is correct. This example illustrates howan instance of pairwise resynchronization can be constructed (in polynomial time) from an instance of set covering, and how solution to this instance of pairwise resynchronization can easily be converted into solution of the set covering instance. The formal proof of the NPhardness of pairwise resync~onizationthat is given in the following section is generalization of the example in Figure 10.2.

In this section, the NP completeness of the resynchronization problem is established. This result derived by reducing an arbitrary instance of the setcovering problem, well-known NP-hard problem,to an instance ofthe pairwise resynchronizationproblem,which is special case of the resynchronization problem that occurs when there are exactly two SCCs. The intuition behind this reduction is explained in Section 10.3 above. Suppose that we are given an instance T ) of set covering, where finite set, and T is family of subsets of that covers Without loss of generality, we assume that T does not contain

U

proper nonempty subset T’ that satisfies 10-9)

tE

We can assume this without of generality because if this assumption does not hold, then we can apply the construction below to each “independent subfamily”

Chapter 10

separately, and then combine the results to get minimal cover for The following steps specify how we construct anWSDFG from Except where stated otherwise, no delay is placed on the edges that are instantiated. instantiate two vertices l . For each x E ( x ) to an edge e( x ) directed from

2. For each

and

and instantiate

tE

Instantiate two vertices

t)

and

t)

(b) For each x E Instantiate an edge directed from

x ) to

Instantiate an edge directed from

t)

to

t)

to

place one delay

t)

and

this edge.

Instantiate an edge directed from Instantiate an edge directed from

(x)

to

and

place one delay on this edge. 3. For each vertex that has been instantiated, instantiate to itself, and place one delay onthis edge.

edge directgd from

Observefromour construction, that whenever x E X contained in T there an edge directed from x) t ) to t) and there also an edge (having unit delay) directed from to x) t ) Thus, from the assumption stated in (l 0-g), it follows that E T ) f o m s one SCC, E forms another SCC, and E X} is the set of feedforward edges. t

E

Let G denote the HSDFG that wehave constructed, and in Section 10.3, define {e E ( s r c ( e ) ,$ & ( e ) ) ) } for eachordered pair of vertices such that is contained in the source SCC of G , and contained in the sink SCC of G . Clearly, G gives an instance of the pairwise resync~onizationproblem. By construction {xE

Thus,forall

(t),

G , observe that

t ) ) subsumes

x),

vsnk(t))

t E

For each x E

all input edges of

x)))}

t for all t E T

t}.

have unit delay on

them. It follows that for any vertex y the in

c:

X(

sink SCC of G ,

E

For each

E

T the only vertices in G that have

t) are those vertices contained in

E

delay-free It follows that

for any vertex y in the sink SCC of G , Now suppose that j-’ f2, is a minimal resynchronization of G . For each i E 1,2, m } exactly one of the following two cases must apply: Case 1 fi) for some E X In this case, we pick an arbiand we set and From trary E T that contains Observation 2, it follows that

c: W;

Case vsnk

I(i each

for some From Obse~ation3, we have From our 1,2, m})} is of the form E

Now, for each i E

1,2,

E

T We set

and

definition of the and minimal resynchronization of G , Also, where E T I m} we define

Z, Z,} covers X . From Observation 4, we have that for each there exists E T such that Z , Thus, each Z; member of T.Also, since wi)l(i E 1,2, m})} is resynE must be prechroni~ationof G , each member of served by some and thus each E X must be contained in some

Z,} is a minimal cover for X . (By contraposition). Suppose there exists cover Y , , U*, U,,,?} (among the members of T ) for X , with m’ m . Then, each E X is contained in some Y , and from Observation 1, Y,), Y,)) subsumes e( Thus, Y;), ( i E l , 2, m‘})} is a resynchronization of C Since m’ m ,it follows that j-’ 2, is not minimal resynchronization of G .

In summary, we have shown how to convert an arbitrary instance T) of the set-covering problem into an instance C of the pairwise resynchronization problem, and we have shown how to convert solution of this instance of pairwise resync~onizationinto solution of 7“) It easily verified that all of the steps involved in deriving C from T) and in deriving from can be performed in polynomial time. Thus, from the NP hardness of set covering we can conclude that the pairwise resynchronization problem is NP hard.

A heuristic framework for the pairwiseresynchronizationproblem emerges naturally from the relationship that was established inSection 10.3 between set-covering and pairwise resync~onization.Given an arbitrary algothat solves the set-covering problem, and given aninstance of paironization that consists of two Cl and and set S of feedforw~dsynchronization edges directed from members of C, to members of this heuristic framework first computesthe subset Sl(pdsrc(e), U ) for each ordered pair of actors (U, that E

EE

{(U’,v‘)I( U’

(PC(V,snk(e)) dela~(e))l contained in the set

in Clan

and then applies the algorithm C O V to ~ the ~ instance of set covering defined by v’))l((u’, E T)} If E the set S together with the family of subsets denotes the solution returned by COVER, then r~sync~onization for the given instance of pairwiseresync~onizationcan be derivedby

mE This resynchronization is the solution returned by the heuristic framework. From the correspondence between set-covering and pairwise resynchronization that is outlined in Section 10.3, it follows that the quality of resynchronization obtained by the heuristic framework is determined entirely by the quality of the solution computed by the set-covering algorithm that is employed; that ~ is V worse ~ ~ more subfamilies) than an if the solution computed by C optimal set-covering solution, then the resulting resynchronization will be worse more synchronization edges) than optimal resync~onizationof the given instance of pairwise resynchronization. The application of the heuristic framework for pairwise resynchronization to each pair of in some arbitrary order, in general synchronization graph yields heuristic framework for the general resynchronization problem. How-

ever, major limitation of this extension to general sync~onizationgraphs arises from its inability to consider resync~onizationopportunities that involve paths that traverse more than two SCCs, and paths that contain more than one feedforward synchronization edge. Thus, in general, the quality of the solutions obtained by this approach will be worse than the quality of the solutions that are derived by the particular set covering heuristic that is employed, androughly, this discrepancy canbe expected to increase the number of SCCs increases relative to the number of sync~onizationedges in the original sync~onizationgraph. For example, Figure 10.3 showsthe sync~onizationgraph that results from a six-processor schedule of synthesizer for plucked-string musical instruments in 11 voices based on the Karplus-Strong technique. Here, represents the excitation input, each represents the computation for the th voice, and the actors marked with signs specify adders. Execution time estimates for the actors are shown in the table at the bottom of the figure. In this example, the only pair of distinct SCCs that have more than one sync~onizationedge between them is the pair consisting of the SCC containing and the SCC containing five addition actors, and the actor labeled Thus, the best result that canbe derived from the heuristic extension for general synchronization graphs described above is resync~onizationthat optimally rearranges the synchronization edges between these two SCCs in isolation, and leaves all other synchronization edges unchanged. Such resynchronization illustrated in Figure 10.4. This synchronization graph has total of nine synchronization edges, which is only one less than the number of synchronization edges in the original graph. In contrast, it is shownin the following subsection thatwith moreflexible approach to resynchronization, the total synchronization cost of this example can be reduced to only five synchroni~ationedges.

This subsection presents more global approach to resync~onization, called Algorithm ~lobal-resync~onize, which overcomes the major limitation of the pairwise approach discussed in Section 10.5.1. Algorithm ~lobal-resynchronize is based on the simple greedy approximation algorithm for set-covering that repeatedly selects subset that covers the largest number of where a remaining element is an element that not contained in any of the subsets that have already been selected. In [Joh74, Lov753 it shown that this setcovering technique is guaranteed to compute solution whose cardinality is no greater than (1n( 1) times that of the optimal solution, where is the set that is to be covered. To adapt this set-covering technique to resync~onization,we construct an instance of set covering by choosing the set the set of elements to be covered,

Chapter

actor

execution time

I

Figure 10.3. The synchronization graph that results from a six-processor schedule of a music synthesizer based on the Karplus-~trong techni~ue.

to be the set of feedforward synchronization edges, and choosing subsets to be

the family of

(V, is the input synchronization graph. The constraint where C pG(v2, vl) in (10-10) ensures that inserting the resynchronizatio~ edge (v,, v2) does not introduce cycle, and thus that it does not introduce deadlock or reduce the estimated throughput. Algorithm ~lobal-resynchronizeassumes that the input synchroni~ation graph is reduced reduced synchronization graph can be derived efficiently, for example, by using the redundant synchronization removal technique discussedin the previous chapter). The algo~thmdetermines the family of subsets specified by (10-lo), chooses member of this family that hasmaximum cardina~ity, inserts the corresponding delayless resynchronization edge, removes all synchronization edges that it subsumes, and updatesthe values pG(x,y ) for the new synchronization graph that results. This entire process is then repeated on the new sync~onizationgraph, and it continues until it arrives at sync~onizationgraph for which the computation defined by (10-10) produces the empty set that is,

Figure 10.4. The synchronization graph that results from applying the heuristic fra~eworkbased on pairwise resynchronization to the example of Figure 10.3.

~lo~~/-r~sync~roniz~ synchro~izationgraph G (V, an alternative r ~ ~ u $~y necdh r o n i ~ a t j ~ ~ that ~ r apprhe ~ ~ r v eG$.

the algorithm terminates when no more resynchronization edges can be added. Figure10.5 gives pseudocode specification of this algorithm(withsome straightforward modifications to improve the ~ n n i n gtime). To analyze the complexity of Algorithm6lobal-resync~onize? the following definition useful. Suppose that G is sync~onizationgraph. The denoted the number of distinct ordered vertex-pairs ( x , y ) in G that satisfy pG(x,y ) 0 That is, where S(G)

{ ( x , y)l(pc(x,y)

0)).

(10-11)

The followinglemmashows that long as the input synchronization graph is reduced, the resynchronization operations performed in Algorithm bal-resynchronize always yield reduced synchronization graph. Suppose that G pair and ( x , y ) is anordered pc(y, x ) and y)l l obtained by inserting do(x,y ) into is, G’ ( V , E’), where

( V , E ) is reducedsynchronizationgraph; of vertices in G suchthat (x,y ) Let G’ denote the synchronization graph G and removing all members of~ ( xy ,) that

-x(&

(E Y)) Y )l Then G’ reduced synchronization graph. In other words, G’ does not contain any redundant synchronizations. Furthermore, G’) DC(G) We prove the first part of this lemma by contraposition. Suppose that there exists redundantsynchronization edge in G’ and first suppose that ( x , y ) Then from Fact 10.1, there exists path in C’ directed from x to y such that DeZay( 0 and contain not does

(x, y )

Also, observe that from the definition of E’,

It follows from 10-12) and (10- 13) that G also contains the path Now let

y’) be an arbitrary member of

Since G c o n t ~ n the s path we have pG(x,y ) ine~uality(3-4) together with (10- 14),

y ) Then

0 , and thus, from the t ~ a n g l e (10-15)

Chapter 10

We conclude that G is reduced.

y’)

redundant in G , which violates the assumption that

If, on the other hand, S ( x , y ) then from Fact 10.1, there exists simple path p s (S) in G’ directed from S) to S) such that

Delay(

delay(s)

(10-16)

Also, it follows from (10-13) that G contains S . Since G is reduced, the path ps must contain the edge ( x , y ) (otherwise S would be redundant in G Thus, p s can be expressed concatenation p s y)), where either may be empty, but notboth. Furthermore, since p s is simple path, neither p 1 nor contains y ) Hence, from (10- 13), we are guaranteed that both and are also contained in G . Now from (10-16), we have

Delay(p2)S delay(s).

(10-17)

F u r t h e ~ o r efrom , the definition of p I and ~G(s~c(s),

and pc(y,

S

~ombining 10-17) and 10-18) yields

which implies that S E ~ ( xy ), But this violates the assumption that G’ does not contain any edges that are subsumed by y ) in G . This concludes the proof of the first part of Lemma 10.3. It remains to be shown that DC( G’) efinition 10.3, it follows that

DC( G) Now, from Lemma

G ) c:S(G’).

10-20)

Also, from the first part of Lemma 10.3, which has already been proven, we know that G’ is reduced. Thus, from Lemma 10.2, we have

But, clearly from the construction of G’ pG>(x,y ) (x,

0 , and thus,

E

From (30-20), (10-21), and (10-22), it follows that S(G) is G’) G ) . QED. ence,

(10-22) proper subset of

Clearly from Lemma 10.3, each time Algorithm Global-resynchronize performs resynchronization operation (an iteration of the 10.5), the number of ordered vertex pairs y ) that sat increased by at least one. Thus, the number of iterations of the The complexity of one ure 10.5 is bounded above by loop is dominated by the computation in the pair of nested tation of one iteration of the inner loop dominated by the time required to y ) for specific actor pair y ) Assuming y’) is availcompute able for all X’, y’ E V the time to compute ~ ( xy ,) is where s, is the number of feedforwardsynchronizationedgesinthe current synchronization graph. Since the numbe forward synchronization edges never increases fromone iteration of the op to the next, it follows that the time-complexity of the overall algorithm VI4) where s is the number of feedforward synchronizationedgesin the input sync tiongraph.In practice, however, the number of resynchronization steps loop iterations) usually much lower than since the constraints on the introduction of cycles severely limit the number of resynchronization steps. Thus, the O(sj VI4) bound can be viewed very conservativeestimate.

AlgorithmGlobal-resynchronize long resynchronizationedgecanbe found that subsumes at least two existing synchronization edges. However, in general it may be advantageous to continue the resynchronization process even if each resynchronization candidate subsumes at most one synchronization edge. This is because although such resynchronization candidate does not lead to an immediatereduction in synchronization cost, its insertion maylead to future resynchronization opportunities in which the number of sync~onizationedges can be reduced. Figures 10.6 and 10.7 illustrate simple example. In the synchroniza graphshowninFigure 10.6(a), there are 5 synchronizationedges, (B,C) (G, Self-loop edges incident to actors C , and F (each of these four actors executes on separate processor) are omitted from the illustration for clarity. It easily verified that no resynchronization candidate in Figure 10.6(a) subsumes more than one synchronization edge.If we terminate the resynchronization process at this point, we must accept synchronization cost of 5 synchronization edges. However,suppose that we insert the resynchronization which subsumes (B,C) and then we remove the subsumed edge ve at the synchronization graph of Figure 10.6(b). In this graph, resynchronization candidates exist that subsume upto two synchronization edgeseach.

Chapter

E) (A,E)

ure 10.6. An example in which inserting aresynchroni~ation edge that subs ~ m e sonly one existing synchronization edge eventually leads to a reduction in the total numberof synchronizations.

Alternative~y?from Figure 10.6(b), we could insert the resynchronization edge (C, E ) and remove both (D, and (A,E ) This gives us the synchronization graph of Figure 10.7(d), which contains four synchronization edges.

Figure 10.7.

continuation of the example in Figure10.6.

Chapter

This is the solution derived by an actual implementation of ~lgorithmGlobalresynchronize [BSL96b] when it applied to the graph Figure 10,6(a).

Figure 10.8 shows the Optimized synchronization graph that is obtained when ~lgorithm ~lobal-resync~onize is applied to the example of Figure 10.3 (using the implementation discussed in [BSL96b]). Observe that the total number of synchr~nizationedges been reduced from 10 to 5. The total number of "resynchronization steps" (number of while-loop iterations) required by the heuristic to complete this resynchronization is 7. Table 10.1 shows the relative t ~ o u g h p u timprovement delivered by the optimized synchronization graphof Figure 10.8 over the original synchron.ization graph the sharedmemoryaccesstime varies from 1 to 10 processor clock cycles. The assumed synchronization protocol is and the back-off time for each s i ~ u l a t ~ o nobtained by the experimental procedure discussed in Section

Figure 10.8. The o p t i m i ~ ~ syn~hronization d graph thatis obtained whenAlgorith~ ~~~a~-res~n~ is happlied r o n i zto~the exampleof Figure

9.5. The second and fourth columns show the average iteration period for the original synchronization graph and the resynchronized graph, respectively. The average iteration period, which is the reciprocal of the average throughput, is the average number of timeunits required to execute an iteration of the synchronization graph. From the sixth column, we see that the resynchronized graph consisto This improvement tently attains throughputimprovement of includes the effect of reduced overheadfor maintaining synchronization variables and reduced contention for shared memory. The third and fifth columns of Table 10.1 show the average number of shared memory accesses per iteration of the sync~onizationgraph. Here we see that the resynchronized solution consistently obtains at least 30% improvementoverthe original synchronizationgraph. Since accesses to shared memory typically require significant amounts of energy,

em access time

Original graph

Resynchronized graph Decrease in iter. period

2 26%

5

6 7

8

9 10 Table 10.1. ~ e ~ o r m a n ccomparison e between the resynchronized solution and the original synchronization graph the example of Figure 10.3.

p a ~ i ~ u l a r for l y multiprocessor. system that is not integrated on single chip, this reduction in the average rate of shared memory accesses is especially useful when low power consumption is an important implementation issue.

The simulation written in C makinguse of package called CSIM [Sch88] that allows concurrently running processes to be modeled. Each CS1 process is “‘created,” after which it runs concurrently with the other processes in the simulation. Processes communi~ateand synchronize through events and ~ u i l ~ u (which ~ e s are FIFO queues of events betweentwo processes). Time delays are specified by the function hold. ~ o l d i n gfor an appropriate time causes the process to be put into an event queue, and the process “wakes up” when the simulation time has advanced by the amount specified by the hold statement. Passage of time is modeled in this fashion. In addition, ~ S allows ~ Mspecification of ~ u ~ i l i t ~which e s , can be accessed by only one process at time. Mutual exclusion of access to shared resources is modeled in this fashion. For the multiprocessor simulation, each processor made into process, and synchronization is attained by sending and receiving messages from mailboxes. The shared bus is made into facility. Polling of the mailbox for checking the presence of data done by first reserving the bus, thenchecking for the message count on that particular mailbox; if the count is greater than zero, data can be read from shared memory, or else the processor backs off for certain duration, and then resumes polling. When processor sends data, it increments counter in shared memory, and then writes the data value. When processor receives, it first polls the corresponding counter, and if the counter is non-zero, proceeds with the read; otherwise, it backs off for some time and then polls the counter again. Experimentally d e t e ~ i n e dback-off times are used for each value of the m e ~ o r yaccess time. For send, the processor checks if the corresponding buffer is full or not. For the simulation, all buffers are sized equal to 5; these sizes can of course be jointly m~nimizedto reduce buffer memory. Polling time is defined the time required to access the bus and check the counter value.

In this section,itis shown that although optimal resynchronization is intractable for general synchronization graphs, broad class of synchronization graphs exists for which optimal resync~onizationscan be computed using an efficient polynomial-ti~ealgorithm.

in synchronization graph C , and f C if for each feedforward synchronisink actor in C , we have pc(x, snk( e)) 0 Simof G if for each feedforward synchronjzation edge e in C, we have 0 . We say that C is y in C such that is an y is anoutput synchronization graph is each if linkable. For example, consider the in Figure 10.9(aj, and assume that the dashed edges represent the sync~onizationedges that connect this with other This has exactly one input hub, actor and exactly one output hub, actor F , and since p(A, F) 0 , it follows that the is linkable. However, if we remove the edge (C, F ) , then the resulting graph (shown in Figure 10.9(bjj not linka~lesince it does not have an output hub. class of linkable that occur commonly in practical sync~ronizationgraphs are those that correspond to only one processor, such the shown in Figure 10.9(c). In such cases, the first actor executed on the processor always an input hub and the last actor executed is always an output hub. In the remainder of this section, we assume that for each linkable an input hub andoutputhub y are selected such that y ) 0 , and these actors are referred to the selec and the se the associated Which input hub are ch ones makes no difference to our discussion of the techniques in this section long they are selected so that y) 0 An important propertyof linkable synchronization graphs is that if C, and C2 are distinct linkable then all synchronization edges directed from C, to are subsumed by the single ordered pair whete denotes the and denotes the selected input hub of C2. Furtherselected output hub of more, if there exists pathbetweentwo C2’ of the form ( ( o , , (02, (on- in)), where is the selected output hub of C,’, i, is the selected input hub of and there exist distinct -2 c,’,c,’> such that for k 2, 1) i,, are respectively the selected input hub and the selected output hub of then all sync~onizationedges between C,’ and are redundant. From these properties, an optimal resynchronization for chainable synchronization graph can be constructedefficiently by computing topological sort of the instantiating zero delay synchronization edge from the selected in the topological sort to the selected input hub of the output hub of the i th ( i l) th for i 1,2, ( n l ) , where is the total number of

Chapter

and then removing all of the redundant synchronization edges that result. For example, if this algorithm applied to the chainable synchronization graph of Figure lO.lO(a), then the synchronization graph of Figure lO.lO(b) is obtained, and the number of synchronization edges is reduced from 4 to This chaining technique can be viewed as a form of pipelining, where each SCC in the output synchronization graph corresponds to pipeline stage. discussed in Chapter 5, pipelining can be used to increase the throughput in multiprocessor DSP implementations through improved parallelism. However, in the form of pipelining that is associated with chainable synchronization graphs, the load of each processor is unchanged, and the estimated throughput is not affected

Figure 10.9. An il~ustrationof input and output hubs forsyn~hronizationgraph.

(since no new cyclic paths are introduced), and thus, the benefit to the throughput of the chaining technique arises chiefly from the optimal reduction of synchronization overhead. The time-complexity of the optimal algorithm discussed above for resychronizing chainable synchronization graphs is where is the number of synchroni~ationgraph actors.

It easily verified that the original synchronization graph for the music synthesis example of Section 10.5.2, shown in Figure 10.3, is chainable. Thus, the chaining technique presented in Section 10.6.1 is guaranteed to produce an optimal resynchronization for this example, and since no feedback synchronization edges are present, the number of synchronization edges in the resynchronized solution guaranteed to be equal to one less than the number of in the original synchronization graph; that is, the optimized synchronization graph contains 6 1 5 synchronization edges. From Figure 10.8, we see that this is precisely the number of synchronization edges in the synchronization graph that results from the implementation of Algorithm Global-resynchronize that was dis-

Figure 10.1 0. i~lustratioflof an algorithm for optimalresyflchroflizatiofl of chainable syflchroflizatiofl graphs. The dashed edges are syflchroflizatiofl edges.

Chapter

m ~lobal-resynchronizedoes not always produce optimal results for chainable synchronization graphs. For example, consider the synchronization graph shown in Figure 10.1 l(a), which corresponds to an eightprocessor schedule in which each ofthe following subsets of actors are assigned arate processor {G, {C and {B} The dashed edges are synchronization connect actors that are assigned to the same processor. The total number of synchronization edges is 14. Now it is easily veri d that actor K is both an input hub and an output hub for the SCC {C, G, and similarly, actor is both an input and output hub for the D, Thus, we see that the overall sync~onizationgraph is chainable. It is easily verified that the chainingtechniquedevelopedinSection10.6.1uniquely yields the optimal resynchronization illustrated in Figure 10.l l(b), which contains only 11 synchronization edges.

In c o n ~ a s tthe ~ quality of the resynchronizationobtained for Figure 10.1l(a) hm by ~lobal-resync~onize on the order which in the actors are tr by each of the two nested in Figure 10.5. For example, ifbothloops traverse the actors inalphab r, then ~lobal-resynchronize obtains the sub-optimal solution shown in Figure 10.1l(c), which contains l 2 synchronization edges. owever, actor traversal orders exist for which ~lobal-resynchronize achieves optimal resynchronizations of Figure 10.1 l(a). Onesuch ordering is

loops traverse the actors in this order, then ~lobal-resynchronize yields the same resynchronized graph that computed uniquely by the chaining technique of Section 10.6.1 (Figure 10.1 l(b)). It is an open question whether or notgivenan arbitrary chainable sync~onizationgraph, actor traversal orders always exist with which ~lobal-resynchronizearrives at optimal resynchroniza(ions. Furthermore, even if such traversal orders are always guaranteed to exist, it is doubtful that theycan, in general, be computed efficiently.

The chaining technique developed in Section 10.6.1 can be generalized to imally resync~onize somewhat broader class of synchronization graphs. class consists of all sync~onizationgraphs for which each source has an output hub(but not necessarily an input hub), each sink has an input hub an output hub), and each internal linkable. In this are pi~elined in the previous algorithm, and then for

ure 10.1 c h a i n ~ ~synchronization le for which resynchronize fails to produce an optimal solution.

Chapter

each source SCC., synchronization edge is inserted from one of its output hubs to the selected input hub of the first SCC in the pipeline of internal SCCs, and for each sink synchronization edge is inserted to one of its input hubs from If there the selected output hub of the last SCC in the pipeline of internal are no internal SCCs, then the sink SCCs are pipelined by selecting one input hub from each SCC, and joining these input hubs with chain of synchronization edges. Then synchronization edge is inserted from an output hub of each source to an input hub of the first in the chain of sink SCCs.

In addition to guaranteed optimality, another important advantage of the chainingtechnique for chainablesynchronizationgraphs is its relatively low time-complexity (0( versus for ~lobal-resync~onize), where the number of synchronization graph actors, and s is the number of feedforward sync~onizationedges. The primarydisadvantage of course, its restricted applicability. An obvious solution is to first check if the general form of the chaining technique (described above in Section 10.6.3) can be applied.,apply the chaining technique if the check returns an affirmative result, or apply Algorithm ~lobal-resynchronizeifthecheck returns negative result. Thecheckmust determine whether or not each source has an output hub, eachsink SCC has an input hub, and each internal linkable. This check can be performed in time, where n is the number of actors in the input synchronization graph, using straightforward algorithm. useful direction for further investigation deeper integration of the chaining technique with algorithm ~lobal-resynchronizefor general (not necessarily chainable) synchronization graphs.

e studied synchronization rea~angementin context the of minimizing for hardware synthesis synchroof nization digital circuitry and significant differences in the models prevent these techniques from applying to the conDF implementation. In the graphical hardware model of on~traintgraph model, each vertex corresponds to separate hardware device and edges have arbitrary weights that specify sequencing en the source vertex hasboundedexecution time, positive ~ o ~ a cunstrai~t) r d imposes the constraint weight

start( snk( e ) )

e ) start( src( e ) )

10-24)

while negative weight

implies snk(

S W(

If the source vertex has unbounded execution time, the forward and backward constraints are relative to the time of the source vertex. In contrast, in the synchronization graph model, multipleactors can reside on the same processing element (implying zero synchronization cost between them), and the timing constraints always correspond to the case whereW ( e) is positive and equal to the execution time of The implementationmodels,and associated implementation cost functions are significantly different. A constraint graph implemented using a schedulingtechnique called 921, whichcanroughlybe viewed intermediatebetween self-timed and tatic scheduling. In relative scheduling, the constraint graph vertices that have unbounded execution time, called are used reference points against which other vertices are scheduled: for each vertex an offset is specified for each anchor that affects the activation of and scheduled to occur once clock cycles have elapsed from the completion of for each i In the implementation of relative schedule, each anchor has attached control circuitry that generates offset signals, and each vertex has synchronization circuit that asserts an signal when all relevant offset signals are present. The resynchronization optimization is driven by cost function that estimates the total area of the synchronization circuitry, where the offset circuitry area estimate for an anchor function of the maximum offset, and the synchronization circuitry estimate for vertex function of the number of offset signals that must be monitored. result of the significant differences in both the scheduling models and the implementation models, the techniques developed for resynchronizing constraint graphs do not extend in any straightforward manner to the resynchronization of sync~onizationgraphs for self-timed multiprocessor implementation, and the solutions that we have discussed for synchronization graphs are significantly different in structure fromthosereportedin [F 921. Forexample, the fundamental relationships that have established between set coveringand the resynchronizationof self-timed F scheduleshavenotemerged in the context of constraint graphs.

This chapter has discussed post-optimization called resynchronization for self-timed, multiprocessor implementations of algorithms. The of resynchronization is to introduce new synchronizations in such way that the

Chapter

number of additional synchronizations that become redundant exceeds the number of new synchronizations that are added, and thus the net s y ~ c ~ o n i z a t i ocost n reduced. It was shown that optimal resynchronization is intractable by deriving a reduction from the classic set-covering ~roblem. owever, a broad class of systems was d e ~ n e dfor which optimal resynchronization can beper forme^ in polynomial time. This chapter also discussed a heuristic algo~thm for resynchronization of general systems that emerges naturally from the correspondence to set covering. The performance of an implementation of this heuristic was emo on st rated on a multiprocessor schedule for a music synthesis system. The results em on st rate that the heuristic can efficiently reduce synchronization overhead and im~rovethroughput signi~cantly.

~ h a p t e r10 introduced the concept of resynchronization, post-optimization for static multiprocessorschedulesinwhichextraneoussynchronization operations are introduced in such way that the number of original synchronizations that conse~uentlybecome significantly exceeds the number of additional synchronizations~ edundantsynchronizations are synchronization operationswhosecorrespong se~uencingre~uirementsare enforcedcompletely by other synchronizations in the system. The amount of run-time overhead re~uiredfor sync~onizationcan be reduced significantly by eli~inating redundant sync~onizations[Sha89, BSL97). Thus, effective resynchronization reduces the netsync~onizationoverhead in the implementation of a multiprocessor schedule, and improvesthe overall throughput. owever, since additional serialization is imposed by the new synchronizations, resynchronization can produce significant increase in latency. In Chapter 10, we discussed fundamentalproperties of resynchronization and westudied the problemofoptimalresynchronizationunder the assumption that a r b i t r a ~ increases in latency canbe tolerated maximum-thro~ghput resynchronization”). Such an assumption is valid, for example, in wide variety of simulation applications. This chapter discusses the problem of computing an optimal resynchronizationamong all resynchronizations that do notincrease the latency beyond p r e s p e c i ~ eupper ~ bound L,, Thisstudyofresynchronization based in the context of self-ti~ed e~ecution of iterative data~ow speci~cations, which imple~entationmodel that has been applied extensively for digital signal processing systems. Latency constraints become important in interactive applications such video conferencing, games, and telephony9where latency beyond certain point becomes annoying to the user. This chapter demonstrates howto obtain the bene-

Chapter 11

fits of resynchronization while maintaining specified latency constraint

This section introduces number of useful properties that pertain to the process by which resynchronization can make certain synchronization edges in the original synchronization graph become redundant.The following definition is fundamental to these properties, If G is synchronization graph, S is synchronization edge in undant, R is a resynchronization of G and S is not contained in thenwesay that R ates S . If R eliminates S S’ R and there is th p from S) t (S) in G) such that p contains S’ and Delay ( p )S delay( S ) then we say that S’ A synchronization edge S can be eliminated if resynchronization creates path p from src(s) to snk( S ) such that Delay( p) S delay( S ) In general, the path may contain more than one resynchronization edge, and thus, it is possible that none of the resynchronization edges allows us to eliminate S 66by itself’, In such cases, it is the contribution of of the resynchronization edges within the path that enables the elimination of S This motivates the choice of terminology in ~efinition11.1. An example is shown in Figure 11.1. The following two facts follow immediately from Suppose that G is sync~onizationgraph, R is resynchronization is resynchronization edge in R . If r does not contribute to the elimination of any synchronization edges, then (R r } ) is also resynchronization of G . If r contributes to the elimination of one and only one synchroniza{ S } ) is resynchronization of G . tion edge S then ( R Suppose that G is synchronization graph, R resynchronization of G S is synchronization edge in G and S’ is resynchronization edge in R such that delay( S’) delay ( S ) Then S’ does not cont~buteto the elimination of S.

For example, let G denote the synchronization graph in Figure 11,.2(a). Figure 11.2(b) shows a resynchronization R of G . In the resynchronized graph of Figure 11.2(b), the resynchronization edge y 3 ) does not contribute to the e~iminationof any of the sync~onizationedges of G and thus Fact 11.1 guaranR y 3 ) } illustrated in Figure 11.2(c), is also resynchronicontributes to the zation of G . In Figure 11.2(c), it is easily verified that elimination of exactly one synchronization edge the edge and from

Fact 11.1, we have that y4)} of G 11.2(d), a also resynchroni~ation

illustrated in Figure

Figure 11 An i~l~stration of Definition 11 Here each processor executes a single actor. A resynchronization of the synchronization graphin (a) is illustratedin (b). In this resynchronization, the resynchronization edges (V, and W) both contribute to the elimination of (V,W ) .

Chapter 11

discussed in Section 10.2, resync~onizationcannot decrease the estimated throughput since it mani~ulatesonly the feedforward edges of a synchronization graph. Frequently in real-time DSP systems, latency an portan ant issue, and although resynchronization does not degrade the esti~ated t~oughput, it generally doesincrease the latency. This section defines the for self-timed mult~~rocessor systems.

Figure

Suppose an application graph, G is graph that results from multiprocessorschedule for source (an actor that has no input edges or has nonzero in G , and y is an actor in G other than We define th y ) ~ n d ( yl, ~e refer to the with this measure of latency, and we refer to y as the la

synchronization is anexecution

Intuitively, the latency is the time required for the first invocation of the latency input to influence the associated latency output, and thus the latency corresponds to the critical path in the dataflow implementation to the first output invocation that is influenced by the input. This inte~retationof the latency the critical path is widely used in VLSI signal processing [Kun88,~ a d 9 5 1 . In general, the latency can be computed by performing simple simulaAP execution for G through the 1 th execution of y Such simulation can be performed functional sirnulation of an HSDFG G,;," that has the same topology (vertices and edges) G , and that maintains the simulation time of each processor in the values of data tokens. Each initial token (delay) in initialized to have the value 0, since these tokens are present at time 0. Then, data-driven simulation of G,, is carried out. In this simulation, an actor may execute whenever it has sufficient data, and the value of the output token produced by the invocation of any actor in the sirnulation given by

where is the set of token values consumed during the actor execution. In such a simulation, the i th token value produced by an actor gives the completion time of the i th invocation of in the ASAP execution of G . Thus, the latency can be determined the value of the 1 th output tokenproduced by y ith careful implementation of the functional simulator lV[,S})) time, described above, the latency can be determined in where 1 and S denotes the number of sync~onizationedges in G . The simulation approach described above is similar to approaches described in [TTL95] For a broad class of synchronization graphs, latency can be analyzed even more efficiently during resynchronization. This is the class of synchronization graphs in whichthe first invocation of the latency output is influenced by the first invocation of the latency input. Equivalently, it is the class of graphs that contain at least one delayless path in the corresponding application graph directed from 1. Recall from Chapter that k ) and k ) denote the time at which invocation k of actor commences and completes execution. note that l) 0 since is an execution source.

Chapter

the latency input to the latency output. For transparent synchronization graphs, we can directly apply well-known longest-path based techniques for computing latency. Suppose that an application graph, source actor in an actor in that is not identical to If pc,(x, y ) 0 , then we t with respect to latency input and latency output y n graph that corresponds to multiprocessor schedule for G,, we also say that G is

If synchronization graph is transp~entwith respect to latency input/ output pair, thenthe latency can be computedefficiently using longest pathcalculations on an acyclic graph that is derived from the input synchroni~ationgraph G . This acyclic graph, which we call the G ) , is constructed by removing all edges from G that have nonzero-delay; adding a vertex V , which represents the beginning of execution; setting 0 and adding delayless edges from V to each source actor (other than V of the partial construction until the only source actor that remains is V Figure 11.3 illustrates the derivation of fi(G) Given two vertices and y in G) such that there is a path in from to y we denote the sum of the execution times along a path from that has maximum cumulative execution timeby y ) That is,

C) to y

Figure 11.3. An example usedto illustrate the construction of $(G) The graphon the b o ~ o mis $(G) if G is the top graph.

LATENCY-CONSTRAINED ~ E S ~ N C H ~ O ~ I ~ A ~ O ~

t ( z ) ( p is pathfrom

mm(

to y in$(G))

p traverses

If there is no path from to y then we define y ) to be + m since G) is acyclic. The values y ) for for all y Tj(c,(x, all pairs y can be computed in 0 ( n 3 ) time, where is the number of actors in G , by using simple adaptation of the Floyd-Warshal~algorithm described in Section 3.13.3. Suppose that is an HSDFG latency input and latency output y G, results from multiprocessor schedule for Then pG(x,y ) 0 , and thus y) 0

that is ans spa rent with respect to is the synchronizationgraph that and G is resyn~~onization G,. (i.e., y)

Since GO is transparent, there is delayless path in from to y Let U,) where and y U, denote the sequence of actors traversed by p . From the semantics of the HSDFG it follows that for 1 i n either and execute on the same processor, with ui scheduled earlier than or there is a zero-delay synchronization edge in G, directed from ui to u j +l Thus, for 1 i n we have pc,(aI., 0 , and thus, that y) 0 . Since G is resynchronization of G,, it follows from Lemma 10. that y) The following theorem gives an efficient means for computing the latency for transparent synchronization graphs. with

Suppose that G is sync~onizationgraph that is transparent respect to latency input and latency output y Then Y) 7;i(G,(”, Y) By induction, we show that for every actor

W

in

G)

which clearly implies the desired result. First, let denote the maximum number of actors that are traversed by path in Ji( G) (over all paths in G) that starts at and terminates at W . If mt( W ) 1 then clearly W Since both the LHS and RHS of 1-3) are identically equal to t ( v ) 0 when W V we have that (l 1-3) holds whenever 1. Now suppose that (1 1-3) holds whenever S k for some k 1 and consider the scenario k 1 Clearly, in the self-timed execution of G , invocation w1 the first invocation of W , commences soon all invocations in the set

Chapter

P,)> have comp~etedexecution? where denotes the first invocation of actor and P, is the set of predecessorsof in f i ( G) Allmembers P, satisfy m$( k since otherwise mt( would exceed ( k 1 Thus, from the induction hypothesis?we have sturt(w, 1)

l)l(z

E

P,))

which implies that

the ut, by definition of T8(G), 7 thus we have that e f i ~ ( W 1)

of (1 1-4) is clearly equal to

and

have shown that (1 1-3) holds for m t ( ~ ) 1 and that whenever it s f o r m ~ ( w ) = k ~ l st hold for ~ t ( ~( k ) 1 Thus, ( l 1-3) holds for values of mt( In the context of resync~onization,the main benefit of transparent synchronization graphs that the change in latency induced by adding a new synchronization edge bbresynchronizationo~eration”)can be computed in 1) time, given 6 ) for all actor pairs b ) We will discuss this further in Section 1.5. Sincemany practical application graphscontain delayless pathsfrom input to output and these graphs admit a p ~ i c u l a r l yefficient means for cornputing latency, the first i~plementationoflatency-constrainedresynchronization was targeted to the class,of transparent sync~onizationgraphs [BSL96a]. However, the overall resync~onizationframework described in this chapter does not depend on any particular method for computing latency, and thus, it can be fully applied to general graphs (with moderate increase in complexity) using the A~~~ simulationapproachmentionedabove. This frameworkcan also be applied to subclasses of synchronization graphs other than transp~entgraphs for which efficient techniques for computing latency are discovered. An instance ofthe consists of synchronization graph G with latency input and latency output y and c~~struifit y ) A solution to suchan instance is aresynchronization R such that S and no has resynchronization of G that results in latency less than or equal to smaller cardinality than R . Given synchronization graph G with latency input and latency output y and a latency constraint we say that resynchronization R of G is

LATENCY- CONSTRAIN^^ R~S~NCHRONIZATIO~

y) Thus, the latency-constrained resync~onizationproblem is the problem of d e t e ~ i n i n g minimal LCR, Generally, resynchronization can be viewed complementary to the Conoptimization defined in Chapter 10: resynchronization is performed first, followed by Under severe latency constraints, it may not be possible to accept the solution computed by in which case the feedforward edges that emerge from the resynchroni~edsolution must be implemented with FFS. In such situation, can be attempted onthe original (before resynchronization) graph to see ifit achieves better result than resync~onization without However, for transparent synchroni~ationgraphs that have only one source SCC only one sink SCC, the latency is not affected by and thus, for suchsystems,resynchronizationand are fully comple~entary.This is fortunate since such systems arise frequently in practice. Trade-offs between latency and throughput have been studied by Potkonjac and Srivastava in the context of transformations for dedicated implementation of linear computations [PS94]. Because this work is basedonsynchronous j~plementations,it does not addressthe synchronization issues and opportunities that we encounter in the self-timed dataflow context.

This section shows that the laten~y-constrained resynchroni~ation problem isNP-hardeven for the very restricted subclass of synchronization graphs in which each SCC corresponds to a single actor, and sync~onizationedges have zero delay. with the ~aximum-throughputresynchronization problem, disc~ssed in Chapter 10, the intractability of this special case of latency-constrained resynchronization can be established by a reduction from set-covering, illustrate this reduction, we suppose that we are given the set {x,, x,} and the family of subsets T t , , t2, where x g } t, and Figure l 1.4 illustrates the instance of latency-constrained resynchronization that we derive from the instance of set-covering specified by ere, each actor corresponds to single processor and the self loop edge for each actor is not shown. The numbers beside the actors specify the actor execution times, and the latency constraint is L,,, 103 In the graph of Figure 1.4, which we denote by G , the edges labeled correspond respectively to themembers of the set in the set"cove~ng instance, and the vertex pairs (resync~onization candidates) st,), st,), s t , ) correspond to the members of T . For each relation

xiE t i , an edge exists that is directed from to The latency input and latency output are defined to be in and out respectively, and it is assumed that C is transparent. The synchronization graph that results from an optimal resynchronization of C shown in Figure 11S , with redundant resynchronization edges removed. Since the resynchronization candidates were chosen to obtain the solution shown in 11S , this solution corresponds to the solution of that consists of the subfamily t,, t,} A correspondencebetween

the set-covering instance

and the

is

instance of latency-constrained resynchronization defined by Figure 11.4 arises from two properties of the const~ctiondescribed above: in the set-covering instance in R

stj)

subsumes

in G)

If R is an optimal LCR of G , then each resynchronization edge of the form sti), i E

1, 2, 3 ) or oftheform

( s t j , sx,), xi

tj

Figure 11.S. The synchronization graph that results from sol instance of l~tency-constraine~ resynchronization shown in Figu

1-5)

l1

The first observation is immediately apparent from inspection of Figure 11.4. A proof of the second observation follows. of ~ ~ s e ~ ~ t We i o must n showthat no other resynchronization edges can becontainedinanoptimalLCR of G . Figure 11.6 specifies argumentswith which we can discard all ~ o s s i b i l ~ t ~ other e s than those given in 1-5). In the matrix shown in Figure 11.6(a), each entry specifies an index into the list of arguments given in Figurel 1.6(b). For each of thesix categories of arguments, except for #6, the reasoning is either obvious or easily understood from inspection of Figure 11.4. A proof of argument follows shortly within this same section. Forexample, edge cannotbe resynchronizationedge in because the edge already exists in the original synchronization graph; an edge of W ) cannot be in because there is path in G from W to each the form W ) since otherwise there wouldbe pathfrom in to out that traverses W, st,, and thus, the latency would be increased to at least R from Lemma 10.2 since pG(in, 0 and 4 since 204 ( i n , otherwise there would be delayless self loop. Three of the entries in Figure 11.6 then sti) point to multiple argument categories. For example, if x j E introduces cycle, andif then s t i ) cannotbecontainedin because it would increase the latency beyond L,,,

(I

The entries in Figure l 1.6 marked OK are simply those that correspond to and thus we havejustified Observation 6.

In the proof of Observation 6, we deferred the proof of ~ g u m e n#6 t for Figure 1l .6. proof of this ar~umentfollows.

A r g u ~ e n t in Figure By contraposition, we show that ( W , cannot contribute to the elimination of any sync~onizationedge of G , and thus upp pose that ct 1 l . 1, it follows from theopti~alityof R that ( W , (W, contributes to the elimination of some synchronizationedge S . Then W)

pc(z, s n k ( s ) )

0

(1 1-6)

where (1 1-7) From the matrix in Figure 11.6, we see that no resynchronization edge can have z the source vertex. Thus, snk(s) E o u t } . ow,if snk( S) then s and thus from (1 1-6), there is zero delay path from to W in G . owever, the existence of such path in implies the existence of path from zn to out that traverses actors W, s x , which inturnimplies that Lc(in, out) 104, and thus that R is not a valid L

~ s s ~that ~ xij Pnt i ~otherwise applies.

Exists in G 2. Introduces cycle.

3. Increases the latency beyond 4. pG(al,

0 (Lemma 10.2).

5. Introduces delayless self loop. 6. Proof

B.

given below.

Chapter

On the other hand, if then E {z, ow, from 1 1-6), S) implies the existence of zero delay path from W in which implies the existence of path from to out that traverses which in turn implies that L,,, 204. On the other hand, if for some i then since from Figure 1 1.6, there are no resynchronization edges that have an the source, it follows from(1 1-6) that there must to W The existence of such path, however, be zero delay path in 6 from implies the existence of cycle in C since out) 0 . S) implies that R is not an LCR. The following observation states that resynchronization edge of the form contributes to the elimination of exactly one sync~ronizat~on edge, w ~ i c h the edge (stj, $xj)

is anoptimalLCR of andsuppose that a resynchronization edge in R , for some i E 1 , 2 , 3 , 4 ) , j E 1,2,3} such that P t i . Then e contributes to the elimination of one and only onesync~onizationedge e

an optimal LCR, we know that e must contribute to the elimination of at least one synchronization edge (from Fact 11 Let be some synchronization edge such that e contributes to the elimination of S . Then

Now from Figure 11.6, it is apparent that there are no resynchronization edges in that have or out as their source actor. Thus, from (11-8), snk (S) or Now,if snk(s) o u t , then for some k i , or S) However, since noresynchronizationedgehas memberof sx4} its source, we must (from 11-8) rule out i m i l ~ l yif, src( S) then from (11-8) there exists a zero delay path inR(G) from to which in turn implies that LR(G)(in, o u t ) 140. But this is not possible since the assumption that R is LCR anguarantees that ~ R ~ G ~ ( i 103 ~ , Thus,weconclude that snk(s) o u t , and thus, that snk(s) Now implies that or (b) s for some k such that x, E t, (recall that xi P and thus, that k j If s thenfrom (1 1-8), pR(G)(stk, 0 . It follows that for any member E t j there is zero delay path in that traverses and Thus, s $ x i ) does not hold since otherwise in, 140. Thus, we are left with only possibility ow, suppose that we are given an optimal LCR

of G . From Observa-

tion 7 and Fact 11.1, we have that for each resynchronization edge in R we can replace this resynchronization edge with and obtain another optimal LCR. Thus from Observation 6, we can efficiently obtain an optimal LC R’ such that all resynchronization edges in R’ are of the form (v, For each xi E

such that (l 1-9)

have that g R’,This is because is assumed to be optimal, and thus, R,G) containsnoredundantsynchronization edges. Foreach xi E for which (1 1-9) does not hold, we can replace with any (v, that satisfies E t j and since such replacement does not affect the latency, we know that the result will be another optimal LCR for G In this manner, we repeatedly replace each that does not satisfy (1 1-9) then we obtain an optimal LC such that

eachresynchronizationedge in R” and for each

E

of the form (v,

there exists resynchronization edge that xi E t i .

1-10) ti) in

(11-11)

It

easily verified that the set of synchronization edges eliminated by E Thus, the set t j l ( t j ) is resynchronization edge in R”} is cover for and the cost (number of sync~onizationedges) of the resynchronization is 1x1 where is the number of synchronization edges in the original sync~onizationgraph, Now, it is also easily verified (from Figure the resynchronization defined by 11.4) that given an arbitrary cover T ,for

(1 1-12) is alsoa valid LCRof G , and that the associated cost 1x1 Thus, it follows from the optimality of that T’ must be minimal cover for given the family of subsets T. To summarize, we have shown how fromthe particular instance (X, T ) of set-covering, we can construct synchronization graph G such that from solution to the latency-constrained resync~onizationproblem instance defined by T ) This example of the reduction we can efficiently derive a solution to from set-covering to latency-constrained resync~onization easily generalized to an arbitrary set-covering instance The generalized const~ctionof the initial sync~onizationgraph G specified by the steps listed in Figure 11.7. The main task in establish~ng general correspondence between latencyconstrained resynchronization and set-covering is generalizing Observation 6 to

Chapter 11

LCR LCR

is,

tasks

tors V, Z, in, wi and instantiate all sub~ra~ associat h in§tantiate an actorlabe~edst that has ex@cution tim Instantiat@an actorlabeled ln§tanti~te the edge E ntiatetheedge do(sx,

as @xecutio~ time 60.

ntiate the edge r eachX E t in§tantiate theed

ure 11.7. procedure for construct in^ an instance r e s y n ~ h r o n i ~ afrom t ~ o ~an instance of et-cov~rin~ yields a solution to

of ~at~ncy-constrained that a solution to

constraints on the task (actor) execut~ontimes. Two-processor optimality results in multiprocessor scheduling havealso been reported in the context of stochastic model for parallel computation in which tasks have random execution times and communication patterns [Nic89].

n times t ( x i ) } such that each xi the i th actor scheduled on the processor that corresponds to the source SCC of the sync~onizationgraph; set of sink actors y, with associated execution times t ( y i ) } such that each is the i th actor scheduled on the processor that corresponds to the sink of the synchronization graph; set of S,} such that for each si non-redundant synchroni~ationedges S { S ] , S,, E and snk( S;) E y,} and latency constraint L,,, which is positive integer. A solution to such an instance is miniy,) S In the remainder of mal res~nchronizationR that satisfies this section, we denote the sync~onizationgraph corresponding to the generic .7

We assume that 0 for all and we refer to the subproblem that results from this restriction This section demonstrate§ an algorithm that solves the delayless n time, where N is number of vertices in An extension of this algorithm to the general problem ( ~ b i t r delays ~ y can bepresent) is alsogiven.

An efficient polynomial-time solution to delayless 2LCR by reducing the problem to special case of set-covering called in which we are given an o r d e ~ n gw l , wN of the m e ~ b e r sof (the set that must be covered), such that the collection of subsets T consists entirely of subsets of the form { W , , W , + Wb}, S a S b S N . Thus, while general set-covering involves covering set from collection of subsets, interval covering amounts to covering an interval from collection of subintervals. Interval covering can be solved in that first selects the subset w1, W , ,

max((bl(w,,

E E)

then selects any subset of the form

1x1]TI)time by for some t E Wbz}

~ ~ ( { b ~ ( w b , + l t,)wforsome bE t~

then selects any subset

b3 and

on until b,

the form

Wb3}

Wb E

N.

simple procedure

where S bl

l where

T}); S b,

E)for some t E T})

where

Chapter l1

R to interval covering, we start with the following observations. Suppose that R is resync~onizationof 6 , r E R , and r contributes to the elimination of synchronization edge s Then r subsumes s . Thus, the set of sync~onizationedges that r contributes to the elimination of is simply the set of synchronization edges that are subsumed by r This follows immediately from the restriction that there can be no resynchronization edges directed from y j to an xi (feedforw~d resync~onization), (R, 6) there can be at most one synchronization edge in any path directed from S) to snk( S ) QED. is resyn~hronizationof $x

Y,)

max( { t p r e d ( src(s’))

t ( x j ) for i j l i

l , 2, i 1,

5,then

t,,,,,( snk(s’)) E R } ) where p , and tSrtcc(yi) jri

Proof- Given a synchronization edge (x,, y b ) E R , there is exactly one delayless from to y , that contains (xu,y b ) and the set of vertices traversed by this path is { x , , x2, x,, yb, y & + y,} The desired result follows immediately. QED. Now, co~espondingto each of the source processor actors xi that satisfies t p r r d ( x i ) t(y,) I we define an ordered pair of actors “resynchronization candidate”) by

Consider the exampleshown 1 for each actor and L,,,

in Figure11.8.Here,weassume 10. From (1 1-13), we have

that

then do(vi) can be viewed the best resynIf vi exists for a given chronization edge that has xi the source actor, and thus, to construct an optimal LCR,we can select the set of resync~onizationedges entirely from among the vi This is established by the following two observations. Suppose that is an LCR of and suppose that (x,, y b ) a R such that (x,, y b ) v,. Then delayless synchronization edge in (R (x,, { d n ( v , ) } ) is an LCR of R .

LATENCY-CONSTRAINED RESYNCHRONI~ATION

Pro@ Let that exists, since

and

yb)

and observe

From Observation 8 and the assumption that y b ) delayless, the set of synchronization edges that y h ) contributes to the elimination of is simply the set synchronization edges that are subsumed by y b ) Now, if is synchronization edge that subsumed by ( x N 7y b ) then

CT CT

CT

Figure 11.8. An instance of delayless, two-processor latency-constrained resynchroni~ation,In this example, the execution times of actors are identically equal to unity.

From the definition of v,, we have that c I b and thus, that pG(yc,y b ) follows from (1 l 16) that

0 . It

and thus, that v, subsumes S . ence, v, subsumes all synchronizationedges that (x,, y b ) contributes to the elimination of, and we can conclude that R’ is valid resyn~hronizationof From the de~nition v,, we know that is an LCR, we have from Observation9 that

ts,,,,(yC)I an LCR.

and

From Fact .2 and the assumption that the members of S are all delayless, anoptimalLCRof consists onlyof delayless sync~onizationedges. Thus from Observation 10, we know that there exists an optimal that consists only of members of the form d o ( ~ , )F u r t h e ~ o r efrom , Observation 9, we know that collection V of vi is an LCR if and only if U X ( V )

{st,

52,

“ - 9

S,}

7

V E

where x ( v ) is the set of synchronization edges that are subsumed by v The following observation completes the co~espondencebetween 2LCR and interval covering. Let

s2’,

S,’

be the ordering of (i

and thus from(1 1

we have

s2,

S,

specified by (1

~ESYN~H~ONIZATIO~

such

of ~ ~ s e ~ ~ t iLet o n (xj, and suppose k a positive integer i