Computer Performance Evaluation and Benchmarking, SPEC Benchmark Workshop 2009

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris

1,395 204 6MB

Pages 152 Page size 430 x 659.996 pts Year 2009

Report DMCA / Copyright

DOWNLOAD FILE

Recommend Papers

Performance Evaluation and Benchmarking

Edited by Lizy Kurian John Lieven Eeckhout Boca Raton London New York A CRC title, part of the Taylor & Francis im

2,457 1,550 10MB Read more

Performance Evaluation and Applications of ATM Networks (The Springer International Series in Engineering and Computer Science)

PERFORMANCE EVALUATION AND APPLICATIONS OF ATM NETWORKS THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SC

403 30 8MB Read more

Portfolio Performance Measurement and Benchmarking (McGraw-Hill Finance & Investing)

3,008 1,547 5MB Read more

Management Control Systems: Performance Measurement, Evaluation and Incentives (2nd Edition)

Management Control Systems Performance Measurement, Evaluation and Incentives Merchant Van der Stede Kenneth A. Mercha

6,604 966 17MB Read more

Computer Organization and Architecture: Designing for Performance (8th Edition)

COMPUTER ORGANIZATION AND ARCHITECTURE DESIGNING FOR PERFORMANCE EIGHTH EDITION William Stallings Prentice Hall Upper

1,706 527 3MB Read more

Nondestructive Evaluation

Theory, Techniques, and Applications edited by Peter J. Shull The Pennsylvania State University Altoona, Pennsylvania

1,227 701 12MB Read more

Watercolor Workshop II

LViZgXdadg Workshop II SiMPLE StEPS to SuCCESS Watercolor Workshop Watercolor Workshop Glynis Barnes-Mellish LO

2,607 2,018 54MB Read more

Rick Steves' Florence and Tuscany 2009

782 147 15MB Read more

Electronic and computer music

Electroni c and Compu ter Music R E V I S E D AND EXPANDED EDITION Peter Manning OXFORD UNIVERSITY PRESS 2004 OXF

3,932 2,464 31MB Read more

High Performance Computational Science and Engineering: IFIP TC5 Workshop on High Performance Computational Science and Engineering (HPCSE), World ... in Information and Communication Technology)

648 309 12MB Read more

File loading please wait...

Citation preview

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

5419

David Kaeli Kai Sachs (Eds.)

Computer Performance Evaluation and Benchmarking SPEC Benchmark Workshop 2009 Austin, TX, USA, January 25, 2009 Proceedings

13

Volume Editors David Kaeli Northeastern University Department of Electrical and Computer Engineering 360 Huntington Ave., Boston, MA 02115, USA E-mail: [email protected] Kai Sachs Technische Universität Darmstadt Dept. of Computer Science Schlossgartenstr. 73, 64289 Darmstadt, Germany E-mail: [email protected]

Library of Congress Control Number: Applied for CR Subject Classification (1998): B.2.4, B.2.2, B.3.3, B.8, C.1, B.1, B.7.1 LNCS Sublibrary: SL 2 – Programming and Software Engineering ISSN ISBN-10 ISBN-13

0302-9743 3-540-93798-6 Springer Berlin Heidelberg New York 978-3-540-93798-2 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12603886 06/3180 543210

Preface

This volume contains the set of papers presented at the SPEC Benchmark Workshop 2009 held January 25 in Austin, Texas, USA. The program included eight refereed papers, a keynote talk on virtualization technology benchmarking, an invited paper on power benchmarking and a panel on multi-core benchmarking. Each refereed paper was reviewed by at least four Program Committee members. The result is a collection of high-quality papers discussing current issues in the area of benchmarking research and technology. A number of people contributed to the success of this workshop. Rudi Eigenmann served as General Chair and ably handled many of the details involved with providing a high-quality meeting. We would like to thank the members of the Program Committee for their time and eﬀort in arriving at a high-quality program. We would also like to acknowledge the guidance provided by the SPEC Workshop Steering Committee. We would like to thank the staﬀ at Springer for their cooperation and support. We want to particularly recognize Dianne Rice for her assistance and guidance, and also Kathy Power, Cathy Sandifer and the whole SPEC oﬃce for their help. And ﬁnally, we want to thank all SPEC members for their continued support and sponsorship of this meeting. January 2009

David Kaeli Kai Sachs

Organization

SPEC Benchmark Workshop 2009 was sponsored by the Standard Performance Evaluation Corporation in cooperation with IEEE Technical Committee on Computer Architecture (TCAA).

Executive Committee General Chair Program Chair Publication Chair

Rudi Eigenmann (Purdue University, USA) David Kaeli (Northeastern University, USA) Kai Sachs (TU Darmstadt, Germany)

Program Committee Jose Nelson Amaral Umesh Bellur Anton Chernoﬀ Lieven Eeckhout Rudi Eigenmann Jose Gonzalez John L. Henning Lizy K. John David Kaeli Helen Karatza Samuel Kounev Tao Li David Lilja Christoph Lindemann John Mashey Jeﬀrey Reilly Kai Sachs Resit Sendag Erich Strohmaier Bronis Supinski Petr T˚ uma Reinhold Weicker

University of Alberta, USA Indian Institute of Technology Bombay, India AMD, USA University of Ghent, Belgium Purdue University, USA Intel Barcelona, Spain Sun Microsystems, USA University of Texas at Austin, USA Northeastern University, USA Aristotle University of Thessaloniki, Greece Universit¨ at Karlsruhe (TH), Germany University of Florida, USA University of Minnesota, USA University of Leipzig, Germany Consultant, USA Intel Corporation, USA TU Darmstadt, Germany University of Rhode Island, USA Lawrence Berkeley National Laboratory, USA Lawrence Livermore National Laboratory, USA Charles University in Prague, Czech Republic (formerly) Fujitsu Siemens, Germany

VIII

Organization

Workshop Steering Committee Alan Adamson Jose Nelson Amaral David Bader Rudi Eigenmann Rema Hariharan John L. Henning Lizy K. John David Kaeli Samuel Kounev David Morse Kai Sachs

IBM, Canada University of Alberta, USA Georgia Tech, USA Purdue University, USA AMD, USA Sun Microsystems, USA University of Texas at Austin, USA Northeastern University, USA Universit¨ at Karlsruhe (TH), Germany Dell, USA TU Darmstadt, Germany

Table of Contents

Benchmark Suites SPECrate2006: Alternatives Considered, Lessons Learned . . . . . . . . . . . . . John L. Henning

1

SPECjvm2008 Performance Characterization . . . . . . . . . . . . . . . . . . . . . . . . Kumar Shiv, Kingsum Chow, Yanping Wang, and Dmitry Petrochenko

17

CPU Benchmarking R 2-Based Montecito Performance Characterization of Itanium Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Darshan Desai, Gerolf F. Hoﬂehner, Arun Kejariwal, Daniel M. Lavery, Alexandru Nicolau, Alexander V. Veidenbaum, and Cameron McNairy

36

A Tale of Two Processors: Revisiting the RISC-CISC Debate . . . . . . . . . . Ciji Isen, Lizy K. John, and Eugene John

57

Investigating Cache Parameters of x86 Family Processors . . . . . . . . . . . . . Vlastimil Babka and Petr T˚ uma

77

Power/Thermal Benchmarking The Next Frontier for Power/Performance Benchmarking: Energy Eﬃciency of Storage Subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Klaus-Dieter Lange

97

Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors Using Geospatial-Based Predictive Models . . . . . . . . . . . . . . . . . Chang-Burm Cho, Wangyuan Zhang, and Tao Li

102

Modeling and Sampling Techniques Generation, Validation and Analysis of SPEC CPU2006 Simulation Points Based on Branch, Memory and TLB Characteristics . . . . . . . . . . . . Karthik Ganesan, Deepak Panwar, and Lizy K. John

121

A Note on the Eﬀects of Service Time Distribution in the M/G/1 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexandre Brandwajn and Thomas Begin

138

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

145

SPECrate2006: Alternatives Considered, Lessons Learned John L. Henning Sun Microsystems [email protected]

Abstract. Since 1992, SPEC has used multiple identical benchmarks to measure multi-processor performance. This “Homogeneous Capacity Method” (aka “SPECrate”) has been criticized on the grounds that real workloads are not homogeneous. Nevertheless, SPECrate provides a useful window into how systems perform when stressed by multiple requests for similar resources. This paper reviews SPECrate’s history, and several performance lessons learned using it: (1) a 4:1 performance gain for startup of a benchmark when I/O was reconﬁgured; (2) a benchmark that improved up to 2:1 when a TLB data structure was re-sized; and (3) a benchmark that improved by 52% after a change to NUMA page allocation. The SPEC CPU workloads usefully exposed several opportunities for performance improvement.

1

Introduction: A Philosophy of Divots

When systems do not perform as expected, performance anomalies are sometimes called “divots”: an unexpected hole where performance sinks. Is a divot something to be ashamed of? Or an opportunity? This tester suggests: Although it is widely understood that “all software has bugs”, it may not be as widely understood that all systems have performance divots. A repeatable, analyzable workload allows divots to be analyzed. Cherish your divots.

2

Background: About SPECrate

The Original Metric: Speed. The SPEC CPU suites are made up of component benchmarks. The original SPECmark (now referenced as “CPU89”) contained 10 benchmarks, such as gcc, spice, and a lisp interpreter; the most recent suite, CPU2006, contains 29 benchmarks, such as bzip2, GNU Go, GAMESS, POV-Ray, and perl. The SPEC-supplied tool set runs each component benchmark individually, and the time in seconds is reported. For each benchmark, a “SPECratio” is computed by dividing the time on a reference system by the time seen on the system under test. Finally, a bottom line metric (such as SPECint95, SPECfp2000, SPECfp base2006) is computed as the geometric mean of the benchmark SPECratios. D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 1–16, 2009. c Springer-Verlag Berlin Heidelberg 2009

2

J.L. Henning

The bottom line metrics mentioned thus far are called “speed” metrics, and are analogous to speed of travel in the real world in that higher numbers are better, and numbers are comparable. If a sports car takes 1/4 the time to get to Cleveland as a truck, we routinely say that the sports car is 4x as fast; and if a new laptop ﬁnishes a well deﬁned task in 1/4 as much time as an old desktop computer, it seems natural to call the laptop 4x as fast as the desktop. Adding a Throughput Metric. The speed tests run only one copy of each component benchmark at a time, leaving resources idle on multi-processor systems. SPEC addressed this problem in 1992 by adding throughput tests that allow the tester to run multiple copies of identical benchmarks. For example, in a 32-copy SPECint rate2006 test, the SPEC tool set starts 32 copies of 400.perlbench, waits for all of them to complete; records the time from start of ﬁrst to ﬁnish of last; then starts 32 copies of 401.bzip2, and so forth. The fact that all copies are running the same workload is the reason that SPECrate was originally known as the “Homogeneous Capacity Method” [1]. The details of the metric calculation have varied somewhat as the suites have evolved, but in all cases a score is calculated for each benchmark which is proportional to the number of copies run divided by the time required to complete the copies. The bottom line metrics (e.g. SPECint rate95, SPECfp rate2000) are the geometric means of the benchmark scores. Interpretation of SPEC CPU throughput metrics is somewhat less intuitive than the speed metrics. For example, if a laptop has a SPECint rate2006 score of 10, and a server has a SPECint rate2006 score of 20, it is not immediately obvious if the better result is achieved by running twice as many copies in the same time, or by running the same number of copies in 1/2 the time, or by some other method. The full reports provide the additional level of detail for the motivated reader. Positioning: A Component Benchmark. Although the throughput metrics exercise more of the system than a single processing unit, while using the compute-intensive portion of real applications [4], it should be understood that SPECrate is not positioned as a whole-system benchmark. It is a design goal to reduce disk I/O, remove network I/O, and eliminate GUIs from the SPEC CPU suites. Use of system services and libraries are also minimized [19].

3

Perceived Weaknesses of SPECrate

Scaling. SPECrate has been criticized because the scaling can sometimes appear to be less than credible. For example, on the metric SPECint rate base2000, an Alpha 21264A with 16 chips vs. 32 scales at .975 [6]; an SGI R14000 with 8 chips vs. 128 scales at .976 [7]. Explanations for the excellent scaling include: (1) SPECrate jobs are independent – there are no stalls for cross-job communication. (2) As mentioned above, they do little IO and use few system services. (3) Although one of the design goals is to exercise the memory hierarchy, it has been shown that the

SPECrate2006: Alternatives Considered, Lessons Learned

3

benchmarks usually exercise only a relatively small part of main memory at a time, [3] and therefore caches are usually eﬀective. (4) Even for the benchmarks that exercise main memory the most, there are often access patterns that compilers and hardware can usefully prefetch. (5) Although anyone can publish a rule-compliant SPEC CPU result, in fact, most publications are done by vendors, who are motivated to ensure that systems are properly set up to ensure good scaling. (6) If good scaling is not possible on a particular system, there is typically no particular motivation for the vendor to publish such a result. In short, a concern with the scaling is that it may appear to be “too good”. Customer applications that depend on interprocess coordination, system services, I/O, or other components not measured by the SPEC CPU benchmarks are unlikely to scale as well as does SPECrate. This is not to say that SPECrate is a dishonest measurement; rather that it is a component benchmark, and that it needs to be understood as such. Homogeneity and Convoys. A second perceived weakness of SPECrate is its homogeneity. In real life, servers often host a variety of applications. Even in an environment where everyone has a common interest (say, all are molecular chemists, running a common tool set) jobs do not all start at the same instant, and run identical workloads. It has been hypothesized that SPEC’s implementation of SPECrate may lead to “convoy eﬀects”: for example, 128 memory-intensive programs all hit their most intense memory bandwidth demand at the same time; or 128 programs all try to read their startup data at the same time, thrashing the disk; or 128 programs all try to acquire an OS lock on the ﬁlesystem at the same instant.

4

Alternatives Considered by SPEC

In response to concerns about SPECrate, the SPEC CPU Subcommittee has considered various alternatives over the years. Two alternatives are described in this section: “heterogeneous” and “staggered homogeneous”. 4.1

Heterogeneous

During the development of SPEC CPU2006, a prototype was implemented that ran the CPU2000 jobs in a heterogeneous fashion. Tables 1 and 2 show the diﬀerence in run order on a system running 4 queues (which would, typically, use 4 processors). For the homogeneous method, each processor runs the same program and workload. For the heterogeneous prototype, each processor starts oﬀ running a diﬀerent job than the other processors (provided that the number of processors is less than the number of benchmarks in the suite). It is important to note that for homogeneous SPECrate, all copies of a benchmark ﬁnish, and then the next benchmark is started. Thus one may read Table 1

4

J.L. Henning Table 1. Homogeneous run order. All processors run identical jobs. Queue 0

Queue 1

Queue 2

Queue 3

164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 254.gap 253.perlbmk 255.vortex 256.bzip2 300.twolf

164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 254.gap 253.perlbmk 255.vortex 256.bzip2 300.twolf

164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 254.gap 253.perlbmk 255.vortex 256.bzip2 300.twolf

164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 254.gap 253.perlbmk 255.vortex 256.bzip2 300.twolf

Table 2. Heterogeneous run order. Diﬀerent processors run diﬀerent jobs. Queue 0

Queue 1

Queue 2

Queue 3

164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 254.gap 253.perlbmk 255.vortex 256.bzip2 300.twolf

175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 254.gap 253.perlbmk 255.vortex 256.bzip2 300.twolf 164.gzip

176.gcc 181.mcf 186.crafty 197.parser 252.eon 254.gap 253.perlbmk 255.vortex 256.bzip2 300.twolf 164.gzip 175.vpr

181.mcf 186.crafty 197.parser 252.eon 254.gap 253.perlbmk 255.vortex 256.bzip2 300.twolf 164.gzip 175.vpr 176.gcc

as implicitly containing 12 phases: between each row there is a pause to wait for all of the row to ﬁnish. No such pause occurs with Table 2. In the heterogeneous prototype, each queue runs independently. Results. Informal (mostly non-quantitative) reports of results with the heterogeneous prototype fell into two categories. Some reports indicated only minor diﬀerences in observed run times, within the usual range for run-to-run variation. Others said that benchmarks with noticeable main memory traﬃc ran noticeably faster, presumably because they tended to compete with less intense jobs, rather than with equally intense copies of themselves. Therefore, the likely bottom line with a heterogeneous method would be slightly better scaling than with the homogeneous method. The possibility that scaling would improve may be seen as a negative aspect of the heterogeneous method, if one is concerned that homogeneous scaling already appears to be “too good”.

SPECrate2006: Alternatives Considered, Lessons Learned

5

It seems intuitive that any particular resource stressed by a homogeneous workload would be less stressed by the above heterogeneous method.1 Diﬃculties with the heterogeneous method. On the assumption that such resource stresses are useful to study, reducing their levels in a heterogeneous workload is bad, because it makes them less apparent and harder to analyze. A heterogeneous workload also makes it much more diﬃcult to reproduce performance conditions. For example, suppose that 255.vortex runs more slowly than desired. To reproduce its conditions from Table 1, to a ﬁrst approximation, one can simply run the 4 copies of vortex. To reproduce its conditions from Table 2, it is necessary to run the whole suite. One cannot try to just run selected “rows”, because the rows in Table 2 do not represent separate phases. 4.2

Staggered Homogeneous

Another alternative prototyped by SPEC delays the start of each job by a small amount (a “stagger”), while running the same job on all processors. The intent of the staggered homogeneous method is to avoid the hypothesized convoy effects mentioned above. The prototype still exists, latent and unsupported, in SPEC CPU2006. The excerpts below are taken from an unmodiﬁed copy of the suite: $ specinvoke -h -S msecs sleep between spawning copies (in milliseconds) $ runspec --stag Option stag is ambiguous (stagger, staggeredhomogenousrate) $ runspec --config oct14a --size test \\ --copies 2 --staggeredhomogen --stagger 6000 473.astar

The specinvoke [11] utility provides a help message that tells us that staggers are expressed in milliseconds. The ﬁrst runspec command tricks the switch parser into reminding us how to spell its undocumented switches, and the second runspec command runs 2 copies of the test workload for the benchmark 473.astar, with a delay of 6 seconds between each copy. As a reminder, the staggered homogeneous prototype is unsupported. If the reader plays with it, you are reminded that anything you learn from it cannot be represented as an oﬃcial SPEC metric. If you do decide to use it, you will probably ﬁnd it easiest to discern what it did by looking in the run directory:

1

With the notable exception of hardware and OS support for the instruction stream. For SPECrate, each copy has its own data, but all use the same program binary, allowing the OS the opportunity to load only one copy into physical memory. In a heterogeneous context, obviously, multiple program binaries are active.

6

J.L. Henning

$ cd $SPEC/benchspec/CPU2006/473.astar/run/run*000 $ cat speccmds.out timer ticks over every 1000 ns running commands in speccmds.cmd 1 times runs started at 1225226364, 29870000, Tue Oct 28 16:39:24 2008 run 1 started at 1225226364, 29876000, Tue Oct 28 16:39:24 2008 child started: 0, 1225226364, 29883000, pid=3147, ’../run_base_test_oct14a.0000/astar_base.oct14a lake.cfg’ child started: 1, 1225226370, 30218000, pid=3148, ’../run_base_test_oct14a.0000/astar_base.oct14a lake.cfg’ child finished: 0, 1225226376, 980432000, sec=12, nsec=950549000, pid=3147, rc=0 child finished: 1, 1225226383, 556000, sec=12, nsec=970338000, pid=3148, rc=0 run 1 finished at: 1225226383, 562000, Tue Oct 28 16:39:43 2008 run 1 elapsed time: 18, 970686000, 18.970686000 runs finished at 1225226383, 597000, Tue Oct 28 16:39:43 2008 runs elapsed time: 18, 970727000, 18.970727000

Notice above that the two copies were started 6 seconds apart (1225226364 and 1225226370 seconds after Jan. 1, 1970), each took just under 13 seconds, and the total elapsed time was just under 19 seconds. The bottom line includes the time for the stagger, as it is measured from start-of-ﬁrst copy to ﬁnish-of-last. One might want to consider other ways of calculating a bottom line. (Reminder: any use of the prototype may not be represented as an oﬃcial SPEC metric.) Results. As SPEC experimented with the prototype, the hypothesized convoy eﬀect was not observed. That is, the expectation had been that the normal SPECrate causes unrealistic resource overloads when, for example, 128 copies all try simultaneously to acquire a lock on a ﬁlesystem; and that a small stagger (on the order of 10s of milliseconds) would avoid the overloading and actually cause faster overall execution time. Instead, small staggers were observed to make no particular diﬀerence to overall time (indistinguishable from noise). Diﬃculties with the staggered homogeneous method. Should the metric include the stagger time? If so, unless the staggers are very small, too much idle time may be included. Alternatively, one might try to exclude the staggers by, for example, calculating time from start-of-last to ﬁnish-of-ﬁrst; a disadvantage of this approach is that it could cause performance to be overstated if one copy has more hardware resources than others (e.g. a 16-chip, 64-core system with 4 copies on 15 of the chips, but only 1 copy on the last). Perhaps the most attractive alternative would be to attempt to achieve a steady state of repeated execution, with all processors busy, running staggered workloads; one would compute a metric that sampled execution time for complete jobs during the steady state. The primary disadvantage of this approach is that the suite is sometimes already criticized as taking too long; running repeated workloads to ramp up to a steady state was not viewed as attractive.

SPECrate2006: Alternatives Considered, Lessons Learned

7

SPEC’s Decision. After discussion, neither of the prototyped alternatives was adopted for CPU2006, and SPECrate remains essentially unchanged since 1992.

5

Applying SPECrate2006

SPECrate provides a useful window into how systems perform when stressed by multiple requests for similar resources such as program startup, data initialization, translation lookaside buﬀer (TLB) requests, and memory allocation. It is understood that in real life, an OS is unlikely to get 128 simultaneous identical requests, so one must be careful not to over optimize to this, or to any other, benchmark. Nevertheless, the homogeneity may be its virtue: in real life, systems do have to deal with intense requests, traﬃc jams do occur, and SPECrate presents a compute-intensive workload that is repeatable and analyzable. In this section, three case studies are brieﬂy summarized from applying SPECrate2006 to Solaris systems. 5.1

A 4:1 Performance Gain for Startup of a SPEC CPU2006 Benchmark When I/O Was Properly Conﬁgured

Although the intent of SPEC CPU benchmarks are to be compute intensive, some I/O inevitably remains. When multiple copies are run for SPECrate, I/O is magniﬁed. With each suite, it seems that one or two benchmarks stick out as being especially in need of I/O tuning. For CPU95, a benchmark of concern was 126.gcc: each copy compiles 56 input ﬁles and writes 112 output ﬁles with a total of 8 MB of output data. For CPU2000, the benchmark 200.sixtrack writes 42 ﬁles, with a total of 5.3 MB, per copy. For both CPU95 and CPU2000, testers learned that on large systems, it is useful to have striped disks, preferably with journaling ﬁle systems that do not stall waiting for writes. Problem. For CPU2006, a benchmark of concern in large SPECrate runs is 450.soplex, a Simplex Linear Program (LP) Solver. The program is invoked twice, and I/O becomes a problem in startup of part 2, when each benchmark copy needs to read its copy of the 267 MB input ﬁle ref.mps. Methods. In order to focus on the second part of the benchmark, the utility convert_to_development [10] was applied to allow modiﬁcations to the ref workload while still using the SPEC tools. The ﬁrst workload was deleted, leaving only ref.mps in the directory 450.soplex/data/ref/input. Then, 128 run directories were populated on a large server using runspec --action setup. The actual runs were done using specinvoke -r [11]. In order to avoid unwanted ﬁle caching eﬀects (which would not be eﬀective in a full reportable run), memory was cleared between tests by running large copies of STREAM [17] and reading a series of unrelated ﬁles. CPU and IO activity were observed using iostat 30.

8

J.L. Henning

Metrics. As each run began, CPU utilization was low, and disk activity high, as 128 copies of ref.mps were read. Eventually, the io kps fell to zero and the tested processors achieved 100% utilization. Two metrics are reported: (1) Startup time in minutes, determined by counting the 2-per-minute iostat records prior to 100% utilization; (2) kps from the busy period (converted to MB/sec). Baseline. When a single 10K RPM disk was used, startup required about 24 minutes, reading at about 24 MB/sec. Software RAID. When Solaris Volume Manager was used with the default 16 KB block size (known as an “interlace size” in the terminology of SVM) on an A5200 Fiber Channel disk array with 6x 10K RPM disks, startup fell to about 20 minutes, reading about 30 MB/sec. With a block size of 256 KB, startup improved to about 8 minutes and 72 MB/sec. For this read-intensive workload, RAID-0 was not particularly faster than RAID-5. Increasing the number of disks in the stripe set had little additional eﬀect on performance, as the maximum observed bandwidth for this somewhat older disk system was about 78 MB/sec. Hardware RAID. A newer hardware RAID Array, the Sun StorageTek 2540 with 6x 15K RPM disks, did not show sensitivity to block size (called “segment size” for this device) over the tested range of 16 KB through 512 KB. This insensitivity may be viewed as a plus, since it may be hard to know in advance what block size to choose. The bandwidth was about 97 MB/sec, roughly matching the limit of the 1 Gb Host Bus Adapter (HBA) used in this test. Once again, read performance was insensitive to use of RAID-0 vs. RAID-5. Further improvement might be possible with a higher bandwidth HBA. Divot summary. With hardware RAID, a performance divot of idle CPUs waiting on I/O was reduced from 24 minutes to 6 minutes, which is a 4:1 improvement over the original single-disk conﬁguration. Lessons for tuning other systems. Even in an allegedly CPU intensive environment, IO lurks. Hardware RAID may oﬄoad overhead from the server. 5.2

An Improvement of up to 2:1 for a CPU2006 Benchmark When a TLB Data Structure Was Re-sized

UltraSPARC T2. The UltraSPARC T2 (aka “Niagara2”) and UltraSPARC T2 Plus processors [12] are multi-threaded processors with eight SPARC processor cores. Each core runs 8 hardware threads and has 2 integer units, one ﬂoating point unit, an 8 KB L1 data cache, and a 16 KB L1 instruction cache. All cores share a single 4 MB L2 cache. Each core does virtual address translation using a 64-entry instruction Translation Lookaside Buﬀer (TLB) and a 128-entry data TLB [5]. When TLB misses occur, software-managed direct-mapped Translation Storage Buﬀers (TSBs) are consulted by a Hardware Table Walker. TSBs are allocated per-process, for up to 4 page sizes (8 KB, 64 KB, 4 MB, 256 MB). By default, each TSB holds 512 entries, but the hardware allows much larger TSBs to be allocated if the operating system so chooses [13].

SPECrate2006: Alternatives Considered, Lessons Learned

9

Table 3. 436.cactusADM SPECratios (higher is better)

run #1 run #2 run #3

base

peak

86.04 86.09 85.82

86.56 86.98 63.52

Table 4. Normalized per-copy times (lower is better) for 436.cactusADM Peak Run 2

Peak Run 3

Including all 127 copies: Median 1.0000 Arithmetic Mean .9916 Std. Deviation .0232 Max 1.0104

.9947 .9990 .0844 1.3837

If the worst 6 copies are dropped: Median .9987 Arithmetic Mean .9907 Std. Deviation .0235 Max 1.0072

.9933 .9813 .0284 1.0086

Metric

Problem. During testing of CPU2006 on UltraSPARC T2 and UltraSPARC T2 Plus processors, unexplained variability was sometimes seen for the benchmark 436.cactusADM. For example, a single reportable run of the ﬂoating point suite from December 2007 with 6 runs of the benchmark (3x base tuning and 3x peak tuning) showed inconsistent performance, as detailed in Table 3. Notice that although the median performance for peak was 86.56, the slowest run was oﬀ by more than 1/4. Analysis: Variation by copy. Recall from the metrics discussion at the beginning of this paper that reported benchmark scores depend on the time from start of ﬁrst copy to completion of last. Therefore, a primary goal for the tester is to attempt to achieve consistency across all tested copies – in this case, 127 copies on a 2-chip system. Table 4 summarizes the copy-by-copy times in the second and third peak runs. In Table 4, times are normalized to the median time from Peak Run 2. Notice the consistency in Peak Run 2, with the worst of the 127 copies needing only 1.04% more time than the median time. By contrast, the slowest copy in Peak Run 3 needed 38.37% more time than the median of Peak Run 2. The problems in Peak Run 3 are not widespread; in fact, only 6 of the 127 copies were slow. If these 6 copies were eliminated, as shown in the second half of the table, the two runs would match each other. Unfortunately for the tester the metrics do not allow post-processing to eliminate the slow copies.

10

J.L. Henning

Considerable time was spent trying to trace the source of the occasional poor copy time for 436.cactusADM, which sometimes was up to 2x worse than the expected time. Analysis of experiment logs did not indicate any particular pattern to the degraded performance. Sometimes, a handful of copies would be slow; often, none would be slow. The slow performance did not appear to be tied to system state, nor to particular virtual processors, as it would move around from one CPU to another. Attempts to instrument the tests were often met by a failure to reproduce the slow performance. Smoking gun. Eventually, a bad run was caught with trapstat -T [14]: cpu 7 11 14 21 23

dtsb-miss 4138331 4117256 4135205 4114273 4139823

%tim 59.8 60.2 59.9 60.4 59.5

In the trapstat output, it can be seen that various copies (on virtual processors 7, 11, 14, 21, 23) are estimated to be spending about 60% of their time processing TSB misses. Once this was found, the solution to the variability of 436.cactusADM was straightforward. As mentioned above, the hardware allows TSBs to be expanded, and Solaris supports the hardware feature with a pair of tunables: enable tsb rss sizing and tsb rss factor [16]. The former is on by default; the latter provides a measure of how full TSBs have to be before they become candidates for resizing. As can be seen in SPEC CPU submissions from early 2008, this Solaris tuning parameter has been used, and 436.cactusADM performance has been steady. For example, in a large SPECrate submission with 630 copies, the three runs diﬀered from each other by less than 1% [8]. If per-copy results are analyzed (as in Table 4), the worst time across all 1890 copies diﬀers from the median by only 1.52%. Divot summary. SPECrate was useful for uncovering a hard-to-predict, hardto-reproduce performance divot of up to 2:1. It was resolved by encouraging the operating system to be more willing to expand the size of the data TSBs. Lessons for tuning other systems. The default TSB sizing is adequate for most applications, especially if large pages are employed. If it is suspected that large applications (e.g. more than 1 GB, with 4 MB pages) may be running more slowly than desired, trapstat -T can be used to check for TSB activity, and if it is found, tsb rss factor can be decreased. 5.3

A Gain of 52% for a CPU2006 Benchmark after a Change to the Operating System Policy for NUMA Page Allocation

Problem. When testing large SPECrate runs, variability was sometimes observed, and, as in the previous section, eﬀort was spent to try to trace it. Unlike the previous case, there appeared to be a pattern, as shown in Figure 1.

SPECrate2006: Alternatives Considered, Lessons Learned

11

seconds 1200 1000 800 600 400 200 0 0

16

32

48

64 processor

80

96

112

128

144

128

144

Fig. 1. 429.mcf variability by processor number seconds 3000 2500 2000 1500 1000 500 0 0

16

32

48

64 processor

80

96

112

Fig. 2. 434.zeusmp variability by processor number

Figure 1 is from a large 72-chip, 144 processor server, running 143 copies of the benchmarks. The server has 18 system boards, each with 8 virtual CPUs. In the graph, the vertical grid delimits system boards. Notice that most copies of 429.mcf completed in about 800 seconds, except for those on the second system board. Attempts to trace the problem showed that generally a single system board would be slow, but it was, at ﬁrst, hard to predict which board. In Figure 2, taken from a diﬀerent large server, notice that it is the 4th from the last that is slower. Graphical analysis. Edward R. Tufte suggests that graphs should be used only if one has large amounts of data needing analysis, and they should contain only pixels that are essential to the analysis, avoiding “chartjunk” [18]. The situation at hand has over 14,000 benchmark observations in each 143-copy reportable SPECfp rate2006 run, and many more from tuning runs. To ease graphical analysis, a perl procedure was written that extracted data from log ﬁles, drove gnuplot with what was viewed as a minimal amount of chartjunk (as in the above graphs), and joined them into a webpage.

12

J.L. Henning

NUMA Hypothesis. Because the graphs showed that problems would tend to occur on a single system board, and because it is known that local system board memory access has shorter latency than remote memory, NUMA (Non Uniform Memory Access) diﬀerences were suspected. Solaris supports NUMA using a concept of Memory Placement Optimization (MPO) [2], which attempts to place process resources into “latency groups”. A latency group is a set of resources which are within some latency of each other. Systems can have multiple latency groups, and multiple levels of groups. Tools. NUMA activity can be seen on Solaris 10 systems with the opensolaris.org “NUMA Observability Tools” [15]. Two useful tools are the extended pmap and lgrpinfo. The ﬁrst is easily installed from the tools binary distribution: $ gunzip -c ptools-bin-0.1.7.tar.gz | tar xf $ cd ptools-bin-0.1.7/ $ ./pmap -Ls $$ | head -10 Address Bytes Pgsz Mode Lgrp Mapped File 00010000 640K 64K r-x-2 /usr/bin/bash 000C0000 64K 64K rwx-1 /usr/bin/bash 000E0000 128K 64K rwx-1 [ heap ] FF0F4000 8K 8K rwxs1 [ anon ]

In the pmap example above, note that -L tells us the locality group for each memory segment, and -s displays the page size. (In the interest of space, various output is truncated in both the examples in this section.) To install lgrpinfo requires a couple of extra steps, because a customization is needed for the local version of perl: $ gunzip -c Solaris-Lgrp-0.1.4.tar.gz | tar xf $ cd Solaris-Lgrp-0.1.4/ $ perl Makefile.PL Writing Makefile for Solaris::Lgrp $ make $ make test All tests successful. $ su Password: # make install # exit $ bin/lgrpinfo lgroup 0 (root): Children: 1 2 CPUs: 0-127 Memory: installed Lgroup resources: lgroup 1 (leaf): CPUs: 0-63 Memory: installed Lgroup resources:

130848 Mb, allocated 3924 Mb, free 126924 Mb 1 2 (CPU); 1 2 (memory)

65312 Mb, allocated 1675 Mb, free 63637 Mb 1 (CPU); 1 (memory)

SPECrate2006: Alternatives Considered, Lessons Learned

13

lgroup 2 (leaf): CPUs: 64-127 Memory: installed 65536 Mb, allocated 2249 Mb, free 63287 Mb Lgroup resources: 2 (CPU); 2 (memory) $

In the lgrpinfo example, the output describes a system with 128 virtual processors and 128 GB memory, divided into two latency groups. (For the sake of brevity, this example is from a simpler system than the one in the graphs.) Diagnosis. Use of pmap showed that the benchmarks running in the slower locality group were receiving memory of the requested page size (4 MB) but not the desired location. It was also noted that the slow locality group was the one where the SPEC tool suite itself (runspec) was started. Observations with lgrpinfo showed that during the benchmark setup phase, when runspec writes 143 run directories for each of the benchmarks in the suite, physical memory was used up in runspec’s locality group, apparently for ﬁle system caches. Workarounds attempted. It was hypothesized that the setup phase may have fragmented memory on runspec’s system board; and that the operating system might not be able (or, might not be willing) to coalesce fragmented 8 KB pages into 4 MB pages. Asking for smaller page sizes (such as 64 KB or 512 KB) sometimes appeared to succeed, but this compromise was not considered desirable since the benchmarks are large enough that 4 MB pages are known to be helpful. The size of ﬁle system caches was reduced using system tuning parameters such as bufhwm and segmap percent, and memory cleanup was encouraged with reasonably active settings for autoup and tune t fsﬂushr [16]. To improve predictability, runspec was initiated in a known location, namely the system board that is also used by Solaris itself, and the amount of physical memory on that board was doubled. These workarounds were usually helpful, and memory availability usually improved, but the workarounds were viewed as less than completely satisfactory on the grounds that in real life, customers may not have the degree of control that the benchmark tester has. Colloquially, the problem can be simply summarized as: “Dear Operating System: If I ask for local bigpages, and you don’t have them handy, please don’t give me remote bigpages instead. Please try harder to create local bigpages.” Given this simple summary, a simple suggestion arises: why not just change the default policy to always try harder? There are several reasons to hesitate to change the default policy: (1) Coalescing pages may be expensive, as it requires relocating pages for running processes. (2) For processes that run quickly, it may be better to allocate memory quickly rather than spending extra eﬀort. (3) It is unknown how frequently the problem may occur in real life: how often do long-running programs ask for large memory regions with large pages, which are then used intensely enough to amortize any extra cost required to coalesce pages? Given insuﬃcient data to

14

J.L. Henning seconds 6000 Peak run 2 Peak run 3

5000

Run 2 has a slow lgroup

4000 Run 3 is consistent across the lgroups

3000 2000 1000 0 0

16

32

48

64 processor

80

96

112

128

144

Fig. 3. 433.milc before and after lpg alloc prefer

answer questions such as these, the operating system policies must be approached with care. Changes to Solaris. Over the course of the investigations of these issues, the Solaris development group responded by implementing two changes. First, the algorithm for coalescing pages was made more eﬃcient. Second, a tunable parameter was introduced to allow users to increase the priority of local page allocation: lpg alloc prefer [16]. If you have single threaded, long running, large memory applications, then consider setting lpg alloc prefer=1. This causes Solaris to spend more CPU time defragmenting memory to allocate local large pages, versus allocating readily available remote large pages. The long term savings from accessing local rather than remote memory may oﬀset the higher allocation cost. This tunable parameter is used in the 256 virtual processor Sun SPARC Enterprise T5440 SPECfp rate2006 result [9]. When the graphical analysis tools are applied to this result, NUMA eﬀects are not seen. Divot summary. An early version of lpg alloc prefer was applied to a system in the middle of a SPECrate run. The eﬀect was to remove a NUMA performance divot that would sometimes slow down a single system board. The largest eﬀect was on the benchmark 433.milc, as shown in Figure 3. Because the tools report the time from start-of-ﬁrst to ﬁnish-of-last, the bottom line improved by 52%: Success 433.milc peak ref ratio=226.74, runtime=5789.629458 Success 433.milc peak ref ratio=344.64, runtime=3809.035023

SPECrate was useful as a generator of a repeatable, intense workload on NUMA systems, allowing careful study of the divot. Lessons for tuning other systems. Systems that tend to run large single-threaded programs may beneﬁt from setting lpg alloc prefer.

SPECrate2006: Alternatives Considered, Lessons Learned

6

15

Summary Although it is widely understood that “all software has bugs”, it may not be as widely understood that all systems have performance divots. A repeatable, analyzable workload allows divots to be analyzed. Cherish your divots. (Repetition is a form of emphasis.)

Acknowledgments. Thank you to the SPEC CPU subcommittee for permission to summarize the investigation of SPECrate alternatives, and especially to Cloyce Spradling for implementation of the alternatives. Numerous collegues within Sun have assisted with the technical investigations summarized in this paper, including Miriam Blatt, Jonathan Chew, Michael Corcoran, Darryl Gove, Aleksandr Guzovskiy, Alexander Kolbasov, Eric Saxe, Steve Sistare, Geetha Vallabhaneni, and Brian Whitney. Karsten Guthridge was the ﬁrst to catch 436.cactusADM in trapstat.

References 1. Carlton, A.: CINT92 and CFP 92 Homogeneous Capacity Method Oﬀers Fair Measure of Processing Capacity, http://www.spec.org/cpu92/specrate.txt 2. Chew, J.: Memory Placement Optimization (MPO), http://opensolaris.org/os/community/performance/mpo overview.pdf 3. Gove, D.: CPU2006 Working Set Size. ACM SIGARCH Computer Architecture News 35(1), 90–96 (2007), http://www.spec.org/cpu2006/publications/ 4. Henning, J.L.: SPEC CPU Suite Growth: An Historical Perspective. ACM SIGARCH Computer Architecture News 35(1), 65–68 (2007), http://www.spec.org/cpu2006/publications/ 5. McGhan, H.: Niagara 2 Opens the Floodgates. Microprocessor Report (November 6, 2006), http://www.sun.com/processors/niagara/M45 MPFNiagara2 reprint.pdf 6. SPEC CPU2000 published results, http://www.spec.org/osg/cpu2000/results/res2000q2/ cpu2000-20000511-00104.html, http://www.spec.org/osg/cpu2000/results/res2000q2/ cpu2000-20000511-00105.html 7. SPEC CPU2000 published results, http://www.spec.org/osg/cpu2000/results/res2002q2/ cpu2000-20020422-01329.html, http://www.spec.org/osg/cpu2000/results/res2002q1/ cpu2000-20020211-01256.html 8. SPEC CPU2006 published results, http://www.spec.org/cpu2006/results/ res2008q2/cpu2006-20080408-04064.html 9. SPEC CPU2006 published results, http://www.spec.org/cpu2006/results/ res2008q4/cpu2006-20080929-05409.html 10. SPEC CPU2006 Documentation, http://www.spec.org/cpu2006/docs/utility. html#convert to development

16

J.L. Henning

11. SPEC CPU2006 Documentation, http://www.spec.org/cpu2006/docs/utility. html#specinvoke 12. Sun Microsystems, UltraSPARC T2 Processor, http://www.sun.com/processors/ UltraSPARC-T2/datasheet.pdf 13. Sun Microsystems, UltraSPARCT2 Supplement to the UltraSPARC Architecture, section 12.2 (2007), http://opensparc-t2.sunsource.net/specs/ UST2-UASuppl-current-draft-P-EXT.pdf 14. Sun Microsystems, Solaris 10 Reference Manual Collection, http://docs.sun.com/ app/docs/doc/816-5166/trapstat-1m?a=view 15. Sun Microsystems, NUMA Observability, http://www.opensolaris.org/os/ community/performance/numa/observability/ 16. Sun Microsystems, Solaris Tunable Parameters Reference Manual, http://docs.sun.com/app/docs/doc/817-0404 17. STREAM: Sustainable Memory Bandwidth in High Performance Computers, http://www.cs.virginia.edu/stream/ 18. Tufte, E.R.: The Visual Display of Quantitative Information, pp. 107–121. Graphics Press, Chesire (1983) 19. Weicker, R.P., Henning, J.L.: Subroutine Proﬁling Results for the CPU2006 Benchmarks. ACM SIGARCH Computer Architecture News 35(1), 102–111 (2007), http://www.spec.org/cpu2006/publications/

SPECjvm2008 Performance Characterization Kumar Shiv, Kingsum Chow, Yanping Wang, and Dmitry Petrochenko Intel Corporation {kumar.shiv,kingsum.chow,yanping.wang, dmitry.petrochenko}@intel.com

Abstract. SPECjvm2008 is a new multi-threaded Java benchmark from SPEC and it replaces the aging single threaded SPECjvm98. The benchmark is intended to address several shortcomings of the earlier workloads in SPECjvm98 by replacing DB, Chart, Javac; removing Jess, adding XML, Serial, Crypto, incache and out-cache versions of Scimark workloads. It is targeted for measuring the performance of both JVM and hardware systems. In this paper we describe the salient features of SPECjvm2008. We then take a first look at the performance of this benchmark on current multi-core platforms and study the sensitivity of the components of the workload to basic architectural aspects such as the number of processor cores, the processor frequency, cache and memory subsystem. We focus our study on understanding how the behavior of this workload compares with other standard Java benchmarks, SPECjbb2005 and SPECjAppServer2004, both in components of the software stack that the workloads touch as well as in the aspects of the platform that they exercise and draw conclusion on the usefulness of SPECjvm2008 for practitioners of JVM and hardware performance analysis. Keywords: SPEC, Java Performance, Workload Characterization.

1 Introduction The release of SPECjvm98 [6] as a client side workload stirred up a lot of interest in performance analysis of Java workloads. Dieckmann and Holzle [1] studied the allocation behavior of the SPECjvm98 Java benchmarks. Radhakrishna [2] did an in depth analysis of micro-architectural techniques to enable efficient Java execution. The benchmarks were also used to go beyond Java code as Li and John [3] characterized operating system activity in SPECjvm98 Benchmarks. However, modern machines are too fast [4] for the 10 year old benchmark and an overhaul had been long overdue. Now ten years later, the release of SPECjvm2008 is expected to stir a lot of interest in how the latest overhaul of the benchmark is going to enable and encourage Java performance analysis on modern architectures. The designers of the new SPECjvm2008 have kept that in mind and the benchmark is intended to take advantage of multiple cores, higher frequencies, bigger caches, and larger memory bandwidths. In this work we have performed several experiments with SPECjvm2008. Our analysis of the workload running on the latest modern processors is intended to D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 17–35, 2009. © Springer-Verlag Berlin Heidelberg 2009

18

K. Shiv et al.

understand the sensitivity of benchmark performance to many cores, higher frequencies, and different cache and memory hierarchies. In addition, we also looked at the effectiveness of Java runtime systems including just-in-time compilation, dynamic optimizations, synchronizations, object allocations and other Java technologies. We then compared the characteristics with recently released SPEC Java benchmarks such as SPECjAppServer2004 [13] and SPECjbb2005 [14]. By studying the SPECjvm2008 and comparing with SPECjAppServer2004 and SPECjbb2005, we hope to establish a pattern of behavior of Java workloads on modern architectures and to enable a distinction between the new benchmark and the two more established bigger benchmarks.

2 Description of SPECjvm2008 Latest advances in processor and Java technologies have necessitated an overhaul of the SPECjvm98 benchmark [6]. Now, 10 years later, Standard Performance Evaluation Corporation (SPEC) has updated it with a new version, SPECjvm2008 [7]. An overview of the comparison between SPECjvm98 and SPECjvm2008 is summarized in Table 1. SPECjvm2008 comprises many multithreaded workloads that represent a broad collection of Java applications for both servers and clients. It can be used to evaluate performance of Java Virtual Machines (JVM) and the underlying hardware systems. It can Table 1. Comparison between SPECjvm98 and SPECjvm2008

Features

SPECjvm98

SPECjvm2008

Target

Client

client and server

Multi-threading

No

Yes

All code is available

No

Yes

Number of sub-groups

7

11

Free downloadable

No

Yes

Include Base and Peak scores

No

Yes

Fixed run duration

Yes

No

Measurements unit

Time

Ops/min

Benchmark output verification

Yes

Yes

Single tier

Yes

Yes

JDK

JDK 1.1 or later

JDK 5.0 or later

Only 1 JVM instance is allowed

Yes

Yes

SPECjvm2008 Performance Characterization

19

stress various components inside the JVM, such as the Java Runtime Environment (JRE), Just-in-time (JIT) code generation, the memory management system, threading and synchronization features. SPECjvm2008 is also designed with modern multicore processors in mind. A single JVM instance running the workload will generate enough threads to stress the underlying hardware systems. It is expected to be useful in the evaluation many hardware features such as the impact of the number of cores and processors, the frequency of the processors, integer and floating point operations, cache hierarchy and memory sub systems. SPECjvm2008 comes with a set of analysis tools such as a plug-in analysis framework that can gather run time information such as heap and power usage. It also comes with a reporter that displays a summary graph of test runs. It is easy to configure and run and provides quick feedback for performance analysis. SPECjvm2008 is perhaps a little biased towards server performance as the minimum memory requirement is 512MB per hardware thread. SPECjvm2008 can be run in 2 modes: base and peak runs. The base run simulates environments in which users do not tune software to increase performance. No configuration or hand tuning of the JVM is allowed. The base run has fixed run durations: 120 seconds warm-up, followed up by 240 seconds measurement interval. The peak run simulates environments in which users are allowed to tune the JVM to increase performance. It also allows feedback optimizations and code caching. The JVM can be configured to obtain the best score possible, using command line parameters and property files, which must be explained in the submission. In addition, the peak run has no restrictions on either the warm-up time or the measurement interval. But only 1 measurement iteration is allowed for each workload. A base submission is required for a peak submission. SPECjvm2008 is available for free. It can be downloaded from the SPEC website. SPECjvm2008 is composed of 11 groups of Java SE applications for both clients and servers. Each group represents a unique area of Java applications. The overall score is computed by nested geometric means as described by Richard M Yoo et al [5].

Score = k

n1

X 11... X 1n1 ...nk Xk1... Xknk

The overall SPECjvm2008 score is computed by substituting k by 11 and n1..k by the corresponding numbers of workloads in each group. Each of 11 groups has an equal weight of the 11th root of the final composite score. The compositions of the 11 groups of workloads are summarized in Table 2. Tests are run in order, i.e., starting with startup.helloworld and ending with xml.validation. A new JVM instance is launched for each “startup” workloads. After all the “startup” workloads are run, a single JVM is launched to run the rest of the workloads, i.e., from compiler.compiler to xml.validation. Thus, the environment left from running each workload may impact the performance of the workloads coming after it.

20

K. Shiv et al. Table 2. SPECjvm2008 Benchmark Composition

Group name Startup

Number of workloads 17

Compiler

2

Compress Crypto

1 3

Derby Mpegaudio Scimark Large

1 1 5

Scimark Small

5

Serial Sunflow Xml

1 1 2

Workloads startup.helloworld, startup.compiler.compiler, startup.compiler.sunflow, startup.compress, startup.crypto.aes, startup.crypto.rsa, startup.crypto.signverify, startup.mpegaudio, startup.scimark.fft, startup.scimark.lu, startup.scimark.monte_carlo, startup.scimark.sor, startup.scimark.sparse, startup.serial, startup.sunflow, startup.xml.transform, startup.xml.validation compiler.compiler, compiler.sunflow Compress crypto.aes, crypto.rsa, crypto.signverify Derby Mpegaudio scimark.fft.large, scimark.lu.large, scimark.sor.large, scimark.sparse.large, scimark.monte_carlo scimark.fft.small, scimark.lu.small, scimark.sor.small, scimark.sparse.small, scimark.monte_carlo Serial Sunflow xml.transform, xml.validation

2.1 Startup The startup group of workloads measures the JVM startup time of each workload by starting each on of them with a new instance of JVM. Each workload in this group is

SPECjvm2008 Performance Characterization

21

single threaded. Each of them is a startup of corresponding throughput test in the suite. The only exception is helloworld, which does not have a corresponding throughput test. When a new JVM is launched for each workload within this group, only default JVM parameters are used. Each test measures the time for a JVM to complete one loop of corresponding throughput test. A single group score is computed by taking the geometric mean of the 17 individual startup scores. 2.2 Compiler The compiler group has two workloads: compiler and sunflow. The compiler.compiler workload measures the compilation time for the OpenJDK compiler. The compiler.sunflow workload measures the compilation of the sunflow benchmark. As the goal of these workloads is to evaluate the performance of the compiler, the impact of I/O is reduced by storing input data in memory, or file cache. 2.3 Compress The compress workload is taken from SPECjvm98. It compresses and decompresses data using a modified Lempel-Ziv method. The input data is extended from 90KB to 3.36MB. To minimize the impact of I/O, data is buffered. Its algorithm uses internal tables (~67KB in size) and pseudo-random access based on input data. This workload exercises Just-in-time compiling, inlining, array access and cache performance as the JVM generates and handles mixed length data accesses. 2.4 Crypto The crypto group contains three different workloads to represent on different important areas of cryptography. They test vendor implementations of the protocols as well as JVM execution. The three workloads are crypto.aes, crypto.rsa and crypto.signverify. The crypto.aes workload encrypts and decrypts using the AES and DES protocols, applying CBC/PKCS5Padding and CBC/NoPadding. The input data sizes are 100 bytes and 713 KB, respectively. The crypto.rsa workload encrypts and decrypts using the RSA protocol for input data sizes of 100 bytes and 16 KB. The crypto.signverify workload signs and verifies using MD5withRSA, SHA1withRSA, SHA1withDSA and SHA256withRSA protocols for input data sizes of 1KB, 65KB and 1MB. Different crypto providers can be used. 2.5 Derby The derby workload uses an open-source database, derby [8], written in pure Java. Multiple databases are instantiated when the workload is started. Every 4 threads share one database instance. Synchronization is exercised in this workload. This workload extended IBM’s telco benchmark [12] to synthesize business logic and to stress the use of the BigDecimal operations. These BigDecimal computations are mostly longer

22

K. Shiv et al.

than 64-bit to examine not only 'simple' BigDecimal, which can be implemented using the long type, but also BigDecimal values that have to be stored in larger data sizes. Thus this workload exercise both database and BigDecimal operations. 2.6 Mpegaudio As the source for the mpegaudio workload from SPECjvm98 cannot be made available, a new version of mpegaudio is created in SPECjvm2008. It uses the MP3 library called JLayer [9], an Mpeg Audio decoder. This workload is floating-point computation centric. Its input data set contains six MP3 files sized from 20KB to 3MB. 2.7 Scimark Scimark, as the name implies, is based on the well known Scimark benchmark developed by NIST [10]. This group of workloads evaluates floating-point operations and data access patterns for intensive mathematical computations. Scimark is modified for multi-threading with different dataset sizes in SPECjvm2008. Scimark is actually composed of two groups in SPECjvm2008, scimark.large and scimark.small, for large and small data sets. Each thread in the workload consumes one data set. The “large” group runs with 32MB data set to simulate out of cache access performance while the “small” group runs with 512K data set to simulate incache access performance. Each group is composed of 5 workloads: fft, lu, sor, sparse and monte_carlo. Scimark.monte_carlo is run once but counted twice in both scimark large and scimark small as the workload does not work on different data set sizes. 2.7.1 Scimark.FFT Scimark.fft simulates Fast Fourier Transformation of one-dimensional, in-place algorithm with bit-reversal and Nlog(N) complexity for large (2MB) and small (512KB) data sets. 2.7.2 Scimark.SOR Scimark.sor simulates Jacobi Successive Over-relaxation for large (2048x2048 grid) and small (250x250 grid) data sets. It exercises typical access patterns in finite difference applications. The algorithm exercises basic "grid averaging" memory patterns, where each A(i,j) is assigned an average weighting of its four nearest neighbors. 2.7.3 Scimark.Sparse Scimark.sparse matrix multiplication uses an unstructured sparse matrix stored in compressed-row format with a prescribed sparsity structure. It exercises indirection addressing and non-regular memory references for large and small data sets. The large data set contains a 200000x200000 matrix in compressed form with 4000000 nonzeros in it. The small data set contains a 25000x25000 matrix in compressed form with 62500 non-zeros in it.

SPECjvm2008 Performance Characterization

23

2.7.4 Scimark.lu Scimark.lu computes the LU factorization of a dense, in-place, matrix using partial pivoting. It solves a linear system of equations using a prefactored matrix in LU form. It exercises linear algebra kernels (BLAS) and dense matrix operations for large (2048x2048) and small (100x100) data sets. 2.7.5 Monte-carlo Scimark.monte_carlo approximates the value of Pi by computing the integral of the quarter circle y = sqrt(1 - x^2) on [0,1]. It chooses random points with the unit square and computes the ratio of those within the circle versus those outside the circle. The algorithm exercises random-number generators, synchronized function calls and function inlining. This workload is counted once in each of the Scimark large and Scimark small groups. 2.8 Serial The serial workload tests the performance of serialization and deserialization of primitives and objects using byte arrays in memory taking a data set from a JBoss benchmark. The workload has a producer-consumer scenario where the objects are serialized by the producer threads and deserialized by the consumer threads on the same system. It exercise java.lang.reflect package. 2.9 Sunflow The sunflow workload is a multi-threaded benchmark simulating graphics and visualization using ray tracing. It runs several bundles of dependent threads. 4 threads are used per bundle. The number of bundles, by default, is equal to number of hardware threads. It is re-configurable. This workload is floating-point heavy. Its high object allocation rate puts pressure on the memory bandwidth. 2.10 XML The XML group contains two workloads: xml.transform and xml.validate. The xml.transform workload exercises the JAXP implementation by executing XSLT transformations with DOM, SAX, Stream sources. It uses XSLTC engine, which compiles xsl style sheets into java classes. 10 real life use cases are implemented. The xml.validation workload exercises JAXP implementation by validating XML instance documents against XML schema. 6 real life use cases are implemented. Both XML workloads have high object allocation rate, high level of contended locks. They also heavily exercise string operations. Each use case has approximately the same influence on workload score.

3 Performance Characterization of SPECjvm2008 In this section we present an initial performance characterization of SPECjvm2008. We look at data regarding various aspects of the workload from both software and hardware perspectives. For our baseline we have chosen to use an Intel core-2 duo based platform with 4 quad-core sockets thus giving us a 16-core system. Most of the data was collected

24

K. Shiv et al. Table 3. SPECjvm2008 Benchmark Component Scores

Workload compiler.compiler compiler.sunflow Compress crypto.aes crypto.rsa crypto.signverify Derby mpegaudio scimark.fft.large scimark.lu.large scimark.sor.large scimark.sparse.large scimark.fft.small scimark.lu.small scimark.sor.small scimark.sparse.small scimark.monte_carlo Sunflow xml.transform xml.validation

Score 937.23 1119.25 614.14 214.77 2012.82 1173.08 174 350.44 15.49 5.14 25.99 18.93 4384.62 4903.85 713.61 509.41 4903.85 195.7 1540.12 1117.91

on systems running at 2.92 GHz with a 1066 MHz front-side bus. Each socket has 2x4M last-level cache (LLC). The system had 16 GB of memory. We also used a platform using pre-release i7 processors for some additional experiments. As the i7 processors have not yet been released, we are not able to share raw performance numbers at this time. Nevertheless, we are able to show some interesting data. Also, unless specified otherwise, the data were collected on the core-2 duo based platform. On the software side, we used Sun’s Hotspot JVM for Java 6, jre-6u4-perf build X64, and ran the benchmark with a heap size of 14 GB. The garbage collector was generational, stop-the-world and parallel, and the JVM allocated data and code into large pages. The XML components used the Xerces parser from Apache. The Operating System was Linux, RHEL5. Table 3 presents one set of baseline data. We have observed a fair amount, occasionally more than 5%, of run-to-run variation, and this will be the cause of some differences in data in subsequent tables. The score of each component has a metric of operations per minute, and can be seen to vary from a low of 5 ops/min for scimark.lu.large to a high of 4904 ops/min for scimark.lu.small and scimark.monte_carlo. This wide range motivated the benchmark developers to choose to use a geometric mean; clearly using an arithmetic mean would have been skewed by the higher scoring components. Table 4 looks at the effect of optimizing the benchmark performance through tuning and configuring of the system. SPECjvm2008 facilitates this by requiring the

SPECjvm2008 Performance Characterization

25

reporting of two scores, base and peak, whenever peak scores are reported. We see that the benchmark’s performance is boosted by almost 7%. The performance increase is higher for some components, and is almost 40% for compiler.compiler. A few components, however, actually lose performance. The same set of configuration parameters need to be used for all the components, and the options that work best for the benchmark as a whole may not be optimal for a few of the components. Bringing up the workloads with all of the configuration parameters, specifically an option referred to in Hotspot as AggressiveOpt, which turns on more sophisticated compiler optimizations in the JIT, now takes longer thus hurting the performance of Startup. Performance degradation is highest for scimark.fft.large. This workload computes the FFT of a large set of data and has a 2^N stride through the data; the data access pattern is such that the performance is actually hurt by the use of large pages due to ineffective cache utilization, and suffers to the extent of 30%. Most of the other components do enjoy the use of large pages. In Table 5, we look at some basic metrics. As points of comparison, the corresponding data for SPECjbb2005 and SPECjAppServer2004 are also included. It can be seen Table 4. SPECjvm2008 Base versus Peak Comparisons

Startup compiler.compiler compiler.sunflow compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.fft.large scimark.lu.large scimark.sor.large scimark.sparse.large scimark.fft.small scimark.lu.small scimark.sor.small scimark.sparse.small scimark.monte_carlo serial sunflow xml.transform xml.validation Composite Score

Peak 30.43 821.71 1103.77 611.66 217.43 1961.54 1154.41 170.58 350.42 15.47 5.14 26.16 19.09 4384.62 4557.69 715.38 481.98 5288.46 265.07 257.23 1556.09 1110.92 340.40

Base 32.38 588.23 871.19 630.47 189.08 1940.39 1139.08 165.84 346.02 22.21 5.18 26.97 18.05 4102.56 4725.42 695.54 464.83 4905.82 249.31 190.55 1399.12 995.87 318.32

Peak/Base 0.940 1.397 1.267 0.970 1.150 1.011 1.013 1.029 1.013 0.697 0.992 0.970 1.058 1.069 0.965 1.029 1.037 1.078 1.063 1.350 1.112 1.116 1.069

26

K. Shiv et al.

that just like SPECjbb2005 but unlike SPECjAppServer2004 all components of SPECjvm2008 require very little kernel time, and the processor is almost always (>99%) in user mode. SPECjAppServer2004 has a high level of network traffic requiring considerable OS support. SPECjvm2008 has no corresponding I/O requirement. At first blush, many of the SPECjvm2008 components have a remarkably high level of context switches. Scimark.lu.large, for one example, has more than 5000 context switches per operation. Four other components suffer from more than a thousand during an operation. On digging deeper, however, we see that the rate while being somewhat high, is not excessively so. Scimark.lu.large only delivers about 5 operations per minute, and this means that we only observe about 26000 context switches in a minute. Compare this to SPECjAppServer2004, that only suffers about 40 context switches per transaction (JOP), but delivers about 2000 transactions per second (JOPS), and we can see that the context switch rate in SPECjAppServer2004 is 40 times higher than in Scimark.lu.large. The SPECjvm2008 component with the highest context switch rate is Derby and its rate is about 30% of the SPECjAppServer2004 rate. We next look at the object allocation rate and garbage collection rate for the various SPECjvm2008 components, and again we present the data for SPECjbb2005 and Table 5. Some Fundamental Metrics

compiler.compiler compiler.sunflow compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.fft.large scimark.lu.large scimark.sor.large scimark.sparse.large scimark.fft.small scimark.lu.small scimark.sor.small scimark.sparse.small scimark.monte_carlo sunflow xml.transform xml.validation SPECjbb2005 SPECjAppServer2004

%user 99.57 99.44 99.93 99.94 98.81 99.91 97.80 99.79 98.83 93.72 97.67 92.70 99.97 99.92 99.97 99.82 99.97 99.45 88.75 99.93 99.79 78.63

%system 0.27 0.41 0.02 0.01 0.87 0.05 1.06 0.19 0.22 0.08 0.09 0.35 0.01 0.02 0.01 0.03 0.03 0.08 0.71 0.02 0.20 19.50

cswch/op 33.92 30.00 50.78 141.20 137.75 27.60 2065.01 87.01 1895.77 5409.42 1073.89 1674.87 6.94 6.37 40.97 59.13 5.71 414.96 417.58 30.47 2.00 37.00

SPECjvm2008 Performance Characterization

27

SPECjAppServer2004 for comparison. Table 6 presents this data, and we can immediately see that apart from two components, compiler.compiler and compiler.sunflow, the rest have a very demand on the garbage collection infrastructure of the JVM. Two of the components in fact make no demands on the garbage collector at all, and eleven others spend less than 0.1% of time in garbage collection. SPECjbb2005 on the other hand spends more than 2% of time in GC and SPECjAppServer2004 spends 7.5% of time in GC. Not surprisingly, the object allocation data shows the same pattern, with compiler.compiler and compiler.sunflow having allocation rates lying between the allocation rates of SPECjbb2005 and SPECjAppServer2004, while most other components have relative low allocation rates. Four components, though, diverge from this pattern, and show high allocation rates and low garbage collection usage. Since the rate at which GC is invoked is directly related to the allocation rate, it follows that these components are spending less time in GC, because each garbage collection goes faster for these components. We can theorize that these components have far fewer live objects when GC is invoked, but we have not yet fully tested this theory. Derby, for instance, has a high object allocation rate due to the frequent allocation of immutable long BigDecimal objects, which do not stay alive very long. Table 6. Allocation and Garbage Collection Rate

compiler.compiler compiler.sunflow Compress crypto.aes crypto.rsa crypto.signverify Derby Mpegaudio scimark.fft.large scimark.lu.large scimark.sor.large scimark.sparse.large scimark.fft.small scimark.lu.small scimark.sor.small scimark.sparse.small scimark.monte_carlo Sunflow xml.transform xml.validation SPECjbb2005 SPECjAppServer2004

Alloc (MB/S) Alloc (MB/OP) 2422.45 155.10 2590.88 143.27 141.04 13.72 868.22 237.53 298.96 9.14 875.96 44.80 3041.56 1058.30 317.34 54.73 11.45 44.42 8.31 97.04 No GC during measurement 18.27 57.48 604.26 8.27 1277.90 15.95 No GC during measurement 160.64 18.88 9.95 0.12 3405.07 1097.76 2832.37 109.14 2343.78 126.39 3655.00 0.01 1100.00 0.55

GC % 3.97 2.31 0.07 0.16 0.02 0.05 0.46 0.01 0.05 0.06 0.01 0.04 0.09 0.01 0.03 0.32 0.27 0.38 2.20 7.50

28

K. Shiv et al.

Turning our attention to hardware performance metrics, we look at the CPI and Pathlength (Instructions Retired per Operation) for each component of SPECjvm2008, and once again provide the data for SPECjbb2005 and SPECjAppServer2004 as well. The CPI data shows a very wide range all the way from 0.35 for a couple of the scimark workloads, sparse.small and monte_carlo, to 37 for scimark.fft.large. While the range is large, only a few of the components have CPI values that are close to values seen for the established benchmarks, SPECjbb2005 and SPECjAppServer2004. Pathlength, the number of instructions executed per benchmark operation, shows a similarly wide range from 860 million instructions for scimark.fft.small to 35 billion instructions for scimark.lu.large. It is interesting to note that the SPECjvm2008 component pathlengths are much larger than the pathlengths of SPECjAppServer2004 and SPECjbb2005. The developers of the new benchmark have defined each component’s operation to be a bundle of the underlying component transactions thus leading to significantly higher pathlengths. We next look deeper at the CPI data. Since there is a wide range of CPI, and high values of CPI for workloads are frequently due a strong memory dependency, we compared the memory requirements for each component, and we present that data in Table 8. Interestingly, while there is indeed a correlation that can be seen and several Table 7. CPI and Pathlength

Workload compiler.compiler compiler.sunflow compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.fft.large scimark.lu.large scimark.sor.large scimark.sparse.large scimark.fft.small scimark.lu.small scimark.sor.small scimark.sparse.small scimark.monte_carlo sunflow xml.transform xml.validation SPECjbb2005 SPECjAppServer2004

CPI 2.24 1.78 0.66 0.43 0.42 0.46 3.36 0.65 37.36 15.49 8.88 6.69 0.74 0.51 1.31 0.35 0.35 0.85 1.09 1.26 1.21 2.22

Pathlength 1,342,140,681.48 1,411,940,351.17 6,935,664,016.67 30,706,203,731.99 3,365,033,523.97 5,213,088,932.49 4,778,948,803.51 12,402,705,156.31 4,857,444,571.78 35,367,955,150.80 12,196,916,787.61 22,270,815,826.80 862,334,186.50 1,131,124,031.40 3,008,152,045.14 15,584,203,597.27 1,655,077,938.70 16,871,976,705.39 1,498,012,657.15 1,991,052,637.11 67,322.00 5,263,947.00

SPECjvm2008 Performance Characterization

29

Table 8. Memory Bandwidth Requirements

Workload compiler.compiler compiler.sunflow compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.fft.large scimark.lu.large scimark.sor.large scimark.sparse.large scimark.fft.small scimark.lu.small scimark.sor.small scimark.sparse.small scimark.monte_carlo sunflow xml.transform xml.validation SPECjbb2005 SPECjAppServer2004

MB/op

MB/sec 155.10 2422 143.27 2591 13.72 141 237.53 868 9.14 299 44.80 876 1058.30 3042 54.73 317 44.42 11 97.04 8 No GC during measurement 57.48 18 8.27 604 15.95 1278 No GC during measurement 18.88 161 0.12 10 1097.76 3405 109.14 2832 126.39 2344 0.02 5946 2.00 4091

of the workloads without lower memory bandwidth requirements have lower CPIs, the workloads with the highest CPIs display very low bandwidth demand. Specifically, scimark.fft.large has a CPI of 37 and memory bandwidth requirement of 9 MB/s. SPECjbb2005, as a point of comparison, has a CPI of 1.22 and a memory bandwidth requirement of 6 GB/s. This is true to a somewhat lesser extent for scimark.sor.large, scimark.lu.large, and scimark.sparse.large. These workloads do not have a high CPI because of excessive memory bandwidth demands. However, this does not necessarily rule out memory latency as a cause of their high CPI. Intel processors provide a range of performance event counters. In Table 9 we examine some key metrics, last-level cache (LLC) MPI (Misses per Instruction), ITLB and DTLB misses, number of floating-point instructions, and the HITM metric to understand the sharing behavior of each component. The LLC MPI data gives us a clear pointer to the cause of the very high CPIs suffered by four of the scimark components. Scimark.fft.large, the workload with the highest CPI, has an MPI of 0.05, or 1 cache miss every 20 instructions, a rate that is approximately 20 times the rate of cache misses in SPECjbb2005 and SPECjAppServer2004. The memory latency seen by these cache misses causes the

30

K. Shiv et al.

high CPI. The high CPI restricts the performance of the workload strongly, and the resulting low throughput creates an appearance of low memory bandwidth requirement. The performance of these four workloads is therefore strongly dependent on memory latency. Of the remaining components, several have negligible cache misses, while the few (compiler.*, xml.*, sunflow, derby) with moderate CPI have MPIs of the same order of magnitude seen in SPECjbb2005 and SPECjAppServer2004. It is not surprising that Derby, with its high allocation rate of immutable BigDecimal objects, has a significant MPI of 0.0057. One criticism that can perhaps be leveled at SPECjbb2005 and SPECjAppServer2004 is the low usage of floating-point. Some of the components of SPECjvm2008 on the other hand can be seen to have significant levels of floatingpoint usage. Derby, especially, has a floating-point instruction usage rate of 0.01, or 1 out of every 100 instructions. Table 9. Some Processor Metrics

MPI

compiler.compiler compiler.sunflow compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.fft.large scimark.lu.large scimark.sor.large scimark.sparse.large scimark.fft.small scimark.lu.small scimark.sor.small scimark.sparse.small scimark.monte_carlo sunflow xml.transform xml.validation SPECjbb2005 SPECjAppServer2004

0.0049 0.0035 0.0003 0.0002 0.0001 0.0002 0.0057 0.0000 0.0530 0.0216 0.0111 0.0208 0.0002 0.0003 0.0000 0.0000 0.0000 0.0011 0.0015 0.0016 0.0028 0.0029

Floating Point Operations Retired 16,725 14,693 21,575 68,220 207,767 105,163 52,579,290 41,137 855,695 17,839,925 506,323 695,015 39,424 3,948 19,519 75,351 2,740 1,210,310 125,201 159,639 110 7,904

ITLB Miss retired

DTLB Miss

1,136,892 889,787 472 2,927 38,639 7,138 2,085,620 52,516 6,684 15,442 4,967 5,156 168 223 299 442 82 18,397 349,122 1,967,790 7 791

2,341,261 1,700,167 59,699 36,211 1,039,284 415,907 4,583,880 926,838 270,140 871,064 146,314 1,378,819 1,139 17,911 5,283 7,483 962 4,326,069 357,000 656,078 90 27,657

HITM / L2 Data Request Miss 0.126 0.138 0.015 0.146 0.206 0.080 0.245 0.662 0.336 0.333 0.594 0.007 0.196 0.124 0.212 0.616 0.573 0.169 0.253 0.303 0.000 0.050

SPECjvm2008 Performance Characterization

31

Most of the components have small code footprints. Once again, Derby stands out as the exception, suffering an ITLB miss every 2500 instructions. None of the workloads face much DTLB pressure. Some of the DTLB-miss metrics look high until we recall the high pathlengths of these workloads. Both SPECjbb2005 and SPECjAppServer2004 have negligible HITM rates indicating low sharing of data between the LLCs on the sockets. While these benchmarks inherently have low sharing, the HITM metric is also lowered by the benchmarks being run with multiple JVMs, 1 JVM per LLC. SPECjvm2008 run-rules preclude the use of multiple JVMs which allows us to see the level of sharing amongst the threads. The more significant cases are the components which have both higher MPI and high HITM rates. Derby, the xml workloads, and some of the scimark components, all exhibit high levels of cache-to-cache memory traffic. As the number of cores in a chip continues to increase, the number of cores in even clients and small servers is increasing rapidly. Therefore the scaling of these workloads with the number of processors is of some interest. Table 10 presents the scaling data, by showing the relative performance of each component to its performance on 1 processor. Since our system has 16 processors, we have tested the scaling from 2 to 16 processors. The start-up time of the workloads is unaffected by the number of processor cores available, since much of the JVM initialization code is single threaded. Several other workloads, such as compress, crypto.* and mpegaudio exhibit excellent scaling, while a few show super-linear behavior. Since there is sufficient run-to-run variation, this Table 10. Thread Scaling Workload Startup compiler.compiler compiler.sunflow compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.fft.large scimark.lu.large scimark.sor.large scimark.sparse.large scimark.fft.small scimark.lu.small scimark.sor.small scimark.sparse.small scimark.monte_carlo serial sunflow xml.transform xml.validation

1T 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

2T 0.98 2 2 2 1.99 1.8 2 1.66 2.04 1.71 1.89 1.97 1.83 2 1.74 2 2 2.01 2.11 0.99 2 1.92

4T 1.01 3.83 4 4.14 3.98 4 4.07 3.15 4.11 2.54 2.62 3 2.62 4 3.75 4 4.03 4.35 4.08 3.95 3.87 3.43

6T 1.02 5.75 6.02 6.22 5.93 5.69 6.07 4.69 6.13 2.97 2.63 2.87 2.8 5.8 5.25 5.79 5.76 6.03 6.25 8.49 5.72 5.16

8T 1.01 7.26 7.21 8.37 7.92 7.8 8.08 5.61 8.25 3.25 2.57 2.9 3.47 7.8 6 8.05 7.93 8.37 8.33 13.31 7.37 7.12

10T 0.99 8.16 8.28 10.37 9.67 9.4 10.01 5.84 10.07 3.4 2.52 2.85 3.31 9.8 8 10.09 9.78 10.72 10.31 15.95 8.26 8.35

12T 1.03 8.58 8.94 12.37 11.94 11.69 12.21 5.8 12.32 3.43 2.58 2.83 3.33 11.2 9.26 12.09 11.85 12.73 12.22 14.67 8.63 9.88

14T 0.99 8.86 9.17 14.43 12.83 13.3 14.02 5.74 14.34 3.8 2.44 2.8 3.35 13.2 11.25 14.17 13.96 14.07 14.61 15.47 8.89 10.98

16T 0.96 7.83 9.57 15.39 15.64 15.3 16.03 5.47 16.4 3.79 2.42 2.79 3.33 15.2 11.85 16.13 15.88 18.42 24.36 18.3 9.44 11.55

32

K. Shiv et al.

data should be treated with caution. While these data points may well be noisy, it must also be noted that this kind of scaling is not theoretically impossible; the availability of more cores allows the JVM to use more threads for compilation and optimization, and this can allow the generation of better code. This, we must emphasize, is just a theory. This benchmark is still new, and it will take some time and additional experiments to filter out the noisy data. Intel recently announced that it would release a new Xeon micro-architecture, the i7. We performed a few experiments on a pre-release platform and present those results next. Table 11 presents the ratio between Peak and Base for the I7 is similar to that seen with the Core2-Duo in most respects. One notable exception is scimark.fft.small which now suffers 14% degradation whereas our earlier results showed a 7% gain. This workload is sensitive to data layout. Data layout change causes different effects since 2 processors having very different cache architectures. The Core2 has a two-level cache system while the i7 has a three-level cache. The second level cache on the i7 is much smaller (only 256K) relying on the large (8M) third level cache to reduce accesses to memory. However, as a result, the cost of accessing a line from the last-level third level cache in i7 is more than the cost of accessing a line from the second level cache. For most workloads, the bigger third Table 11. i7 Processor Baseline Scores

compiler.compiler compiler.sunflow compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.fft.large scimark.lu.large scimark.sor.large scimark.sparse.large scimark.fft.small scimark.lu.small scimark.sor.small scimark.sparse.small scimark.monte_carlo serial sunflow xml.transform xml.validation Composite Score

Peak/Base 1.448 1.096 1.036 0.997 1.000 1.003 1.103 1.013 0.628 1.029 1.033 1.257 0.859 1.005 1.001 0.997 1.661 1.112 1.087 1.015 1.155 1.073

SPECjvm2008 Performance Characterization

33

level cache provides a performance boost. Here, however, scimark.fft.small has a cache footprint that is too large to fit in the 256K second-level cache on the i7 platform but fits easily into the last level cache on both systems. As a result of these different latencies to different levels of the cache hierarchy, the use of large pages now penalizes scimark.fft.small as well. The i7 processor core includes SMT (Simultaneous Multi Threading) which allows two software threads to execute simultaneously on 1 core. In Table 12 we look at the effect of SMT on SPECjvm2008 performance, and note that the benchmark as a whole enjoys a 22% boost. While almost all the components benefit to some extent, scimark.sor.small almost doubles. SMT benefit can often be limited by cache availability; the data footprint of scimark.sor.small is such that this does not happen. On the other hand, scimark.lu.small suffers due to contention for the 256K second-level cache. The contention for 256K L2 cache is critical because scimark.lu.small workload performance is limited by L2 LLC throughput. Finally, in Table 13 we look at the extent to which SPECjvm2008 benefits because of faster processor core frequencies. We ran several components of the benchmark at two frequencies, 2.8 GHz and 2.93 GHz. The 4.6% increase in frequency leads to performance gains of 4-5% in all these cases. Even Derby which is more Table 12. Effect of SMT

compiler.compiler compiler.sunflow Compress crypto.aes crypto.rsa crypto.signverify Derby Mpegaudio Scimark.fft.large Scimark.lu.large Scimark.sor.large Scimark.sparse.large Scimark.fft.small Scimark.lu.small Scimark.sor.small Scimark.sparse.small Scimark.monte_carlo Serial Sunflow xml.transform xml.validation Composite Score

SMT Gain 1.161 1.173 1.254 1.387 1.189 1.059 1.398 1.205 1.061 1.043 1.508 1.085 1.018 0.890 1.925 1.039 1.011 1.184 1.254 1.199 1.219 1.216

34

K. Shiv et al. Table 13. Frequency Scaling of i7 Platforms

compiler.compiler compiler.sunflow compress crypto.aes crypto.rsa crypto.signverify derby mpegaudio scimark.fft.large scimark.lu.large scimark.sor.large scimark.sparse.large scimark.fft.small scimark.lu.small scimark.sor.small scimark.sparse.small scimark.monte_carlo serial sunflow xml.transform xml.validation Composite Score

Freq Gain 1.044 1.041 1.045 1.055 1.054 1.048 1.045 1.046 1.214 1.043 1.040 1.055 1.000 1.009 1.048 1.048 0.880 1.047 1.052 1.050 1.042 1.042

memory-dependent than the other components benefits fully from the frequency increase. The i7 platform uses QPI (Quick-Path Interconnect) and has lower memory access latencies. Actual memory latency is also reduced by improved hardware prefetchers.

4 Analysis and Conclusions In the early days of studying Java performance, it was very popular to use SPECjvm98. The primary reasons were that it was the first SPEC java benchmark, that it was simple to use, and that by providing several components it allowed the performance analyst to study several different aspects of the effect of the workload on the platform. This new benchmark continues to provide the latter 2 benefits. However, SPECjvm2008 contains a wide range of Java tests with significantly different system characteristics. It is a great challenge for software optimizations and the system under test. The fact that the reported metric is the geometric mean of 11 components, several of which have sub-components, implies that platform improvements that affect only 1 component are unlikely to change the benchmark reported metric. For example, doubling the performance of Derby, while leaving other components unchanged, will

SPECjvm2008 Performance Characterization

35

change the reported SPECjvm08 performance by just 6%. We expect therefore that keen interest will be focused on individual component scores as much as the reported score. The workloads in SPECjvm2008 present many opportunities for the JVM to improve code generation, threading, memory management, and lock algorithm tuning. Many such changes could impact all components though to different degrees. For example, improvements in object allocation will benefit all, but some components like Derby will benefit more. Other changes will benefit in a more localized manner. Of particular interest is floating-point behavior. Previous benchmarks had not stressed floating-point, and there was no generally-accepted way of studying the platform and JVM improvements in this regard. Workloads like Derby should mitigate that. Similar comments can be made about string and XML behavior due to the XML components. Many workloads in SPECjvm2008 can also be used to evaluate current and future hardware features especially on memory subsystem and lock optimization. Our conclusion based on our first analysis of this new benchmark is that it appears to be a valuable addition to our toolkit. While it cannot replace SPECjbb2005 or SPECjAppServer2004, and it may never be as important and as representative as those two, it provides behavior that is different enough to make it attractive to the performance analyst.

References 1. Dieckmann, S., Holzle, U.: The allocation behavior of the SPECjvm98 Java benchmarks. In: Performance evaluation and benchmarking with realistic applications, pp. 77–108. MIT Press, Cambridge (2001) 2. Radhakrishnan, R.: Microarchitectural Techniques to Enable Efficient Java Execution, Ph. D. Dissertation, University of Texas at Austin (2000) 3. Li, T., John, L.K.: Characterizing Operating System Activity in SPECjvm98 Benchmarks. In: John, L.K., Maynard, A.M.G. (eds.) Characterization of Contemporary Workloads, pp. 53–82. Kluwer Academic Publishers, Dordrecht (2001) 4. Excelsior JET Benchmarks, http://web.archive.org/web/20071217043141, http://www.excelsior-usa.com/jetbenchspecjvm.html 5. Yoo, R.M., Lee, H.-H.S., Lee, H., Chow, K.: Hierarchical Means: Single Number Benchmarking with Workload Cluster Analysis. In: IEEE International Symposium on Workload Characterization (IISWC 2007), Boston, MA, USA, September 27-29 (2007) 6. SPECjvm98 Benchmarks, http://www.spec.org/jvm98/ 7. SPECjvm2008 Benchmarks, http://www.spec.org/jvm2008 8. Apache derby, http://db.apache.org/derby/ 9. JLayer, http://www.javazoom.net/javalayer/javalayer.html 10. Scimark 2.0 Benchmark, http://math.nist.gov/scimark2/ 11. Sunflow, http://sunflow.sourceforge.net/ 12. IBM Telco Benchmark, http://www2.hursley.ibm.com/decimal/telco.html 13. SPECjApp Server 2004 Benchmark, http://www.spec.org/jAppServer2004 14. SPECjbb 2005 Bechmark, http://www.spec.org/jbb2005

Performance Characterization of R Itanium 2-Based Montecito Processor Darshan Desai2 , Gerolf F. Hoﬂehner2 , Arun Kejariwal1 , Daniel M. Lavery2, Alexandru Nicolau1 , Alexander V. Veidenbaum1 , and Cameron McNairy2 1

Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697, USA 2 Intel Compiler Lab Intel Corporation Santa Clara, CA 95050, USA

Abstract. This paper presents the performance characteristics of the R R Itanium 2-based Montecito processor and compares its perforIntel mance to the previous generation Madison processor. Measurements on both are done using the industry-standard SPEC CPU2006 benchmarks. The benchmarks were compiled using the Intel Fortran/C++ optimizing compiler and run using the reference data sets. We analyze a large set of processor parameters such as cache misses, TLB misses, branch prediction, bus transactions, resource and data stalls and instruction frequencies. Montecito achieves 1.14× and 1.16× higher (geometric mean) IPC on integer and ﬂoating-point applications. We believe that the results and analysis presented in this paper can potentially guide future IA-64 compiler and architectural research.

1

Introduction

The Itanium 2 family of processors, including the Itanium 2-based dual core (also known as Montecito), provide a fast, wide, and in-order execution core coupled to a fast, wide, out of order memory sub-system and system interface [1]. The processor has two dual-threaded cores integrated on die with more than 26.5MB of cache in a 90nm process with 7 layers of copper interconnect. Other improvements over its predecessor include the integration of 2 cores on-die, each with a dedicated 12MB L3 cache, a 1MB L2I cache and dual-threading [2]. In this paper we analyze the key features of Montecito’s microarchitecture which yield better performance than its predecessor (Madison) on both integer and ﬂoating-point applications. The main contributions of the paper are as follows: ❚ First, we present a description of Montecito’s microarchitecture, discuss and analyze the key enhancements which result in better performance on Montecito than its predecessor. For this, we present detailed proﬁling data D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 36–56, 2009. c Springer-Verlag Berlin Heidelberg 2009

R Performance Characterization of Itanium 2-Based Montecito Processor

37

corresponding to a large set of performance metrics such as cache misses, branch prediction, resource and data stalls. ❚ Second, we present a detailed characterization of the released SPEC CPU2006 suite [3]. To the best of our knowledge, this is the ﬁrst work to present the behavior of CPU2006 on two generations of the IA-64 architecture. ❚ Third, we present the relative impact of various performance bottlenecks. The architects and compiler designers can use these results to accurately identify and target the key areas for future performance improvements. 1.1

Data Collection

Table 1. Experimental Setup We obtained the performance data on a 1.6 Processor Intel Itanium 2-based (Montecito) Processor, 1.6 GHz 4 GB GHz Montecito proces- Memory L1 D-Cache 16 KB (4-way, line size: 64 bytes) sor using the Caliper L1 I-Cache 16 KB (4-way, line size: 64 bytes) [4] performance monL2 D-Cache 256 KB (8-way, line size: 128 bytes) itoring tool. The de- L2 I-Cache 1 MB (8-way, line size: 128 bytes) L3 Cache 12 MB (12-way, line size: 128 bytes) tailed conﬁguration is Compiler Intel Fortran/C++ compiler (version 9.1) Red Hat Enterprise Linux AS release 4 (Nahant Update 3) given in Table 1. The OS kernel 2.6.9-36.EL #1 SMP benchmarks were compiled using the Intel Fortran/C++ optimizing compiler (version 9.1). The compiler supports a wide variety of optimizations such as software pipelining, predication, software prefetching and whole program optimizations. The events monitored for each metric such as IPC (instructions per cycle) are listed at the start of the corresponding sections. The event monitoring process is non-intrusive as it is in-built in the hardware and does not require any special setup. The data collected provides valuable insights of the system behavior, especially, the role played by buses, I/O and disc, which are typically not modeled in simulators. The rest of the paper is organized as follows: Section 2 presents an overview of the Montecito microarchitecture. Sections 3–10 provide an in-depth performance characterization results for Montecito and compares it with the previous generation Madison processor. Finally, we conclude in Section 11. R

R

R

2

Processor Description

In the following subsections we brieﬂy introduce the core and then the memory sub-system of the Intel’s Montecito processor. A high level block diagram of Montecito is shown in Figure 1. 2.1

Execution Core

The Itanium 2 Execution Core consists of a front-end, which is responsible for delivering instructions ready to execute and a back-end, which completes execution and forwards requests to the memory sub-system.

38

D. Desai et al.

The front-end, with two levels of branch prediction, two levels of translation look-aside buﬀers (TLBs) and a zero-cycle branch predictor, feeds two bundles (with 3 instructions each) into the 8 bundle deep instruction buﬀer every cycle. Instruction fetch and branch prediction require only two pipe-stages (the Montecito pipeline is shown in Figure 2) — the IPG and ROT stages. The instruction buﬀer allows the frontend to continue to deliver instructions to the back-end even when the back end is stalled and can be completely bypassed adding no pipe stages to execution. The instruction buﬀer delivers two bundles of any alignment to the remaining six pipeline stages. The dispersal logic determines issue groups from the two oldest bundles in the instruction buﬀer and allocates up to six instructions to the 11 available functional units (two integer, four memory, two ﬂoating point, and three branch) in the EXP stage. These instructions form an issue group and travel down the back-end pipeline and experience stall conditions in unison. The register renaming logic maps virtual registers in the instruction to physical registers in the REN stage to support software pipelining and stacked registers Fig. 1. Block diagram of a single core of which are managed by the Register Save Montecito FP1 FP2 FP3 Engine (RSE) (which provides seemingly IPG ROT EXP REN REG EXE FP4 DET WRB unlimited virtual registers). IPG: Instruction pointer generation and fetch Further, 32 of the integer registers are ROT: Instruction rotation EXP: Instruction template decode, expand and disperse direct mapped and do not require renam- REN: Rename (for register stack and rotating registers and decode) ing, while 96 registers are stacked. The REG: Register file read EXE: ALU execution physical register identiﬁers access the ac- DET: Exception detection tual 128 integer or 128 ﬂoating point regis- WRB: Write back FPx: Floating−point pipe stage ter ﬁle in the REG stage. The register ﬁles Fig. 2. Montecito pipeline are highly ported to support 6 instructions accesses per cycle (12 integer since most integer instructions require 2 sources and eight ﬂoating-point read ports since ﬂoating point operations may require 4 sources per instruction). The instructions in the issue group perform their operation in the EXE pipe stage which acts as the primary coupling point between the L1D cache and the execution core. Scoreboard logic which tracks long latency operations may stall the instructions in the issue group at the EXE stage to prevent an instruction from accessing the older instruction destination until the register is written. The full bypass network allows nearly immediate access to previous instruction

R Performance Characterization of Itanium 2-Based Montecito Processor

39

results. Some instructions may fault or trap, while branch instructions may be mis-predicted. 2.2

Memory Subsystem

Montecito supports three levels of on-chip cache. Each core contains a complete cache hierarchy, with nearly 13.3 MB per core and a total of nearly 27 MB of processor cache. The ﬁrst-level data cache (L1D) is a multi-ported, 16KB, fourway set associative, physically-addressed cache with a 64-byte line size. The L1D is non-blocking and in-order. Lower virtual address bits 11:0, which represent the minimum virtual page, are never translated and are used for cache indexing. The access latency of the L1D is one cycle unless the use is for an address of another load operation (i.e., pointer chasing) in which case it is two cycles. The L1D enforces a write-through, with no write-allocate policy. All stores go to the second-level cache whether they hit or miss in the L1D. If a store hits in the L1D, the data is kept in a store buﬀer until the L1D array becomes available for update (see Figure 6). These store buﬀers are capable of merging store data and forwarding it to later loads with restrictions. The L1D allocates on load misses according to temporal hints, load type, and available resources. The major enhancement to the Montecito cache hierarchy starts at the L2 caches where the L2 cache is split into dedicated instruction and data caches. This separation makes it possible to have dedicated access paths to the caches, thereby eliminating contention and capacity pressures at the L2 caches. The L2I holds 1 MB, is 8-way set associative and has a 128-byte line size but has the same sevencycle instruction-access latency as the smaller Itanium 2 uniﬁed cache. The tag and data arrays of L2I are single ported, but the control logic supports out-of-order and pipelined accesses, which enable a high utilization rate. L2D has the same structure and organization as the uniﬁed 256-Kbyte L2 cache of Itanium 2 but with several microarchitectural improvements to increase throughput and reduce latency and core stalls. In Itanium 2, any accesses to the same cache line beyond the ﬁrst access that misses L2 will access the L2 tags periodically (recirculate) until the tags detect a hit. The repeated tag accesses consume bandwidth from the core and increase the L2 miss latency. The L2D suspends such secondary misses until the L2D ﬁll occurs. At that point, the ﬁll immediately satisﬁes the suspended request. This approach greatly reduces bandwidth contention and ﬁnal latency. The L2D also manages the 32-entry L2 OzQ more eﬃciently, through pseudo-compression, to increase concurrency and reduce core stalls. The L3 is a multi-way (actual number of ways depends on the model and conﬁguration of the particular processor) on-chip cache. It has a 128 byte line size matching the L2 and only supports entire line accesses. The L3 tags and data arrays are single ported, but pipelined allowing several accesses to each to be in ﬂight at the same time. The L3 will allocate for read misses according to temporal hints, but will not allocate for L2 dirty victim misses (L3 write request). The hardware page walker (HPW) is the third level of address translation and performs page look-ups from the virtual hash page table (VHPT). On a L2 DTLB/ITLB miss, the HPW will access the L2 cache and (if necessary) L3

40

D. Desai et al.

Fig. 3. IPC

cache and the memory to obtain the page entry. If the HPW does not ﬁnd the page, it will generate a page fault.

3

IPC

The IPC (instructions per cycle) value signiﬁes the amount of instruction level parallelism (ILP) that can be achieved using a given compiler and processor. The IPC was computed by taking the ratio of the number of events corresponding to the following hardware performance counters: IA64 INST RETIRED: This event counts the number of retired Itanium instructions. This also includes the NOPS instructions and instructions which were squashed due to predicate oﬀ. We subtract the latter (which are measured using the counters NOPS RETIRED and PREDICATE SQUASHED RETIRED) to compute the eﬀective IPC. CPU OP CYCLES: This event counts the number CPU operating cycles. From Figure 3 we observe that Montecito achieves higher IPC than its predecessor Madison, across the entire CPU2006 suite. To compare the performance of Montecito and Madison, we ﬁrst compute the ratio of IPC on Montecito and Madison for each benchmark and then compute the geometric mean of the ratios. Our analyis shows that Montecito achieves 1.14× and 1.16× higher IPC on CINT2006 and CFP2006 respectively. The higher IPC value can be attributed to a number of reasons: larger caches and other cache-related microarchitectural enhancements, discussed further in Section 4 and better TLB performance, discussed further in Section 6. The low IPC value of applications such as 429.mcf, 471.omnetpp, 450.soplex and 459.GemsFDTD can, in part, be ascribed to the large amount of L3 cache misses (see Figure 4). Also, note that in applications such as 456.hmmer, 444.namd, an IPC of more than 4 is achieved.

4

Cache Performance

In this section, we present a detailed analysis of the performance of the data cache. Recall that each core on Montecito has a uniﬁed L3 cache. Therefore, while evaluating the L3 performance w.r.t. the data stream, it very critical to measure

R Performance Characterization of Itanium 2-Based Montecito Processor

41

Fig. 4. Number of L1D, L2D and L3 data misses per 1000 retired instructions

Fig. 5. L2D Buﬀers

the L3 misses which occur due to data reads and writes only. We measured the performance of the data cache using the following hardware performance counters: ❶ L1D READ MISSES.ALL: This event counts the number of L1D read misses. ❷ L2D MISSES: This event counts the number of L2D misses (in terms of the L2D cache line requests sent to L3). ❸ L3 READS.DATA READ.MISS: This event counts the number of L3 load misses. ❹ L3 WRITES.DATA WRITE.MISS: This event counts the number of L3 store misses (excludes L2D write backs, includes L3 read for ownership requests that satisfy stores). The total number of L3 data misses is computed as the sum of ❸ and ❹. This does not include the L3 instruction misses. From Figure 41 we observe that, on an average, Montecito incurs fewer data cache misses as compared to Madison at any level of the cache hierarchy. This can be attributed to the larger caches on Montecito. For example, A reduction of up to 1.38× (429.mcf) and 1.76× (470.lbm) in L1D and L2D cache misses is achieved respectively. In general, we note that the reduction in data cache misses is higher in CINT2006 than in CFP2006. Even then, the L1D miss rate is higher in CINT2006 than CFP2006. 1

L3D in the ﬁgure refers to (uniﬁed) L3 data read and write misses. It should not be interpreted as misses corresponding to a separate L3 data cache.

42

D. Desai et al.

For applications such as 433.milc and 459.GemsFDTD, we observe that the number of L2D cache misses is higher than L1D cache misses. This stems from the fact that in Itanium ﬂoating-point loads do not access the L1D. The beneﬁt of this is that it enables the issue of ﬂoating-point loads on any of the 4 memory ports with minimal restrictions. It also explains the higher L2D and L3 data cache miss rate in CFP2006 as compared to CINT2006. For example, CFP2006 incurs over 2-fold L3 data cache misses compared to CINT2006. This in turn increases the memory bus pressure thereby aﬀecting performance adversely. The higher number of ﬂoating-point loads also explains the higher L2D cache miss rate than L1D cache miss rate in the integer application 429.mcf. In the rest of this subsection, we discuss the performance of the individual L2D buﬀers. 4.1

OzQ Buﬀer

The L2 OzQ and its control logic provide the non-blocking and re-ordering capabilities of the L2 (see Figure 6). This structure holds up to 32 operations that cannot be satisﬁed by the L1D. L1D requests which are conﬂict-free at the L2 require fewer than 32 L2 OzQ entries to full streaming. The additional entries allow the L1D pipeline to continue servicing core memory requests that hit and issuing additional requests of the L2 while the L2 resolves the conﬂicts. The OzQ control logic maintains 3 round robin pointers to track head, tail, and issue. New requests are allocated at the tail pointer which indicates where the youngest OzQ operation exists. The head pointer is the oldest operation in the OzQ. The issue pointer is always between head and tail and indicates where

Fig. 6. Memory subsystem and system interface

R Performance Characterization of Itanium 2-Based Montecito Processor

43

the issue logic should look for new operations to send down the L2 pipeline. A consequence of this head and tail organization is that holes may appear in the OzQ from operations that have issued (OzQ entries between head and tail that are no longer valid). The OzQ is not compressed when these holes develop. Without compression, these holes are not available to new L1D requests. Thus, there may be instances where the OzQ control logic indicates that there is no more room for new L1D requests, despite the fact that only a few OzQ entries are valid. Every cycle the L2 OzQ searches 16 requests, starting at head, for requests to issue to the L2 data array (L2 hits), the system bus/L3 (L2 misses), or back to the L1D for another L2 tag lookup (recirculate). The L2 OzQ control logic allocates up to four contiguous entries per cycle starting from the last entry allocated in the previous cycle (the tail). If there are too few entries available (between 4 and 12), the L1D pipeline is stalled to prohibit any additional operations being passed to the L2. Requests are removed from the L2 OzQ when they complete at the L2 - that is when a store updates the data array, or when a load returns correct data to the core, or when an L2 miss request is accepted by the system bus/L3. Whenever the OzQ is full, there is an increased L2D back pressure which results in back-end stalls. Figure 5 reports the time (as percentage of the total execution time) for which OzQ was full. We measured the number of times the L2D OzQ was full using the L2D OZQ FULL hardware performance counter. From the ﬁgure we see that the OzQ is rarely (> 2% on an average) full in the case of CINT2006. On the contrary, in the case of applications such as 410.bwaves and 433.milc of CFP2006, the OzQ is full for more than 50% of the total execution time. This results in a high percentage of data stalls, see Figure 12, which adversely aﬀects the overall performance. Support for elimination/minimization of the number of holes in the OzQ can potentially reduce the number of data stalls. Alternatively, a larger OzQ may yield better performance. 4.2

Fill Buﬀer

An entry in the L2D Fill Buﬀer is allocated when L2 speculatively issues an access to the L3/system. The 16 entries in the ﬁll buﬀer correspond to the maximal 16 simultaneous outstanding L3 memory requests. When the buﬀer gets full L3 requests have to be served before new requests can be submitted. We measured the number of times the L2D Fill buﬀer was full using the L2D FILLB FULL hardware performance counter. From Figure 5 we see that the ﬁll buﬀer is rarely full in the case of CINT2006. In contrast, akin to OzQ, the ﬁll buﬀer is full for ≈ 10% of the total execution time for applications such as 410.bwaves and 437.leslie3d. However, on an average, the ﬁll buﬀer is full for only 3% of the total execution time for CFP2006. 4.3

OzD Buﬀer

Stores that miss in the L2 record data in the 24 entry L2 Oz Data buﬀer and their address in the OzQ. The data needs to be merged with the 128 bytes delivered

44

D. Desai et al.

Fig. 7. Number of L1I, L2I and L3 data misses per 1000 retired instructions

from the L3/system interface.2 When the buﬀer is full for a missing store request, the processor stalls until entries can be freed. We measured the number of times the L2D Oz data buﬀer was full using the L2D OZD FULL hardware performance counter. From Figure 5 we note that the OzD buﬀer is rarely (> 1% on an average) full in both CINT2006 and CFP2006. This suggests that the OzD buﬀer is not a performance bottleneck for CPU2006. 4.4

Victim Buﬀer

The victim buﬀer holds L2 dirty victim data until it can be issued to the L3/system interface. Operations are issued, up to four at a time, to access the L2 data array when the conﬂicts are resolved and resources are available. The buﬀer can hold up to 16 entries. If the buﬀer is full for a request that misses the L2, the request will recirculate. This in turn increases the L2D back pressure and can cause back-end stalls. We measured the number of times the L2D victim buﬀer was full using the L2D VICTIMB FULL hardware performance counter. From Figure 5 we note that the victim buﬀer is rarely (> 1%) full in both CINT2006 and CFP2006. From this we conclude that victim buﬀer getting full does not impact the overall performance in a signiﬁcant manner. 4.5

Instruction Cache

Recall that Montecito has a uniﬁed L3 cache. Therefore, while evaluating the L3 performance w.r.t. the instruction stream, it very critical to measure the L3 misses which occur due to instruction reads only. We measured the performance of the instruction cache using the following hardware performance counters: ➀ L2I DEMAND READS: This event counts the number of L1I and ISB (instruction stream buﬀer) misses regardless of whether they hit or miss in the RAB (Request Address Buﬀer). 2

Assume there is L2 miss and L3 hit. The L3 cache line size is 128 byte. The memory system reads 128 byte out of the L3, merges the data from the L2 Oz data buﬀer and writes it back to the L3.

R Performance Characterization of Itanium 2-Based Montecito Processor

45

Fig. 8. Data Misspeculation

➁ L2I PREFETCHES: This event counts the number of prefetch requests issued to the L2I. ➂ L2I READS.MISS.ALL: This event counts the fetches which miss the L2I-cache. ➃ L3 READS.ALL.MISS: This event counts the L3 read misses. ➄ L3 READS.DATA READ.MISS: This event counts the number of L3 load misses. The L1I misses are computed as the sum of ➀ and ➁, whereas the L3 instruction misses are computed as diﬀerence of ➃ and ➄. From Figure 73 we see that the integer program incur higher number of L1I misses than the ﬂoating-point programs. This is due to the fact that integer codes are very control-ﬂow intensive and thus very irregular in nature, which results in higher instruction cache misses. Except 483.xalancbmk, the number of L2I misses are negligible in both CINT2006 and CFP2006. This is primarily due to the presence of large L2I cache (1MB, see Table 1 for the detailed conﬁguration).

5

Data Speculation

Itanium supports data speculation for Source: Assembly: int &g; ld4.a rx=[ra] ;; scheduling a load in advance of one int &h; add ry=rx,1 // t = *h+1 foo() { ... or more stores. The advanced load int t; st4 [rb] = 1 // *g = 1 *g = 1; chk.a rx, rec_code records information including memt = *h + 1; resume: ... ... ory address, size and target register } rec_code: ld4 rx=[ra] ;; number into a hardware structure, add ry=rx,1 br resume ;; the Advanced Load Address Table (ALAT). The ALAT is implemented as fully associative data cache with 32 entries, tagged by the physical register number. An ALAT entry is invalidated when a subsequent store address collides (overlaps). This condition is checked by the chk.a instruction: in the case of a collision, program execution resumes at the compiler generated recovery code, which executes the non-speculative version of the load and returns to the point after the chk.a. Let us consider the example code shown on the right. In 3

L3D in the ﬁgure refers to (uniﬁed) L3 instruction misses. It should not be interpreted as misses corresponding to a separate L3 instruction cache.

46

D. Desai et al.

the above example, assuming the compiler does not have enough information whether or not the addresses of g and h overlap. In this case one can use data speculation to hoist the load above the store. We measured the data misspeculation rate using the following counters: INST FAILED CHKA LDC ALAT.ALL: This provides information on the number of failed advanced check load (chk.a) and check load (ld.c) instructions that reach retirement. INST CHKA LDC ALAT.ALL:This provides information on the number of all advanced check load (chk.a) and check load (ld.c) instructions that reach retirement. Figure 8 shows the data misspeculation percentage for CPU2006 on Montecito. From the ﬁgure we see that only two applications, viz., 435.gromacs and 454.calculix incur data misspeculation for more than 5%. On an average, CINT2006 and CFP2006 incur a data misspeculation rate of 0.65% and 2.62% respectively. Since chk.a and ld.c constitute only 0.28% and 0.46% of the total number of retired instructions in CINT2006 and CFP2006 respectively, data misspeculation does not play a key role in determining the overall performance.

6

TLB Performance

Akin to other processor parameters, the TLB performance is also directly dependent on the nature of the applications [5]. In this section we report the data and instruction TLB performance using CPU2006. 6.1

DTLB

This subsection compares the DTLB performance of Montecito and Madison. Both the processors have a 2-level DTLB. We measured the performance of each DTLB level using the following hardware performance counters: L1DTLB TRANSFER: This event counts the number of times an L1DTLB miss hits in the L2DTLB for an access counted in L1D READS. L2DTLB MISSES: This event counts the number of L2DTLB misses (which is the same as references to HPW (hardware page walker); DTLB HIT=0) for demand requests [6].

Fig. 9. DTLB Performance

R Performance Characterization of Itanium 2-Based Montecito Processor

47

Fig. 10. ITLB Performance

From Figure 9 we observe that the DTLB performance of Montecito is better than that of Madison for both CINT2006 and CFP2006. In general, we note that integer application incur higher number of DTLB misses than ﬂoatingpoint applications. This suggests that the former have a more irregular memory access pattern than the latter. Contrary to intuition, we note that the number of L2DTLB misses is higher than L1DTLB misses in CFP2006. This stems from the fact that in Montecito ﬂoating-point memory operations bypass the L1DTLB and access the L2DTLB directly (for details see page 94 in [6]). This corresponds to an increase in hardware page walker utilization which may aﬀect performance adversely. Use of larger pages (> 16 KB) for data can potentially mitigate the above in such cases. The high number of ﬂoating-point operations also explains the high L2DTLB miss rate in the integer applications 429.mcf and 483.xalancbmk. Applications such as 434.zeusmp incur a small cache miss rate; likewise the branch prediction rate is also negligible. However, from Figure 9 we note that 434.zeusmp incurs large number of DTLB misses which serves as the primary performance bottleneck. For such applications, exploration of the TLB design space is required to achieve better performance. 6.2

ITLB

This subsection compares the ITLB performance of CPU2000 and CPU2006. The Itanium 2-based Montecito has a 2-level ITLB. We measured the performance of each ITLB level using the following hardware performance counters: ITLB MISSES FETCH.L1ITLB: This event counts the number of misses in L1ITLB, even if L1ITLB is not updated for an access (Uncacheable/nat page/not present page/faulting/some ﬂushed). ITLB MISSES FETCH.L2ITLB: This event counts the total number of misses in L1ITLB which also missed in L2ITLB. Unlike DTLB, from Figure 10 we note that the ITLB performance of Montecito is the same as that of Madison for both CINT2006 and CFP2006. Akin to the DTLB behavior, we observe that integer application incur higher number of DTLB misses than ﬂoating-point applications. This suggests that integer

48

D. Desai et al.

applications have a more irregular memory access pattern than ﬂoating-point applications.

7

Memory Bus Transactions

In this section we analyze the memory bus pressure exerted by the CPU2006 applications on Montecito and contrast it with that of Madison. For this, we measured, using the following hardware performance counters, the bus memory transactions and correlate them with the number of L3 data read misses. BUS MEMORY.ALL.SELF: This event counts the number of bus memory transactions (i.e., memory-read-invalidate, reserved-memory-read, memory-read, and memory-write transactions). L3 READS.DATA READ.MISS: This event counts the number of L3 Load Misses (excludes reads for ownership used to satisfy stores). From Figure 11 we observe that the memory bus pressure is, on an average, higher in CFP2006 than in CINT2006 on Montecito. This can be ascribed to the higher L3 data miss rates incurred by ﬂoating-point applications, which have a larger memory footprint than integer applications. On the other hand, we note that CINT2006 incurs a higher reduction in memory bus pressure, 1.51 vs. 1.1 for CFP2006, while migrating from Madison to Montecito. This correlates to the higher reduction in L3 data cache misses for integer, as compared to ﬂoatingpoint, applications.

Fig. 11. Memory Bus Pressure

8

Stalls

In this section we analyze the relative impact of the various resource and data stalls. 8.1

Resource Stalls

Figure 12 shows the diﬀerent components [7] of the total execution time for applications in CINT2006 and CFP2006. The descriptions of the diﬀerent components is given below:

R Performance Characterization of Itanium 2-Based Montecito Processor

49

Fig. 12. Resource Stalls incurred

❐ Data stalls correspond to full pipe bubbles in the main pipe caused by the L1D or the execution unit (discussed further in the next subsection). ❐ RSE stalls correspond to full pipe bubbles in the main pipe caused by the Register Stack Engine.We measured this using the BE RSE BUBBLE.ALL hardware performance counter. ❐ Branch misprediction stalls correspond to full pipe bubbles in the main pipe due to ﬂushes. We measure this using the BE FLUSH BUBBLE.ALL hardware performance counter. ❐ Front end stalls in the ﬁgure correspond to full pipe bubbles in the main pipe due to the front end. The front end can in turn be stalled due to the following reasons: FEFLUSH, TLBMISS, IMISS, branch, FILL-RECIRC, BUBBLE, IBFULL (listed in priority from high to low). We measured this using BACK END BUBBLE.FE hardware performance counter. ❐ Scoreboarding corresponds to full pipe bubbles in the main pipe due to the FPU. We measured the above using the BE L1D FPU BUBBLE.ALL hardware performance counter. From the ﬁgure we see that data stalls are most prominent amongst the diﬀerent types of stalls mentioned above. More importantly, note that in applications such as 429.mcf, 471.omnetpp and 433.milc, data stalls account for more than 50%(!) of the total execution time. This can, in part, be attributed to their high L3 cache miss rate (refer to Figure 4). This highlights the high sensitivity of the performance of the emerging applications, represented by CPU2006, w.r.t. the cache sub-system. Further, we note that stalls due to branch mispredictions are second to data stalls. Speciﬁcally, the stalls due to branch mispredictions account for 5% and 1.6% of the total execution time, on an average, in CINT2006 and CFP2006. On the other hand, front end stalls account for 4.6% and 1.4% of the total execution time, on an average, in CINT2006 and CFP2006. 8.2

Data Stalls

The breakdown of data stalls is shown in Figure 13. As mentioned earlier, the data stalls occur due to either the L1D or the execution unit. A L1D stall can

50

D. Desai et al.

Fig. 13. Data Stalls incurred

potentially occur due to several reasons such as: store buﬀer being full, due to a recirculate, due to a hardware page walker, due to a store in conﬂict with a returning ﬁll, due to L2D back pressure or due to L2DTLB to L1DTLB transfer. Register load stalls were measured using the following hardware performance counters and are computed as ①-②+③: ① BE EXE BUBBLE.GRALL: This corresponds to the case when the back-end was stalled by EXE due to GR/GR or GR/load dependency. ② BE EXE BUBBLE.GRGR: This corresponds to the case when the back-end was stalled by EXE due to GR/GR dependency. ③ BE EXE BUBBLE.FRALL: This corresponds to the case when the back-end was stalled by EXE due to FR/FR or FR/load dependency. Other data stall components were using the following hardware performance counters: BE L1D FPU BUBBLE.L1D HPW: This measures the back-end stalls due to Hardware Page Walker. BE L1D FPU BUBBLE.L1D PIPE RECIRC: This measures the back-end stalls due to a recirculate. The most predictable reason for a request to recirculate is that the request misses a line that is already being serviced by the system bus/L3, but has not yet returned to the L2. The L2 only retires L2 hits and primary L2 misses to an L2 line. It does not retire multiple L2 miss requests; additional misses remain in the L2 OzQ and recirculate until the tag lookup returns a hit. The request then issues from the L2 OzQ and returns data (for a load) or updates the array (for a store) as a normal L2 hit request. BE L1D FPU BUBBLE.L1D L2BPRESS: This measures the back-end stalls due to L2D Back Pressure (L2BP). BE L1D FPU BUBBLE.L1D TLB: This measures the back-end stalls due to L2DTLB to L1DTLB transfer. Note that the various component of data stalls are not mutually exclusive. In other words, there may be overlap between the diﬀerent components. From Figure 13 we note that the register load stalls dominate CINT2006, except for 462.libquantum and 456.hmmer, in which recirculates dominate the data stalls. On the other hand, in CFP2006, 11 out of 17 benchmarks are dominated by register load stalls while others such as 433.milc and 459.GemsFDTD are

R Performance Characterization of Itanium 2-Based Montecito Processor

51

Fig. 14. Branch Misprediction %

dominated by either the L2BP and/or L2 recirculates. The latter stems from a high number of L3 data cache misses (see Figure 4). From the data we conjecture that applications in which register-load stalls are not the dominating component are memory bandwidth bound.

9

Branch Prediction

The Itanium 2 processors branch prediction performance relies on a two-level prediction algorithm and two levels of branch history storage. The ﬁrst level of branch prediction storage is tightly coupled to the L1I cache. This coupling allows a branches taken/not taken history and a predicted target to be delivered with every L1I demand access in one cycle. The branch prediction logic uses the history to access a pattern history table and determine a branches ﬁnal taken/not taken prediction, or trigger, according to the Yeh-Patt algorithm [8]. The L2 branch cache saves the histories and triggers of branches evicted from the L1I so that they are available when the branch is revisited, providing the second storage level. We measured the branch misprediction rate using the following hardware performance counters: BR MISPRED DETAIL.ALL.ALL PRED: This event counts the number of retired branches, regardless of the prediction result. We denote this by ➀. BR MISPRED DETAIL.ALL.CORRECT PRED: This event counts the number of correctly predicted (both outcome and target) retired branches. We denote this by ➁. The branch misprediction percentage is computed as follows: ➀−➁ × 100 Branch Misprediction % = ➀ Figure 14 shows the branch misprediction percentage on Montecito and Madison for the applications in CPU2006. From the ﬁgure we see that, as expected, CINT2006 incurs a higher branch misprediction rate than CFP2006. This explains the higher number of stalls caused due to branch misprediction for integer codes as compared to ﬂoating-point codes (refer to Figure 12). the performance of the branch predictor on the two machines is almost the same. In the rest of

52

D. Desai et al.

Fig. 15. Branch Classiﬁcation

the section, we present the classiﬁcation of branches and report results for the prediction accuracy for each type of branch. 9.1

Branch Classiﬁcation

Table 2. Branch Misprediction Penalty The branches can be broadly classiﬁed into the Branch Type Whether prediction Target Prediction Penalty (cycles) IP Relative Correct Correct 0 following categories: IP IP Relative Correct Incorrect 1 Relative branches, IndiReturn Correct Correct 1 Return Correct InCorrect 6+ rect branches and Return Indirect Correct Correct 2 branches. The breakdown Indirect Correct InCorrect 6+ of the total number of Any Incorrect n/a 6+ branches see Figure 19) is shown in Figure 15. The description of each branch type is given later in this subsection. From the ﬁgure we see that IP Relative branches account for more than 90% of the total branches in both CINT2006 and CFP2006. The penalty associated with each type of branch is given in Table 2. On the Itanium 2, the “whether” branch hints are .sptk, .spnt, .dptk, .dpnk (sp=static prediction, dp=dynamic prediction). They are conﬁdence hints generated by the compiler. For example, .sptk means the code generator is certain that the branch will be taken, while .dptk means the code generator ‘thinks’ the branch will be taken, but is not sure. At the counter level, the WRONG PATH events count the number of mispredicted branches due to the wrong “whether prediction” hint. On the other hand, if the target address is predicted wrong, it gets accounted under WRONG TARGET.

IP Relative Branches. We measured the misprediction rate of IP relative branches on Montecito using the following hardware performance counters: BR MISPRED DETAIL.IPREL.CORRECT PRED: This event counts the number of correctly predicted (outcome and target) IP relative branches. BR MISPRED DETAIL.IPREL.WRONG PATH: This event counts the number of mispredicted IP relative branches due to wrong branch direction. BR MISPRED DETAIL.IPREL.WRONG TARGET: This event counts the number of mispredicted IP relative branches due to wrong target for taken branches.

R Performance Characterization of Itanium 2-Based Montecito Processor

53

Fig. 16. Misprediction behavior of IP relative branches

Fig. 17. Misprediction behavior of indirect branches

For better readability, we only show the percentage of the latter two in Figure 16. From the ﬁgure, we note that a high prediction accuracy is achieved for the IP relative branches. Speciﬁcally, an accuracy of 95.7% and 98.39% is achieved, on an average, for CINT2006 and CFP2006 applications respectively. Improving the prediction accuracy for IP relative branches can potentially boost the performance of integer codes, albeit by a small amount. Indirect Branches. Indirect branches are predicted on the basis of the current value in the referenced branch register. There is always a 2 cycle penalty for correctly predicted indirect branch. We measured the misprediction rate of indirect branches in CPU2006 on Montecito using the following hardware performance counters: BR MISPRED DETAIL.NRETIND.CORRECT PRED: This event counts the number of correctly predicted (outcome and target) non-return indirect branches. BR MISPRED DETAIL.NRETIND.WRONG PATH: This event counts the number of mispredicted non-return indirect branches due to wrong branch direction. BR MISPRED DETAIL.NRETIND.WRONG TARGET: This event counts the number of mispredicted non-return indirect branches due to wrong target for taken branches.

54

D. Desai et al.

Fig. 18. Misprediction behavior of return branches

For better readability, we only show the percentage of the latter two in Figure 18. From the ﬁgure we note that indirect branches incur a large misprediction rate. Speciﬁcally, 50.79% and 54.2% of the total indirect branches are mispredicted in CINT2006 and CFP2006 respectively. In each case, the misprediction occurs due to the wrong target. On the other hand, from Figures 15 and 19, we note that indirect branches constitute a small (> 1%) percentage of the total number of branches. From this, we conclude that improving the prediction accuracy of indirect branches is unlikely to beneﬁt the overall performance in a signiﬁcant fashion. Return Branches. All predictions for return branches come form an eightentry return stack buﬀer (RSB). A branch call pushes both the caller’s IP and its current function state onto the RSB. A return pops oﬀ this information. There is always a 1 cycle penalty for a correctly predicted return. We measured the misprediction rate of return branches in CPU2006 on Montecito using the following hardware performance counters: BR MISPRED DETAIL.RETURN.CORRECT PRED: This event counts the number of correctly predicted (outcome and target) return type branches. This occurs when the return address popped from the RSB does not match the actual return address. The RSB has 8 entries, so in applications with call stacks deeper than 8 such mispredicts are likely to occur. BR MISPRED DETAIL.RETURN.WRONG PATH: This event counts the number of mispredicted return type branches due to wrong branch direction. BR MISPRED DETAIL.RETURN.WRONG TARGET: This event counts the mispredicted return type branches due to wrong target for taken branches. This can happen in two cases. First, for predicated returns [(qp) br.ret], e.g., the return is predicted taken although the qualifying predicate (qp) is clear. Second, when the return branch inherits a “wrong” predication hint from another branch that has been issued within a 2 bundle window of the return. For better readability, we only show the percentage of the latter two in Figure 18. From the ﬁgure we observe that, on average, RET incur misprediction of > 1% in both CINT2006 and CFP2006. From this and Table 2 we conclude that reduction in mispredictions due to RETs will not yield signiﬁcant performance gains.

R Performance Characterization of Itanium 2-Based Montecito Processor

55

Fig. 19. Breakdown of retired instructions

10

Instruction Breakdown

Figure 19 presents the instruction breakdown for both CINT2006 and CFP2006. We measured this using the following hardware performance counters: LOADS RETIRED: The event counts the number of retired loads, excluding predicated oﬀ loads. The count includes integer, ﬂoating-point, RSE, semaphores, VHPT, uncacheable loads and check loads (ld.c) which missed in ALAT and L1D (because this is the only time this looks like any other load). Also included are loads generated by squashed HPW walks. STORES RETIRED: The event counts the number of retired stores, excluding those that were predicated oﬀ. The count includes integer, ﬂoating-point, semaphore, RSE, VHPT, uncacheable stores. NOPS RETIRED: This event provides information on number of retired nop.i, nop.m, and nop.b and nop.f instructions, excluding nop instructions that were predicated oﬀ. BR MISPRED DETAIL.ALL.ALL PRED: This event counts the number of branches retired of all types, regardless of the prediction result. PREDICATE SQUASHED RETIRED: This event provides information on number of instructions squashed due to a false qualifying predicate. Includes all non-Bsyllable instructions which reached retirement with a false predicate. FP OPS RETIRED: This event provides information on number of retired ﬂoatingpoint operations, excluding all predicated oﬀ instructions. From the ﬁgure we see that loads and stores constitute, on an average, 18% and 17% of the total number of retired instructions in CINT2006 and CFP2006 respectively. Also, we note that the percentage of NOPs is higher, on an average, in CFP2006 (32%) than in CINT2006 (18.2%). This is due to the longer latency of the ﬂoating-point instructions, e.g., the ﬂoating-point multiply add (fma) has a 5 cycle latency [6].

11

Conclusion

This paper presented detailed performance characterization, using the built-in hardware performance counters, of the of the dual-core dual-threaded Itanium

56

D. Desai et al.

Montecito processor. To the best of our knowledge, this is the ﬁrst work which uses the SPEC CPU2006 benchmark suite for evaluation of an IA-64 architecture. It also compared the performance of Montecito with the previous generation Madison processor. Based on our analysis we make the following conclusions: ❐ First, Montecito achieves, on a geometric mean basis, 14% and 16% higher IPC for the integer and ﬂoating-point applications respectively. These gains are primarily due to the better cache design on Montecito as compared to Madison. ❐ Second, a relatively low IPC value is achieved for the C++ benchmarks and 429.mcf in CINT2006 and 5 applications in CFP2006. This is primarily due to a high cache miss rate and/or a high DTLB miss rate. ❐ Third, the performance gain achievable using an oracle branch predictor on Itanium is only 5% and 1.5%, on an average, for integer and ﬂoatingpoint applications respectively. From this, we conclude that the performance potential for a “better” branch predictor on an Itanium-based platform is relatively low for the SPEC CPU2006 benchmarks.

Acknowledgements The authors would like to thank the anonymous reviewers for their valuable feedback.

References 1. Naﬀziger, S., Stackhouse, B., Grutkowski, T., Josephson, D., Desai, J., Alon, E., Horowitz, M.: The implementation of a 2-core multi-threaded itaniumR -family processor. IEEE Journal of Solid-State Circuits 41(1), 197–209 (2006) 2. McNairy, C., Bhatia, R.: Montecito: A dual-core, dual-thread Itanium processor. IEEE Micro. 25(2), 10–20 (2005) 3. SPEC CPU (2006), http://www.spec.org/cpu2006 4. Caliper, http://ieeexplore.ieee.org/iel5/4434/19364/00895108.pdf 5. Kandiraju, G.B., Sivasubramaniam, A.: Characterizing the d-TLB behavior of SPEC CPU 2000 benchmarks. In: Proceedings of the 2002 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Marina Del Rey, CA, pp. 129–139 (2002) 6. Dual-Core Update to the IntelRItaniumR2 Processor Reference Manual, Revision 0.9 (January 2006), http://download.intel.com/design/Itanium2/manuals/30806501.pdf 7. Cvetanovic, Z., Bhandarkar, D.: Performance characterization of the Alpha 21164 microprocessor using TP and SPEC workloads. In: Proceedings of the 2nd International Symposium on High-Performance Computer Architecture, San Jose, CA, pp. 270–280 (February 1996) 8. Yeh, T.-Y., Patt, Y.N.: Alternative implementations of two-level adaptive branch prediction. In: Proceedings of the 19th International Symposium on Computer Architecture, Queensland, Australia, pp. 124–134 (1992)

A Tale of Two Processors: Revisiting the RISC-CISC Debate Ciji Isen1, Lizy K. John1, and Eugene John2 1 ECE Department, The University of Texas at Austin ECE Department, The University of Texas at San Antonio {isen,ljohn}@ece.utexas.edu, [email protected] 2

Abstract. The contentious debates between RISC and CISC have died down, and a CISC ISA, the x86 continues to be popular. Nowadays, processors with CISC-ISAs translate the CISC instructions into RISC style micro-operations (eg: uops of Intel and ROPS of AMD). The use of the uops (or ROPS) allows the use of RISC-style execution cores, and use of various micro-architectural techniques that can be easily implemented in RISC cores. This can easily allow CISC processors to approach RISC performance. However, CISC ISAs do have the additional burden of translating instructions to micro-operations. In a 1991 study between VAX and MIPS, Bhandarkar and Clark showed that after canceling out the code size advantage of CISC and the CPI advantage of RISC, the MIPS processor had an average 2.7x advantage over the studied CISC processor (VAX). A 1997 study on Alpha 21064 and the Intel Pentium Pro still showed 5% to 200% advantage for RISC for various SPEC CPU95 programs. A decade later and after introduction of interesting techniques such as fusion of micro-operations in the x86, we set off to compare a recent RISC and a recent CISC processor, the IBM POWER5+ and the Intel Woodcrest. We find that the SPEC CPU2006 programs are divided between those showing an advantage on POWER5+ or Woodcrest, narrowing down the 2.7x advantage to nearly 1.0. Our study points to the fact that if aggressive micro-architectural techniques for ILP and high performance can be carefully applied, a CISC ISA can be implemented to yield similar performance as RISC processors. Another interesting observation is that approximately 40% of all work done on the Woodcrest is wasteful execution in the mispredicted path.

1 Introduction Interesting debates on CISC and RISC instruction set architecture styles were fought over the years, e.g.: the Hennessy-Gelsinger debate at the Microprocessor Forum [8] and Bhandarkar publications [3, 4]. In the Bhandarkar and Clark study of 1991 [3], the comparison was between Digital's VAX and an early RISC processor, the MIPS. As expected, MIPS had larger instruction counts (expected disadvantage for RISC) and VAX had larger CPIs (expected disadvantage for CISC). Bhandarkar et al. presented a metric to indicate the advantage of RISC called the RISC factor. The average RISC factor on SPEC89 benchmarks was shown to be approximately 2.7. Not even one of the SPEC89 program showed an advantage on the CISC. D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 57–76, 2009. © Springer-Verlag Berlin Heidelberg 2009

58

C. Isen, L.K. John, and E. John

The Microprocessor forum debate between John Hennessy and Pat Gelsinger included the following 2 quotes: "Over the last five years, the performance gap has been steadily diminishing. It is an unfounded myth that the gap between RISC and CISC, or between x86 and everyone else, is large. It's not large today. Furthermore, it is getting smaller." - Pat Gelsinger, Intel "At the time that the CISC machines were able to do 32-bit microprocessors, the RISC machines were able to build pipelined 32-bit microprocessors. At the time you could do a basic pipelining in CISC machine, in a RISC machine you could do superscalar designs, like the RS/6000, or superpipelined designs like the R4000. I think that will continue. At the time you can do multiple instruction issue with reasonable efficiency on an x86, I believe you will be able to put second-level caches, or perhaps even two processors on the same piece of silicon, with a RISC machine." - John Hennessy, Stanford Many things have changed since the early RISC comparisons such as the VAXMIPS comparison in 1991 [3]. The debates have died down in the last decade, and most of the new ISAs conceived during the last 2 decades have been mainly RISC. However, a CISC ISA, the x86 continues to be popular. It translates the x86 macroinstructions into micro-operations (uops of Intel and ROPS of AMD). The use of the uops (or ROPS) allows the use of RISC-style execution cores, and use of various micro-architectural techniques that can be easily implemented in RISC cores. A 1997 study of the Alpha and the Pentium Pro [4] showed that the performance gap was narrowing, however the RISC Alpha still showed significant performance advantage. Many see CISC performance approaching RISC performance, but exceeding it is probably unlikely. The hardware for translating the CISC instructions to RISC-style is expected to consume area, power and delay. Uniform-width RISC ISAs do have an advantage for decoding and runtime translations that are required in CISC are definitely not an advantage for CISC. Fifteen years after the heated debates and comparisons, and at a time when all the architectural ideas in Hennessy's quote (on chip second level caches, multiple processors) have been put into practice, we set out to compare a modern CISC and RISC processor. The processors are Intel's Woodcrest (Xeon 5160) and IBM's POWER5+ [11, 16]. A quick comparison of key processor features can be found in Table 1. Though the processors do not have identical micro-architectures, there is a significant similarity. They were released around the same time frame and have similar transistor counts (276 million for P5+ and 291 million for x86). The main difference between the processors is in the memory hierarchy. The Woodcrest has larger L2 cache while the POWER5+ includes a large L3 cache. The SPEC CPU2006 results of Woodcrest (18.9 for INT/17.1 for FP) are significantly higher than that of POWER5+ (10.5 for INT/12.9 for FP). The Woodcrest has a 3 GHz frequency while the POWER5 has a 2.2 GHz frequency. Even if one were to scale up the POWER5+ results and compare the score for CPU2006 integer programs, it is clear that even ignoring the frequency advantage, the CISC processor is exhibiting an advantage over the RISC processor. In this paper, we set out to investigate the performance differences of these 2 processors.

A Tale of Two Processors: Revisiting the RISC-CISC Debate

59

Table 1. Key Features of the IBM POWER5+ and Intel Woodcrest [13] IBM POWER5+

Intel-Woodcrest(Xeon 5160)

Bit width

64bit

32/64bit

Cores/chip*Thread/core

2x2

2x1

Clock Frequency

2.2GHz

3.GHz

L1 I/D

2x64/32k

2x32k/32k

L2

1.92M

4M

L3

36M (off-chip)

None

Execution Rate/Core

5 issue

5uops

Pipeline Stages

15

14

Out of Order

200 inst

126 uops

Memory B/W

12.8GB/s

10.5GB/s

Process technology

90nm

65nm

Die Size

245mm2

144nm2

Transistors

276 million

291 million

Power (Max)

100W

80W

SPECint/fp2006 [cores]

10.5 / 12.9

18.9 / 17.1 [4]

SPECint/fp2006_rate[cores]

197 / 229 [16]

60.0 / 44.1 [4]

Other interesting processor studies in the past include a comparison of the PowerPC601 and Alpha 21064 [12], a detailed study of the Pentium Pro processor [5], a comparison of the SPARC and MIPS [7], etc.

2 The Two Chips 2.1 POWER5+ The IBM POWER5+ is an out of order superscalar processor. The core contains one instruction fetch unit, one decode unit, two load/store pipelines, two fixed-point execution pipelines, two floating-point execution pipelines, and two branch execution pipelines. It has the ability to fetch up to 8 instructions per cycle and dispatch and retire 5 instructions per cycle. POWER5+ is a multi-core chip with two processor cores per chip. The core has a 64KB L1 instruction cache and a 32KB L1 data cache. The chip has a 1.9MB unified L2 cached shared by the two cores. An additional 36MB L3 cache is available off-chip with its controller and directory on the chip. The POWER5+ memory management unit has 3 types of caches to help address translation: a translation look-aside buffer (TLB), a segment look-aside buffer (SLB) and an effective-to-real address table (ERAT). The translation processes starts its search with the ERAT. Only on that failing does it search the SLB and TLB. This processor supports simultaneous multithreading.

60

C. Isen, L.K. John, and E. John

2.2 Woodcrest The Xeon 5160 is based on Intel’s Woodcrest microarchitecture, the server variant of the Core microarchitecture. It is a dual core, 64 bit, 4-issue superscalar, moderately pipelined (14 stages), out-of-order MPU, and implemented in a 65nm process. The processor can address 36 bits of physical memory and 48 bits of virFig. 1. IBM POWER5+ Processor [16] tual. An 8 way 32KB L1 I cache, a dual ported 32KB L1D cache along with a shared 4MB L2 cache feeds data and instruction to the core. Unlike the POWER5+ it has no L3 cache. The branch prediction occurs inside the Instruction Fetch Unit. The Core microarchitecture employs the traditional Branch Target Buffer (BTB), a Branch Address Calculator (BAC) and the ReFig. 2. Front-End of the Intel Woodcrest processor [17] turn Address Stack (RAS) and two more predictors. The two predictors are: the loop detector (LD) which predicts loop exits and the Indirect Branch Predictor (IBP) which picks targets based on global history, which helps for branches to a calculated address. A queue has been added between the branch target predictors and the instruction fetch to hide single cycle bubbles introduced by taken branches. The x86 instructions are generally broken down into simpler micro-operations (uops), but in certain specialized cases, the processor fuses certain micro-operations to create integrated or chained operations. Two types of fusion operations are used: macro-fusion and micro-fusion.

3 Methodology In this study we use the 12 integer and 17 floating-point programs of the SPEC CPU2006 [18] benchmark suite and measure performance using the on chip performance counters. Both POWER5+ and Woodcrest microprocessors provide on-chip logic to monitor processor related performance events. The POWER5+ Performance

A Tale of Two Processors: Revisiting the RISC-CISC Debate

61

Monitor Unit contains two dedicated registers that count instructions completed and total cycles as well as four programmable registers, which can count more than 300 hardware events occurring in the processor or memory system. The Woodcrest architecture has a similar set of registers, two dedicated and two programmable registers. These registers can count various performance events such as, cache misses, TLB misses, instruction types, branch misprediction and so forth. The perfex utility from the Perfctr tool is used to perform the counter measurements on Woodcrest. A tool from IBM was used for making the measurements on POWER5+. The Intel Woodcrest processor supports both 32-bit as well as 64-bit binaries. The data we present for Woodcrest corresponds to the best runtime for each benchmark (hence is a mix of 64-bit and 32-bit applications). Except for gcc, gobmk, omnetpp, xalancbmk and soplex, all other programs were in the 64-bit mode. The benchmarks for POWER5+ were compiled using Compilers: XL Fortran Enterprise Edition 10.01 for AIX and XL C/C++ Enterprise Edition 8.0 for AIX. The POWER5+ binaries were compiled using the flags: C/C++ -O5 -qlargepage -qipa=noobject -D_ILS_MACROS -qalias=noansi qalloca + PDF (-qpdf1/-qpdf2) FP - O5 -qlargepage -qsmallstack=dynlenonheap -qalias=nostd + PDF (-qpdf1/qpdf2). The OS used was AIX 5L V5.3 TL05. The benchmarks on Woodcrest were compiled using Intel’s compilers - Intel(R) C Compiler for 32-bit applications/ EM64Tbased applications Version 9.1 and Intel(R) Fortran Compiler for 32-bit applications/ EM64T-based applications, Version 9.1. The binaries were compiled using the flag: -xP -O3 -ipo -no-prec-div / -prof-gen -prof-use. Woodcrest was configured to run using SUSE LINUX 10.1 (X86-64).

4 Execution Characteristics of the Two Processors 4.1 Instruction Count (path length) and CPI According to the traditional RISC vs. CISC tradeoff, we expect POWER5+ to have a larger instruction count and a lower CPI compared to Intel Woodcrest, but we observe that this distinction is blurred. Figure 3 shows the path length (dynamic instruction count) of the two systems for SPEC CPU2006. As expected, the instruction counts in the RISC POWER5+ is more in most cases, however, the POWER5+ has better instruction counts than the Woodcrest in 5 out of 12 integer programs and 7 out of 17 floating-point programs (indicated with * in Figure 3). The path length ratio is defined as the ratio of the instructions retired by POWER5+ to the number of instructions retired by Woodcrest. The path length ratio (instruction count ratio) ranges from 0.7 to 1.23 for integer programs and 0.73 to 1.83 for floating-point programs. The lack of bias is evident since the geometric mean is about 1 for both integer and floating-point applications. Figure 4 presents the CPIs of the two systems for SPEC CPU2006. As expected, the POWER5+ has better CPIs than the Woodcrest in most cases. However, in 5 out of 12 integer programs and 7 out of 17 floating-point programs, the Woodcrest CPI is better (indicated with * in Figure 4). The CPI ratio is the

62

C. Isen, L.K. John, and E. John

Fig. 3. a) Instruction Count (Path Length)-INT

Fig. 3. b) Instruction Count (Path Length) – FP

ratio of the CPI of Woodcrest to that of POWER5+. The CPI ratio ranges from 0.78 to 4.3 for integer programs and 0.75 to 4.4 for floating-point applications. This data is a sharp contrast to what was observed in the Bhandarkar-Clark study. They obtained an instruction count ratio in the range of 1 to 4 and a CPI ratio ranging from 3 to 10.5. In their study, the RISC instruction count was always higher than CISC and the CISC CPI was always higher than the RISC CPI.

A Tale of Two Processors: Revisiting the RISC-CISC Debate

63

Fig. 4. a) CPI of the 2 processors for INT

Fig. 4. b) CPI of the 2 processors for FP

Figure 5 illustrates an interesting metric, the RISC factor and its change from the Bhandarkar-Clark study to our study. Bhandarkar–Clark defined RISC factor as the ratio of CPI ratio to path length (instruction count) ratio. The x-axis indicates the CPI ratio (CISC to RISC) and the y-axis indicates the instruction count ratio (RISC to CISC). The SPEC 89 data-points from the Bhandarkar-Clark study are clustered to the right side of the figure, whereas most of the SPEC CPU2006 points are located closer to the line representing RISC factor=1 (i.e. no advantage for RISC or CISC). This line represents the situation where the CPI advantage for RISC is cancelled out by the path length advantage for CISC. The shift highlights the sharp contrast between the results observed in the early days of RISC and the current results.

64

C. Isen, L.K. John, and E. John

4.2 Micro-operations Per Instruction (uops/inst) Woodcrest converts its instructions into simpler instructions called micro-ops (uops). The number of uops per instruction gives an indication of the complexity of the x86 instructions used in each benchmark. Past studies by Bhandarkar and Fig. 5.(a) CPI ratio vs. Path length ratio - INT Ding [5] have recorded the uops per instruction to be in the 1.2 to 1.7 range for SPEC 89 benchmarks. A higher uops/inst ratio would imply that more work is done per instruction for CISC, something that is expected of CISC. Our observation on Woodcrest shows the uops per instruction ratio to be much lower than past studies [5]: an average very close to 1. Table 2 presents the Fig. 5.(b) CPI ratio vs. Path length ratio - FP uops/inst for both SPEC CPU2006 integer and floating-point suites. The integer programs have an average of 1.03 uops/inst and the FP programs have an average of 1.07 uops/instructions. Only 482.sphinx3 has a uops/inst ratio that is similar to what is observed by Bhandarkar et al. [5] (a ratio of 1.34). Among the integer benchmarks, mcf has the highest uops/inst ratio – 1.14. 4.3 Instruction Mix In this section, we present the instruction mix to help the reader better understand the later sections on branch predictor performance, and cache performance. The instruction mix can give us an indication of the difference between the benchmarks. It is far from a clear indicator of bottlenecks but it can still provide some useful information. Table 3 contains the instruction mix for the integer programs while Table 4

A Tale of Two Processors: Revisiting the RISC-CISC Debate

65

Table 2. Micro-ops per instruction for CPU2006 on Intel Woodcrest BENCHMARK 400.perlbench 401.bzip2 403.gcc 429.mcf 445.gobmk 456.hmmer

uops/inst 1.06 1.03 0.97 1.14 0.93 1.08

458.sjeng 462.libquantum 464.h264ref 471.omnetpp 473.astar 483.xalancbmk

1.06 1.05 1.02 0.98 1.07 0.96

INT - geomean

1.03

BENCHMARK 433.milc 434.zeusmp 435.gromacs 436.cactusADM 437.leslie3d 444.namd

uops/inst 1.01 1.02 1.01 1.12 1.09 1.02

447.dealII 450.soplex 453.povray 454.calculix 459.GemsFDTD 465.tonto 470.lbm 481.wrf 482.sphinx3 410.bwaves.input1 416.gamess FP – geomean

1.04 1.00 1.07 1.05 1.16 1.08 1.00 1.16 1.34 1.01 1.02 1.07

Table 3. Instruction mix for SPEC CPU2006 integer benchmarks POWER5+ BENCHMARK

Branches

Stores

Load

Woodcrest Others

Branches

Stores

Loads

other

400.perlbench

18%

15%

25%

41%

23%

11%

24%

41%

401.bzip2

15%

8%

23%

54%

15%

9%

26%

49%

403.gcc

19%

17%

18%

46%

22%

13%

26%

39%

429.mcf

17%

9%

26%

48%

19%

9%

31%

42%

445.gobmk

16%

11%

20%

53%

21%

14%

28%

37%

456.hmmer

14%

11%

28%

47%

8%

16%

41%

35%

458.sjeng

18%

6%

20%

56%

21%

8%

21%

50%

462.libquantum

21%

8%

21%

50%

27%

5%

14%

53%

464.h264ref

7%

16%

35%

42%

8%

12%

35%

45%

471.omnetpp

19%

17%

26%

38%

21%

18%

34%

27%

473.astar

13%

8%

27%

52%

17%

5%

27%

52%

483.xalancbmk

20%

9%

23%

47%

26%

9%

32%

33%

contains the same information for floating-point benchmarks. In comparing the composition of instructions in the binaries of POWER5+ and Woodcrest, the instruction mix seems to be largely similar for both architectures. We do observe that some Woodcrest binaries have a larger fraction of load instructions compared to their POWER5+ counterparts. For example, the execution of hmmer on POWER5+ has 28% load instruction while the Woodcrest version has 41% loads. Among integer programs, gcc, gobmk and xalancbmk are other programs where the percentage of loads in Woodcrest is higher than that of POWER5+.

66

C. Isen, L.K. John, and E. John Table 4. Instruction mix for SPEC CPU2006 floating-point benchmarks POWER5+

BENCHMARK

Branches

Stores

Load

Woodcrest

Others

Branches

Stores

Loads

Others

410.bwaves

1%

7%

46%

46%

1%

8%

47%

44%

416.gamess

8%

8%

31%

53%

8%

9%

35%

48%

433.milc

3%

18%

34%

46%

2%

11%

37%

50%

434.zeusmp

2%

11%

26%

61%

4%

8%

29%

59%

435.gromacs

4%

14%

28%

54%

3%

14%

29%

53%

436.cactusADM

0%

14%

38%

48%

0%

13%

46%

40%

437.leslie3d

1%

12%

28%

59%

3%

11%

45%

41%

444.namd

5%

6%

28%

61%

5%

6%

23%

66%

447.dealII

15%

9%

32%

45%

17%

7%

35%

41%

450.soplex

15%

6%

26%

53%

16%

8%

39%

37%

453.povray

12%

14%

31%

44%

14%

9%

30%

47%

454.calculix

4%

6%

25%

65%

5%

3%

32%

60%

459.GemsFDTD

2%

10%

31%

57%

1%

10%

45%

43%

465.tonto

6%

13%

29%

52%

6%

11%

35%

49%

470.lbm

1%

9%

18%

72%

1%

9%

26%

64%

481.wrf

4%

11%

31%

54%

6%

8%

31%

56%

482.sphinx3

8%

3%

31%

59%

10%

3%

30%

56%

We also find a difference in the fraction of branch instructions, though not as significant as the differences observed for load instructions. For example, xalancbmk has 20% branches in a POWER5+ execution and 26% branches in the case of Woodcrest. A similar difference exists for gobmk and libquantum. In the case of hmmer, unlike the previous cases, the number of branches is lower for Woodcrest (14% for POWER5+ and only 8% for Woodcrest). Similar examples for difference in the fraction of load and branch instructions can be found in the floating-point programs. A few examples are cactusADM, leslie3d, soplex, gemsFDTD and lbm. FP programs have traditionally had a lower fraction of branch instructions, but three of the programs exhibit more than 12% branches. This observation holds for both POWER5+ and Woodcrest. Interestingly these three programs (dealII, soplex and povray) are C++ programs. 4.4 Branch Prediction Branch prediction is a key feature in modern processors allowing out of order execution. Branch misprediction rate and misprediction penalty significantly influence the stalls in the pipeline, and the amount of instructions that will be executed speculatively and wastefully in the misprediction path. In Figure 6 we present the branch misprediction statistics for both architectures. We find that Woodcrest outperforms POWER5+ in this aspect. The misprediction rate for Woodcrest among integer benchmarks ranges from a low 1% for xalancbmk to a high 14% for astar. Only

A Tale of Two Processors: Revisiting the RISC-CISC Debate

67

gobmk and astar have a misprediction rate higher than 10% for Woodcrest. On the other hand, the misprediction rate for POWER5+ ranges from 1.74% for xalancbmk and 15% for astar. On average the misprediction for integer benchmarks is 7% for POWER5+ and 5.5% for Woodcrest. In the case of floating-point benchmarks this is 5% for POWER5+ and 2% for Woodcrest. We see that, in the case of the floatingpoint programs, POWER5+ branch prediction performs poorly relative to Woodcrest. This is particularly noticeable in programs like games, dealII, tonto and sphinx. 16% 14% 12% 10% 8% 6% 4% 2%

bm

k

ta r

nc a la 3 .x

48

e tp

3. as 47

mn

64

1 .o 47

4. h2

ua ib q 2 .l 46

p

f re

m n tu

je

m

8 .s 45

ob 5 .g

P5+ branch mispred %

46

k

ng

f mc 9. 44

42

3 .g

2 40

z ip

h

1 .b 40

lbe nc er 0 .p 40

cc

0%

WC branch mispred %

Fig. 6. a) Branch misprediction – INT 12%

10%

8%

6%

4%

2%

41 0. bw 41 ave 6. s ga m es s 43 3 43 .mil c 4. ze 43 usm 5. p 43 gro m 6. ac ca s ct us 43 AD M 7. le sli e 44 3d 4. na m d 44 7. de al 45 II 0. so pl ex 45 3. po 45 vra y 45 4.ca 9. G lcul em ix sF D TD 46 5. to nt 47 o 0. lb m 48 48 1.w rf 2. sp hi nx 3

0%

P5 + branch mispred %

WC branch mispred %

Fig. 6. b) Branch misprediction – FP

68

C. Isen, L.K. John, and E. John

4.5 Cache Misses The cache hierarchy is one of the important micro-architectural features that differ between the systems. POWER5+ has a smaller L2 cache (1.9M instead of 4M in Woodcrest), but it has a large shared L3 cache. This makes the performance of the cache hierarchies of the two processors of particular interest. Figure 7 shows the L1 data cache misses per thousand instructions for both integer and floating-point benchmarks. Among integer programs mcf stands out, while there are no floating-point programs with a similar behavior. POWER5+ has a higher L1 D cache miss rate for gcc, milc and lbm even though both processors have the same L1 D cache size. In general, the L1 data cache miss rates are under 40 misses per 1k instructions. In spite of the small L2 cache, the L2 miss ratio on POWER5+ is lower than that on Woodcrest. While no data is available to further analyze this, we suspect that differences in the

160 140 120 100 80 60 40 20

cf 44 5. go bm k 45 8. sj en 46 g 2. li b qu an tu m 46 4. h2 64 re 47 f 1. om ne tp p 47 3. as 48 ta r 3. xa la nc bm k

42 9. m

40 3. gc c

40 0. pe rlb en ch 40 1. bz ip 2

0

P5+ L1 D miss/1 k inst

WC L1 D miss/1k inst

Fig. 7. a) L1 D cache misses per 1k Instructions – INT 160 140 120 100 80 60 40 20

41 0. b 41 wav 6. e ga s m e 43 ss 3. 43 m il 4. ze c us 43 m 5. p gr 43 o 6. ca m a c ct us s 43 AD M 7. le sli e3 44 d 4. na 44 md 7. d 45 ea lI 0. so I 45 ple x 3. p 45 ovr 45 4.c ay al 9. c G em ulix sF D 46 TD 5. to n 47 to 0. lb m 4 48 8 1. w 2. s p rf hi nx 3

0

P5+ L1 D miss/1k inst

WC L1 D miss/1k inst

Fig. 7. b) L1 D cache misses per 1k Instructions - FP

A Tale of Two Processors: Revisiting the RISC-CISC Debate

69

40 35 30 25 20 15 10 5

cf 44 5. go bm k 45 8. sj en 46 g 2. lib qu an tu m 46 4. h2 64 re 47 f 1. om ne tp p 47 3. as 48 ta r 3. xa la nc bm k

42 9. m

40 3. gc c

40 0. pe rlb e

nc h 40 1. bz ip 2

0

P5 L2 miss/1k inst

WC L2 miss/1k inst

Fig. 8. a) L2 cache misses per 1k Instructions – INT

60 50 40 30 20 10

41 0. b 41 wav e 6. ga s m es s 43 3. 43 m ilc 4. ze us 43 m 5 p 43 .g r om 6. ca ac ct us s 43 AD M 7. le sl ie 3d 44 4. na 44 md 7. d 45 ea lI 0. so I 45 ple x 3. p 45 ovr a 4 y 45 .c al 9. c G em ulix sF D 46 TD 5. to n 47 to 0. lb m 4 48 8 1. w 2. s p rf hi nx 3

0

P5+ L2 miss/1k inst

WC L2 miss/1k inst

Fig. 8. b) L2 cache misses per 1k Instructions – FP

amount of loads in the instruction mix (as discussed earlier), differences in the instruction cache misses (POWER5+ has a bigger I-cache) etc. can lead to this. 4.6 Speculative Execution Over the years out-of-order processors have achieved significant performance gains from various speculation techniques. The techniques have primarily focused on control flow prediction and memory disambiguation. In Figure 9 we present speculation percentage, a measure of the amount of wasteful execution, for different benchmarks. We define the speculation % as the ratio of instructions that are executed speculatively but not retired to the number of instructions retired (i.e. (dispatched_inst_cnt / retired_inst_cnt) -1). We find the amount of speculation in integer benchmarks to be

70

C. Isen, L.K. John, and E. John

Speculation % 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

k

46 en 2. g lib qu an tu m 46 4. h2 64 re 47 f 1. om ne tp p 47 3. as 48 ta 3. r xa la nc bm k

45 8. sj

cf

bm

42 9. m

44 5. go

ip 2

40 3. gc c

rlb e pe 40 0.

40 1. bz

nc h

0

P5+(inst disp/compl)

WC(UOPS disp/retired)

Fig. 9. (a) Percentage of instructions executed speculatively - INT

Speculation % 0.3 0.25 0.2 0.15 0.1 0.05

41 0. b 41 wav 6. es ga m es 43 s 43 3.m ilc 4. z 43 e u s m 5. 43 g r o p m 6. ac ca s ct u 43 sAD 7. le M sli 4 4 e 3d 4. na 4 4 md 7. d 45 ea lI 0. so I 45 ple x 3. po 4 vr 45 54.c ay 9. al G c em ul ix sF DT D 46 5. to nt 47 o 0. lb m 4 48 81. w 2. sp rf hi nx 3

0

P5+(inst disp/compl)

WC(UOPS disp/retired)

Fig. 9. (b) Percentage of instructions executed speculatively - FP

higher than floating-point benchmarks, not surprising considering the higher percentage of branches and branch mispredictions in integer programs. In general, the Woodcrest micro-architecture speculates much more aggressively compared to POWER5+. On an average, an excess of 40% of instructions in Woodcrest and 29% of instructions in POWER5+ are speculative for integer benchmarks. The amount of speculations for FP programs on average is 20% for Woodcrest and 9% for POWER5+. Despite concerns on power consumption, the fraction of instructions spent in mispredicted path has increased from the average of 20% (25% for INT and 15% for FP) seen in the 1997 Pentium Pro study. Among the floating-point programs, POWER5+ speculates more than Woodcrest in four of the benchmarks: dealII, soplex, povray and sphinx. It is interesting to note that 3 of these benchmarks are C++

A Tale of Two Processors: Revisiting the RISC-CISC Debate

71

programs. With limitation on power and energy consumption, wastage from execution in speculative path is of great concern.

5 Techniques That Aid Woodcrest Part of Woodcrest’s performance advantage comes from the reduction of microoperations through fusion. Another important technique is early load address resolution. In this section, we analyze these specific techniques. 5.1 Macro-fusion and Micro-op Fusion Although the Woodcrest breaks instructions into micro-operations, in certain cases, it also uses fusion of micro-operations to combine specific uops to integrated operations, thus taking the advantage of simple or complex operations as it finds fit. Macrofusion [11] is a new feature for Intel’s Core micro-architecture, which is designed to decrease the number of micro-ops in the instruction stream. Select pairs of compare and branch instruction are fused together during the pre-decode phase and then sent through any one of the four decoders. The decoder then produces a micro-op from the fused pair of instructions. The hardware can perform a maximum of one macro-fusion per cycle. Table 5 and Table 6 show the percentage of fused operations for integer and floating-point benchmarks. In the tables, fused operations are classified as macro-fusion and micro-fusion. Micro-fusion is further classified into two: Loads that are fused with arithmetic operations or an indirect branch (LD_IND_BR) and store address computations fused with data store (STD_STA). As stated before, the version of the benchmark selected (32bit vs. 64bit) depends on the overall performance. This was done to give maximum performance benefit to CISC. It turns out that most of the programs performed best in the 64-bit mode but in this mode macro-fusion does not work well. Since our primary focus is in comparing POWER5+ with Woodcrest we used the binaries that yielded best performance for this study too. The best case runs (runs with highest performance) for integer benchmarks have an average of 19% operations that can be fused by micro or macro-fusion. This implies that the average uops/inst will go up from 1.03 to 1.23 uops/inst if there was no fusion. The majority of the fusion comes from micro-fusion, an average of 14%, and the rest from macro-fusion. Macro-fusion in integer benchmarks ranges from 0.13% in hammer to 21% for xalancbmk. For micro-fusion, we find it to range from 6% (astar) to 29% (hmmer). Among the two sub-components of micro-fusion, store address computation fusion is predominant. ‘Store address and store’ fusion ranges from 4%, for astar, to 18%, for omnetpp. On the other hand Loads fusion (LD_IND_BR - Loads that fused with arithmetic operations or an indirect branch) is the lowest for mcf and the highest for hmmer. The best case runs (runs with highest performance) for FP benchmarks have an average of 15% uops that can be fused by micro or macro-fusion. Almost all of the fusion is from micro-fusion. The percentage of uops that can be fused via micro-fusion in FP programs ranges from 4% (sphinx) to 21% (leslie3D).

72

C. Isen, L.K. John, and E. John Table 5. Micro & macro-fusion in SPEC CPU2006 integer benchmarks

BENCHMARK 400.perlbench

uops/inst 1.06

%macro-

%micro-

%fusion

%LD_IND_BR

%STD_STA

fusion uop

fusion uop

uop

uops

uops

0%

13%

13%

3%

11%

401.bzip2

1.03

0%

12%

12%

4%

9%

403.gcc

0.97

15%

16%

31%

4%

13%

429.mcf

1.14

0%

8%

8%

0%

8%

445.gobmk

0.93

12%

19%

31%

5%

15% 15%

456.hmmer

1.08

0%

29%

29%

14%

458.sjeng

1.06

0%

9%

9%

2%

7%

462.libquantum

1.05

0%

8%

8%

3%

5%

464.h264ref

1.02

0%

18%

18%

6%

12%

471.omnetpp

0.98

10%

22%

31%

5%

18%

473.astar

1.07

0%

6%

6%

1%

4%

483.xalancbmk

0.96

21%

13%

34%

13%

10%

Average

1.03

5%

14%

19%

5%

11%

Table 6. Micro & macro-fusion in SPEC CPU2006 – FP benchmarks

BENCHMARK

uops/inst

%macro-

%micro-

%fusion

%LD_IND_BR

%STD_STA

fusion uop

fusion uop

uop

uops

uops

410.bwaves

1.01

0%

19%

19%

11%

8%

416.gamess

1.02

0%

20%

20%

11%

9%

433.milc

1.01

0%

13%

13%

3%

11%

434.zeusmp

1.02

0%

13%

13%

5%

8%

435.gromacs

1.01

0%

18%

18%

3%

14%

436.cactusADM

1.12

0%

20%

20%

8%

12%

437.leslie3d

1.09

0%

21%

21%

12%

10%

444.namd

1.02

0%

9%

9%

3%

6%

447.dealII

1.04

0%

19%

19%

12%

7%

450.soplex

1.00

4%

15%

20%

8%

7%

453.povray

1.07

0%

13%

13%

5%

8%

454.calculix

1.05

0%

9%

9%

6%

3%

459.GemsFDTD

1.16

0%

13%

13%

5%

9%

465.tonto

1.08

0%

20%

20%

10%

10%

470.lbm

1.00

0%

19%

19%

10%

9%

481.wrf

1.16

0%

13%

13%

7%

6%

482.sphinx3

1.34

0%

4%

4%

2%

2%

Average

1.07

0%

15%

15%

7%

8%

A Tale of Two Processors: Revisiting the RISC-CISC Debate

73

Hypothetically, not having fusion would increase the uops/inst for floating-point programs from 1.07 uops/inst to 1.23 uops/inst and for integer programs from 1.03 uops/inst to 1.23 uops/inst. It is clear that this micro-architectural technique has played a significant part in blunting the advantage of RISC by reducing the number of uops that are executed per instruction. 5.2 Early Load Address Resolution The cost of memory access has been accentuated by the higher performance of the logic unit of the processor (the memory wall). The Woodcrest architecture is said to perform an optimization aimed at reducing the load latencies of operations with regards to the stack pointer [2]. The work by Bekerman et al. [2] proposes tracking the ESP register and simple operations on it of the form reg±immediate, to enable quick resolutions of the load address at decode time. The ESP register in IA32 holds the stack pointer and is almost never used for any other purpose. Instructions such as CALL/RET, PUSH/POP, and ENTER/LEAVE can implicitly modify the stack pointer. There can also be general-purpose instructions that modify the ESP in the fashion ESP←ESP±immediate. These instructions are heavily used for procedure calls and are translated into uops as given below in Table 7. The value of the immediate operand is provided explicitly in the uop. Table 7. Early load address prediction - Example PUSH EAX

ESP←ESP - immediate. Mem[ESP] ← EAX

POP EAX

EAX ← mem[ESP] ESP←ESP - immediate.

LOAD EAX from stack

EAX ← mem[ESP+imm]

These ESP modifications can be tracked easily after decode. Once the initial ESP value is known later values can be computed after each instruction decode. In essence this method caches a copy of the ESP value in the decode unit. Whenever a simple modification to the ESP value is detected the cached value is used to compute the ESP value without waiting for the uops to reach execution stage. The cached copy is also updated with the newly computed value. In some cases the uops cause operations that are not easy to track and compute; for example loads from memory into the ESP or computations that involve other registers. In these cases the cached value of ESP is flagged and it is not used for computations until the uop passes the execution stage and the new ESP value is obtained. In the mean while, if any other instruction that follows attempts to modify the ESP value, the decoder tracks the change operation and the delta value it causes. Once the new ESP value is obtained from the uop that passed the execution stage, the delta value observed is applied on it to bring the ESP register up-to-date. Having the ESP value at hand allows quick resolution of the load addresses there by avoiding any stall related to that. This technique is expected to bear fruit in workloads where there is a significant use of the stack, most likely for function calls. Further details on this optimization can be found in Bekerman et al. [2].

74

C. Isen, L.K. John, and E. John

In Table 8 we present data related to ESP optimization. The percentage of ESP.SYNC refers to the number of times the ESP value had to be synchronized with the delta value as a percent of the total number of instructions. A high number is not desirable as it would imply the frequent need to synchronize the ESP data i.e. ESP data can not be computed at the decoder because it has to wait for the value from the execution stage. % ESP.ADDITIONS is a similar percent for the number of ESP addition operations performed in the decode unit – an indication of the scope of this optimization. A high value for this metric is desirable because, larger the percentage of instructions that use the addition operation, more are the number of cycles saved. The stack optimization seems to be more predominant in the integer benchmarks and not the floating-point benchmarks. The % ESP addition optimization in integer benchmarks range from 0.1% for hmmer to 11.3% for xalancbmk. The % of ESP synchronization is low even for benchmarks with high % of ESP addition. For example xalancbmk exhibits 11.3% ESP addition and has only 3.76% ESP synchronization. The C++ programs are expected to have more function calls and hence more scope for this optimization. Among integer programs omnetpp and xalancbmk are among the ones with a large % ESP addition. The others are gcc and gobmk; the modular and highly control flow intensive nature of gcc allows for these optimizations. Although Astar is a C++ application, it makes very little use of C++ features [19] and we find that it has a low % for ESP addition. Among the floating-point applications, dealII and povray, both C++ applications, have a higher % of ESP addition. Table 8. Percentage of instructions on which early load address resolutions were applied % ESP BENCHMARK

SYNCH

%

ESP

ADDITIONS

% ESP BENCHMARK

SYNCH

%

ESP

ADDITIONS

400.perlbench

0.90%

6.88%

433.milc

0.00%

0.04%

401.bzip2

0.30%

1.41%

434.zeusmp

0.00%

0.00%

403.gcc

1.80%

7.99%

435.gromacs

0.03%

0.14%

429.mcf

0.17%

0.24%

436.cactusADM

0.00%

0.00%

445.gobmk

1.81%

8.45%

437.leslie3d

0.00%

0.00%

456.hmmer

0.00%

0.11%

444.namd

0.00%

0.01%

458.sjeng

0.41%

3.19%

447.dealII

0.20%

3.05%

462.libquantum

0.12%

0.13%

450.soplex

0.11%

0.54%

464.h264ref

0.12%

1.44%

453.povray

0.67%

2.77%

471.omnetpp

3.06%

7.60%

454.calculix

0.03%

0.09%

473.astar

0.01%

0.14%

459.GemsFDTD

0.08%

0.33%

483.xalancbmk

3.76%

11.30%

465.tonto

0.26%

0.77%

INT - geomean

1.04%

4.07%

470.lbm

0.00%

0.00%

481.wrf

0.19%

0.35%

482.sphinx3

0.17%

0.90%

410.bwaves

0.03%

0.04%

416.gamess

0.15%

0.76%

FP - geomean

0.12%

0.60%

A Tale of Two Processors: Revisiting the RISC-CISC Debate

75

On average the benefit from ESP based optimization is 4% for integer programs and 0.6% for FP programs. Each ESP based addition that is avoided amounts to avoiding execution of one uop. Although the average benefit is low, some of the applications benefit significantly in reducing unnecessary computations and there by helping performance of those applications in relation to their POWER5+ counter parts.

6 Conclusion Using the SPEC CPU2006 benchmarks, we analyze the performance of a recent CISC processor, the Intel Woodcrest (Xeon 5160) with a recent RISC processor, the IBM POWER5+. In a CISC RISC comparison in 1991, the RISC processor showed an advantage of 2.7x and in a 1997 study of the Alpha 21064 and the Pentium Pro, the RISC Alpha showed 5% to 200% advantage on the SPEC CPU92 benchmarks. Our study shows that the performance difference between RISC and CISC has further narrowed down. In contrast to the earlier studies where the RISC processors showed dominance on all SPEC CPU programs, neither the RISC nor CISC dominates in this study. In our experiments, the Woodcrest shows advantage on several of the SPEC CPU2006 programs and the POWER5+ shows advantage on several other programs. Various factors have helped the Woodcrest to obtain its RISC-like performance. Splitting the x86 instruction into micro-operations of uniform complexity has helped, however, interestingly the Woodcrest also combines (fuses) some micro-operations to a single macro-operation. In some programs, up to a third of all micro-operations are seen to benefit from fusion, resulting in chained operations that are executed in a single step by the relevant functional unit. Fusion also reduces the demand on reservation station and reorder buffer entries. Additionally, it reduces the net uops per instruction. The average uop per instruction for Woodcrest in 2007 is 1.03 for integer programs and 1.07 for floating-point programs, while in Bhandarkar and Ding’s 1997 study [5] using SPEC CPU95 programs, the average was around 1.35 uops/inst. Although the POWER5+ has smaller L2 cache than the Woodcrest, it is seen to achieve equal or better L2 cache performance than the Woodcrest. The Woodcrest has better branch prediction performance than the POWER5+. Approximately 40%/20% (int/fp) of instructions in Woodcrest and 29%/9% (int/fp) of instructions in the POWER5+ are seen to be in the speculative path. Our study points out that with aggressive micro-architectural techniques for ILP, CISC and RISC ISAs can be implemented to yield very similar performance.

Acknowledgement We would like to acknowledge Alex Mericas, Venkat R. Indukuru and Lorena Pesantez at IBM Austin for their guidance. The authors are supported in part by NSF grant 0702694, and an IBM Faculty award. Any opinions, findings and conclusions expressed in this paper are those of the authors and do not necessarily reflect the views of the National Science Foundation (NSF) or other research sponsors.

76

C. Isen, L.K. John, and E. John

References 1. Agerwala, T., Cocke, J.: High-performance reduced instruction set processors. Technical report, IBM Computer Science (1987) 2. Bekerman, M., Yoaz, A., Gabbay, F., Jourdan, S., Kalaev, M., Ronen, R.: Early load address resolution via register tracking. In: Proceedings of the 27th Annual international Symposium on Computer Architecture, pp. 306–315 3. Bhandarkar, D., Clark, D.W.: Performance from architecture: comparing a RISC and a CISC with similar hardware organization. In: Proceedings of ASPLOS 1991, pp. 310–319 (1991) 4. Bhandarkar, D.: A Tale of two Chips. ACM SIGARCH Computer Architecture News 25(1), 1–12 (1997) 5. Bhandarkar, D., Ding, J.: Performance Characterization of the Pentium® Pro Processor. In: Proceedings of the 3rd IEEE Symposium on High Performance Computer Architecture, February 01-05, 1997, pp. 288–297 (1997) 6. Chow, F., Correll, S., Himelstein, M., Killian, E., Weber, L.: How many addressing modes are enough. In: Proceedings of ASPLOS-2, pp. 117–121 (1987) 7. Cmelik, et al.: An analysis of MIPS and SPARC instruction set utilization on the SPEC benchmarks. In: ASPLOS 1991, pp. 290–302 (1991) 8. Hennessy, Gelsinger Debate: Can the 386 Architecture Keep Up? John Hennessy and Pat Gelsinger Debate the Future of RISC vs. CISC: Microprocessor Report 9. Hennessy, J.: VLSI Processor Architecture. IEEE Transactions on Computers C-33(11), 1221–1246 (1984) 10. Hennessy, J.: VLSI RISC Processors. VLSI Systems Design, VI:10, pp. 22–32 (October 1985) 11. Inside Intel Core Microarchitecture: Setting New Standards for Energy-Efficient Performance, http://www.intel.com/technology/architecture-silicon/core/ 12. Smith, J.E., Weiss, S.: PowerPC 601 and Alpha 21064. A Tale of Two RISCs, IEEE Computer 13. Microprocessor Report – Chart Watch - Server Processors. Data as of (October 2007) http://www.mdronline.com/mpr/cw/cw_wks.html 14. Patterson, D.A., Ditzel, D.R.: The case for the reduced instruction set computer. Computer architecture News 8(6), 25–33 (1980) 15. Patterson, D.: Reduced Instruction Set Computers. Communications of the ACM 28(1), 8– 21 (1985) 16. Kanter, D.: Fall Processor Forum 2006: IBM’s POWER6, http://www.realworldtech.com/ 17. Kanter, D.: Intel’s Next Generation Microarchitecture Unveiled. Real World Technologies (March 2006), http://www.realworldtech.com 18. SPEC Benchmarks, http://www.spec.org 19. Wong, M.: C++ benchmarks in SPEC CPU 2006. SIGARCH Computer Architecture News 35(1), 77–83 (2007)

Investigating Cache Parameters of x86 Family Processors Vlastimil Babka and Petr T˚ uma Department of Software Engineering Faculty of Mathematics and Physics, Charles University Malostransk´e n´ amˇest´ı 25, Prague 1, 118 00, Czech Republic {vlastimil.babka,petr.tuma}@dsrg.mff.cuni.cz

Abstract. The excellent performance of the contemporary x86 processors is partially due to the complexity of their memory architecture, which therefore plays a role in performance engineering eﬀorts. Unfortunately, the detailed parameters of the memory architecture are often not easily available, which makes it diﬃcult to design experiments and evaluate results when the memory architecture is involved. To remedy this lack of information, we present experiments that investigate detailed parameters of the memory architecture, focusing on such information that is typically not available elsewhere.

1

Introduction

The memory architecture of the x86 processor family has evolved over more than a quarter of a century – by all standards, an ample time to achieve considerable complexity. Equipped with advanced features such as translation buﬀers and memory caches, the architecture represents an essential contribution to the overall performance of the contemporary x86 family processors. As such, it is a natural target of performance engineering eﬀorts, ranging from software performance modeling to computing kernel optimizations. Among such eﬀorts is the investigation of the performance related eﬀects caused by sharing of the memory architecture among multiple software components, carried out within the framework of the Q-ImPrESS project1 . The Q-ImPrESS project aims to deliver a comprehensive framework for multicriterial quality of service modeling in the context of software service development. The investigation, necessary to achieve a reasonable modeling precision, is based on evaluating a series of experiments that subject the memory architecture to various workloads. In order to design and evaluate the experiments, a detailed information about the memory architecture exercised by the workloads is required. Lack of information about features such as hardware prefetching, associativity or inclusivity 1

This work is supported by the European Union under the ICT priority of the Seventh Research Framework Program contract FP7-215013 and by the Czech Academy of Sciences project 1ET400300504.

D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 77–96, 2009. c Springer-Verlag Berlin Heidelberg 2009

78

V. Babka and P. T˚ uma

could result in naive experiment designs, where the workload behavior does not really target the intended part of the memory architecture, or in naive experiment evaluations, where incidental interference between various parts of the memory architecture is interpreted as the workload performance. Within the Q-ImPrESS project, we have carried out multiple experiments on both AMD and Intel processors. Surprisingly, the documentation provided by both vendors for their processors has turned out to be somewhat less complete and correct than necessary – some features of the memory architecture are only presented in a general manner applicable to an entire family of processors, other details are buried among hundreds of pages of assorted optimization guidelines. To overcome the lack of detailed information, we have constructed additional experiments intended speciﬁcally to investigate the parameters of the memory architecture. These experiments are the topic of this paper. We believe that the experiments investigating the parameters of the memory architecture can prove useful to other researchers – some performance relevant aspects of the memory architecture are extremely sensitive to minute details, which makes the investigation tedious and error prone. We present both an overview of some of the more interesting experiments and an overview of the framework used to execute the experiments – Section 2 focuses on the parameters of the translation buﬀers, Section 3 focuses on the parameters of the memory caches, Section 4 presents the framework. After a careful consideration, we have decided against providing an overview of the memory architecture of the x86 processor family. In the following, we assume familiarity with the x86 processor family on the level of the vendor supplied user guides [1,2], or at least on the general programmer level [3]. 1.1

Experimental Platforms

For the experiments, we have chosen two platforms that represent common servers with both Intel and AMD processors, further referred to as Intel Server and AMD Server. Intel Server. A server conﬁguration with an Intel processor is represented by the Dell PowerEdge 1955 machine, equipped with two Quad-Core Intel Xeon CPU E5345 2.33 GHz (Family 6 Model 15 Stepping 11) processors with internal 32 KB L1 caches and 4 MB L2 caches, and 8 GB Hynix FBD DDR2-667 synchronous memory connected via Intel 5000P memory controller. AMD Server. A server conﬁguration with an AMD processor is represented by the Dell PowerEdge SC1435 machine, equipped with two Quad-Core AMD Opteron 2356 2.3 GHz (Family 16 model 2 stepping 3) processors with internal 64 KB L1 caches, 512 KB L2 caches and 2 MB L3 caches, integrated memory controller with 16 GB DDR2-667 unbuﬀered, ECC, synchronous memory. To collect the timing information, the RDTSC processor instruction is used. In addition to the timing information, we collect the values of the performance counters for events related to the experiments using the PAPI library [4] running

Investigating Cache Parameters of x86 Family Processors

79

on top of perfctr [5]. The performance events supported by the platforms are described in [1, Appendix A.3] and [6, Section 3.14]. For overhead incurred by the measurement framework, see [7]. Although mostly irrelevant, both platforms are running Fedora Linux 8 with kernel 2.6.25.4-10.fc8.x86 64, gcc-4.1.2-33.x86 64, glibc-2.7-2.x86 64. Only 4 level paging with 4 KB pages is investigated. 1.2

Presenting Results

To illustrate the results, we typically provide plots of values such as the duration of the measured operation or the value of a performance counter, typically plotted as a dependency on one of the experiment parameters. Durations are expressed in processor clocks. On Platform Intel Server, a single clock tick corresponds to 0.429 ns. On Platform AMD Server, a single clock tick corresponds to 0.435 ns. To capture the statistical variability of the results, we use boxplots of individual samples, or, where the duration of individual operations approaches the measurement overhead, boxplots of averages. The boxplots are scaled to ﬁt the boxes with the whiskers, but not necessarily to ﬁt all the outliers, which are usually not related to the experiment. Where boxplots would lead to poorly readable graphs, we use lines to plot the trimmed means. When averages are used in a plot, the legend of the plot informs about the details. The Avg acronym is used to denote standard mean of the individual observations – for example, 1000 Avg indicates that the plotted values are standard means from 1000 operations performed by the experiment. The Trim acronym is used to denote trimmed mean of the individual observations where 1 % of minimum and maximum observations was discarded – for example, 1000 Trim indicates that the plotted values are trimmed means from 1000 operations performed by the experiment. The acronyms can be combined – for example, 1000 walks Avg Trim means that observations from 1000 walks performed by the experiment were the input of a standard mean calculation, whose outputs were the input of a trimmed mean calculation, whose output is plotted. Since the plots that use averages do not give information about the statistical variability of the results, we point out in text those few cases where the standard deviation of the results is above 0.5 processor clock cycles or 0.2 performance event counts.

2

Investigating Translation Buﬀers

On Platform Intel Server, the translation buﬀers include an instruction TLB (ITLB), two levels of data TLB (DTLB0, DTLB1), a cache of the third level paging structures (PDE cache), and a cache of the second level paging structures (PDPTE cache). On Platform AMD Server, the translation buﬀers include two levels of instruction TLB (L1 ITLB, L2 ITLB), two levels of data TLB

80

V. Babka and P. T˚ uma Table 1. Translation Buﬀer Parameters Buﬀer Entries Associativity Miss [cycles] Platform Intel Server ITLB 128 4-way 18.5 16 4-way 2 DTLB0 256 4-way +7 DTLB1 PDE cache present +4 present +8 PDPTE cache not present N/A PML4TE cache Platform AMD Server L1 ITLB 32 full 4 512 4-way +40 L2 ITLB 48 full 5 L1 DTLB 512 4-way +35 L2 DTLB PDE cache present +21 present +21 PDPTE cache present +21 PML4TE cache

(L1 DTLB, L2 DTLB), a cache of the third level paging structures (PDE cache), a cache of the second level paging structures (PDPTE cache), and a cache of the ﬁrst level paging structures (PML4TE cache). The following table summarizes the basic parameters of the translation buﬀers on the two platforms, with the parameters not available in vendor documentation emphasized. We begin our translation buﬀers investigation by describing experiments targeted at the translation miss penalties, which are not available in vendor documentation. 2.1

Translation Miss Penalties

The experiments we perform are based on measuring durations of memory accesses using various access patterns, constructed to trigger hits and misses as necessary. Underlying the construction of the patterns is an assumption that accesses to the same address generally trigger hits, while accesses to diﬀerent addresses generally trigger misses, and the choice of addresses determines which part of the memory architecture hits or misses. Due to measurement overhead, it is not possible to measure the memory accesses alone. To minimize the distortion of the experiment results, the measured workload should perform as few additional memory accesses and additional processor instructions as possible. To achieve this, we create the access pattern in advance and store it in memory as the very data that the measured workload accesses. The access pattern forms a chain of pointers and the measured workload uses the pointer that it reads in each access as an address for the next access. The workload is illustrated in Listing 1.1.

Investigating Cache Parameters of x86 Family Processors

81

Listing 1.1. Pointer walk workload.

// Variable start is initialized by an access pattern generator uintptr_t *ptr = start; for (int i = 0; i < loopCount; i++) ptr = (uintptr_t *) *ptr; Experiments with instruction access use a similar workload, replacing chains of pointers with chains of jump instructions. A necessary diﬀerence from using the chains of pointers is that the chains of jump instructions must not wrap, but must contain additional instructions that control the access loop. To achieve a reasonably homogeneous workload, the access loop is partially unrolled, as presented in Listing 1.2.

Listing 1.2. Instruction walk workload.

// The jump_walk function contains the jump instructions int len = loopCount / 16; while (len --) jump_walk (); // The function is invoked 16 times To measure the translation miss penalties, the experiments need to access addresses that miss in TLB but hit in the L1 cache. This is done by accessing addresses that map to the same associativity set in TLB but to diﬀerent associativity sets in the L1 cache. With a TLB of size S and associativity A mapping pages of size P , the associativity set is selected by log2 (S/A) bits starting with bit log2 (P ) of the virtual address. Similarly, with a virtually indexed L1 cache of size S and associativity A caching lines of size L, the associativity set is selected by log2 (S/A) bits starting with bit log2 (L) of the virtual address. The two groups of bits can partially overlap, making a choice of an associativity set in TLB limit the choices of an associativity set in the L1 cache. We generate an access pattern that addresses a single associativity set in TLB and chooses a random associativity set of the available sets in the L1 cache. The code of the set collision access pattern generator is presented in Listing 1.3 and accepts these parameters: – – – – –

numP ages The number of diﬀerent addresses to choose from. numAccesses The number of diﬀerent addresses to actually access. pageStride The stride of accesses in units of page size. accessOf f set Oﬀset of addresses inside pages when not randomized. accessOf f setRandom Tells whether to randomize oﬀsets inside pages.

82

V. Babka and P. T˚ uma Listing 1.3. Set collision access pattern generator.

// Create array of pointers to the allocated pages uintptr_t **pages = new (uintptr_t *) [numPages]; for (int i = 0; i < numPages; i++) pages [i] = (uintptr_t *) buf + pageStride * PAGE_SIZE; // Cache line size is considered in units of pointer size int numOffsets = PAGE_SIZE / LINE_SIZE; // Create array of offsets in a page offsets = new int [numPageOffsets]; for (int i = 0 ; i < numPageOffsets ; i++) offsets [i] = i * cacheLineSize; // Randomize the order of pages and offsets random_shuffle (pages, pages + numPages); random_shuffle (offsets, offsets + numOffsets); // Create the pointer walk from pointers and offsets uintptr_t *start = addresses [0]; if (accessOffsetRandom) start += offsets [0]; else start += accessOffset; uintptr_t **ptr = (uintptr_t **) start; for (int i = 1 ; i < numAccesses ; i++) { uintptr_t *next = addresses [i]; if (accessOffsetRandom) next += offsets [i % numOffsets]; else next += accessOffset; (*ptr) = next; ptr = (uintptr_t **) next; } // Wrap the pointer walk (*ptr) = start; delete [] pages;

2.2

Experiment: TLB Miss Penalties

For every DTLB present in the system, the experiments that determine the penalties of translation misses use the set collision pointer walk from Listing 1.1 and 1.3 with pageStride set to number of entries divided by associativity, numP ages set to a value higher than associativity and numAccesses varying from 1 to numP ages. When numAccesses is less than or equal to associativity,

Investigating Cache Parameters of x86 Family Processors

83

Fig. 1. DTLB0 miss penalty and related performance events on Intel Server

all accesses should hit, afterwards the accesses should start missing, depending on the replacement policy. For ITLBs, we analogically use a jump emitting version of code from Listing 1.3 with the code from Listing 1.2. Since the plots that illustrate the results for each TLB are similar in shape, we include only representative examples and comment the results in writing. All plots are available in [7]. Starting with an example of a well documented result, we choose the experiment with DTLB0 on Platform Intel Server, which requires pageStride set to 4 and numAccesses varying from 1 to 32. The results on Fig. 1 contain both the average access duration and the counts of the related performance events. We see that the access duration increases from 3 to 5 cycles at 5 accessed pages. At the same time, the number of misses in DTLB0 (DTLB MISSES.L0 MISS LD events) increases from 0 to 1, but there are no DTLB1 misses (DTLB MISSES:ANY events). The experiment therefore conﬁrms the well documented parameters of DTLB0 such as the 4-way associativity and the miss penalty of 2 cycles [1, page A-9]. It also suggests that the replacement policy behavior approximates LRU for our access pattern. Experimenting with DTLB1 on Platform Intel Server requires changing the pageStride parameter to 64 and yields an increase in the average access duration from 3 to 12 cycles at 5 accessed pages. Figure 2 shows the counts of the related performance events, attributing the increase to DTLB1 misses and conﬁrming the 4-way associativity. Since there are no DTLB0 misses that would hit in the DTLB1, the ﬁgure also suggests non-exclusive policy between DTLB0 and DTLB1. The experiment therefore estimates the miss penalty, which is not available in vendor documentation, at 7 cycles. Interestingly, the counter of cycles spent in page walks (PAGE WALKS:CYCLES events) reports only 5 cycles per access and therefore does not fully capture this penalty. As an additional information not available in vendor documentation, we can see that exceeding the DTLB1 capacity increases the number of L1 data cache references (L1D ALL REF events) from 1 to 2. This suggests that page tables are cached in the L1 data cache, and that the PDE cache is present and the page table accesses hit there, since only the last level page walk step is needed. Experimenting with L1 DTLB on Platform AMD Server requires changing pageStride to 1 for full associativity. The results show a change from 3 to 8

84

V. Babka and P. T˚ uma

Fig. 2. Performance event counters related to L1 DTLB misses on Intel Server (left) and L2 DTLB misses on AMD Server (right)

cycles at 49 accessed pages, which conﬁrms the full associativity and 48 entries in the L1 DTLB, the replacement policy behavior approximates LRU for our access pattern. The performance counters show a change from 0 to 1 in the L1 DTLB miss and L2 DTLB hit events, the L2 DTLB miss event does not occur. The experiment therefore estimates the miss penalty, which is not available in vendor documentation, at 5 cycles. Note that the value of L1 DTLB hit counter (L1 DTLB HIT:L1 4K TLB HIT) is always 1, indicating a possible problem with this counter on the particular experiment platform. For L2 DTLB on Platform AMD Server, pageStride is set to 128. The results show an increase from 3 to 43 cycles at 49 accessed pages, which means that we observe L2 DTLB misses and also indicates a non-exclusive policy between L1 DTLB and L2 DTLB. The L2 associativity, however, is diﬃcult to conﬁrm due to full L1 associativity. The event counters on Fig. 2 show a change from 0 to 1 in the L2 miss event (L1 DTLB AND L2 DTLB MISS:4K TLB RELOAD event). The penalty of the L2 DTLB miss is thus estimated at 35 cycles in addition to the L1 DTLB miss penalty, or 40 cycles in total. On Platform AMD Server, the paging structures are not cached in the L1 cache. The value of the REQUESTS TO L2:TLB WALK event counter shows that each L2 DTLB miss in this experiment results in one page walk step that accesses the L2 cache. This means that a PDE cache is present, as is further examined in the next experiment. Note that the problem with the value of the L1 DTLB HIT:L1 4K TLB HIT event counter persists, it is always 1 even in presence of L2 DTLB misses. 2.3

Additional Translation Caches

Our experiments targeted at the translation miss penalties indicate that a TLB miss can be resolved with only one additional memory access, rather than as many accesses as there are levels in the paging structures. This means that that a cache of the third level paging structures is present on both investigated platforms, and since the presence of such additional translation caches mentioned only discussed in general terms in vendor documentation [8], we investigate these caches next.

5

10

5 4 3

Stride [pages] 512 4K 8K 64 K 128 K 256 K

2

512 4K 8K 64 K 128 K 256 K

85

1

L1DC accesses [1000 walks Avg Trim]

20

30

40

Stride [pages]

10

Acc. dur. [cycles − 1000 walks Avg Trim]

Investigating Cache Parameters of x86 Family Processors

15

Number of accessed pages

5

10

15

Number of accessed pages

Fig. 3. Extra translation caches miss penalty (left) and related L1 data cache references events related (right) on Intel Server

2.4

Experiment: Extra Translation Buﬀers

With the presence of the third level paging structure cache (PDE cache) already conﬁrmed, we focus on determining the presence of caches for the second level (PDPTE cache) and the ﬁrst level (PML3TE cache). The experiments use the set collision pointer walk from Listing 1.1 and 1.3. The numAccesses and pageStride parameters are initially set to values that make each access miss in the last level of DTLB and hit in the PDE cache. By repeatedly doubling pageStride, we should eventually reach a point where only a single associativity set in the PDE cache is accessed, triggering misses when numAccesses exceeds the associativity. This should be observed as an increase of the average access duration and an increase of the data cache access count during page walks. Eventually, the accessed memory range pageStride × numP ages exceeds the 512 × 512 pages translated by a single third level paging structure, making the accesses map to diﬀerent entries in the second level paging structure and thus diﬀerent entries in the PDPTE cache, if present. Further increase of pageStride extends the scenario analogically to the PML4TE cache. The change of the average access durations and the corresponding change in the data cache access count for diﬀerent values of pageStride on Platform Intel Server are illustrated in Fig. 3. Only those values of pageStride that lead to diﬀerent results are displayed, the results for the values that are not displayed are the same as the results for the previous value. For the 512 pages stride, the average access duration changes from 3 to 12 at 5 accessed pages, which means we hit the PDE cache as in the previous experiment. We also observe an increase of the access duration from 12 to 23 cycles and a change in the L1 cache miss (L1D REPL event) counts from 0 to 1 at 9 accessed pages. These misses are not caused by the accessed data but by the page walks, since with this particular stride and alignment, we always read the ﬁrst entry of a page table and therefore the same cache set. We see that the penalty of this miss is 11 cycles, also reﬂected in the value of the PAGE WALKS:CYCLES event counter, which changes from 5 to 16. Later experiments will show that an L1 data cache miss penalty for data load on this platform is indeed 11 cycles, which means that the L1 data cache miss penalty simply adds up with the DTLB miss penalty.

86

V. Babka and P. T˚ uma

Fig. 4. Extra translation caches miss penalty (left) and related page walk requests to L2 cache (right) on AMD Server

As we increase the stride, we start to trigger misses in the PDE cache. With the stride of 8192 pages, which spans 16 PDE entries, and 5 or more accessed pages, the PDE cache misses on each access. The L1 data cache misses event counter indicates that there are three L1 data cache references per memory access, two of them are therefore caused by the page walk. This means that a PDP cache is also present and the PDE miss penalty is 4 cycles. Further increasing the stride results in a gradual increase of the PDP cache misses. With the 512 × 512 pages stride, each access maps to a diﬀerent PDP entry. At 5 accessed pages, the L1D ALL REF event counter increases to 5 L1 data cache references per access. This indicates that there is no PML4TE cache, since all four levels of the paging structures are traversed, and that the PDP cache has at most 4 entries. Compared to the 8192 pages stride, the PDP miss adds approximately 19 cycles per access. Out of those, 11 cycles are added by an extra L1 data cache miss, as both PDE and PTE entries miss the L1 data cache due to being mapped to the same set. The remaining 8 cycles is the cost of walking two additional levels of page tables due to the PDPTE miss. The standard deviation of the results exceeds the limit of 0.5 cycles only when the L1 cache associativity is about to be exceeded – up to 3.5 cycles, and when the translation cache level is about to be exhausted – up to 8 cycles. The observed access durations and the corresponding change in the data cache access count from an analogous experiment on Platform AMD Server are shown in Fig. 4. We can see that for a stride of 128 pages, we still hit the PDE cache as in the previous experiment. Strides of 512 pages and more need 2 page walk steps and thus hit the PDPTE cache. Strides of 256 K pages need 3 steps and thus hit the PML4TE cache. Finally, strides of 128 M pages need all 4 steps. The access duration increases by 21 cycles for each additional page walk step. With a 128 M stride, we see an additional penalty due to page walks triggering L2 cache misses. The standard deviation of the results exceeds the limit of 0.5 cycles only when the L2 cache capacity is exceeded – up to 18 cycles, and when the translation cache level is about to be exhausted – up to 10 cycles.

Investigating Cache Parameters of x86 Family Processors

87

Table 2. Cache parameters Cache Size Associativity Platform Intel Server L1 data 32 KB 8-way 32 KB 8-way L1 code L2 uniﬁed 4 MB 16-way Platform AMD Server L1 data 64 KB 2-way L1 code 64 KB 2-way L2 uniﬁed 512 KB 16-way 32-way L3 uniﬁed 2 MB

3

Index

Miss [cycles]

virtual virtual physical

11 30 2 256-286 3

virtual 12 random, 27-40 single set 4 virtual 20 random 5 , 25 single set 6 physical +32-35 random 7 , +16-63 single set physical +208 random 8 , +159-211 single set.

Investigating Memory Caches

On Platform Intel Server, the memory caches include an L1 instruction cache per core, an L1 data cache per core, and a shared L2 uniﬁed cache per every two cores. Both L1 caches are virtually indexed, the L2 cache is physically indexed. On Platform AMD Server, the memory caches include an L1 instruction cache per core, an L1 data cache per core, an L2 uniﬁed cache per core, and a shared L3 uniﬁed cache per every four cores. The following table summarizes the basic parameters of the memory caches on the two platforms, with the parameters not available in vendor documentation emphasized. We begin our memory caches investigation by describing experiments targetted at the cache line sizes, which diﬀer between vendor documentation and reported research. 3.1

Cache Line Size

The experiments we perform are still based on measuring durations of memory accesses using various access patterns in the pointer walk from Listing 1.1. To avoid the eﬀects of hardware prefetching, we use a random access pattern generated by code from Listing 1.4. First, an array of pointers to the buﬀer of allocSize bytes is created, with a distance of accessStride bytes between two consecutive pointers. Next, the array is shuﬄed randomly. Finally, the array is used to create the access pattern of a length of accessSize divided by accessStride. 2 3

4 5 6 7 8

Includes penalty of branch misprediction. Depends on the cache line set where misses occur. Also includes associated DTLB1 miss and L1 data cache miss due to page walk. Diﬀers from the 9 cycles penalty stated in vendor documentation [9, page 223]. Includes partial penalty of branch misprediction and L1 ITLB miss. Includes partial penalty of branch misprediction. Depends on the oﬀset of the word accessed. Also includes penalty of L1 DTLB miss. Includes penalty of L2 DTLB miss.

88

V. Babka and P. T˚ uma Listing 1.4. Random access pattern generator.

// Create array of pointers in the allocated buffer int numPtrs = allocSize / accessStride; uintptr_t **ptrs = new (uintptr_t *)[numPtrs]; for (int i = 0; i < numPtrs; i++) ptrs [i] = buffer + i * accessStride; // Randomize the order of the pointers random_shuffle (ptrs, ptrs + numPtrs); // Create the pointer walk from selected pointers uintptr_t *start = ptrs [0]; uintptr_t **ptr = (uintptr_t **) start; int numAccesses = accessSize / accessStride; for (int i = 1; i < numAccesses; i++) { uintptr_t *next = ptrs [i]; (*ptr) = next; ptr = (uintptr_t **) next; } // Wrap the pointer walk (*ptr) = start; delete [] ptrs;

3.2

Experiment: Cache Line Size

In order to determine the cache line size, the experiment executes a measured workload that randomly accesses half of the cache lines, interleaved with an interfering workload that randomly accesses all the cache lines. For data caches, both workloads use a pointer emitting version of code from Listing 1.4 to initialize the access pattern and code from Listing 1.1 to traverse the pattern. For instruction caches, both workloads use a jump emitting version of code from Listing 1.4 to initialize the access pattern and code from Listing 1.2 to traverse the pattern. The measured workload uses the smallest possible access stride, which is 8 B for 64 bit aligned pointer variables and 16 B for jump instructions. The interfering workload varies its access stride. When the stride exceeds the cache line size, the interfering workload should no longer access all cache lines, which should be observed as a decrease in the measured workload duration, compared to the situation when the interfering workload accesses all cache lines. The results from both platforms and all cache levels and types, except the L2 cache on Platform Intel Server, show a decrease in the access duration when the access stride of the interfering workload increases from 64 B to 128 B. The counts of the related cache miss events conﬁrm that the decrease in access duration is caused by the decrease in cache misses. Except for the L2 cache on Platform

Investigating Cache Parameters of x86 Family Processors

89

Fig. 5. The eﬀect of interfering workload access stride on the L2 cache eviction (left); streamer prefetches triggered by the interfering workload during the L2 cache eviction on Intel Server (right)

Intel Server, we can therefore conclude that the line size is 64 B for all cache levels, as stated in the vendor documentation. Figure 5 shows the results for the L2 cache on Platform Intel Server. These results are peculiar in that they would indicate the cache line size of the L2 cache is 128 B rather than 64 B, a result that was already reported in [10]. The reason behind the observed results is the behavior of the streamer prefetcher [11, page 3-73], which causes the interfering workload to fetch two adjacent lines to the L2 cache on every miss, even though the second line is never accessed. The interfering workload with a 128 B stride thus evicts two 64 B cache lines. Figure 5 contains values of the L2 prefetch miss (L2 LINES IN:PREFETCH) event counter collected from the interfering workload rather than the measured workload, and conﬁrms that L2 cache misses triggered by prefetches occur. Because the vendor documentation does not explain the exact behavior of the streamer prefetcher when fetching two adjacent lines, we have performed a slightly modiﬁed experiment to determine which two lines are fetched together. Both workloads of the experiment access 4 MB with 256 B stride, the measured workload with oﬀset 0 B, the interfering workload with oﬀsets 0, 64, 128 and 192 B. The oﬀset therefore determines whether both workloads access the same cache associativity sets or not. The oﬀset of 0 B should always evict lines accessed by the measured code, the oﬀset of 128 B should always avoid them. If the streamer prefetcher fetches a 128 B aligned pair of cache lines, using the 64 B oﬀset should also evict the lines of the measured workload, while the 192 B oﬀset should avoid them. If the streamer prefetcher fetches any pair of consecutive cache lines, using both the 64 B oﬀset and the 192 B oﬀset should avoid the lines of the measured workload. The results on Fig. 6 indicate that the streamer prefetcher always fetches 128 B aligned pair of cache lines, rather than any pair of consecutive cache lines. Additional experiments also show that the streamer prefetcher does not prefetch the second line of a pair when the L2 cache is saturated with another workload. Running two workloads on cores that share the cache therefore results in fewer prefetches than running the same two workloads on cores that do not share the cache.

90

V. Babka and P. T˚ uma

Fig. 6. Access duration (left) and L2 cache misses by accesses only (right) investigating streamer prefetch on Intel Server

3.3

Cache Indexing

We continue by determining whether the cache is virtually or physically indexed, since this information is also not always available in vendor documentation. Knowing whether the cache is virtually or physically indexed is essential for later experiments that determine cache miss penalties. We again use the pointer walk code from Listing 1.1 and create the access pattern so that all accesses map to the same cache line set. To achieve this, we reuse the pointer walk initialization code from the TLB experiments on Listing 1.3, because the stride we need is always a multiple of the page size on our platforms. The diﬀerence is in that we do not use the oﬀset randomization. For physically indexed caches, the task of constructing the access pattern where all accesses map to the same cache line set is complicated by the fact that the cache line set is determined by physical rather than virtual address. To overcome this complication, our framework provides an allocation function that returns pages whose physical and virtual addresses are identical in the bits that determine the cache line set. This allocation function, further called colored allocation, is used in all experiments that deﬁne strides in physically indexed caches. Note that we do not have to determine cache indexing for the L1 caches on Platform Intel Server, where the combination of 32 KB size and 8-way associativity means that an oﬀset within a page entirely determines the cache line set. 3.4

Experiment: Cache Set Indexing

We measure the average access time in a set collision pointer walk from Listing 1.1 and 1.3, with the buﬀer allocated using either the standard allocation or the colored allocation. The number of accessed pages is selected to exceed the cache associativity. If a particular cache is virtually indexed, the results should show an increase in access duration when the number of accesses exceeds associativity for both modes of allocation. If the cache is physically indexed, there should be no increase in access duration with the standard allocation, because the stride in virtual addresses does not imply the same stride in physical addresses.

Investigating Cache Parameters of x86 Family Processors

91

Fig. 7. Dependency of associativity misses in L2 cache on page coloring on Intel Server

The results from Platform Intel Server show that colored allocation is needed to trigger L2 cache misses, as illustrated in Fig. 7. The L2 cache is therefore physically indexed. Without colored allocation, the standard deviation of the results grows when the L1 cache misses start occuring, staying below 3.2 cycles for 8 accessed pages and below 1 cycle for 9 and more accessed pages. Similarly with colored allocation, the standard deviation stays below 5.5 cycles for 7 and 8 accessed pages when the L1 cache starts missing, and below 10.5 cycles for 16 and 17 accessed pages when the L2 cache stats missing. The results from Platform AMD Server on Fig. 8 also show that colored allocation is needed to trigger L2 cache misses with 19 and more accesses. Colored allocation also seems to make a diﬀerence for the L1 data cache, but values of the event counters on Fig. 8 show that the L1 data cache misses occur with both modes of allocation, the diﬀerence in the observed duration therefore should not be attributed to indexing. The standard deviation of the results exceeds the limit of 0.5 cycles for small numbers of accesses, with a maximum standard deviation of 2.1 cycles at 3 accesses. 3.5

Cache Miss Penalties

Finally, we measure the memory cache miss penalties, which appear to include eﬀects not described in vendor documentation.

Fig. 8. Dependency of associativity misses in L1 data and L2 cache on page coloring (left) and related performance events (right) on AMD Server

92

V. Babka and P. T˚ uma

Fig. 9. L2 cache miss penalty when accessing single cache line set (left); dependency on cache line set selection in pages of color 0 (right) on Intel Server

3.6

Experiment: Cache Miss Penalties and Their Dependencies

The experiment determines the penalties of misses in all levels of the cache hierarchy and their possible dependency on the oﬀset of accesses triggering the misses. We rely again on the set collision access pattern from Listing 1.1 and 1.3, increasing the number of repeatedly accessed addresses and varying the oﬀset within a cache line to determine its inﬂuence on the access duration. The results are summarized in Table 2, more can be found in [7]. On Platform Intel Server, we observe an unexpected increase in the average access duration when about 80 diﬀerent addresses mapped to the same cache line set. The increase, visible on Fig. 9, is not reﬂected by any of the relevant event counters. Further experiments, also illustrated on Fig. 9, reveal a diﬀerence between accessing odd and even cache line sets within a page. We see that the diﬀerence varies with the number of accessed addresses, with accesses to the even cache lines faster than odd cache lines for 32 and 64 addresses, and the other way around for 128 addresses. The standard deviation in these results is under 3 clocks. On Platform AMD Server, we observe an unusually high penalty for the L1 data cache miss, with an even higher peak when the number of accessed addresses just exceeds the associativity, as illustrated in Fig. 10. Determined this way, the

Fig. 10. L1 data cache miss penalty when accessing a single cache line set (left) and random sets (right) on AMD Server

Investigating Cache Parameters of x86 Family Processors

93

Fig. 11. ependency of L2 cache miss penalty on access oﬀset in a cache line when accessing random cache line sets (left) and 20 cache lines in the same set (right) on AMD Server

penalty would be 27 cycles, 40 cycles for the peak, which is signiﬁcantly more than the stated L2 access latency of 9 cycles [9, page 223]. Without additional experiments, we speculate that the peak is caused by the workload attempting to access data that is still in transit from the L1 data cache to the L2 cache. More light is shed on the unusually high penalty by another experiment, one which uses the random access pattern from Listing 1.4 rather than the set collision pattern from Listing 1.3. The workload allocates memory range twice the cache size and varies the portion that is actually accessed. Accessing the full range triggers cache misses on each access, the misses are randomly distributed to all cache sets. With this approach, we observe a penalty of approximately 12 cycles per miss, as illustrated on Fig. 10. We have extended this experiment to cover all caches on Platform AMD Server, the diﬀerences in penalties when accessing a single cache line set and when accessing multiple cache line sets is summarized in Table 2. For the L2 cache, we have also observed a small dependency of the access duration on the access oﬀset within the cache line when accessing random cache sets, as illustrated on Fig. 11. The access duration increases with each 16 B of the oﬀset and can add almost 3 cycles to the L2 miss penalty. A similar dependency was also observed when accessing multiple addresses mapped to the the same cache line set, as illustrated on Fig. 11. Again, we believe that illustrating the many variables that determine the cache miss penalties is preferable to the incomplete information available in vendor documentation, especially when results of more complex experiments which include such eﬀects are to be analyzed.

4

Experimental Framework

The experiments described here were performed within a generic benchmarking framework, designed to investigate performance related eﬀects due to sharing of resources such as the processor core or the memory architecture among multiple software components. The framework source is available for download at

94

V. Babka and P. T˚ uma

http://dsrg.mﬀ.cuni.cz/benchmark together with multiple benchmarks, including all the benchmarks described in this paper, implemented in the form of extensible workload modules. The support provided by the framework includes: – Creating and executing parametrized benchmarks. The user can specify ranges of individual parameters, the framework executes the benchmark with all the speciﬁed combinations of the parameter values. – Collecting precise timing information through RDTSC instruction and performance counter values through PAPI [4]. – Executing either isolated benchmarks or combinations of benchmarks to investigate the sharing eﬀects. – Plotting of results through R [12]. Supports boxplots for examining dependency on one benchmark parameter and plots with multiple lines for diﬀerent values of other benchmark parameters. Besides providing the execution environment for the benchmarks, the framework bundles utility functions, such as the colored allocation used in experiments with physically indexed caches in Section 3. The colored allocation is based on page coloring [13], where the bits determining the associativity set are the same in virtual and physical address. The number of the associativity set is called a color. As an example, the L2 cache on Platform Intel Server has a size of 4 MB and 16-way associativity, which means that addresses with a stride of 256 KB will be mapped to the same cache line set [11, page 3-61]. With 4 KB page size, this yields 64 diﬀerent colors, determined by the 6 least signiﬁcant bits of the page address. Although the operating system on our experimental platforms does not support page allocation with coloring, it does provide a way for the executed program to determine its current mapping. Our colored allocation uses this information together with the mremap function to allocate a continuous virtual memory area, determine its mapping and remap the allocated pages one by one to a diﬀerent virtual memory area with the target virtual addresses matching the color of the physical addresses. This way, the allocator can construct a continuous virtual memory area with virtual pages having the same color as the physical frames that the pages are mapped to.

5

Conclusion

We have described a series of experiments designed to investigate some of the detailed parameters of the memory architecture of the x86 processor family. Although the knowledge of the detailed parameters is of limited practical use in general software development, where it is simply too involved and too specialized, we believe it is of signiﬁcant importance in designing and evaluating research experiments that exercise the memory architecture. Without this knowledge, it is diﬃcult to design experiments that target the intended part of the memory

Investigating Cache Parameters of x86 Family Processors

95

architecture and to distinguish results that are characteristic of the experiment workload from results that are due to incidental interference. We should point out that the detailed parameters are often not available in vendor documentation, or – since claiming to know all vendor documentation would be somewhat preposterous – at least are often only available as fragmented information buried among hundreds of pages of text. Among the detailed parameters investigated in this paper are the address translation miss penalties (which are partially documented for Platform Intel Server and not documented for Platform AMD Server), the parameters of the additional translation caches (which are not documented for Platform Intel Server and not even mentioned for Platform AMD Server), the cache line size (which is well documented but measured incorrectly in [10]) together with the reasons for the cited incorrect measurement, the cache indexing (which seems to be generally known but is not documented for Platform AMD Server), and the cache miss penalties (which seem to be more complex than documented even when abstracting from the memory itself). Additionally, we show some interesting anomalies such as suspect values of performance counters. We also provide a framework that makes it possible to easily reproduce our experiments, or to execute our experiments on diﬀerent experiment platforms. The framework is used within the Q-ImPrESS project and many more collected results are available in [7]. To our knowledge, the experiments that we have performed are not available elsewhere. Closest to our work are the results in [10] and [14], which describe algorithms for automatic assessment of basic memory architecture parameters, especially the size and associativity of the memory caches. The workloads used in [10] and [14] share common features with some of our workloads, especially where the random pointer walk is concerned. Our workloads are more varied and therefore provide more results, although the comparison is not quite fair since we did not aim for automated analysis. We also show some eﬀects that the cited workloads would not reveal. Although this paper is primarily targeted at performance evaluation professionals involved in detailed measurements related to the memory architecture of the x86 processor family, our results in [7] demonstrate that the observed eﬀects can impact performance modeling precision at much higher levels. As far as the general applicability of our results is concerned, it should be noted that they are very much tied to the particular experimental platforms, and can change even with minor platform parameters such as processor or chipset stepping. For diﬀerent experimental platforms, our results can serve to illustrate what eﬀects can be observed, but not to guarantee what eﬀects will really be present. The availability of our experimental framework, however, makes it possible to repeat our experiments with very little eﬀort, leaving only the evaluation of the diﬀerent results to be carried out where applicable.

96

V. Babka and P. T˚ uma

References 1. Intel Corporation: Intel 64 and IA-32 Architectures Software Developer Manual, Volume 3: System Programming, Order Nr. 253668-027 and 253669-027 (July 2008) 2. Advanced Micro Devices, Inc.: AMD64 Architecture Programmer’s Manual Volume 2: System Programming, Publication Number 24593, Revision 3.14. (September 2007) 3. Drepper, U.: What every programmer should know about memory (2007), http://people.redhat.com/drepper/cpumemory.pdf 4. PAPI: Performance application programming interface, http://icl.cs.utk.edu/papi 5. Pettersson, M.: Perfctr, http://user.it.uu.se/∼ mikpe/linux/perfctr/ 6. Advanced Micro Devices, Inc.: AMD BIOS and Kernel Developer’s Guide For AMD Family 10h Processors, Publication Number 31116, Revision 3.06 (March 2008) 7. Babka, V., Bulej, L., Dˇeck´ y, M., Kraft, J., Libiˇc, P., Marek, L., Seceleanu, C., T˚ uma, P.: Resource usage modeling, Q-ImPrESS deliverable 3.3 (September 2008), http://www.q-impress.eu 8. Intel Corporation: Intel 64 and IA-32 Architectures Application Note: TLBs, Paging-Structure Caches, and Their Invalidation, Order Nr. 317080-002 (April 2008) 9. Advanced Micro Devices, Inc.: AMD Software Optimization Guide for AMD Family 10h Processors, Publication Number 40546, Revision 3.06 (April 2008) 10. Yotov, K., Pingali, K., Stodghill, P.: Automatic measurement of memory hierarchy parameters. In: Proceedings of the 2005 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pp. 181–192. ACM, New York (2005) 11. Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Manual, Order Nr. 248966-016 (November 2007) 12. R: The R Project for Statistical Computing, http://www.r-project.org/ 13. Kessler, R.E., Hill, M.D.: Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst. 10(4), 338–359 (1992) 14. Yotov, K., Jackson, S., Steele, T., Pingali, K.K., Stodghill, P.: Automatic measurement of instruction cache capacity. In: Ayguad´e, E., Baumgartner, G., Ramanujam, J., Sadayappan, P. (eds.) LCPC 2005. LNCS, vol. 4339, pp. 230–243. Springer, Heidelberg (2006)

The Next Frontier for Power/Performance Benchmarking: Energy Efficiency of Storage Subsystems Klaus-Dieter Lange Hewlett-Packard Company, 11445 Compaq Center Dr. W, Houston, TX-77070, USA [email protected]

Abstract. The increasing concern of energy usage in datacenters has drastically changed how the IT industry evaluates servers. The energy conscious selection of storage subsystems is the next logical step. This paper first quantifies the possible energy savings of utilizing modern storage subsystems by identifying inherent energy characteristics of next generation disk IO subsystems. Additionally, the power consumptions of a variety of workload patterns is demonstrated. Keywords: SPEC, Benchmark, Power, Energy, Performance, Server, Storage, Datacenter.

1 Introduction Today’s challenge for datacenters is their high energy consumption [1]. The demand for efficient real estate in datacenters has moved to more power efficient datacenters. This increasing concern of energy usage in datacenters has drastically changed how the IT industry evaluates servers. In response, the Standard Performance Evaluation Corporation (SPEC) [2] has developed and released SPECpower_ssj2008 [3], the first industry-standard benchmark that evaluates the power and performance characteristics of server class computers. The need for this type of measurement was so urgent and necessary that the US Environmental Protection Agency (US EPA) included it in their ENERGY STAR® Program Requirements for Computer Servers [4]. The SPECpower_ssj2008 results [5] are also already being utilized for energy conscious purchase decisions. With the competitive marketplace driving server innovation even further, the next logical phase is adopting an energy conscious evaluation of storage subsystems.

2 Power Consumption of Server and Storage In order to show the significant impact on the power consumption of the storage subsystem we configured a server with external storage, similar to the publicly released SPECweb2005 result [6]. Two AC power analyzers were connected to separately measure the power consumption of the server and the external storage. The configured system was then benchmarked with the SPECweb2005 (Banking) workload from idle to 100% in 10% increments and the power measurements were automatically D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 97–101, 2009. © Springer-Verlag Berlin Heidelberg 2009

98

K.-D. Lange

recorded in 1s intervals. The server power consumption ranged from ~286W at idle to ~312W at 100% performance; while the external storage ranged from ~305W at idle to ~400W at 100% performance. Figure 1 represents a graphical view of these data. This test configuration shows that the power consumption of the external storage subsystem can be significantly higher than the server itself; most of the current public SPECweb2005 results exhibit similar tendencies. Another recent study [7] on the energy cost of datacenters shows that in database setups, 63% of power is consumed by the storage systems. For at least these application areas (web serving and database) an industry standard method to measure the energy usage for storage subsystems is necessary. Another interesting discovery was the range in power consumption between idle and 100% performance. For our baseline benchmark configuration this equated to ~9% range for the server and ~30% range for the storage. For comparison, in only one year after its release, SPECpower_ssj2008 results show that companies pushed the server range as far as 50%.

Average Power Consumption 400 Average Power Consumption [W]

Server

External Storage

350 300 250 200 150 100 50 0 idle

10%

20%

30% 40% 50% 60% 70% SPECweb2005 (Banking) Performance

80%

90%

100%

Fig. 1. Average Power Consumption - Graduated SPECweb2005 Workload

3 Saving Energy by Utilizing Modern Storage Subsystems To demonstrate the energy savings when using the latest technology, two generations of storage enclosures, both with a standard rack form factor of 3U, were compared. The older generation storage enclosure holds 14 large form factor (LFF) 3.5” SCSI drives and the current generation storage enclosure holds 25 small form factor (SFF) 2.5” SAS drives. A drive capacity of 32GB was chosen for each drive. Each empty enclosure was attached to a server and then loaded with drives, one drive at a time every 66 seconds. The idle power of the empty SAS enclosure (72W) was slightly

The Next Frontier for Power/Performance Benchmarking

99

Large Form Factor SCSI vs. Small Form Factor SAS

Idle Power Consumption [W]

250

200

150

100 Previous Generation - LFF SCSI Current Generation - SFF SAS

50

0 1

2

3

4

5

6

7

8

9 11 12 13 14 15 16 17 18 19 20 22 23 24 25 Drive Count

Fig. 2. Large Form Factor SCSI vs. Small Form Factor SAS

higher than the SCSI enclosure (58W). Nevertheless, with an idle power of ~12W for an individual LFF SCSI drive and ~6.25 for an individual SFF SAS drive, this advantage was surpassed after the third drive was added. When reaching the maximum drive capacity, the 14 LFF drive enclosure used ~227W; the SFF SAS enclosure with 14 drives used only ~156W, approximately 71W power savings. The SFF SAS enclosure needed to be fully equipped with 25 drives before it would reach the power consumption of the LFF SCSI drive enclosure.

4 Various Workload Pattern The power consumption of a server is dependent on the CPU stress pattern. Different stress patterns cause different consumptions of power (Figure 3). To demonstrate a similar behavior on storage subsystems, the same hardware configuration was utilized as in section 3 and different workloads were applied. Five different workloads were selected for this experiment: 100% random write, 75% random write with 25% random read, 100% random read, 100% sequential write and 100% sequential read. The resulting power consumptions are shown in Figure 4. The findings indicate that random access causes higher power consumption than sequential access – this could be caused by the additional head movement of the drives. For these workloads the SFF SAS enclosure needed to be fully equipped with 25 drives before it would reach the power consumption of the LFF SCSI drive enclosure.

100

K.-D. Lange

lucas

fma3d

sixtrac

apsi

2:40

2:50

3:00

3:10

galgel art equake facere ammp

mesa

applu

mgrid

crafty parser eon perlbmk gap vortex bzip2 twolf 0:30

wupwi swim

gzip vpr gcc mcf 0:00

Power Consumption with different CPU stress patterns

Power Consumption [W]

310

290

270

250

3:40

3:30

3:20

2:30

2:20

2:10

2:00

1:50

1:40

1:30

1:20

1:10

1:00

0:50

0:40

0:20

0:10

230

Fig. 3. CPU – power consumption for various workloads Large Form Factor (SCSI) vs. Small Form Factor (SAS) various workloads q-depth = 128 Previous Generation - max 14 LFF SCSI Drives Current Generation - max 25 SFF SAS Drives block-size = 32 Current Generation 14 SFF SAS Drives 350

Power Consumption [W]

325 300 275 250 225 200 175 150 100% RW

75% RW / 25%

100% RR

100% SW

100% SR

Fig. 4. Storage Subsystem – power consumption for various workloads

5 Conclusion The power consumption of the external storage subsystem has been identified to be significantly higher than the server itself in the application areas of web serving and database. The experiments in sections 3 and 4 show that modern storage subsystems significantly save more energy than their predecessors; however as of December 2008 there

The Next Frontier for Power/Performance Benchmarking

101

is no industry standard benchmark available that can demonstrate these or similar real energy savings. There will be many challenges along the way to create benchmarks that measure the power/performance of server storage subsystems. As in the development of SPECpower_ssj2008, I am convinced that SPEC will again step up to these challenges and convene the best talents from the industry to lead the exploration in this next frontier.

6 Future Work Preliminary measurements of the power/performance characteristics of solid-state drives (SSD) show very promising results which warrant further investigation. Another area of interest is to analyze the impact of energy preserving storage enclosures and advanced power supplies. Once we have studied these measurements, we will provide the results to SPEC to support their benchmark development. The active support of the augmentation of a power component to all applicable SPEC’s benchmarks will be in the industry’s best interest, since it will enable the fair evaluation of servers and their subsystems under a wide variety of workloads.

Acknowledgement The author would like to acknowledge Richard Tomaszewski and Steve Fairchild for their guidance; Kris Langenfeld, Jonathan Koomey, Roger Tipley, Mark Thompson and Raghunath Nambiar for their comments and feedback; Bryon Georgson, David Rogers, Daniel Ames and David Schmidt for their support conducting the power and performance measurements; Dwight Barron, Mike Nikolaiev and Tracey Stewart for their continuous support. SPEC and the benchmark names SPECpower_ssj2008 and SPECweb2005 are registered trademarks of the Standard Performance Evaluation Corporation.

References 1. Koomey, J.: Worldwide electricity used in data centers. Environmental Research Letters 3(034008) (September 23, 2008), http://www.iop.org/EJ/abstract/1748-9326/3/3/034008/ 2. Standard Performance Evaluation Corporation (SPEC), http://www.spec.org 3. SPECpower_ssj2008, http://www.spec.org/power_ssj2008 4. US EPA’s Energy Star for Enterprise Servers, http://www.energystar.gov/ index.cfm?c=new_specs.enterprise_servers 5. SPECpower_ssj2008 results, http://www.spec.org/power_ssj2008/results/power_ssj2008.html 6. SPECweb2005 result, http://www.spec.org/web2005/results/res2006q4/ web2005-20061019-00048.html 7. Poess, M., Nambiar, R.: Energy Cost, The Key Challenge of Today’s Data Centers: A Power Consumption Analysis of TPC-C Results, http://www.vldb.org/pvldb/1/1454162.pdf

Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors Using Geospatial-Based Predictive Models Chang-Burm Cho, Wangyuan Zhang, and Tao Li Intelligent Design of Efficient Architecture Lab(IDEAL), Department of ECE, University of Florida {choreno,zhangwy}@ufl.edu, [email protected]

Abstract. This paper presents novel 2D geospatial-based predictive models for exploring the complex thermal spatial behavior of three-dimensional (3D) die stacked multi-core processors at the early design stage. Unlike other analytical techniques, our predictive models can forecast the location, size and temperature of thermal hotspots. We evaluate the efficiency of using the models for predicting within-die and cross-dies thermal spatial characteristics of 3D multicore architectures with widely varied design choices (e.g. microarchitecture, floor-plan and packaging). Our results show the models achieve high accuracy while maintaining low complexity and computation overhead. Keywords: Thermal/power characterization, multi-core architecture, 3D die stacking, analytical modeling.

1 Introduction Three-dimensional (3D) integrated circuit design [1] is an emerging technology that greatly improves transistor integration density and reduces on-chip wire communication latency. It places planar circuit layers in the vertical dimension and connects these layers with a high density and low-latency interface. In addition, 3D offers the opportunity of binding dies, which are implemented with different techniques to enable integrating heterogeneous active layers for new system architectures. Leveraging 3D die stacking technologies to build uni-/multi-core processors has drawn an increased attention to both chip design industry and research community [2- 8]. The realization of 3D chips faces many challenges. One of the most daunting of these challenges is the problem of inefficient heat dissipation. In conventional 2D chips, the generated heat is dissipated through an external heat sink. In 3D chips, all of the layers contribute to the generation of heat. Stacking multiple dies vertically increases power density and dissipating heat from the layers far away from the heat sink is more challenging due to the distance of heat source to external heat sink. Therefore, 3D technologies not only exacerbate existing on-chip hotspots but also create new thermal hotspots. High die temperature leads to thermal-induced performance degradation and reduced chip lifetime, which threats the reliability of the whole system, making modeling and analyzing thermal characteristics crucial in effective 3D microprocessor design. D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 102–120, 2009. © Springer-Verlag Berlin Heidelberg 2009

Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors

Die2

Die3

Die4

MIX

MEM

CPU

Die1

103

Fig. 1. 2D within-die and cross-dies thermal variation in 3D die stacked multi-core processors

Config. B

Config. C

Config. D

MIX

MEM

CPU

Config. A

Fig. 2. 2D thermal variation on die 4 under different microarchitecture and floor-plan configurations

Previous studies [5, 6] show that 3D chip temperature is affected by factors such as configuration and floor-plan of microarchitectural components. For example, instead of putting hot components together, thermal-aware floor-planning places the hot components by cooler components, reducing the global temperature. Thermal-aware floorplanning [5] uses intensive and iterative simulations to estimate the thermal effect of microarchitecture components at early architectural design stage. However, using detailed yet slow cycle-level simulations to explore thermal effects across large design space of 3D multi-core processors is very expensive in terms of time and cost. To achieve thermal efficient 3D multi-core processor design, architects and chip designers need models with low computation overhead, which allow them to quickly explore the design space and compare different design options. One challenge in modeling the thermal behavior of 3D die stacked multi-core architecture is that the manifested thermal patterns show significant variation within each die and across different dies (as shown in Fig. 1). The results were obtained by simulating a 3D die stacked quad-core processors running multi-programmed CPU (bzip2, eon, gcc,perlbmk), MEM (mcf, equake, vpr, swim) and MIX (gcc, mcf, vpr, perlbmk) workloads. Each program within a multi-programmed workload was assigned to a die

104

C.-B. Cho, W. Zhang, and T. Li

that contains a processor core and caches. More details on our experimental methodologies can be found in Section 4. Figure 2 shows the 2D thermal variation on die 4 under different microarchitecture and floor-plan configurations. On the given die, the 2-dimensional thermal spatial characteristics vary widely with different design choices. As the number of architectural parameters in the design space increases, the complex thermal variation and characteristics cannot be captured without using slow and detailed simulations. As shown in Figs. 1 and 2, to explore the thermal-aware design space accurately and informatively, we need computationally effective methods that not only predict aggregate thermal behavior but also identify both size and geographic distribution of thermal hotspots. In this work, we aim to develop fast and accurate predictive models to achieve this goal. Prior work has proposed various predictive models [9, 10, 11, 12, 13, 14, 15] to cost-effectively reason processor performance and power characteristics at the design exploration stage. A common weakness of existing analytical models is that they assume centralized and monolithic hardware structures and therefore lack the ability to forecast the complex and heterogeneous thermal behavior across large and distributed 3D multi-core architecture substrates. In this paper, we addresses this important and urgent research task by developing novel, 2D multi-scale predictive models, which can efficiently reason the geo-spatial thermal characteristics within die and across different dies during the design space exploration stage without using detailed cycle-level simulations. Instead of quantifying the complex geo-spatial thermal characteristics using a single number or a simple statistical distribution, our proposed techniques employ 2D wavelet multiresolution analysis and neural network non-linear regression modeling. With our schemes, the thermal spatial characteristics are decomposed into a series of wavelet coefficients. In the transform domain, each individual wavelet coefficient is modeled by a separate neural network. By predicting only a small set of wavelet coefficients, our models can accurately reconstruct 2D spatial thermal behavior across the design space. The rest of the paper is organized as follows: In Section 2, we briefly describe the wavelet transform, especially for 2D wavelet transform and the principles of neural networks are also presented. Section 3 provides our wavelet based neural networks for 2D thermal behavior prediction and system details. Section 4 introduces our experimental setup. Section 5 highlights our experimental results on 2D thermal behavior prediction and analyzes the tradeoff between model complexity, configuration, and prediction accuracy. Section 6 discusses related work. Section 7 concludes the paper.

2 Background To familiarize the reader with the general methods used in this paper, we provide a brief overview of wavelet multiresolution analysis and neural network regression prediction in this section. To learn more details about wavelets and neural networks, the reader is encouraged to read [16, 17].

Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors

105

2.1 1D Wavelet Transform Wavelets are mathematical tools that use a simple, fixed prototype function (called the analyzing or mother wavelet) to transform data of interest into different frequency components and study each component with a resolution that matches its scale. A wavelet transform, which decomposes data of interest by wavelets, provides a compact and effective mathematical representation of the original data. In contrast to Fourier transforms, which only offer frequency representations, wavelets are capable of providing time and frequency localizations simultaneously. Wavelet analysis employs two functions, often referred to as the scaling filter ( H ) and the wavelet filter ( G ), to generate a family of functions that break down the original data. The scaling filter is similar in concept to an approximation function, while the wavelet filter quantifies the differences between the original data and the approximation generated by the scaling function. Wavelet analysis allows one to choose the pair of scaling and wavelet filters from numerous functions. In this section, we provide a quick primer on wavelet analysis using the Haar wavelet, which is the simplest form of wavelets [18]. Equation (1) shows the scaling and wavelet filters for Haar wavelets, respectively. H = (1 / 2 ,1 / 2 )

G = (−1 / 2 ,1 / 2 ) .

(1)

The Haar discrete wavelet transform (DWT) works by averaging two adjacent values on a series of data at a given scale to form smoothed, lower-dimensional data (i.e. approximations), and the resulting coefficients (i.e. details), which are the differences between the values and their averages. By recursively repeating the decomposition process on the averaged sequence, we achieve multi-resolution decomposition. The process continues by decomposing the scaling coefficient (approximation) vector repeating the same steps, and completes when only one coefficient remains. As a result, wavelet decomposition is the collection of average and detail coefficients at all scales. H * = (1 / 2 ,1 / 2 )

G* = (1 / 2 ,−1 / 2 ) .

(2)

The original data can be reconstructed from wavelet coefficients using a pair of wavelet synthetic filters ( H * and G * ), as shown in (2). With the Haar wavelets, this inverse wavelet transform can be achieved by adding difference values back or subtracting differences from the averages. This process can be performed recursively until the finest scale is reached. The original data can be perfectly recovered if all wavelet coefficients are involved. Alternatively, an approximation of the data can be reconstructed using a subset of wavelet coefficients. Using a wavelet transform gives time-frequency localization of the original data. As a result, the original data can be accurately approximated using only a few wavelet coefficients since they capture most of the energy of the input data. Thus, keeping only the most significant coefficients enables us to represent the original data in a lower dimension. Note that in (1) and (2) we use 2 instead of 2 as a scaling factor since just averaging cannot preserve Euclidean distance in the transformed data.

106

C.-B. Cho, W. Zhang, and T. Li

2.2 2D Wavelet Transform To capture the 2D spatial thermal characteristics effectively in 3D integrated multicore chips, we propose to use 2D wavelet analysis in this study. With 1D wavelet analysis that uses Haar wavelet filters, each adjacent pair of data in a discrete interval is replaced with its average and difference. A similar concept can be applied to obtain a 2D wavelet transform of data in a discrete plane.

Fig. 3. Illustration of 2D wavelet transforms

As shown in Fig. 3, the 1D analysis filter bank is first applied to the rows (horizontal filtering) of the data and then applied to the columns (vertical filtering). This kind of 2D DWT leads to a decomposition of approximation coefficients at level j in four components: the approximation (LL) at level j+1, and the details in three orientations, e.g., horizontal (LH), vertical (HL), and diagonal (HH). LL1 346

345

LL

2

LH

HL

2

HH2

2

344

343

LH1

342

341

340

HL1

(a) Original thermal behavior

HH

1

(b) 2D wavelet transformed thermal behavior

Fig. 4. An example of using 2D DWT to capture thermal spatial characteristics

To obtain wavelet coefficients for 2D data, we apply a 1D wavelet transform to the data along the horizontal axis first, resulting in low-pass and high-pass signals (average and difference). Next, we apply 1D wavelet transforms to both signals along the vertical axis generating one averaged and three detailed signals. Consequently, 2D wavelet decomposition is obtained by recursively repeating this procedure on the averaged signal. Fig. 4 illustrates the original thermal behavior and 2D wavelet transformed thermal behavior. As can be seen, the 2D thermal characteristics can be effectively captured using a small number of wavelet coefficients (e.g. Average (LL=1) or Average

Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors

107

(LL=2)). Since a small set of wavelet coefficients provide concise yet insightful information on 2D thermal spatial characteristics, we use predictive models (i.e. neural networks) to relate them individually to various design parameters. Through inverse 2D wavelet transform, we use the small set of predicted wavelet coefficients to synthesize 2D thermal spatial characteristics across the design space. Compared with a simulation-based method, predicting a small set of wavelet coefficients using analytical models is computationally efficient and is scalable to explore the large thermal design space of 3D multi-core architecture. 2.3 Neural Network An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems process information. It is composed of a set of interconnected processing elements working in unison to solve problems.

Fig. 5. The radial basis function network

The most common type of neural network (shown as Fig. 5) consists of three layers of units: a layer of input units is connected to a layer of hidden units, which is connected to a layer of output units. The input is fed into network through input units. Each hidden unit receives the entire input vector and generates a response. The output of a hidden unit is determined by the input-output transfer function that is specified for that unit. Commonly used transfer functions include the sigmoid, linear threshold function and radial basis function (RBF) [19]. The RBF is a special class of function with response decreasing monotonically with distance from a central point. The center, the distance scale, and the precise shape of the radial function are parameters of the model. A typical radial function is the Gaussian which, in the case of a scalar input, is ⎛ ( x − c) 2 ⎞ . ⎟ h( x ) = exp⎜⎜ − r 2 ⎟⎠ ⎝

(3)

Its parameters are its center c and its radius r. A neural network that uses RBF can be expressed as

108

C.-B. Cho, W. Zhang, and T. Li n

f ( x) = ∑ w j h j ( x) . j =1

(4)

r where w ∈ ℜ n is adaptable or trainable weight vector and {h j (⋅)} nj =1 are radial basis functions of the hidden units. As shown in (4), the ANN output, which is determined by the output unit, is computed using the responses of the hidden units and the weights between the hidden and output units. Neural networks outperform linear models in capturing complex, non-linear relations between input and output, which make them a promising technique for tracking and forecasting complex thermal behavior.

3 Combining Wavelets and Neural Network for 2D Thermal Spatial Behavior Prediction We view the 2D spatial thermal characteristics yielded in 3D integrated multi-core chips as a nonlinear function of architecture design parameters. Instead of inferring the spatial thermal behavior via exhaustively obtaining temperature on each individual location, we employ wavelet analysis to approximate it and then use a neural network to forecast the approximated thermal behavior across a large architectural design space.

Fig. 6. Hybrid neuro-wavelet thermal prediction framework

Previous work [9, 10, 11, 12] shows that neural networks can accurately predict the aggregated workload behavior across varied architecture configurations. Nevertheless, monolithic global neural network models lack the ability to reveal complex thermal behavior on a large scale. To overcome this disadvantage, we propose combining 2D

Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors

109

wavelet transforms and neural networks that incorporate multiresolution analysis into a set of neural networks for spatial thermal characteristics prediction of 3D die stacked multi-core design. The 2D wavelet transform is a very powerful tool for characterizing spatial behavior since it captures both global trend and local variation of large data sets using a small set of wavelet coefficients. The local characteristics are decomposed into lower scales of wavelet coefficients (high frequencies) which are utilized for detailed analysis and prediction of individual or subsets of components, while the global trend is decomposed into higher scales of wavelet coefficients (low frequencies) that are used for the analysis and prediction of slow trends across each die. Collectively, these wavelet coefficients provide an accurate interpretation of the spatial trend and details of complex thermal behavior at a large scale. Our wavelet neural networks use a separate RBF neural network to predict individual wavelet coefficients. The separate predictions of wavelet coefficients proceed independently. Predicting each wavelet coefficient by a separate neural network simplifies the training task (which can be performed concurrently) of each sub-network. The prediction results for the wavelet coefficients can be combined directly by the inverse wavelet transforms to synthesize the 2D spatial thermal patterns across each die. Fig. 6 shows our hybrid neuro-wavelet scheme for 2D spatial thermal characteristics prediction. Given the observed spatial thermal behavior on training data, our aim is to predict the 2D thermal behavior of each die in 3D die stacked multi-core processors under different design configurations. The hybrid scheme involves three stages. In the first stage, the observed spatial thermal behavior in each layer is decomposed by wavelet multiresolution analysis. In the second stage, each wavelet coefficient is predicted by a separate ANN. In the third stage, the approximated 2D thermal characteristics are recovered from the predicted wavelet coefficients. Each RBF neural network receives the entire architecture design space vector and predicts a wavelet coefficient. The training of an RBF network involves determining the center point and a radius for each RBF, and the weights of each RBF, which determine the wavelet coefficients.

4 Experimental Methodology 4.1 Floorplanning and Hotspot Thermal Model In this study, we model four floor-plans that involve processor core and cache structures as illustrated in Fig. 7. As can be seen, the processor core is placed at different locations across the different floor-plans. Each floor-plan can be chosen by a layer in the studied 3D die stacking quad-core processors. The size and adjacency of blocks are critical parameters for deriving the thermal model. The baseline core architecture and floorplan we modeled is an Alpha processor, closely resembling the Alpha 21264.

Fig. 7. The selected floor-plans

110

C.-B. Cho, W. Zhang, and T. Li

Fig. 8 shows the baseline core floorplan. We assume a 65 nm processing technique and the floor-plan is scaled accordingly. The entire die size is 21×21mm and the core size is 5.8×5.8mm. We consider three core configurations: 2-issue (5.8×5.8 mm), 4issue (8.14×8.14 mm) and 8-issue (11.5×11.5 mm). Since the total die area is fixed, the more aggressive core configurations lead to smaller L2 caches. For all three types of core configurations, we calculate the size of the L2 caches based on the remaining die area available. RF Window (ROB+IQ)

ALU

BPRED

LSQ

il1

dl1

L2

L2

L2

Fig. 8. Processor core floor-plan

Table 1 lists the detailed processor core and cache configurations. We use Hotspot4.0 [20] to simulate thermal behavior of a 3D quad-core chip shown as Fig. 9. The Hotspot tool can specify the multiple layers of silicon and metal required to model a three dimensional IC. We choose grid-like thermal modeling mode by specifying a set of 64 x 64 thermal grid cells per die and the average temperature of each cell (32um x 32um) is represented by a value. Hotspot takes power consumption data for each component block, the layer parameters and the floor-plans as inputs and generates the steady-state temperature for each active layer.

Fig. 9. Cross section view of the simulated 3D quad-core chip

Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors

111

Table 1. Architecture configuration for different issue width 2 issue Processor Width

2-wide fetch/issue/commit

Issue Queue

32

32 entries, 4-way, 200 cycle ITLB miss Branch Predic- 512 entries Gshare, 10-bit tor global history BTB

512K entries, 4-way

4 issue

8 issue

4-wide fetch/issue/commit

8-wide fetch/issue/commit

64

128

64 entries, 4-way, 200 cycle 128 entries, 4-way, 200 cycle miss miss 1K entries Gshare, 10-bit 2K entries Gshare, 10-bit global global history history 1K entries, 4-way

2K entries, 4-way

Return Address 8 entries RAS 32K, 2-way, 32 Byte/line, 2 L1 I-Cache ports, 1 cycle access

16 entries RAS 32 entries RAS 64K, 2-way, 32 Byte/line, 2 128K, 2-way, 32 Byte/line, 2 ports, 1 cycle access ports, 1 cycle access

ROB Size

32 entries

64 entries

96 entries

24 entries

48 entries

72 entries

2 I-ALU, 1 I-MUL/DIV, 2 Load/Store 1 FP-ALU, 1FPMUL/DIV/SQRT 64 entries, 4-way, 200 cycle miss 32K, 2-way, 32 Byte/line, 2 ports, 1 cycle access unified 4MB, 4-way, 128 Byte/line, 12 cycle access

4 I-ALU, 2 I-MUL/DIV, 2 Load/Store 2 FP-ALU, 2FP-MUL/ DIV/SQRT 128 entries, 4-way, 200 cycle miss 64KB, 4-way, 64 Byte/line, 2 ports, 1 cycle unified 3.7MB, 4-way, 128 Byte/line, 12 cycle access

8 I-ALU, 4 I-MUL/DIV, 4 Load/Store

Load/ Store Integer ALU FP ALU DTLB L1 D-Cache L2 Cache Memory Access

32 bit wide, 200 cycles access 64 bit wide, 200 cycles latency access latency

4 FP-ALU, 4FP-MUL/DIV/SQRT 256 entries, 4-way, 200 cycle miss 128K, 2-way, 32 Byte/line, 2 ports, 1 cycle access unified 3.2MB, 4-way, 128 Byte/line, 12 cycle access 64 bit wide, 200 cycles access latency

To build a 3D multi-core processor simulator, we heavily modified and extended the M-Sim simulator [21] and incorporated the Wattch power model [22]. The power trace is generated from the developed framework with an interval size of 500K cycles. We simulate a 3D-stacked quad-core processor with one core assigned to each layer. 4.2 Workloads and System Configurations We use both integer and floating-point benchmarks from the SPEC CPU 2000 suite (e.g. bzip2, crafty, eon, facerec, galgel, gap, gcc, lucas, mcf, parser, perlbmk, twolf, swim, vortex and vpr) to compose our experimental multiprogrammed workloads (see Table 2). We categorize all benchmarks into two classes: CPU-bound and MEM bound applications. We design three types of experimental workloads: CPU, MEM and MIX. The CPU and MEM workloads consist of programs from only the CPU intensive and memory intensive categories respectively. MIX workloads are the combination of two benchmarks from the CPU intensive group and two from the memory intensive group. These multi-programmed workloads were simulated on our multi-core simulator configured as 3D quad-core processors. We use the Simpoint tool [23] to obtain a representative slice for each benchmark (with full reference input set) and each

112

C.-B. Cho, W. Zhang, and T. Li Table 2. Simulation configurations

Chip Frequency Voltage Proc. Technology Die Size

Workloads

3G 1.2 V 65 nm 21 mm × 21 mm CPU1 bzip2, eon, gcc, perlbmk CPU2 perlbmk, mesa, facerec, lucas CPU3 gap, parser, eon, mesa MIX1 gcc, mcf, vpr, perlbmk MIX2 perlbmk, mesa, twolf, applu MIX3 eon, gap, mcf, vpr MEM1 mcf, equake, vpr , swim MEM2 twolf, galgel, applu, lucas MEM3 mcf, twolf, swim, vpr

benchmark is fast-forwarded to its representative point before detailed simulation takes place. The simulations continue until one benchmark within a workload finishes the execution of the representative interval of 250M instructions. 4.3 Design Parameters In this study, we consider a design space that consists of 23 parameters (see Table 3) spanning from floor-planning to packaging technologies. These design parameters have been shown to have a large impact on processor thermal behavior. The ranges for these parameters were set to include both typical and feasible design points within the explored design space. Using detailed cycle-accurate simulations, we measure processor power and thermal characteristics on all design points within both training and testing data sets. We build a separate model for each benchmark domain and use the model to predict thermal behavior at unexplored points in the design space. The training data set is used to build the wavelet-based neural network models. An estimate of the model’s accuracy is obtained by using the design points in the testing data set. To train an accurate and prompt neural network prediction model, one needs to ensure that the sample data sets disperse points throughout the design space but keeps the space small enough to maintain the low model building cost. To achieve this goal, we use a variant of Latin Hypercube Sampling (LHS) [24] as our sampling strategy since it provides better coverage compared to a naive random sampling scheme. We generate multiple LHS matrices and use a space filing metric called L2-star discrepancy [25]. The L2-star discrepancy is applied to each LHS matrix to find the representative design space that has the lowest value of L2-star discrepancy. We use a randomly and independently generated set of test data points to empirically estimate the predictive accuracy of the resulting models. In this work, we used 200 train and 50 test data to reach a high accuracy for thermal behavior prediction since our study shows that it offers a good tradeoff between simulation time and prediction accuracy for the design space we considered. In our study, the thermal characteristics across each die are represented by 64×64 samples.

Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors

113

Table 3. Design space parameters

Thickness (m) Layer0 Floorplan Bench Thickness (m) Layer1 Floorplan Bench 3D Config. Thickness (m) Layer2 Floorplan Bench Thickness (m) Layer3 Floorplan Bench Heat Capacity (J/m^3K) TIM (Thermal Interface Resistivity (m K/W) Material) Thickness (m) Convection capacity (J/k) Convection resistance (K/w) Heat sink Side (m) General Thickness (m) Config. Side(m) Heat Spreader Thickness(m) Others Ambient temperature (K) Archi. Issue width

Keys ly0_th ly0_fl ly0_bench ly1_th ly1_fl ly1_bench ly2_th ly2_fl ly2_bench ly3_th ly3_fl ly3_bench TIM_cap TIM_res TIM_th HS_cap HS_res HS_side HS_th HP_side HP_th Am_temp Issue width_

Low High 5e-5 3e-4 Flp 1/2/3/4 CPU/MEM/MIX 5e-5 3e-4 Flp 1/2/3/4 CPU/MEM/MIX 5e-5 3e-4 Flp 1/2/3/4 CPU/MEM/MIX 5e-5 3e-4 Flp 1/2/3/4 CPU/MEM/MIX 2e6 4e6 2e-3 5e-2 2e-5 75e-6 140.4 1698 0.1 0.5 0.045 0.08 0.02 0.08 0.025 0.045 5e-4 5e-3 293.15 323.15 2 /4/8

5 Experimental Results In this section, we present detailed experimental results using 2D wavelet neural networks to forecast thermal behaviors of large scale 3D multi-core structures running various CPU/MIX/MEM workloads without using detailed simulation. 5.1 Simulation Time vs. Prediction Time To evaluate the effectiveness of our thermal prediction models, we compute the speedup metric (defined as simulation time vs. prediction time) across all experimented workloads (shown as Table 4). To calculate simulation time, we measured the time that the Hotspot simulator takes to obtain steady thermal characteristics on a given design configuration. As can be seen, the Hotspot tool simulation time varies with design configurations. We report both shortest (best) and longest (worst) simulation time in Table 4. The prediction time, which includes the time for the neural networks to predict the targeted thermal behavior, remains constant for all studied cases.

114

C.-B. Cho, W. Zhang, and T. Li Table 4. Simulation time vs. prediction time

Workloads CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3

Simulation (sec) [best:worst] 362 : 6,091 366 : 6,567 365 : 6,218 351 : 5,890 355 : 6,343 367 : 5,997 352 : 5,944 365 : 6,091 360 : 6,024

Prediction (sec)

1.23

Speedup (Sim./Pred.) 294 : 4,952 298 : 5,339 297 : 5,055 285 : 4,789 289 : 5,157 298 : 4,876 286 : 4,833 297 : 4,952 293 : 4,898

In our experiment, a total number of 16 neural networks were used to predict 16 2D wavelet coefficients which efficiently capture workload thermal spatial characteristics. As can be seen, our predictive models achieve a speedup ranging from 285 (MEM1) to 5339 (CPU2), making them suitable for rapidly exploring large thermal design space. 5.2 Prediction Accuracy The prediction accuracy measure is the mean error defined as follows: ME =

1 N

N

∑ k =1

~ x (k ) − x(k ) . x (k )

(5)

where: x(k ) is the actual value generated by the Hotspot thermal model, ~x ( k ) is the predicted value and N is the total number of samples (a set of 64 x 64 temperature samples per layer, detailed in section 4.1). As prediction accuracy increases, the ME becomes smaller. We present boxplots to observe the average prediction errors and their deviations for the 50 test configurations against Hotspot simulation results. Boxplots are graphical displays that measure location (median) and dispersion (interquartile range), identify possible outliers, and indicate the symmetry or skewness of the distribution. The central box shows the data between “hinges” which are approximately the first and third quartiles of the ME values. Thus, about 50% of the data are located within the box and its height is equal to the interquartile range. The horizontal line in the interior of the box is located at the median of the data, it shows the center of the distribution for the ME values. The whiskers (the dotted lines extending from the top and bottom of the box) extend to the extreme values of the data or a distance 1.5 times the interquartile range from the median, whichever is less. The outliers are marked as circles. In Fig. 10, the blue line with diamond shape markers indicates the statistics average of ME across all benchmarks.

Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors

115

20

Error (%)

16 12 8 4 0

CPU1 CPU2 CPU3 MEM1 MEM2 MEM3 MIX1 MIX2 MIX3

Fig. 10. ME boxplots of prediction accuracies (number of wavelet coefficients = 16)

Fig. 10 shows that using 16 wavelet coefficients, the predictive models achieve median errors ranging from 2.8% (CPU1) to 15.5% (MEM1) with an overall median error of 6.9% across all experimented workloads. As can be seen, the maximum error at any design point for any benchmark is 17.5% (MEM1), and most benchmarks show an error less than 9%. This indicates that our hybrid neuro-wavelet framework can predict 2D spatial thermal behavior across large and sophisticated 3D multi-core architecture with high accuracy. Fig. 10 also indicates that CPU (average 4.4%) workloads have smaller error rates than MEM (average 9.4%) and MIX (average 6.7%) workloads. This is because the CPU workloads usually have higher temperature on the small core area than the large L2 cache area. These small and sharp hotspots can be easily captured using just few wavelet coefficients. On MEM and MIX workloads, the complex thermal pattern can spread the entire die area, resulting in higher prediction error.

CPU1

MEM1

MIX1

Prediction

Simulation

Fig. 11. The simulated and predicted thermal behavior

Fig. 11 illustrates the simulated and predicted 2D thermal spatial behavior of die 4 (for one configuration) on CPU1, MEM1 and MIX1 workloads. The results show that our predictive models can tack both size and location of thermal hotspots. We further examine the accuracy of predicting locations and area of the hottest spots and the results are similar to those presented in Figure 10. Fig. 12 shows the prediction accuracies with different number of wavelet coefficients on multi-programmed workloads CPU1, MEM1 and MIX1. In general, the 2D

116

C.-B. Cho, W. Zhang, and T. Li

thermal spatial pattern prediction accuracy is increased when more wavelet coefficients are involved. However, the complexity of the predictive models is proportional to the number of wavelet coefficients. The cost-effective models should provide high prediction accuracy while maintaining low complexity. The trend of prediction accuracy shown in Fig. 12 suggests that for the programs we studied, a set of wavelet coefficients with a size of 16 combine good accuracy with low model complexity; increasing the number of wavelet coefficients beyond this point improves error at a lower rate except on MEM1 workload. Thus, we select 16 wavelet coefficients in this work to minimize the complexity of prediction models while achieving good accuracy. CPU1

Error (%)

8

4

0

16wc

32wc

64wc

96wc

128wc

256wc

96wc

128wc

256wc

96wc

128wc

256wc

MEM1

Error (%)

15 10 5 0

16wc

32wc

64wc MIX1

Error (%)

20

10

0

16wc

32wc

64wc

Fig. 12. ME boxplots of prediction accuracies with different number of wavelet coefficients

We further compare the accuracy of our proposed scheme with that of approximating 3D stacked die spatial thermal patterns via predicting the temperature of 16 evenly distributed locations across 2D plane. The results shown in Fig. 13 indicate that using the same number of neural networks, our scheme yields significant higher accuracy than conventional predictive models. This is because wavelets provide a good time and locality characterization capability and most of the energy is captured by a limited set of important wavelet coefficients. The coordinated wavelet coefficients provide superior interpretation of the spatial patterns across scales of time and frequency domains.

Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors

117

100 Predicting the wav elet coefficients Predicting the raw data

Error (%)

80 60 40 20 0

CPU1

CPU2

CPU3

MEM1

MEM2

MEM3

MIX1

MIX2

MIX3

Fig. 13. The benefit of predicting wavelet coefficients

Our RBF neural networks were built using a regression tree based method. In the regression tree algorithm, all input parameters (refer to Table 3) were ranked based on split frequency. The input parameters which cause the most output variation tend to be split frequently in the constructed regression tree. Therefore, the input parameters that largely determine the values of a wavelet coefficient have a larger number of splits. We present in Fig. 14 (shown as star plot) the most frequent splits within the regression tree that models the most significant wavelet coefficient. Design Parameters by Regression Tree

ly0 _th

ly0 _fl

ly0 _bench

ly1 _th

ly1 _fl

ly1 _bench

ly2 _th

ly2 _fl

ly2 _bench

ly3 _th

ly3 _fl

ly3 _bench

T IM_cap

T IM_r es

T IM _th

H S_cap

H S_r es

H S_side

H S_th

H P _side

H P _th

am_temp

Iss_size

Clockwise: CPU1 MEM1 MIX1

Fig. 14. The roles of input parameters

A star plot [15] is a graphical data analysis method for representing the relative behavior of all variables in a multivariate data set. Each volume size of parameter is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. From the star plot, we can obtain information such as: What variables are dominant for a given datasets? Which observations show similar behavior? As can be seen, floor-planning of each layer and core configuration largely affect thermal spatial behavior of the studied workloads.

118

C.-B. Cho, W. Zhang, and T. Li

6 Related Work There have been several attempts to build thermal aware microarchitecture [3, 20, 27, 28]. [27, 28] propose invoking energy saving techniques when the temperature exceeds a predefined threshold. [5] proposes a performance and thermal aware floorplanning algorithm to estimate power and thermal effects for 2D and 3D architectures using an automated floor-planner with iterative simulations. To our knowledge, little research has been completed so far in developing accurate and informative analytical methods to forecast complex thermal spatial behavior of emerging 3D multi-core processors at early architecture design stage. Researchers have successfully applied wavelet techniques in many fields, including image and video compression, financial data analysis, and various fields in computer science and engineering [29, 30]. In [31], Joseph and Martonosi used wavelets to analyze and predict the change of processor voltage over time. In [32], wavelets were used to improve accuracy, scalability, and robustness in program phase analysis. In [33], the multiresolution analysis capability of wavelets was exploited to analyze phase complexity. These studies, however, made no attempt to link architecture wavelet domain behavior to various design parameters. In [13] Joseph et al. developed linear models using D-optimal designs to identify significant parameters and their interactions. Lee and Brooks [14, 15] proposed regression on cubic splines for predicting the performance and power of applications executing on microprocessor configurations in a large microarchitectural design space. Neural networks have been used in [9, 10, 11, 12] to construct predictive models that correlate processor performance characteristics with the design parameters. The above studies all focus on analyzing and predicting aggregated architecture characteristics and assume monolithic architecture designs while our work aims to model heterogeneous 2D thermal behavior. Our work significantly extends the scope of these existing studies and is distinct in its use of 2D multiscale analysis to characterize the spatial thermal behavior of large-scale 3D multi-core architecture substrate.

7 Conclusions Leveraging 3D die stacking technologies in multi-core processor design has received increased momentum in both the chip design industry and research community. One of the major road blocks to realizing 3D multi-core design is its inefficient heat dissipation. To ensure thermal efficiency, processor architects and chip designers rely on detailed yet slow simulations to model thermal characteristics and analyze various design tradeoffs. However, due to the sheer size of the design space, such techniques are very expensive in terms of time and cost. In this work, we aim to develop computationally efficient methods and models which allow architects and designers to rapidly yet informatively explore the large thermal design space of 3D multi-core architecture. Our models achieve several orders of magnitude speedup compared to simulation based methods. Meanwhile, our model significantly improves prediction accuracy compared to conventional predictive models of the same complexity. More attractively, our models have the capability of capturing complex 2D thermal spatial patterns and can be used to forecast both the

Thermal Design Space Exploration of 3D Die Stacked Multi-core Processors

119

location and the area of thermal hotspots during thermal-aware design exploration. In light of the emerging 3D multi-core design era, we believe that the proposed thermal predictive models will be valuable for architects to quickly and informatively examine a rich set of thermal-aware design alternatives and thermal-oriented optimizations for large and sophisticated architecture substrates at an early design stage.

References [1] Banerjee, K., Souri, S., Kapur, P., Saraswat, K.: 3-D ICs: A Novel Chip Design for Improving Deep-Submicrometer Interconnect Performance and Systems-on-Chip Integration. Proceedings of the IEEE 89, 602–633 (2001) [2] Tsai, Y.F., Wang, F., Xie, Y., Vijaykrishnan, N., Irwin, M.J.: Design Space Exploration for 3-D Cache. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 16(4) (April 2008) [3] Black, B., Nelson, D., Webb, C., Samra, N.: 3D Processing Technology and its Impact on IA32 Microprocessors. In: Proc. of the 22nd International Conference on Computer Design, pp. 316–318 (2004) [4] Reed, P., Yeung, G., Black, B.: Design Aspects of a Microprocessor Data Cache using 3D Die Interconnect Technology. In: Proc. of the International Conference on Integrated Circuit Design and Technology, pp. 15–18 (2005) [5] Healy, M., Vittes, M., Ekpanyapong, M., Ballapuram, C.S., Lim, S.K., Lee, H.S., Loh, G.H.: Multiobjective Microarchitectural Floorplanning for 2-D and 3-D ICs. IEEE Trans. on Computer Aided Design of IC and Systems 26(1), 38–52 (2007) [6] Lim, S.K.: Physical design for 3D system on package. IEEE Design & Test of Computers 22(6), 532–539 (2005) [7] Puttaswamy, K., Loh, G.H.: Thermal Herding: Microarchitecture Techniques for Controlling Hotspots in High-Performance 3D-Integrated Processors. In: HPCA (2007) [8] Wu, Y., Chang, Y.: Joint Exploration of Architectural and Physical Design Spaces with Thermal Consideration. In: ISLPED (2005) [9] Joseph, P.J., Vaswani, K., Thazhuthaveetil, M.J.: A Predictive Performance Model for Superscalar Processors. In: MICRO (2006) [10] Ipek, E., McKee, S.A., Supinski, B.R., Schulz, M., Caruana, R.: Efficiently Exploring Architectural Design Spaces via Predictive Modeling. In: ASPLOS (2006) [11] Yoo, R.M., Lee, H., Chow, K., Lee, H.H.S.: Constructing a Non-Linear Model with Neural Networks For Workload Characterization. In: IISWC (2006) [12] Lee, B., Brooks, D., Supinski, B., Schulz, M., Singh, K., McKee, S.: Methods of Inference and Learning for Performance Modeling of Parallel Applications. In: PPoPP 2007 (2007) [13] Joseph, P.J., Vaswani, K., Thazhuthaveetil, M.J.: Construction and Use of Linear Regression Models for Processor Performance Analysis. In: HPCA (2006) [14] Lee, B., Brooks, D.: Accurate and Efficient Regression Modeling for Microarchitectural Performance and Power Prediction. In: ASPLOS (2006) [15] Lee, B., Brooks, D.: Illustrative Design Space Studies with Microarchitectural Regression Models. In: HPCA (2007) [16] Daubechies, I.: Ten Lectures on Wavelets. Capital City Press, Montpelier (1992) [17] Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall, Englewood Cliffs (1999)

120

C.-B. Cho, W. Zhang, and T. Li

[18] Daubechies, I.: Orthonomal bases of Compactly Supported Wavelets. Communications on Pure and Applied Mathematics 41, 906–966 (1988) [19] Orr, M., Takezawa, K., Murray, A., Ninomiya, S., Leonard, T.: Combining Regression Tree and Radial Based Function Networks. International Journal of Neural Systems (2000) [20] Skadron, K., et al.: Temperature-Aware Microarchitecture. In: ISCA (2003) [21] http://www.cs.binghamton.edu/~jsharke/m-sim/ [22] Brooks, D., Tiwari, V., Martonosi, M.: Wattch: A framework for architectural-level power analysis and optimizations. In: ISCA (2000) [23] Sherwood, T., Perelman, E., Hamerly, G., Calder, B.: Automatically Characterizing Large Scale Program Behavior. In: ASPLOS (2002) [24] Cheng, J., Druzdzel, M.J.: Latin Hypercube Sampling in Bayesian Networks. In: FLAIRS (2000) [25] Vandewoestyne, B., Cools, R.: Good Permuatations for Deterministic Scrambled Halton Sequences in terms of L2-discrepancy. Journal of Computational and Applied Mathematics 189(1-2) (2006) [26] Chambers, J., Cleveland, W., Kleiner, B., Tukey, P.: Graphical Methods for Data Analysis, Wadsworth (1983) [27] Brooks, D., Martonosi, M.: Dynamic Thermal Management for High-Performance Microprocessors. In: HPCA (2001) [28] Gunther, S., Binns, F., Canmean, D.M., Hall, J.C.: Managing the Impact of Increasing Microprocessor Power Consumption. Intel Technology Journal, Ql (2001) [29] Mallat, S.: Multifrequency Channel Decompositions of Images and Wavelet Models. IEEE Trans. on Acoustic, Speech, and Signal Processing 37, 2091–2110 (1989) [30] Feldmann, A., Gilbert, A.C., Willinger, W., Kurtz, T.G.: The Changing Nature of Network Traffic: Scaling Phenomena. ACM Computer Communication Review 28, 5–29 (1998) [31] Joseph, R., Hu, Z.G., Martonosi, M.: Wavelet Analysis for Microprocessor Design: Experiences with Wavelet-Based dI/dt Characterization. In: HPCA (2004) [32] Cho, C.B., Li, T.: Using Wavelet Domain Workload Execution Characteristics to Improve Accuracy, Scalability and Robustness in Program Phase Analysis. In: ISPASS (2007) [33] Cho, C.B., Li, T.: Complexity-based Program Phase Analysis and Classification. In: PACT (2006)

Generation, Validation and Analysis of SPEC CPU2006 Simulation Points Based on Branch, Memory and TLB Characteristics Karthik Ganesan, Deepak Panwar, and Lizy K. John University of Texas at Austin, 1 University Station C0803, Austin, TX 78712, USA

Abstract. The SPEC CPU2006 suite, released in Aug 2006 is the current industry-standard, CPU-intensive benchmark suite, created from a collection of popular modern workloads. But, these workloads take machine weeks to months of time when fed to cycle accurate simulators and have widely varying behavior even over large scales of time [1]. It is to be noted that we do not see simulation based papers using SPEC CPU2006 even after 1.5 years of its release. A well known technique to solve this problem is the use of simulation points [2]. We have generated the simulation points for SPEC CPU2006 and made it available at [3]. We also report the accuracies of these simulation points based on the CPI, branch misspredictions, cache & TLB miss ratios by comparing with the full runs for a subset of the benchmarks. It is to be noted that the simulation points were only used for cache, branch and CPI studies until now and this is the ﬁrst attempt towards validating them for TLB studies. They have also been found to be equally representative in depicting the TLB characteristics. Using the generated simulation points, we provide an analysis of the behavior of the workloads in the suite for different branch predictor & cache conﬁgurations and report the optimally performing conﬁgurations. The simulations for the diﬀerent TLB conﬁgurations revealed that usage of large page sizes signiﬁcantly reduce the translation misses and aid in improving the overall CPI of the modern workloads.

1

Introduction

Understanding program behaviors through simulations is the foundation for computer architecture research and program optimization. These cycle accurate simulations take machine weeks of time on most modern realistic benchmarks like the SPEC [4] [5] [6] suites incurring a prohibitively large time cost. This problem is further aggravated due to the need to simulate on diﬀerent micro-architectures to test the eﬃcacy of the proposed enhancement. This necessitates the need to come up with techniques [7] [8] that can facilitate faster simulations of large workloads like SPEC suites. One such well known technique is the Simulation Points. While there are Simulation Points for the SPEC CPU2000 suite widely available and used, the simulation points are not available for the SPEC CPU2006 suite. D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 121–137, 2009. c Springer-Verlag Berlin Heidelberg 2009

122

K. Ganesan, D. Panwar, and L.K. John

We used the SimPoint [9] [10] [11] tool to generate these simulation points for the SPEC2006 benchmark suite and provide it for use at [3]. The contributions of this paper are two-fold. The ﬁrst contribution is the creation of the simulation points, which we make it available at [3] to the rest of the architecture research community. We also provide the accuracy of these simulation points by comparing the results with the full run of select benchmarks. It must be noted that 1.5 years after the release of SPEC CPU2006, simulations based papers using CPU2006 are still not appearing in architecture conferences. The availability of simulation points for CPU2006 will change this situation. The second contribution is the use of CPU2006 simulation points for branch predictor, cache & TLB studies. Our ultimate goal was to ﬁnd the optimal branch predictor, the cache and the TLB conﬁgurations which provide the best performance on most of the benchmarks. For this, we analyzed the benchmark results for diﬀerent set of static and dynamic branch predictors [12] and tried to come up with the ones that perform reasonably well on most of the benchmarks. We then varied the size of one of these branch predictors to come up with the best possible size for a hardware budget. A similar exercise was performed to come up with the optimum instruction and data cache design parameters. We varied both the associativity and size of caches to get an insight into the best performing cache designs for the modern SPEC CPU workloads. The performance for diﬀerent TLB conﬁgurations was also studied to infer the eﬀect of diﬀerent TLB parameters like the TLB size, page size and associativity. It should be noted that such a study without simulation points will take several machine weeks. Since the accuracy of the simulation points were veriﬁed with several full runs, we are fairly conﬁdent of the usefullness of the results.

2

Background

Considerable work has been done in investigating the dynamic behavior of the current day programs. It has been seen that the dynamic behavior varies over time in a way that is not random, rather structured [1] [13] as sequences of a number of short reoccurring behaviors. The SimPoint [2] tool tries to intelligently choose and cluster these representative samples together, so that they represent the entire execution of the program. These small set of samples are called simulation points that, when simulated and weighted appropriately provide an accurate picture of the complete execution of the program with large reduction in the simulation time. Using the Basic Block Vectors [14] , the SimPoint tool [9][10][11] employs the K-means clustering algorithm to group intervals of execution such that the intervals in one cluster are similar to each other and the intervals in diﬀerent clusters are diﬀerent from one another. The Manhattan distance between the Basic Block Vectors serve as the metric to know the extent of similarity between two intervals. The SimPoint tool takes the maximum number of clusters as the input and generates a representative simulation point for each cluster. The representative simulation point is chosen as the one which has the minimum distance from the

Generation, Validation and Analysis of SPEC CPU2006 Simulation Points

123

centroid of the cluster. Each of the simulation points is assigned a weight based on the number of intervals grouped into its corresponding cluster. These weights are normalized such that they sum up to unity.

3

Methodology

In this paper we used sim-fast, sim-outorder simulators of the simplescalar toolset [6] along with the SimPoint tool to generate the simulation points for the SPEC CPU2006 suite. Figure 1 shows the ﬂowchart representation of the methodology. We used sim-fast simulator to identify the diﬀerent basic blocks in the static code of the benchmark and generate a Basic Block Vector for every ﬁxed dynamic interval of execution of the program. We chose the interval size to be 100 million instructions. Further, these basic block vectors are fed as input to the clustering algorithm of the SimPoint tool, which generates the diﬀerent simulation points (collection of Basic Block Vectors) and their corresponding weights. Having obtained the simulation points and their corresponding weights, the simulation points are tested by fast-forwarding (i.e., executing the program without performing any cycle accurate simulation, as described in [3]) up to the simulation point, and then running a cycle accurate simulation for 100 million instructions. The sim-outorder tool provides a convenient method of fast-forwarding, to simulate programs in the manner described above. Fast-forwarding a program implies only a functional simulation and avoids any time consuming detailed cycle accurate measurements. The statistics like CPI (Cycles Per Instruction), cache misses, branch mispredictions etc. are recorded for each simulation point. The metrics for the overall program were computed based on the weight of each simulation point. Each of the individual simulation point is simulated in parallel and their results were aggregated based on their corresponding normalized weight. For example, the CPI was computed by multiplying the CPI of each individual simulation point with its corresponding weights as in eqn (1). CP I =

n

(CP I i ∗ weighti )

(1)

i=0

On the other hand, the ratio based metrics like branch misprediction rate, cache miss ratio were computed by weighing the numerator and denominator correspondingly as in eqn (2). n (missesi ∗ weighti ) M issRatio = ni=0 (2) (lookups i ∗ weighti ) i=0 The accuracy of the generated simulation points were studied by performing the full program simulation using sim-outorder simulator and comparing the metrics like CPI, cache miss ratios and branch mispredictions. This validation was performed to know the eﬀectiveness of the SimPoint methodology on SPEC CPU2006 [15] suite in depicting the true behavior of the program. Since, simoutorder runs on SPEC CPU2006 take machine weeks of time, we restricted ourselves to running only a few selected benchmarks for this purpose.

124

K. Ganesan, D. Panwar, and L.K. John

Bench mark

Simfast Sim-outorder

BBVs

Simpoint Engine

Simp oints

weigh ts

Compa re Sim-outorder

Simpoint 1 o/p

2 . . . .n Aggregate Data

error % , speedup

Fig. 1. Simulation point Generation and Veriﬁcation

For studying the branch behavior of the suite we once again used the simoutorder simulator available in SimpleScalar [6]. This tool has in-built implementation for most of the common static and dynamic branch predictors namely Always Taken, Always Not-Taken, Bimodal, Gshare and other Twoway adaptive predictors. We studied the inﬂuence of above predictors on the program behavior in terms of common metrics like execution time, CPI, branch misprediction. One of the best performing predictors was chosen and the Pattern History Table (PHT) size was varied and the results were analyzed to come up with an optimal size for the PHT. To get an insight into the memory and TLB behavior of the Suite, the same sim-outorder simulator was employed, using which the conﬁgurations for the diﬀerent levels of the cache hierarchy and TLB were speciﬁed. We obtained the corresponding hit and miss rate for various conﬁgurations along with their respective CPIs.

4

Simulation Points Generation and Veriﬁcation

Figures 2 shows the sim-fast results for the SPECINT and SPECFP benchmarks. The tables in the Figures. 2 and 3 show the number of simulation points

Generation, Validation and Analysis of SPEC CPU2006 Simulation Points

125

Fig. 2. SPEC CPU2006 - Number of simulation points, total number of instructions and the simulation time taken by the Simfast simulator of the SimpleScalar LLC. It is to be noted that Simoutorder will take an order more time than Simfast

Fig. 3. Speedup obtained by using the simulation points. The simulation point runs were done on the Texas Advance Computing Center and the full runs on a quad core 2 Ghz Xeon Processor

generated for each of the benchmarks along with their instruction count and simulation time on a 2 GHz Xeon machine. The interval of execution given to the sim-fast simulator was 100 million instructions. Also, maximum number of clusters given to the SimPoint tool were 30. These simulation points were launched as parallel jobs on the Texas Advance Computing Center (TACC) using the sim-outorder simulator. A node on TACC could have been 2x to 3x faster than the other xeon machine to which the execution times are compared. But, still the speedup numbers here are too high that this discrepancy in machine speeds can be safely ignored. The ﬁnal aggregated metrics for the simulation point runs were calculated using the formulae mentioned in the previous section. The full run simulations were also carried out for a few integer and ﬂoating point

126

K. Ganesan, D. Panwar, and L.K. John

2.5 2 CPI

1.5 1 0.5

tp p 40 1. bz ip 44 7. de al 40 ll 0. pe 46 rlb 2. en lib qu an tu 43 m 4. ze us m p 45 8. sj en g 45 6. hm m er

ne

1. 47

45

47

om

so 0.

3. as ta r

ex pl

bm go

5. 44

41

6. ga

m es

s

k

0

Full run Simpoint run

Benchmarks

0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0

Full run

er m

g 45

6.

sj 8.

45

hm

en

m

m tu

us ze

an

4.

qu

43

lib

p

ll en

al

rlb pe

0. 40

46

2.

ip

7. 44

40

1.

bz

tp

de

p

r ne om 1.

47

3.

as

ta

ex

47

45

0.

so

pl

bm

es

go

m

5. 44

ga 6. 41

k

Simpoint run

s

Misprediction Rate

Fig. 4. CPI comparison between full runs and simulation point runs

Benchmarks

Fig. 5. Branch misprediction rate comparison between full runs and simulation point runs

benchmarks and the accuracy of the generated simulation points were obtained by comparing the results. To verify the accuracy of the simulation points, we further compared the CPIs and cache miss ratios of the simulation point run to that of full run and analyzed the speedup obtained due to the usage of simulation points. The conﬁguration that we used to simulate the various full and the simulation point runs is with a RUU size of 128, LSQ size of 64, decode, issue and commit widths of 8, L1 data and instruction cache size of 256 sets, 64B block size, an associativity of 2, L2

Generation, Validation and Analysis of SPEC CPU2006 Simulation Points

127

0.008 0.007 Miss rate

0.006 0.005 0.004

Full run

0.003

Simpoint run

0.002 0.001 ll pe 0.

44

7.

rlb

de

bz 1. 40

en

al

ip

p tp om

ne

Benchmarks

40

47

1.

3. as ta r

x 0.

45

47

so p

le

bm go 5.

44

41 6. ga m

es s

k

0

Fig. 6. Instruction cache miss ratio comparison between full runs and simulation point runs

data and instruction cache size of 4096 sets, 64B block size, an associativity of 4. The ITLB size used was 32 sets with 4K block size, and an associativity of 4. The DTLB size used was 64 sets, 4K block size and an associativity of 4. The number of Integer ALUs were set to 4 and the number of Floating Point ALUs were set to 2. A combined branch predictor with a meta table size of 2048. The error percentage in CPI and the speed-up obtained due to the use of simulation points are given in Figures 3 and 4 . Clearly, performing the simulation using the generated simulation points results in considerable speedup without much loss in the accuracy, reducing machine weeks of time to a few hours. The CPI values obtained using simulation points was within 5 percent of the full run CPI values for all the benchmarks except 401.bzip where the value was oﬀ by around 8 percent. Even the error in Data, Instruction cache miss rates, DTLB miss rates and the branch misprediction ratios were within a limit of 5 percent for most of the benchmarks excepting bzip and libquantum that have an error of 11% and 13% for the branch missprediction rates. Figures 4, 5, 6, 7 show the errors in the values of CPI, branch mispredictions, data cache, instruction cache and DTLB miss rates for a set of benchmarks. Though the concept of simulation points have been widely used in various studies about caches, branch predictors etc., this is the ﬁrst attempt towards validating and studying the TLB characteristics based on simulation points. It is quite evident from the results that these simulation points are representative of the whole benchmark even in terms of the TLB characteristics. Though the methodology used by SimPoint is micorarchitecture independent, this validation is performed by taking one speciﬁc platform (alpha) as a case study and the error rates may vary for other platforms.

128

K. Ganesan, D. Panwar, and L.K. John

0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

Full run

ga m es 44 s 5. go b 45 m k 0. so pl 47 e x 3 47 .as t 1. om ar ne tp p 40 1. bz 44 ip 7. de 40 a 0. pe lll 46 r 2. lib lben qu a 43 ntu m 4. ze us 45 mp 8. s 45 jen g 6. hm m er

Simpoint run

41

6.

Miss rate

DL1 miss rate comparison

Benchmarks

DTLB miss rate comparison

Miss rate

0.02 0.015 0.01

Full run

0.005

Simpoint run

er m

g 45

6.

sj 8.

45

hm

en

m

m

us

tu

ze

an qu

4. 43

lib

p

ll en

al

rlb pe

0. 40

46

2.

ip 44

7.

bz

tp

1. 40

de

p

r ne om 1. 47

47

3.

as

ta

ex

k 45

0.

so

pl

bm

go 5.

44

41

6.

ga

m

es

s

0

Benchmarks

Fig. 7. Data cache and DTLB miss rate comparison between full runs and simulation point runs

We hope that these simulation points that are provided [3] will serve as a powerful tool aiding in carrying out faster simulations using the large and representative benchmarks of the SPEC CPU2006 Suite. The reference provided has the simulation points for 21 benchmarks and we are in the process of generating the remaining simulation points, which will also be added to the same reference.

5 5.1

Simulation Results and Analysis Branch Characteristics

As mentioned earlier, sim-outorder supports both static and dynamic branch predictors. static predictors are quite ideal for the embedded applications due to

Generation, Validation and Analysis of SPEC CPU2006 Simulation Points

129

their simplicity and low power requirements. Static predictors are also employed in designing simple cores in case of single chip multiprocessors like Niagara [15], where there exists strict bounds on area and power consumption on each core. It is also commonly used as backup predictors in superscalar processors that require an early rough prediction during training time and when there are misses in the Branch Targe Buﬀer. On the other hand, dynamic predictors give superior performance compared to the static ones but at the cost of increased power and area, as implemented in the modern complex x86 processors. Fig. 8 shows the CPI results for two common type of static branch predictors viz., Always Taken and Always Not-Taken. As expected, it is clear from Fig. 8 and Fig. 10 that the performance of static predictors is quite poor compared to the perfect predictor. Always taken has the overhead in branch target calculation, but most of the branches in loops are taken. Fig. 9 shows the CPI results for some common dynamic branch predictors. In this paper, we have studied the performance of the following dynamic predictors viz., Bimodal, Combined, Gshare, PAg and GAp. The conﬁgurations that were used for these predictors respectively are, – – – – –

Bimodal - 2048 Combined - 2048 (Meta table size) Gshare - 1:8192:13:1 PAg - 256:4096:12:0 GAp - 1:8192:10:0

Gshare, PAg and GAp are 2level predictors and their conﬁgurations are given in the format {l1size:l2size:hist size:xor}. Clearly, the CPI values obtained using dynamic predictors is much closer to the values obtained from the perfect predictor. Also, among these predictors, Gshare and Combined branch predictors performs much better compared to others. Taking a closer look at the graphs, we see that the Gshare predictor is ideal in the case of FP benchmarks while combined predictors fares better for the integer benchmarks. Also, PAg performs better than GAp predictor which indicates that a predictor with a global Pattern History Table (PHT) performs better than one with a private PHT. This clearly shows that constructive interference in a global PHT is helping the modern workloads and results in an improved CPI. Looking at the performance of the private and the global conﬁgurations of the Branch History Shift Register (BHSR), it is evident that each of them perform well on speciﬁc benchmarks. Fig. 11 shows the misprediction rates for the diﬀerent dynamic predictors. The performance improvement in CPI and Misprediction rate by using a dynamic predictor to a static predictor is drastic for the cases of 471.omnetpp and 416.gamess. Both of these benchmarks are pretty small workloads, that their branch behavior is easily captured by these history based Branch Predictors. 462.libquantum and 450.soplex also have a signiﬁcant improvement in the CPI compared to their static counterparts, which can be attributed to fact that the dynamic predictors are able to eﬃciently capture the branch behavior of these benchmarks.

130

K. Ganesan, D. Panwar, and L.K. John 6 5

CPI

4 3

Not Taken Taken

2

Perfect

1

de 7. 44

43

4.

ll 0. bw av es 43 5. gr om ac s 48 2. sp hi nx 3

ilc

al

p m

m 3.

us ze

7.

45

43

3d

em s

ie

G

sl le

9.

s

ex

es

43

41

pl 0.

41

44

6. ga

so

m

4. na

rlb

1. bz

pe 0.

40

40

45

en

ip

g en sj 8. 45

m d

t

er

an

m

qu

hm

lib

6.

2.

47

45

46

p tp

3. as ta r

ne

47

1.

44

5.

om

go

bm k

0

Fig. 8. Static branch predictor CPI 2.5

CPI

2 1.5

Bimod Comb

1

gshare PAg

0.5

GAp

3 sp

hi

om

2.

gr

48

5. 43

41

nx

ac

es

s

ll al

av bw 0.

44

43

7.

3.

de

m

m

43

4.

45

7. 43

ilc

p

s us

9.

ze

G

ie

pl

sl le

so 0. 45

em

3d

ex

s m ga 6.

41

0.

44

4.

pe

na

rlb

m

es

en

ip 1. 40

45

40

8.

bz

en sj

qu lib 2.

d

t g

an

r hm 6.

46

45

47

47

1.

er m

ta 3.

ne om

go 5. 44

as

tp

bm

p

k

0

Fig. 9. Dynamic branch predictor CPI 0.45 0.4

Miss ratio

0.35 0.3 0.25 0.2

NotTaken

0.15

Taken

0.1 0.05

nx

ac sp

hi

om gr

2.

5.

48

0.

3

s

ll es av

7. 44

bw 43

ilc

al de

m 3. 43

41

s

p m us ze

43

4.

9. 45

sl le 7. 43

G

ie

pl so 0.

45

em

3d

ex

s m ga

6. 41

es

en 4. 44

pe 0.

na

rlb

m

ip bz 1. 40

40

d

t en 45

8.

sj

qu lib 2.

6.

hm 46

45

g

an

r

er m

ta as 3.

47

1. 47

44

5.

om

go

ne

tp

bm

p

k

0

Fig. 10. Static branch predictor misprediction rate

For the purpose of analyzing the eﬀect of PHT size on the behavior of the programs, we chose one of the best performing predictors obtained in the previous analysis i.e. Gshare and varied the size of it’s PHT. We used PHT of index 12, 13 and 14 bits and observed the improvement in both CPI and branch misprediction rate (Fig 12. & 10). Diﬀerent benchmarks responded diﬀerently to the increase in the PHT size. It can be observed that the integer benchmarks respond more to the increase in the PHT size compared to the ﬂoating point benchmarks. The ﬂoating point benchmarks have the least eﬀect on the CPI for the increase in the PHT size. This is because of the fact that the ﬂoating

0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

Comb gshare PAg

pp

bm

om ne t

go

1. 47

47 3. as ta 45 r 6. hm m 46 er 2. lib qu an t 45 8. sje ng 40 1. bz ip 40 0. pe rlb en 44 4. na m 41 d 6. ga m es s 45 0. so pl ex 43 7. le sli e3 d 45 9. G em 43 s 4. ze us m p 43 3. m ilc 44 7. de al 41 ll 0. bw av 43 es 5. gr om ac 48 s 2. sp hi nx 3

GAp

5. 44

131

Bimod

k

Miss ratio

Generation, Validation and Analysis of SPEC CPU2006 Simulation Points

Fig. 11. Dynamic branch predictor misprediction rate

0.12 0.1 Miss Ratio

0.08 0.06 1:4096:12:1 0.04

1:8192:13:1

0.02

1:16384:14:1

nx

ac

hi

om

sp

gr

2.

5.

48

43

41

3

s

ll es

al de

av

7. 44

43

bw

3.

m

m us ze 4.

43

0.

p

ilc

s

3d

em G

ie 7.

45

le

sl

so 0.

43

45

9.

s

ex pl

m

es m

na

ga

4. 44

6.

bz

rlb pe

1. 40

0. 40

41

ip

en

g en sj 8.

45

2. 46

d

t

er lib

hm

qu

m

an

r ta 45

47

1.

47

6.

3.

ne om

go 5. 44

as

tp

bm

p

k

0

Fig. 12. Misprediction rate for Gshare conﬁgurations given as L1size:L2size:hist size & xor 2.5

CPI

2 1.5 1:4096:12:1

1

1:8192:13:1 0.5

1:16384:14:1

3 sp

2.

gr 5.

48

43

hi

om

nx

ac

es av bw

7.

s

ll al de

m 3. 43

44

0. 41

p

ilc

s 43

4.

9.

ze

G

us

m

em

3d ie sl le 7.

45

s

ex

45

43

0.

so

m ga 6.

4. 41

44

pl

es

d na

rlb 0.

pe

1. 40

40

m

en

g

ip bz

en sj 8.

45

46

2.

lib

hm

qu

m

an

t

er

r ta as 3.

6. 45

47

1. 47

44

5.

om

go

ne

tp

bm

p

k

0

Fig. 13. CPI for Gshare conﬁgurations given as L1size:L2size:hist size & xor

point benchmarks have lesser number of branches and thus their behavior can be captured with a smaller PHT. For instance, considering 435.gromacs, although there is a signiﬁcant reduction in the misprediction rate with an increase in the PHT size, there is not much improvement observed in the CPI. After analyzing this benchmark, we found that 435.gromacs has only 2 percent of the instructions as branches. So, improving the accuracy of branch predictor does not have much eﬀect on the CPI of the FP benchmarks. On the other hand, for the case of 445.gobmk which is an integer benchmark, the improvement in misprediction rate shows a proportional change in the CPI. This is expected since 445.gobmk has higher percentage of branches (15 percent) to the total instructions.

132

K. Ganesan, D. Panwar, and L.K. John

2.5

CPI

2 1.5

DL1:256:64:1:1 DL1:512:64:1:1

1

DL1:1024:64:1:1 0.5

DL1:256:64:2:1 DL1:128:64:4:1 t 45 8. sj en g 40 1. bz ip 40 0. pe rlb en 44 4. na m d 41 6. ga m es s 45 0. so pl e x 43 7. le sl ie 3d 45 9. G em 43 s 4. ze us m p 43 3. m ilc 44 7. de al 41 ll 0. bw av e 43 s 5. gr om ac 48 s 2. sp hi nx 3

er 46

45

2.

lib

6. hm

qu

m

an

r

p

ta

tp

as 47

3.

om ne

47

1.

44

5.

go

bm k

0

Fig. 14. CPI for DL1 conﬁgurations in format name:no.sets:blk size:associativity&repl. policy

0.3

Miss Ratio

0.25 0.2

DL1:256:64:1:1

0.15

DL1:512:64:1:1 DL1:1024:64:1:1

0.1

DL1:256:64:2:1 0.05

DL1:128:64:4:1

ac

nx hi

om

sp

gr

2. 48

5. 43

41

3

s

ll al

es av bw

44

43

7.

3.

de

m

m us ze 4.

43

0.

p

ilc

s

3d

em

ie

G

sl

45

43

45

7.

0.

le

so

9.

s

ex pl

m

es m

na 4.

ga

44

6.

bz

rlb

1.

pe 0.

40

40

41

ip

en

g en sj 8.

45

d

t

er 46

45

2.

6.

lib

hm

qu

m

an

r ta as 3.

47

47

1.

44

5.

om

go

ne

tp

bm

p

k

0

Fig. 15. Missrate for DL1 conﬁgs in format name:no.sets:blk size:associativity&repl. policy

2.5

2 1.5 CPI

IL1:1:256:64:1:1 IL1:1:512:64:1:1

1

IL1:1:1024:64:1:1 IL1:1:256:64:2:1

0.5

IL1:1:128:64:4:1

3 sp

2. 48

gr 5. 43

hi

om

nx

ac

es av

de 41

0.

bw

7.

s

ll al

ilc m 3. 43

ze 4.

44

s

p us

m

em 43

sl le

9. 45

7.

G

ie

pl 43

45

0.

so

m ga

3d

ex

s es

d 6.

4.

41

44

pe

na

rlb

bz 1.

0. 40

40

m

en

ip

g en sj 8.

lib 2.

45

46

45

6.

hm

qu

m

an

t

er

r ta as 3.

47

1. 47

44

5.

om

go

ne

tp

bm

p

k

0

Fig. 16. CPI for IL1 conﬁgs in format name:no.sets:blk size:associativity&repl. policy

5.2

Memory Characteristics

The memory hierarchy design is of paramount importance in modern superscalar processors because of the performance loss due the Von Neumann bottleneck. It necessitates the need to come up with the optimal cache design parameters, so that it is capable of hiding the memory latencies eﬃciently. In this paper, we analyzed both the instruction and data level I caches and tried to come up with the optimal design parameters.

Generation, Validation and Analysis of SPEC CPU2006 Simulation Points

133

0.02 0.018 0.016 Miss ratio

0.014 0.012

IL1:1:256:64:1:1

0.01

IL1:1:512:64:1:1

0.008

IL1:1:1024:64:1:1

0.006

IL1:1:256:64:2:1

0.004

IL1:1:128:64:4:1

0.002 44 5. go bm 47 k 1. om ne tp p 47 3. as ta r 45 6. hm m 46 er 2. lib qu an t 45 8. sj en g 40 1. bz ip 40 0. pe rlb en 44 4. na m 41 d 6. ga m es s 45 0. so pl ex 43 7. le sl ie 3d 45 9. G em 43 s 4. ze us m p 43 3. m ilc 44 7. de 41 al ll 0. bw av es 43 5. gr om ac 48 s 2. sp hi nx 3

0

Fig. 17. Missrate for IL1 conﬁgs in format name:no.sets:blk size:associativity&repl. policy

Fig. 18. CPI for varying associativity with 16KB page sizes

For the purpose of analyzing the L1 caches, we varied both the cache size and the associativity and compared the values of CPI and the miss ratios. We used the LRU replacement policy for all our experiments which is given as one in specifying the conﬁguraion of the cache in the ﬁgures. From the graph in Fig. 14 & 15, it is evident that the eﬀect of increasing associativity has a prominent effect on the performance than just increasing the size of the data cache. For some benchmarks like 445gobmk, increasing the associativity to 2 result in a colossal reduction in the miss ratios, which can be attributed to smaller foot prints of these benchmarks. Other benchmarks where associativity provided signiﬁcant beneﬁt are 456.hmmer, 458.sjeng and 482.sphinx3 in which case increasing the associativity to 2 resulted in more than 50 percent reduction in miss ratio. However, some benchmarks like 473.astar and 450.soplex responded more to the size than associativity. It can be concluded that 473.astar and 450.soplex has lot of sequential data and hence we cannot extract much beneﬁt by increasing the associativity. The CPIs of the benchmarks 462.libquantum and 433.milc neither respond to the increase in the cache size nor to that in associativity. This may be due to a smaller memory footprint of these benchmarks which can be captured completely by just a small direct mapped cache.

134

K. Ganesan, D. Panwar, and L.K. John

Fig. 19. TLB miss ratios for varying sssociativity with 16KB page sizes 2.5

CPI

2 1.5 4 KB 16 KB

1

64 KB 0.5

16 MB

sp 2.

gr 5.

48

43

hi

om

nx

ac

3

s

ll al

es

de

av bw

7. 44

0.

43

3.

m

m us ze 4.

43

41

p

ilc

s

3d

em G

ie sl le 7.

43

9.

so 0. 45

45

s

ex pl

m

es m

na 4.

ga 6.

44

41

ip

en pe 0.

40

40

1.

rlb

bz

en

d

t g

an

8.

sj

qu 2.

lib

hm 6. 45

46

45

r

er m

ta as 3. 47

1. 47

44

5.

om

go

ne

tp

bm

p

k

0

Fig. 20. CPI for varying page sizes with 2-way associative TLB

The CPI and the miss ratios for diﬀerent Level 1 instruction cache conﬁgurations are shown in Fig. 16 and 17. As expected, the miss ratios of the instruction cache is much lesser than that of the data cache because of the uniformity in the pattern of access to the instruction cache. For some of the benchmarks like 473.astar, 456.hmmer, 435.gromacs, the miss ratio is almost negligible and hence further increase in the cache size or associativity does not have any eﬀect on the performance. The performance beneﬁt due to increase in associativity compared to cache size in instruction cache is not as much as the data cache. This is because of the fact that the instruction cache responds more to the increase in the cache size to that of associativity because of high spatial locality in the references. Considering the tradeoﬀ between the performance and complexity, an associativity of two at the instruction cache level seems to be optimal. 5.3

TLB Characteristics

Although designing the data cache is an important step in processor design, it has to be coupled with an eﬃcient TLB usage to achieve good performance. Choosing the TLB page size is becoming critical in modern memory intensive workloads with large foot prints. This can be attributed to the recent addition of features like multiple page sizes to modern operating systems.

Generation, Validation and Analysis of SPEC CPU2006 Simulation Points

135

0.014

TLB Miss ratios

0.012 0.01 0.008

4 KB

0.006

16 KB 64 KB

0.004

16 MB

0.002

ll 0. bw av es 43 5. gr om ac s 48 2. sp hi nx 3

ilc

al

m 3.

7. 44

43

de

41

p

s 43

4. ze

us

m

em

3d ie

G

sl le

9.

7.

45

s

ex pl so 0.

45

6. ga 41

na 4. 44

43

d

m es

en

m

ip bz

rlb

1.

pe

40

0. 40

44

5.

go

bm 47 k 1. om ne tp p 47 3. as ta r 45 6. hm m er 46 2. lib qu an t 45 8. sj en g

0

Fig. 21. TLB miss ratio for varying page sizes with 2-way associative TLB

Using Simplescalar, we performed simulations on the SPEC 06 suite for different TLB page sizes & associativities and observed the TLB miss ratio, which characterizes the part of the CPI due to the time incurred in page translation. First, we ﬁxed the page size as 16KB and varied the associativity to see the corresponding impact on miss ratios and CPI. As expected, the direct mapped TLB has performed worse than the 2-way and 4-way TLBs as seen in Fig. 19 & 20. It looks like the improvement in the performance from 2-way to 4-way is not much and is not worth the extra hardware complexity required for the same. Thus, an associativity of two seems to be optimal for the modern workloads. As we increased the TLB size from 16KB to 16MB, we found that the change in associativity did not have any eﬀect on the performance and this can be attributed to the fact that a page size of 16MB is large enough to reduce the conﬂict misses to zero. Second, we performed simulations with various page sizes for a 2-way associative TLB. Our results as shown in Fig. 21 & 22 had a close match with that of the results speciﬁed in [16] for a power5 processor. We found that large page sizes resulted in the least translation misses, leading to a better CPI. Firstly, it can be observed that there is a reduction in the TLB miss ratio around 30% for 471.omnetpp, 80% for 473.astar when the page size is increased from 4KB to 16KB. There is a consistent improvement in the performance of all the benchmarks for an increase in the page size. When a page size of 16MB is used, the TLB misses reduces to nearly zero for most of the benchmarks except 445.gobmk and 450.soplex. One possible cause for the increase in CPI for 445.gobmk and 450.soplex for a 16MB page size could be due to serious wastage of memory caused due to internal fragmentation problems. Other reasons could be having higher numbers of conﬂicts amongst the cache lines if the virtual address bits used in cache tag matches are insuﬃciently distinct from each other under larger sized TLB mappings.

6

Conclusion

The simulation points have proved to be an eﬀective technique in reducing the simulation time to a large extent without much loss of accuracy in the SPEC

136

K. Ganesan, D. Panwar, and L.K. John

CPU2006 Suite. Using simulation points not only reduces the number of dynamic instructions to be simulated but also makes the workload parallel, making them ideal for the current day parallel computers. Further, simulating the diﬀerent benchmarks with the diﬀerent branch predictors, gave an insight into understanding the branch behavior of modern workloads, which helped in coming up with the best performing predictor conﬁgurations. We observed Gshare and the combined (Bimodal & 2-level) to be the ideal predictors, predicting most of the branches to near perfection. Looking at the eﬀect of diﬀerent cache parameters, it is observed that the design of level-1 data cache parameters proves to be more important in aﬀecting the CPI than that of the instruction cache parameters. Instruction accesses, due to their inherent uniformity, tends to miss less frequently, which makes the task of designing the Instruction cache much easier. The line size of the Instruction cache seems to be the most important, while for the data cache, both the line size and the associativity needs to be tailored appropriately to get the best performance. The simulations for the diﬀerent TLB conﬁgurations revealed that usage of large page sizes signiﬁcantly reduce the translation misses and aid in improving the overall CPI of the modern workloads.

Acknowledgement We would like to thank the Texas Advance Computing Center (TACC) for the excellent simulation environment provided for performing all the time consuming simulations of SPEC CPU2006 with enough parallelism. Our thanks to Lieven Eeckhout and Kenneth Hoste of the Ghent University, Belgium for providing us the alpha binaries for the SPEC suite. This work is also supported in part through the NSF award 0702694. Any opinions, ﬁndings and conclusions expressed in this paper are those of the authors and do not necessarily reﬂect the views of the National Science Foundation (NSF).

References 1. Sherwood, T., Calder, B.: Time varying behavior of programs. Technical Report UCSD-CS99-630, UC San Diego, (August 1999) 2. Sherwood, T., Perelman, E., Hamerly, G., Calder, B.: Automatically characterizing large scale program behavior. In: ASPLOS (October 2002) 3. http://www.freewebs.com/gkofwarf/simpoints.htm 4. SPEC. Standard performance evaluation corporation, http://www.spec.org 5. Henning, J.L.: SPEC CPU 2000: Measuring cpu performance in the new millennium. IEEE Computer 33(7), 28–35 (2000) 6. Charney, M.J., Puzak, T.R.: Prefetching and memory system behavior of the SPEC95 benchmark suite. IBM Journal of Research and Development 41(3) (May 1997) 7. Haskins, J., Skadron, K.: Minimal subset evaluation: warmup for simulated hardware state. In: Proceedings of the 2001 International Conference on Computer Design (September 2000)

Generation, Validation and Analysis of SPEC CPU2006 Simulation Points

137

8. Phansalkar, A., Joshi, A., John, L.K.: Analysis of redundancy and application balance in the SPEC CPU 2006 benchmark suite. In: The 34th International Symposium on Computer Architecture (ISCA) (June 2007) 9. Hamerly, G., Perelman, E., Lau, J., Calder, B.: Simpoint 3.0: Faster and more ﬂexible program analysis. In: Workshop on Modeling, Benchmarking and Simulation (June 2005) 10. Hamerly, G., Perelman, E., Calder, B.: How to use simpoint to pick simulation points. ACM SIGMETRICS Performance Evaluation Review (March 2004) 11. Perelman, E., Hamerly, G., Calder, B.: Picking statistically valid and early simulation points. In: International Conference on Parallel Architectures and Compilation Techniques (September 2003) 12. Yeh, T.-Y., Patt, Y.N.: Alternative implementations of two-level adaptive branch prediction. In: 19th Annual International Symposium on Computer Architecture (May 1992) 13. Lau, J., Sampson, J., Perelman, E., Hamerly, G., Calder, B.: The strong correlation between code signatures and performance. In: IEEE International Symposium on Performance Analysis of Systems and Software (March 2005) 14. Perelman, E., Sherwood, T., Calder, B.: Basic block distribution analysis to ﬁnd periodic behavior and simulation points in applications. In: International Conference on Parallel Architectures and Compilation Techniques (September 2001) 15. Kongetira, P., Aingaran, K., Olukotun, K.: Niagara: A 32-way multithreaded sparc processor. MICRO 25(2), 21–29 (2005) 16. Korn, W., Chang, M.S.: SPEC CPU 2006 sensitivity to memory page sizes. ACM SIGARCH Computer Architecture News (March 2007)

A Note on the Effects of Service Time Distribution in the M/G/1 Queue Alexandre Brandwajn1 and Thomas Begin2 1

Baskin School of Engineering, University of California Santa Cruz, USA 2 Université Pierre et Marie Curie, LIP6, France [email protected], [email protected]

Abstract. The M/G/1 queue is a classical model used to represent a large number of real-life computer and networking applications. In this note, we show that, for coefficients of variation of the service time in excess of one, higherorder properties of the service time distribution may have an important effect on the steady-state probability distribution for the number of customers in the M/G/1 queue. As a result, markedly different state probabilities can be observed even though the mean numbers of customers remain the same. This should be kept in mind when sizing buffers based on the mean number of customers in the queue. Influence of higher-order distributional properties can also be important in the M/G/1/K queue where it extends to the mean number of customers itself. Our results have potential implications for the design of benchmarks, as well as the interpretation of their results. Keywords: performance evaluation, M/G/1 queue, higher-order effects, finite buffers.

1 Introduction The M/G/1 queue is a classical model used to represent a large number of real-life computer and networking applications. For example, M/G/1 queues have been applied to evaluate the performance of devices such as volumes in a storage subsystem [1], Web servers [13], or nodes in an optical ring network [3]. In many applications related to networking, the service times may exhibit significant variability, and it may be important to account for the fact that the buffer space is finite. It is well known that, in the steady state, the mean number of users in the unrestricted M/G/1 queue depends only on the first two moments of the service time distribution [11]. It is also known [4] that the first three (respectively, the first four) moments of the service time distribution enter into the expression for the second (respectively, the third) moment of the waiting time. In this note our goal is to illustrate the effect of properties of the service time distribution beyond its mean and coefficient of variation on the shape of the stationary distribution of the number of customers in the M/G/1 queue. In particular, we point out the risk involved in dimensioning buffers based on the mean number of users in the system. D. Kaeli and K. Sachs (Eds.): SPEC Benchmark Workshop 2009, LNCS 5419, pp. 138–144, 2009. © Springer-Verlag Berlin Heidelberg 2009

A Note on the Effects of Service Time Distribution in the M/G/1 Queue

139

2 M/G/1 Queue Assuming a Poisson arrival process, a quick approach to assess the required capacity for buffers in a system is to evaluate it as some multiplier (e.g. three or six) times the mean number of customers in an open M/G/1 queue (e.g. [12]). From the PollaczekKhintchine formula [11], this amounts to dimensioning the buffers based on only the first two moments of service time distribution. Unfortunately, the steady-state distribution of the number of customers in the M/G/1 queue can exhibit a strong dependence on higher-order properties of the service time distribution. This is illustrated in Figure 1, which compares the distribution of the number of customers for two different Cox-2 service time distributions with the same first two moments, and thus yielding the same mean number of customers in the system. The parameters of these distributions are given in Table 1. Note that both distributions I and II correspond to a coefficient of variation of 3 but have different higher-order properties such as skewness and kurtosis [14]. Similarly, distributions III and IV both correspond to a coefficient of variation of 5 but again different higher-order properties. The stationary distribution of the number of customers in this M/G/1 queue was computed using a recently published recurrence method [2]. We observe that, perhaps not surprisingly, the effects of the distribution tend to be more significant as the server utilization and the coefficient of variation of the service time distribution increase. It is quite instructive to note, for instance, that with a coefficient of variation of 3 and server utilization of 0.5, the probability of exceeding 20 users in the queue (a little over 6 times the mean) is about 0.1% in one case while it is an order of magnitude larger for another service time distribution with same first two moments. Table 1. Parameters and properties of the service time distributions used in Figure 1 Distribution Dist. I Dist. II Dist. III Dist. IV

Mean Coefficient Skewness Kurtosis service of variation time 1 3 4.5 27.3 1 3 3557.4 1.90*107 1 5 7.5 75.1 1 5 6913.2 6.63*107

Rate of service at stage 1 10000.0 1.0 10000.0 1.0

Probability to go to stage 2 2.00*10-1 2.50*10-7 7.69*10-2 8.33*10-8

Rate of service at stage 2 2.00*10-1 2.50*10-4 7.69*10-2 8.33*10-5

3 M/G/1/K Queue Clearly, using the M/G/1/K, i.e., the M/G/1 queue with a finite queueing room would be a more direct way to dimension buffers. There seem to be fewer theoretical results for the M/G/1/K queue than for the unrestricted M/G/1 queue, but it is well known that the steady-state distribution for the M/G/1/K queue can be obtained from that for the unrestricted M/G/1 queue after appropriate transformations [10, 7, 4]. Clearly, this approach can only work if the arrival rate does not exceed the service rate since otherwise the unrestricted M/G/1 would not be stable.

140

A. Brandwajn and T. Begin

Queue length with utilization of 0.5 0.6

0.5

0.4 Dist. I

0.3

Dist. II

0.2

0.1

0 0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20

number of customers

(a) Coefficient of variation: 3, server utilization: 0.5 Queue length with an utilization of 0.9 0.12

0.1

0.08 Dist. I

0.06

Dist. II

0.04

0.02

45

48

42

39

36

33

30

27

24

21

18

15

9

12

6

3

0

0 number of customers

(b) Coefficient of variation: 3, server utilization: 0.9 Queue length with utilization of 0.5 0.6

0.5

0.4 Dist. III

0.3

Dist. IV

0.2

0.1

0 0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20

number of customers

(c) Coefficient of variation: 5, server utilization: 0.5 Fig. 1. Effect of service time distributions on the number of customers in the M/G/1 queue

A Note on the Effects of Service Time Distribution in the M/G/1 Queue

141

Queue length with utilization of 0.9 0.12 0.1 0.08

Dist. III Dist. IV

0.06 0.04 0.02

48

45

42

39

36

33

30

27

24

21

18

15

9

12

6

3

0

0

number of customers

(d) Coefficient of variation:0.5, server utilization: 0.9 Fig. 1. (continued)

While the steady-state distribution for the M/G/1/K queue can be derived from the one for the unrestricted M/G/1 queue, and the mean number of users in the latter depends only on the first two moments of the service time distribution, this is not the case for the M/G/1/K queue. Table 2 shows that even the first three moments of the service time distribution do not generally suffice to determine the mean number of customers in the M/G/1/K queue. Here we illustrate the results obtained for two Cox-3 distributions sharing the first three moments but with different properties of higher-order. Since the mean number of customers in the unrestricted M/G/1 queue depends only on the first two moments of the service time distribution, and in the M/G/1/K for K=1 there is no distributional dependence at all (since there is no queueing), it is interesting to see how the dependence on properties of higher-order varies with K, the size of the queueing room. This is the objective in Figure 2 where we have represented the relative difference in the probabilities of having exactly one customer in the system, as well as in the probabilities of the buffer being full, for distribution I and II of Table 1. We observe that, although the first two moments of the service time distribution are the same for both distributions, higher-order properties lead to drastically different values for the selected probabilities. Interestingly, for the probability of the buffer being full, although the relative difference between the distributions considered decreases as the size of the queueing room, K , increases, it remains significant even for large values of the latter. To further illustrate the dependence on higher-order properties of the service time distribution, we consider read performance for two simple cached storage devices. When the information requested is found in the cache, a hit occurs and the service time is viewed as a constant (assuming a fixed record length). When the information is not in the cache, it must be fetched from the underlying physical storage device. In Table 3 we show simulation results [8] obtained for two different storage systems with the same first two moments of the service time (resulting from the combination

142

A. Brandwajn and T. Begin

Table 2. Effect of properties beyond the third moment on the mean number in the M/G/1/K queue

Rate of arrivals Size of queueing room Mean service time Coefficient of variation Skewness Kurtosis Mean number in the M/G/1/K

First Cox-3

Second Cox-3

1

1

30

30

Relative differences

1 6.40 2331.54 7.43*106

1.44*107

3.98

5.07

27.4 %

Fig. 2. Relative difference in selected probabilities for distributions I and II as a function of the queueing room in the M/G/1/K queue

of hit and miss service times), and queueing room limited to 10. In one case the service time of the underlying physical device (i.e. miss service time) is represented by a uniform distribution, and in the other by a truncated exponential [9]. We are interested in the achievable I/O rate such that the mean response time does not exceed 5 ms. We observe that the corresponding I/O rates differ by over 20% in this example (the coefficient of variation of the service time being a little over 1.6). It has been our experience that the influence of higher-order properties tends to increase as the coefficient of variation and the skewness of the service time increase. It is interesting to note that this is precisely the case when one considers instruction execution times in programs running on modern processors where most frequent

A Note on the Effects of Service Time Distribution in the M/G/1 Queue

143

Table 3. I/O rate for same mean I/O time in two storage subsystems

Truncated exponential miss service time

Uniform miss service time Mean service time Coefficient of variation Hit probability Hit service time Miss service time Attainable I/O rate for Mean I/O time of 5 ms

Relative differences

1.9 1.62 0.9 1 Uniform [2,18]

0.257

0.985 1.64 Truncated exponential mean: 20, max: 100 0.312

21.4 %

instructions are highly optimized, less frequent instructions can be significantly slower, and certain even less frequent instructions may be implemented as subroutine calls with order of magnitude longer execution times. As another example of the effects of higher-order properties of the service time in an M/G/1 queue, consider the probability that a small buffer of 10 messages at an optical network node is full. Incoming packets can be of three different lengths. In the fist case, abstracted from reported IP traffic, the packet lengths are 40, 300 and 1500 bytes with probabilities 0.5, 0.3 and 0.2, respectively. In the second case, longer packets are used: 150, 500 and 5000 bytes, with respective probabilities 0.426, 0.561 and 0.013. Both packet length distributions have the same mean of 410 bytes with a coefficient of variation of 1.36, but different higher order properties. With the average packet arrival rate at 1 per mean packet service time, simulation results indicate that the probability of the buffer being full differs by some 20% depending on the packet mix (12.5% in the first case vs. 10.5% in the second) even though both packet mixes have the same fist two moments.

4 Conclusion In conclusion, we have shown that, for coefficients of variation of the service time in excess of one, higher-order properties of the service time distribution may have an important effect on the steady-state probability distribution for the number of customers in the M/G/1 queue. As a result, markedly different state probabilities can be observed even though the mean numbers of customers remain the same. This should be kept in mind when sizing buffers based on the mean number of customers in the queue. Influence of higher-order distributional properties can also be important in the M/G/1/K queue where it extends to the mean number of customers itself. The potentially significant impact of higher-order distributional properties of the service times should be kept in mind also when interpreting benchmark results for systems that may

144

A. Brandwajn and T. Begin

be viewed as instances of the M/G/1 or M/G/1/K queue, in particular, transaction oriented systems. Our results imply that it may not be sufficient to look just at the mean or even the mean and the variance of the system execution times to correctly assess the overall system performance. Another implication relates to benchmark design since, unless one is dealing with a system that satisfies the assumptions of a product-form queueing network, it may not be sufficient to simply preserve the mean of the system load [6]. Acknowledgments. The authors wish to thank colleagues for their constructive remarks on an earlier version of this note.

References 1. Brandwajn, A.: Models of DASD Subsystems with Multiple Access Paths: A ThroughputDriven Approach. IEEE Transactions on Computers C-32(5), 451–463 (1983) 2. Brandwajn, A., Wang, H.: Conditional Probability Approach to M/G/1-like Queues. Performance Evaluation 65(5), 366–381 (2008) 3. Bouabdallah, N., Beylot, A.-L., Dotaro, E., Pujolle, G.: Resolving the Fairness Issues in Bus-Based Optical Access Networks. IEEE Journal on Selected Areas in Communications 23(8), 1444–1457 (2005) 4. Cohen, J.W.: On Regenerative Processes in Queueing Theory. Lecture Notes in Economics and Mathematical Systems. Springer, Berlin (1976) 5. Cohen, J.W.: The Single Server Queue, 2nd edn. North-Holland, Amsterdam (1982) 6. Ferrari, D.: On the foundations of artificial workload design. SIGMETRICS Perform. Eval. Rev. 12(3), 8–14 (1984) 7. Glasserman, P., Gong, W.: Time-changing and truncating K-capacity queues from one K to another. Journal of Applied Probability 28(3), 647–655 (1991) 8. Gross, D., Juttijudata, M.: Sensitivity of Output Performance Measures to Input Distributions in Queueing Simulation Modeling. In: Proceedings of the 1997 winter simulation conference, pp. 296–302 (1997) 9. Jawitz, J.W.: Moments of truncated continuous univariate distributions. Advances in Water Resources 27(3), 269–281 (2004) 10. Keilson, J.: The Ergodic Queue Length Distribution for Queueing Systems with Finite Capacity. Journal of the Royal Statistical Society 28(1), 190–201 (1966) 11. Kleinrock, L.: Queueing systems. Theory, vol. I. Wiley, Chichester (1974) 12. Mitrou, N.M., Kavidopoulos, K.: Traffic engineering using a class of M/G/1 models. Journal of Network and Computer Applications 21, 239–271 (1998) 13. Molina, M., Castelli, P., Foddis, G.: Web traffic modeling exploiting. TCP connections’ temporal clustering through HTML-REDUCE 14(3), 46–55 (2000) 14. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall / CRC, Boca Raton (1986)

Author Index

Babka, Vlastimil 77 Begin, Thomas 138 Brandwajn, Alexandre Cho, Chang-Burm 102 Chow, Kingsum 17 Desai, Darshan

138

Lange, Klaus-Dieter 97 Lavery, Daniel M. 36 Li, Tao 102 McNairy, Cameron

36

Nicolau, Alexandru

36

36

Ganesan, Karthik

121

Henning, John L. 1 Hoﬂehner, Gerolf F. 36

Panwar, Deepak 121 Petrochenko, Dmitry 17 Shiv, Kumar T˚ uma, Petr

Isen, Ciji

17 77

57 Veidenbaum, Alexander V.

John, Eugene John, Lizy K. Kejariwal, Arun

57 57, 121 36

Wang, Yanping Zhang, Wangyuan

17 102

36