585 17 26MB
Pages 469 Page size 430 x 659.996 pts Year 2009
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5415
Eduardo César Michael Alexander Achim Streit Jesper Larsson Träff Christophe Cérin Andreas Knüpfer Dieter Kranzlmüller Shantenu Jha (Eds.)
Euro-Par 2008 Workshops – Parallel Processing VHPC 2008, UNICORE 2008, HPPC 2008, SGS 2008, PROPER 2008, ROIA 2008, and DPA 2008 Las Palmas de Gran Canaria, Spain, August 25-26, 2008 Revised Selected Papers
13
Volume Editors Eduardo César Universidad Autónoma de Barcelona, Spain E-mail: [email protected] Michael Alexander Wirtschaftsuniversität Wien, Austria E-mail: [email protected] Achim Streit Jülich Supercomputing Centre, Germany E-mail: [email protected] Jesper Larsson Träff NEC Laboratories Europe, Sankt Augustin, Germany E-mail: [email protected] Christophe Cérin Université de Paris Nord, LIPN, France E-mail: [email protected] Andreas Knüpfer Technische Universität Dresden, Germany E-mail: [email protected] Dieter Kranzlmüller LMU München, Germany E-mail: [email protected] Shantenu Jha Louisiana State University, USA E-mail: [email protected] Library of Congress Control Number: Applied for CR Subject Classification (1998): C.1-4, D.1-4, F.1-3, G.1-2, H.2 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-00954-9 Springer Berlin Heidelberg New York 978-3-642-00954-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12643462 06/3180 543210
Preface
Parallel and distributed processing, although within the focus of computer science research for a long time, is gaining more and more importance in a wide spectrum of applications. These proceedings aim to demonstrate the use of parallel and distributed processing concepts in different application fields, and attempt to spark interest in novel research directions to parallel and high-performance computing research in general. The objective of these workshops is to specifically address researchers coming from university, industry and governmental research organizations and application-oriented companies in order to close the gap between purely scientific research and the applicability of the research ideas to real-life problems. Euro-Par is an annual series of international conferences dedicated to the promotion and advancement of all aspects of parallel and distributed computing. The 2008 event was the 14th issue of the conference. Euro-Par has for a long time been eager to attract colocated events sharing the same goal of promoting the development of parallel and distributed computing, both as an industrial technique and an academic discipline, extending the frontier of both the state of the art and the state of the practice. Since 2006, Euro-Par has been offering researchers the chance to colocate advanced technical workshops back-to-back with the main conference. This is for a mutual benefit: the workshops can take advantage of all technical and social facilities that are set up for the conference, so that the organizational tasks are kept to a minimal level; the conference can rely on workshops to experiment with specific areas of research that are not yet mature enough, or too specific, to lead to an official, full-fledged topic at the conference. The 2006 and 2007 events were quite successful, and were extended to a larger size in 2008, where nine events were colocated with the main Euro-Par Conference: •
•
•
CoreGRID Symposium is the major annual event of the CoreGRID European Research Network on Foundations, Software Infrastructures and Applications for large-scale distributed, grid and peer-to-peer technologies. It is also an opportunity for a number of CoreGRID Working Groups to organize their regular meetings. The proceedings have been published in a specific volume of the Springer CoreGRID series, Towards Next Generation Grids. GECON 2008 is the 5th International Workshop on Grid Economic and Business Model. Euro-Par was eager to attract an event about this very important aspect of grid computing, which has often been overlooked by scientific researchers of the field. Its proceedings are published in a separate volume of Springer’s Lecture Notes in Computer Science series. VHPC 2008 is the Workshop on Virtualization/Xen in High-Performance Cluster and Grid Computing. Virtual machine monitors (VMMs) are now
VI
Preface
•
•
•
•
integrated with a variety of operating systems and are moving out of research labs into scientific, educational and operational usage. This workshop aimed to bring together researchers and practitioners active in exploring the application of virtualization in distributed and high-performance cluster and grid computing environments. This was a unique opportunity for the Euro-Par community to make connections with this very active research domain. UNICORE Summit 2008 brought together researchers and practitioners working with UNICORE in the areas of grid and distributed computing, to exchange and share their experiences, new ideas and latest research results on all aspects of UNICORE. The UNICORE grid technology provides a seamless, secure and intuitive access to distributed grid resources. This was the fourth meeting of the UNICORE community, after a meeting in Sophia-Antipolis, France, in 2005, and a colocated meeting at Euro-Par 2006 in Dresden, Germany, in 2006, and Euro-Par 2007 in Rennes, France. HPPC 2008 is the Second Workshop on Highly Parallel Processing on a Chip. With a number of both general and special purpose multi-core processors already on the market, it is foreseeable that new designs with a substantial number of processing cores will emerge to meet demands for extremely high performance, dependability and controllable power consumption in mobile and embedded devices, and in response to the convergence of communication, media and compute devices. The HPPC workshop aims to be(come) a forum for discussion of the major challenges to architecture, language and compiler design, algorithms and application developments, in order to fully (or acceptably) exploit the raw compute power of multi-core processors with a significant amount of parallelism. SGS 2008 is the First Workshop on Secure, Trusted, Manageable and Controllable Grid Services. It refers to the notions of security, the way we manage such large systems and the way we control the grid system. For instance, the word 'controllable' means: how we measure the activity of the grid and how we report it. The word 'manageable' means: 'how we deploy the grid architecture, the grid softwares, and how we start jobs (under controllable events such as the availability of resources). The word 'security' refers to the traditional fields of authentication, fault tolerance but refers also to safe execution (how to certify results, how to adapt computation according to some metric). Moreover, all these services should collaborate making the building of middleware a challenging problem. The building of chains of trust between software components as well as the integration of security and privacy mechanisms across multiple autonomous and/or heterogeneous grid platforms are key challenges for the community. The PROPER 2008 workshop was organized on behalf of the Virtual Institute for High Productivity Supercomputing (VI-HPS), which aims at improving the quality and accelerating the development process of complex simulation codes in science and engineering that are being designed to run on highly parallel computer systems. One part of this mission is the development of integrated state-of-the-art programming tools for high-performance computing that assist
Preface
•
•
VII
programmers in diagnosing programming errors and optimizing the performance of their applications. Accordingly, the workshop topics cover tools for parallel program development and analysis as well as general performance measurement and evaluation approaches. Last but not least, it includes success stories about optimization or parallel scalability achieved using the tools. In particular, the workshop wants to stimulate discussion between tool developers and experts on one hand and tool users and application developers on the other hand. Furthermore, it especially supports younger researchers to present their work. ROIA 2008 is the First International Workshop on Real-Time Online Interactive Applications on the Grid. It aimed to bring together researchers from the domain of ROIAs and grid computing in order to exchange knowledge, experiences, ideas and concepts for combining both fields. The event was closely related to the research perfomed in the European edutain@grid project. DPA 2008 aimed to determine where programming abstractions are important and where non-programmatic abstractions are likely to make greater impact in enabling applications to effectively utilize distributed infrastructure. This workshop will have a balance of applications and topical infrastructure developments (such as abstractions for Clouds).
The reader will find in this volume the proceedings of the last seven events. Hosting Euro-Par 2008 and these colocated events in Las Palmas de Gran Canaria would not have been possible without the support and the help of different institutions and numerous people. Although we are thankful to many more people, we are particularly grateful to the workshop organizers: Martti Forsell and Jesper Larsson Träff for HPPC 2008; Achim Streit and Wolfgang Ziegler for UNICORE Summit 2008; and Michael Alexander and Stephen Childs for VHPC 2008. It has been a pleasure to collaborate with them on this project. We particularly thank them for their interest in our proposal and their trust and availability along the entire preparation process. Euro-Par 2008 was hosted on the university campus and we would like to thank the University Institute for Intelligent Systems and Numerical Applications in Engineering of the Universidad de Las Palmas de Gran Canaria for the support and infrastructure. We gratefully acknowledge the great organizational support of the Computer Architecture and Operating Systems Department of the Universidad Autónoma de Barcelona. We would also like to thank the Cabildo de Gran Canaria and the City Council of Las Palmas de Gran Canaria for they institutional support. Finally, we are grateful to Springer for agreeing to publish the proceedings of these seven workshops in a specific volume of its Lecture Notes in Computer Science series. We are definitely eager to pursue this collaboration. It has been a great pleasure to work together on this project in Las Palmas de Gran Canaria.
VIII
Preface
We hope that the current proceedings are beneficial for the sustainable growth and awareness of parallel and distributed computing concepts in future applications.
December 2008
Eduardo César Michael Alexander Achim Streit Jesper Larsson Träff Christophe Cérin Andreas Knüpfer Dieter Kranzlmüller Shantenu Jha
Organization
Euro-Par Steering Committee Chair Christian Lengauer
University of Passau, Germany
Vice-Chair Luc Bougé
ENS Cachan, France
European Representatives José Cunha Marco Danelutto Rainer Feldmann Christos Kaklamanis Anne-Marie Kermarrec Paul Kelly Harald Kosch Thomas Ludwig Emilio Luque Luc Moreau Wolfgang Nagel Rizos Sakellariou
New University of Lisbon, Portugal University of Pisa, Italy University of Paderborn, Germany Computer Technology Institute, Greece IRISA, Rennes, France Imperial College, UK University of Klagenfurt, Austria University of Heidelberg, Germany University Autonoma of Barcelona, Spain University of Southampton, UK Dresden University of Technology, Germany University of Manchester, UK
Non-European Representatives Jack Dongarra Shinji Tomita
University of Tennessee at Knoxville, USA Kyoto University, Japan
Honorary Members Ron Perrott Karl Dieter Reinartz
Queen's University Belfast, UK University of Erlangen-Nuremberg, Germany
Observers Domingo Benitez Henk Sips
University of Las Palmas, Gran Canaria, Spain Delft University of Technology, The Netherlands
X
Organization
Euro-Par 2008 Local Organization Conference Co-chairs Emilio Luque Domingo Benítez Tomàs Margalef
UAB General Chair ULPGC Vice-Chair UAB Vice-Chair
Local Organizing Committee Eduardo César (UAB) Ana Cortés (UAB) Daniel Franco (UAB) Elisa Heymann (UAB) Anna Morajko (UAB) Juan Carlos Moure (UAB) Dolores Rexachs (UAB) Miquel Àngel Senar (UAB) Joan Sorribes (UAB) Remo Suppi (UAB) Web and Technical Support Daniel Ruiz (UAB) Javier Navarro (UAB)
Euro-Par 2008 Workshop Program Committees Third Workshop on Virtualization in High-Performance Cluster and Grid Computing (VHPC 2008) Program Chairs Michael Alexander (Chair) Stephen Childs (Co-chair)
WU Vienna, Austria Trinity College, Dublin, Ireland
Program Committee Jussara Almeida Padmashree Apparao Hassan Barada Volker Buege Simon Crosby Marcus Hardt Sverre Jarp Krishna Kant Yves Kemp
Federal University of Minas Gerais, Brazil Intel Corp., USA Etisalat University College, UAE University of Karlsruhe, Germany Xensource, UK Forschungszentrum Karlsruhe, Germany CERN, Switzerland Intel Corporation, USA University of Karlsruhe, Germany
Organization
Naoya Maruyama Jean-Marc Menaud José E. Moreira Jose Renato Santos Andreas Schabus Yoshio Turner Andreas Unterkircher Dongyan Xu,
Tokyo Institute of Technology, Japan Ecole des Mines de Nantes, France IBM T.J. Watson Research Center, USA HP Labs, USA Microsoft , Austria HP Labs, USA CERN, Switzerland Purdue University, USA
UNICORE Summit 2008 Program Chairs Achim Streit Wolfgang Ziegler
Forschungszentrum Jülich, Germany Fraunhofer Gesellschaft SCAI, Germany
Program Committee Agnes Ansari Rosa Badia Thomas Fahringer Donal Fellows Anton Frank Edgar Gabriel Alfred Geiger Fredrik Hedman Odej Kao Paolo Malfetti Ralf Ratering Mathilde Romberg Bernd Schuller David Snelling Thomas Soddemann Stefan Wesner Ramin Yahyapour Additional Reviewers Max Berger Kassian Plankensteiner
CNRS-IDRIS, France Barcelona Supercomputing Center, Spain University of Innsbruck, Austria University of Manchester, UK LRZ Munich, Germany University of Houston, USA T-Systems, Germany KTH-PDC, Sweden Technical University of Berlin, Germany CINECA, Italy Intel GmbH, Germany Forschungszentrum Jülich, Germany Forschungszentrum Jülich, Germany Fujitsu Laboratories of Europe, UK Max-Planck-Institut für Plasmaphysik - RZG, Germany University of Stuttgart - HLRS, Germany University of Dortmund, Germany
XI
XII
Organization
Third Workshop on Highly Parallel Processing on a Chip (HPPC 2008) Program Chairs Martti Forsell Jesper Larsson Träff
VTT, Finland NEC Laboratories Europe, NEC Europe Ltd., Germany
Program Committee David Bader Gianfranco Bilardi Martti Forsell Anwar Ghuloum Peter Hofstee Chris Jesshope Ben Juurlink Darren Kerbyson Christoph Kessler Dominique Lavenier Lasse Natvig Andrea Pietracaprina Jesper Larsson Träff Uzi Vishkin
Georgia Institute of Technology, USA University of Padova, Italy VTT, Finland Intel, USA IBM, USA University of Amsterdam, The Netherlands Technical University of Delft, The Netherlands Los Alamos National Laboratory, USA University of Linköping, Sweden IRISA - CNRS, France NTNU, Norway University of Padova, Italy NEC Laboratories Europe, NEC Europe Ltd., Germany University of Maryland, USA
Workshop on Secure, Trusted, Manageable and Controllable Grid Services (SGS 2008) Steering Committee Pascal Bouvry Christophe Cérin Noria Foukia Jean-Luc Gaudiot Mohamed Jemni Kuan-Ching Li Jean-Louis Pazat Helmut Reiser
University of Luxembourg, Luxembourg University of Paris 13, France Otago University, New Zealand University of California, Irvine, USA ESSTT, Tunisia Providence University, Taiwan IRISA, France Leibniz Supercomputing Centre, Garching, Germany
Organization
Workshop on Productivity and Performance (PROPER 2008) Program Chairs Matthias S. Müller (Chair) Andreas Knüpfer (Local Chair) Program Committee Matthias Müller Karl Füerlinger Andreas Knüpfer Bettina Krammer Allen Malony Dieter an Mey Shirley Moore Martin Schulz Felix Wolf
Technical University of Dresden (Chair) University of Tennessee Technical University of Dresden University of Stuttgart University of Oregon RWTH Aachen University University of Tennessee Lawrence Livermore National Lab Forschungszentrum Jülich
Real-Time Online Interactive Applications (ROIA) on the GRID Program Chairs Christoph Anthes Thomas Fahringer Dieter Kranzlmüller Program Committee Alexis Aragon Damjan Ceric Justin Ferris Frank Glinka Sergei Gorlatch Alexandru Iosup Roland Landertshamer Mark Lidstone Arton Lipaj Jens Müller-Iden Vlad Nae
Darkworks S.A., France Amis d.o.o, Slovenia IT Innovation Centre, University of Southampton,UK Institute of Computer Science, University of Münster, Germany Institute of Computer Science, University of Münster, Germany Parallel and Distributed Systems (PDS) Group, TU Delft, The Netherlands Institute of Graphics and Parallel Processing, Joh. Kepler University Linz, Austria BMT Cordah Ltd., UK Amis d.o.o, Slovenia Institute of Computer Science, University of Münster, Germany Institute for Computer Science, University of Innsbruck, Austria
XIII
XIV
Organization
Alexander Ploss Radu Prodan Christopher Rawlings Mike Surridge Jens Volkert
Institute of Computer Science, University of Münster, Germany Institute for Computer Science, University of Innsbruck, Austria BMT Cordah Ltd., UK IT Innovation Centre, University of outhampton, UK Institute of Graphics and Parallel Processing, Joh. Kepler University Linz, Austria
Abstractions for Distributed Systems (DPA 2008) Program Chair Shantenu Jha (LSU and eSI), Chair Program Committee Shantenu Jha (LSU and eSI) Dan Katz (LSU) Manish Parashar (Rutgers) Omer Rana (Cardiff) Murray Cole (Edinburgh)
Table of Contents
Workshop on Virtualization in High-Performance Cluster and Grid Computing (VHPC 2008) Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Alexander and Stephen Childs (Program Chairs)
1
Tools and Techniques for Managing Virtual Machine Images . . . . . . . . . . H˚ avard K.F. Bjerke, Dimitar Shiyachki, Andreas Unterkircher, and Irfan Habib
3
Dynamic on Demand Virtual Clusters in Grid . . . . . . . . . . . . . . . . . . . . . . . Mario Leandro Bertogna, Eduardo Grosclaude, Marcelo Naiouf, Armando De Giusti, and Emilio Luque
13
Dynamic Provisioning of Virtual Clusters for Grid Computing . . . . . . . . . Manuel Rodr´ıguez, Daniel Tapiador, Javier Font´ an, Eduardo Huedo, Rub´en S. Montero, and Ignacio M. Llorente
23
Dynamic Resources Management of Virtual Appliances on a Computational Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander A. Moskovsky, Artem Y. Pervin, and Bruce J. Walker Complementarity between Virtualization and Single System Image Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J´erˆ ome Gallard, Geoffroy Vall´ee, Adrien L`ebre, Christine Morin, Pascal Gallard, and Stephen L. Scott Efficient Shared Memory Message Passing for Inter-VM Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fran¸cois Diakhat´e, Marc Perache, Raymond Namyst, and Herve Jourdren An Analysis of HPC Benchmarks in Virtual Machine Environments . . . . Anand Tikotekar, Geoffroy Vall´ee, Thomas Naughton, Hong Ong, Christian Engelmann, and Stephen L. Scott
33
43
53
63
UNICORE Summit 2008 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Achim Streit and Wolfgang Ziegler (Program Chairs) Space-Based Approach to High-Throughput Computations in UNICORE 6 Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bernd Schuller and Miriam Schumacher
73
75
XVI
Table of Contents
The Chemomentum Data Services – A Flexible Solution for Data Handling in UNICORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Katharina Rasch, Robert Sch¨ one, Vitaliy Ostropytskyy, Hartmut Mix, and Mathilde Romberg
84
A Reliable and Fast Data Transfer for Grid Systems Using a Dynamic Firewall Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. Oistrez, E. Gr¨ unter, M. Meier, and R. Niederberger
94
Workflow Service Extensions for UNICORE 6 – Utilising a Standard WS-BPEL Engine for Grid Service Orchestration . . . . . . . . . . . . . . . . . . . . S. Gudenkauf, W. Hasselbring, A. H¨ oing, G. Scherp, and O. Kao
103
Benchmarking of Integrated OGSA-BES with the Grid Middleware . . . . . Fredrik Hedman, Morris Riedel, Phillip Mucci, Gilbert Netzer, Ali Gholami, M. Shahbaz Memon, A. Shiraz Memon, and Zeeshan A. Shah
113
Second Workshop on Highly Parallel Processing on a Chip (HPPC 2008) Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martti Forsell and Jesper Larsson Tr¨ aff (Program Chairs)
123
Models for Parallel and Hierarchical On-Chip Computation (Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gianfranco Bilardi
127
Building a Concurrency and Resource Allocation Model into a Processor’s ISA (Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Jesshope
129
Optimized Pipelined Parallel Merge Sort on the Cell BE . . . . . . . . . . . . . . J¨ org Keller and Christoph W. Kessler Towards an Intelligent Environment for Programming Multi-core Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sabri Pllana, Siegfried Benkner, Eduard Mehofer, Lasse Natvig, and Fatos Xhafa
131
141
Adaptive Read Validation in Time-Based Software Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ehsan Atoofian, Amirali Baniasadi, and Yvonne Coady
152
Compile-Time and Run-Time Issues in an Auto-Parallelisation System for the Cell BE Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alastair F. Donaldson, Paul Keir, and Anton Lokhmotov
163
Table of Contents
A Unified Runtime System for Heterogeneous Multi-core Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C´edric Augonnet and Raymond Namyst (When) Will CMPs Hit the Power Wall? . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cor Meenderinck and Ben Juurlink
XVII
174 184
Workshop on Secure, Trusted, Manageable and Controllable Grid Services (SGS 2008) Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christophe C´erin Meta-Brokering Solutions for Expanding Grid Middleware Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Attila Kert´esz, Ivan Rodero, and Francesc Guim Building Secure Resources to Ensure Safe Computations in Distributed and Potentially Corrupted Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sebastien Varrette, Jean-Louis Roch, Guillaume Duc, and Ronan Keryell
195
199
211
Simbatch: An API for Simulating and Predicting the Performance of Parallel Resources Managed by Batch Systems . . . . . . . . . . . . . . . . . . . . . . . Y. Caniou and J.-S. Gay
223
Analysis of Peer-to-Peer Protocols Performance for Establishing a Decentralized Desktop Grid Middleware . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heithem Abbes and Jean-Christophe Dubacq
235
Towards a Security Model to Bridge Internet Desktop Grids and Service Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gabriel Caillat, Oleg Lodygensky, Etienne Urbah, Gilles Fedak, and Haiwu He
247
Workshop on Productivity and Performance (PROPER 2008) Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias M¨ uller and Andreas Kn¨ upfer (Program Chairs) Enabling Data Structure Oriented Performance Analysis with Hardware Performance Counter Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karl F¨ urlinger, Dan Terpstra, Haihang You, Phil Mucci, and Shirley Moore Complete Def-Use Analysis in Recursive Programs with Dynamic Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Castillo, F. Corbera, A. Navarro, R. Asenjo, and E.L. Zapata
261
263
273
XVIII
Table of Contents
Parametric Studies in Eclipse with TAU and PerfExplorer . . . . . . . . . . . . . Kevin A. Huck, Wyatt Spear, Allen D. Malony, Sameer Shende, and Alan Morris Trace-Based Analysis and Optimization for the Semtex CFD Application – Hidden Remote Memory Accesses and I/O Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Holger Mickler, Andreas Kn¨ upfer, Michael Kluge, Matthias S. M¨ uller, and Wolfgang E. Nagel
283
295
Scalasca Parallel Performance Analyses of PEPC . . . . . . . . . . . . . . . . . . . . . Zolt´ an Szebenyi, Brian J.N. Wylie, and Felix Wolf
305
Comparing the Usability of Performance Analysis Tools . . . . . . . . . . . . . . . Christian Iwainsky and Dieter an Mey
315
Real-Time Online Interactive Applications on the Grid (ROIA 2008) Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christoph Anthes, Thomas Fahringer, and Dieter Kranzlm¨ uller (Program Chairs)
327
Real-Time Performance Support for Complex Grid Applications . . . . . . . Marian Bubak, Wlodzimierz Funika, Bartosz Bali´s, Tomasz Szepieniec, Krzysztof Guzy, and Roland Wism¨ uller
329
CoUniverse: Framework for Building Self-organizing Collaborative Environments Using Extreme-Bandwidth Media Applications . . . . . . . . . . Miloˇs Liˇska and Petr Holub Developing VR Applications for the Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . Christoph Anthes, Roland Landertshamer, Helmut Bressler, and Jens Volkert
339
352
An Information System for Real-Time Online Interactive Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vlad Nae, Jordan Herbert, Radu Prodan, and Thomas Fahringer
361
Securing Real-Time On-Line Interactive Applications in edutain@grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. Ferris, M. Surridge, and F. Glinka
371
The edutain@grid Portals – Providing User Interfaces for Different Kinds of Actors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roland Landertshamer, Christoph Anthes, Jens Volkert, Bassem I. Nasser, and Mike Surridge
382
Table of Contents
A Case Study on Using RTF for Developing Multi-player Online Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Ploss, Frank Glinka, and Sergei Gorlatch
XIX
390
Abstractions for Distributed Systems (DPA 2008) Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shantenu Jha, Dan Katz, Manish Parashar, Omer Rana, and Murray Cole (Program Committee) Co-design of Distributed Systems Using Skeleton and Autonomic Management Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Aldinucci, M. Danelutto, and P. Kilpatrick Distributed Data Mining Tasks and Patterns as Services . . . . . . . . . . . . . . Domenico Talia ProActive Parallel Suite: From Active Objects-Skeletons-Components to Environment and Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Denis Caromel and Mario Leyton On Abstractions of Software Component Models for Scientific Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julien Bigot, Hinde Lilia Bouziane, Christian P´erez, and Thierry Priol
401
403 415
423
438
Group Abstractions for Organizing Dynamic Distributed Systems . . . . . . Jos´e C. Cunha, Carmen P. Morgado, and Jorge F. Cust´ odio
450
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
461
Workshop on Virtualization in High-Performance Cluster and Grid Computing (VHPC 2008)
Virtual machine monitors (VMMs) are now integrated with a variety of operating systems and are moving out of research labs into scientific, educational and operational usage. Modern hypervisors exhibit a low overhead and allow concurrent execution of large numbers of virtual machines, each strongly encapsulated. VMMs can offer a network-wide abstraction layer of individual machine resources, thereby opening up new models for implementing high-performance computing (HPC) architectures in both cluster and grid environments. This workshop aims to bring together researchers and practitioners active in exploring the application of virtualization in distributed and high-performance cluster and grid computing environments. Areas that are covered in the workshop series include VMM performance, VMM architecture and implementation, cluster and grid VMM applications, management of VM-based computing resources, hardware support for virtualization, but it is open to a wider range of topics. As basic virtualization technologies mature, the main focus of research now is techniques for managing virtual machines in large-scale installations. This was reflected in this year’s workshop, where five presentations were given on the management of virtualized HPC systems. In total seven papers were accepted for this year’s workshop, with an acceptance rate of approximately 39%. An invited talk by Bernhard Schott of the company Platform gave an overview of the company’s products relative to virtualization. The Chairs would like to thank the Euro-Par organizers, the members of the Program Committee along with the speakers and attendees, whose interaction created a stimulating environment. Our special thanks to Bernhardt Schott for accepting our invitation to speak at the workshop and we acknowledge the financial support of Citrix. VHPC is planning to continue the successful co-location with Euro-Par in 2009. December 2008
E. C´ esar et al. (Eds.): Euro-Par 2008 Workshops, LNCS 5415, p. 1, 2009. c Springer-Verlag Berlin Heidelberg 2009
Michael Alexander Stephen Childs
Tools and Techniques for Managing Virtual Machine Images H˚ avard K. F. Bjerke1 , Dimitar Shiyachki1 , Andreas Unterkircher1, and Irfan Habib2 1
CERN, 1211 Geneva 23, Switzerland Centre for Complex Cooperative Systems (CCCS), University of the West of England (UWE) Frenchay, Bristol BS16 1QY United Kingdom
2
Abstract. Virtual machines can have many different deployment scenarios and therefore may require generation of multiple VM images. OS Farm is a service that aims to provide VM images that are tailored and generated on the fly. In order to optimize generation of images, a layered copy-on-write image structure is used, and an image cache ensures that identical images are not regenerated. Images can be several hundreds of megabytes large and thus can congest the network and delay their transfer. Content-Based Transfer is a technique which transfers only the difference between the source image and existing target client image data. We present an implementation which achieves an observed bandwidth close to the theoretical maximum and a significant reduction in network congestion.
1
Introduction
Virtualization can add agility to datacenters by providing flexible testing environments, failover with live-migration and satisfying different OS flavour requirements with consolidation. In all these scenarios it is important to have an infrastructure that efficiently handles the needed VM images. Sect. 2 presents a real scenario for the application of image management techniques, as a motivation for this work. Libfsimage is a library and a standalone application, which generates VM images with a rich selection of Linux distributions. It is presented in Sect. 3. OS Farm is a software application that aims to provide a user interface for generating and managing VM images. For generating images, it uses Libfsimage. It employs some techniques in order to optimize the generation and propagation of images, as described in Sect. 4. The large sizes of VM images is a hurdle for managing images, particularly in image propagation. An implementation of the Content-Based Transfer technique, which optimizes the propagation of VM images over the network, is presented in Sect. 5. This paper presents our work as a solution for image management with a good degree of configuration flexibility and performance. E. C´ esar et al. (Eds.): Euro-Par 2008 Workshops, LNCS 5415, pp. 3–12, 2009. c Springer-Verlag Berlin Heidelberg 2009
4
2
H.K.F. Bjerke et al.
Application of Image Management Techniques in the EGEE/WLCG Grid
The Enabling Grids for E-scienceE (EGEE)[1] project, funded by the European Union, provides a seamless Grid infrastructure for e-Science. EGEE produces the gLite[2] middleware for grid computing. Tightly coupled to EGEE is the Worldwide LHC Computing Grid (WLCG)[3]. Its mission is to build and maintain a data storage and analysis infrastructure for the entire high energy physics community that will use the Large Hadron Collider (LHC), which is currently being built at the European Organization for Nuclear Research (CERN)[4]. 2.1
Grid Middleware Certification
A section in CERN’s IT Department is responsible for the integration, testing and release of the gLite middleware. This activity is carried out in collaboration with several partners all over Europe within EGEE. Testing gLite faces the problem that its components are under active development. To enable progress in certification the turnaround time from feature submission to certified state must be as small as possible. Bug fixes and new features enter gLite via the concept of a patch. A grid testbed is being operated so that new patches can be applied to the relevant grid nodes. However, certification of several patches at the same time can cause conflicts on the testbed. A non functional patch may spoil the whole testbed. To cope with such problems an infrastructure of virtual machines based on Xen was established so that certifiers can bring up grid nodes with a certain patch independently. GLite is available on different Linux flavors and architectures: Scientific Linux CERN[5] 3 and 4 on i686 and x86 64 platforms. More Linux flavors (e.g. Debian) are under development. As all these combinations are found in production, interactions of nodetypes with different operating systems and hardware must be tested. To speed up the certification process we need to be able to quickly produce pre-defined images of different gLite nodetypes to use with Xen. We produce such images on a weekly basis in order to reflect the latest updates. The tools libfsimage and OS Farm were developed at CERN to achieve the aforementioned goals. 2.2
Usage of Virtual Machines on the Grid
The EGEE/WLCG infrastructure lets users send jobs for execution to more than 250 sites. Using the information system a user can determine the operating system provided by a site. However as more and more users from different scientific communities join the grid it gets difficult for sites to fullfill all their requirements in terms of operating systems and installed software. One way to deal with this problem is to provide each job on the grid a virtual machine with a dedicated OS setup. With thousands of users on the grid transferring images to sites becomes an issue. The content based image transfer described in Sect. 5 shows how to overcome this problem.
Tools and Techniques for Managing Virtual Machine Images
3
5
Libfsimage
Libfsimage is a library of Linux file system generation routines implemented in Python. Its primary goals are the simultaneous generation of 32-bit and 64bit file systems for different Linux distributions in an isolated environment and reuse of the common setup and configuration code between the distributions. At present, the supported Linux flavors are Debian[6], Ubuntu[7], CentOS[8], Scientific Linux CERN and Fedora[9]. The package installation and dependency resolution for the first two are done with Debian package management tools. For the RPM[10] based distributions this is achieved with Yum[11]. Libfsimage uses pre-built environments that reflect the hardware architecture as well as the type and version of the package manager used in the relevant version of the distribution being generated. When possible these pre-built environments are shared between the different distributions. Prior to the generation the appropriate environment is deployed and the library uses the chroot system call to switch to a new root directory. The initial package installations in the generated file system are done from there. Once the latter contains the basic libraries, a package manager and configuration tools, further package installations and image configuration are performed from inside the new file system, again by leveraging the chroot capability. Libfsimage can be used both from the command line and as a python library. In the latter case it manages a Workspace object that keeps track of the deployed environments and their status in order to speed up the file system generation by reusing them. A consequence of the extensive use of the chroot system call by Libfsimage, and Debian and RPM based package managers is the root privileges indispensability. This is a major drawback in the scenario when Libfsimage is used for simultaneous creation of images for different third parties that need to install custom packages inside the generated file system. A harmful pre- or post-install scriptlet that is run during the installation of a package could escape the chroot jail and run arbitrary code with root privileges, thus compromising the host and interfering with the generation of other parties’ images. To address this issue a SELinux policy module is being developed for narrowing the standard root capabilities to the required minimum and confining the concurrently running generation processes in dedicated SELinux domains, so that misuse of the CAP CHROOT privilege can not lead to a system compromise.
4
OS Farm
OS Farm creates VM images and Virtual Appliances (VA) [12] that can satisfy different needs. It provides a web interface through which users can choose between a range of Linux distributions and yum repositories, with their corresponding yum packages. Images are generated using libfsimage, which provides a rich selection of Linux distributions, which in OS Farm are called classes. The main interface to an OS Farm service is a web interface, which is shown in Fig. 1. It provides several ways for the user to request a VM image. The most
6
H.K.F. Bjerke et al.
Fig. 1. OS Farm user interface
MySlc4Image SLC4 i386 emacs unzip Base Core
Fig. 2. OS Farm VM image specification
basic method is a Simple request, which allows the user to select image class and architecture. Advanced request extends Simple request and gives the possibility of adding yum packages to the image. A dropdown menu allows the user to choose between a list of predefined yum repositories. Using Asynchronous JavaScript and XML [13], a further list is returned which allows the user to select the corresponding yum packages. OS Farm also supports image requests by XML descriptions. An image description can be uploaded as an XML file. An example image description is shown in Fig. 2. 4.1
Layered Generation and Caching
The generation of an image is divided into three layers or stages, core, base and image. Core is a small functional image with a minimal set of software required to run the image on a Virtual Machine Monitor or in order to satisfy higher level software dependencies. Base is a layer on top of core, which provides some additional software needed in order to satisfy requirements for VAs. An image is also a layer on top of core and provides user defined software in addition to the core software. A subclass of image is virtual appliance, which is an image with an extra set of rules aimed to allow the image to satisfy requirements of
Tools and Techniques for Managing Virtual Machine Images
7
Virtual appliance Base
Image Core
Fig. 3. Layers of a VM image
the deployment scenario. Fig. 3 shows that core can be shared between images, while base can be shared between VAs. In order to accelerate the generation of images, core and base are always cached the first time they are generated. The layers are cached in Logical Volume Manager (LVM) [14] volumes. This allows higher layers to continue instantaneously, using copy-on-write, with the snapshot feature of LVM. Images are also cached and tagged such that if an image request matches that of a cached image, the image is returned immediately. The tag of a cached image is the checksum of its configuration parameters, such as architecture and yum packages. Whenever an image is requested, a checksum is generated from the requested configuration and looked up in the cache. If an image with the exact same configuration parameters already exists in the cache, the image is returned immediately. A timeout value can be set on a cache entry in order to limit the validity of it. If an entry has timed out, a request for that entry results in a regeneration of it. The cache also serves as a browsable repository for images. Instead of requesting images by parameters, images can be browsed and downloaded directly from the cache.
Fig. 4. Flowchart of the recursive image request and generation process
8
H.K.F. Bjerke et al.
Fig. 4 shows the recursive image generation process. A request for an image is effectively a request for a layer, which in turn requires the presence of its inferior layer core, which, if it does not exist, triggers its creation. The speedup gained from using these techniques varies between image configurations. We have measured the execution times of three simple example scenarios: – the cache is empty, i.e. the image is generated from scratch: 254 seconds – core is in the cache: 72 seconds – image is in the cache: instantaneous The results show that a significant speedup can be gained when an image or one or more of its layers are cached.
5
Content-Based Transfer
Content Based Transfer (CBT) is a technique to efficiently transfer VM image data from a source host to a target host. It takes advantage of knowing the structure of the image data to extract common data which need not be transmitted. Most filesystems are aligned on a fixed boundary. For example, in the Ext2 filesystem, a file is aligned in blocks of size 1024∗2n, where 0 ≤ n ≤ 232 [15]. This means that two identical files on two different ext2 volumes, will be contained in a set of identical blocks on both volumes, even if they are fragmented. Moreover, if block sizes are different on the two volumes, as long as the largest block size of the two volumes is a multiple of the smallest block size, all files in both volumes will be aligned on the smallest block boundary. In our experience, block sizes of 1024 and 4096 are most frequent. [16] examines the commonality between filesystem volumes. In our experiments we have analysed two different scenarios: – two computer centre batch systems, – two major versions of Scientific Linux CERN (SLC) VM images Our experiments have shown that commonality ranges from moderate to significant. The commonality between some example volumes is shown in Tab. 1. Before transmitting a volume, A, across the network, a comparison can be done between an existing volume at the destination, B, and A. If any blocks in A already exist in B, then those blocks need not be transmitted. Moreover, any Table 1. Commonality between example volumes Scenario
Commonality
Batch system to batch system 84 % SLC4 to SLC3 48 % SLC3 to SLC4 22 %
Tools and Techniques for Managing Virtual Machine Images
9
blocks in A which exist in any of the volumes at the destination, need not be transmitted. Content Adressable Storage [17] exploits commonality in order to reduce storage load. In [18] this is exploited in order to reduce the load on the network and storage. We present an implementation which exploits commonality in order to reduce network load and speed up network transfer to close to the theoretical maximum. 5.1
Implementation
Our implementation of Content-Based Transfer uses Java, as a good compromise between efficiency and convenience. Most notably, it exploits Java’s hash digest algorithms and thread-safe hash tables. Identical blocks are identified with checksums, which are calculated using the available hash digest algorithms in Java. In spite of discovered collisions in the MD5 algorithm1 [19], for the results in this paper, we have used it because it is the fastest available algorithm in the Java library. The choice of hashing algorithm, however, is open to the user. The implementation is split into several threads that pass checksums and blocks among each other in a pipelined fashion. For example, one thread scans the source image and generates a checksum for the current block. Immediately, before continuing to the next block, the checksum is passed to the next thread. Concurrently, another thread, at the destination node, receives a checksum and looks it up in its hash table. If the block is not already at the destination, then it is requested from the source, in a similar pipelined fashion. The key to achieving good efficiency in this implementation is to allow the blocks that are already at the destination be written simultaneously and at the same pace as they are read from the source (which should be the disk read speed), and the other blocks be transmitted at the pace which the network allows. 5.2
Hypothetical Maximal Observed Bandwidth
The information needed to be transmitted from the source to the target consists of a differential, sD , and an identical, sI , part. In the differential part, all data is different between the source and the target, so all source data must be transmitted, and thus the transfer speed for the differential part is bound by the network transmission bandwidth, vn , given that the disk speed is higher than the network bandwidth. In the identical part, data is identical between the source and the target, so data must only be identified and only identification information needs to be transmitted, and thus the transfer speed for the identical part is bound by the disk read speed, vd . Given an I/O bound only program (CPU speed is infinite, CPU time is 0), the hypothetical best transfer time is 1
MD5 is not recommended for security application, since a collision can actively be created. This concern is not that relevant for our application because the user is not expected to actively create blocks that will collide with his or her own blocks.
10
H.K.F. Bjerke et al.
t=
sI sD + , vd vn
(1)
iff vd > vn .
(2)
The total transfer time of an image, using this technique, is different from a non-optimized technique. With the optimized transfer time, we calculate an observed bandwidth. The hypothetical best observed bandwidth is, given (2), v=
sI vd
s +
sD vn
.
In general, v=
ΔI vd
1 +
ΔD vn
, where Δj =
(3)
sj ∈ [0, 1] . s
(4)
The intent of the hypothetical best observed bandwidth is to determine the performance of a hypothetical optimal CBT implementation to use as a benchmark for our CBT implementation. “Observed bandwidth” serves as a measure of performance that is independent of image sizes and indicates the speedup given by the CBT technique compared to a non-optimized technique. On our test system we measured, using the Unix command “dd,” a disk read speed of 35.6 MB/s. The test systems were also equipped with a 100 Mb/s network interface card, and the theoretical max bandwidth of a 100 Mb/s line is 11.9 MB/s. Using these vd and vn bandwidths, we calculated the hypothetical observed bandwidths which are given in Fig. 5. 5.3
Experimental Analysis
Running our implementation of CBT on our test systems, we measured the observed bandwidths which are given in Fig. 5. The results show that our CBT
Fig. 5. CBT bandwidth
Tools and Techniques for Managing Virtual Machine Images
11
implementation achieves an observed bandwidth which closely follows the hypothetical maximum bandwidth. Also, the real bandwidth, which indicates the network load produced by the actual amount of data transmitted, is reduced, meaning a reduced load on the network. It is worth noting that the observed bandwidth at the two extremes of ΔD , at 0 and 1, are close to vd and vn , respectively. In our example, vd > vn . If vn > vd , as would be, for us, the case with a gigabit network, the disk read speed is the limitation, and CBT does not give any speedup. CBT in this case still has the advantage of reducing the load on the network.
6
Related Work
RPath[20] is a service that offers VAs, and provides a service similar to OS Farm. RPath’s “recipe” approach to constructing a VA is a powerful method and gives great opportunity for reuse of packages between VAs. The approach encourages a VA developer, who develops recipes, and a VA user, who downloads the VAs, as two separate roles. The user can choose between a set of predefined VAs, but does not actively change the VA. In OS Farm, the user would normally also be the author of an image, since it is a trivial exercise. If VM images are to be deployed on a large scale, they need to be adapted to their deployment context. Libfsimage allows paremeterized configuration of the images, but a future goal is to allow for contextualization[21]. Rsync[22] is an application that uses commonality in order to speed up the transmission of data. It uses SSH for authentication, which adds some overhead. The observed bandwidth given by Rsync, as calculated from the time reported by Rsync itself, which is lower than the total execution time including authentication, is not as high as CBT. For example, for ΔD = 0.1, Rsync gives a 30 MB/s observed bandwidth, and ΔD = 0.5 gives a 13 MB/s observed bandwidth. Another advantage CBT has is that it takes a set of images as a source for commonality, as opposed to Rsync, which uses only one target file.
7
Conclusion
We have presented tools and techniques for managing images, which help to overcome some of the problems that present themselves when managing an infrastructure of VMs. Libfsimage provides the different flavors and architectures that are needed in our use case and has a basis which allows extension for further flavors of Linux. It also lends itself as a library for external application, as in the example of OS Farm. OS Farm uses Libfsimage and provides a graphical user interface for generation of VM images. It also provides a repository for VM images, which also serves to cache and optimize image generation through sharing layers of images. CBT exploits commonality between images to optimize the transfer of images across the network. It achieves an observed bandwidth close to the theoretical maximal observed bandwidth. It can also help to avoid network congestion when transferring large images.
12
H.K.F. Bjerke et al.
Acknowledgements This work makes use of results produced by the Enabling Grids for E-sciencE project, a project co-funded by the European Commission (under contract number INFSO-RI-031688) through the Sixth Framework Programme. EGEE brings together 91 partners in 32 countries to provide a seamless Grid infrastructure available to the European research community 24 hours a day. Full information is available at http://www.eu-egee.org.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
17. 18.
19. 20. 21. 22.
About EGEE, http://public.eu-egee.org/intro/ gLite, http://glite.web.cern.ch/glite/ Worldwide LHC Computing Grid, http://lcg.web.cern.ch/LCG/ European Organization for Nuclear Research, http://public.web.cern.ch/public/ Scientific Linux CERN, http://linux.web.cern.ch/linux/ Debian home page, http://www.debian.org/ Ubuntu home page, http://www.ubuntu.com/ CentOS home page, http://www.centos.org/ Fedora home page, http://fedoraproject.org/ RPM home page, http://www.rpm.org/ Yum home page, http://linux.duke.edu/projects/yum/ Saputzakis, C., et al.: Virtual Appliances for Deploying and Maintaining Software. In: Proceedings of LISA 2003 (2003) Wikipedia article on AJAX, http://en.wikipedia.org/wiki/AJAX LVM2 Resource Page, http://sources.redhat.com/lvm2/ Bovet, D.P., Cesati, M.: Understanding the Linux Kernel, pp. 574–607. O’Reilly, Sebastopol (2003) Tolia, N., et al.: Using Content Addressing to Transfer Virtual Machine State (2002), http://www.intel-research.net/Publications/Pittsburgh/ 050520030704 127.pdf Tolia, N., et al.: Opportunistic Use of Content Addressable Storage for Distributed File Systems. In: Proceedings of USENIX 2003 (2003) Nath, P., et al.: Design Tradeoffs in Applying Content Addressable Storage to Enterprise-scale Systems Based on Virtual Machines. In: Proceedings of USENIX 2006 (2006) Mikle, O.: Practical Attacks on Digital Signatures Using MD5 Message Digest, Cryptology ePrint Archive: Report 2004/356 (2004) rPath home page, http://www.rpath.com/ Bradshaw, R., et al.: A Scalable Approach To Deploying And Managing Appliances. In: TeraGrid 2007, Madison, WI (June 2007) Rsync home page, http://samba.anu.edu.au/rsync/
Dynamic on Demand Virtual Clusters in Grid Mario Leandro Bertogna1 , Eduardo Grosclaude1 , Marcelo Naiouf2 , Armando De Giusti2 , and Emilio Luque3 1
3
Department of Computer Science Universidad Nacional del Comahue C.P. 8300. Buenos Aires 1400. Argentina {mlbertog,oso}@uncoma.edu.ar 2 Informatic Research Institute LIDI Universidad Nacional de La Plata, Argentina {mnaiouf,degiusti}@lidi.info.unlp.edu.ar Computer Architecture and Operating System Department Universidad Autónoma de Barcelona, Spain [email protected]
Abstract. In Grid environments, many different resources are intended to work in a coordinated manner, each resource having its own features and complexity. As the number of resources grows, simplifying automation and management is among the most important issues to address. This paper’s contribution lies on the extension and implementation of a grid metascheduler that dynamically discovers, creates and manages on-demand virtual clusters. The first module selects the clusters using graph heuristics. The algorithm then tries to find a solution by searching a set of clusters, mapped to the graph, that achieve the best performance for a given task. The second module, one per-grid node, monitors and manages physical and virtual machines. When a new task arrives, these modules modify virtual machine’s configuration or use live migration to dynamically adapt resource distribution at the clusters, obtaining maximum utilization. Metascheduler components and local administrator modules work together to make decisions at run time to balance and optimize system throughput. This implementation results in performance improvement of 20% on the total computing time, with machines and clusters processing 100% of their working time. These results allow us to conclude that this solution is feasible to be implemented on Grid environments, where automation and self-management are key to attain effective resource usage.
1
Introduction
Using geographically distributed clusters in a coordinated manner has a major impact in execution time for parallel applications. Grid computing is a natural environment to deal with this usage, as Grids provide resource sharing through open standards and tight security, making possible to solve problems faster and efficiently. E. César et al. (Eds.): Euro-Par 2008 Workshops, LNCS 5415, pp. 13–22, 2009. c Springer-Verlag Berlin Heidelberg 2009
14
M.L. Bertogna et al.
The Grid metascheduler is an active component in distributed systems coordination and management. This component facilitates the user’s tasks to access resources across different administrative domains. It can take decisions based on information of the whole system. Owners of physical resource become services providers, and the metascheduler orchestrates them according to negotiated policies and service level agreements. Virtual machines provide a way to make this task easier. These virtual machines can be independently instantiated and configured beforehand with sandbox-like environments[4, 9]. They also allow dynamically tuning of parameters like memory, number of CPUs assigned to each virtual machine, etc. Today’s virtual machine technologies performance at CPU intensive tasks is comparable to that of native applications [5]. This paper presents a framework extension to a Grid metascheduler. The extension consist of two modules; the first one that dynamically discovers free machines determined by user requirements. The other one creates virtual clusters to efficiently satisfy submition of parallel jobs. The first module selects the clusters using graph heuristics. Free resources are mapped to a graph, machines as nodes and network links as edges. The algorithm then tries to find a solution by searching a set of clusters that achieve the best performance for a given task. The second module, of which one instance resides in every grid node, monitors and manages physical machines. When a new task arrives, these modules modify virtual machines configuration to dynamically adapt resource distribution at the clusters, thus obtaining maximum utilization. Metascheduler components and local administrator modules work together to make decisions at run time to balance and optimize system throughput. In the second section of this paper we present the sequence of use and the architecture of the solution, focusing on the metascheduler; the third section describes the model, heuristics and algorithms developed; the fourth section presents experimental results; and finally, related works on this subject and conclusions are shown.
2
Metascheduler
From the design viewpoint, the architecture[1] of this solution is conceptually divided into three layers or tiers. In the first one, named access tier, the clients accessing the system are defined. The second, management tier, considers access control and creation of resources. Finally, the third, resource tier, deals with the implementation of physical and virtual resources. This paper focus on the management tier cover by the metascheduler. The implementation begins with the study of several proposals, for this work, CSF (Community Scheduler Framework)[2] was chosen. CSF is an open-source implementation of a number of Grid services, which together functionally perform as a Grid metascheduler, and can be used as a development toolkit. To satisfy on demand virtual clusters there are two extension modules within CSF. The first one is the Resource Manager Adapter Service called GramVM. This module is in charge of looking for free machines in the group of clusters.
Dynamic on Demand Virtual Clusters in Grid
15
For management and instantiation of virtual machines within local domains, a new module called Hypervisor proxy was implemented. Unlike the original CSF proposal, several Hypervisor proxy instances can be working towards one GramVM instance at the same time, in a coordinated manner. In the original implementation, CSF could work with different local schedulers, just one at every time. Also different from the original CSF is that local schedulers, with similar duties as Hypervisor Proxy, have to be setup previously to CSF execution, and once execution in a cluster is started, the assignment of machines can not be modified until end of execution. For dynamically instantiated virtual machines, we do not know how many machines each cluster will have until the requirement arrives. GramVM should try to find the resources that best fit the task requirements, leading to a great number of alternatives. Besides, virtual machines can modify their usage of physical resources during execution, either by live migration[6], or by dynamically varying memory and CPU allocation. All of these features essentially reconfigure the pool of available free resources. They can be used to obtain better cluster performance and they are negotiated between Hypervisor Proxy and GramVM at run time.
3
Model
Finding a group of machines with specific characteristics, which is able to efficiently share a given workload, in a short time, is not a trivial problem. To approach this task, we settled for two criteria which were given higher priority: how fast the problem was solved, and how good the outcome was when compared to the optimal solution. To be able to solve the problem in an analytical way, groups of clusters and free machines in the Grid environment are mapped to a graph. Machines are viewed as nodes and network links as edges. Nodes and edges have weights corresponding to machine features and bandwidth. More bandwidth-capable edges receive less weight. Node’s weights are based upon cost functions such as per-time billing, computing power, etc. The strategy is divided into two stages. The first one consists in selecting the groups of clusters. At this stage, a heuristic is used to find an optimal set of machines, taking into account communication overhead and machines computing power. Once a group of clusters is obtained, the second stage starts. For each cluster, an analysis must be done to evaluate how many physical machines will be incorporated. If the number of machines involved is greater than needed then efficiency will decrease, as some machines will stall waiting to send data to another cluster. We seek to keep efficiency over a certain threshold, given beforehand. Our analysis extends the work done in [3]. This work modifies the MPI library to span a number of clusters; a certain, unique, type of task is assumed. In our paper, a virtual environment is proposed where applications can run unmodified over a combination of clusters. Different types of tasks can be supported, as expected from a Grid environment.
16
M.L. Bertogna et al.
Our model was evaluated over master-worker parallel applications. Clusters are dedicated and serve a previously determined number of tasks. All tasks within a same requirement from a user perform the same computation, and send or receive the same amount of data, but the number of tasks can vary across requirements. This schema is common in graphic and simulation parallel applications. It is assumed that each time a cluster is added to the grid environment, virtual images are characterized and a performance benchmark is done. The network links are monitored regularly to sense bandwidth changes. 3.1
Machine Selection Algorithm
To select a set of feasible clusters to be incorporated into the solution, an iterative improvement algorithm is used. This algorithm, known as Hill-Climbing, is mainly a loop that continually moves in the direction of increasing value; in this case, the direction maximizing computing power. The algorithm does not maintain a search tree; the node data structure needs to record just the last state reached. This simple policy has some drawbacks: local maxima (peak values that are lower than the highest peak value in the state space); plateaux (a region in the state space where the evaluation function is essentially flat) and ridges (a ridge may have steeply sloping sides, so that the search reaches the top of the ridge with ease, but the top may slope only very gently toward a peak). The problem we are trying to solve has particular features, as certain networked geographical regions or provider domains are better provisioned than others. This geographical connectivity pattern is mapped onto the graph edges. When taking this feature into account, there is no need to do random restarts as in the original algorithm. If the graph is partitioned into better-connected geographical zones, or islands, the chances to find the global maximum grow, and the time to find it decreases. This modification is called Hill-Climbing with k-restarts, where k is the number of partitions on the graph. Each partition will be a starting point. To partition the graph in geographical zones, a different kind of algorithm is used, namely Minimum Spanning Trees (MST). A minimum spanning tree includes all nodes in the graph, such that the sum of their weighted edges is lesser or equal to that of any other spanning tree over the graph. The chosen algorithm is Kruskal’s variant because of the approach taken to build the MST. This algorithm starts by sorting the edges by weight; then all nodes are agglomerated, starting from as many partitions as nodes. Traversing over the edges, the solution is checked at every iteration for cycles. If a cycle appears, the edge that was most recently introduced is discarded. To enhance Hill-Climbing performance, we need to know how many restarts there will be. The number of restarts will be the number of suitable partitions in the graph. Once the number of partitions is set as a threshold, Kruskal ’s algorithm starts adding edges until the threshold is reached. When Kruskal algorithm stops, the remaining graph is partitioned into maximally well-connected trees, as the first step taken was to sort the edges by weight. For each partition the Hill-Climbing algorithm is then applied, obtaining a global maximum.
Dynamic on Demand Virtual Clusters in Grid
17
The number of partitions depends on the number of nodes and the surface where Hill-Climbing algorithm will run. As a good practice the graph was partitioned until each segment had at least one complete cluster. In most cases that number of partition was three, this number assures to find the global maximum in each test. The algorithmic complexity for MST is O(E log(V)). The Hill-Climbing algorithm using adjacency lists is O(E log(V)) where E are Edges and V are Vertices of the graph. Performance can be improved if the graph nodes are Grid nodes instead of machines, as the number of vertices in the graph decreases. 3.2
Cluster Usage Optimization
A parallel application in a multicluster environment is either limited by performance of machines in the cluster (compute-bound) or by network throughput (communication bound). The maximum performance (maxperf ) is reached by an application on a particular cluster when it is compute-bound. If the application is communication bound, machines will sit idle waiting for network input/output. For a worker task running on a processor, the computation time (T Cpt ) is defined as the ratio between the task number of operations (Oper ) and the processor performance (Perf ): T Cpt =Oper/Perf. The communication time (T Comm ) is the ratio between the volume of data communication (Comm) (worker task data from and to the master) and the network throughput (T Put ): T Comm =N*Comm/T Put. The maxperf is the performance that can be obtained when TCpt >= TComm. Once the set of clusters is computed by the heuristics, the second stage starts. Here we analytically determine how many machines will be used in each cluster, so as to avoid maxperf dropping under a previously fixed threshold. This calculation is based on how many task’s data the network is able to transfer by time unit, and how many tasks per time unit the cluster is able to process. If the cluster processes more tasks than the network would transfer, then the application becomes communication bound; if the network is able to transfer more task’s data than the cluster processes, then the application becomes computation bound. Hence, if the application is communication bound and the number of machines diminishes until a balance is reached, the cluster resources are not fully used; but the machines processing the tasks will be used at a maximal efficiency. In a Grid environment, multiple possibilities for cluster assignment exist. If we regard the application as being started from different clusters (i.e. we choose different master-clusters), the resulting outcome from our analysis will be different. With the heuristic algorithm, this search work is minimized, and the best solution (reaching the highest computing power with a combination of clusters) is probably achieved. To make this possible, an analytic search has to be done for each graph partition. This will limit the number of machines for each cluster so that performance can be held over threshold for every machine. Not only communication and computing power for the local cluster has to be evaluated, but for the master-cluster as well. If the master-cluster ’s bandwidth is smaller than the aggregated worker-clusters bandwidth, then machines from the
18
M.L. Bertogna et al.
worker-clusters will be idle even though their own network links are enough to exploit their full computing capacity for a given task. Here, a fractional of the master-cluster bandwidth will be determined, based on the computing power of each cluster; and the analytic evaluation of computation/communication will be carried on upon this value. This solution focuses on maintaining machine performance, but cluster usage is also a matter of importance. If a cluster can always satisfy a task with low computing requirements but high data communication, and the network link does always limit the computing power to a few machines, then this is not a good solution. An approach to this problem is to limit even more the usage of computing power, so as to free bandwidth. When tasks with less communication requirements arrive, they can be submitted to idle machines, thus improving cluster usage; on the downside, if such tasks never arrive, resources will be wasted. Our proposal in this paper is to build a cluster over virtual machines to release bandwidth on demand; when a new task with smaller communication requirements arrives, machines already executing on the clusters are migrated without interrupting the execution. When two or more virtual machines are executing on a physical processor, the virtualization software scheduler will assign computing resources fairly, so this will result in less computing power per machine and less data per task will be sent. The time for completion of both the new task and the executing task are known, so the metascheduler module can calculate the time gain for machine migration and will submit the new task onto the cluster. If the new task has smaller communication requirements than the migrated one, not only the physical nodes that were supporting virtual machines running on them, but also the idle physical nodes in the cluster, could be assigned to the new task, improving the whole cluster usage.
4
Experimental Results
The experimental evaluation is divided into two phases. The first one shows improvement gains by using the heuristic of machine selection. This strategy is compared to classical grid algorithms[7, 8] as a set of independent tasks arrives. From the system’s point view, a common strategy is to assign them according to the load of resources in order to achieve high system throughput. Three algorithms were selected: a) Minimum Execution Time (MET): assigns each task to the resource with the best expected execution time for that task, no matter whether this resource is available or not at the present time b) Minimum Completion Time (MCT): assigns each task, in an arbitrary order, to the resource with the minimum expected completion time for that task and c) Opportunistic Load Balancing (OLB): assigns each task, in arbitrary order, to the next machine that is expected to be available, regardless of the task’s expected execution time on that machine . The intuition behind OLB is to keep all machines as busy as possible. One advantage of OLB is its simplicity, but because OLB does not consider expected task execution times, the mappings it finds can result in very poor makespans. Classical grid algorithm like Min-min and Max-Min were not
Dynamic on Demand Virtual Clusters in Grid
19
selected because these begins with the set of all tasks and in this case this data is unknown. The problem of grid scheduling can be investigated by taking experimental or simulation approach. The advantages of performing actual experiment by scheduling real applications on real resources are that it is easier and straightforward to compare the efficacy of multiple algorithms. However, in experimental study of scheduling algorithms it is seldom feasible to perform a sufficient number of experiments to obtain meaningful results. Furthermore, it is difficult to explore a variety of resource configurations. Typically a grid environment is highly dynamic; variations in resource availability make it difficult to obtains repeatable results. As a result of all these difficulties with the experimental approach, simulation is the most viable approach to effectively investigate grid scheduling algorithms. The simulation approach is configurable, repeatable, and generally fast, and is the approach we take in this work. Our simulation takes as parameters a description of the existing clusters, their network links, machines therein (specifying memory and processor type) and finally the tasks with their execution time for each type of processors. For model validation a real test in [3] was considered, where four clusters with three, five and eight machines were used. The first two clusters were physically lying in South America, having less bandwidth and machines with smaller computing power. The third, more powerful one, was in Spain. If we enforce the same master-cluster as in the real test, i.e. the application is submitted from each of the clusters in South America, the final results in computing time and network throughput returned by the simulator are the same. However, if the simulator is used along with the machine selection algorithm, the Spanish cluster is selected and the total execution time decreases nearly by 50%. The explanation being, if the application is submitted from a South American cluster, then there will be idle machines in the Spanish cluster; while if the application is submitted from Spain, then the three clusters will have better usage. 4.1
Machine Selection Experiences
Tests were done to verify the impact of partition and master-cluster selection carried over by the Hill-Climbing algorithm. These tests compare how the heuristic algorithm proposed in this paper performs against classical grid algorithms. The tests were done simulating clusters composed of eight machines each one. Two arrival statistical distribution were chosen; the first one, a uniform distribution simulating low rate of arrivals; the other one, an exponential distribution simulating a incremental rate of tasks arrivals. Nineteen request were made, each one with three hundred process to be distributed into the clusters. In all cases MCT performs better than MET and OLB. For greater clarity, figure comparisons were done between Hill-Climbing and MCT algorithms. The results of simulation executions can be seen in figure 1. In the firts graph, Hill-Climbing uses all clusters machines, selecting the best master cluster and MCT only selects the minimum completion time cluster. Choosing to run the application in all machines heuristic algorithm performs 70% better than MCT,
20
M.L. Bertogna et al.
Fig. 1. Time execution comparison between Heuristic and MCT algorithm with uniform task arrival and exponential task arrival
this one is the best case. If each parallel task sends more data while processing, then the application becomes communication bounded (TCpt